Agao22 Script
Agao22 Script
Agao22 Script
Spring 2022
These notes will be updated throughout the course. They are likely to contain typos, and
they may have mistakes or lack clarity in places. Feedback and comments are welcome.
Please send to kyng@inf.ethz.ch or, even better, submit a pull request at
https://github.com/rjkyng/agao22_script.
We want to thank scribes from the 2020 edition of the course who contributed to these notes:
Hongjie Chen, Meher Chaitanya, Timon Knigge, and Tim Taubner – and we’re grateful to
all the readers who’ve submitted corrections, including Martin Kucera, Alejandro Cassis,
and Luke Volpatti.
A important note: If you’re a student browsing these notes to decide whether to take this
course, please note that the current notes are incomplete. We will release Parts III and IV
later in the semester. You can take a look at last year’s notes for an impression of the what
the rest of the course will look like. Find them here:
https://kyng.inf.ethz.ch/courses/AGAO21/agao21_script.pdf
There will, however, be some changes to the content compared to last year.
2
Contents
1 Course Introduction 9
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Electrical Flows and Voltages - a Graph Problem from Middle School? . . . 9
1.3 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4 More Graph Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . 18
3
3.2.3 Characterization of convexity . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Gradient Descent - An Approach to Optimization? . . . . . . . . . . . . . . 35
3.3.1 A Quantitative Bound on Changes in the Gradient . . . . . . . . . . 35
3.3.2 Analyzing Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Accelerated Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6 Random Walks 64
6.1 A Primer on Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2 Convergence Results for Random Walks . . . . . . . . . . . . . . . . . . . . 65
4
6.2.1 Making Random Walks Lazy . . . . . . . . . . . . . . . . . . . . . . . 66
6.2.2 Convergence of Lazy Random Walks . . . . . . . . . . . . . . . . . . 67
6.2.3 The Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.3 Properties of Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3.1 Hitting Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3.2 Commute Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5
10.4.1 Normalization, a.k.a. Isotropic Position . . . . . . . . . . . . . . . . . 108
10.4.2 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
10.4.3 Martingale Difference Sequence as Edge-Samples . . . . . . . . . . . . 110
10.4.4 Stopped Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
10.4.5 Sample Norm Control . . . . . . . . . . . . . . . . . . . . . . . . . . 112
10.4.6 Random Matrix Concentration from Trace Exponentials . . . . . . . 114
10.4.7 Mean-Exponential Bounds from Variance Bounds . . . . . . . . . . . 115
10.4.8 The Overall Mean-Trace-Exponential Bound . . . . . . . . . . . . . . 116
6
14 The Cut-Matching Game: Expanders via Max Flow 152
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
14.2 Embedding Graphs into Expanders . . . . . . . . . . . . . . . . . . . . . . . 153
14.3 The Cut-Matching Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 154
14.4 Constructing an Expander via Random Walks . . . . . . . . . . . . . . . . . 157
7
17.1.2 Updates using Divergence . . . . . . . . . . . . . . . . . . . . . . . . 190
17.1.3 Understanding the Divergence Objective . . . . . . . . . . . . . . . . 192
17.1.4 Quadratically Smoothing Divergence and Local Agreement . . . . . . 193
17.1.5 Step size for divergence update . . . . . . . . . . . . . . . . . . . . . 196
8
Chapter 1
Course Introduction
1.1 Overview
This course will take us quite deep into modern approaches to graph algorithms using convex
optimization techniques. By studying convex optimization through the lens of graph algo-
rithms, we’ll try to develop an understanding of fundamental phenomena in optimization.
Much of our time will be devoted to flow problems on graphs. We will not only be studying
these problems for their own sake, but also because they often provide a useful setting for
thinking more broadly about optimization.
The course will cover some traditional discrete approaches to various graph problems, espe-
cially flow problems, and then contrast these approaches with modern, asymptotically faster
methods based on combining convex optimization with spectral and combinatorial graph
theory.
We will dive right into graph problems by considering how electrical current moves through
a network of resistors.
First, let us recall some middle school physics. If some of these things don’t make sense two
you, don’t worry, in less than paragraph from here, we’ll be make to safely doing math.
Recall that a typical battery that buy from Migros has two endpoints, and produces what
is called a voltage difference between these endpoints.
9
One end of the battery will have a positive charge (I think that means an excess of positrons1 ),
and the other a negative charge. If we connect the two endpoints with a wire, then a current
will flow from one end of the battery to the other in an attempt to even out this imbalance
of charge.
We can also imagine a kind of battery that tries to send a certain amount of current the
wires between its endpoints, e.g. 1 unit of charge per unit of time. This will be a little more
convenient to work with, so let us focus on that case.
A resistor is a piece of wire that connects two points u and v, and is completely described
by a single number r called its resistance.
1
I’m joking, of course! Try Wikipedia if you want to know more. However, you will not need it for this
class.
10
If the voltage difference between the endpoints of the resistor is x, and the resistance is r
then this will create a flow of charge per unit of time of f = x/r. This is called Ohm’s Law.
Suppose we set up a bunch of wires that route electricity from our current source s to our
current sink t in some pattern:
We have one unit of charge flowing out of s per unit of time, and one unit coming into
t. Because charge is conserved, the current flowing into any other point u must equal the
amount flowing out of it. This is called Kirchoff’s Current Law.
To send one unit of current from s to t, we must be sending it first form s to u and then
from u to t. So the current on edge (s, u) is 1 and the current on (u, t) is 1. By Ohm’s Law,
the voltage difference must also be 1 across each of the two wires. Thus if the voltage is x
at s, it must be x + 1 at u and x + 2 at t. What is x? It turns out it doesn’t matter: We
only care about the differences. So let us set x = 0.
11
Figure 1.5: A path of two resistors.
How much flow will go directly from s to t and how much via u?
Well, we know what the net current flowing into and out of each vertex must be, and we can
use to set up some equations. Let us say the voltage at s is xs , at u is xu and at t is xt .
Electrical flows in general graphs. Do we know enough to calculate the electrical flow
in some other network of resistors? To answer this, let us think about the network as a
graph. Consider a undirected graph G = (V, E) with |V | = n vertices and |E| = m edges,
and let us assume G is connected. Let’s associate a resistance r (e) > 0 with every edge
e ∈ E.
12
To keep track of the direction of the flow on each edge, it will be useful to assign an arbitrary
direction to every edge. So let’s do that, but remember that this is just a bookkeeping tool
that helps us track where flow is going.
E
A P is a vector f : R . The net flow of f at a vertex u ∈ V is defined as
P flow in the graph
v→u f (v, u) − u→v f (u, v).
We say a flow routes the demands d ∈ RV if the net flow at every vertex v is d (v).
We can assign a voltage to every vertex x ∈ RV . Ohm’s Law says that the electrical flow
1
induced by these voltages will be f (u, v) = r (u,v) (x (u) − x (v)).
Say we want to route unit of current from vertex s ∈ V to vertex t ∈ V . As before, we
can write an equation for every vertex saying that the voltage differences must produce the
desired net current:
• Net current at s: 1
P
−1= (s,v) r (s,v) (x (s) − x (v))
This gives us n constraints, exactly as many as we have voltage variables. However we have
to be a little careful when trying to conclude that a solution exists, yielding voltages x that
gives induce an electrical flow routing the desired demand.
You will prove in the exercises (Week 1, Exercise 3) that a solution x exists. The proof
requires two important observations: Firstly that the graph is connected, and secondly that
summed over all vertices, the net demand is zero, i.e. as much flow is coming into the network
as is leaving it.
The incidence matrix and the Laplacian matrix. To have a more compact notation
for net flow constraints, we also introduce the edge-vertex incidence matrix of the graph,
B ∈ RV ×E .
1
if e = (u, v)
B(v, e) = −1 if e = (v, u)
0 o.w.
Bf = d .
This is also called a conservation constraint. In our examples so far, we have d (s) = −1,
d (t) = 1 and d (u) = 0 for all u ∈ V \ {s, t}.
If we let R = diage∈E r (e) then Ohm’s law tells us that f = R−1 B > x . Putting these
observations together, we have BR−1 B > x = d . The voltages x that induce f must solve
this system of linear equations, and we can use that to compute both x and f . It is exactly
13
the same linear equation as the one we considered earlier. We can show a that for a connected
graph, a solution xPexists if and only if the flow into the graph equals the net flow out, which
we can express as v d (v) = 0 or 1> d = 0. You will show this as part of Exercise 3. This
also implies that an electrical flow routing d exists if and only if the net flow into the graph
equals the net flow out, which we can express as 1> d = 0.
The matrix BR−1 B > is called the Laplacian of the graph and is usually denoted by L.
Now, another interesting question would seem to be: If we want to find a flow routing a
certain demand d , how should the flow behave in order to minimize the electrical energy
def P
spent routing the flow? The electrical energy of a flow vector f is E(f ) = e r (e)f (e)2 . We
can phrase this as an optimization problem:
min E(f )
f ∈RE
s.t. Bf = d .
14
We call this problem electrical energy-minimizing flow. As we will prove later, the flow f ∗
that minimizes the electrical energy among all flows that satisfy Bf = d is precisely the
electrical flow.
A pair of problems. What about our voltages, can we also get them from some opti-
mization problem? Well, we can work backwards from the fact that our voltages solve the
equation Lx = d . Consider the function c(x ) = 21 x > Lx − x > d . We should ask ourselves
some questions about this function c : RV → R. Is it continuous and continuously differ-
entiable? The answer to this is yes, and that is not hard to see. Does the function have a
minimum? This is maybe not immediately clear, but the minimum does indeed exist.
When this is minimized, the derivative of c(x ) with respect to each coordinate of x must be
zero. This condition yields exactly the system of linear equations Lx = d . You will confirm
this in Exercise 4 of the first exercise sheet.
Based on our derivative condition for the optimum, we can also express the electrical voltages
as the solution to an optimization problem, namely
min c(x )
x ∈RV
As you are probably aware, having the derivative of each coordinate equal zero is not a
sufficient condition for being at the optimum of a function2 .
It is also interesting to know whether all solutions to Lx = d are in fact minimizers of c.
The answer is yes, and will see some very general tools for proving statements like this in
Chapter 2.
Altogether, we can see that routing electrical current through a network of resistors leads to
a pair of optimization problems, let’s call them f ∗ and x ∗ , and that the solutions to the two
problems are related, in our case through the equation f ∗ = R−1 B > x ∗ (Ohm’s Law). But,
why and how are these two optimization problems related?
Instead of minimizing c(x ), we can equivalently think about maximizing −c(x ), which gives
the following optimization problem: maxx ∈RV −c(x ). In fact, as you will show in the exercises
for Week 1, we have E(f ∗ ) = −c(x ∗ ), so the minimum electrical energy is exactly the
maximum value of −c(x ). More generally for any flow that routes d and any voltages x ,
we have E(f ) ≥ −c(x ). So, for any x , the value of −c(x ) is a lower bound on the minimum
energy E(f ∗ ).
This turns out to be an instance of a much broader phenomenon, known as Lagrangian
duality, which allows us to learn a lot about many optimization problems by studying two
related pairs of problems, a minimization problem, and a related maximization problem that
gives lower bounds on the optimal value of the minimization problem.
2
Consider the function in one variable c(x) = x3 .
15
Solving Lx = d . Given a graph G with resistances for the edges, and some net flow vector
d , how quickly can we compute x ? Broadly speaking, there are two very different families
of algorithms we could use to try to solve this problem.
Either, we could solve the linear equation using something like Gaussian Elimination to
compute an exact solution.
Alternatively, we could start with a guess at a solution, e.g. x 0 = 0, and then we could try
to make a change to x 0 to reach a new point x 1 with a lower value of c(x ) = 21 x > Lx − x > d ,
i.e. c(x 1 ) < c(x 0 ). If we repeat a process like that for enough steps, say t, hopefully we
eventually reach x t with c(x t ) close to c(x ∗ ), where x ∗ is a minimizer of c(x ) and hence
Lx ∗ = d . Now, we also need to make sure that c(x t ) ≈ c(x ∗ ) implies that Lx t ≈ d in some
useful sense.
One of the most basic algorithms in this framework of “guess and adjust” is called Gradient
Descent, which we will study in Week 2. The rough idea is the following: if we make
a very small step P
from x to x + δ, then a multivariate Taylor expansion suggests that
∂c(x )
c(x + δ) − c(x ) ≈ v∈V δ(v) ∂x (v)
.
If we are dealing with smooth convex function, this quantity is negative if we let δ(v) =
∂c(x )
− · ∂x (v)
for some small enough so the approximation holds well. So we should be able to
make progress by taking a small step in this direction. That’s Gradient Descent! The name
comes from the vector of partial derivatives, which is called the gradient.
As we will see later in this course, understanding electrical problems from an optimization
perspective is crucial to develop fast algorithms for computing electrical flows and voltages,
but to do very well, we also need to borrow some ideas from Gaussian Elimination.
What running times do different approaches get?
1. Using Gaussian Elimination, we can find x s.t. Lx = d in O(n3 ) time and with
asymptotically faster algorithms based on matrix multiplication, we can bring this
down to roughly O(n2.372 ).
2. Meanwhile Gradient Descent will get a running time of O(n3 m) or so – at least this is
a what a simple analysis suggests.
3. However, we can do much better: By combining ideas from both algorithms, and a bit
more, we can get x up to very high accuracy in time O(m logc n) where c is some small
constant.
Recall our plot in Figure 1.7 of the energy required to route a flow f across a resistor with
resistance r, which was E(f ) = r · f 2 . We see that the function has a special structure: the
16
graph of the function sits below the line joining any two points (f, E(f )) and (g, E(g)). A
function E : R → R that has this property is said to be convex.
Figure 1.8 shows the energy as a function of flow, along with two points (f, E(f )) and
(g, E(g)). We see the function sits below the line segment between these points.
Figure 1.8: Energy has a function of flow in a resistor with resistance r = 1. The function
is convex.
We can also interpret this condition as saying that for all θ ∈ [0, 1]
17
A B C
Figure 1.9: A depiction of convex and non-convex sets. The sets A and B are convex since
the straight line between any two points inside them is also in the set. The set C is not
convex.
d is convex. It is also easy to verify that E(f ) = e r (e)f (e)2 is a convex function, and
P
hence finding an electrical flow is an instance of convex minmization:
min kf k∞
f ∈RE
s.t. Bf = d
This problem is known as the Minimum Congested Flow Problem4 . It is equivalent to the
more famous Maximum Flow Problem.
The behavior of this kind of flow is very different than electrical flow. Consider the question
of whether a certain demand can be routed kf k∞ ≤ 1. Imagine sending goods from a source
s to a destination t using a network of train lines that all have the same capacity and asking
whether the network is able to route the goods at the rate you want: This boils down to
whether routing exists with kf k∞ ≤ 1, if we set it up right.
We have a very fast, convex optimization-based algorithm for
Minimum Congested Flow: In
−1
m log O(1)
n time, we can find a flow f̃ s.t. B f̃ = d and
f̃
≤ (1 + ) kf ∗ k∞ , where f ∗
∞
4
This version is called undirected, because the graph is undirected, and uncapacitated because we are
aiming for the same bound on the flow on all edges.
18
is an optimal solution, i.e. an actual minimum congestion flow routing d .
But what if we want to be very small, e.g. 1/m? Then this running time isn’t so good any-
more. But, in this case, we can use other algorithms that find flow f ∗ exactly. Unfortunately,
these algorithms take time roughly m4/3+o(1) .
Just as the electrical flow problem had a dual voltage problem, so maximum flow has a dual
voltage problem, which is know as the s-t minimum cut problem.
Maximum flow, with directions and capacities. We can make the maximum flow
problem harder by introducing directed edges: To do so, we allow edges to exist in both
directions between vertices, and we require that the flow on a directed edge is always non-
negative. So now G = (V, E) is a directed graph. We can also make the problem harder
by
introducing capacities. We define a capacity vector c ∈ RE ≥ 0 and try to minimize
C −1 f
, where C = diage∈E c(e). Then our problem becomes
∞
min
C −1 f
∞
f ∈RE
s.t. Bf = d
f ≥ 0.
For this
√ capacitated, directed maximum 1.483 flow problem, our best algorithms run in about
O(m n) time in sparse graphs and O(m ) in dense graphs5 , even if we are willing to
accept fairly low accuracy solution. If the capacities are allowed to be exponentially large,
the best running time we can get is O(mn). For this problem, we do not yet know how to
improve over classical combinatorial algorithms using convex optimization.
Multi-commodity flow. We can make the even harder still, by simultaneously trying to
route two types of flow (imagine pipes with Coke and Pepsi). Our problem now looks like
min
C −1 (f 1 + f 2 )
∞
f 1 ,f 2 ∈RE
s.t. Bf 1 = d 1
Bf 2 = d 2
f 1 , f 2 ≥ 0.
Solving this problem to high accuracy is essentially as hard as solving a general linear pro-
gram! We should see later in the course how to make this statement precise.
If we in the above problem additionally require that our flows must be integer valued, i.e.
f 1 , f 2 ∈ N0 , then the problem becomes NP-complete.
5
Provided the capacities are integers satisfying a condition like c ≤ n100 1.
19
Random walks in a graph. Google famously uses6 the PageRank problem to help decide
how to rank their search results. This problem essentially boils down to computing the stable
distribution of a random walk on a graph. Suppose G = (V, E) is a directed graph where each
outgoing edge P (v, u), which we will define as going from u to v, has a transition probability
p(v,u) > 0 s.t. z←u p(z,u) = 1. We can take a step of a random walk on the vertex set by
starting at some vertex u0 = u, and then randomly pick one of the outgoing edges (v, u) with
probability p(v,u) and move to the chosen vertex u1 = v. Repeating this procedure, to take
a step from the next vertex u1 , gives us a random walk in the graph, a sequence of vertices
u0 , u1 , u2 . . . , uk .
We let P ∈ RV ×V be the matrix of transition probabilities given by
(
p(v,u) for (u, v) ∈ E
P vu =
0 o.w.
V
P distribution over the vertices can be specified by a vector p ∈ R where
Any probability
p ≥ 0 and v p(v) = 1. We say that probability distribution π on the vertices is a stable
distribution of the random walk if π = Pπ. A strongly connected graph always has exactly
one stable distribution.
How quickly can we compute the stable distribution of a general random walk? Under some
mild conditions on the stable distribution7 , we can find a high accuracy approximation of π
in time O(m logc n) for some constant c.
This problem does not easily fit in a framework of convex optimization, but nonetheless, our
fastest algorithms for it use ideas from convex optimization.
2. What are some problems we can solve quickly on graphs using optimization?
4. What algorithm design techniques are good for getting algorithms that quickly find
a crude approximate solution? And what techniques are best when we need to get a
highly accurate answer?
6
At least they did at some point.
7
Roughly something like maxv 1/π(v) ≤ n100 .
20
Part I
21
Chapter 2
2.1 Overview
In this chapter, we will
3. Recall our definition of convex functions and see how convex functions can also be
understood in terms of a characterization based on first derivatives.
4. See how the first derivatives of a convex function can certify that we are at a global
minimum.
22
where {gi (x )}m
i=1 encode the constraints. For example, in the following optimization problem
from the previous chapter
X
min r (e)f (e)2
f ∈RE
e
s.t. Bf = d
we have the constraint Bf = d . Notice that we can rewrite this constraint as Bf ≤ d and
−Bf ≤ −d to match the above setting. The set of points which respect the constraints is
called the feasible set.
Definition 2.2.1. For a given optimization problem the set F = {x ∈ Rn : gi (x ) ≤
bi , ∀i ∈ [m]} is called the feasible set. A point x ∈ F is called a feasible point, and
a point x 0 ∈/ F is called an infeasible point.
Ideally, we would like to find optimal solutions for the optimization problems we consider.
Let’s define what we mean exactly.
Definition 2.2.2. For a maximization problem x ? is called an optimal solution if
f (x ? ) ≥ f (x ), ∀x ∈ F. Similarly, for a minimization problem x ? is an optimal solution
if f (x ? ) ≤ f (x ), ∀x ∈ F.
What happens if there are no feasible points? In this case, an optimal solution cannot exist,
and we say the problem is infeasible.
Definition 2.2.3. If F = ∅ we say that the optimization problem is infeasible. If
F 6= ∅ we say the optimization problem is feasible.
Figure 2.1
(i) F = [a, b]
(ii) F = [a, b)
23
(iii) F = [a, ∞)
In the first example, the minimum of the function is attained at b. In the second case the
region is open and therefore there is no minimum function value, since for every point we
will choose, there will always be another point with a smaller function value. Lastly, in the
third example, the region is unbounded and the function decreasing, thus again there will
always be another point with a smaller function value.
Recall the definitions of convex sets and convex functions that we introduced in Chapter 1:
Definition 2.3.1. A set S ⊆ Rn is called a convex set if any two points in S contain
their line, i.e. for any x , y ∈ S we have that θx + (1 − θ)y ∈ S for any θ ∈ [0, 1].
Figure 2.2: This plot shows the function f (x, y) = xy. For any fixed y0 , the function
h(x) = f (x, y0 ) = xy0 is this is linear in x, and so is a convex function in x. But is f convex?
24
Notation for this section. In the rest of this section, we frequently consider a multivari-
ate function f whose domain is a set S ⊆ Rn , which we will require to be open. When we
additionally require that S is convex, we will specify this. Note that S = Rn is both open and
convex and it suffices to keep this case in mind. Things sometimes get more complicated if
S is not open, e.g. when the domain of f has a boundary. We will leave those complications
for another time.
f (y ) = f (x ) + ∇f (z )> (y − x ).
1
See p. 248, Corollary 25.5.1 in Convex Analysis by Rockafellar (my version is the Second print,
1972). Rockefellar’s corollary concerns finite convex functions, because he otherwise allows convex func-
tions that may take on the values ±∞.
25
This theorem is useful for showing that the function f can be approximated by the affine
function y → f (x ) + ∇f (x )> (y − x ) when y is “close to” x in some sense.
Figure 2.3: The convex function f (y ) sits above the linear function in y given by
f (x ) + ∇f (x )> (y − x ).
f (x + λd ) − f (x )
Df (x )[d ] = lim
λ→0 λ
f (x + λd ) − f (x ) o(λ kd k2 )
= ∇f (x )> d +
λ λ
letting λ go to 0 concludes the proof.
In order to prove the characterization of convex functions in the next section we will need
the following lemma. This lemma says that any differentiable convex function can be lower
bounded by an affine function.
26
Figure 2.4: The convex function f (y ) sits above the linear function in y given by
f (x ) + ∇f (x )> (y − x ).
Proof. [ =⇒ ] Assume f is convex, then for all x, y ∈ S and θ ∈ [0, 1], if we let z =
θy + (1 − θ)x , we have that
f (x + θ(y − x )) − f (x ) ≤ θf (y ) + (1 − θ)f (x ) − f (x )
= θf (y ) − θf (x ).
f (x + θ(y − x )) − f (x )
≤ f (y ) − f (x )
θ
f (x + θ(y − x )) − f (x )
∇f (x )> (y − x ) = lim+ ≤ f (y ) − f (x ).
θ→0 θ
f (y ) ≥ f (z ) + ∇f (z )> (y − z) (2.1)
>
f (x ) ≥ f (z ) + ∇f (z ) (x − z) (2.2)
θf (y ) + (1 − θ)f (x ) ≥ f (z ) + ∇f (z )> 0
= f (θy + (1 − θ)x )
27
2.4 Conditions for Optimality
We now want to find necessary and sufficient conditions for local optimality.
Definition 2.4.1. Consider a differentiable function f : S → R. A point x ∈ S at
which ∇f (x ) = 0 is called a stationary point.
Proof. Let us assume that x is a local minimum for f . Then for all d ∈ Rn , f (x ) ≤ f (x +λd )
for λ small enough. Hence:
For convex functions however it turns out that a stationary point necessarily implies that
the function is at its minimum. Together with the proposition above, this says that for a
convex function on Rn a point is optimal if and only if it is stationary.
28
Chapter 3
Notation for this chapter. In this chapter, we sometimes consider a multivariate func-
tion f whose domain is a set S ⊆ Rn , which we will require to be open. When we additionally
require that S is convex, we will specify this. Note that S = Rn is both open and convex
and it suffices to keep this case in mind. Things sometimes get more complicated if S is
not open, e.g. when the domain of f has a boundary. We will leave those complications for
another time.
29
The following theorem gives a useful characterization of (semi)definite matrices.
In order to prove this theorem, let us first recall the Spectral Theorem for symmetric matrices.
Theorem 3.1.3 (The Spectral Theorem for Symmetric Matrices). For all symmetric A ∈
Rn×n there exist V ∈ Rn×n and a diagonal matrix Λ ∈ Rn×n s.t.
1. A = V ΛV > .
3. Λii = λi (A).
1.
x > Ax
λi = min max
subspace W ⊆Rn x ∈W,x 6=0 x > x
dim(W )=i
2.
x > Ax
λi = max min
subspace W ⊆Rn x ∈W,x 6=0 x > x
dim(W )=n+1−i
Theorem 3.1.2 is an immediate corollary of Theorem 4.1.1, since we can see that the minimum
value of the quadratic form x > Ax over x ∈ W = Rn is λ1 (A) kx k22 .
Using the decomposition from Theorem 3.1.3 A = V ΛV > where Λ is a diagonal ma-
trix of eigenvalues of A, which we take to be sorted in increasing order. Then x > Ax =
x > V > ΛV x = (V x )> Λ(V x ) = ij=1 λj c(j)2 ≤ λi kck22 = λi . So this choice of W ensures
P
the maximizer cannot achieve a value above λi .
30
But is it possible that the “minimizer” can do better by choosing a different W ? Let
T = span {v i , . . . , v n }. As dim(T ) = n+1−i and dim(W ) = i, we must have dim(W ∩T ) ≥ 1,
by a standard property of subspaces. Hence for any W of this dimension,
x > Ax x > Ax
max ≥ max
x ∈W,x 6=0 x > x x ∈W ∩T,x 6=0 x > x
x > Ax
≥ min max = λi ,
subspace V ⊆T x ∈V,x 6=0 x > x
dim(V )=1
where the last equality follows from a similar calculation to our first one. Thus, λi can always
be achieved by the “maximizer” for all W of this dimension.
Part 2 can be dealt with similarly.
1 −1
A :=
−1 1
1 1
For x = , we have Ax = 0, so λ = 0 is an eigenvalue of A. For x = , we have
1 −1
2
Ax = = 2x , so λ = 2 is the other eigenvalue of A. As both are non-negative, by the
−2
theorem above, A is positive semidefinite.
Since we are learning about symmetric matrices, there is one more fact that everyone should
know about them. We’ll use λmax (A) denote maximum eigenvalue of a matrix A, and
λmin (A) the minimum.
Claim 3.1.5. For a symmetric matrix A ∈ Rn×n , kAk = max(|λmax (A)| , |λmin (A)|).
We will now use the second derivatives of a function to obtain characterizations of convexity
and optimality. We will begin by introducing the Hessian, the matrix of pairwise second
derivatives of a function. We will see that it plays a role in approximating a function via a
second-order Taylor expansion. We will then use semi-definiteness of the Hessian matrix to
characterize both conditions of optimality as well as the convexity of a function.
31
Definition 3.2.1. Given a function f : S → R its Hessian matrix at point x ∈ S
denoted H f (x ) (also sometimes denoted ∇2 f (x )) is:
∂ 2 f (x ) ∂ 2 f (x ) ∂ 2 f (x )
∂x (1)2 ∂x (1)∂x (2)
... ∂x (1)∂x (n)
∂ 2 f (x ) ∂ 2 f (x ) ∂ 2 f (x )
∂x (2)∂x (1) ∂x (2)2
... ∂x (2)∂x (n)
H f (x ) :=
.. .. ... ..
. . .
∂ 2 f (x ) ∂ 2 f (x ) ∂ 2 f (x )
∂x (n)∂x (1) ∂x (n)∂x (2)
... ∂x (n)2
1
f (y ) = f (x ) + ∇f (x )> (y − x ) + (y − x )> H f (z )(y − x )
2
We can now give the second-order necessary conditions for local extrema via the Hessian.
32
Theorem 3.2.4. Let f : S → R be a function twice differentiable at x ∈ S. If x is a local
minimum, then H f (x ) is positive semidefinite.
Proof. Let us assume that x is a local minimum. We know from Proposition 3.2.3 that
∇f (x ) = 0, hence the second-order expansion at x takes the form:
1
f (x + λd ) = f (x ) + λ2 d > H f (x )d + o(λ2 kd k22 )
2
Because x is a local minimum, we can then derive
f (x + λd ) − f (x ) 1
0 ≤ lim+ 2
= d > H f (x )d
λ→0 λ 2
This is true for any d , hence H f (x ) is positive semidefinite.
Remark 3.2.5. Again, for this proposition to hold, it is important that S is open.
A local minimum thus is a stationary point and has a positive semi-definite Hessian. The
converse is almost true, but we need to strengthen the Hessian condition slightly.
Theorem 3.2.6. Let f : S → R be a function twice differentiable at a stationary point
x ∈ S. If H f (x ) is positive definite then x is a local minimum.
Proof. Let us assume that H f (x ) is positive definite. We know that x is a stationary point.
We can write the second-order expansion at x :
1
f (x + δ) = f (x ) + δ > H f (x )δ + o(kδk22 )
2
Because the Hessian is positive definite, it has a strictly positive minimum eigenvalue λmin ,
we can conclude that δ > H f (x )δ ≥ λmin kδk22 . From this, we conclude that when kδk22 is
small enough, f (x + δ) − f (x ) ≥ 14 λmin kδk22 > 0. This proves that x is a local minimum.
Remark 3.2.7. When H f (x ) is indefinite at a stationary point x , we have what is known
as a saddle point: x will be a minimum along the eigenvectors of H f (x ) for which the
eigenvalues are positive and a maximum along the eigenvectors of H f (x ) for which the
eigenvalues are negative.
33
Theorem 3.2.9. Let S ⊆ Rn be open and convex, and let f : S → R be twice continuously
differentiable.
Proof.
f (y ) ≥ f (x ) + ∇f (x )> (y − x )
f (y ) > f (x ) + ∇f (x )> (y − x ).
Applying the same idea as in Theorem 3.3.5 we can show that in this case f is strictly
convex.
3. Let f be a convex function. For x ∈ S, and some small λ > 0, for any d ∈ Rn we have
that x + λd ∈ S. From the Taylor expansion of f we get:
λ2 >
f (x + λd) = f (x ) + λ∇f (x )> d + d Hf (x )d + o(λ2 kd k22 ).
2
From Lemma 3.3.5 we get that if f is convex then:
λ2 >
d Hf (x )d + o(||λd||2 ) ≥ 0
2
Dividing by λ2 and taking λ → 0+ gives us that for any d ∈ Rn : d> Hf (x )d ≥ 0.
Remark 3.2.10. It is important to note that if S is open and f is strictly convex, then
H f (x ) may still (only) be positive semi-definite ∀x ∈ S. Consider f (x) = x4 which is
strictly convex, then the Hessian is H f (x) = 12x2 which equals 0 at x = 0.
34
3.3 Gradient Descent - An Approach to Optimization?
We have begun to develop an understanding of convex functions, and what we have learned
already suggests a way for us to try to find an approximate minimizer of a given convex
function.
Suppose f : Rn → R is convex and differentiable, and we want to solve
min f (x )
x ∈Rn
We would like to find x ∗ , a global minimizer of f . Suppose we start with some initial guess
x 0 , and we want to update it to x 1 with f (x 1 ) < f (x 0 ). If we can repeatedly make updates
like this, maybe we eventually find a point with nearly minimum function value, i.e. some
x̃ with f (x̃ ) ≈ f (x ∗ )?
Recall that f (x 0 + δ) = f (x 0 ) + ∇f (x 0 )> δ + o(kδk2 ). This means that if we choose
x 1 = x 0 − λ∇f (x 0 ), we get
k∇f (x ) − ∇f (y )k2 ≤ β kx − y k2 .
You will prove this in Exercise 13 (Week 2) of the first exercise set.
35
Proposition 3.3.3. Let f : S → R be a β-gradient Lipschitz function. Then for all x , y ∈ S,
β
f (y ) ≤ f (x ) + ∇f (x )> (y − x ) + kx − y k22
2
To prove this proposition, we need the following result from multi-variate calculus. This is
a restricted form of the fundamental theorem of calculus for line integrals.
Z 1
f (y ) = f (x ) + ∇f (x θ )> (y − x )dθ
Zθ=0
1 Z 1
>
= f (x ) + ∇f (x ) (y − x )dθ + (∇f (x θ ) − ∇f (x ))> (y − x )dθ
θ=0 θ=0
Next we use Cauchy-Schwarz pointwise.
We also evaluate the first integral.
Z 1
>
≤ f (x ) + ∇f (x ) (y − x ) + k∇f (x θ ) − ∇f (x )k ky − x k dθ
θ=0
Then we apply β-gradient Lipschitz and note x θ − x = θ(y − x ).
Z 1
>
≤ f (x ) + ∇f (x ) (y − x ) + βθ ky − x k2 dθ.
θ=0
β
= f (x ) + ∇f (x )> (y − x ) + ky − x k2 .
2
It turns out that just knowing a function f : Rn → R is convex and β-gradient Lipschitz is
enough to let us figure out a reasonable step size for Gradient Descent and let us analyze its
convergence.
36
We start at a point x 0 ∈ Rn , and we try to find a point x 1 = x 0 + δ with lower function
value. We will let our upper bound from Proposition 3.3.3 guide us, in fact, we could ask,
what is the best update for minimizing this upper bound, i.e. a δ solving
β
minn f (x 0 ) + ∇f (x 0 )> δ + kδk2
δ∈R 2
We can compute the best according to this upper bound by noting first that function is
convex and continuously differentiable, and hence will be minimized at any point where the
> β 2
gradient is zero. Thus we want 0 = ∇δ f (x 0 ) + ∇f (x 0 ) δ + 2 kδk = ∇f (x 0 ) + βδ,
which occurs at δ = − β1 ∇f (x 0 ).
2
Plugging in this value into the upper bound, we get that f (x 1 ) ≤ f (x 0 ) − k∇f (x 0 )k2/2β .
Now, as our algorithm, we will start with some guess x 0 , and then at every step we will
update our guess using the best step based on our Proposition 3.3.3 upper bound on f at
x i , and so we get
1 k∇f (x i )k22
x i+1 = x i − ∇f (x i ) and f (x i+1 ) ≤ f (x i ) − . (3.1)
β 2β
Let us try to prove that our algorithm converges toward an x with low function value.
We will measure this by looking at
gapi = f (x i ) − f (x ∗ )
where x ∗ is a global minimizer of f (note that there may not be a unique minimizer of f ).
We will try to show that this function value gap grows small. Using f (x i+1 ) − f (x i ) =
gapi+1 − gapi , we get
k∇f (x i )k22
gapi+1 − gapi ≤ − (3.2)
2β
i 2 k∇f (x )k2
If the gapi value is never too much bigger than 2β
, then this should help us show we
are making progress. But how much can they differ? We will now try to show a limit on
this.
Recall that in the previous chapter we showed the following theorem.
Theorem 3.3.5. Let S be an open convex subset of Rn , and let f : S → R be a differentiable
function. Then, f is convex if and only if for any x , y ∈ S we have that f (y ) ≥ f (x ) +
∇f (x )> (y − x ).
Using the convexity of f and the lower bound on convex functions given by Theorem 3.3.5,
we have that
f (x ∗ ) ≥ f (x i ) + ∇f (x i )> (x ∗ − x i ) (3.3)
37
Rearranging gets us
At this point, we are essentially ready to connect Equation (3.2) with Equation (3.4) and
analyze the convergence rate of our algorithm.
However, at the moment, we see that the change gapi+1 − gapi in how close we are to
the optimum function value is governed by the norm of the gradient k∇f (x i )k2 , while the
size of the gap is related to both this quantity and the distance kx i − x ∗ k2 between the
current solution x i and an optimum x ∗ . Do we need both or can we get rid of, say, the
distance? Unfortunately, with this algorithm and for this class of functions, a dependence
on the distance is necessary. However, we can simplify things considerably using the following
observation, which you will prove in the exercises (Exercise 2):
Claim 3.3.6. When running Gradient Descent as given by the step in Equation (3.1), for
all i kx i − x ∗ k2 ≤ kx 0 − x ∗ k2 .
At this point, a simple induction will complete the proof of following result.
∗ 2β kx 0 − x ∗ k22
f (x k ) − f (x ) ≤ .
k+1
Carrying out this induction is one of the Week 2 exercises (Exercise 15 in the first exercise
set).
38
and Yudin [NY83]. The phenomenon of acceleration was perhaps first understood in the
context of quadratic functions, minimizing x > Ax − x > b when A is positive definite – for
this case, the Conjugate Gradient algorithm was developed independently by Hestenes and
Stiefel [HS+ 52] (here at ETH!), and by Lanczos [Lan52]. In the past few years, providing more
intuitive explanations of acceleration has been a popular research topic. This presentation is
based on an analysis of Nesterov’s algorithm developed by Diakonikolas and Orecchia [DO19].
We will adopt a slightly different approach to analyzing this algorithm than what we used
in the previous section for Gradient Descent.
We will use x 0 to denote the starting point of our algorithm, and we will produce a sequence
of iterates x 0 , x 1 , x 2 , . . . , x k . At each iterate x i , we will compute the gradient ∇f (x i ).
However, the way we choose x i+1 based on what we know so far will now be a little more
involved than what we did for Gradient Descent.
To help us understand the algorithm, we are going to introduce two more sequences of
iterates y 0 , y 1 , y 2 , . . . , y k ∈ Rn and v 0 , v 1 , v 2 , . . . , v k ∈ Rn .
The sequence of y i ’s will be constructed to help us get as low a function value as possible at
f (y i ), which we will consider our current solution and the last iterate y k will be the output
solution of our algorithm.
The sequence of v i ’s will be constructed to help us get a lower bound on f (x ∗ ).
By combining the upper bound on the function value of our current solution f (y i ) with a
lower bound on the function value at an optimal solution f (x ∗ ), we get an upper bound on
the gap f (y i ) − f (x ∗ ) between the value of our solution and the optimal one. Finally, each
iterate x i , which will be where we evaluate gradient ∇f (x i ), is chosen through a trade-off
between wanting to reduce the upper bound and wanting to increase the lower bound.
The upper bound sequence: y i ’s. The point y i will be chosen from x i to minimize an
upper bound on f based at x i . This is similar to what we did in the previous section. We let
y i = x i +δi and choose δi to minimize the upper bound f (y i ) ≤ f (x i )+∇f (x i )> δi + β2 kδi k2 ,
which gives us
1 k∇f (x i )k22
y i = x i − ∇f (x i ) and f (y i ) ≤ f (x i ) − .
β 2β
k∇f (x i )k22
Ui = f (y i ) ≤ f (x i ) − . (3.6)
2β
39
Philosophizing about lower bounds1 . A crucial ingredient to establishing an upper
bound on gapi was a lower bound on f (x ∗ ).
In our analysis of Gradient Descent, in Equation (3.4), we used the lower bound
f (x ∗ ) ≥ f (x i ) − k∇f (x i )k2 kx i − x ∗ k2 . We can think of the Gradient Descent analysis as
being based on a tension between two statements: Firstly that “a large gradient implies we
quickly approach the optimum” and secondly “the function value gap to optimum cannot
exceed the magnitude of the current gradient (scaled by distance to opt)”.
This analysis does not use that we have seen many different function values and gradients,
and each of these can be used to construct a lower bound on the optimum value f (x ∗ ), and,
in particular, it is not clear that the last gradient provides the best bound. To do better, we
will try to use lower bounds that take advantage of all the gradients we have seen.
Definition 3.4.1. We will adopt a new notation for inner products that sometimes is
def
more convenient when dealing with large expressions: haa , bi = a > b.
The lower bound sequence: v i ’s. We can introduce weights ai > 0 for each step and
combine the gradients
P we have observed into one lower bound based on a weighted average.
Let us use Ai = j≤i aj to denote the sum of the weights. Now a general lower bound on
the function value at any v ∈ Rn is :
1 X
f (v ) ≥ aj (f (x j ) + h∇f (x j ), v − x j i)
Ai j≤i
However, to use Cauchy-Schwarz on each individual term here to instantiate this bound at
x ∗ does not give us anything useful. Instead, we will employ a somewhat magical trick: we
introduce a regularization term
def σ
φ(v ) = kv − x 0 k22 .
2
We will choose the value σ > 0 later. Now we derive our lower bound Li
!
1 X φ(x ∗ )
f (x ∗ ) ≥ φ(x ∗ ) + aj f (x j ) + haj ∇f (x j ), x ∗ − x j i −
Ai j≤i
Ai
( !)
1 X φ(x ∗ )
≥ minn φ(v ) + aj f (x j ) + haj ∇f (x j ), v − x j i −
v ∈R Ai j≤i
Ai
= Li
We will let v i be the v obtaining the minimum in the optimization problem appearing in
the definition of Li , so that
!
1 X φ(x ∗ )
Li = φ(v i ) + ai f (x i ) + hai ∇f (x i ), v i − x i i −
Ai j≤i
Ai
1
YMMV. People have a lot of different opionions about how to understand acceleration, and you should
take my thoughts with a grain of salt.
40
How we will measure convergence. We have designed the upper bound Ui and the
lower bound Li such that gapi = f (y i ) − f (x ∗ ) ≤ Ui − Li .
As you will show in Exercise 3, we can prove the convergence of Gradient Descent directly by
an induction that establishes 1/gapi ≤ C · i for some constant C depending on the Lipschitz
gradient parameter β and the distance kx 0 − x ∗ k2 .
To analyze Accelerated Gradient Descent, we will adopt a similar, but slightly different
strategy, namely trying to show that (Ui − Li )r(i) is non-increasing for some positive “rate
function” r(i). Ideally r(i) should grow quickly, which would imply that gapi quickly gets
small. We will also need to show that (U0 −L0 )r(0) ≤ C for some constant C again depending
on β and kx 0 − x ∗ k2 . Then, we’ll be able to conclude that
Establishing the convergence rate. Let’s start by looking at the change in the upper
bound scaled by our rate function:
Ai+1 Ui+1 − Ai Ui =Ai+1 f (y i+1 ) − f (x i+1 ) − Ai (f (y i ) − f (x i+1 )) + (Ai+1 − Ai )f (x i+1 )
(3.7)
!
k∇f (x i+1 )k22
≤Ai+1 − First term controlled by Equation (3.6).
2β
− Ai h∇f (x i+1 ), y i − x i+1 i Second term bounded by Theorem 3.3.5.
+ ai+1 f (x i+1 ) Third term uses ai+1 = Ai+1 − Ai .
The solution v i to the minimization in the lower bound Li turns out to be relatively simple
to characterize. By using derivatives to find the optimum, we first analyze the initial value
of the lower bound L0 .
41
Claim 3.4.2.
a0
1. v 0 = x 0 − σ
∇f (x 0 )
2. L0 = f (x 0 ) − a0
2σ
k∇f (x 0 )k22 − σ
2a0
kx ∗ − x 0 k22 .
You will prove Claim 3.4.2 in Exercise 17 (Week 2) of the first exercise sheet. Noting A0 = a0 ,
we see from Equation (3.6) and Part 2 of Claim 3.4.2, that
2
a0 a0 σ
A0 (U0 − L0 ) ≤ − k∇f (x 0 )k22 + kx ∗ − x 0 k22 (3.8)
2σ 2β 2
It will be convenient to introduce notation for the rescaled lower bound Ai Li without opti-
mizing over v .
X
mi (v ) = φ(v ) − φ(x ∗ ) + aj f (x j ) + haj ∇f (x j ), v − x j i
j≤i
Thus Ai Li − Ai+1 Li+1 = mi (v i ) − mi+1 (v ). Now, it is not too hard to show the following
relationships.
Claim 3.4.3.
1. mi (v ) = mi (v i ) + σ
2
kv − v i k22
And again, you will prove Claim 3.4.3 in Exercise 18 (Week 2) of the first exercise sheet.
Hint for Part 1: note that mi (v) is a quadratic function, miminimized at vi and its Hessian
equals σI at all v.
Given Claim 3.4.3, we see that
Ai Li − Ai+1 Li+1
= mi (v i ) − mi+1 (v i+1 ) (3.9)
σ
= −ai+1 f (x i+1 ) − hai+1 ∇f (x i+1 ), v i+1 − x i+1 i − kv i+1 − v i k22 (3.10)
2
2
a
= −ai+1 f (x i+1 ) − hai+1 ∇f (x i+1 ), v i − x i+1 i + i+1 k∇f (x i+1 )k22 (3.11)
2σ
This means that by combining Equation (3.7) and (3.11) we get
−Ai+1 a2i+1
Ai+1 (Ui+1 − Li+1 ) − Ai (Ui − Li ) ≤ + k∇f (x i+1 )k22
2β 2σ
+ h∇f (x i+1 ), Ai+1 x i+1 − ai+1 v i − Ai y i i .
42
Now, this means that Ai+1 (Ui+1 − Li+1 ) − Ai (Ui − Li ) ≤ 0 if
Ai y i +ai+1 v i i+1
We can get this by letting x i+1 = Ai+1
, and σ = β and ai = 2
, which implies that
(i+1)(i+2)
Ai = 4
> a2i .
By Equation (3.8), these parameter choices also imply that
β
A0 (U0 − L0 ) ≤ kx ∗ − x 0 k22 .
2
Finally, by induction, we get Ai (Ui − Li ) ≤ β2 kx ∗ − x 0 k22 . Dividing through by Ai and using
gapi ≤ Ui − Li results in the following theorem.
i+1 (i + 1)(i + 2)
ai = , Ai =
2 4
1
v0 = x0 − ∇f (x 0 )
2β
1
y i = x i − ∇f (x i )
β
Ai y i + ai+1 v i
x i+1 =
Ai+1
ai+1
v i+1 = v i − ∇f (x i+1 )
β
ensures that the kth iterate satisfies
2β kx 0 − x ∗ k22
f (x k ) − f (x ∗ ) ≤ .
(k + 1)(k + 2)
43
Part II
44
Chapter 4
In this chapter, we will study graphs through linear algebra. This approach is known as
Spectral Graph Theory and turns out to be surprisingly powerful. An in-depth treatment of
many topics in this area can be found in [Spi19].
To introduce the edge-vertex incidence matrix of the graph, we first have to associate an
arbitrary direction to every edge. We then let B ∈ RV ×E .
1
if e = (u, v)
B(v, e) = −1 if e = (v, u)
0 o.w.
The edge directions are only there to help us track the meaning of signs of quantities defined
on edges: The math we do should not depend on the choice of sign.
Let W ∈ RE×E be the diagonal matrix given by W = diag(w ), i.e W (e, e) = w (e). We
define the Laplacian of the graph as L = BW B > . Note that in the first chapter, we defined
the Laplacian as BR−1 B > , where R is the diagonal matrix with edge resistances on the
diagonal. We want to think of high weight on an edge as expressing that two vertices are
45
highly connected, whereas we think of high resistance on an edge as expressing that the two
vertices are poorly connected, so we let w (e) = 1/R(e, e).
The weighted adjacency matrix A ∈ RV ×V of a graph is given by
(
w (u, v) if {u, v} ∈ E
A(u, v) =
0 otherwise.
>
Note that we treat the edgesPas undirected here, so A = A. The weighted degree of a
vertex is defined as d (v) = {u,v}∈E w(u, v). Again we treat the edges as undirected. Let
D = diag(d ) be the diagonal matrix in RV ×V with weighted degrees on the diagonal.
In Problem Set 1, you showed that L = D − A, and that for x ∈ RV ,
X
x > Lx = w (a, b)(x (a) − x (b))2 .
{a,b}∈E
Bf = d .
This is also called a conservation constraint. In our examples so far, we have d (s) = −1,
d (t) = 1 and d (u) = 0 for all u ∈ V \ {s, t}.
If we let R = diage∈E r (e) then Ohm’s law tells us that electrical voltages x will induce an
electrical flow f = R−1 B > x . We defined the electrical energy of a flow f ∈ RE to be
X
E(f ) = r (e)f (e)2 = f > Rf .
e
E(x ) = x > Lx .
The Courant-Fisher Theorem. Let us also recall the Courant-Fischer theorem, which
we proved in Chapter 3 (Theorem 4.1.1).
Theorem 4.1.1 (The Courant-Fischer Theorem). Let A be a symmetric matrix in Rn×n ,
with eigenvalues λ1 ≤ λ2 ≤ . . . ≤ λn . Then
1.
x > Ax
λi = min max
subspaceW ⊆Rn x ∈W,x 6=0 x > x
dim(W )=i
46
2.
x > Ax
λi = max min
subspace W ⊆Rn x ∈W,x 6=0 x > x
dim(W )=n+1−i
In fact, from our proof of the Courant-Fischer theorem in Chapter 3, we can also extract a
slightly different statement:
Theorem 4.1.2 (The Courant-Fischer Theorem, eigenbasis version). Let A be a symmetric
matrix in Rn×n , with eigenvalues λ1 ≤ λ2 ≤ . . . ≤ λn , and corresponding eigenvectors
x 1 , x 2 , . . . , x n which form an othernormal basis. Then
1.
x > Ax
λi = min
x ⊥x 1 ,...x i−1 x >x
x 6=0
2.
x > Ax
λi = max
x ⊥x i+1 ,...x n x >x
x 6=0
x>
i Ax i
Of course, we also have λi (A) = x>
.
i xi
47
Eigenvalues of the Laplacian of a Complete Graph. To get a sense of how Laplacian
eigenvalues behave, let us start by considering the n vertex complete graph with unit weights,
which we denote by Kn . The adjacency matrix of Kn is A = 11> − I , since it has ones
everywhere, except for the diagonal, where entries are zero. The degree matrix D = (n−1)I .
Thus the Laplacian is L = D − A = nI − 11> .
Thus for any y ⊥ 1, we have y > Ly = ny > y − (1> y )2 = ny > y .
From this, we can conclude that any y ⊥ 1 is an eigenvector of eigenvalue n, and that all
λ2 = λ3 = . . . = λn = n.
Next, let us try to understand λ2 and λn for Pn , the n vertex path graph with unit weight
edges. I.e. the graph has edges E = {{i, i + 1} for i = 1 to (n − 1)}.
This is in a sense the least well-connected unit weight graph on n vertices, whereas Kn is
the most well-connected.
We can use the eigenbasis version of the Courant-Fisher theorem to observe that the second-
smallest eigenvalue of the Laplacian is given by
x > Lx
λ2 (L) = min . (4.1)
x 6=0 x >x
x > 1=0
We can get a better understanding of this particular case through a couple of simple observa-
tions. Suppose x = y +α1, where y ⊥ 1. Then x > Lx = y > Ly , and kx k22 = ky k2 +α2 k1k2 .
>
So for any given vector, you can increase the value of xx >Lx
x
, by instead replacing x with the
component orthogonal to 1, which we denoted by y .
We can conclude from Equation (4.1) that for any vector y ⊥ 1,
y > Ly
λ2 ≤
y >y
When we use a vector y in this way to prove a bound on an eigenvalue, we call it a test
vector.
Now, we’ll use a test vector to give an upper bound on λ2 (LPn ). Let x ∈ RV be given by
x (i) = (n + 1) − 2i, for i ∈ [n]. This vector satisfies x ⊥1. We picked this because we wanted
a sequence of values growing linearly along the path, while also making sure that the vector
48
is orthogonal to 1. Now
+ 1))2
P
i∈[n−1] (x (i) − x (i
λ2 (LPn ) ≤ Pn
x (i)2
Pn−1i=12
i=1 2
= Pn 2
i=1 (n + 1 − 2i)
4(n − 1)
=
(n + 1)n(n − 1)/3
12 12
= ≤ 2.
n(n + 1) n
Later, we will prove a lower bound that shows this value is right up to a constant factor. But
the test vector approach based on the Courant-Fischer theorem doesn’t immediately work
when we want to prove lower bounds on λ2 (L).
We can see from either version of the Courant-Fischer theorem that
v > Lv
λn (L) = max . (4.2)
v 6=0 v >v
y > Ly
λn ≥ .
y >y
This means we can get a test vector-based lower bound on λn . Let us apply this to the
Laplacian of Pn . We’ll try the vector x ∈ RV given by x (1) = −1, and x (n) = 1 and
x (i) = 0 for i 6= 1, n.
Here we get
x > Lx 2
λn (LPn ) ≥ >
= = 1.
x x 2
Again, it’s not clear how to use the Courant-Fischer theorem to prove an upper bound on
λn (L). But, later we’ll see how to prove an upper that shows that for Pn , the lower bound
we obtained is right up to constant factors.
In the previous sections, we first saw a complete characterization of the eigenvalues and
eigenvectors of the unit weight complete graph on n vertices, Kn . Namely, LKn = nI − 11> ,
and this means that every vector y ⊥ 1 is an eigenvector of eigenvalue n.
49
We then looked at eigenvalues of Pn , the unit weight path on n vertices, and we showed
using test vector bounds that
12
λ2 (LPn ) ≤ and 1 ≤ λn (LPn ). (4.3)
n2
Ideally we would like to prove an almost matching lower bound on λ2 and an almost matching
upper bound on λn , but it is not clear how to get that from the Courant-Fischer theorem.
To get there, we start we need to introduce some more tools.
We’ll now introduce an ordering on symmetric matrices called the Loewner order, which I
also like to just call the positive semi-definite order. As we will see in a moment, it is a partial
order on symmetric matrices, we denote it by “”. For conveniece, we allow ourselves to
both write A B and equivalently B A.
For a symmetric matrix A ∈ Rn×n we define that
A0
if and only if A is positive semi-definite.
More generally, when we have two symmetric matrices A, B ∈ Rn×n , we will write
1. Reflexivity: A A.
2. Anti-symmetry: A B and B A implies A = B
3. Transitivity: A B and B C implies A C
50
Claim 4.2.2. If A B, then for all i
λi (A) ≤ λi (B).
Proof. We can prove this Claim by applying the subspace version of the Courant-Fischer
theorem.
x > Ax x > Bx
λi (A) = min max ≤ min max = λi (B).
subspace W ⊆Rn x ∈W,x 6=0 x > x subspace W ⊆Rn x ∈W,x 6=0 x > x
dim(W )=i dim(W )=i
Note
that
the converse
of Clam 4.2.2 is very much false, for example the matrices A =
2 0 1 0
and B = have equal eigenvalues, but both A 6 B and B 6 A.
0 1 0 2
Remark 4.2.3. It’s useful to get used to and remember some of the properties of the
Loewner order, but all the things we have established so far are almost immediate from the
basic characterization in Equation (4.4). So, ideally, don’t memorize all these facts, instead,
try to see that they are simple consequences of the definition.
It’s sometimes convenient to overload the for the PSD order to also apply to graphs. We
will write
GH
51
if LG LH .
For example, given two unit weight graphs G = (V, E) and H = (V, F ), if H = (V, F ) is a
subgraph of G, then
LH LG .
Dropping edges will only decrease the value of the quadratic form. The same is for decreasing
the weights of edges. The graph order notation is especially useful when we allow for scaling
a graph by a constant, say c > 0,
c·H G
What is c · H? It is the same graph as H, but the weight of every edge is multiplied by c.
Now we can make statements like 21 H G 2H, which turn out to be useful notion of the
two graphs approximating each other.
Now, we’ll see a general tool for comparing two graphs G and H to prove inequalities like
cH G for some constant c. Our tools won’t necessarily work well for all cases, but we’ll
see some examples where they do.
In the rest of the chapter, we will often need to compare two graphs defined on the same
vertex set V = {1, . . . , n} = [n].
We use Gi,j to denote the unit weight graph on vertex set [n] consisting of a single edge
between vertices i and j.
(n − 1) · Pn G1,n ,
∆(i) = x (i + 1) − x (i).
52
The inequality we want to prove then becomes
n−1 n−1
!2
X X
(n − 1) (∆(i))2 ≥ ∆(i) .
i=1 i=1
But, this is immediate from the Cauchy-Schwarz inequality a > b ≤ kaa k2 kbk2 :
n−1
X
(n − 1) (∆(i))2 = k1n−1 k2 · k∆k2
i=1
= (k1n−1 k · k∆k)2
≥ (1>
n−1 ∆)
2
n−1
X
=( ∆(i))2
i=1
We will now use Lemma 4.2.5 to prove a lower bound on λ2 (LPn ). Our strategy will be to
prove that the path Pn is at least some multiple of the complete graph Kn , measured by the
Loewner order, i.e. Kn f (n)Pn for some function f : N → R. We can combine this with
our observation earlier that λ2 (LKn ) = n to show that
and this will give our lower bound on λ2 (LPn ). When establishing the inequality between Pn
and Kn , we can treat each edge of the complete graph separately, by first noting that
X
LKn = LGi,j
i<j
For every edge (i, j) in the complete graph, we apply the Path Inequality, Lemma 4.2.5:
j−1
X
Gi,j (j − i) Gk,k+1
k=i
(j − i)Pn
This inequality says that Gi,j is at most (j − i) times the part of the path connecting i to j,
and that this part of the path is less than the whole.
Summing inequality (4.3) over all edges (i, j) ∈ Kn gives
X X
Kn = Gi,j (j − i)Pn .
i<j i<j
53
To finish the proof, we compute
X X
(j − i) ≤ n ≤ n3
i<j i<j
So
LKn n3 · LPn .
1
≤ λ2 (Pn ).
n2
This only differs from our test vector-based upper bound in Equation (4.3) by a factor 12.
We could make this considerably tighter by being more careful about the sums.
Let’s do the same analysis with the complete binary tree with unit weight edges on n =
2d+1 − 1 vertices, which we denote by Td .
Td is the balanced binary tree on this many vertices, i.e. it consists of a root node, which has
two children, each of those children have two children and so on until we reach a depth of d
from the root, at which point the child vertices have no more children. A simple induction
shows that indeed n = 2d+1 − 1.
We can also describe the edge set by saying that each node i has edges to its children 2i and
2i + 1 whenever the node labels do not exceed n. We emphasize that we still think of the
graph as undirected.
The largest eigenvalue. We’ll start by above bounding λn (LTd ) using a test vector.
We let x (i) = 0 for all nodes that have a child node, and x (i) = −1 for even-numbered leaf
nodes and x (i) = +1 for odd-numbered leaf nodes. Note that there are (n + 1)/2 leaf nodes,
and every leaf node has a single edge, connecting it to a parent with value 0. Thus
Meanwhile, every vertex has degree at most 3, so by Claim 4.2.4, λn (L) ≤ 6. So we can
bound the largest eigenvalue above and below by a constant.
54
λ2 and diameter in any graph. The following lemma gives a simple lower bound on λ2
for any graph.
Proof. We will again prove a lower bound comparing G to the complete graph. For each
edge (i, j) ∈ Kn , let Gi,j denote a shortest path in G from i to j. This path will have length
at most D. So, we have
X
Kn = Gi,j
i<j
X
DGi,j
i<j
X
DG
i<j
2
n DG.
n2 Dλ2 (G) ≥ n,
λ2 in a tree. Since a complete binary tree Td has diameter 2d ≤ 2 log2 (n), by Lemma 4.2.6,
λ2 (LTd ) ≥ 2n log1 (n) .
2
Let us give an upper bound on λ2 of the tree using a test vector. Let x ∈ Rv have x (1) = 0
and x (i) = −1 for i in the left subtree and x (i) = +1 in the right subtree. Then
v > Lv x > Lx 2
λ2 (LTd ) = min >
≤ >
= .
v 6=0 v v x x n−1
v > 1=0
55
Chapter 5
A common algorithmic problem that arises is the problem of partitioning the vertex set V
of a graph G into clusters X1 , X2 , . . . , Xk such that
• for each i, the induced graph G[Xi ] = (Xi , E ∩ (Xi × Xi )) is ”well-connected”, and
• only an -fraction of edges e are not contained in any induced graph G[Xi ] (where is
a very small constant).
Figure 5.1: After removing the red edges (of which there are few in relation to the total
number of edges), each connected component in G is ”well-connected”.
56
In this lecture, we make precise what ”well-connected” means by introducing the notions of
conductance and expanders.
Building on the last two lectures, we show that the second eigenvalue of the Laplacian L
associated with graph G can be used to certify that a graph is ”well-connected” (more
precisely the second eigenvalue of a normalized version of the Laplacian). This result, called
Cheeger’s inequality, is one of the key tools in Spectral Graph Theory. Moreover, it can be
turned into an algorithm that computes the partition efficiently!
|E(S, V \ S)|
φ(S) = .
min{vol(S), vol(V \ S)}
It can be seen that φ(·) is symmetric in the sense that φ(S) = φ(V \ S). We define the
conductance of the graph G denoted φ(G) by
We note that finding the conductance of a graph G is NP-hard. However, good approxima-
tions can be found as we will see today (and in a later lecture).
Expander and Expander Decomposition. For any φ ∈ (0, 1], we say that a graph G
is a φ-expander if φ(G) ≥ φ. We say that the partition X1 , X2 , . . . , Xk of the vertex set V is
a φ-expander decomposition of quality q if
57
Today, we obtain a φ-expander decomposition of quality q = O(φ−1/2 · log n). In a few
lectures, we revisit the problem and obtain quality q = O(logc n) for some small constant c.
In practice, we mostly care about values φ ≈ 1.
In the graded homework, we ask you to make the procedure CertifyOrCut(G, φ) explicit,
and then to show how to use it to compute a φ-expander decomposition.
To see this, observe that we can rewrite the numerator above using the Laplacian of G as
X
|E(S, V \ S)| = (1S (u) − 1S (v))2 = 1>
S L1S
(u,v)∈E
where D = diag(d ) is the degree-matrix. We can now alternatively define the graph con-
ductance of G by
1>
S L1S
φ(G) = min >
(5.1)
∅⊂S⊂V, 1S D1S
vol(S)≤vol(V )/2
where we use that φ(S) = φ(V \ S) such that the objective value is unchanged as long as for
each set ∅ ⊂ S ⊂ V either S or V \ S is in the set that we minimize over.
58
The Normalized Laplacian. Let us next define the normalized Laplacian
To learn a bit about this new matrix, let us first look at the first eigenvalue where we use
the test vector y = D 1/2 1, to get by Courant-Fischer (see Theorem 4.1.2) that
because D −1/2 D 1/2 = I and L1 = 0 (for the former we use the assumption that G is
connected). Since N is PSD (as you will show in the exercises), we also know λ1 (N ) ≥ 0,
so λ1 (N ) = 0.
Let us use Courant-Fischer again to reason a bit about the second eigenvalue of N :
0 = d >z S
⇐⇒ 0 = d > (1S − α1)
⇐⇒ 0 = d > 1S − αd > 1
d > 1S vol(S)
⇐⇒ α = >
= .
d 1 vol(V )
1>
S L1S 1 z>
S Lz S
To conclude the proof, it remains to argue that 1>
≥ 2
· z>
:
S D1S S Dz S
59
• Denominator: observe by straight-forward calculations that
z> 2
S Dz S = vol(S) · (1 − α) + vol(V \ S) · (−α)
2
Proof. To prove the theorem, we want to show that for any z ⊥ d , we can find a set
∅ ⊂ S ⊂ V with vol(S) ≤ vol(V )/2, such that
r
1>S L1 S z > Lz
≤ 2 · . (5.4)
1>S D1S z > Dz
As a first step, we would like to change z slightly to make it more convenient to work with:
P
i.e. z c (i)>0 d (i) ≤ vol(V )/2.
60
In the exercises, you will show that changing z to z sc can only make the ratio we are
> Lz > Lz
interested in smaller, i.e. that zz> Dz ≥ zz>scDz sc
sc
. Thus, if we can show that equation 5.4 holds
sc
for z sc in place of z , then it also follows for z itself.
We now arrive at the main idea of the proof: we define the set Sτ = {i ∈ V | z sc (i) < τ } for
some random variable τ with probability density function
(
2 · |t| t ∈ [z sc (1), z sc (n)],
p(t) = (5.5)
0 otherwise.
Rb
So, we have probability P[a < τ < b] = t=a
p(t) dt.
Since the volume incident to Sτ might be quite large, let us define S for convenience by
(
Sτ vol(Sτ ) < vol(V )/2,
S=
V \ Sτ otherwise.
Eτ [1>
S L1S ]
q
z>
sc Lz sc
Claim 5.3.2. We have Eτ [1>
≤ 2· z>
.
S D1S ] sc Dz sc
Proof. Recall 1> S L1S = E(Sτ , V \Sτ ), and by choice of τ , we have for any edge e = {i, j} ∈ E
where z sc (i) ≤ z sc (j),
We can further upper bound either case by |z sc (i) − z sc (j)| · (|z sc (i)| + |z sc (j)|) (we leave
this as an exercise).
Using our new upper bound, we can sum over all edges e ∈ E to conclude that
X
Eτ [|E(Sτ , V \ Sτ )|] ≤ |z sc (i) − z sc (j)| · (|z sc (i)| + |z sc (j)|)
i∼j
sX X
≤ (z sc (i) − z sc (j))2 · (|z sc (i)| + |z sc (j)|)2
i∼j i∼j
61
The
P first sum should2 look> familiar by now: it is simply the Quadratic Laplacian Form
i∼j (z sc (i) − z sc (j)) = z sc Lz sc .
While this almost looks like what we want, we still have to argue that z > >
sc Dz sc = Eτ [1S D1S ]
to finish the proof.
To this end, when unrolling the expectation, we use a simple trick that splits by cases:
X
Eτ [1>
S D1S ] = d (i) · P[i ∈ S]
i∈V
X X
= d (i) · P[i ∈ S ∧ S = Sτ ] + d (i) · P[i ∈ S ∧ S 6= Sτ ]
i∈V,z sc (i)<0 i∈V,z sc (i)≥0
X X
= d (i) · P[z sc (i) < τ ∧ τ < 0] + d (i) · P[z sc (i) ≥ τ ∧ τ ≥ 0]
i∈V,z sc (i)<0 i∈V,z sc (i)≥0
where we use the centering of z sc the definition of S and that the event {i ∈ S ∧ S = Sτ }
can be rewritten as the event {i < τ ∧ τ < 0} (the other case is analogous).
Let i be a vertex with z sc (i) < 0, then the probability P[i ∈ S ∧ S = Sτ ] is exactly z sc (i)2 by
choice of the density function of τ (again the case for i with z sc (i) non-negative is analgous).
Thus, summing over all vertices, we obtain
X X
Eτ [1>
S D1S ] = d (i) · P[z sc (i) < τ ∧ τ < 0] + P[z sc (i) ≥ τ ∧ τ ≥ 0]
i∈V,z sc (i)<0 i∈V,z sc (i)≥0
X
= d (i) · z sc (i)2 = z >
sc Dz sc .
i∈V
Therefore, we can plug in our result directly into Equation 5.6 and the proof is completed
by dividing both sides by Eτ [1>
S D1S ].
While Theorem 5.3.2 only ensures our claim in expectation, this is already sufficient to
conclude that there exists some set S that satisfies the same guarantees deterministically, as
you will prove in Problem Set 4. This is often called the probabilistic method of expectation
and can be seen from the definition of expectation. We have thus proven the upper bound
of Cheeger’s inequality.
62
5.4 Conclusion
Today, we have introduced the concepts of conductance and formalized expanders and ex-
pander decompositions. These are crucial concepts that you will encounter often in literature
and also again in this course. They are a key tool in many recent breakthroughs in Theo-
retical Computer Science.
In the second part of the lecture (the main part), we discussed Cheeger’s inequality which
allows to relate the second eigenvalue of the normalized Laplacian to a graphs conductance.
We summarize the full statement here.
We point out that this Theorem is tight as you will show in the exercises. The proof for
Cheeger’s inequality is probably the most advanced proof, we have seen so far in the course.
The many tricks that make the proof work might sometimes seem a bit magical but it is
important to remember that they are a result of many people polishing this proof over and
over. The proof techniques used are extremely useful and can be re-used in various contexts.
We therefore strongly encourage you to really understand the proof yourself!
63
Chapter 6
Random Walks
Today, we talk about random walks on graphs and how the spectrum of the Laplacian
guides convergence of random walks. We start by giving the definition of a random walk on
a weighted graph G = (V, E, w).
To gain some intuition for the definition, assume first that the graph G is undirected. Con-
sider a particle that is placed at a random vertex v0 initially. Then at each step the particle
is moved to a neighbor of the current vertex it is resting at, where the neighbor is chosen
uniformly at random.
If the graph is weighted, then instead of choosing a neighbor vt+1 of a vertex vt at each step
uniformly at random, one chooses a neighbor v of vt with probability w (v, vt ) divided by the
degree d (vt ).
The Random Walk Matrix. We now define the random walk matrix W by
W = AD −1
and observe that for all vertices u, v ∈ V (and any t), we have that
(
w (u, v)/d (u) if {u, v} ∈ E,
W vu =
0 otherwise.
64
Figure 6.1: A (possibly random) walk where the red edges indicate the edges that the particle
moves along. Here the walk visits the vertices v0 = 1, v1 = 2, v2 = 3, v3 = 2, v4 = 3, v5 = 4.
• How does a random walk behave after a large number of steps are taken?
• How many steps does it take asymptotically until the random walk behaves as if an
infinite number of steps were taken?
d
d (v)/1> d = 1
P P
Proof. Let π = 1> d
. Clearly, we have that kπk1 = v∈V 1> d v∈V d (v) = 1,
65
so π is indeed a distribution. Further note that
d A1 d
W π = AD −1 · = = > = π.
1> d >
1 d 1 d
For many graphs one can show that for t → ∞, we have that p t → π, i.e. that independent
of the starting distribution p 0 , the random walk always converges to distribution π.
Unfortunately, this is not true for all graphs: take the graph of two vertices connected by a
single edge with p 0 being 1 at one vertex and 0 at the other.
Lazy Random Walks. Luckily, we can overcome this issue by using a lazy random walk.
A lazy random walk behaves just like a random walk, however, at each time step, with
probability 21 instead of transitioning to a neighbor, it simply stays put. We give the lazy
random walk matrix by
1 1 1
I + AD −1 .
W̃ = I + W =
2 2 2
It is not hard to see that the stationary distribution π for W , is also a stationary distribution
for W̃.
Figure 6.2: A lazy random walk where the red edges indicate the edges that the particle moves
along. Here the lazy walk visits the vertices v0 = 1, v1 = 2, v2 = 2, v3 = 3, v4 = 3, v5 = 2.
66
Lazy Random Walks and the Normalized Laplacian. Recall that we defined N =
D −1/2 LD −1/2 = I − D −1/2 AD −1/2 ⇐⇒ D −1/2 AD −1/2 = I − N . We can therefore derive
1 1
W̃ = I + AD −1
2 2
1 1 1/2 −1/2
= I + D D AD −1/2 D −1/2
2 2
1 1
= I + D 1/2 (I − N )D −1/2
2 2
1 1 1/2 1
= I + D I D −1/2 − D 1/2 N D −1/2
2 2 2
1 1/2
= I − D N D −1/2
2
We will now start to reason about the eigenvalues and eigenvectors of W̃ in terms of the
normalized laplacian N that we are already familiar with.
For the rest of the lecture, we let ν1 ≤ ν2 ≤ · · · ≤ νn be the eigenvalues of N associated
with the orthogonal eigenvectors ψ 1 , ψ 2 , . . . , ψ n where we know that such eigenvectors exist
by the Spectral Theorem. We note in particular that from the last lecture, we have that
1/2
ψ 1 = (1>d d )1/2 (see Equation 5.2 where we added a normalization such that ψ > 1 ψ 1 = 1).
Lemma 6.2.2. For the ith eigenvalue νi of N associated with eigenvector ψi , we have that
W̃ has an eigenvalue of (1 − 12 νi ) associated with eigenvector D 1/2 ψ i .
Proof. Recall that L 4 2D which implies that N 4 2I . But this implies that every
eigenvalue of N is in [0, 2]. Thus, using Lemma 6.2.2, the corollary follows.
We have now done enough work to obtain an interesting result. We can derive an alternative
characterization of p t by expanding p 0 along an orthogonal eigenvectors basis and then we
can repeatedly apply W̃ by taking powers of the eigenvalues.
67
Unfortunately, W̃ is not symmetric so its eigenvectors are not necessarily orthogonal. In-
stead, we use a simple trick that allows to expand along the eigenvectors of N
n
X n
X
∀i, ψ >
i D
−1/2
p0 = αi ⇐⇒ D −1/2
p0 = αi ψ i ⇐⇒ p 0 = αi D 1/2 ψ i . (6.1)
i=1 i=1
The above equivalences are best understood if you start from the middle. To get to the
left side, you need to observe that multiplying both sides by ψ > i cancels all terms ψ j with
j 6= i in the sum by orthogonality. To get the right hand side expression, one can simply
left-multiply by D 1/2 . Technically, we have to show that D −1/2 p 0 lives in the eigenspace of
N but we leave this as an exercise.
This allows us to express a right multiplication by W̃ as
n n
X
1/2
X νi 1/2
p 1 = W̃p 0 = αi W̃D ψi = αi 1 − D ψi.
i=1 i=1
2
And as promised, if we apply W̃, the lazy random walk operator, t times, we now obtain
n n
X νi t 1/2 1/2
X νi t 1/2
pt = αi 1 − D ψ i = α1 D ψ 1 + αi 1 − D ψi. (6.2)
i=1
2 i=2
2
where we use in the last equality that ν1 = 0. Using this simple characterization, we
immediately get that p t → π if νi > 0 for all i ≥ 2 (which is exactly when the graph is
connected as you will prove in an exercise). To see this, observe that as t grows sum vanishes.
We have that
lim p t = α1 D 1/2 ψ 1 = π.
t→∞
Theorem 6.2.4. For any connected graph G, we have that the lazy random walk converges
to the stationary distribution of G.
Let us now come to the main result that we want to prove in this lecture.
Theorem 6.2.5. In any unweighted (a.k.a. unit weight) connected graph G, for any p 0 , at
any time step t, we have for p t = W̃t p 0 that
√
kp t − πk∞ ≤ e−ν2 ·t/2 n
Instead of proving the theorem above, we prove the lemma below which gives point-wise
convergence. This makes it more convenient to derive a proof and it is not hard to deduce
the theorem above as a corollary.
68
Lemma 6.2.6. In any weighted connected graph G, for all a, b ∈ V , and any time step t,
we have for p 0 = 1a and p t = W̃t p 0 that
p
|p t (b) − π(b)| ≤ e−ν2 ·t/2 d b /d a
where we use Cauchy-Schwarz in the last inequality, i.e. |hu, v i|2 ≤ hu, ui · hv , v i. Let us
finally bound the two sums:
Pn 2
• By 6.1,
Pn 2
i=2 αi = i=2 ψ>
i D
−1/2
p0 ≤ kD −1/2 p 0 k22 = kD −1/2 1a k22 = 1/d a .
Pn 2 Pn 2
• Finally, we show that i=2 1> 1/2
≤ i=1
b D ψi = kD 1/2 1b k22 = d b
1>
b D
1/2
ψi
(we only show the first equality, the other inequalities are straight-forward). We first
expand the vector D 1/2 1b along the eigenvectors using some values βi defined
n
X
D 1/2
1b = βi ψ i ⇐⇒ ψ >
i D
1/2
1b = βi ⇐⇒ 1>
b D
1/2
ψ i = βi
i=1
We used orthogonality to get the first equivalence, and then just take the transpose to
get the second. We can now write
n
! n ! n
X X X
1/2 2 1/2 > 1/2 >
kD 1b k2 = (D 1b ) (D 1b ) = βi ψ i βi ψ i = βi2
i=1 i=1 i=2
69
6.3 Properties of Random Walks
We now shift our focus away from convergence of random walks and consider some interest-
ing properties of random walks1 . Here, we are no longer interested in lazy random walks,
although all proofs can be straight-forwardly adapted. While in the previous section, we
relied on computing the second eigenvalue of the Normalized Laplacian efficiently, here, we
will discover that solving Laplacian systems, that is finding an x such that Lx = b can solve
a host of problems in random walks.
One of the most natural questions one can ask about a random walk starting in a vertex a
(i.e. p 0 = 1a ) is how many steps it takes to get to a special vertex s. This quantity is called
the hitting time from a to s and we denote it by Ha,s = inf{t | v t = s}. For the rest of this
section, we are concerned with computing the expected hitting time, i.e. E[Ha,s ].
It turns out, that it is more convenient to compute all expected hitting times Ha,s for vertices
a ∈ V to a fixed s. We denote by h ∈ RV , the vector with h(a) = E[Ha,s ]. We now show
that we can compute h by solving a Laplacian system Lh = b. We will see later in the
course that such systems (spoiler alert!) can be solved in time Õ(m), so this will imply a
near-linear time algorithm to compute the hitting times.
Hitting Time and the Random Walk Matrix. Let us first observe that if s = a, then
the answer becomes trivially 0, i.e. h(s) = 0.
We compute the rest of the vector by writing down a system of equations that recursively
characterizes h. Observe therefore first that for any a 6= s, we have that the random walks
starting at a will next visit a neighbor b of a. If the selected neighbor b = s, the random
walks stops; otherwise, the random walks needs in expectation E[Hb,s ] time to move to s.
We can express this algebraically by
X X w (a, b)
>
h a = 1+ P[vt+1 = b |vt = a]·h(b) = 1+ ·h(b) = 1+(W 1a )> h = 1+1>
a W h.
a∼b a∼b
d (a)
>
1 = 1>
a (I − W )h.
1 − α · 1s = (I − W > )h
1
Note that this part of the script is in large part inspired by Aaron Sidford’s lecture notes on the same
subject, in his really interesting course Discrete Mathematics and Algorithms.
70
where we have an extra degree of freedom in choosing α in formulating a constraint 1 − α =
>
1>s (I − W )h. This extra degree of freedom stems from the fact that n − 1 equations suffice
for us to enforce that the returned vector x from the system is indeed the hitting times
(possibly shifted by the value assigned to coordinate t).
Finding Hitting Times via Laplacian System Solve. Since we assume G connected,
we have that multiplying with D = D > preserves equality. Further since W = AD −1 , we
obtain
d − α · d (s) · 1s = (D − A)h.
We have now formalized L and b completely. A last detail that we should not forget about
is that any solution x to such system Lx = b is not necessarily equal h but has the property
that it is shifted from h by the all-ones vector. Since we require h(s) = 0, we can reconstruct
h from x straight-forwardly by subtracting x (s)1.
Theorem 6.3.1. Given a connected graph G, a special vertex s ∈ V . Then, we can formalize
a Laplacian system Lx = b (where L is the Laplacian of G) such that the expected hitting
times to s are given by h = x − x (s)1. We can reconstruct h from x in time O(n).
Hitting Times and Electrical Networks. Seeing that hitting times can be computed
by formulating a Laplacian system Lx = b. You might remember that in the first lecture,
we argued that a system Lx = b also solves the problem of routing a demand b via an
electrical flow with voltages x .
Indeed, we can interpret computing expected hitting times h to a special vertex s as the
problem of computing the electrical voltages x where we insert (or more technically correct
apply) d (a) units of current at every vertex a 6= s and where we remove 1> d − d (s) units
of current at the vertex s. Then, we can express expected hitting time to some vertex a as
the voltage difference to s: E[Ha,s ] = h(a) = x (a) − x (s).
A topic very related to hitting times are commute times. That is for two vertices a, b, the
commute time is the time in a random walk starting in a to visit b and then to return to a
again. Thus, it can be defined Ca,b = Ha,b + Hb,a .
71
Commute Times via Electric Flows. Recall that expected hitting times have an electric
flow interpretation.
Now, let us denote by x a solution to the Laplacian system Lx = b b where the demand is
b b = d −d > 1·1b ∈ ker(1)⊥ . Recall that we have E[Hz,b ] = x (z)−x (b) for all z. Similarly, we
can compute voltages y to the Laplacian system Ly = b a where b a = d −d > 1·1a ∈ ker(1)⊥ .
Again, E[Hz,a ] = y (z) − y (a) for all z.
But observe that this allows us to argue by linearity that E[Ca,b ] = E[Ha,b + Hb,a ] = E[Ha,b ] +
E[Hb,a ] = x (a) − x (b) + y (b) − y (a) = (1a − 1b )> (x − y ). But using linearity again, we
can also argue at the same time that we obtain the vector z = (x − y ) as a solution to the
problem Lz = b b − b a = d > 1(1a − 1b ). That is the flow that routes kd k1 units of flow from
b to a.
Theorem 6.3.2. Given a graph G = (V, E), for any two fixed vertices a, b ∈ V , the expected
commute time Ca,b is given by the voltage difference between a and b for any solution z to
the Laplacian system Lz = kd k1 · (1b − 1a ).
We note that the voltage difference between a and b in an electrical flow routing demand
1b − 1a is also called the effective resistance Reff (a, b). This quantity will play a crucial role
in the next roles. In the next lecture, we introduce Reff (a, b) slightly differently as the energy
required by the electrical flow that routes 1b − 1a , however, it is not hard to show that these
two definitions are equivalent.
Our theorem can now be restated as saying that the expected commute time E[Ca,b ] =
kd k1 · Reff (a, b). This is a classic result.
72
Chapter 7
Under the above conditions, L+ is uniquely defined and we call it the pseudoinverse of L.
Note that there are many other equivalent definitions of the pseudoinverse of some matrix
A, and we can also generalize the concept to matrices that aren’t symmetric or even square.
Let λi , v i be the i-th pair of eigenvalue and eigenvector of L, with {v i }ni=1 forming a orthog-
onal basis. Then by the spectral theorem,
X
L = V ΛV > = λi v i v >
i ,
i
where V = v 1 · · · v n and Λ = diag{λ1 , ..., λn }. And we can show that its pseudoinverse
is exactly X
L+ = λ−1 >
i v iv i .
i,λi 6=0
Checking conditions 1), 2), 3) is immediate. We can also prove uniqueness, but this takes
slightly more work.
73
7.2 Electrical Flows Again
Recall the incidence matrix B ∈ RV ×E of a graph G = (V, E).
In Chapter 1, we introduced the electrical flow routing demand d ∈ RV . Let’s call the
electrical flow f̃ ∈ RE . The net flow constraint requires B f̃ = d . By Ohm’s Law, f̃ =
R−1 B > x for some voltage x ∈ RV where R = diag(r ) and r (e) = resistance of edge e. We
showed (in the exercises) that when d ⊥ 1, there exists an voltage x̃ ⊥ 1 s.t. f̃ = R−1 B > x̃
and B f̃ = d . This x̃ solves Lx = d where L = BR−1 B > .
And we also made the following claim.
Claim 7.2.1. X
f̃ = arg min f > Rf where f > Rf = r (e)f (e)2 , (7.1)
Bf =d e
You proved this in the exercises for Week 1. Let’s recap the proof briefly, just to get back
into thinking about electrical flows.
74
But for the electrical flow f̃ and electrical voltage x̃ , we have f̃ = R−1 B > x̃ and Lx̃ = d .
So
> >
f̃ Rf̃ = R−1 B > x̃ R R−1 B > x̃ = x̃ > BR−1 B > x̃ = x̃ > Lx̃ = x̃ > d .
Therefore,
1 > 1
f̃ Rf̃ = d > x̃ − x̃ > Lx̃ . (7.3)
2 2
By combining Equation (7.2) and Equation (7.3), we see that for all f s.t. Bf = d ,
1 > 1 1 >
f Rf ≥ d > x̃ − x̃ > Lx̃ = f̃ Rf̃ .
2 2 2
Thus f̃ is the minimum electrical energy flow among all flows that route demand d , proving
Equation (7.1) holds.
The drawing below shows how the quantities line up:
So when we have a graph consisting of just one edge (a, b), the effective resistance is just
Reff (a, b) = r(a, b).
In a general graph, we can also consider the energy required to route one unit of current
between two vertices. For any pair a, b ∈ V , we have
75
where e v ∈ RV is the indicator vector of v. Note that the cost of routing F units of flow
from a to b will be Reff (a, b) · F 2 .
>
Since (e b − e a )> 1 = 0, we know from the previous section that Reff (a, b) = f̃ Rf̃ where f̃
is the electrical flow. Now we can write Lx̃ = e b − e a and x̃ = L+ (e b − e a ) for the electrical
voltages routing 1 unit of current from a to b. Now the energy of routing 1 unit of current
from a to b is
>
Reff (a, b) = f̃ Rf̃ = x̃ > Lx̃ = (e b − e a )> L+ LL+ (e b − e a ) = (e b − e a )> L+ (e b − e a ),
Remark 7.3.1. We have now seen several different expressions that all take on the same
value: the energy of the electrical flow. It’s useful to remind yourself what these are. Consider
an electrical flow f̃ routes demand d , and associated electrical voltages x̃ . We know that
B f̃ = d , and f = R−1 B > x̃ , and Lx̃ = d , where L = BR−1 B > . And we have seen how to
express the electrical energy using many different quantities:
> >
f̃ Rf̃ = x̃ > Lx̃ = d > L+ d = d > x̃ = f̃ B > x̃
Claim 7.3.2. Any PSD matrix A has a PSD square root A1/2 s.t. A1/2 A1/2 = A.
1/2
λi v i v >
P
Proof. By the spectral theorem, A = i i where {v i } are orthonormal. Let A =
P 1/2 >
i λi v i v i . Then
!2
X 1/2
A 1/2
A 1/2
= λi v i v >
i
i
X X
= λi v i v > >
i v iv i + λi v i v > >
i v jv j
i i6=j
X
= λi v i v >
i
i
1/2
where the last equality is due to v >
i v j = δij . It’s easy to see that A is also PSD.
76
Figure 7.2: A path graph with k edges.
To see this, observe that to have 1 unit of flow going from vertex 1 to vertex k + 1, we must
have one unit flowing across each edge i. Let ∆(i) be the voltage difference across edge i,
and f (i) the flow on the edge. Then 1 = f (i) = ∆(i) , so that ∆(i) = r (i). The electrical
V
P r (i)
voltages are then x̃ ∈ R where x̃ (i) = x̃ (1) + j<i ∆(j). Hence the effective resistance is
k
X
> >
Reff (1, k + 1) = d x̃ = (e k+1 − e 1 ) x̃ = x̃ (k + 1) − x̃ (1) = r (i).
i=1
This behavior is sometimes known as the fact that the resistance of resistors adds up when
they are connected in series.
77
Let’s see why. Our electrical voltages x̃ ∈ RV can be described by just the voltage difference
∆ ∈ R between vertex 1 and vertex 2, i.e. x̃ (2) − x̃ (1) = ∆. which creates
P a flow on edge
i of f̃ (i) = ∆/r (i). Thus the total flow from vertex 1 to vertex 2 is 1 = i ∆/r (i), so that
∆ = Pk 11/r (i) . Meanwhile, the effective resistance is also
i=1
1
Reff (1, 2) = (e 2 − e 1 )> x̃ = ∆ = Pk
i=1 1/r (i)
Before proving this lemma, let’s see a claim that will help us finish the proof.
Claim 7.3.5. Let Lx̃ = e b − e a . Then for all c ∈ V , we have x̃ (b) ≥ x̃ (c) ≥ x̃ (a).
Proof. It is easy to check that conditions 1, 2, and 3 of Definition 7.3.3 are satisfied by Reff .
Let us confirm condition 4.
For any u, v, let x̃ u,v = L+ (−e u + e v ). Then
78
Thus,
Reff (a, b) = (−e a + e b )> x̃ a,b = (−e a + e b )> (x̃ a,c + x̃ c,b )
= −x̃ a,c (a) + x̃ a,c (b) − x̃ c,b (a) + x̃ c,b (b)
≤ −x̃ a,c (a) + x̃ a,c (c) − x̃ c,b (c) + x̃ c,b (b).
where in the last line we applied Claim 7.3.5 to show that x̃ a,c (b) ≤ x̃ a,c (c) and
−x̃ c,b (a) ≤ −x̃ c,b (c).
79
Chapter 8
min E(x )
x ∈RV
y
Let x = where y ∈ R and z ∈ RV \{1} .
z
We will now explore how to minimize over y, given any z . Once we find an expression for y
in terms of z , we will be able to reduce it to a new quadratic minimization problem in z ,
1
E 0 (z ) = −d 0> z + z > L0 z
2
80
where d 0 is a demand vector on the remaining vertices, with d ⊥ 1 and L0 is a Laplacian
of a graph on the remaining vertices V 0 = V \ {1}. We can then repeat the procedure to
eliminate another variable and so on. Eventually, we can then find all the solution to our
original minimization problem.
To help us understand how to minimize over the first variable, we introduce some notation
for the first row and column of the Laplacian:
−aa >
W
L= . (8.1)
−aa diag(aa ) + L−1
−aa >
W
(8.2)
−aa diag(aa )
is the Laplacian of the subgraph of G containing only the edges incident on vertex 1, while
L−1 is the Laplacian of the subgraph of G containing all edges not incident on vertex 1.
b
Let us also write d = where b ∈ R and c ∈ RV \{1} .
c
Now,
> >
−aa >
> 1 > b y 1 y W y
E(x ) = −d x + x Lx = − +
2 c z 2 z −aa diag(aa ) + L−1 z
1 2
= −by − c > z + y W − 2yaa > z + z > diag(aa )z + z > L−1 z .
2
∂
Now, to minimize over y, we set ∂y
E(x ) = 0 and get
−b + yW − a > z = 0.
Observe that
1 2
E(x ) = −by − c > z + y W − 2yaa > z + z > diag(aa )z + z > L−1 z
2
11 1
= −by − c > z + (yW − a > z )2 − z >a a > z + z > diag(aa )z + z > L−1 z
2 W | W {z }
1
a )− W
Let S =diag(a a a > +L−1
1 1
= −by − c > z + (yW − a > z )2 + z > S z ,
2 W
81
where we simplified the expression by defining S = diag(aa ) − 1
W
aa > + L−1 . Plugging in
y = W1 (b + a > z ), we get
>
b2
y 1 1
min E =− c+b a z− + z >S z .
y z W 2W 2
b 2
since dropping the constant term − 2W does not change what the minimizing z values are.
Claim 8.1.1.
1. d 0 ⊥ 1
2. S = diag(aa ) − 1
W
aa> + L−1 is a Laplacian of a graph on the vertex set V \ {1}.
We will prove Claim 8.1.1 in a moment. From the Claim, we see that the problem of finding
arg minz E 0 (z ), is exactly of the same form as finding arg minx E(x ), but with one fewer
variables.
We can get a minimizing x that solves arg minx E(x ) by repeating the variable elimination
procedure until we get down to a single variable and finding its value. We then have to work
back up to getting a solution for z , and then substitute that into Equation (8.3) to get the
value for y.
Remark 8.1.2. In fact, this perspective on Gaussian elimination also makes sense for any
positive definite matrix. In this setting, minimizing over one variable will leave us with
another positive definite quadratic minimization problem.
>
Proof of Claim 8.1.1. To establish the first part, we note that 1> d 0 = 1> c + b 1Wa = 1> c +
b = 1> d = 0. To establish the second part, we notice that L−1 is a graph Laplacian by
definition. Since the sum of two graph Laplacians is another graph Laplacian, it now suffices
to show that S is a graph Laplacian.
Claim 8.1.3. A matrix M is a graph Laplacian if and only it satisfies the following condi-
tions:
• M> = M.
• The diagonal entries of M are non-negative, and the off-diagonal entries of M are
non-positive.
• M 1 = 0.
82
Let’s see that Claim 8.1.3 is true. Firstly, when the conditions hold we can write M = D −A
where D is diagonal and non-negative, and A P is non-negative, symmetric, and zero on the
diagonal, and from the last condition D(i, i) = j6=i A(i, j). Thus we can view A as a graph
adjacency matrix and D as the corresponding diagonal matrix of weighted degrees. Secondly,
it is easy to check that the conditions hold for any graph Laplacian, so the conditions indeed
hold if and only if. Now we have to check that the claim applies to S . We leave this as an
exercise for the reader.
Finally, we want to argue that the graph corresponding to S is connected. Consider any
i, j ∈ V \ {1}. Since G, the graph of L, is connected, there exists a simple path in G
connecting i and j. If this path does not use vertex 1, it is a path in the graph of L−1 and
hence in the graph of S . If the path does use vertex 1, it must do so by reaching the vertex
on some edge (v, 1) and leaving on a different edge (1, u). Replace this pair of edges with
edge (u, v), which appears in the graph of S because S (u, v) < 0. Now we have a path in
the graph of S .
83
In this Section, we’ll study how to decompose a graph Laplacian as L = LL> , where
L ∈ Rn×n is a lower triangular matrix, i.e. L(i, j) = 0 for i < j. Such a factorization is
called a Cholesky decomposition. It is essentially the result of Gaussian elimination with a
slight twist to ensure the matrices maintained at intermediate steps of the algorithm remain
symmetric.
We use nnz(A) to denote the number of non-zero entries of matrix A.
Lemma 8.2.1. Given an invertible square lower triangular matrix L, we can solve the linear
equation Ly = b in time O(nnz(L)). Similarly, given an upper triangular matrix U , we can
solve linear equations U z = c in time O(nnz(U )).
We omit the proof, which is a straight-forward exercise. The algorithms for solving linear
equations in upper and lower triangular matrices are known as forward and back substitution
respectively.
Remark 8.2.2. Strictly speaking, the lemma requires us to have access an adjacency list
representation of L so that we can quickly tell where the non-zero entries are.
Dealing with pseudoinverses. But how can we solve a linear equation in L = LL> ,
where L is not invertible? For graph Laplacians we have a simple characterization of the
kernel, and because of this, dealing with the lack of invertibility turns out to be fairly easy.
We can use the following lemma which you will prove in an exercise next week.
Lemma 8.2.4. Consider a real symmetric matrix M = X Y X > , where X is real and
invertible and Y is real symmetric. Let ΠM denote the orthogonal projection to the image
of M . Then M + = ΠM (X > )−1 Y + X −1 ΠM .
The factorizations L = LL> that we produce will have the property that all diagonal entries
of L are strictly non-zero, except that L(n, n) = 0. Let L b be the matrix whose entries
agree with L, except that L(n,b n) = 1. Let D be the diagonal matrix with D(i, i) = 1
for i < n and D(n, n) = 0. Then LL> = LD b L b> , and Lb is invertible, and D + = D.
1 >
Finally, ΠL = I − n 11 , because this matrix is acts like identity on vectors orthogonal to
1 and ensures ΠL 1 = 0, and this matrix can be applied to a vector in O(n) time. Thus
L+ = ΠL (L b> )−1 D L
b−1 ΠL , and this matrix can be applied in time O(nnz(L)).
84
An additive view of Gaussian Elimination. The following theorem describes Gaussian
Elimination / Cholesky decompostion of a graph Laplacian.
Proof. Let L(0) = L. We will use A(:, i) to denote the the ith column of a matrix A. Now,
for i = 1 to i = n − 1 we define
1
li = q L(i−1) (:, i) and L(i) = L(i−1) − l i l >
i
L(i−1) (i, i)
From this claim, it follows that L(i−1) (i, i) 6= 0 for i < n, since a connected graph Laplacian
on a graph with |U | > 1 vertices cannot have a zero on the diagonal, and it follows that
L(n−1) (i, i) = 0, because the only graph we allow on one vertex is the empty graph. This
shows Equation (8.4) holds.
Sketch of proof of Claim 8.2.6. We will focus on the first elimination, as the remaining are
similar. Adopting the same notation as in Equation (8.1), we write
−aa >
(0) W
L =L=
−aa diag(aa ) + L−1
85
Thus the first row and column of L(1) are zero claimed. It also follows by Claim 8.1.1 that
L(1) ({2, . . . , n} , {2, . . . , n}) is the Laplacian of a connected graph. This proves Claim 8.2.6
for the case i = 1. An induction following the same pattern can be used to prove the claim
for all i < n.
86
Chapter 9
The Chernoff bound should be familiar to most of you, but you may not have seen the
following very similar bound. The Bernstein bound, which we will state in terms of zero-
mean variables, is much like the Chernoff bound. It also requires bounded variables. But,
when the variables have small variance, the Bernstein bound is sometimes stronger.
Theorem 9.1.2 (A Bernstein Concentration Bound). Suppose X1 , . . . , Xk ∈PR are in-
dependent, zero-mean, random variables with |Xi | ≤ R always. Let X = i Xi , and
2 2
P
σ = V ar [X] = i E [Xi ], then for > 0
−t2
Pr[|X| ≥ t] ≤ 2 exp .
2Rt + 4σ 2
We will now prove the Bernstein concentration bound for scalar random variables, as a warm-
up to the next section, where we will prove a version of it for matrix-valued random variables.
87
To help us prove Bernstein’s bound, first let’s recall Markov’s inequality. This is a very weak
concentration inequality, but also very versatile, because it requires few assumptions.
Lemma 9.1.3 (Markov’s Inequality). Suppose X ∈ R is a non-negative random variable,
with a finite expectation. Then for any t > 0,
E [X]
Pr[X ≥ t] ≤ .
t
Proof.
Proof of Theorem 9.1.2. We will focus on bounding the probability that Pr[X ≥ t]. The
proof that Pr[−X ≥ t] is small proceeds in the same way.
First we observe that
Now, let’s require that θ ≤ 1/R This will allow us to use the following bound: For all |z| ≤ 1,
exp(z) ≤ 1 + z + z 2 . (9.1)
We omit a proof of this, but the plots in Figure 9.1 suggest that this upper bound holds.
The reader should consider how to prove this. With this in mind, we see that
" !#
X
E [exp(θX)] = E exp θ Xi
i
= E [Πi exp(θXi )]
= Πi E [exp (θXi )] because E [Y Z] = E [Y ] E [Z] for independent Y and Z.
≤ Πi E 1 + θXi + (θXi )2
= Πi (1 + θ2 E Xi2 )
because Xi are zero-mean.
2
2
≤ Πi exp(θ E Xi ) because 1 + z ≤ exp(z) for all z ∈ R.
!
X
θ2 E Xi2 = exp(θ2 σ 2 ).
= exp
i
88
Figure 9.1: Plotting exp(z) compared to 1 + z + z 2 .
Thus Pr[X ≥ t] ≤ exp(−θt) E [exp(θX)] ≤ exp(−θt + θ2 σ 2 ). Now, to get the best possible
bound, we’d like to minimize −θt + θ2 σ 2 subject to the constraint 0 < θ ≤ 1/R. Setting
∂
−θt + θ2 σ 2 = −t + 2θσ 2 .
∂θ
t
Setting this derivative to zero gives θ = 2σ 2
, and plugging that in gives
t2
−θt + θ2 σ 2 = −
4σ 2
This choice only satisfies our constraints on θ if 2σt 2 ≤ 1/R. Otherwise, we let θ = 1/R and
note that in this case
t σ2 t t t
−θt + θ2 σ 2 = − + 2 ≤ − + =−
R R R 2R 2R
89
where we got the inequality from t > 2σ 2 /R. Altogether, we can conclude that there always
is a choice of θ s.t.
t t2 t2
2 2
−θt + θ σ ≤ − min , 2 ≤− .
2R 4σ 2Rt + 4σ 2
In fact, with the benefit of hindsight, and a little algebra, we arrive at the same conclusion
in another way: One can
check that the following choice of θ is always valid and achives the
√ 3/2
1 R·t
same bound: θ = 2σ2 t − √2σ2 +Rt .
We use k·k to denote the spectral norm on matrices. Let’s take a look at a version of
Bernstein’s bound that applies to sums of random matrices.
−t2
Pr[kX k ≥ t] ≤ 2n exp .
2Rt + 4σ 2
This basically says that probability of X being large in spectral norm behaves like the scalar
case, except the bound is larger by a factor n, where the matrices are n × n. We can get a
feeling for why this might be a reasonable boundPby considering the case of random diagonal
matrices. Now kX k = maxj |X (j, j)| = maxj | i X i (j, j)|. In this case, we need to bound
the largest of the n diagonal entries: We can do this by a union bound over n instances of
the scalar problem – and this also turns out to be essentially tight in some cases, meaning
we can’t expect a better bound in general.
In this section we will prove the Bernstein matrix concentration bound (Tropp 2011) that
we saw in the previous section.
−t2
Pr[kX k ≥ t] ≤ 2n exp .
2Rt + 4σ 2
But let’s collect some useful tools for the proof first.
90
Definition 9.2.2 (trace). The trace of a square matrix A is defined as
X
Tr (A) := A(i, i)
i
Let S n denote the set of all n × n real symmetric matrices, S+n the set of all n × n positive
n
semidefinite matrices, and S++ the set of all n×n positive definite matrices. Their relation is
clear, S++ ⊂ S+ ⊂ S . For any A ∈ S n with eigenvalues λ1 (A) ≤ · · · ≤ λn (A), by spectral
n n n
decomposition theorem, A = V ΛV > where Λ = diagi {λi (A)} and V > V = V V > = I ,
we’ll use this property without specifying in the sequel.
P
Claim 9.2.4. Given a symmetric and real matrix A, Tr (A) = i λi , where {λi } are eigen-
values of A.
Proof. !
X
> >
Tr (A) = Tr V ΛV = Tr Λ |V {zV} = Tr (Λ) = λi .
I i
Example. Recall that every PSD matrix A has a square root A1/2 . If f (x) = x1/2 for
x ∈ R+ , then f (A) = A1/2 for A ∈ S+n .
Example. If f (x) = exp(x) for x ∈ R, then f (A) = exp(A) = V exp(Λ)V > for A ∈ S n .
Note that exp(A) is positive definite for any A ∈ S n .
91
Meanwhile, a function f : S → T where S, T ⊆ S n is said to be operator monotone
increasing if A B implies f (A) f (B).
Lemma 9.2.6. Let T ⊆ R. If the scalar function f : T → R is monotone increasing, the
matrix function X 7→ Tr (f (X )) is monotone increasing.
Proof. From previous chapters, we know if A B then λi (A) ≤ λi (B) for all i. As x 7→ f (x)
is monotone, then λi (f (A)) ≤ λi (f (B)) for all i. By Claim 9.2.4, Tr (f (A)) ≤ Tr (f (B)).
From this, and the fact that x 7→ exp(x) is a monotone function on the reals, we get the
following corollary.
Corollary 9.2.7. If A B, then Tr (exp(A)) ≤ Tr (exp(B)), i.e. X 7→ Tr (exp(X )) is
monotone increasing.
Lemma 9.2.8. If 0 ≺ A B, then B −1 A−1 , i.e. X 7→ X −1 is operator monotone
n
decreasing on S++ .
Proof.
Z ∞ Z T
1 1 1 1
− dt = lim − dt
0 1+t a+t T →∞ 0 1+t a+t
= lim [log(1 + t) − log(a + t)]T0
T →∞
1+T
= log(a) + lim log
T →∞ a+T
= log(a)
Proof sketch of Lemma 9.2.9. Because all the matrices involved are diagonalized by the same
orthogonal transformation, we can conclude from Lemma 9.2.10 that for a matrix A 0,
Z ∞
1 −1
log(A) = I − (tI + A) dt
0 1+t
This integration can be expressesd as the limit of a sum with positive coefficients, and from
this we can show that is the integrand (the term inside the integration symbol) is operator
monotone increasing in A by Lemma 9.2.8, the result of the integral, i.e. log(A) must also
be operator monotone increasing.
92
The following is a more general version of Lemma 1.6.
Lemma 9.2.11. Let T ⊂ R. If the scalar function f : T → R is monotone, the matrix
function X 7→ Tr (f (X )) is monotone.
Remark 9.2.12. It is not always true that when f : R → R is monotone, f : S n → S n is
operator monotone. For example, X 7→ X 2 and X 7→ exp(X ) are not operator monotone.
Proof.
Recall exp(x) ≤ 1 + x + x2 for all |x| ≤ 1. Since kAk ≤ 1 i.e. |λi | ≤ 1 for all i, thus
1 + λi + λ2i − exp(λi ) ≥ 0 for all i, meaning I + A + A2 − exp(A) 0.
Lemma 9.2.14. log(I + A) A for A −I .
Proof.
Recall x ≥ log(1 + x) for all x > −1. Since kAk −I i.e. λi > −1 for all i, thus
λi − log(1 + λi ) ≥ 0 for all i, meaning A − log(I + A) 0.
n
Theorem 9.2.15 (Lieb). Let f : S++ → R be a matrix function given by
The Lieb’s theorem will be crucial in our proof of Theorem 9.2.1, but it is also highly non-
trivial and we will omit its proof here. The interested reader can find a proof in Chapter 8
of [T+ 15].
Lemma 9.2.16 (Jensen’s inequality). E [f (X)] ≥ f (E [X]) when f is convex; E [f (X)] ≤
f (E [X]) when f is concave.
93
9.2.4 Proof of Matrix Bernstein Concentration Bound
Proof of Theorem 9.2.1. For any A ∈ S n , its spectral norm kAk = max{|λn (A)|, |λ1 (A)|} =
max{λn (A), −λ1 (A)}. Let λ1 ≤ · · · ≤ λn be the eigenvalues of X . Then,
h _ i
Pr[kX k ≥ t] = Pr (λn ≥ t) (−λ1 ≥ t) ≤ Pr[λn ≥ t] + Pr[−λ1 ≥ t].
P
Let Y := i −X i . It’s easy to see that −λn ≤ · · · ≤ −λ1 are eigenvalues of Y , implying
λn (Y ) = −λ1 (X ). Since E [−X i ] = E [X i ] = 0 and k − X i k = kX i k ≤ R for all i, if we
can bound Pr[λn (X ) ≥ t], then applying to Y , we can bound Pr[λn (Y ) ≥ t]. As
E f (U , V ) = E E [f (U , V )|U ] = E E [f (U , V )] .
U ,V U V U V
94
P
Define X <i = j<i X j . Let 0 < θ ≤ 1/R,
Then,
Pr[λn ≥ t] ≤ n · exp(−θt + θ2 σ 2 ),
and
Pr[kX k ≥ t] ≤ 2n · exp(−θt + θ2 σ 2 ).
Similar to the proof of Bernstein concentration bound for one-dimension random variable,
minimize the RHS over 0 < θ ≤ 1/R yields
−t2
Pr[kX k ≥ t] ≤ 2n · exp .
2Rt + 4σ 2
In this section, we will see that for any dense graph, we can find another sparser graph whose
graph Laplacian is approximately the same as measured by their quadratic forms. This turns
out to be a very useful tool for designing algorithms.
95
Definition 9.3.1. Given A, B ∈ S+n and > 0, we say
1
A ≈ B if and only if A B (1 + )A.
1+
the same time LG ≈ LG̃ . We call G̃ a spectral sparsifier of G. Our construction will also
ensure that Ẽ ⊆ E, although this is not important in most applications. Figure 9.2 shows
an example of a graph G and spectral sparsifier G̃.
Figure 9.2: A graph G and a spectral sparsifier G̃ , satisfisying LG ≈ LG̃ for = 2.42.
1
cG (T ) ≤ cG̃ (T ) ≤ (1 + )cG (T ).
1+
Proof. Let 1T ∈ RV be the indicator of the set T , i.e. 1T (u) = 1 for u ∈ V and 1T (u) = 0
otherwise. We can see that 1>
T LG 1T = cG (T ), and hence the theorem follows by comparing
the quadratic forms.
96
Figure 9.3: The cut cG (T ) in G.
But how well can we spectrally approximate a graph with a sparse graph? The next theorem
gives us a nearly optimal answer to this question.
Theorem 9.3.3 (Spectral Graph Approximation by Sampling, (Spielman-Srivastava 2008)).
Consider a connected graph G = (V, E, w ), with n = |V |. For any 0 < < 1 and 0 < δ < 1,
there exist sampling probabilities pe for each edge e ∈ E s.t. if we include each edge e in Ẽ
independently with probabilty pe and set its weight w̃ (e) = p1e w (e), then with probability at
least 1 − δ the graph G̃ = (V, Ẽ, w̃ ) satisfies
LG ≈ LG̃ and Ẽ ≤ O(n−2 log(n/δ)).
We are going to analyze a sampling procedure by turning our goal into a problem of matrix
concentration. Recall that
Fact 9.3.5. A B implies C AC > C BC > for any C ∈ Rn×n .
where ΠL = L+/2 LL+/2 is the orthogonal projection to the complement of the kernel of L.
Definition 9.3.6. Given a matrix A, we define ΠA to be the orthogonal projection to the
complement of the kernel of A, i.e. ΠA v = 0 for v ∈ ker(A) and ΠA v = v for v ∈ ker(A)⊥ .
Recall that ker(A)⊥ = im(A> ).
Claim 9.3.7. For a matrix A ∈ S n with spectral decomposition A = V ΛV > = i λi v i v >
P
i
s.t. V > V = I , we have ΠA = i:λi 6=0 v i v > +/2
AA+/2 = AA+ = A+ A.
P
i , and ΠA = A
97
From the definition, we can see that ΠL = I − n1 11> .
Now that we understand the projection ΠL , it is not hard to show the following claim.
Claim 9.3.8.
Really, the only idea needed here is that when comparing quadratic forms in matrices with
the same kernel, we necessarily can’t have the quadratic forms disagree on vectors in the
kernel. Simple! But we are going to write it out carefully, since we’re still getting used to
these types of calculations.
Proof of Claim 9.3.8. To prove Part 1, we assume ΠL ≈ L+/2 L̃L+/2 . Recall that G is a
connected graph, so ker(L) = span {1}, while L̃ is the Laplacian of a graph which may or
may not be connected, so ker(L̃) ⊇ ker(L), and equivalently im(L̃) ⊆ im(L). Now, for any
v ∈ ker(L) we have v > L̃v = 0 = v > Lv . For any v ∈ ker(L)⊥ we have v = L+/2 z for some
z , as ker(L)⊥ = im(L) = im(L+/2 ). Hence
1 1
v > L̃v = z > L+/2 L̃L+/2 z ≥ z > L+/2 LL+/2 z = v > Lv
1+ 1+
and similarly
v > L̃v = z > L+/2 L̃L+/2 z ≤ (1 + )z > L+/2 LL+/2 z = (1 + )v > Lv .
− I L+/2 L̃L+/2 − ΠL I
2 2
But since
1> (L+/2 L̃L+/2 − ΠL )1 = 0,
we can in fact sharpen this to
− ΠL L+/2 L̃L+/2 − ΠL ΠL .
2 2
Rearranging, we then conclude
(1 − )ΠL L+/2 L̃L+/2 (1 + )ΠL .
2 2
98
We now have most of the tools to prove Theorem 9.3.3, but to help us, we are going to
establish one small piece of helpful notation: We define a matrix function Φ : Rn×n → Rn×n
by
Φ(A) = L+/2 AL+/2 .
We sometimes call this a “normalizing map”, because it transforms a matrix to the space
where spectral norm bounds can be translated into relative error guarantees compare to the
L quadratic form.
We introduce a set of independent random variables, one for each edge e, with a probability
pe associated with the edge which we will fix later. We then let
(
w (e)
pe
b eb >
e with probability pe
Ye =
0 otherwise.
By linearity of Φ, h i h i
E Φ(L̃) = Φ(E L̃ ) = ΠL .
Let us also define X
X e = Φ(Y e ) − E [Φ(Y e )] and X = Xe
e
Note that this ensures E [X e ] = 0. We are now going to fix the edge sampling probabilities,
in a way that depends on some overall scaling parameter α > 0. We let
99
In the exercises for this chapter, we will ask you to show that Equations (9.4) and (9.5) hold.
This means that we can apply Theorem 9.2.1 to our X = e X e , with R = α1 and σ 2 = α1 ,
P
to get
−0.252
h
i
+/2 +/2
Pr
ΠL − L L̃L
≥ /2 ≤ 2n exp
( + 4)/α
Since 0 < < 1, this means that if α = 40−2 log(n/δ), then
h
i 2nδ 2
P r
ΠL − L+/2 L̃L+/2
≥ /2 ≤ 2 ≤ δ/2.
n
In the last step, we assumed n ≥ 4.
Lastly, we’d like to know that the graph G̃ is sparse. The number of edges
in G̃ is equal to
the number of Y e that come out nonzero. Thus, the expected value of Ẽ is
h i X X
E Ẽ = pe ≤ α w (e)
L+/2 b e b > L+/2
e
e e
We can bound the sum of the norms with a
neat
trick relating it to the
trace of ΠL . Note
n >
> >
that in general for a vector a ∈ R , we have a a
= a a = Tr a a . And hence
X
X
+/2 > +/2
+/2 > +/2
w (e)
L b e b e L
= w (e)Tr L b e b e L
e e
! !
X
= Tr L+/2 w (e)b e b >
e L+/2
e
= Tr (ΠL ) = n − 1.
least 1−δ/2. Thus by a union bound, the this condition and Equation (9.3) are both satisfied
with probability at least 1 − δ.
100
We haven’t shown how to compute the sampling probabilities efficiently, so right now, it isn’t
clear whether we can efficiently find G̃. It turns out that if we have access to a fast algorithm
for solving Laplacian linear equations, then we can find sufficiently good approximations to
the effective resistances quickly, and use these to compute G̃. An algorithm for this is
described in [SS11].
101
Chapter 10
Definition 10.1.1. Given PSD matrix M and d ∈ ker(M )⊥ , let M x ∗ = d . We say that
x̃ is an -approximate solution to the linear equation M x = d if
Remark 10.1.2. The requirement d ∈ ker(M )⊥ can be removed, but this is not important
for us.
Theorem 10.1.3 (Spielman and Teng (2004) [ST04]). Given a Laplacian L of a weighted
undirected graph G = (V, E, w ) with |E| = m and |V | = n and a demand vector d ∈ RV , we
can find x̃ that is an -approximate solution to Lx = d , using an algorithm that takes time
O(m logc n log(1/)) for some fixed constant c and succeeds with probability 1 − 1/n10 .
In the original algorithm of Spielman and Teng, the exponent on the log in the running time
was c ≈ 70.
Today, we are going to see a simpler algorithm. But first, we’ll look at one of the key tools
behind all algorithms for solving Laplacian linear equations quickly.
102
10.2 Preconditioning and Approximate Gaussian
Elimination
Recall our definition of two positive semi-definite matrices being approximately equal.
Definition 10.2.1 (Spectral approximation). Given A, B ∈ S+n , we say that
1
A ≈K B if and only if A B (1 + K)A.
1+K
n
Suppose we have a positive definite matrix M ∈ S++ and want to solve a linear equation
M x = d . We can do this using gradient descent or accelerated gradient descent, as we
covered in Graded Homework 1. But if we have access to an easy-to-invert matrix that
happens to also be a good spectral approximation of M , then we can use this to speed up
the (accelerated) gradient descent algorithm. An example of this would be that we have
a factorization LL> ≈K M , where L is lower triangular and sparse, which means we can
invert it quickly.
The following lemma, which you will prove in Problem Set 6, makes this preconditioning
precise.
Lemma 10.2.2. Given a matrix M ∈ S++ n
, a vector d and a decomposition M ≈K LL> , we
can find x̃ that -approximately solves M x = d , using O((1+K) log(K/)(Tmatvec +Tsol +n))
time.
• Tmatvec denotes the time required to compute M z given a vector z , i.e. a “matrix-vector
multiplication”.
• Tsol denotes the time required to compute L−1 z or (L> )−1 z given a vector z .
Dealing with pseudo-inverses. When our matrices have a null space, preconditioning
becomes slightly more complicated, but as long as it is easy to project to the complement
of the null space, there’s no real issue. The following describes precisely what we need (but
you can ignore the null-space issue when first reading these notes without losing anything
significant).
Lemma 10.2.3. Given a matrix M ∈ S+n , a vector d ∈ ker(M )⊥ and a decomposition
M ≈K LDL> , where L is invertible, we can find x̃ that -approximately solves M x = d ,
using O((1 + K) log(K/)(Tmatvec + Tsol + Tproj + n)) time.
• Tmatvec denotes the time required to compute M z given a vector z , i.e. a “matrix-vector
multiplication”.
• Tsol denotes the time required to compute L−1 z and (L> )−1 z and D + z given a vector
z.
103
• Tproj denotes the time required to compute ΠM z given a vector z .
Theorem 10.2.4 (Kyng and Sachdeva (2015) [KS16]). Given a Laplacian L of a weighted
undirected graph G = (V, E, w ) with |E| = M and |V | = n, we can find a decomposition
LL> ≈0.5 L, such that L has number of non-zeroes nnz(L) = O(m log3 n), with probability
at least 1 − 3/n5 . in time O(m log3 n).
We can combine Theorem 10.2.4 with Lemma 10.2.3 to get a fast algorithm for solving
Laplacian linear equations.
Corollary 10.2.5. Given a Laplacian L of a weighted undirected graph G = (V, E, w ) with
|E| = m and |V | = n and a demand vector d ∈ RV , we can find x̃ that is an -approximate
solution to Lx = d , using an algorithm that takes time O(m log3 n log(1/)) and succeeds
with probability 1 − 1/n10 .
Proof sketch. First we need to get a factorization that confirms to Lemma 10.2.3. The
decomposition LL> provided by Theorem 10.2.4 can be rewritten as LL> = LD( e L)e > where
Le is equal to L except L(n, n) = 1 and we let D be the identity matrix, except D(n, n) = 0.
This ensures D + = D and that L e is invertible and lower triagular with O(m log3 n) non-
zeros. We note that the inverse of an invertible lower or upper triangular matrix with N
non-zeros can be applied in time O(N ) given an adjacency list representation of the matrix.
Finally, as ker(LL> ) = span {1}, we have ΠLD( 1 >
e > = I − n 11 , and this projection matrix
e L)
can be applied in O(n) time. Altogether, this means that Tmatvec + Tsol + Tproj = O(n), which
suffices to complete the proof.
We want to introduce some notation that will help us describe and analyze a faster version
of Gaussian elimination – one that uses sampling to create a sparse approximation of the
decomposition.
104
Consider a Laplacian S of a graph H and a vertex v of H. We define Star(v, S ) to be the
Laplacian of the subgraph of H consisting of edges incident on v. We define
1
Clique(v, S ) = Star(v, S ) − S (:, v)S (:, v)>
S (v, v)
For example, suppose
−aa >
W
L=
−aa diag(aa ) + L−1
Then
−aa >
W 0 0
Star(1, L) = and Clique(1, L) =
−aa diag(aa ) 0 diag(aa ) − 1
W
aa >
which is illustrated in Figure 10.1.
In Chapter 8, we proved that Clique(v, S ) is a graph Laplacian – it follows from the proof
of Claim 8.1.1 in that chapter. Thus we have that following.
Claim 10.3.1. If S is the Laplacian of a connected graph, then Clique(v, S ) is a graph
Laplacian.
105
This also provides way to understand why Gaussian Elimination is slow in some cases. At
each step, one vertex is eliminated, but a clique is added to the subgraph on the remaining
vertices, making the graph denser. And at the ith step, computing Star(vi , S i−1 ) takes
around deg(vi ) time, but computing Clique(vi , S i−1 ) requires around deg(vi )2 time. In
order to speed up Gaussian Elimination, the algorithmic idea of [KS16] is to plug in a
sparser appproximate of the intended clique instead of the entire one.
The following procedure CliqueSample(v, S ) produces a sparse approximation of
clique(v, S ). Let V be the vertex set of the graph associated with S and E the edge
set. We define b i,j ∈ RV to be the vector with
Algorithm 2: CliqueSample(v, S )
Input: Graph Laplacian S ∈ RV ×V , of a graph with edge weights w , and vertex v ∈ V
Output: Y v ∈ RV ×V sparse approximation of clique(v, S )
Y v ← 0n×n ;
foreach Multiedge e = (v, i) from v to a neighbor i do
Randomly pick a neighbor j of v with probability ww(j,v) v
;
w (i,v)w (j,v)
If i 6= j, let Y v ← Y v + w (i,v)+w (j,v) b i,j b >
i,j ;
return Y v ;
Remark 10.3.2. We can implement each sampling of a neighbor j in O(1) time using a
classical algorithm known as Walker’s method (also known as the Alias method or Vose’s
method). This algorithm requires an additional O(degS (v)) time to initialize a data struc-
ture used for sampling. Overall, this means the total time for O(degS (v)) samples is still
O(degS (v)).
Proof. Let C = Clique(v, S ). Observe that both E [Y v ] and C are Laplacians. Thus it
suffices to verify EY v (i,j) = C (i, j) for i 6= j.
106
Remark 10.3.4. Lemma 10.3.3 shows that CliqueSample(v, L) produces the original
Clique(v, L) in expectation.
a
L is not actually lower triangular. However, if we let P π be the permutation matrix corresponding to
π, then P π L is lower triangular. Knowing the ordering that achieves this is enough to let us implement
forward and backward substitution for solving linear equations in L and L> .
Note that if we replace CliqueSample(π(i), S i−1 ) by Clique(π(i), S i−1 ) at each step, then
we can recover Gaussian Elimination, but with a random elimination order.
Ultimately, the proof is going to have a lot in common with our proof of Matrix Bernstein
in Chapter 9. Overall, the lesson there was that when we have a sum of independent, zero-
mean random matrices, we can show that the sum is likely to have small spectral norm if the
spectral norm of each random matrix is small, and the matrix-valued variance is also small.
Thus, to replicate the proof, we need control over
107
2. The sample variance.
But, there is seemlingly another major obstacle: We are trying to analyze a process where
the samples are far from independent. Each time we sample edges, we add new edges to the
remaining graph, which we will the later sample again. This creates a lot of dependencies
between the samples, which we have to handle.
However, it turns out that independence is more than what is needed to prove concentration.
Instead, it suffices to have a sequence of random variables such that each is mean-zero in
expectation, conditional on the previous ones. This is called a martingale difference sequence.
We’ll now learn about those.
Since our analysis requires frequently measuring matrices after right and left-multiplication
by L+/2 , we reintroduce the “normalizing map” Φ : Rn×n → Rn×n defined by
10.4.2 Martingales
That is, conditional on the outcome of all the previous random variables, the expectation of
Zi equals Zi−1 . If we unravel the sequence of conditional expectations, we get that without
conditioning E [Zk ] = E [Z0 ].
Typically, we use martingales to show a statement along like “Zk is concentrated around
E [Zk ]”.
We can also think of a martingale in terms of the sequence of changes in the Zi variables.
Let Xi = Zi − Zi−1 . The sequence of Xi s is called a martingale difference sequence. We can
now state the martingale condition as
E [Xi | Z0 , . . . , Zi−1 ] = 0.
And because Z0 and X1 , . . . , Xi−1 completely determine Z1 , . . . , Zi−1 , we could also write
the martingale condition equivalently as
E [Xi | Z0 , X1 , . . . , Xi−1 ] = 0.
108
Crucially, we can write
k
X k
X
Zk = Z0 + Zi − Zi−1 = Z0 + Xi
i=1 i=1
L0 = L = l 1 l >
1 + Clique(v, L) + L−1
L1 = l 1 l >
1 + CliqueSample(v, L) + L−1
= l 1l >
1 + Clique(v, L) + L−1
= L0
Ultimately, our plan is to use this matrix martingale structure to show that “Ln is concen-
trated around L” in some appropriate sense. More precisely, the spectral approximation we
would like to show can be established by showing that “Φ(Ln ) is concentrated around Φ(L)”
109
10.4.3 Martingale Difference Sequence as Edge-Samples
Li − Li−1 = l i l >
i + CliqueSample(π(i), S i−1 ) − Star(π(i), S i−1 )
= CliqueSample(π(i), S i−1 ) − Clique(π(i), S i−1 ) (10.5)
= CliqueSample(π(i), S i−1 ) − E [CliqueSample(π(i), S i−1 ) | preceding samples]
by Lemma 10.3.3.
In particular, recall that by Lemma 10.3.3, conditional on the randomness before the call to
CliqueSample(π(i), S i−1 ), we have
and we further introduce notation each multi-edge sample for e ∈ Star(π(i), S i−1 ), as
Y π(i),e , denoting the random edge Laplacian sampled when the algorithm is processing
multi-edge e. Thus, conditional on preceding samples, we have
X
Y π(i) = Y π(i),e (10.6)
e∈Star(π(i),S i−1 )
Note that even the number of multi-edges in Star(π(i), S i−1 ) depends on the preceding
samples. We also want to associate zero-mean variables with each edge. Conditional on
preceding samples, we also define
X
X i,e = Φ Y π(i),e − E Y π(i),e and X i = X i,e
e∈Star(π(i),S i−1 )
Note that the X i,e variables form a martingale difference sequence, because the linearity of
Φ ensures they are zero-mean conditional on preceding randomness.
110
10.4.4 Stopped Martingales
Figure 10.2 shows the L̃i martingale getting stuck at the first time Lj ∗ 6 1.5L.
Claim 10.4.3.
n o
1. The sequence L̃i for i = 0, . . . , n is a martingale.
2.
L+/2 (L̃i − L)L+/2
≤ 0.5 implies
L+/2 (Li − L)L+/2
≤ 0.5
111
h i
The martingale property also implies that the unconditional expectation satisfies E L̃n =
L. The proof of the claim is easy to sketch: For Part 1, each difference is zero-mean if the
condition has not been violated, and is identically
n o zero (and hence
zero-mean) if it has
been
+/2 +/2
violated. For Part 2, if the martingale L̃i has stopped, then
L (L̃i − L)L
≤ 0.5
is false, and the implication is vacuosly true. If, on the other hand, the martingale has not
stopped, the quantities are equal, because L̃i = Li , and again it’s easy to see the implication
holds.
Thus, ultimately, our strategy is goin to be to show that
L+/2 (L̃i − L)L+/2
≤ 0.5 with
high probability. Expressed using the normalizing map Φ(·), our goal is to show that with
high probability
Φ(L̃n − L)
≤ 0.5.
Whenever the modified martingale X̃ i has not yet stopped, we also introduce individual
modified edge samples X̃ i,e = X i,e . If the martingale has stopped, i.e. X̃ i = 0, then we can
take these edge samples X̃ i,e to be zero. We can now write
X n Xn n
X X
Φ L̃n − L = Φ(L̃i − L̃i−1 ) = X̃ i = X̃ i,e .
i=1 i=1 i=1 e∈Star(π(i),S i−1 )
In this Subsection, we’re going to see that the norms of each multi-edge sample is controlled
throughout the algorithm.
Lemma 10.4.4. Given two Laplacians L and S on the same vertex set.1 If each multiedge
e of Star(v, S ) has bounded norm in the following sense,
+/2 > +/2
L w S (e)b b
e e L
≤ R,
1
L can be regarded as the original Laplacian we care about, while S can be regarded as some intermediate
Laplacian appearing during Approximate Gaussian Elimination.
112
then each possible sampled multiedge e0 of CliqueSample(v, S ) also satisfies
+/2 0 > +/2
L w new (e )b e0 b e0 L
≤ R.
Proof. Let w = w S for simplicity. Consider a sampled edge between i and j with weight
w new (i, j) = w (i, v)w (j, v)/(w (i, v) + w (j, v)).
+/2 > +/2
+/2 > +/2
L w new (i, j)b ij b ij L
= w new (i, j)
L b ij b ij L
+/2
2
= w new (i, j)
L b ij
+/2
2
+/2
2
≤ w new (i, j)
L b iv
+
L b jv
w (j, v)
L w (i, v)b iv b >
+/2 +/2
= iv L
+
w (i, v) + w (j, v)
w (i, v)
L w (j, v)b jv b >
+/2 +/2
jv L
w (i, v) + w (j, v)
w (j, v) w (i, v)
≤ R+ R
w (i, v) + w (j, v) w (i, v) + w (j, v)
=R
The first inequality uses the triangle inequality of effective resistance in L, in that effective
resistance is a distance as we proved in Chapter 7. The second inequality just uses the
conditions of this lemma.
Remark 10.4.5. Lemma 10.4.4 only requires that each single multiedge has small norm
instead of that the sum of all edges between a pair of vertices have small norm. And this
lemma tells us, after sampling, each multiedge in the new graph still satisfies the bounded
norm condition.
From the Lemma, we can conclude that each edge sample Y π(i),e satisfies
Φ(Y π(i),e )
≤ R
provided the assumptions of the Lemma hold. Let’s record this observation as a Lemma.
Lemma 10.4.6. If for all e ∈ Star(v, S i ),
Φ(w S (e)b e b >
e ) ≤ R.
i
113
This also implies that
+/2 > +/2
L w (ê)b ê b ê L
≤ 1.
Now, that means that if we split every original edge e of the graph into K multi-edges
e1 , . . . eK , with a fraction 1/K of the weight, we get a new graph G0 = (V, E 0 , w 0 ) such that
Claim 10.4.7.
2. |E 0 | = K |E|
Before we run Approximate Gaussian Elimination, we are going to do this multi-edge splitting
to ensure we have control over multi-edge sample norms. Combined with Lemma 10.4.4
immediately establishes the next lemma, because we start off with all multi-edges having
bounded norm and only produce multi-edges with bounded norm.
Lemma 10.4.8. When Algorithm 3 is run on the (multi-edge) Laplacian of G0 , arising from
splitting edges of G into K multi-edges, the every edge sample Y π(i),e satisfies
Φ(Y π(i),e )
≤ 1/K.
Let us recall how matrix-valued variances come into the picture when proving concentration
following the strategy from Matrix Bernstein in Chapter 9.
For some matrix-valued random variable X ∈ S n , we’d like to show Pr[kX k ≤ 0.5]. Using
Markov’s inequality, and some observations about matrix exponentials and traces, we saw
that for all θ > 0,
Pr[kX k ≥ 0.5] ≤ exp(−0.5θ) (E [Tr (exp (θX ))] + E [Tr (exp (−θX ))]) . (10.10)
We then want to bound E [Tr (exp (θX ))] using Lieb’s theorem. We can handle
E [Tr (exp (−θX ))] similarly.
n
Theorem 10.4.9 (Lieb). Let f : S++ → R be a matrix function given by
114
As observed by Tropp, this is useful for proving matrix concentration statements. Combined
with Jensen’s inequality, it gives that for a random matrix X ∈ S n and a fixed H ∈ S n
The next crucial step was to show that it suffices to obtain an upper bound on the matrix
E [exp(X )] w.r.t the Loewner order. Using the following three lemmas, this conclusion is an
immediate corollary.
To use Corollary 10.4.13, we need to construct useful upper bounds on E [exp(X )]. This can
be done, starting from the following lemma.
we end up wanting to bound the matrix-valued variance E X 2 . In the rest of this Sub-
section, we’re going to see the matrix-valued variance of the stopped martingale is bounded
throughout the algorithm.
Firstly, we note that for a single edge sample X̃ i,e , by Lemma 10.4.8, we have that
i,e
≤
Φ Y π(i),e − E Y π(i),e
≤ 1/K,
X̃
115
10.4.8 The Overall Mean-Trace-Exponential Bound
We will use E(<i) to denote expectation over variables preceding the ith elimination step. We
are going to refrain from explicitly writing out conditioning in our expectations, but any inner
expectation that appears inside another outer expectation should be taken as conditional on
the outer expectation. We are going to use di to denote the multi-edge degree of vertex π(i)
in S i−1 . This is exactly the number of edge samples in the ith elimination. Note that there
is no elimination at step n (the algorithm is already finished). As a notational convenience,
let’s write n̂ = n − 1. With√ all that in mind, we bound the mean-trace-exponential for some
parameter 0 < θ ≤ 0.5/ K
n̂
!
X
E Tr exp(θ X̃ i ) (10.12)
i=1
Xn̂−1 n̂ −1
dX
= E E ··· Tr exp θ X̃ + θ X̃ +θ X̃
E E E i n̂,e n̂,dn̂
(<n̂) π(n̂) X̃ n̂,1 X̃ n̂,dn̂ −1 X̃ n̂,dn̂
|i=1 {ze=1 }
H
X̃ n̂,1 . . . , X̃ n̂,dn̂ are independent conditional on (< n̂), π(n̂)
n̂ −1
n̂−1 dX
!
X 1 2
≤ E E E · · · E Tr exp θX̃ i + θX̃ n̂,e + θ · E Φ(Y π(n̂),dn̂ )
(<n̂) π(n̂) X̃ n̂,1 X̃ n̂,dn̂ −1 i=1 e=1
K X̃ n̂,dn̂
By Equation (10.11) and Corollary 10.4.13 .
..
. Repeat for each multi-edge sample X̃ n̂,1 . . . , X̃ n̂,dn̂ −1
n̂−1 dn̂
!
X X 1 2
≤ E E Tr exp θX̃ i + θ · E Φ(Y π(n̂),e )
(<n̂) π(n̂)
i=1 e=1
K X̃ n̂,e
n̂−1
!
X 1 2
= E E Tr exp θX̃ i + θ Φ(Clique(π(n̂), S n̂−1 ))
(<n̂) π(n̂)
i=1
K
To further bound the this quantity, we now need to deal with the random choice of π(n̂).
We’ll be able to use this to bound the trace-exponential in a very strong way. From a random
matrix perspective, it’s the following few steps that give the analysis it’s surprising strength.
We can treat K1 θ2 Φ(Clique(π(n̂), S n̂−1 )) as a random matrix. It is not zero-mean, but we
can still bound the trace-exponential using Corollary 10.4.13.
We can also bound the expected matrix exponential in that case, using a simple corollary of
Lemma 10.4.14.
Corollary 10.4.15. exp(A) I + (1 + R)A for 0 A with kAk ≤ R ≤ 1.
Proof. The conclusion follows after observing that for 0 A with kAk ≤ R, we have
A2 RA. We can see this by considering the spectral decomposition of A and dealing with
each eigenvalue separately.
116
Next, we need a simple structural observation about the cliques created by elimination:
Claim 10.4.16.
Clique(π(i), S i ) Star(π(i), S i ) S i
Next we make use of the fact that X̃ i is from the difference sequence of the stopped martin-
gale. This means we can assume
S i 1.5L,
since otherwise X̃ i = 0 and we get an even better bound on the trace-exponential. To make
this formal, in Equation (10.12), we ought to do a case analysis that also includes the case
X̃ i = 0 when the martingale has stopped, but we omit this.
Thus we can conclude by Claim 10.4.16 that
kΦ(Clique(π(i), S i ))k ≤ 1.5.
√
By our assumption 0 < θ ≤ 0.5/ K, we have
K1 θ2 Φ(Clique(π(i), S i−1 ))
≤ 1, so that by
Corollary 10.4.15,
1 2 2
E exp θ Φ(Clique(π(i), S i−1 )) I + θ2 E Φ(Clique(π(i), S i−1 )) (10.13)
π(i) K K π(i)
2
I + θ2 E Φ(Star(π(i), S i−1 ))
K π(i)
by Claim 10.4.16.
Next we observe that, because every multi-edge appears in exactly two stars, and π(i) is
chosen uniformly at random among the n + 1 − i vertices that S i−1 is supported on, we have
1
E Star(π(i), S i−1 ) = 2 S i−1 .
π(i) n+1−i
And, since we assume S i 1.5L, we further get
6θ2
1 2
E exp θ Φ(Clique(π(i), S i−1 )) I + I.
π(i) K K(n + 1 − i)
We can combine this with Equation (10.12) and Corollary 10.4.13 to get
n̂
!
X
E Tr exp(θ X̃ i )
i=1
n̂−1
!
X 1 2
≤ E E Tr exp θX̃ i + θ Φ(Clique(π(n̂), S n̂−1 ))
(<n̂) π(n̂)
i=1
K
n̂−1
!
X 6θ2
≤ E Tr exp θX̃ i + I
(<n̂)
i=1
K(n + 1 − i)
117
And by repeating this analysis for each term X̃ i , we get
n̂
! n̂
!
X X 6θ2
E Tr exp(θ X̃ i ) ≤ Tr exp I
i=1 i=1
K(n + 1 − i)
2
7θ log(n)
≤ Tr exp I
K
2
7θ log(n)
= n exp
K
√
Then, by choosing K = 200 log2 n and θ = 0.5 K, we get
n̂
! 2
X 7θ log(n)
exp(−0.5θ) E Tr exp(θ X̃ i ) ≤ exp(−0.5θ)n exp ≤ 1/n5 .
i=1
K
Pn̂
E Tr exp(−θ i=1 X̃ i ) can be bounded by an identical argument, so that Equation (10.10)
gives "
n̂
#
X
Pr
X̃ i
≥ 0.5 ≤ 2/n5 .
i=1
P
Thus we have established
n̂i=1 X̃ i
≤ 0.5 with high probability (Equation (10.9)), and
Now, all that’s left to note is that the running time is linear in the multi-edge degree of
the vertex being eliminated in each iteration (and this also bounds the number of non-zero
entries being created in L). The total number of multi-edges left in the remaining graph
stays constant at Km = O(m log2 n). Thus the expected degree in the ith elimination is
Km/(n + i − 1), because the remaining number of vertices is n + i − 1. Hence the total
running time and total number of non-zero entries created can both be bounded as
X
Km 1/(n + i − 1) = O(m log3 n).
i
We can further prove that the bound O(m log3 n) on running time and number of non-zeros
in L holds with high probability (e.g. 1 − 1/n5 ). To show this, we essentially need a scalar
Chernoff bound, in except the degrees are in fact not independent, and so we need a scalar
martingale concentration result, e.g. Azuma’s Inequality. This way, we complete the proof
of Theorem 10.2.4.
118
Part III
119
Chapter 11
Setup. Consider a directed graph G = (V, E, c), where V denotes vertices, E denotes edges
and c ∈ RE , c ≥ 0 denotes edge capacities. In contrast to earlier chapters, the direction of
each edge will be important, and not just as a book keeping tool (which is how we previously
used it in undirected graphs). We consider an edge (u, v) ∈ E to be from u and to v. Edge
capacities are associated with directed edges, and we allow both edge (u, v) and (v, u) to
exist, and they may have different capacities.
A flow is any vector f ∈ RE .
We say that a flow is feasible when 0 ≤ f ≤ c. The constraint 0 ≤ f ensures that the flow
respects edge directions, while the constraint f ≤ c ensures that the flow does not exceed
capacity on any edge.
We still define the edge-vertex incidence matrix B ∈ RV ×E of G by
1
if e = (u, v)
B(v, e) = −1 if e = (v, u)
0 o.w.
Bf = d .
120
We will focus on the case:
Bf = F (−e s + e t ) = F b st .
Flows that satisfy the above equation for some scalar F are called s-t flows where s, t ∈ V .
The vertex s is called the source, t is called the sink and e s , e t are indicator vectors for
source and sink nodes respectively. The vector b st has -1 at source and 1 at sink. The
maximum flow can be expressed as a linear program as follows
max F
f ∈RE ,F
s.t. Bf = F b st (11.1)
0≤f ≤c
Definition 11.2.2. A cycle flow is a flow f ≥ 0 that can be written as a cycle i.e.
X
f =α ee
e∈c
Lemma 11.2.3 (The path-cycle decomposition lemma). Any s-t flow f can be decomposed
into a sum of s-t path flows and cycle flows such that the sum contains at most nnz(f ) terms.
Note nnz(f ) ≤ |E|.
Proof. We perform induction on the number of edges with non-zero flow, which we denote
by nnz(f ). Note that by “the support of f ”, we mean the set {e ∈ E : f (e) 6= 0}.
121
Inductive step: Try to find a path from s to t OR a cycle in the support of f .
“Path” case. If there exists such a s-t path, let α be the minimum flow value along the edges
of the path, i.e.
α = min f (a, b)
(a,b)∈p
0
X
f =α ee
e∈p
α = min f (a, b)
(a,b)∈c
0
X
f =α ee
e∈c
Lemma 11.2.4. In any s-t max flow problem instance, there is an optimal flow f ∗ with a
path-cycle decomposition that has only paths and no cycles.
Proof. Let f̃ be an optimal flow. Let f ∗ be the sum of path flows in the path cycle decom-
position of f̃ . They route the same flow (as cycles contribute to no net flow from s to t).
Thus
Bf ∗ = B f̃
0 ≤ f ∗ ≤ f̃ ≤ c
122
The first inequality follows from f ∗ being a sum of positive path flow. The second inequality
holds as f ∗ is upper bounded in every single entry by f̃ , because we reduced it by positive
entry cycles. The third inequality holds because f̃ is a feasible flow, so it is upper bounded
by the capacities.
An optimal flow solving the maximum flow problem may not be unique. For example,
consider the graph below with source s and sink t:
There are two optimal paths in this example. Maximum flow is a convex optimization
problem but not a strongly convex problem as the solutions are not unique, and this is part
of what makes it hard to solve.
The decomposition shown earlier provides a way to show that the maximum flow in a graph
is upper bounded by constructing graph cuts.
Given a vertex subset S ⊆ V , we say that (S, V \ S) is a cut in G and that the value of the
cut is X
cG (S) = w (e).
e∈E∩(S×V \S)
Note that in a directed graph, only edges crossing the cut going from S and to V \ S count
toward the cut.
Figure 11.1: Example of a cut: No edges go from S to (S, V \ S), and so the value of this
cut is zero.
Definition. (s-t cuts). We define an s-t cut to be a subset S ⊂ V , where s ∈ S and t ∈ V \S.
123
A decision problem: “Given an instance of the Maximum Flow problem, is there a flow
from s to t such that f 6= 0?”
If YES: We can decompose this flow into s-t path flows and cycle flows.
If NO: There is no flow path from s to t. Let S be the set of vertices reachable from source
s. Then (S, V \ S) is a cut in the graph, with no edges crossing from S to V \ S.
Figure 11.1 gives an example.
Upper bounding the maximum possible flow value. How can we recognize a maxi-
mum flow? Is there a way to confirm that a flow is of maximum value?
We can now introduce the Minimum Cut problem.
s.t. s ∈ V and t 6∈ V
The Minimum Cut problem can also be phrased as a linear program, although we won’t see
that today.
We’d like to obtain a tight connection between flows and cuts in the graph. As a first step,
we won’t get that, but we can at least observe that the value of any s-t cut provides an
upper bound to the maximum possible s-t flow value.
Theorem 11.3.1 (Max Flow ≤ Min Cut). The maximum s-t flow value in a directed graph
G (Program (17.1)) is upper bounded by the minimum value of any s-t cut (Program (11.2)).
I.e. if S is an s-t cut, and f a feasible s-t flow then
val(f ) ≤ cG (S)
And in particular this holds for any minimum cut S ∗ and maximum flow f ∗ , i.e. val(f ∗ ) ≤
cG (S ∗ ).
124
Figure 11.2: s-t Cut
Note that the edges from T to S do not count towards the cut. The above equation holds
for all flows with all s-t cuts. This implies that max flow ≤ min cut.
This theorem is an instance of a general pattern, known as weak duality. Weak duality is a
relationship between two optimization programs, a maximization problem and a minimiza-
tion problem, where any solution to the former has its value upper bounded by the value of
any solution to the latter.
Does Algorithm 1 work? Consider the graph below with directed edges with capacities
1 at every edge. If we make a single path update as shown by the orange lines in Figure 11.3,
then afterwards, using the remaining capacity, there’s no path flow we can route, as shown
in Figure 11.3. But the max flow is 2, as shown by the flow on orange edges in Figure 11.5.
So, the above algorithm does not always find a maximum flow: It can get stuck at a point
where it cannot make progress despite not having reached the maximum value.
125
Figure 11.3: Sending a unit s-t path flow through the graph.
Figure 11.4: Remaining edge capacities after sending a path flow through the graph as
depicted in Figure 11.3.
A better approach. It turns out we can fix the problem with the previous algorithm
using a simple fix. This idea is known as residual graphs.
126
The −f (e) can be treated as an edge going in the other direction with capacity f (e). By
convention, an edge in G with f (e) = c(e) is called saturated, and we do not include the
edge in Gf . The graph defined above with such capacities is called the residual graph
of f , Gf . Gf is only defined for feasible f , since otherwise the constraint f̃ ≤ c − f gives
trouble.
Suppose we start with a graph having a single edge with capacity 2 and we introduce a flow
of 1 unit.
The residual graph Gf has an edge with capacity 1 going forward and −1 capacity going
forward, but we can treat the latter as +1 capacity going backwards. So it is an edge that
allows you to undo the choice made to send flow along that direction.
Let us consider the same example with its residual graph. The original graph is shown in
Figure 11.6.
Figure 11.6: Original graph G and an s-t path flow in G shown in orange.
127
Figure 11.7: The residual graph w.r.t. the flow from Figure 11.6, and new s-t path flow which
is feasible in the residual graph.
Adding both flows together, we get the paths as shown in Figure 11.8 with value 2, which is
the optimum.
Lemma 11.4.1. Suppose f , f̂ are feasible in G. Then this implies that f̂ −f (where negative
entries count as flow in opposite direction) is feasible in Gf .
Proof. Let f be optimal, and suppose t is reachable from s in Gf then, we can find a s-t path
flow f̃ that is feasible in Gf , and val(f + f̃ ) > val(f ). f + f̃ is feasible in G by Lemma 11.4.2.
This is a contradiction, as we assumed f was supposed to be optimal.
128
Suppose t is not reachable from s in Gf , and f is feasible, but not optimal. Let f ∗ be
optimal, then by Lemma 11.4.1, the flow f ∗ − f is feasible in Gf and val(f ∗ − f ) > 0. So
there exists a s-t path flow from s to t in Gf (as we can do a path decomposition of f ∗ − f ).
But, this is a contradiction as t is not reachable from s in Gf .
Theorem 11.4.4 (Max Flow = Min Cut theorem). The maximum flow in a directed graph
G equals the minimum cut.
Proof. Consider the set S = {vertices reachable from s in Gf }. Note that f saturates the
edge capacities in cut S, V \ S in G: Consider any edge from S to T := V \ S in G. Since
this edge does not exist in the residual graph, we must have f (e) = c(e).
Figure 11.9: The cut between vertices reachable from S and everything else in Gf must have
all outgoing edges saturated by f .
val(f ) = cG (S, V \ S)
which proves the Max Flow = Min Cut theorem; also called strong duality.
Remark 11.4.5. Note that this proof also gives an algorithm to compute an s, t-min cut
given an s, t maximum flow f ∗ : Take one side of the cut to be the set S of nodes reachable
from s in the residual graph w.r.t. f ∗ .
Ford-Fulkerson Algorithm.
Algorithm 6: Ford-Fulkerson Algorithm
repeat
Add update by arbitrary s-t path flow in Gf (augment the flow f by the path flow)
129
• Does it converge towards the max flow value?
No, it does not converge to max flow value if the updates are poor and the capacities
are irrational.
Lemma 11.4.6. Consider Ford-Fulkerson algorithm with integer capacities. The algorithm
terminates in val(f ∗ ) augmentations i.e. O(m val(f ∗ )) time.
Proof. Each iteration increases the flow by at least one unit as the capacities in Gf are
integral and each iteration can be computed in O(m) time.
Can we do better than this? Suppose we pick the maximum bottleneck capacity (min-
imum capacity along path) augmenting path. This gives an algorithm that is better in some
regimes.
How to pick the maximum bottleneck capacity augmenting path? We are going to perform
a binary search on the capacities in Gf , to find a path with maximum bottleneck capacity.
Each time our binary search has picked a threshold capacity, we then try to find an s-t path
flow in Gf using only edges with capacity above that threshold. If we find a path, the binary
search tries a higher capacity next. If we don’t find a path, the binary search tries a lower
capacity next.
Using this approach, the time to find a single path is O(m log(n)) where m is number of
edges in the graph. This path must carry at least a m1 fraction of the total amount of flow
F̂
left in Gf . For instance, if F̂ is the amount of flow left in Gf , then the path must carry m
flow (from the path decomposition lemma, we know that there exists a decomposition into at
most m paths, and the one carrying the most flow must carry at least the average amount).
So if the flow is integral, the algorithms completes when
T
1
1− val(f ∗ ) < 1
m
where T is the number of augmentations.
This means
T = m log F
Current state-of-the-art approaches for Max Flow. For the interested reader, we’ll
briefly mention the current state-of-the-art for Maximum Flow algorithms, which is a very
active area of research.
130
• Strongly polynomial time algorithm:
– has to work with real valued capacities, then the best time O(mn) by Orlin.
• General integer capacities: For a long time, the√state-of-the-art was a result by Gold-
berg and Rao achieving a runtime of O(m min( m, n2/3 ) log(n) log(U )) where U is the
maximum capacity. Very recently, m1+o(1) log(U ) has been claimed (not yet published
in a peer-reviewed venue), see https://arxiv.org/abs/2203.00671. There are many
other recent results prior to this work that we don’t have time to cover here.
131
Chapter 12
12.1 Overview
In this chapter we continue looking at classical max flow algorithms. We derive Dinic’s
algorithm which (unlike Ford-Fulkerson) converges in a polynomial number of iterations.
We will also show faster convergence on unit capacity graphs. The setting is the same as
last chapter: we have a directed graph G = (V, E, c) with positive capacities on the edges.
Our goal is to solve the maximum flow problem on this graph for a given source s and sink
t 6= s:
132
0/1 0/1
1/1 1/1 t
s 1/1
0/1 0/1
Definition 12.2.1. Call an edge (u, v) admissible if `(u) + 1 = `(v). Let L be the set of
admissible edges in Gf and define the level graph of Gf to be the subgraph induced by
these edges.
The last condition makes a blocking flow ‘blocking’: it blocks any shortest augmenting path
in Gf by saturating one of its edges. Note however that a blocking flow is not necessarily a
maximum flow in the level graph.
Proof. Let L, L0 be the edge sets of the level graphs of Gf , Gf 0 respectively. We would now
like to show that dL0 (s, t) ≥ 1 + dL (s, t). Let’s first assume that in fact dL0 (s, t) < dL (s, t).
Now take a shortest s-t path in the level graph of Gf 0 , say s = v0 , v1 , . . . , vdL0 (s,t) = t. Let
vj be the first vertex along the path such that dL0 (s, vj ) < dL (s, vj ). As dL0 (s, t) < dL (s, t),
such a vj must exist.
1
Sometimes also transliterated as Dinitz.
133
We’d like to understand the edges in the level graph L0 . Let E(Gf ) denote the edges of the
Gf , and let rev(L) denote the set of reversed edges of L. The level graph edges of Gf 0 must
satisfy L0 ⊆ L ∪ rev(L) ∪ (E(Gf ) \ L).
In our s-t path in L0 , the vertex vj is reached by an edge from vj−1 , and the level of vj−1 in
Gf and Gf 0 are the same, so the edge (vj−1 , vj ) ∈ L skips at least one level forward in Gf .
But, edges in L do not skip a level in Gf , and edges in rev(L) or E(Gf ) \ L do not move
from a lower level to higher level in Gf . So this edge cannot exist and we have reached a
contradiction.
Now suppose that dL0 (s, t) = dL (s, t). This means there is an s-t path in L0 using edges in
L – but such a path must contain an edge saturated by the blocking flow f̂ .
(Note: the reason this path must use only edges in L is similar to the previous case: the
other possible types of edges do not move to higher levels in Gf at all, making the path too
long if we use any of them.)
So we must have dL0 (s, t) ≥ 1 + dL (s, t).
Proof. By the previous lemma, the level of t increases by at least one each iteration. A
shortest path can only contain each vertex once (otherwise it would contain a cycle) so the
level of any vertex is never more than n.
For graphs with unit capacities (c = 1) we can prove even better bounds on the number of
iterations.
Theorem 12.3.3. On unit capacity graphs, Dinic’s algorithm terminates in
O min{m1/2 , n2/3 }
iterations.
1. Suppose we run Dinic’s algorithm for k iterations, obtaining a flow f that is feasible
but not necessarily yet optimal. Let f̃ be an optimal flow in Gf (i.e. f + f̃ is a
maximum flow for the whole graph), and consider a path decomposition of f̃ . Since
after k iterations any s-t path has length k or more, we use up a total capacity of at
least val(f̃ )k across all edges. But the edges in Gf are either edges from the original
graph G or their reversals (but never both) meaning the total capacity of Gf is at most
m, hence val(f̃ ) ≤ m/k.
Recalling our earlier observation that our algorithm is a special case of Ford-Fulkerson,
this implies that our algorithm will terminate after at most another m/k iterations.
134
Hence
√ the number of iterations is bounded by k + m/k for any k > 0. Substituting
k = m gives the first desired bound.
2. Suppose again that we run Dinic’s algorithm for k iterations obtaining a flow f . The
level graph of Gf partitions the vertices into sets Di = {u | `(u) = i} for i ≥ 0. As
shown before, the sink t must be at a level of at least k, meaning we have at least this
many non-empty levels starting from D0 = {s}. To simplify, discard all vertices and
levels beyond D`(t) .
Now, consider choosing a level I uniformly at random from {1, . . . , k}. Since there are
at most n − 1 vertices in total across these levels, E [|DI |] ≤ (n − 1)/k, and by Markov’s
inequality
(n − 1)/k 1
Pr[|DI | ≥ 2n/k] ≤ < .
2n/k 2
Thus strictly more than k/2 of the levels i ∈ {1, . . . , k} have |Di | < 2n/k and so there
must be two adjacent levels j, j + 1 for which this upper bound on the size holds.
There can be at most |Dj | · |Dj+1 | ≤ 4n2 /k 2 edges between these levels, and by the
min-cost max-flow theorem we saw in the previous chapter, this is an upper bound on
the flow in Gf , and hence on the number of iterations still needed for our algorithm to
terminate.
This means the number of iterations is bounded by k + 4n2 /k 2 for any k > 0, which is
O(n2/3 ) at k = 2n2/3 .
What has been missing from our discussion so far is the process of actually finding a blocking
flow. We can achieve this using repeated depth-first search. We repeatedly do a search in the
level graph (so only using edges L) for s-t paths and augment these. We erase edges whose
subtrees have been exhausted and do not contain any augmenting paths to t. Pseudocode
135
for the algorithm is given below.
Algorithm 7: FindBlockingFlow
f ← 0;
H ← L;
repeat
P ← Dfs(s, H, t);
if P 6= ∅ then
Let f̂ be a flow that saturates path P .
f ← f + f̂ ;
Remove from H all edges saturated by f̂ .
else
return f ;
end
Algorithm 8: Dfs(u, H, t)
if u = t then
return the path P on the dfs-stack.
end
for (u, v) ∈ H do
P ← Dfs(v, H, t);
if P 6= ∅ then
return P ;
else
Erase (u, v) from H.
end
end
return ∅;
Lemma 12.4.1. For general graphs, FindBlockingFlow returns a blocking flow in
O(nm) time. Hence Dinic’s algorithm runs in O(n2 m) time on general capacity graphs.
Proof. First, consider the amount of work spent pushing edges onto the stack which even-
tually results in augmentation along an s-t path consisting of those edges (i.e. adding flow
along that path). Since each augmenting path saturates at least one edge, we do at most m
augmentations. Each path has length at most n. Thus the total amount of work pushing
these edges to the stack, and removing them from the stack upon augmentation, and deleting
saturated edges, can be bounded by O(mn). Now consider the work spent pushing edges
onto the stack which are later deleted because we “retreat” after not finding an s-t. An edge
can only be pushed onto the stack once in this way, since it is then deleted. So the total
amount spent pushing edges to the stack and deleting them this way is O(m).
Lemma 12.4.2. For unit capacity graphs, FindBlockingFlow returns
a blocking flow in
3/2 2/3
O(m) time. Hence Dinic’s algorithm runs in O min{m , mn } time on unit capacity
graphs.
136
Proof. When our depth-first search traverses some edge (u, v), one of two things will happen:
either we find no augmenting path in the subtree of v, leading to the erasure of this edge,
or we find an augmenting path which will necessarily saturate (u, v), again leading to its
erasure. This means each edge will be traversed at most once by the depth-first search.
Finally we show that minimum cut may be formulated as a linear program. For a subset
S ⊆ V write cG (S) for the sum of c(e) over all edges e = (u, v) that cross the cut, i.e.
u ∈ S, v 6∈ S. Note that ‘reverse’ edges are not counted in this cut. The minimum cut
problem asks us to find some S ⊆ V such that s ∈ S and t 6∈ S such that cG (S) is minimal.
min cG (S)
S⊆V
s.t. s ∈ S (12.2)
t 6∈ S
Recall b >
e x = x (v) − x (u) for e = (u, v). We can rewrite this as a proper linear program by
introducing extra variables for the maximum:
min c>u
x ∈RV ,u∈RE
s.t. b >
s,t x = 1
0≤x ≤1 (12.4)
u ≥0
u ≥ B >x
And, in fact, we’ll also see that we don’t need the constraint 0 ≤ x ≤ 1, leading to the
137
following simpler program:
min c>u
x ∈RV ,u∈RE
s.t. b >
s,t x = 1 (12.5)
u ≥0
u ≥ B >x
Lemma 12.5.1. Programs (12.2), (12.4), and (12.5) have equal optimal values.
Proof. We start by considering equality between optimal values of Program (12.2) and Pro-
gram (12.4) Let S be an optimal solution to Program (12.2) and take x = 1V \S . Then x is
a feasible solution to Program (12.4) with cost equal to cG (S).
Conversely, suppose x is an optimal solution to Program (12.4). Let α be uniform in [0, 1]
and define S = {v ∈ V | x (v) ≤ α}. This is called a threshold cut. We can verify that
and hence Et [cG (S)] is exactly the optimization function of 12.3 (and hence 12.4). Since at
least one outcome must do as well as the average, there is a subset S ⊆ V achieving this
value (or less).
We can use the same threshold cut to show that we can round a solution to Program (12.5)
to an equal or smaller value cut feasible for Program (12.2). The only difference is that in
this case, we get
Pr [e is cut by S] ≤ max{b >
e x , 0}
138
Chapter 13
Link-Cut Trees
In this chapter, we will learn about a dynamic data structure that allows us to speed-up
Dinic’s algorithm even more: Link-Cut Trees. This chapter is inspired by lecture notes of
Richard Peng1 and the presentation in a very nice book on data structures and network flows
by Robert Tarjan [Tar83].
13.1 Overview
Model. We consider a directed graph G = (V, E) that is undergoing updates in the
form of edge insertions and/or deletions. We number the graph in its different versions
G0 , G1 , G2 , . . . such that G0 is the initial input graph, and Gi is the initial graph after the
first i updates were applied. Such a graph is called a dynamic graph.
In this lecture, we restrict ourselves to dynamic rooted forests that is we assume that every
Gi forms a directed forest where in each forest a single root vertex is reached by every other
vertex in the tree. For simplicity, we assume that G0 = (V, ∅) is an empty graph.
The Interface. Let us now describe the interface of our data structure that we call a
link-cut tree. We want to support the following operations:
• Initialize(G): Creates the data structure initially and returns a pointer to it. Each
vertex is initialized to have an associated cost cost(v) equal to 0.
• AddCost(v, ∆): Add ∆ to the cost of every vertex on the path from v to the root
vertex FindRoot(v).
1
CS7510 Graph Algorithms Fall 2019, Lecture 17 at https://faculty.cc.gatech.edu/~rpeng/CS7510_
F19/
139
Figure 13.1: The ith version of G is a rooted forest. Here, red vertices are roots and there
are two trees. The (i + 1)th version of G differs from Gi by a single edge that was inserted
(the blue edge). Note that edge insertions are only valid if the tail of the edge was a root.
In the right picture, the former root on the right side is turned into a normal node in the
tree by the update.
• FindMin(v): Returns tuple (w, cost(w)) where w is the (first) vertex on the path from
v to FindRoot(v) of lowest cost.
• Link(u, v): Links two trees that contain u and v into a single tree by inserting the edge
(u, v). This assumes that u, v are initially in different trees and u was a root vertex.
• Cut(u, v): Cuts the edge (u, v) from the graph which causes the subtree rooted at u to
become a tree and u to become a root vertex. Assumes (u, v) is in the current graph.
Main Result. The following theorem is the main result of today’s lecture.
Theorem 13.1.1. We can implement a link-cut tree such that any sequence of m operations
takes total expected time O(m log2 n + |V |).
140
the property for each node v, all items in the left subtree of v have keys less than v, and all
items in the right subtree of v have keys greater than v. This is called the search property
of the tree, because it makes it simple to search through the tree to determine if it contains
an item with a given key.
The binary trees we studied in APC supported the following operations, while maintaining
the search property:
delete: We can remove an item from the tree with a specified key.
split: Suppose the tree contains two items v1 and v2 with keys k1 and k2 where k1 < k2 , and
suppose no item in tree has a key k in the interval (k1 , k2 ). Then the split operation
applied to these two items should split the current tree into two: One tree containing
all items with keys in (−∞, k1 ] and the other with all items in [k2 , ∞).
join: Given two binary search trees, one with keys in the interval (−∞, k], the other with
keys in the interval (k, ∞), form a single binary search tree containing the items of
both.
Treaps allow us to implement all of these operations, each with expected time O(log n) for
each operation when working over n items.
Tree rotations. A crucial operation when implementing binary search trees in general
and treaps in particular is tree rotation. A tree rotation makes a local change to the tree by
changing only a constant number of pointers around, while preserving the search property.
Given two items v1 and v2 such that v1 is a child of v2 , a tree rotation applied to these two
nodes will make v2 the child of v1 , while moving their subtrees around to preserve the search
property. This is shown in Figure 13.2. When the rotation makes v2 the right child of v1
(because v2 has the larger key), this is called a right rotation, and when it makes v2 the left
child of v1 (because v2 has a smaller key), it is called a left rotation.
141
Figure 13.2: Given two items v1 and v2 such that v1 is a child of v2 , a tree rotation applied to
these two nodes will make v2 the child of v1 , while moving their subtrees around to preserve
the search property. This figure shows a right rotation.
Representing Paths via Balanced Binary Search Trees. It turns out that paths
can be represented rather straight-forwardly via Balanced Binary Search Trees, with an
node/item for each vertex of the graph. For the sake of concreteness, we here use treaps to
represent paths2 . Most other balanced binary search trees would also work.
Note that we now represent each rooted path with a tree, and this tree as its own root, which
is usually not the root of othe path. To minimize confusion, we will refer to the root of the
path as the path-root and the root of the associated tree as the tree-root (or subtree-root for
the root of a subtree).
In our earlier discussion of binary trees, we always assumed each item has a key: Instead we
will now let the key of each vertex correspond to its distance to the path-root in the path
that contains it. We will make sure the treap respects the search property w.r.t. this set of
keys/ordering, but we will not actually explicitly compute these keys or store them. Note
one important difference to the scenario of treaps-with-keys: When we have two paths, with
path-roots r1 and r2 , we will allow ourselves to join these paths in two different ways: either
so that r1 or r2 becomes the overall path-root.
Let us describe how to represent a path P in G. First, we pick for each vertex v, we assign
it a priority denoted prio(v), which we choose to be a uniformly random integer sampled
from a large universe, say [1, n100 ). We assume henceforth that prio(v) 6= prio(w) for all
v 6= w ∈ V .
Then, for each path P in G, we store the vertices in P in a binary tree, in particular a treap,
TP and enforce the following invariants:
• Search-Property: for all v, lef tTP (v) precedes v on P and rightTP (v) appears later
2
If you have not seen treaps before, don’t worry, they are simple enough to understand them from our
application here.
142
on P than v. See the Figure 13.3 below for an illustration.
Figure 13.3: In the upper half of the picture, the original path P in G is shown, along with
the random numbers prio(v) for each v. The lower half depicts the resulting treap with
vertices on the same vertical line as before.
Depth of Vertex in a Treap. Let us next analyze the expected depth of a vertex v in a
treap TP representing a path P . Let P = hx1 , x2 , . . . , xk = v, . . . , x|P | i, i.e. v is the k th vertex
on the path P . Observe that a vertex xi with i < k is an ancestor of v in TP if and only if
no vertex {xi+1 , xi+2 , . . . , xk } has received a smaller random number than prio(xi ). Since we
1
sample prio(w) uniformly at random for each w, we have that P[xi ancestor of v] = k−i+1 .
1
The case where i > k is analogous and has P[xi ancestor of v] = i−k+1 . Letting Xi be the
indicator variable for the event that xi is an ancestor of v, it is straight-forward to calculate
the expected depth of v in TP :
k−1 |P |
X X 1 X 1
E[depth(v)] = E[Xi ] = + = Hk + H|P |−k+1 − 2 = O(log |P |)
i6=k i=1
k − i + 1 i=k+1 i − k + 1
Implementing PFindRoot(v). From any vertex v, we can simply follow the parent point-
ers in TP until we are at the tree-root of TP . Then, we find the right-most child of TP by
following the rightTP pointers. Finally, we return the right-most child which is the path-root.
143
Extra fields to help with PAddCost(v, ∆) and PFindMin(v). The key trick to do
the Link-Cut tree operations efficiently is to store the change to subtrees instead of updating
cost(v) for each affected vertex.
To this end, we store two fields ∆cost(v) and ∆min(v) for every vertex v. We let cost(v)
be the cost of each vertex, and mincost(v) denote the minimum cost of any descendant of
v in TP (where we let v be a descendant of itself). A warning: the mincost(v) value is the
minimum cost in the treap-subtree; not the minimum cost on the path between v and the
path-root that could be different. Note also that we do not explicitly maintain these fields.
Then, we maintain for each v
∆cost(v) = cost(v) − mincost(v)
(
mincost(v) parentTP (v) = N U LL,
∆min(v) =
mincost(v) − mincost(parentTP (v)) otherwise
With
P a bit of work, we can see that with these definitions, we obtain mincost(v) =
w∈TP [v] ∆min(v) where TP [v] is the v-to-tree-root path in TP . We can then recover
cost(v) = mincost(v) + ∆cost(v). We say that ∆min(v) and ∆cost(v) are field-preserving if
we can compute mincost(v) and cost(v) in the above way.
144
which is more efficient, but also more unpleasant3 .
To implement the operation PFindMin(v) more efficiently, consider the sequence of nodes
on the path in the tree TP (not the path P ) between v and the path-root. First let v =
x1 , x2 , . . . , xk = r be the nodes leading to the tree-root if v is left of the tree root, and then
let y1 , . . . , yl be the remaining nodes on the tree path to the path-root. Observe that either
the minimizer one of these nodes (which we can check by checking their ∆min), or it is in
a right subtree of some xi , or in any subtree of some yj . So we just search through these
subtrees for their minimizers (looking for ∆min(v) = 0 ) and pick the best one. There will
only be O(log n) trees in expectation.
Next, let us discuss the implementation of the operation PAddCost(v, ∆) which is given
in pseudo-code above. The first for-loop of this algorithm ensures that the tree is again
field-preserving. This is ensured by walking down the path from the path-root of v to v
and whenever we walk to the left, we make sure to increase all costs in the right subtree by
adding ∆ to the subtree-root of the right subtree r in the form of adding it to ∆min(r). On
the vertices along the path it updates the costs by using the ∆cost(·) fields. It is easy to
prove from this that thereafter the tree is field-preserving.
Algorithm 9: PAddCost(v, ∆)
Store the vertices on the path in the tree TP between v and the path-root, i.e.
hv = x1 , x2 , . . . , xk = PFindRoot(v)i.
for i = k down to 1 do
∆cost(xi ) = ∆cost(xi ) + ∆.
if (r ← rightTP (xi )) 6= N U LL AND (i = 0 OR xi−1 6= r) then
∆min(r) ← ∆min(r) + ∆
end
end
for i = 1 up to k do
minval ← ∆cost(xi ).
foreach w ∈ {lef tTP (xi ), rightTP (xi )}, w 6= N U LL do
minval ← min{minval, ∆min(w)}.
∆min(xi ) ← ∆min(xi ) + minval.
∆cost(xi ) ← ∆cost(xi ) − minval.
foreach w ∈ {lef tTP (xi ), rightTP (xi )}, w 6= N U LL do
∆min(w) ← ∆min(w) − minval.
end
While after the first for-loop the values can be computed efficiently, we might now be in the
situation that ∆min(w) or/and ∆cost(w) are negative for some vertices w. We may also
violate the invariants we want to preserve on the relationship between the ∆min(w) or/and
∆cost(w) fields and the true cost and mincost values at each node. We recover through the
3
You don’t have to know these more complicated approaches for the exam, but we include them for the
interested reader.
145
second for-loop that at each vertex on the tree path from v to its path-root first computes the
correct minimum in the subtree again (using the helper variable minval) and then adjusts
all values in its left/right subtree and its own fields. Since we argued that v is at expected
depth O(log(n)), we can see that this can be done in expected O(log n) time.
Implementing PCut(u, v). Let us first assume that we have a dummy node d0 with
prio(d0 ) = 0 in the vertex set. The trick is to first treat the operation as splitting the edge
(u, v) into (u, d0 ) and (d0 , v) by inserting the vertex d0 in between u and v in the tree TP as a
leaf (this is always possible). Then, we can do standard binary tree rotations to re-establish
the heap-order invariant (see Figure 13.4). It is not hard to see that after O(log n) tree
rotations in expectation, the heap-order is re-established and d0 is at new tree-root of TP . It
remains to remove d0 and make lef tTP (d0 ) and rightTP (d0 ) tree-roots.
Figure 13.4: For PCut(u, v), we insert a vertex d0 as a leaf of either u or v to formally split
(u, v) into (u, d0 ) and (d0 , v) (this is shown on the left). While this preserves that Search-
Property, it will violate the Heap-Order. Thus, we need to use tree rotations (shown on the
left) to push d0 to the top of the tree TP (the arrow between the tree rotations should point
in both ways).
To ensure that the fields are correctly adjusted, we can make the ∆min(·) values of the two
vertices that are rotated equal to zero while ensuring that the tree is still field-preserving by
changing the ∆min(·) fields of the (at most 3) subtree-roots of the subtrees to be rotated,
and adapting the ∆cost(·) fields of the two vertices to be rotated. Then, after the rotation,
it is not hard to see that the tree is still field-preserving and that the procedure applied
146
in the second for-loop of the operation PAddCost(v, ∆) can then just be applied to the
two rotated vertices and the subtree-roots of their subtrees. This ensures that the fields
are maintained correctly. All of these operations can be implemented in O(1) time per
tree rotation. Note that PCut is basically just the treap split operation. In the exercises
for this week, we ask you to come up with the pseudo-code for maintaing that the tree is
field-preserving during tree rotations.
Notes. The operations PLink/PCut can be implemented using almost all Balanced Bi-
nary Search Trees (especially the ones you have seen in your first courses on data structures).
Thus, it is not hard to get a O(log n) worst-case time bound for all operations discussed
above.
Path Decomposition. For each rooted tree T , the idea is to decompose T into paths.
In particular, we decompose each T into a collection of vertex-disjoint paths P1 , P2 , . . . , Pk
such that each internal vertex v in T has exactly one incoming edge in some Pi . We call the
edges on some Pi solid edges and say that the other edges are dashed.
We maintain the paths P1 , P2 , . . . , Pk using the data structure described in the last section.
To avoid confusion, we use the prefix P when we invoke operations of the path data structure,
for example PFindRoot(v) finds the root of v in the path graph Pi where v ∈ Pi . We no
longer need to think about the balanced binary trees that are used internally to represent
each Pi . Instead, in this section, we use path-root to refer to the root of a path Pi and we
use tree root (or just root) to refer to a root of one of the trees in our given collection of
rooted trees.
The Expose(v) Operation. We start by discussing the most important operation of the
data structure that will be used by all other operations internally: the operation Expose(v).
This operation flips solid/dashed edges such that after the procedure the path from v to its
147
Figure 13.5: The dashed edges are the edges not on any path Pi . The collection of (non-
empty) paths P1 , P2 , . . . , Pk can be seen to be the maximal path segments of solid edges.
tree root in G is solid (possibly as a subpath in the path collection). Below you can find an
implementation of the procedure Expose(v).
Implementing Operations via Expose(v). We can now implement link-cut tree opera-
tions by invoking Expose(v) and then forwarding the operation to the path data structure.
148
Algorithm 11: AddCost(v, ∆)
Expose(v); PAddCost(v, ∆)
Analysis. All of the operations above can be implemented using a single Expose(·) op-
eration plus O(1) operations on paths. Since path operations can be executed efficiently,
our main task is to bound the run-time of Expose(·). More precisely, since each itera-
tion of Expose(·) also runs in time O(log n), the total number of while-loop iterations in
Expose(·).
To this end, we introduce a dichotomy over the vertices in G. We let parentG (v) denote the
unique parent of v in G and let sizeG (v) denote the number of vertices in the subtree rooted
at v (including v).
Definition 13.4.1. Then, we say that an edge (u, v) is heavy if sizeG (u) > sizeG (v)/2.
Otherwise, we say (u, v) is light.
It can now be seen that the number of light edges on the v-to-root path for any v is at most
lg n: every time we follow a light edge (w, w0 ), i.e. when sizeG (w0 ) > 2sizeG (w), we double
the size of the subtree so after taking more than lg n such edges, we have > 2lg n = n vertices
in the graph (which is a contradiction).
Thus, when Expose(·) runs for many iterations, it must turn many heavy edges solid (this
is also since each vertex has at most one incoming heavy edge, so when we make a heavy
edge solid, we also don’t make any other heavy edge solid).
Claim 13.4.2. Each update can only increase the number of dashed, heavy edges by O(log n).
Proof. First observe that every time Expose(·) is invoked, it turns at most lg n heavy edges
from solid to dashed (since it has to visit a light edge to do so).
The only two operations that can cause additional dashed, heavy edges are Link(u, v) and
Cut(u, v) by toggling heavy/light. For Link(u, v), we observe that only the vertices on the
149
v-to-root path increase their sizes. Since there are at most lg n light edges on this path that
can turn heavy, this increases the number of dashed, heavy edges by at most lg n.
The case for Cut(u, v) is almost analogous: only vertices on the v-to-root path decrease
their sizes which can cause any heavy edge on such a path to become light, and instead
for a sibling of such a vertex might becomes heavy. But there can be at most lg n such
new heavy edges, otherwise the total size of the tree must exceed again n which leads to a
contradiction.
Thus, after m updates, we have created at most O(m log n) dashed heavy edges. The iter-
ations of Expose(·) either visit a dashed light edge (at most ln n of them), or consumes a
dashed heavy edge, i.e. turning them . We conclude that after m updates, the while-loop in
Expose(·) runs for at most O(m log n) iterations, in total, i.e. summed across the updates
so far. Each iteration can be implemented in O(log n) expected time. This dominates the
total running time and proves Theorem 13.1.1.
Recall from Section 12.4 that computing blocking flows in a level graph L from a vertex s
to t can be done by successively running DFS(s) and routing flow along the s-t path found
if one such path exists and otherwise we know that we have found a blocking flow.
We can now speed-up this procedure by storing the DFS-tree explicitly as a dynamic tree. To
simplify exposition, we transform L to obtain a graph Transform(L) that has capacities
on vertices instead of edges. To obtain Transform(L), we simply split each edge in L and
assign the edge capacity to the mid-point vertex while assigning capacity ∞ to all vertices
that were already in L. This creates an identical flow problem with at most O(m) vertices
and edges.
Figure 13.6: Each edge (u, v) with capacity c(u, v) is split into two edges (u, m) and (m, v).
The capacity is then on the vertex m.
150
Finally, we give the new pseudo-code for the blocking flow procedure below.
Proof. Each edge (u, v) in the graph Transform(L) enters the link-cut tree at most once
(we only invoke Cut(u, v) when we delete (u, v) from H).
Next, observe that the first if -case requires O(1 + #edgesDeletedF romH) many tree oper-
ations. The else-case requires O(#edgesDeletedF romH) many tree operations.
But each edge is only deleted once from H, thus we have a total of O(m) tree operations over
all iterations. Since each link-cut tree operation takes amortized expected time O(log2 n),
we obtain the bound on the total running time.
The correctness of this algorithm follows almost immediately using that the level graph L
(and therefore Transform(L)) is an acyclic graph.
151
Chapter 14
In this chapter, we learn about a new algorithm to compute expanders that employs max
flow as a subroutine.
14.1 Introduction
We start with a review of expanders where make a subtle change the notion of an expander
in comparison to chapter 5 to ease the exposition.
|E(S, V \ S)|
ψ(S) =
min{|S|, |V \ S|}
|E(S,V \S)|
Note that sparsity ψ(S) differs from conductance φ(S) = min{vol(S),vol(V \S)}
, as defined in
5, in the denominator. It is straight-forward to see that in a connected graph ψ(S) ≥ φ(S)
for all S.
Clearly, we again have ψ(S) = ψ(V \ S). We define the sparsity of a graph G by ψ(G) =
min∅⊂S⊂V ψ(S). For any ψ ∈ (0, n], we say a graph G is a ψ-expander with regard to sparsity,
if ψ(G) ≥ ψ. When the context is clear, we simply say that G is a ψ-expander.
The Main Result. The main result of this chapter is the following theorem.
152
Theorem 14.1.1. There is an algorithm SparsityCertifyOrCut(G, ψ) that given a
graph G and a parameter 0 < ψ ≤ 1 either:
The algorithm runs in time O(log2 n) · Tmax f low (G) + Õ(m) where Tmax f low (G) is the time
it takes to solve a Max Flow problem on G1 .
The bounds above can further be extended to compute φ-expanders (with regard to conduc-
tance). Using current state-of-the art Max Flow results, the above problem can be solved in
Õ(m + n3/2+o(1) ) time [vdBLL+ 21].
Definition of Embedding. Given graphs H and G that are defined over the same vertex
set, then we say that a function EmbedH → − G is an embedding if it maps each edge (u, v) ∈ H
to a u-to-v path Pu,v = EmbedH → − G (u, v) in G.
Figure 14.1: In this example the red edge (u, v) in H is mapped to the red u-to-v path in G.
1
Technically, we will solve problems with two additional vertices and n additional edges but this will not
change the run-time of any known max-flow algorithm asymptotically.
153
We say that the congestion of EmbedH →
− G is the maximum number of times that any edge
e ∈ E(G) appears on any embedding path:
0 0
− G ) = max |{e ∈ E(H) | e ∈ EmbedH →
cong(EmbedH → − G (e )}|.
e∈E(G)
Certifying Expander Graphs via Embeddings. Let us next prove the following lemma
that is often consider Folklore.
Lemma 14.2.1. Given a 12 -expander graph H and an embedding of H into G with congestion
C, then G must be an Ω C1 -expander.
Proof. Consider any cut (S, V \ S) with |S| ≤ |V \ S|. Since H is a ψ-expander, we have that
|EH (S, V \ S)| ≥ |S|/2. We also know by the embedding of H into G, that for each edge
(u, v) ∈ EH (S, V \ S), we can find path a Pu,v in G that also has to cross the cut (S, V \ S)
at least once. But since each edge in G is on at most C such paths, we can conclude that at
least |EH (S, V \ S)|/C ≥ |S|/2C edges in G cross the cut (S, V \ S).
Unfortunately, the reverse of the above lemma is not true, i.e. even if there exists no
embedding from 1/2-expander H into G of congestion C, then G might still be an Ω C1 -
expander.
154
Figure 14.2: Illustration of the steps of the Algorithm. In a), a bi-partition of V is found.
In b), the bi-partition is used to obtain a flow problem where we inject one unit of flow to
each vertex in S via super-source s and extract one unit via super-sink t. c) A path flow
decomposition. For each path, the first vertex is in S and the last vertex in S. d) We find
Mi to be the one-to-one matching between endpoints in S and S defined by the path flows.
The If Statement. If the flow problem can be solved exactly, then we can find a path
decomposition of the flow f in Õ(m) time (for example using a DFS) where each path starts
in S ends in S and carries one unit of flow2 . This defines a one-to-one correspondence
between the vertices in S and the vertices in S. We capture this correspondences in the
matching Mi . We will later prove the following lemma.
Lemma 14.3.1. If the algorithm returns after constructing T matchings, for an appropri-
ately chosen T = Θ(log2 n), then the graph H returned by the algorithm is a 12 -expander and
H can be embedded into G with congestion O(log2 n/ψ).
2
For simplicity, assume that the returned flow is integral.
155
The Else Statement. On the other hand, if the flow problem on G could not be solved,
then we return the min-cut of the flow problem. Such a cut can be found in O(m) time by
using the above reduction to an s-t flow problem on which one can compute maximum flow
f from which the s-t min-cut can be constructed by following the construction in the proof
of Theorem 11.4.4.
It turns out that this min-cut already is a sparse cut by the way our flow problem is defined.
Lemma 14.3.2. If the algorithm finds a cut (XS , XS ), then the returned cut satisfies
|EG (XS \ {s}, XS \ {t})| = O(ψ).
Proof. First observe that since a min-cut was returned, we have that |EG (XS , XS )| < n/2
(otherwise we could have routed the demands).
Let ns be the number of edges incident to the super-source s that cross the cut (XS , XS ).
Let nt be the number of edges incident to t that cross the cut (XS , XS ).
Figure 14.3: Set XS is enclosed by the orange circle. The thick orange edges are in the cut
and incident to super-source s. Thus they count towards ns . Here ns = 2, nt = 0. Note that
all remaining edges in the cut are black, i.e. were originally in G and therefore have capacity
1/ψ.
Observe that after taking away the vertices s and t, the cut (XS \ {s}, XS \ {t}) has less than
n/2 − ns − nt capacity. But each remaining edge has capacity 1/ψ, so the total number of
edges in the cut can be at most ψ · (n/2 − ns − nt ). Since XS \ {s} is of size at least n/2 − ns ,
and XS \ {t} is of size at least n/2 − nt , we have that the induced cut in G has
ψ · (n/2 − ns − nt )
ψ(XS \ {s}) < ≤ ψ.
min{n/2 − ns , n/2 − nt }
156
14.4 Constructing an Expander via Random Walks
Next, we give the implementation and analysis for the procedure FindBiPartition(·). We
start however by giving some more preliminaries.
Proof. Consider any cut (S, S) with |S| ≤ |S|. It is convenient to think about the random
walks in terms of probability mass that is moved around. Observe that each vertex j ∈ S
has to push at least 1/(2n) units of probability from j to i (by definition of mixing).
Figure 14.4: Each vertex j ∈ S sends at least 1/(2n) probability mass to i (red arrow).
But in order to transport it, it does have to push the mass through edges in the matchings
M1 , M2 , . . . , Mt that cross the cut.
Clearly, to move the mass from S to S it has to use the matching edges that also cross the
cut.
157
Now observe that since there are ≥ n/2 vertices in S, and each of them has to push ≥ 1/(2n)
mass to i, the total amount of probability mass pushed through the cut for i is ≥ 1/4. Since
there are |S| such vertices i, the total amount of mass that has to cross the cut is ≥ |S|/4.
But note, that after each step of the random walk, the total probability mass at each vertex
is exactly 1. Thus, at each step t0 , each edge in Mt0 crossing the cut can push at most 1/2
units of probability mass over the cut (and thereafter the edge is gone).
It follows that there must be at least |S|/2 edges in the matchings M1 , M2 , . . . Mt . But this
implies that H = ∪i Mi is a 12 -expander.
The central claim, we want to prove is the following: given a potential function for the
random walk at step t
X X
Φt = (p tj7→i − 1/n)2 = kp ti − 1/nk22 .
i,j i
To obtain the Corollary, one can simply set up a sequence of random 0-1 variables
X 1 , X 2 , . . . , X (T +1) where each X (t+1) is 1 if and only if the decrease in potential is at
least an Ω(1/ log n)-fraction of the previous potential. Since the expectation is only over the
current r in each round and we choose these independently at random, one can then use
a Chernoff bound to argue that after T rounds (for an appropriate hidden constant), one
has at least Ω(T ) rounds during which the potential is decreased substantially (unless it is
already tiny and the O(n−5 ) factor dominates).
We further observe that this implies that {M1 , M2 , . . . , MT +1 } is mixing (you can straight-
forwardly prove it by a proof of contradiction), and thereby we conclude the proof of our
main theorem.
158
Let us now give the prove of Claim 14.4.2:
Interpreting the Potential Drop. Let us start by writing out the amount by which the
potential decreases
X X (t+1)
Φt − Φ(t+1) = kp ti − 1/nk22 − kp i − 1/nk22
i i
Considering P Mt+1 , and an edge (i, j) ∈ Mt+1 . We can re-write the former sum
now matching
as i kp ti − 1/nk22 = (i,j)∈Mt+1 kp ti − 1/nk22 + kp tj − 1/nk22 as each vertex occurs as exactly
P
one endpoint of a matching edge. We can do the same for the t + 1-step walk probabilities.
(t+1) (t+1) p t +p t
Further, recall that for (i, j) ∈ Mt+1 , we have p i = pj = i 2 j . Thus,
X (t+1) (t+1)
Φt − Φ(t+1) = kp ti − 1/nk22 + kp tj − 1/nk22 − kp i − 1/nk22 − kp j − 1/nk22
(i,j)∈Mt+1
t
2
X
p i + p tj
= kp ti − 1/nk22 + kp tj − 1/nk22 − 2
2 − 1/n
.
(i,j)∈Mt+1 2
The potential thus drops by a lot if vertices i and j are matched where p ti and p tj differ
starkly. Note that this equality implies directly the remark in our claim that Φt − Φ(t+1) is
non-negative.
Understanding the Random Projection. Next, we want to further lower bound the
potential drop using the random vector u. This intuitively helps a lot in our analysis since
we are matching vertices i, j with high value u(i) and low value u(j) (or vice versa). We
will show that (w.p. ≥ 1 − n−3 )
1 X n−1 X
Φt − Φ(t+1) = kp ti − p tj k22 ≥ |u(i) − u(j)|2 . (14.1)
2 64 · log n
(i,j)∈Mt+1 (i,j)∈Mt+1
To prove this claim, we will prove this again term-wise, showing that for each pair of vertices
n−1
i, j ∈ V , we have kp ti − p tj k22 ≥ 16·log n
|u(i) − u(j)|2 w.h.p. It will then suffice to talk a union
bound over all pairs i, j.
To this end, let us make the following observations: since P u(i) = p ti · r , we have that
u(i) − u(j) = (p ti − p tj ) · r by linearity. Also note that since j p tj7→i = 1 for all i (since Πt
is doubly-stochastic), we further have that the projection (p ti − p tj ) is orthogonal to 1.
159
We can now use the following statement about random vector r to argue about the effect of
projecting (p ti − p tj ) onto r . Below, we note that we have d = n − 1 since r is chosen from
the (n − 1)-dimensional space orthogonal to 1.
`2
• E[(y > r )2 ] = d
, and
Multiplying both sides of the event by (n − 1)/(64 log n), we derive the claimed Inequality
(14.1). We can further union bound over the n/2 matching pairs to all satisfy this bound
with probability ≥ 1 − n−7 , as desired.
Relating to the Lengths of the Projections. Let µ = maxi∈S u(i), then we have by
definition that u(i) ≤ µ ≤ u(j) for all i ∈ S, j ∈ S.
Now we can write
n−1 X
Φt − Φ(t+1) ≥ |u(i) − u(j)|2
64 · log n
(i,j)∈Mt+1
n−1 X
≥ (u(i) − µ)2 + (u(j) − µ)2
64 · log n
(i,j)∈Mt+1
n−1 X
= (u(i) − µ)2
64 · log n i∈V
!
n−1 X X
= u(i)2 − 2µ · u(i) + nµ2
64 · log n i∈V i∈V
160
Taking the Expectation. Then, from the second fact of Theorem 14.4.4, we obtain that
" #
X X X X
E u(i)2 = E[u(i)2 ] = E[(p ti · r )2 ] = E[(p ti − 1/n) · r )2 ] (14.4)
i∈V i∈V i∈V i∈V
X kp t − 1/nk2 Φt
i 2
= = (14.5)
i∈V
n−1 n−1
Since again the event E occurs with very low probability, and since Φt − Φt+1 is non-negative
always, we can then conclude that in unconditionally, in expectation, the potential decreases
by Ω(Φt / log n) − O(1/n−5 ).
161
Part IV
Further Topics
162
Chapter 15
15.1 Overview
First part of this chapter introduces the concept of a separating hyperplane of two sets
followed by a proof that for two closed, convex and disjoint sets a separating hyperplane
always exists. This is a variant of the more general separating hyperplane theorem 1 due to
Minkowski. Then Lagrange multipliers x , s of a convex optimization problem
min E(y )
y
s.t. Ay = b
c(y ) ≤ 0
where L(x , s) = miny L(y , x , s). We show weak duality, i.e. L(y , x , s) ≤ E(y ) and that
assuming Slater’s condition the values of both the primal and dual is equal, which is referred
to as strong duality.
1
Wikipedia is good on this: https://en.wikipedia.org/wiki/Hyperplane separation theorem
163
15.2 Separating Hyperplane Theorem
Suppose we have two convex subsets A, B ⊆ Rn that are disjoint (A ∪ B = ∅). We wish to
show that there will always be a (hyper-)plane H that separates these two sets, i.e. A lies
on one side, and B on the other side of H.
So what exactly do we mean by Hyperplane? Let’s define it.
∀aa ∈ A : hn, a i ≥ µ
∀b ∈ B : hn, bi ≤ µ
If we replace ≥ with > and ≤ with < we say H strictly separates A and B.
It is easy to see that there exists disjoint non-convex sets that can not be separated by a
hyperplane (e.g. a point cannot be separated from a ring around it). But can two disjoint
convex sets always be strictly separated by a hyperplane? The answer is no: consider the
two-dimensional case depicted in Figure 15.1 with A = {(x, y) : x ≤ 0} and B = {(x, y) :
x > 0 and y ≥ x1 }. Clearly they are disjoint; however the only separating hyperplane is
H = {(x, y) : x = 0} but it intersects A.
One can prove that there exists a non-strictly separating hyperplane for any two disjoint
convex sets. We will prove that if we further require A,B to be closed and bounded, then a
strictly separating hyperplane always exists. (Note in the example above how our choice of
B is not bounded.)
Theorem 15.2.3 (Separating Hyperplane Theorem; closed, bounded sets). For two closed,
bounded, and disjoint convex sets A, B ∈ Rn there exists a strictly separating hyperplane H.
One such hyperplane is given by normal n = d − c and threshold µ = 12 kd k22 − kck22 ,
Proof. We omit the proof that dist(A, B) = mina ∈A,b∈B kaa − bk2 > 0, which follows from
A, B being disjoint, closed, and bounded. Now, we want to show that hn, bi > µ for all
b ∈ B; then hn, a i < µ for all a ∈ A follows by symmetry. Observe that
164
Figure 15.1: The sets A = {(x, y) : x ≤ 0} and B = {(x, y) : x > 0 and y ≥ x1 } only permit
a non-strictly separating hyperplane.
1
kd k22 − kck22
hn, d i − µ = hd − c, d i −
2
1 1
= kd k2 − d c − kd k22 + kck22
2 >
2 2
1 2
= kd − ck2 > 0.
2
So suppose there exists u ∈ B such that hn, ui − µ ≤ 0. We now look at the line defined by
the distance minimizer d and the point on the “wrong side” u. Define b(λ) = d + λ(u − d ),
and take the derivative of the distance between b(λ) and c. Evaluated at λ = 0 (which is
when b(λ) = d ), this yields
d 2
kb(λ) − ck2 = 2 hd − λd + λu − c, u − d i|λ=0 = 2 hd − c, u − d i .
dλ λ=0
However, this would imply that the gradient is strictly negative since
hn, ui − µ = hd − c, ui − hd − c, d i + hd − c, d i − µ
1 1
= hd − c, u − d i + kd k22 − hc, d i − kd k22 + kck22
2 2
1
= hd − c, u − d i + kd − ck22 ≤ 0.
2
This contradicts the minimality of d and thus concludes this proof.
165
A more general separating hyperplane theorem holds even when the sets are not closed and
bounded:
Theorem 15.2.4 (Separating Hyperplane Theorem). Given two disjoint convex sets A, B ∈
Rn there exists a hyperplane H separating them.
This insight is the core idea of Lagrange multipliers (in this case λ).
Note that here the problem is not convex, because {x : kx k22 = 1} is not convex and because
we are asking to maximize a norm. In the following we will study Lagrange multipliers for
general convex problems.
166
15.3.1 Karush-Kuhn Tucker Optimality Conditions for Convex
Problems
A full formal treament of convex duality would require us to be more careful about using
inf and sup in place of min and max, as well as considering problems that have no feasible
solutions. Today, we’ll ignore these concerns.
Let us consider a general convex optimization problem with convex objective, linear equality
constraints and convex inequality constraints
s.t. Ay = b
c(y ) ≤ 0,
Definition 15.3.1 (Primal feasibility). We say that y ∈ S is primal feasible if all constraints
are satisfied, i.e. Ay = b and c(y ) ≤ 0.
Now, as we did in our example with the 2- and p-norms, we will try to understand the
relationship between the gradient of the objective function and of the constraint functions
at an optimal solution y ∗ .
An (not quite true!) intuition. Suppose y ∗ is an optimal solution to the convex program
above. Let us additionally suppose that y ∗ is not on the boundary of S. Then, generally
speaking, because we are at a constrained minimum of E(y ∗ ), we must have that for any
infinitesimal δ s.t. y ∗ + δ is also feasible, δ > ∇E(y ∗ ) ≥ 0, i.e. the infinitesimal does not
decrease the objective. We can also view this as saying that if δ > ∇E(y ∗ ) < 0, the update
must be infeasible. But what kind of updates will make y ∗ + δ infeasible? This will be true
if a > ∗ >
j δ 6= 0 for some linear constraint j or, roughly speaking, ∇ci (y ) δ 6= 0 for some tight
inequality constraint. But, if this is true for all directions that have a negative inner product
with ∇E(y ∗ ), then we must have that −∇E(y ∗ ) can be written as a linear combination of a j ,
i.e. gradients of the linear constraint, and of gradients ∇ci (y ∗ ) of tight inequality constraints,
and furthermore the cofficients of the gradients of these tight inequality constraints must be
positive, so that moving along this direction will increase the function value and violate the
constraint.
167
To recap, given cofficients x ∈ Rm and s ∈ Rk with s(i) ≥ 0 if ci (y ∗ ) = 0, and s(i) = 0
otherwise, we should be able to write
X X
−∇y E(y ∗ ) = x (j)aa j + s(i)∇ci (y ∗ ) (15.2)
j i
Note that since y ∗ is feasible, and hence c(y ∗ ) ≤ 0, we can write the condition that s(i) ≥ 0
if ci (y ∗ ) = 0, and s(i) = 0 otherwise, in a very slick way: namely as s ≥ 0 and s > c(y ∗ ) = 0.
Traditionally, the variables in s are called slack variables, because of this, i.e. they are
non-zero only if there is no slack in the constraint. This condition has a fancy name: when
s > c(y ) = 0 for some feasible y and s ≥ 0, we say that y and s satisfy complementary
slackness. We will think of the vectors s and x as variables that help us prove optimality of
a current solution y , and we call them dual variables.
Definition 15.3.2 (Dual feasibility). We say (x , s) is dual feasible if s ≥ 0. If additionally
y is primal feasible, we say (y , x , s) is primal-dual feasible.
Now, we have essentially argued that at any optimal solution y ∗ , we must have that com-
plementary slackness holds, and that Equation (15.2) holds for some x and some s ≥ 0.
However, while this intuitive explanation is largely correct, it turns out that it can fail for
technical reasons in some weird situations2 . Nonetheless, under some mild conditions, it
is indeed true that the conditions we argued for above must hold at any optimal solution.
These conditions have a name: The Karush-Kuhn-Tucker Conditions. For convenience, we
will state the conditions using ∇c(y ) to denote the matrix whose ith column is given by
∇ci (y ). This is sometimes called the Jacobian of c.
Definition 15.3.3 (The Karush-Kuhn-Tucker (KKT) Conditions). Given a convex opti-
mization problem of form (15.1) where the domain S is open. Suppose y , x , s satisfy the
following conditions:
min x
x∈R
s.t. x2 = 0.
This has only a single feasible point x = 0, which must then be optimal. But at this point, the gradient of
the constraint function is zero, while the gradient of the objective is non-zero. Thus our informal reasoning
breaks down, because there exists an infeasible direction δ we can move along where the constraint function
grows, but at a rate of O(δ 2 ).
168
15.3.2 Slater’s Condition
There exists many different mild technical conditions under which the KKT conditions do
indeed hold at any optimal solution y . The simplest and most useful is probably Slater’s
condition.
Definition 15.3.4 (Slater’s condition with full domain). A (primal) problem as defined in
(15.1) with S = Rn fulfills Slater’s condition if there exists a strictly feasible point, i.e. there
exists ỹ s.t. Aỹ = b and c(ỹ ) < 0. This means that the strictly feasible point ỹ lies strictly
inside the set {y : c(y ) ≤ 0} defined by the inequality constraints.
One way to think about Slater’s condition is that your inequality constraints should not
restrict the solution space to be lower-dimensional. This is a degenerate case as the sublevel
sets of the inequality constraints are generically full-dimenensional and you want to avoid
this degenerate case.
We can also extend Slater’s condition to the case when the domain S is an open set. To
extend Slater’s condition to this case, we need the notion of a “relative interior”.
Definition 15.3.5 (Relative interior). Given a convex set S ⊂ Rn , the relative interior of
S is
Finally, we end with a proposition that tells us that given Slater’s condition, the KKT are
indeed necessary for optimality of our convex programs.
Proposition 15.3.7 (KKT necessary for optimality when Slater’s condition holds). Con-
sider a convex program in the form (15.1) that satisfies Slater’s condition and has an open set
S as its domain. Suppose y is a primal optimal (feasible) solution, and that y , x , s satisfy
the KKT conditions.
We will prove this lemma later, after developing some tools we will use in the proof. In fact,
we will also see later that assuming Slater’s condition, the KKT conditions are sufficient for
optimality.
169
15.3.3 The Lagrangian and The Dual Program
That is, we can write this condition as the gradient of the quantity (?) is zero. But what is
this quantity (?)? We call it the Lagrangian of the program.
Definition 15.3.8. Given a convex program (15.1), we define the Lagrangian of the program
as
L(y , x , s) = E(y ) + x > (b − Ay ) + s > c(y ).
We also define a Lagrangian only in terms of the dual variables by minimizing over y as
L(x , s) = min L(y , x , s).
y
When s ≥ 0, we have that L(y , x , s) is a convex function of y . For each y , the Lagrangian
L(y , x , s) is linear in (x , s) and hence also concave in them. Hence L(x , s) is a concave
function, because it is the pointwise minimum (over y ), of a collection of concave functions
in (x , s).
We can think of x as assigning a price to violating the linear constraints, and of s as assigning
a price to violating the inequality constraints. The KKT gradient condition tells us that at
the given prices, the there is no benefit gained from locally violating the constraints – i.e.
changing the primal solution y would not improve the cost.
Notice that if y , x , s are primal-dual feasible, then
E(y ) = E(y ) + x > (b − Ay ) as b − Ay = 0.
≥ E(y ) + x > (b − Ay ) + s > c(y ) as c(y ) ≤ 0 and s ≥ 0.
= L(y , x , s) (15.3)
Thus, for primal-dual feasible variables, the Lagrangian is always a lower bound on the
objective value.
L(x , s) is defined by minizming L(y , x , s) over y , i.e. what is the worst case value of
the lower bound L(y , x , s) across all y . We can think of this as computing how good the
given “prices” are at approximately enforcing the constraints. This naturally leads to a new
optimization problem: How can we choose our prices x , s to get the best (highest) possible
lower bound?
Definition 15.3.9 (Dual problem). We define the dual problem as
max
x ,s
min L(y , x , s) = max
x ,s
L(x , s) (15.4)
y
s≥0 s≥0
170
The dual problem is really a convex optimization problem in disguise, because we can flip the
sign of −L(x , s) to get a convex function and minimizing this is equivalent to maximizing
L(x , s).
max
x ,s
L(x , s) = − min
x ,s
−L(x , s)
s≥0 s≥0
When x and s are optimal for the dual program, we say they are dual optimal. And for
convenience, when we also have a primal optimal y , altogether, we will say that (y , x , s)
are primal-dual optimal.
The primal problem can also be written in terms of the Lagrangian.
α∗ = min max L(y , x , s) (15.5)
y x ;s≥0
This is because for a minimizing y all constraints have to be satisfied and the Lagrangian
simplifies to L(y , x , s) = E(y ). If Ax − b = 0 was violated, making x large sends
L(y , x , s) → ∞. And if c(y ) ≤ 0 is violated, we can make L(y , x , s) → ∞ by choos-
ing large s.
Note that we require s ≥ 0, as we only want to penalize the violation of the inequality
constraints in one direction, i.e. when c(y ) > 0.
For any primal-dual feasible y , x , s we have L(y , x , s) ≤ E(y ) (see Equation (15.3)) and
hence also L(x , s) = miny L(y , x , s) ≤ E(y ).
In other words maxx ;s≥0 L(x , s) = β ∗ ≤ α∗ . This is referred to as weak duality.
Using the forms in Equations (15.4) and (15.5), we can also state this as
Theorem 15.3.10 (The Weak Duality Theorem). For any convex program (15.4) and its
dual, we have
α∗ = min max L(y , x , s) ≥ max min L(y , x , s) = β ∗ .
y x ;s≥0 x ;s≥0 y
So now that we have proved weak duality β ∗ ≤ α∗ , what is strong duality? β ∗ = α∗ ? The
answer is yes, but strong duality only holds under some conditions. Again, a simple sufficient
condition is Slater’s condition (Definition 15.3.6.
Theorem 15.3.11. For a program (15.1) satisfying Slater’s condition, strong duality holds,
i.e. α∗ = β ∗ . In other words, the optimal value of the primal problem α∗ is equal to the
optimal value of the dual.
171
How are we going to prove this? Before we prove the theorem, let’s make a few
observations to get us warmed up. If you get bored, skip ahead to the proof.
It is sufficient to prove that α∗ ≤ β ∗ , as the statement then follows in conjunction with weak
duality. We define the set
which defines a hyperplane with n = (1, x , s) and µ = L(x , s) such that G is on one side.
To establish strong duality, we would like to show the existence of a hyperplane such that
for (t, v , u) ∈ G
n > (t, v , u) ≥ α∗ and n = (1, xb , sb) with sb ≥ 0.
Then we would immediately get
β ∗ ≥ L(b
x , sb) = min (1, x , s)> (t, v , u) ≥ α∗ .
(t,v ,u)∈G
Perhaps not surprisingly, we will use the Separating Hyperplane Theorem. What are the
challenges we need to deal with?
• We need to replace G with a convex set (which we will call A) and separate A from
some other convex set (which we will call B).
• We need to make sure the hyperplane normal n has 1 in the first coordinate and s ≥ 0,
and the hyperplane threshold is α∗ .
Proof of Theorem 15.3.11. For simplicity, our proof will assume that S = Rn , but only a
little extra work is required to handle the general case.
Let’s move to on finding two convex disjoints sets A, B to enable the use of the separating
hyperplane Theorem 15.2.4.
172
First set we define A, roughly speaking, as a multi-dimensional epigraph of G. More precisely
Note that A is a convex set. The proof is similar to the proof that the epigraph of a convex
function is a convex set. The optimal value of the primal program can be now written as
α∗ = min t.
(t,0,0)∈A
B := {(r ∈ R, 0 ∈ Rm , 0 ∈ Rk ) : r < α∗ }.
Now, we claim that s̃ ≥ 0. Suppose s̃(i) < 0, then for u(i) → ∞ the threshold would
grow unbounded, i.e. µ → −∞ contradicting that the threshold µ is finite by the separating
hyperplane theorem. Similarly we claim ρ̃ ≥ 0, as if this were not the case, having t → ∞
implies that µ → −∞ again contradicting the finiteness of µ.
From Equation (15.7) it follows that tρ̃ ≤ µ for all t < α∗ which implies that tρ̃ ≤ µ for
t = α∗ by taking the limit. Hence we have α∗ ρ̃ ≤ µ. From (t, v , u) ∈ A we get from
Equation (15.6)
(ρ̃, x̃ , s̃)> (t, v , u) ≥ µ ≥ α∗ ρ̃
173
and thus
(ρ̃, x̃ , s̃)> (E(y ), Ay − b, c(y )) ≥ α∗ ρ̃. (15.8)
Now we consider two cases; starting with the “good” case where ρ̃ > 0. Dividing Equa-
tion (15.8) by ρ̃ gives
x̃ > s̃ >
E(y ) + (Ay − b) + c(y ) ≥ α∗ .
ρ̃ ρ̃
Noting that the left hand side above is L(y , x̃ρ̃ , s̃ρ̃ ) and that the equation holds for arbitrary
y ; therefore also for the minimum we get
x̃ s̃
min L y , , ≥ α∗
y ρ̃ ρ̃
As Slater’s condition holds, there is an interior point ỹ , i.e. it satisfies b − Aỹ = 0 and
c(ỹ ) < 0. Together with the equation above this yields
This, however, means that there is a point in A on the wrong side of the hyperplane, as
Remark. Note that our reasoning about why s ≥ 0 in the proof above is very similar to
our reasoning for why the primal program can be written as Problem (15.5).
174
Example. As an example of A and B as they appear in the above proof, consider
min y 2
y∈(0,∞)
1/y−1≤0
This leads to α∗ = 1, y ∗ = 1, and A = {(t, u) : y ∈ (0, ∞) and t > y 2 and u ≥ 1/y − 1}, and
B = {(t, 0) : t < 1} and the separating hyperplane normal is n = (1, 2). These two sets A, B
are illustrated in Figure 15.3.
Figure 15.3: Example of the convex sets A and B we wish to separate by hyperplane.
Let’s come back to our earlier discussion of parallel gradients and the KKT conditions.
Consider a convex program (15.1) with an open set S as its domain. Suppose y ∗ is an
optimizer of the primal problem and x ∗ , s ∗ for the dual, and suppose that strong duality
holds. We thus have
L(y ∗ , x ∗ , s ∗ ) = α∗ = β ∗ .
Because L(y , x ∗ , s ∗ ) is a convex function in y , it also follows that as E : S → R and c are
differentiable then we must have that the gradient w.r.t. y is zero, i.e.
and hence when the i-th convex constraint is not active, i.e. ci (y ∗ ) < 0 the slack must be
zero, i.e. s(i) = 0. Conversely if the slack is non-zero, that is s(i) 6= 0 implies that the
175
constraint is active, i.e. ci (y ∗ ) = 0. This says precisely that the complementary slackness
condition holds at primal-dual optimal (y ∗ , x ∗ , s ∗ ). Combined with our previous observation
Equation (15.9), we get the following result.
Theorem 15.3.12. Consider a convex program (15.1) with an open domain set S and whose
dual satisfies strong duality. Then KKT conditions necessarily hold at primal-dual optimal
(y ∗ , x ∗ , s ∗ ).
Theorem 15.3.12 combined with Theorem 15.3.11 immediately imply Proposition 15.3.7.
KKT is sufficient. We’ve seen that when strong duality holds, the KKT conditions are
necessary for optimality. In fact, they’re also sufficient for optimality, as the next theorem
shows. And in this case, we do not need to assume strong duality, as now it is implied by
the KKT conditions.
Theorem 15.3.13. Consider a convex program (15.1) with an open domain set S. Then if
the KKT conditions hold at (ỹ , x̃ , s̃), they must be primal-dual optimal, and strong duality
must hold.
Proof. ỹ is global minimizer of y 7→ L(y , x̃ , s̃), since this function is convex with vanishing
gradient at ỹ . Hence,
A good reference for basic convex duality theory is Boyd’s free online book “Convex opti-
mization” (linked to on the course website). It provides a number of different interpretations
of duality. One particularly interesting one comes from economics: economists see the slack
variables s as prices for violating the constraints.
176
Chapter 16
min E(y )
s.t. Ay = b (16.1)
c(y ) ≤ 0,
max
x ,s
L(x , s) (16.2)
s≥0
whose optimal value is denoted by β ∗ . The dual is always a convex optimization program
even though the primal is non-convex. The optimal value of the primal (16.1) can also be
written as
α∗ = inf sup L(y , x , s), (16.3)
y x ;s≥0
where no constraint is imposed on the primal variable y . The optimal value of the dual
(16.2) is
β ∗ = sup inf L(y , x , s). (16.4)
x ;s≥0 y
177
Note the only difference between (16.3) and (16.4) is that the positions of “inf” and “sup”
are swapped. The weak duality theorem states that the dual optimal value is a lower bound
of the primal optimal value, i.e. β ∗ ≤ α∗ .
The Slater’s condition for (16.1) open domain S requires the existence of a strictly feasible
point, i.e. there exists ỹ ∈ S s.t. Aỹ = b and c(ỹ ) < 0. This means that the strictly feasible
point ỹ lies inside the interior of the set {y : c(y ) ≤ 0} defined by the inequality constraints.
If the domain S is not open, Slater’s condition also requires that a strictly feasible point is
in the relative interior of S (NB: when S is open, S is equal to its relative interior). The
strong duality theorem says that Slater’s condition implies strong duality, β ∗ = α∗ .
We were also introduced to the KKT conditions, and we saw that for our convex programs
with continuously differentiable objective and constraints functions, when the domain is
open, the conditions are sufficient to imply strong duality and primal-dual optimality of the
points that satisfy them. If Slater’s condition holds, we also have that KKT necessarily holds
at any optimal solution. In summary: Slater’s condition =⇒ strong duality ⇐⇒ KKT
Example. In Chapter 12, we gave a combinatorial proof of the min-cut max-flow theorem,
and showed that the min-cut program can be expressed as a linear program. Now, we will
use the strong duality theorem to give an alternative proof, and directly find the min-cut
linear program is the dual program to our maximum flow linear program.
We will assume that Slater’s condition holds for our primal program. Since scaling the flow
down enough will always ensure that capacity constraints are strictly satisfied i.e. f < c, the
only concern is to make sure that non-negativity constraints are satisfied. This means that
there is an s-t flow that sends a non-zero flow on every edge. In fact, this may not always
be possible, but it is easy to detect such edges and remove them without changing the value
of the program: an edge (u, v) should be removed if there is no path s to u or no path v to
t. We can identify all such edges using a BFS from s along the directed edges and a BFS
along reversed directed edges from t.
Slater’s condition holds whenever there is a directed path from s to t with non-zero capacity
(and if there is not, the maximum flow and minimum cut are both zero).
= max −c > s
x; s ≥ 0
b>s,t x = 1
s ≥ B >x
178
Thus switching signs gives us
The LHS of (16.5) is exactly the LP formulation of max-flow, while the RHS is exactly the
LP formulation of min-cut. Note that we treated the “constraint” 0 ≤ f as a restriction on
the domain of f rather than a constraint with a dual variable associated with it. We always
have this kind of flexibility when deciding how to compute a dual, and some choices may
lead to a simpler dual program than others.
E ∗ (z ) = suphz , y i − E(y ).
y ∈S
Remark 16.2.2. E ∗ is a convex function whether E is convex or not, since E ∗ (z) is pointwise
supremum of a family of convex (here, affine) functions of z.
In this course, we have only considered convex functions that are real-valued and continuous
and defined on a convex domain. For any such E, we have E ∗∗ = E, i.e. the Fenchel conjugate
of the Fenchel conjugate is the original function. This is a consequence of the Fenchel-Moreau
theorem, which establishes this under slightly more general conditions. We will not prove
this generally, but as part of Theorem 16.2.3 below, we sketch a proof under more restrictive
assumptions.
Example. Let E(y ) = p1 ky kpp (p > 1). We want to evaluate its Fenchel conjugate E ∗ at any
given point z ∈ Rn . Since E is convex and differentiable, the supremum must be achieved
at some y with vanishing gradient
179
Then,
E ∗ (z ) = hz , y i − E(y )
X 1 1 p
= |z (i)| p−1 +1 − |z (i)| p−1
i
p
1 1
(define q s.t. + = 1)
q p
1
= kz kqq .
q
min E(y )
y ∈Rn
s.t. Ay = b
Theorem 16.2.3 (Properties of the Fenchel conjugate). Consider a strictly convex function
E : S → R where S ⊆ Rn is an open convex set. When E is differentiable with a Hessian
that is positive definite everywhere and its gradient ∇E is surjective onto Rn , we have the
following three properties:
2. (E ∗ )∗ = E, i.e. the Fenchel conjugate of the Fenchel conjugate is the original function.
3. H E ∗ (∇E(y )) = H −1
E (y )
180
∇y E
−−→
primal point y ∇ E∗
dual point z
←−z−−
Hessian H E (y ) = H −1
E ∗ (z ) Hessian H E ∗ (y ) = H −1
E (y )
Proof sketch. Part 1. Because the gradient ∇y E is surjective onto Rn , given any z ∈ Rn ,
there exists a y such that ∇E(y ) = z . Let y (z ) be a y s.t. ∇E(y ) = z . It can be shown
that because E is strictly convex, y (z ) is unique.
The function y 7→ hz , y i − E(y ) is concave in y and has gradient z − ∇E(y ) and is hence
maximized at y = y (z ). This follows because we know a differentiable convex function is
minimized when its gradient is zero and so a differentiable concave function is maximized
when its gradient is zero.
Then, using the product rule and composition rule of derivatives,
and let z (u) denote the z obtaining the supremum, in the above program. We then have
u = ∇E ∗ (z (u)). Letting y (z ) be defined as in Part 1, we get y (z (u)) = ∇z E ∗ (z (u)) = u
∇z E ∗ (z + τ ) = y + δ, ∇y E(y + δ) = z + τ .
Then,
∇y E(y + δ) − ∇y E(y ) = τ , ∇z E ∗ (z + τ ) − ∇z E ∗ (z ) = δ.
181
Since H E (y ) measures the change of ∇y E(y ) when y changes by an infinitesimal δ, then
∇y E(y + δ) − ∇y E(y ) ≈ H E (y )δ
⇐⇒ H −1
E (y ) (∇y E(y + δ) − ∇y E(y )) ≈ δ
⇐⇒ H −1 ∗ ∗
E (y )τ ≈ δ = ∇E (z + τ ) − ∇E (z )
⇐⇒ H −1 ∗ ∗
E (y )τ ≈ ∇E (z + τ ) − ∇E (z ) (16.6)
Similarly,
H E ∗ (z )τ ≈ ∇z E ∗ (z + τ ) − ∇z E ∗ (z ) (16.7)
Comparing (16.6) and (16.7), it is easy to see
H E ∗ (z ) = H −1 −1
E (y ) ⇐⇒ H E ∗ (∇E(y )) = H E (y ).
Remark 16.2.4. Theorem 16.2.3 can be generalized to show that the Fenchel conjugate has
similar nice properties under much more general conditions, e.g. see [BV04].
∇E(y ) = Ay + b = 0,
∇δ E(y + δ) = A(y + δ) + b = 0
δ = −y − A−1 b
y + δ = −A−1 b
This gives us exactly global minimizer in just one step. However, the situation changes when
the function is not quadratic anymore and thus we do not have a constant Hessian. But
taking a step which tries to set the gradient to zero might still be a good idea.
182
16.3.2 K-stable Hessian
183
where the second inequality is due to K-stability of the inverse Hessian,
1
A−1 H E (y )−1 (1 + K)A−1 .
1+K
Meanwhile, using Taylor’s theorem and K-stability, for some ŷ between y and y ∗ , and noting
∇E(y ∗ ) = 0, we have
1
E(y ) = E(y ∗ ) + ∇E(y ∗ )> (y − y ∗ ) + (y − y ∗ )> H E (ŷ )(y − y ∗ )
2
(K + 1)
E(y ) − E(y ∗ ) ≤ (y − y ∗ )> A(y − y ∗ )
2 | {z }
=:γ
Next, our task is reduced to comparing σ and γ. y t := y ∗ + t(y − y ∗ ) (t ∈ [0, 1]) is a point
on the segment connecting y ∗ and y . Since
Z 1
∗
∇E(y ) = ∇E(y ) − ∇E(y ) = H(y t )(y − y ∗ )dt,
0
then
Z 1
∗ >
(y − y ) ∇E(y ) = (y − y ∗ )> H(y t )(y − y ∗ )dt
0
Z 1
1
≥ (y − y ∗ )> A(y − y ∗ )dt
K +1 0
γ
= (16.10)
K +1
On the other hand, define z s = ∇E(y ∗ ) + s(∇E(y ) − ∇E(y ∗ )) and then dz s = ∇E(y )ds.
Using Theorem 16.2.3, we have
Z 1
∗
y −y = H E ∗ (z s )∇E(y )ds.
0
Then,
Z 1
> ∗
∇E(y ) (y − y ) = ∇E(y )> H E ∗ (z s )∇E(y )ds
0
Z 1
≤ (K + 1) ∇E(y )> A−1 ∇E(y )ds
0
≤ (K + 1)σ (16.11)
Combining (16.10) and (16.11) yields
γ ≤ (K + 1)2 σ.
Therefore,
∗ ∗ ∗ 1
E(y + δ ) − E(y ) ≤ (E(y ) − E(y )) 1 − .
(K + 1)6
184
Remark 16.3.2. The basic idea of relating σ and γ in the above proof is writing the same
quantity, ∇E(y )> (y − y ∗ ), as two integrations along different lines. (K + 1)6 can be reduced
to (K +1)2 and even to (K +1) with more care. In some settings, Newton’s method converges
in log log(1/) steps.
Let us apply Newton’s method to convex optimization programs with only linear constraints,
min E(f )
f ∈Rm
s.t. Bf = d
min E(f 0 + ρ)
ρ∈Rm
s.t. Bρ = 0
1. What does the gradient and Hessian, and hence Newton steps, of Ê(f ) look like?
2. Does the K-stability of the Hessian of E carry over to the function Ê?
The gradient and Hessian of Ê should live in Rdim(C) and Rdim(C)×dim(C) respectively. Let ΠC
be the orthogonal projection matrix onto C , meaning ΠC is symmetric and ΠC δ = δ for
1
You don’t need to know the formal definition of isomorphism on vector spaces. In this context, it means
equivalent up to a transformation by an invertible matrix. In fact in our case, the isomorphism is given by
an orthonormal matrix.
185
any δ ∈ C. Given any f ∈ C, add to it an infinitesimal δ ∈ C, then
From this, we can deduce that the gradient of Ê at a point f ∈ C is essentially equal (up
to a fixed linear transformation independent of f ) to the projection of gradient of ∇E at f
onto the subspace C. Similarly,
1
Ê(f + δ) = E(f + δ) ≈ E(f ) + hΠC ∇E(f ), δi + hδ, ΠC H E (f )ΠC δi
2
Again from this, we can deduce that the Hessian of Ê at a point f ∈ C is essentially equal
(again up to a fixed linear transformation independent of f ) to the matrix ΠC H E (f )ΠC .
Note that X Y implies ΠC X ΠC ΠC Y ΠC , and from this we can see that the Hessian
of Ê is K-stable if the Hessian of E is. Also note that we were not terribly formal in the
discussion above. We can be more precise by replacing ΠC with a linear map from Rdim(C)
to Rm which maps any vector in Rdim(C) to a vector in C and then going through a similar
chain of reasoning.
What is a Newton step δ ∗ w.r.t. Ê? It turns out that for actually computing the Newton
step, it is easier to think again of E with a constraint that the Newton step must lie in the
subspace C. One can show that this is equivalent to the Newton step of Ê, but we omit this.
In the constrained view, δ ∗ should be a minimizer of
1
minm h∇E(f ), δi + hδ, H E (f ) δi
δ∈R | {z } 2 | {z }
Bδ=0 =:g =:H
(Lagrange duality)
1
⇐⇒ maxn minm hg , δi + hδ, Hδi − x > Bδ (16.12)
x ∈R δ∈R
| 2 {z }
Lagrangian L(δ,x )
Bδ = 0,
∇δ L(δ, x ) = g + Hδ − B > x = 0,
δ + H −1 g = H −1 B > x
Bδ +BH −1 g = BH −1 B > x
|{z}
=0
BH −1 g = BH −1 >
| {z B } x
=:L
186
Finally, the solutions to (16.12) are
(
x ∗ = L−1 BH −1 g
δ ∗ = −H −1 g + H −1 B > x ∗
It is easy to verify that Bδ ∗ = 0. Thus, our update rule is f i+1 = f i + δ ∗ . And we have the
following convergence result.
Theorem 16.3.3. Ê(f k ) − Ê(f ∗ ) ≤ · Ê(f 0 ) − Ê(f ∗ ) when k > (K + 1)6 log(1/).
Pm
Remark 16.3.4. Note if E(f ) = i=1 Ei (f (i)), then H E (f ) is diagonal. Thus, L =
BH −1 B > is indeed a Laplacian provided that B is an incidence matrix. Therefore, the
linear equations we need to solve to apply Newton’s method in a network flow setting are
Laplacians, which means we can solve them very quickly.
187
Chapter 17
In this chapter, we’ll learn about interior point methods for solving maximum flow, which is
a rich and active area of research [DS08, Mad13, LS20b, LS20a].
We’re going to frequently need to refer to vectors arising from elementwise operations com-
bining other vectors.
−−−−−−→
To that end, given two vector a ∈ Rm , and b ∈ Rm , we will use (aa (i)b(i)) to denote the
vector z with z (i) = a (i)b(i) and so on.
Throughout this chapter, when we are working in the context of some given graph G with
vertices V and edges E, we will let m = |E| and n = |V |.
The plots in this chapter were made using Mathematica, which is available to ETH students
for download through the ETH IT Shop.
max F (17.1)
f ∈RE
188
As we develop algorithms for this problem, we will assume that we know the maximum flow
value F ∗ . Let f ∗ denote some maximum flow, i.e. a flow with −c ≤ f ≤ c can val(f ∗ ) = F ∗ .
In general, an a lower bound F ≤ F ∗ will allow us to find a flow with value F , and because
of this, we can use a binary search to approximate F ∗ .
We assume the optimal value of Program (17.1) is F ∗ . Then for a given 0 ≤ α < 1 we define
a program
min V (f ) (17.2)
f ∈RE
This problem makes sense for any 0 ≤ α < 1. When α = 0, we are not routing any flow
yet. This will be our starting point. For any 0 ≤ α < 1, the scaled-down maximum flow
αf ∗ strictly satisfies the capacities −c < αf ∗ < c, and Bαf ∗ = αF ∗ b st . Hence αf ∗ is a
feasible flow for this value of α and hence V (αf ∗ ) < ∞ and so the optimal flow for the Barrier
Problem at this α must also have objective value strictly below ∞, and hence in turn strictly
satisfy the capacity constraints. Thus, if we can find the optimal flow for Program (17.2)
for α = 1 − , we will have a feasible flow with Program (17.1), the Undirected Maximum
Flow Problem, routing (1 − )F ∗ . This is how we will develop an algorithm for computing
the maximum flow.
Program (17.2) has the Lagrangian
Bf = αF ∗ b st and − c ≤ f ≤ c (17.3)
“Barrier feasibility”
∇V (f ) = B > x (17.4)
”Barrier Lagrangian gradient optimality”
Let f ∗α denote the optimal solution to Problem 17.2 for a given 0 ≤ α < 1, and let x ∗α be
optimal dual voltages such that ∇V (f ∗α ) = B > x ∗α .
189
It turns out that, if we have a solution f ∗α to this problem for some α < 1, then we can find
a solution f α+α0 for some α0 < 1 − α. And, we can compute f α+α0 using a small number of
Newton steps, each of which will only require a Laplacian linear equation solve, and hence
is computable in O(m)
e time. Concretely, for any 0 ≤ α < 1, given the optimal flow at this
α, we will be able to compute the optimal flow at αnew = α + (1 − α) 1501√m . This means that
√
after T = 150 m log(1/) updates, we have a solution for α ≥ 1 − .
We can state the update problem as
min V (δ + f ) (17.5)
δ∈RE
0 ∗
s.t. Bδ = α F b st “The Update Problem”
It turns out that for the purposes of analysis, it will be useful to ensure that our “Update
Problem” uses an objective function that is minimized at δ = 0.
This leads to a variant of the Update Problem, which we call the “Divergence Update
Problem”. We obtain our new problem by switching from V (δ + f ) as our objective to
V (δ + f ) − (V (f ) + h∇V (f ), δi) as our objective, and this is called the divergece of V w.r.t.
δ based at f .
Now, for any flow δ such that Bδ = α0 F ∗ b st , using the Lagrangian gradient condition (17.4),
we have h∇V (f ∗α ), δi = hx ∗α , α0 F ∗ b st i. Hence, for such δ, we have
V (δ + f ∗α ) − (V (f ∗α ) + h∇V (f ∗α ), δi) = V (δ + f ∗α ) − (V (f ∗α ) + hx ∗α , α0 F ∗ b st i)
We conclude that the objectives of the Update Problem (17.5) and the Divergence Update
Problem (17.6) have the same minimizer, which we denote δα∗ 0 , although, to be precise, it is
also a function of α.
Thus f ∗α + δα∗ 0 is optimal for the optimization problem
min V (f )
f ∈RE
(17.7)
s.t. Bf = (α + α )F ∗ b st
0
Lemma 17.1.1. Suppose f ∗α is the minimizer of Problem (17.2) (the Barrier Problem with
parameter α) and δα∗ 0 is the minimizer of Problem (17.6) (the Update Problem with param-
eters f ∗α and α0 ), then f ∗α + δα∗ 0 is optimal for Problem (17.2) with parameter α + α0 (i.e. a
new instance of the Barrier problem).
190
Algorithm 18: Interior Point Method
f ← 0;
α ← 0;
while α < 1 − do
a0 ← 201−α
√ ;
m
Compute δ, the minimizer of Problem (17.6);
Let f ← f + δ and α ← α + α0 ;
end
return f
Pseudotheorem 17.1.1. Let f be the minimizer of Problem (17.2). Then, when a0 ≤ 1−α
√ ,
20 m
the minimizer δ of Problem (17.7) can be computed in O(m)
e time.
The key insight in this type of interior point method is that when the update α0 is small
enough,
Theorem 17.1.2. Algorithm 18 returns a flow f that is feasible for Problem (17.1) in time
e 1.5 log(1/)).
O(m
Proof Sketch. First note that for α = 0, the minimizer of Problem (17.2) is f = 0. The
proof now essentially follows by Lemma (17.1.1), and Pseudotheorem 17.1.1. Note
1 √ that 1 − α
shrinks by a factor (1 − 20 m ) in each iteration of the while-loop, and so after 20 m log(1/)
√
iterations, we have 1 − α ≤ , at which point the loop terminates. To turn this into a formal
proof, we need to take care of the fact the proper theorem corresponding to Pseudotheo-
rem 17.1.1 only gives a highly accurate but not exact solution δ to the “Update Problem”.
But it’s possible to show that this is good enough (even though both f and δ end up not
being exactly optimal in each iteration).
Remark 17.1.3. For the maximum flow problem, when capacities are integral and polyno-
mially bounded, if we choose = m−c for some large enough constant c, given a feasible flow
with val(f ) = 1 − , is it possible to compute an exact maximum flow in nearly linear time.
Thus Theorem 17.1.2 can also be used to compute an exact maximum flow in O(m) e time,
but we omit the proof. The idea is to first round to an almost optimal, feasible integral flow
(which requires a non-trivial combinatorial algorithm), and then to recover the exact flow
using Ford-Fulkerson. See [Mad13] for details.
Remark 17.1.5. For sparse graphs with m = O(n) e and large capacities, this running time
is the best known, and improving it is major open problem.
191
17.1.3 Understanding the Divergence Objective
Note that if V (x) = − log(1 − x), then D(x) = V (x) − (V (0) + V 0 (0)x).
Figure 17.1: Plot showing V (x) = − log(1 − x) and then linear approximation
V (0) + V 0 (0)x.
We let
c + (e) = c(e) − f (e) and c − (e) = c(e) + f (e)
192
So then
Note that DV (δ) is strictly convex of over the feasible set, so the argmin is unique.
Lemma 17.1.6.
1. 1/2 ≤ D̃00 (x ) ≤ 2.
3. x2 /4 ≤ D̃(x ) ≤ x2 .
What’s happening here? We glue together D(x) for small x with its quadratic approximation
for |x| > . For x > , we “glue in” a Taylor series expansion based at x = .
193
Figure 17.3: Plot showing D(x) = − log(1 − x) and the quadratic approximation based at
x = 0.1.
We also define
X δ(e) δ(e)
D̃V (δ) = D̃ + D̃ −
e
c + (e) c − (e)
194
Note that D̃V (δ) is strictly convex of over the feasible set, so the argmin is unique.
Pseudoclaim 17.1.7. We can compute the argmin δ ∗ of Problem (17.9), the Smoothed
Update Problem, using the Newton-Steps for K-stable Hessian convex functions that we saw
in the previous chapter, in O(m)
e time.
Sketch of proof. Problem (17.9) fits the class of problems for which we showed in the pre-
vious chapter that (appropriately scaled) Newton steps converge. This is true because the
Hessian is always a 2-spectral approximation of the Hessian at D̃V (δ ∗ ), as can be shown
from Lemma 17.1.6. Because the Hessian of D̃V (δ) is diagonal, and the constraints are flow
constraints, each Newton step boils down to solving a Laplacian linear system, which can be
done to high accuracy O(m)
e time.
Remark 17.1.8. There are three things we need to modify to turn the pseudoclaim into a
true claim, addressing the errors arising from both Laplacian solvers and Newton steps:
1. We need to rephrase the claim to so that we only claim δ ∗ has been computed to high
accuracy, rather than exactly.
2. We need to show that we can construct an initial guess to start off Newton’s method
δ0 for which the value D̃V (δ0 ) is not too large. (This is easy).
3. We need show that Newton steps converge despite using a Laplacian solver that doesn’t
give exact solutions, only high accuracy solutions. (Takes a bit of work, but is ulti-
mately not too difficult).
Importantly, to ensure our overall interior point method still works, we also need to show
that it converges, even if we’re using approximate solutions everywhere. This also takes some
work to show, again is not too difficult.
Proof Sketch. We sketch the proof in the case when both f, g are differentiable: Observe
that 0 = ∇f (x ∗ ) = ∇g(x ∗ ), and hence g(x ) is also minimized at x ∗ .
We define
c
b(e) = min(c + (e), c − (e)) (17.10)
Lemma 17.1.10.
−−−−−−−Suppose δ ∗ is the argmin of Problem (17.9), the Smoothed Update Prob-
∗ −→
lem, and c (e))
< 0.1. Then δ ∗ is the argmin of Problem (17.8).
(δ (e)/b
∞
195
−−−−−−−−→
Proof. We observe that if
(δ ∗ (e)/bc (e))
< 0.1, then D̃V (δ ∗ ) = DV (δ ∗ ), and, for all
∞
τ ∈ Rm with norm
−−−−−−−−→
−−−−−−−−→
(τ (e)/b c (e))
< 0.1 −
(δ ∗ (e)/b
c (e))
∞ ∞
∗ ∗
we have that D̃V (δ + τ ) = DV (δ + τ ). Thus D̃V and DV agree on a neighborhood around
δ ∗ and hence by Lemma 17.1.9, we have that δ ∗ is the argmin of Problem (17.8).
{a, b} ∈ E
b if (a, b) ∈ E AND (b, a) ∈ E
and
c
b({a, b}) = min(c(a, b), c(b, a)).
Note that when G bf is the symmetrization of the residual graph Gf (which we defined in
Chapter 11), then c
b matches exactly the definition of c
b in Equation (17.10).
Lemma 17.1.14. Let G be an undirected, capacitated multi-graph G = (V, E, c) which is s-t
well-conditioned. Let f be the minimizer of Program (17.2). Let G bf be the symmetrization
of the residual graph Gf (in the sense of Lecture 10). Then there exists a flow δ̂ which
satisfies B δ̂ = 1−α
5
F ∗ b st and is feasible in G
bf . Note that we can also state the feasibility in
G
bf as
−
−−−−−−−→
δ̂(e)/bc (e)
≤1
∞
Proof. We recall since f is the minimizer of Program (17.2), there exists dual-optimal volt-
ages x such that
−−−−−−−−−−−−−−−−−−−−−−→
> 1 1
B x = ∇V (f ) = −
c(e) − f (e) c(e) + f (e)
From Lecture 10, we know that there is flow δ̄ that is feasible with respect to the residual
graph capacities of the graph Gf such that B δ̄ = (1 − α)F ∗ b st . Note when treating δ̄
196
as an undirected flow, feasibility in the residual graph means that δ̄(e) < c(e) − f (e) and
−δ̄(e) < c(e) + f (e). Thus,
X δ̄ δ̄
(1 − α)F ∗ b > >
st x = δ̄B x = − ≤m
e
c(e) − f (e) c(e) + f (e)
Now, because the graph is s-t well-conditioned, there are at 32 m edges directly from s to t with
capacity U and each of these e satisfy by the Lagrangian gradient optimality condition (17.4)
1 1
b>
st x = −
U − f (e) U + f (e)
m 1 1 4/5
∗
≥ b>
st x ≥ − ≥
(1 − α)F U − f (e) U + f (e) U − f (e)
So
4 (1 − α)F ∗ 1
U − f (e) ≥ ≥ (1 − α)U
5 m 2
In this case, the capacity on each of the 32 m s-t edges with capacity U in G will have
capacity (1 − α)U/2 in G
bf . This guarantees that there is feasible flow in G
bf of value at least
1 1 ∗
3
(1 − α)mU ≥ 3 (1 − α)F .
197
−−−−−−−−→
0 ∗
1
B δ̃ = α F b st and
δ̃(e)/b
c (e)
≤ √
30 m
. This means that
∞
! !
X δ̃(e) δ̃(e)
D̃V (δ̃) = D̃ + D̃ −
e
c + (e) c − (e)
!2 !2
X δ̃(e) δ̃(e)
≤ 4 +4 −
e
c + (e) c − (e)
!2
X δ̃(e)
≤ 8
e
c
b(e)
≤ 8/900 < 1/100.
This then means that the minimizer δ ∗ of Problem (17.9) also satisfies D̃V (δ̃) < 1/100.
−−−−−−→
2
∗ X δ ∗ (e) 2 δ ∗ (e) 2
c (e))
≤
(δ /b + −
∞
e
c + (e) c − (e)
δ ∗ (e) δ ∗ (e)
X
≤ D̃ + D̃ − By Lemma 17.1.6.
e
c + (e) c − (e)
= D̃V (δ̃) < 1/100.
−−−−−−→
Hence
(δ ∗ /b
c (e))
< 0.1.
∞
198
Chapter 18
Distance Oracles
In this chapter, we learn about distance oracles as presented in the seminal paper [TZ05].
Distance Oracles are data structures that allow for any undirected graph G = (V, E) to be
stored compactly in a format that allows to query for the (approximate) distance between
any two vertices u, v in the graph. The main result of this chapter is the following data
structure.
Theorem 18.0.1. There is an algorithm that, for any integer k ≥ 1 and undirected graph
G = (V, E), computes a data structure that can be stored using Õ(kn1+1/k ) bits such that on
querying any two vertices u, v ∈ V returns in O(k) time a distance estimate dist(u,
g v) such
that
dist(u, v) ≤ dist(u,
g v) ≤ (2k − 1) · dist(u, v).
Remark 18.0.2. Note that for k = 1, the theorem above is trivial: it can be solved by
computing APSP and storing the distance matrix of G.
Remark 18.0.3. We point out that given space O(n1+1/k ), approximation (2k−1) is the best
that we can hope for according to a popular and widely believed conjecture that essentially
says that there are unweighted graphs that have no cycle of length (2k+1) but have Ω̃(n1+1/k )
edges. A more careful analysis than we will carry out allows to shave all logarithmic factors
from Theorem 18.0.1 and therefore the data structure is only a factor k off in space from
optimal while also answering queries extremely efficiently. It turns out that the factor k
can also be removed in space and query time (although currently preprocessing is quite
expensive), see therefore the following (really involved) articles [Che14, Che15].
Remark 18.0.4. Also note that in directed graphs no such distance oracle is possible.
Even maintaining the transitive closure (the information of who reaches who) can only be
preserved if one stores Ω̃(n2 ) bits.
199
18.1 Warm-up: A Distance Oracle for k = 2
Let us first describe the data structure for the case where k = 2. See therefore the pseudo-
code below. Here we use the convention that dist(x, X) for some vertex x ∈ V and some
subset X ⊆ V is the minimum distance from x to any y ∈ X, formally dist(x, X) =
miny∈X dist(x, y).
The key to the algorithm is the definition of pivots and bunches. Below is an example that
illustrates their definitions.
Without further due, let us discuss the query procedure which is depicted below. It essentially
consists of checking whether the vertex v is already in the bunch of u in which case we have
stored the distance distG (u, v) explicitly. Otherwise, it uses a detour via its pivot.
200
Figure 18.1: Graph G (without the edges) where vertices are drawn according to their
distance from u. The blue and red vertices are in S. The red vertex is chosen to be the
pivot p(u). Note that another vertex in S could have been chosen to become the pivot. The
bunch of u is the vertices that are strictly withing the red circle. In particular, both blue
vertices, the red vertex and also the white vertex on the boundary of the red circle are not
in the bunch B(u).
Space Analysis. For each vertex s ∈ S, we store a hash-table with one entry for each
vertex v. This√can be stored with space O(|S|n). We have by a standard Chernoff-bound
that |S| = Õ( n) w.h.p. so this becomes Õ(n3/2 ).
Next, fix some u ∈ V \ S, and let us argue about the size of B(u) (which asymptotically
matches |Hu |). We order the vertices v1 , v2 , . . . , vn in V by their distance from u. Since we
sample uniformly at random, by a simple √Chernoff bound, we obtain that the first vertex vi
in v1 , v2 , . . . , vn that is in S has i = Õ( n) w.h.p.. But note that since√only vertices that
are strictly closer to u than vi are in B(u), this implies that |B(u)| = Õ( n).
It remains to take a careful union bound over all bad events at every vertex u ∈ V \ S to
conclude that with
√ high probability, the hash tables Hu for all u ∈ V \ S combined take total
3/2
space Õ(|V \ S| √
n) = Õ(n ). For the rest of the section, we condition on the event that
each |B(v)| = Õ( n) to use it as if it was a deterministic guarantee.
201
Preprocessing and Query Time. In order to find the distances stored in the hash
tables Hs for s ∈ S, we can simply run Dijkstra from each s ∈ S on the graph in total time
Õ(m|S|) = Õ(mn1/2 ).
To compute the pivots for each vertex u, we can insert a super-vertex s0 and add an edge
from s0 to each s ∈ S of weight 0. We can then run Dijkstra from s0 on G ∪ {s0 }. Note that
for each u, we have distG∪{s0 } (s0 , u) = distG (p(u), u) and that p(u) can be chosen to be the
closest vertex on the path from s0 to u that Dijkstra outputs along with the distances (recall
that Dijkstra can output a shortest path tree). This takes Õ(m) time.
It remains to compute the bunches B(u). Here, we use duality: we define the cluster C(w)
for every vertex w ∈ V \ S to be the set
Note the subtle difference to the bunches in that membership of v now depends on p(v) and
not on p(u)! It is not hard to see that u ∈ C(w) ⇐⇒ P w ∈ B(u). And Pit is straight-forward
to compute the bunches from the clusters in time O( v |B(v)|) = O( w |C(w)|) = Õ(n3/2 ).
Finally, it turns out that we can compute each C(w) by running Dijkstra with a small
modification.
Lemma 18.1.1 (Lemma 4.2 in [TZ05]). Consider running Dijkstra from a vertex w but
only relaxing edges incident to vertices v that satisfy distG (v, w) < distG (v, p(v)). Then,
202
the algorithm computes C(w) and all distances dist(v, w) for v ∈ C(w) in time Õ(|E(C(w))|)
where E(C(w)) are the edges that touch a vertex in C(w).
It remains to observe that the total time required to compute all clusters is
X X X X
Õ( |E(C(w))|) = Õ( |E(v)|) = Õ( |E(v)|) = Õ( |E(v)||B(v)|).
w w,v∈C(w) v,w∈B(v) v
√
But√we have upper bounded |B(v)| = Õ( n)√ for all v, thus each vertex just pays its degree
Õ( n) times and we get running time Õ(m n).
Monte Carlo vs. Las Vegas. Note that the analysis above only guarantees that the
algorithm works well with high probability. However, it is not hard to see that the algorithm
can be transformed
√ into a Las Vegas algorithm: whenever we find a bunch B(v) whose size
exceeds our Õ( n) bound, we simply re-run the algorithm. This guarantees that the final
data structure that we output indeed satisfies the guarantees stipulated in the theorem.
Note that in a sense this algorithm is almost easier than the one for k = 2 since it treats
each level in the same fashion. Here we ensure that the last set Sk+1 is empty (which would
happen with constant probability otherwise).
We make the implicit assumption throughout that Sk 6= ∅ so that pk (u) is well-defined for
each u. We also define dist(x, X) = ∞ if X is the empty set.
203
The drawing below illustrates the new definition of a bunch where we have chosen k = 3 to
keep things simple.
Figure 18.2: Graph G (without the edges) where vertices are drawn according to their
distance from u. All vertices are in S1 . The blue and red vertices are in S2 . The dark blue
and dark red vertices are also in S3 . Finally, we have S4 = ∅. The light red vertex is chosen
as pivot p2 (u); the dark red vertex is chosen as pivot p3 (u). The bunch B(u) includes all
vertices that have a black edge in this graph to u; in particular these are the white vertices
within the circle drawn in light red (B1 (u)); the light blue vertices encircled by the dark red
circle (B2 (u)); and all dark blue vertices (B3 (u)).
Before we explain the query procedure, let us give a (rather informal) analysis of the space
required by our new data structure.
Space Analysis. We have for each vertex u ∈ V , and each 1 ≤ i ≤ k that the bunch
Bi (u) consists of all vertices in Si that are closer to u than the closest vertex in Si+1 .
204
Now, order the vertices x1 , x2 , . . . in Si by their distance from u. Since each vertex in Si is
sampled into Si+1 with probability n−1/k , we have that with high probability some vertex xi
with i = O(n1/k log n) is sampled into Si+1 . This ensures that |Bi (u)| = Õ(n1/k ) with high
probability.
Applying this argument for all i, we have that |B(u)| = Õ(k · n1/k ) for each u w.h.p. and
therefore our space bound follows.
Preprocessing Time. Much like in the k = 2 construction, we can define for each u ∈
Si \ Si+1 the cluster C(u) = {v ∈ V | distG (u, v) < distG (u, Si+1 )}. Extending our analysis
from before using this new definition, we get construction time Õ(kmn1/k ).
Query Operation. A straight-forward way to query our new data structure for a tu-
ple (u, v) would be to search for the smallest i such that v ∈ B(pi (u)) and then return
distG (u, pi (u)) + distG (pi (u), v). This can be analyzed in the same way as we did for k = 2
to obtain stretch 4k − 31 .
However, we aim for stretch approximation 2k − 1. We state below the pseudo-code to
achieve this guarantee.
Our main tool in the analysis is the claim below where we define ∆ = distG (u, v).
Claim 18.2.1. After the ith iteration of the while-loop, we have distG (u, w) ≤ i∆.
This implies our theorem, since we have at most k − 1 iterations, then w is a vertex in Sk
and Sk ⊆ B(x) for all vertices x ∈ V . Therefore we have that for the final w, we have
distG (u, w) ≤ (k − 1)∆. It remains to conclude by the triangle inequality that
distG (u, w) + distG (w, v) ≤ 2distG (u, w) + ∆ ≤ (2k − 1)distG (u, v).
Proof of Claim 18.2.1. Let wi , ui , vi denote the variables w, u, v after the ith while-loop iter-
ation (or right before for w0 , u0 , v0 ).
For i = 0, we have that w0 = u0 ; thus, distG (u0 , w0 ) = 0.
1
One can actually prove that this strategy gives an 4k − 5 stretch with a little trick.
205
For i ≥ 1, we want to prove that if the ith while-loop iteration is executed then distG (ui , wi ) ≤
distG (ui−1 , wi−1 ) + ∆ (if it is not executed then the statement follows trivially).
In order to prove this, observe that by the while-loop condition, we must have had wi−1 6∈
B(vi−1 ), thus distG (vi−1 , wi−1 ) ≥ distG (vi−1 , pi (vi−1 )).
But the while-iteration sets ui = vi−1 and wi = pi (vi−1 ), and therefore we have
distG (ui , wi ) = distG (vi−1 , pi (vi−1 )) ≤ distG (vi−1 , wi−1 ) = distG (vi−1 , pi−1 (ui−1 ))
≤ distG (ui−1 , pi−1 (ui−1 )) + distG (vi−1 , ui−1 ) = distG (ui−1 , wi−1 ) + ∆.
206
Bibliography
[Che14] Shiri Chechik. Approximate distance oracles with constant query time. In
Proceedings of the forty-sixth annual ACM symposium on Theory of computing,
pages 654–663, 2014.
[Che15] Shiri Chechik. Approximate distance oracles with improved bounds. In Pro-
ceedings of the forty-seventh annual ACM symposium on Theory of Computing,
pages 1–10, 2015.
[DO19] Jelena Diakonikolas and Lorenzo Orecchia. Conjugate gradients and accel-
erated methods unified: The approximate duality gap view. arXiv preprint
arXiv:1907.00289, 2019.
[DS08] Samuel I. Daitch and Daniel A. Spielman. Faster approximate lossy generalized
flow via interior point algorithms. In Proceedings of the Fortieth Annual ACM
Symposium on Theory of Computing, STOC ’08, pages 451–460, New York,
NY, USA, May 2008. Association for Computing Machinery.
[HS+ 52] Magnus R Hestenes, Eduard Stiefel, et al. Methods of conjugate gradients for
solving linear systems. Journal of research of the National Bureau of Standards,
49(6):409–436, 1952.
[LS20a] Yang P. Liu and Aaron Sidford. Faster Divergence Maximization for Faster
Maximum Flow. arXiv:2003.08929 [cs, math], April 2020.
[LS20b] Yang P. Liu and Aaron Sidford. Faster energy maximization for faster maximum
flow. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory
of Computing, pages 803–814, 2020.
207
[Mad13] Aleksander Madry. Navigating central path with electrical flows: From flows to
matchings, and back. In 2013 IEEE 54th Annual Symposium on Foundations
of Computer Science, pages 253–262. IEEE, 2013.
[Nes83] Y. E. Nesterov. A method for solving the convex programming problem with
convergence rate o(1/k 2 ). Dokl. Akad. Nauk SSSR, 269:543–547, 1983.
[ST04] Daniel A. Spielman and Shang-Hua Teng. Nearly-linear time algorithms for
graph partitioning, graph sparsification, and solving linear systems. In Pro-
ceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing,
STOC ’04, page 81–90, New York, NY, USA, 2004. Association for Computing
Machinery.
[T+ 15] Joel A Tropp et al. An introduction to matrix concentration inequalities. Foun-
dations and Trends® in Machine Learning, 8(1-2):1–230, 2015.
[Tar83] Robert Endre Tarjan. Data structures and network algorithms. SIAM, 1983.
[Tro19] Joel A Tropp. Matrix concentration & computational linear algebra. 2019.
[TZ05] Mikkel Thorup and Uri Zwick. Approximate distance oracles. Journal of the
ACM (JACM), 52(1):1–24, 2005.
[vdBLL+ 21] Jan van den Brand, Yin Tat Lee, Yang P. Liu, Thatchaphol Saranurak,
Aaron Sidford, Zhao Song, and Di Wang. Minimum Cost Flows, MDPs, and
$$\backslash$ell 1 $-Regression in Nearly Linear Time for Dense Instances.
arXiv preprint arXiv:2101.05719, 2021.
208