Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Agao22 Script

Download as pdf or txt
Download as pdf or txt
You are on page 1of 208

ETH Zurich

Advanced Graph Algorithms and


Optimization

Rasmus Kyng & Maximilian Probst Gutenberg

Spring 2022
These notes will be updated throughout the course. They are likely to contain typos, and
they may have mistakes or lack clarity in places. Feedback and comments are welcome.
Please send to kyng@inf.ethz.ch or, even better, submit a pull request at

https://github.com/rjkyng/agao22_script.

We want to thank scribes from the 2020 edition of the course who contributed to these notes:
Hongjie Chen, Meher Chaitanya, Timon Knigge, and Tim Taubner – and we’re grateful to
all the readers who’ve submitted corrections, including Martin Kucera, Alejandro Cassis,
and Luke Volpatti.
A important note: If you’re a student browsing these notes to decide whether to take this
course, please note that the current notes are incomplete. We will release Parts III and IV
later in the semester. You can take a look at last year’s notes for an impression of the what
the rest of the course will look like. Find them here:

https://kyng.inf.ethz.ch/courses/AGAO21/agao21_script.pdf

There will, however, be some changes to the content compared to last year.

2
Contents

1 Course Introduction 9
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Electrical Flows and Voltages - a Graph Problem from Middle School? . . . 9
1.3 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4 More Graph Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . 18

I Introduction to Convex Optimization 21

2 Some Basic Optimization, Convex Geometry, and Linear Algebra 22


2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 A Characterization of Convex Functions . . . . . . . . . . . . . . . . . . . . 24
2.3.1 First-order Taylor Approximation . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Directional Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.3 Lower Bounding Convex Functions with Affine Functions . . . . . . . 26
2.4 Conditions for Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Convexity and Second Derivatives, Gradient Descent and Acceleration 29


3.1 A Review of Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Characterizations of Convexity and Optimality via Second Derivatives . . . . 31
3.2.1 A Necessary Condition for Local Extrema . . . . . . . . . . . . . . . 32
3.2.2 A sufficient condition for local extrema . . . . . . . . . . . . . . . . . 33

3
3.2.3 Characterization of convexity . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Gradient Descent - An Approach to Optimization? . . . . . . . . . . . . . . 35
3.3.1 A Quantitative Bound on Changes in the Gradient . . . . . . . . . . 35
3.3.2 Analyzing Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Accelerated Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 38

II Spectral Graph Theory 44

4 Introduction to Spectral Graph Theory 45


4.1 Recap: Incidence and Adjacency Matrices, the Laplacian Matrix and Electri-
cal Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Understanding Eigenvalues of the Laplacian . . . . . . . . . . . . . . . . . . 47
4.2.1 Test Vector Bounds on λ2 and λn . . . . . . . . . . . . . . . . . . . . 48
4.2.2 Eigenvalue Bounds Beyond Test Vectors . . . . . . . . . . . . . . . . 49
4.2.3 The Loewner Order, aka. the Positive Semi-Definite Order . . . . . . 50
4.2.4 Upper Bounding a Laplacian’s λn Using Degrees . . . . . . . . . . . . 51
4.2.5 The Loewner Order and Laplacians of Graphs. . . . . . . . . . . . . . 51
4.2.6 The Path Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.7 Lower Bounding λ2 of a Path Graph . . . . . . . . . . . . . . . . . . 53
4.2.8 Laplacian Eigenvalues of the Complete Binary Tree . . . . . . . . . . 54

5 Conductance, Expanders and Cheeger’s Inequality 56


5.1 Conductance and Expanders . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 A Lower Bound for Conductance via Eigenvalues . . . . . . . . . . . . . . . 58
5.3 An Upper Bound for Conductance via Eigenvalues . . . . . . . . . . . . . . . 60
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6 Random Walks 64
6.1 A Primer on Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2 Convergence Results for Random Walks . . . . . . . . . . . . . . . . . . . . 65

4
6.2.1 Making Random Walks Lazy . . . . . . . . . . . . . . . . . . . . . . . 66
6.2.2 Convergence of Lazy Random Walks . . . . . . . . . . . . . . . . . . 67
6.2.3 The Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.3 Properties of Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3.1 Hitting Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3.2 Commute Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7 Pseudo-inverses and Effective Resistance 73


7.1 What is a (Moore-Penrose) Pseudoinverse? . . . . . . . . . . . . . . . . . . . 73
7.2 Electrical Flows Again . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.3 Effective Resistance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.3.1 Effective Resistance is a Distance . . . . . . . . . . . . . . . . . . . . 78

8 Different Perspectives on Gaussian Elimination 80


8.1 An Optimization View of Gaussian Elimination for Laplacians . . . . . . . . 80
8.2 An Additive View of Gaussian Elimination . . . . . . . . . . . . . . . . . . . 83

9 Random Matrix Concentration and Spectral Graph Sparsification 87


9.1 Matrix Sampling and Approximation . . . . . . . . . . . . . . . . . . . . . . 87
9.2 Matrix Concentration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.2.1 Matrix Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.2.2 Monotonicity and Operator Monotonicity . . . . . . . . . . . . . . . . 91
9.2.3 Some Useful Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.2.4 Proof of Matrix Bernstein Concentration Bound . . . . . . . . . . . . 94
9.3 Spectral Graph Sparsification . . . . . . . . . . . . . . . . . . . . . . . . . . 95

10 Solving Laplacian Linear Equations 102


10.1 Solving Linear Equations Approximately . . . . . . . . . . . . . . . . . . . . 102
10.2 Preconditioning and Approximate Gaussian Elimination . . . . . . . . . . . 103
10.3 Approximate Gaussian Elimination Algorithm . . . . . . . . . . . . . . . . . 104
10.4 Analyzing Approximate Gaussian Elimination . . . . . . . . . . . . . . . . . 107

5
10.4.1 Normalization, a.k.a. Isotropic Position . . . . . . . . . . . . . . . . . 108
10.4.2 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
10.4.3 Martingale Difference Sequence as Edge-Samples . . . . . . . . . . . . 110
10.4.4 Stopped Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
10.4.5 Sample Norm Control . . . . . . . . . . . . . . . . . . . . . . . . . . 112
10.4.6 Random Matrix Concentration from Trace Exponentials . . . . . . . 114
10.4.7 Mean-Exponential Bounds from Variance Bounds . . . . . . . . . . . 115
10.4.8 The Overall Mean-Trace-Exponential Bound . . . . . . . . . . . . . . 116

III Combinatorial Graph Algorithms 119

11 Classical Algorithms for Maximum Flow I 120


11.1 Maximum Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
11.2 Flow Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
11.3 Cuts and Minimum Cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
11.4 Algorithms for Max flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

12 Classical Algorithms for Maximum Flow II 132


12.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
12.2 Blocking Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
12.3 Dinic’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
12.4 Finding Blocking Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
12.5 Minimum Cut as a Linear Program . . . . . . . . . . . . . . . . . . . . . . . 137

13 Link-Cut Trees 139


13.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
13.2 Balanced Binary Search Trees: A Recap . . . . . . . . . . . . . . . . . . . . 140
13.3 A Data Structure for Path Graphs . . . . . . . . . . . . . . . . . . . . . . . 141
13.4 Implementing Trees via Paths . . . . . . . . . . . . . . . . . . . . . . . . . . 147
13.5 Fast Blocking Flow via Dynamic Trees . . . . . . . . . . . . . . . . . . . . . 150

6
14 The Cut-Matching Game: Expanders via Max Flow 152
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
14.2 Embedding Graphs into Expanders . . . . . . . . . . . . . . . . . . . . . . . 153
14.3 The Cut-Matching Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 154
14.4 Constructing an Expander via Random Walks . . . . . . . . . . . . . . . . . 157

IV Further Topics 162

15 Separating Hyperplanes, Lagrange Multipliers, KKT Conditions, and Con-


vex Duality 163
15.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
15.2 Separating Hyperplane Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 164
15.3 Lagrange Multipliers and Duality of Convex Problems . . . . . . . . . . . . . 166
15.3.1 Karush-Kuhn Tucker Optimality Conditions for Convex Problems . . 167
15.3.2 Slater’s Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
15.3.3 The Lagrangian and The Dual Program . . . . . . . . . . . . . . . . 170
15.3.4 Strong Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
15.3.5 KKT Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

16 Fenchel Conjugates and Newton’s Method 177


16.1 Lagrange Multipliers and Convex Duality Recap . . . . . . . . . . . . . . . . 177
16.2 Fenchel Conjugates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
16.3 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
16.3.1 Warm-up: Quadratic Optimization . . . . . . . . . . . . . . . . . . . 182
16.3.2 K-stable Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
16.3.3 Linearly Constrained Newton’s Method . . . . . . . . . . . . . . . . . 185

17 Interior Point Methods for Maximum Flow 188


17.1 An Interior Point Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
17.1.1 A Barrier Function and an Algorithm . . . . . . . . . . . . . . . . . . 189

7
17.1.2 Updates using Divergence . . . . . . . . . . . . . . . . . . . . . . . . 190
17.1.3 Understanding the Divergence Objective . . . . . . . . . . . . . . . . 192
17.1.4 Quadratically Smoothing Divergence and Local Agreement . . . . . . 193
17.1.5 Step size for divergence update . . . . . . . . . . . . . . . . . . . . . 196

18 Distance Oracles 199


18.1 Warm-up: A Distance Oracle for k = 2 . . . . . . . . . . . . . . . . . . . . . 200
18.2 Distance Oracles for any k ≥ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 203

8
Chapter 1

Course Introduction

1.1 Overview

This course will take us quite deep into modern approaches to graph algorithms using convex
optimization techniques. By studying convex optimization through the lens of graph algo-
rithms, we’ll try to develop an understanding of fundamental phenomena in optimization.
Much of our time will be devoted to flow problems on graphs. We will not only be studying
these problems for their own sake, but also because they often provide a useful setting for
thinking more broadly about optimization.
The course will cover some traditional discrete approaches to various graph problems, espe-
cially flow problems, and then contrast these approaches with modern, asymptotically faster
methods based on combining convex optimization with spectral and combinatorial graph
theory.

1.2 Electrical Flows and Voltages - a Graph Problem


from Middle School?

We will dive right into graph problems by considering how electrical current moves through
a network of resistors.
First, let us recall some middle school physics. If some of these things don’t make sense two
you, don’t worry, in less than paragraph from here, we’ll be make to safely doing math.
Recall that a typical battery that buy from Migros has two endpoints, and produces what
is called a voltage difference between these endpoints.

9
One end of the battery will have a positive charge (I think that means an excess of positrons1 ),
and the other a negative charge. If we connect the two endpoints with a wire, then a current
will flow from one end of the battery to the other in an attempt to even out this imbalance
of charge.

Figure 1.1: A 9 volts battery with a wire attached.

We can also imagine a kind of battery that tries to send a certain amount of current the
wires between its endpoints, e.g. 1 unit of charge per unit of time. This will be a little more
convenient to work with, so let us focus on that case.

Figure 1.2: A 1 ampere battery with a wire attached.

A resistor is a piece of wire that connects two points u and v, and is completely described
by a single number r called its resistance.
1
I’m joking, of course! Try Wikipedia if you want to know more. However, you will not need it for this
class.

10
If the voltage difference between the endpoints of the resistor is x, and the resistance is r
then this will create a flow of charge per unit of time of f = x/r. This is called Ohm’s Law.

Figure 1.3: Ohm’s Law for a resistor with resistance r = 1.

Suppose we set up a bunch of wires that route electricity from our current source s to our
current sink t in some pattern:

Figure 1.4: A path of two resistors.

We have one unit of charge flowing out of s per unit of time, and one unit coming into
t. Because charge is conserved, the current flowing into any other point u must equal the
amount flowing out of it. This is called Kirchoff’s Current Law.
To send one unit of current from s to t, we must be sending it first form s to u and then
from u to t. So the current on edge (s, u) is 1 and the current on (u, t) is 1. By Ohm’s Law,
the voltage difference must also be 1 across each of the two wires. Thus if the voltage is x
at s, it must be x + 1 at u and x + 2 at t. What is x? It turns out it doesn’t matter: We
only care about the differences. So let us set x = 0.

11
Figure 1.5: A path of two resistors.

Let us try one more example:

Figure 1.6: A network with three resistors.

How much flow will go directly from s to t and how much via u?
Well, we know what the net current flowing into and out of each vertex must be, and we can
use to set up some equations. Let us say the voltage at s is xs , at u is xu and at t is xt .

• Net current at s: −1 = (xs − xt ) + (xs − xu )


• Net current at u: 0 = (xu − xs ) + (xu − xt )
• Net current at t: 1 = (xt − xs ) + (xt − xu )

The following is a solution: xs = 0, xu = 31 , xt = 32 . And as before, we can shift all the


voltages by some constant x and get another solution a = x + 0, xu = x + 31 , xt = x + 23 .
You might want to convince yourself that these are the only solutions.

Electrical flows in general graphs. Do we know enough to calculate the electrical flow
in some other network of resistors? To answer this, let us think about the network as a
graph. Consider a undirected graph G = (V, E) with |V | = n vertices and |E| = m edges,
and let us assume G is connected. Let’s associate a resistance r (e) > 0 with every edge
e ∈ E.

12
To keep track of the direction of the flow on each edge, it will be useful to assign an arbitrary
direction to every edge. So let’s do that, but remember that this is just a bookkeeping tool
that helps us track where flow is going.
E
A P is a vector f : R . The net flow of f at a vertex u ∈ V is defined as
P flow in the graph
v→u f (v, u) − u→v f (u, v).

We say a flow routes the demands d ∈ RV if the net flow at every vertex v is d (v).
We can assign a voltage to every vertex x ∈ RV . Ohm’s Law says that the electrical flow
1
induced by these voltages will be f (u, v) = r (u,v) (x (u) − x (v)).
Say we want to route unit of current from vertex s ∈ V to vertex t ∈ V . As before, we
can write an equation for every vertex saying that the voltage differences must produce the
desired net current:

• Net current at s: 1
P
−1= (s,v) r (s,v) (x (s) − x (v))

• Net current at u ∈ V \ {s, t}: 1


P
0 = (u,v) r (u,v) (x (u) − x (v))
• Net current at t: 1
P
1 = (t,v) r (t,v) (x (t) − x (v))

This gives us n constraints, exactly as many as we have voltage variables. However we have
to be a little careful when trying to conclude that a solution exists, yielding voltages x that
gives induce an electrical flow routing the desired demand.
You will prove in the exercises (Week 1, Exercise 3) that a solution x exists. The proof
requires two important observations: Firstly that the graph is connected, and secondly that
summed over all vertices, the net demand is zero, i.e. as much flow is coming into the network
as is leaving it.

The incidence matrix and the Laplacian matrix. To have a more compact notation
for net flow constraints, we also introduce the edge-vertex incidence matrix of the graph,
B ∈ RV ×E . 
1
 if e = (u, v)
B(v, e) = −1 if e = (v, u)

0 o.w.

Now we can express the net flow constraint that f routes d by

Bf = d .

This is also called a conservation constraint. In our examples so far, we have d (s) = −1,
d (t) = 1 and d (u) = 0 for all u ∈ V \ {s, t}.
If we let R = diage∈E r (e) then Ohm’s law tells us that f = R−1 B > x . Putting these
observations together, we have BR−1 B > x = d . The voltages x that induce f must solve
this system of linear equations, and we can use that to compute both x and f . It is exactly

13
the same linear equation as the one we considered earlier. We can show a that for a connected
graph, a solution xPexists if and only if the flow into the graph equals the net flow out, which
we can express as v d (v) = 0 or 1> d = 0. You will show this as part of Exercise 3. This
also implies that an electrical flow routing d exists if and only if the net flow into the graph
equals the net flow out, which we can express as 1> d = 0.
The matrix BR−1 B > is called the Laplacian of the graph and is usually denoted by L.

An optimization problem in disguise. So far, we have looked at electrical voltages


and flows as arising from a set of linear equations – and it might not be apparent that this
has anything to do with optimization. But transporting current through a resistor requires
energy, which will be dissipated as heat by the resistor (i.e. it will get hot!). If we send a
current of f across a resistor with a potential drop of x, then the amount of energy spent
per unit of time by the resistor will be f · x. This is called Joule’s Law. Applying Ohm’s
law to a resistor with resistance r, we can also express this energy per unit of time as
f · x = x2 /r = r · f 2 . Since we aren’t bothering with units, we will even forget about time,
and refer to these quantities as “energy”, even though a physicist would call them “power”.

Figure 1.7: Energy has a function of flow in a resistor with resistance r = 1.

Now, another interesting question would seem to be: If we want to find a flow routing a
certain demand d , how should the flow behave in order to minimize the electrical energy
def P
spent routing the flow? The electrical energy of a flow vector f is E(f ) = e r (e)f (e)2 . We
can phrase this as an optimization problem:

min E(f )
f ∈RE

s.t. Bf = d .

14
We call this problem electrical energy-minimizing flow. As we will prove later, the flow f ∗
that minimizes the electrical energy among all flows that satisfy Bf = d is precisely the
electrical flow.

A pair of problems. What about our voltages, can we also get them from some opti-
mization problem? Well, we can work backwards from the fact that our voltages solve the
equation Lx = d . Consider the function c(x ) = 21 x > Lx − x > d . We should ask ourselves
some questions about this function c : RV → R. Is it continuous and continuously differ-
entiable? The answer to this is yes, and that is not hard to see. Does the function have a
minimum? This is maybe not immediately clear, but the minimum does indeed exist.
When this is minimized, the derivative of c(x ) with respect to each coordinate of x must be
zero. This condition yields exactly the system of linear equations Lx = d . You will confirm
this in Exercise 4 of the first exercise sheet.
Based on our derivative condition for the optimum, we can also express the electrical voltages
as the solution to an optimization problem, namely

min c(x )
x ∈RV

As you are probably aware, having the derivative of each coordinate equal zero is not a
sufficient condition for being at the optimum of a function2 .
It is also interesting to know whether all solutions to Lx = d are in fact minimizers of c.
The answer is yes, and will see some very general tools for proving statements like this in
Chapter 2.
Altogether, we can see that routing electrical current through a network of resistors leads to
a pair of optimization problems, let’s call them f ∗ and x ∗ , and that the solutions to the two
problems are related, in our case through the equation f ∗ = R−1 B > x ∗ (Ohm’s Law). But,
why and how are these two optimization problems related?
Instead of minimizing c(x ), we can equivalently think about maximizing −c(x ), which gives
the following optimization problem: maxx ∈RV −c(x ). In fact, as you will show in the exercises
for Week 1, we have E(f ∗ ) = −c(x ∗ ), so the minimum electrical energy is exactly the
maximum value of −c(x ). More generally for any flow that routes d and any voltages x ,
we have E(f ) ≥ −c(x ). So, for any x , the value of −c(x ) is a lower bound on the minimum
energy E(f ∗ ).
This turns out to be an instance of a much broader phenomenon, known as Lagrangian
duality, which allows us to learn a lot about many optimization problems by studying two
related pairs of problems, a minimization problem, and a related maximization problem that
gives lower bounds on the optimal value of the minimization problem.

2
Consider the function in one variable c(x) = x3 .

15
Solving Lx = d . Given a graph G with resistances for the edges, and some net flow vector
d , how quickly can we compute x ? Broadly speaking, there are two very different families
of algorithms we could use to try to solve this problem.
Either, we could solve the linear equation using something like Gaussian Elimination to
compute an exact solution.
Alternatively, we could start with a guess at a solution, e.g. x 0 = 0, and then we could try
to make a change to x 0 to reach a new point x 1 with a lower value of c(x ) = 21 x > Lx − x > d ,
i.e. c(x 1 ) < c(x 0 ). If we repeat a process like that for enough steps, say t, hopefully we
eventually reach x t with c(x t ) close to c(x ∗ ), where x ∗ is a minimizer of c(x ) and hence
Lx ∗ = d . Now, we also need to make sure that c(x t ) ≈ c(x ∗ ) implies that Lx t ≈ d in some
useful sense.
One of the most basic algorithms in this framework of “guess and adjust” is called Gradient
Descent, which we will study in Week 2. The rough idea is the following: if we make
a very small step P
from x to x + δ, then a multivariate Taylor expansion suggests that
∂c(x )
c(x + δ) − c(x ) ≈ v∈V δ(v) ∂x (v)
.
If we are dealing with smooth convex function, this quantity is negative if we let δ(v) =
∂c(x )
− · ∂x (v)
for some small enough  so the approximation holds well. So we should be able to
make progress by taking a small step in this direction. That’s Gradient Descent! The name
comes from the vector of partial derivatives, which is called the gradient.
As we will see later in this course, understanding electrical problems from an optimization
perspective is crucial to develop fast algorithms for computing electrical flows and voltages,
but to do very well, we also need to borrow some ideas from Gaussian Elimination.
What running times do different approaches get?

1. Using Gaussian Elimination, we can find x s.t. Lx = d in O(n3 ) time and with
asymptotically faster algorithms based on matrix multiplication, we can bring this
down to roughly O(n2.372 ).

2. Meanwhile Gradient Descent will get a running time of O(n3 m) or so – at least this is
a what a simple analysis suggests.

3. However, we can do much better: By combining ideas from both algorithms, and a bit
more, we can get x up to very high accuracy in time O(m logc n) where c is some small
constant.

1.3 Convex Optimization

Recall our plot in Figure 1.7 of the energy required to route a flow f across a resistor with
resistance r, which was E(f ) = r · f 2 . We see that the function has a special structure: the

16
graph of the function sits below the line joining any two points (f, E(f )) and (g, E(g)). A
function E : R → R that has this property is said to be convex.
Figure 1.8 shows the energy as a function of flow, along with two points (f, E(f )) and
(g, E(g)). We see the function sits below the line segment between these points.

Figure 1.8: Energy has a function of flow in a resistor with resistance r = 1. The function
is convex.

We can also interpret this condition as saying that for all θ ∈ [0, 1]

E(θf + (1 − θ)g) ≤ θE(f ) + (1 − θ)E(g).

This immediately generalizes to functions E : Rm → R.


A convex set is a subset of S ⊆ Rm s.t. if f , g ∈ S then for all θ ∈ [0, 1] we have θf +(1−θ)g ∈
S.
Figure 1.9 shows some examples of sets that are and aren’t convex.
Convex functions and convex sets are central to optimization, because for most problems of
minimization a convex function over a convex set, we can develop fast algorithms 3 .
So why convex functions and convex sets? One important reason is that for a convex function
defined over a convex feasible set, any local minimum is also a global minimum, and this
fact makes searching for an optimal solution computationally easier. In fact, this is closely
related to why Gradient Descent works well on many convex functions.
Notice that the set {f : Bf = d } is convex, i.e. the set of all flows that route a fixed demand
3
There are some convex optimization problems that are NP-hard. That said, polynomial time algorithms
exist for almost any convex problem you can come up with. The most general polynomial time algorithm
for convex optimization is probably the Ellipsoid Method.

17
A B C

Figure 1.9: A depiction of convex and non-convex sets. The sets A and B are convex since
the straight line between any two points inside them is also in the set. The set C is not
convex.

d is convex. It is also easy to verify that E(f ) = e r (e)f (e)2 is a convex function, and
P
hence finding an electrical flow is an instance of convex minmization:

1.4 More Graph Optimization Problems


Maximum flow. Again, let G = (V, E) be an undirected, connected graph with n vertices
and m edges. Suppose we want to find a flow f ∈ RE that routes d , but instead of trying
to minimize electrical energy, we try to pick an f that minimizes the largest amount of flow
on any edge, i.e. maxe |f e | – which we also denote by kf k∞ . We can write this problem as

min kf k∞
f ∈RE

s.t. Bf = d

This problem is known as the Minimum Congested Flow Problem4 . It is equivalent to the
more famous Maximum Flow Problem.
The behavior of this kind of flow is very different than electrical flow. Consider the question
of whether a certain demand can be routed kf k∞ ≤ 1. Imagine sending goods from a source
s to a destination t using a network of train lines that all have the same capacity and asking
whether the network is able to route the goods at the rate you want: This boils down to
whether routing exists with kf k∞ ≤ 1, if we set it up right.
We have a very fast, convex optimization-based algorithm for Minimum Congested Flow: In
−1
m log O(1)
n time, we can find a flow f̃ s.t. B f̃ = d and f̃ ≤ (1 + ) kf ∗ k∞ , where f ∗


4
This version is called undirected, because the graph is undirected, and uncapacitated because we are
aiming for the same bound on the flow on all edges.

18
is an optimal solution, i.e. an actual minimum congestion flow routing d .
But what if we want  to be very small, e.g. 1/m? Then this running time isn’t so good any-
more. But, in this case, we can use other algorithms that find flow f ∗ exactly. Unfortunately,
these algorithms take time roughly m4/3+o(1) .
Just as the electrical flow problem had a dual voltage problem, so maximum flow has a dual
voltage problem, which is know as the s-t minimum cut problem.

Maximum flow, with directions and capacities. We can make the maximum flow
problem harder by introducing directed edges: To do so, we allow edges to exist in both
directions between vertices, and we require that the flow on a directed edge is always non-
negative. So now G = (V, E) is a directed graph. We can also make the problem harder
by
introducing capacities. We define a capacity vector c ∈ RE ≥ 0 and try to minimize
C −1 f , where C = diage∈E c(e). Then our problem becomes

min C −1 f ∞

f ∈RE

s.t. Bf = d
f ≥ 0.

For this
√ capacitated, directed maximum 1.483 flow problem, our best algorithms run in about
O(m n) time in sparse graphs and O(m ) in dense graphs5 , even if we are willing to
accept fairly low accuracy solution. If the capacities are allowed to be exponentially large,
the best running time we can get is O(mn). For this problem, we do not yet know how to
improve over classical combinatorial algorithms using convex optimization.

Multi-commodity flow. We can make the even harder still, by simultaneously trying to
route two types of flow (imagine pipes with Coke and Pepsi). Our problem now looks like

min C −1 (f 1 + f 2 ) ∞

f 1 ,f 2 ∈RE

s.t. Bf 1 = d 1
Bf 2 = d 2
f 1 , f 2 ≥ 0.

Solving this problem to high accuracy is essentially as hard as solving a general linear pro-
gram! We should see later in the course how to make this statement precise.
If we in the above problem additionally require that our flows must be integer valued, i.e.
f 1 , f 2 ∈ N0 , then the problem becomes NP-complete.
5
Provided the capacities are integers satisfying a condition like c ≤ n100 1.

19
Random walks in a graph. Google famously uses6 the PageRank problem to help decide
how to rank their search results. This problem essentially boils down to computing the stable
distribution of a random walk on a graph. Suppose G = (V, E) is a directed graph where each
outgoing edge P (v, u), which we will define as going from u to v, has a transition probability
p(v,u) > 0 s.t. z←u p(z,u) = 1. We can take a step of a random walk on the vertex set by
starting at some vertex u0 = u, and then randomly pick one of the outgoing edges (v, u) with
probability p(v,u) and move to the chosen vertex u1 = v. Repeating this procedure, to take
a step from the next vertex u1 , gives us a random walk in the graph, a sequence of vertices
u0 , u1 , u2 . . . , uk .
We let P ∈ RV ×V be the matrix of transition probabilities given by
(
p(v,u) for (u, v) ∈ E
P vu =
0 o.w.

V
P distribution over the vertices can be specified by a vector p ∈ R where
Any probability
p ≥ 0 and v p(v) = 1. We say that probability distribution π on the vertices is a stable
distribution of the random walk if π = Pπ. A strongly connected graph always has exactly
one stable distribution.
How quickly can we compute the stable distribution of a general random walk? Under some
mild conditions on the stable distribution7 , we can find a high accuracy approximation of π
in time O(m logc n) for some constant c.
This problem does not easily fit in a framework of convex optimization, but nonetheless, our
fastest algorithms for it use ideas from convex optimization.

Topics in this Course


In this course, we will try to address the following questions.

1. What are the fundamental tools of fast convex optimization?

2. What are some problems we can solve quickly on graphs using optimization?

3. What can graphs teach us about convex optimization?

4. What algorithm design techniques are good for getting algorithms that quickly find
a crude approximate solution? And what techniques are best when we need to get a
highly accurate answer?

5. What is special about flow problems?

6
At least they did at some point.
7
Roughly something like maxv 1/π(v) ≤ n100 .

20
Part I

Introduction to Convex Optimization

21
Chapter 2

Some Basic Optimization, Convex


Geometry, and Linear Algebra

2.1 Overview
In this chapter, we will

1. Start with an overview (i.e. this list).

2. Learn some basic terminology and facts about optimization.

3. Recall our definition of convex functions and see how convex functions can also be
understood in terms of a characterization based on first derivatives.

4. See how the first derivatives of a convex function can certify that we are at a global
minimum.

2.2 Optimization Problems


Focusing for now on optimization over x ∈ Rn , we usually write optimization problems as:

min (or max) f (x )


x ∈Rn
s.t. g1 (x ) ≤ b1
.
.
.
gm (x ) ≤ bm

22
where {gi (x )}m
i=1 encode the constraints. For example, in the following optimization problem
from the previous chapter
X
min r (e)f (e)2
f ∈RE
e
s.t. Bf = d

we have the constraint Bf = d . Notice that we can rewrite this constraint as Bf ≤ d and
−Bf ≤ −d to match the above setting. The set of points which respect the constraints is
called the feasible set.
Definition 2.2.1. For a given optimization problem the set F = {x ∈ Rn : gi (x ) ≤
bi , ∀i ∈ [m]} is called the feasible set. A point x ∈ F is called a feasible point, and
a point x 0 ∈/ F is called an infeasible point.
Ideally, we would like to find optimal solutions for the optimization problems we consider.
Let’s define what we mean exactly.
Definition 2.2.2. For a maximization problem x ? is called an optimal solution if
f (x ? ) ≥ f (x ), ∀x ∈ F. Similarly, for a minimization problem x ? is an optimal solution
if f (x ? ) ≤ f (x ), ∀x ∈ F.
What happens if there are no feasible points? In this case, an optimal solution cannot exist,
and we say the problem is infeasible.
Definition 2.2.3. If F = ∅ we say that the optimization problem is infeasible. If
F 6= ∅ we say the optimization problem is feasible.

Figure 2.1

Consider three examples depicted in Figure 2.1:

(i) F = [a, b]

(ii) F = [a, b)

23
(iii) F = [a, ∞)

In the first example, the minimum of the function is attained at b. In the second case the
region is open and therefore there is no minimum function value, since for every point we
will choose, there will always be another point with a smaller function value. Lastly, in the
third example, the region is unbounded and the function decreasing, thus again there will
always be another point with a smaller function value.

Sufficient Condition for Optimality. The following theorem, which is a fundamental


theorem in real analysis, gives us a sufficient (though not necessary) condition for optimality.

Theorem (Extreme Value Theorem). Let f : Rn → R be a continuous function and F ⊆ Rn


be nonempty, bounded, and closed. Then, the optimization problem min f (x ) : x ∈ F has
an optimal solution.

2.3 A Characterization of Convex Functions

Recall the definitions of convex sets and convex functions that we introduced in Chapter 1:
Definition 2.3.1. A set S ⊆ Rn is called a convex set if any two points in S contain
their line, i.e. for any x , y ∈ S we have that θx + (1 − θ)y ∈ S for any θ ∈ [0, 1].

Definition 2.3.2. For a convex set S ⊆ Rn , we say that a function f : S → R is


convex on S if for any two points x , y ∈ S and any θ ∈ [0, 1] we have that:
 
f (θx + (1 − θ)y ) ≤ θf (x ) + 1 − θ f (y ).

Figure 2.2: This plot shows the function f (x, y) = xy. For any fixed y0 , the function
h(x) = f (x, y0 ) = xy0 is this is linear in x, and so is a convex function in x. But is f convex?

We will first give an important characterization of convex function. To do so, we need to


characterize multivariate functions via their Taylor expansion.

24
Notation for this section. In the rest of this section, we frequently consider a multivari-
ate function f whose domain is a set S ⊆ Rn , which we will require to be open. When we
additionally require that S is convex, we will specify this. Note that S = Rn is both open and
convex and it suffices to keep this case in mind. Things sometimes get more complicated if
S is not open, e.g. when the domain of f has a boundary. We will leave those complications
for another time.

2.3.1 First-order Taylor Approximation

Definition 2.3.3. The gradient of a function f : S → R at point x ∈ S is denoted


∇f (x ) is:
 >
∂f (x ) ∂f (x )
∇f (x ) = ,...,
∂x (1) ∂x (n)

First-order Taylor expansion. For a function f : R → R of a single variable, differen-


tiable at x ∈ R
f (x + δ) = f (x) + f 0 (x)δ + o(|δ|)
where by definition:
o(|δ|)
lim = 0.
δ→0 |δ|

Similarly, a multivariate function f : S → R is said to be (Fréchet) differentiable at x ∈ S


when there exists ∇f (x ) ∈ Rn s.t.

f (x + δ) − f (x ) − ∇f (x )> δ
2
lim = 0.
δ→0 kδk2

Note that this is equivalent to saying that f (x + δ) = f (x ) + ∇f (x )> δ + o(kδk2 ).


We say that f is continuously differentiable on a set S ⊆ Rn if it is differentiable and in
addition the gradient is continuous on S. A differentiable convex function whose domain is
an open convex set S ⊆ Rn is always continuously differentiable1 .
Remark. In this course, we will generally err on the side of being informal about functional
analysis when we can afford to, and we will not worry too much about the details of different
notions of differentiability (e.g. Fréchet and Gateaux differentiability), except when it turns
out to be important.
Theorem 2.3.4 (Taylor’s Theorem, multivariate first-order remainder form). If f : S → R
is continuously differentiable over [x , y ], then for some z ∈ [x , y ],

f (y ) = f (x ) + ∇f (z )> (y − x ).
1
See p. 248, Corollary 25.5.1 in Convex Analysis by Rockafellar (my version is the Second print,
1972). Rockefellar’s corollary concerns finite convex functions, because he otherwise allows convex func-
tions that may take on the values ±∞.

25
This theorem is useful for showing that the function f can be approximated by the affine
function y → f (x ) + ∇f (x )> (y − x ) when y is “close to” x in some sense.

Figure 2.3: The convex function f (y ) sits above the linear function in y given by
f (x ) + ∇f (x )> (y − x ).

2.3.2 Directional Derivatives


Definition 2.3.5. Let f : S → R be a function differentiable at x ∈ S and let us
consider d ∈ Rn . We define the derivative of f at x in direction d as:

f (x + λd ) − f (x )
Df (x )[d ] = lim
λ→0 λ

Proposition 2.3.6. Df (x )[d ] = ∇f (x )> d .

Proof. Using the first order expansion of f at x :

f (x + λd ) = f (x ) + ∇f (x )> (λd ) + o(kλd k2 )

hence, dividing by λ (and noticing that kλd k2 = λ kd k2 ):

f (x + λd ) − f (x ) o(λ kd k2 )
= ∇f (x )> d +
λ λ
letting λ go to 0 concludes the proof.

2.3.3 Lower Bounding Convex Functions with Affine Functions

In order to prove the characterization of convex functions in the next section we will need
the following lemma. This lemma says that any differentiable convex function can be lower
bounded by an affine function.

Theorem 2.3.7. Let S be an open convex subset of Rn , and let f : S → R be a differentiable


function. Then, f is convex if and only if for any x , y ∈ S we have that f (y ) ≥ f (x ) +
∇f (x )> (y − x ).

26
Figure 2.4: The convex function f (y ) sits above the linear function in y given by
f (x ) + ∇f (x )> (y − x ).

Proof. [ =⇒ ] Assume f is convex, then for all x, y ∈ S and θ ∈ [0, 1], if we let z =
θy + (1 − θ)x , we have that

f (z) = f ((1 − θ)x + θy ) ≤ (1 − θ)f (x ) + θf (y )

and therefore by subtracting f (x ) from both sides we get:

f (x + θ(y − x )) − f (x ) ≤ θf (y ) + (1 − θ)f (x ) − f (x )
= θf (y ) − θf (x ).

Thus we get that (for θ > 0):

f (x + θ(y − x )) − f (x )
≤ f (y ) − f (x )
θ

Applying Proposition 2.3.6 with d = x − y we have that:

f (x + θ(y − x )) − f (x )
∇f (x )> (y − x ) = lim+ ≤ f (y ) − f (x ).
θ→0 θ

[ ⇐= ] Assume that f (y ) ≥ f (x ) + ∇f (x )> (y − x ) for all x, y ∈ S and show that f is


convex. Let x, y ∈ S and z = θy + (1 − θ)x . By our assumption we have that:

f (y ) ≥ f (z ) + ∇f (z )> (y − z) (2.1)
>
f (x ) ≥ f (z ) + ∇f (z ) (x − z) (2.2)

Observe that y − z = (1 − θ)(y − x ) and x − z = θ(y − x ). Thus adding θ times (2.1) to


(1 − θ) times (2.2) gives cancellation of the vectors multiplying the gradient, yielding

θf (y ) + (1 − θ)f (x ) ≥ f (z ) + ∇f (z )> 0
= f (θy + (1 − θ)x )

This is exactly the definition of convexity.

27
2.4 Conditions for Optimality
We now want to find necessary and sufficient conditions for local optimality.
Definition 2.4.1. Consider a differentiable function f : S → R. A point x ∈ S at
which ∇f (x ) = 0 is called a stationary point.

Proposition 2.4.2. If x is a local extremum of a differentiable function f : S → R then


∇f (x ) = 0.

Proof. Let us assume that x is a local minimum for f . Then for all d ∈ Rn , f (x ) ≤ f (x +λd )
for λ small enough. Hence:

0 ≤ f (x + λd ) − f (x ) = λ∇f (x )> d + o(kλd k)

dividing by λ > 0 and letting λ → 0+ , we obtain 0 ≤ ∇f (x )> d . But, taking d = −∇f (x ),


we get 0 ≤ − k∇f (x )k22 . This implies that ∇f (x ) = 0.
The case where x is a local maximum can be dealt with similarly.

Remark 2.4.3. For this proposition to hold, it is important that S is open.

For convex functions however it turns out that a stationary point necessarily implies that
the function is at its minimum. Together with the proposition above, this says that for a
convex function on Rn a point is optimal if and only if it is stationary.

Proposition 2.4.4. Let S ⊆ Rn be an open convex set and let f : S → R be a differentiable


and convex function. If x is a stationary point then x is a global minimum.

Proof. From Theorem 3.3.5 we know that for all x , y ∈ S : f (y ) ≥ f (x ) + ∇f (x )(y − x ).


Since ∇f (x ) = 0 this implies that f (y ) ≥ f (x ). As this holds for any y ∈ S, x is a global
minimum.

28
Chapter 3

Convexity and Second Derivatives,


Gradient Descent and Acceleration

Notation for this chapter. In this chapter, we sometimes consider a multivariate func-
tion f whose domain is a set S ⊆ Rn , which we will require to be open. When we additionally
require that S is convex, we will specify this. Note that S = Rn is both open and convex
and it suffices to keep this case in mind. Things sometimes get more complicated if S is
not open, e.g. when the domain of f has a boundary. We will leave those complications for
another time.

3.1 A Review of Linear Algebra


Semi-definiteness of a matrix. The following classification of symmetric matrices will
be useful.
Definition 3.1.1. Let A by a symmetric matrix in Rn×n . We say that A is:

1. positive definite iff x > Ax > 0 for all x ∈ Rn \ {0};

2. positive semidefinite iff x > Ax ≥ 0 for all x ∈ Rn ;

3. If neither A nor −A is positive semi-definite, we say that A is indefinite.

Example: indefinite matrix. Consider the following matrix A:


 
+4 −1
A :=
−1 −2
   
1 > 0
For x = , we have x Ax = 4 > 0. For x = we have x > Ax = −2 < 0. A is
0 1
therefore indefinite.

29
The following theorem gives a useful characterization of (semi)definite matrices.

Theorem 3.1.2. Let A be a symmetric matrix in Rn×n .

1. A is positive definite iff all its eigenvalues are positive;

2. A is positive semidefinite iff all its eigenvalues are non-negative;

In order to prove this theorem, let us first recall the Spectral Theorem for symmetric matrices.

Theorem 3.1.3 (The Spectral Theorem for Symmetric Matrices). For all symmetric A ∈
Rn×n there exist V ∈ Rn×n and a diagonal matrix Λ ∈ Rn×n s.t.

1. A = V ΛV > .

2. V > V = I (the n × n identity matrix). I.e. the columns of V form an orthonormal


basis. Furthermore, v i is an eigenvector of λi (A), the ith eigenvalue of A.

3. Λii = λi (A).

Using the Spectral Theorem, we can show the following result:

Theorem 3.1.4 (The Courant-Fischer Theorem). Let A be a symmetric matrix in Rn×n ,


with eigenvalues λ1 ≤ λ2 ≤ . . . ≤ λn . Then

1.
x > Ax
λi = min max
subspace W ⊆Rn x ∈W,x 6=0 x > x
dim(W )=i

2.
x > Ax
λi = max min
subspace W ⊆Rn x ∈W,x 6=0 x > x
dim(W )=n+1−i

Theorem 3.1.2 is an immediate corollary of Theorem 4.1.1, since we can see that the minimum
value of the quadratic form x > Ax over x ∈ W = Rn is λ1 (A) kx k22 .

Proof of Theorem 4.1.1. We start by showing Part 1.


Consider letting W = span {v 1 , . . . , v i }, and normalize x ∈ W so that kx k2 = 1. Then
x = ij=1 c(j)v j for some vector c ∈ Ri with kck2 = 1.
P

Using the decomposition from Theorem 3.1.3 A = V ΛV > where Λ is a diagonal ma-
trix of eigenvalues of A, which we take to be sorted in increasing order. Then x > Ax =
x > V > ΛV x = (V x )> Λ(V x ) = ij=1 λj c(j)2 ≤ λi kck22 = λi . So this choice of W ensures
P
the maximizer cannot achieve a value above λi .

30
But is it possible that the “minimizer” can do better by choosing a different W ? Let
T = span {v i , . . . , v n }. As dim(T ) = n+1−i and dim(W ) = i, we must have dim(W ∩T ) ≥ 1,
by a standard property of subspaces. Hence for any W of this dimension,

x > Ax x > Ax
max ≥ max
x ∈W,x 6=0 x > x x ∈W ∩T,x 6=0 x > x

x > Ax
≥ min max = λi ,
subspace V ⊆T x ∈V,x 6=0 x > x
dim(V )=1

where the last equality follows from a similar calculation to our first one. Thus, λi can always
be achieved by the “maximizer” for all W of this dimension.
Part 2 can be dealt with similarly.

Example: a positive semidefinite matrix. Consider the following matrix A:

 
1 −1
A :=
−1 1
   
1 1
For x = , we have Ax = 0, so λ = 0 is an eigenvalue of A. For x = , we have
 1 −1
2
Ax = = 2x , so λ = 2 is the other eigenvalue of A. As both are non-negative, by the
−2
theorem above, A is positive semidefinite.
Since we are learning about symmetric matrices, there is one more fact that everyone should
know about them. We’ll use λmax (A) denote maximum eigenvalue of a matrix A, and
λmin (A) the minimum.

Claim 3.1.5. For a symmetric matrix A ∈ Rn×n , kAk = max(|λmax (A)| , |λmin (A)|).

3.2 Characterizations of Convexity and Optimality via


Second Derivatives

We will now use the second derivatives of a function to obtain characterizations of convexity
and optimality. We will begin by introducing the Hessian, the matrix of pairwise second
derivatives of a function. We will see that it plays a role in approximating a function via a
second-order Taylor expansion. We will then use semi-definiteness of the Hessian matrix to
characterize both conditions of optimality as well as the convexity of a function.

31
Definition 3.2.1. Given a function f : S → R its Hessian matrix at point x ∈ S
denoted H f (x ) (also sometimes denoted ∇2 f (x )) is:
∂ 2 f (x ) ∂ 2 f (x ) ∂ 2 f (x )
 
∂x (1)2 ∂x (1)∂x (2)
... ∂x (1)∂x (n)
 ∂ 2 f (x ) ∂ 2 f (x ) ∂ 2 f (x ) 
 ∂x (2)∂x (1) ∂x (2)2
... ∂x (2)∂x (n) 

H f (x ) := 

.. .. ... ..
. . .

 
∂ 2 f (x ) ∂ 2 f (x ) ∂ 2 f (x )
∂x (n)∂x (1) ∂x (n)∂x (2)
... ∂x (n)2

Second-order Taylor expansion. When f is twice differentiable it is possible to obtain


an approximation of f by quadratic functions. Our definition of f : S → R being twice
(Fréchet) differentiable at x ∈ S is that there exists ∇f (x ) ∈ Rn and H f (x ) ∈ Rn×n s.t.

f (x + δ) − f (x ) − ∇f (x )> δ + 1 δ > H f (x )δ

2 2
lim = 0.
δ→0 kδk22

This is equivalent to saying that for all δ


1
f (x + δ) = f (x ) + ∇f (x )> δ + δ > H f (x )δ + o(kδk22 ).
2
where by definition:
o(kδk2 )
lim =0
δ→0 kδk2
2

We say that f is continuously differentiable on a set S ⊆ Rn if it is twice differentiable and


in addition the gradient and Hessian are continuous on S.
As for first order expansions, we have a Taylor’s Theorem, which we state in the so-called
remainder form.

Theorem 3.2.2 (Taylor’s Theorem, multivariate second-order remainder form). If f : S →


R is twice continuously differentiable over [x , y ], then for some z ∈ [x , y ],

1
f (y ) = f (x ) + ∇f (x )> (y − x ) + (y − x )> H f (z )(y − x )
2

3.2.1 A Necessary Condition for Local Extrema

Recall that in the previous chapter, we show the following proposition.

Proposition 3.2.3. If x is a local extremum of a differentiable function f : S → R then


∇f (x ) = 0.

We can now give the second-order necessary conditions for local extrema via the Hessian.

32
Theorem 3.2.4. Let f : S → R be a function twice differentiable at x ∈ S. If x is a local
minimum, then H f (x ) is positive semidefinite.

Proof. Let us assume that x is a local minimum. We know from Proposition 3.2.3 that
∇f (x ) = 0, hence the second-order expansion at x takes the form:
1
f (x + λd ) = f (x ) + λ2 d > H f (x )d + o(λ2 kd k22 )
2
Because x is a local minimum, we can then derive
f (x + λd ) − f (x ) 1
0 ≤ lim+ 2
= d > H f (x )d
λ→0 λ 2
This is true for any d , hence H f (x ) is positive semidefinite.
Remark 3.2.5. Again, for this proposition to hold, it is important that S is open.

3.2.2 A sufficient condition for local extrema

A local minimum thus is a stationary point and has a positive semi-definite Hessian. The
converse is almost true, but we need to strengthen the Hessian condition slightly.
Theorem 3.2.6. Let f : S → R be a function twice differentiable at a stationary point
x ∈ S. If H f (x ) is positive definite then x is a local minimum.

Proof. Let us assume that H f (x ) is positive definite. We know that x is a stationary point.
We can write the second-order expansion at x :
1
f (x + δ) = f (x ) + δ > H f (x )δ + o(kδk22 )
2
Because the Hessian is positive definite, it has a strictly positive minimum eigenvalue λmin ,
we can conclude that δ > H f (x )δ ≥ λmin kδk22 . From this, we conclude that when kδk22 is
small enough, f (x + δ) − f (x ) ≥ 14 λmin kδk22 > 0. This proves that x is a local minimum.
Remark 3.2.7. When H f (x ) is indefinite at a stationary point x , we have what is known
as a saddle point: x will be a minimum along the eigenvectors of H f (x ) for which the
eigenvalues are positive and a maximum along the eigenvectors of H f (x ) for which the
eigenvalues are negative.

3.2.3 Characterization of convexity

Definition 3.2.8. For a convex set S ⊆ Rn , we say that a function f : S → R is


strictly convex on S if for any two points x 1 , x 2 ∈ S and any θ ∈ (0, 1) we have that:
 
f (θx 1 + (1 − θ)x 2 ) < θf (x 1 ) + 1 − θ f (x 2 ).

33
Theorem 3.2.9. Let S ⊆ Rn be open and convex, and let f : S → R be twice continuously
differentiable.

1. If Hf (x ) is positive semi-definite for any x ∈ S then f is convex on S.


2. If Hf (x ) is positive definite for any x ∈ S then f is strictly convex on S.
3. If f is convex, then Hf (x ) is positive semi-definite ∀x ∈ S.

Proof.

1. By applying Theorem 3.2.2, we find that for some z ∈ [x , y ]:


1 
f (y ) = f (x ) + ∇f (x )> (y − x ) + (y − x )> Hf (z )(y − x )
2
If Hf (z ) is positive semi-definite, this necessarily implies that:

f (y ) ≥ f (x ) + ∇f (x )> (y − x )

and from Theorem 3.3.5 we get that f is convex.


2. if Hf (x ) is positive definite, we have that:

f (y ) > f (x ) + ∇f (x )> (y − x ).

Applying the same idea as in Theorem 3.3.5 we can show that in this case f is strictly
convex.
3. Let f be a convex function. For x ∈ S, and some small λ > 0, for any d ∈ Rn we have
that x + λd ∈ S. From the Taylor expansion of f we get:
λ2 >
f (x + λd) = f (x ) + λ∇f (x )> d + d Hf (x )d + o(λ2 kd k22 ).
2
From Lemma 3.3.5 we get that if f is convex then:

f (x + λd) ≥ f (x ) + λ∇f (x )> d.

Therefore, we have that for any d ∈ Rn :

λ2 >
d Hf (x )d + o(||λd||2 ) ≥ 0
2
Dividing by λ2 and taking λ → 0+ gives us that for any d ∈ Rn : d> Hf (x )d ≥ 0.
Remark 3.2.10. It is important to note that if S is open and f is strictly convex, then
H f (x ) may still (only) be positive semi-definite ∀x ∈ S. Consider f (x) = x4 which is
strictly convex, then the Hessian is H f (x) = 12x2 which equals 0 at x = 0.

34
3.3 Gradient Descent - An Approach to Optimization?
We have begun to develop an understanding of convex functions, and what we have learned
already suggests a way for us to try to find an approximate minimizer of a given convex
function.
Suppose f : Rn → R is convex and differentiable, and we want to solve

min f (x )
x ∈Rn

We would like to find x ∗ , a global minimizer of f . Suppose we start with some initial guess
x 0 , and we want to update it to x 1 with f (x 1 ) < f (x 0 ). If we can repeatedly make updates
like this, maybe we eventually find a point with nearly minimum function value, i.e. some
x̃ with f (x̃ ) ≈ f (x ∗ )?
Recall that f (x 0 + δ) = f (x 0 ) + ∇f (x 0 )> δ + o(kδk2 ). This means that if we choose
x 1 = x 0 − λ∇f (x 0 ), we get

f (x 0 − λ∇f (x 0 )) = f (x 0 ) − λ k∇f (x 0 )k22 + o(λ k∇f (x 0 )k2 )

And because f is convex, we know that ∇f (x 0 ) 6= 0 unless we are already at a global


minimum. So, for some small enough λ > 0, we should get f (x 1 ) < f (x 0 ) unless we’re
already at a global minimizer. This idea of taking a step in the direction of −∇f (x 0 ) is
what is called Gradient Descent. But how do we choose λ each time? And does this lead
to an algorithm that quickly reaches a point with close to minimal function value? To get
good answers to these questions, we need to assume more about the function f that we are
trying to minimize.
In the following subsection, we will see some conditions that suffice. But there are also many
other settings where one can show that some form of gradient descent converges.

3.3.1 A Quantitative Bound on Changes in the Gradient

Definition 3.3.1. Let f : S → R be a differentiable function, where S ⊆ Rn is convex


and open. We say that f is β-gradient Lipschitz iff for all x , y ∈ S

k∇f (x ) − ∇f (y )k2 ≤ β kx − y k2 .

We also refer to this as f being β-smooth.

Proposition 3.3.2. Consider a twice continuously differentiable f : S → R. Then f is


β-gradient Lipschitz if and only if for all x ∈ S, kH f (x )k ≤ β.

You will prove this in Exercise 13 (Week 2) of the first exercise set.

35
Proposition 3.3.3. Let f : S → R be a β-gradient Lipschitz function. Then for all x , y ∈ S,

β
f (y ) ≤ f (x ) + ∇f (x )> (y − x ) + kx − y k22
2

To prove this proposition, we need the following result from multi-variate calculus. This is
a restricted form of the fundamental theorem of calculus for line integrals.

Proposition 3.3.4. Let f : S → R be a differentiable function, and consider x , y such that


[x , y ] ∈ S. Let x θ = x + θ(y − x ). Then
Z 1
f (y ) = f (x ) + ∇f (x θ )> (y − x )dθ
θ=0

Now, we’re in a position to show Proposition 3.3.3

Proof of Proposition 3.3.3. Let f : S → R be a β-gradient Lipschitz function. Consider


arbitrary x , y ∈ S such that [x , y ] ∈ S

Z 1
f (y ) = f (x ) + ∇f (x θ )> (y − x )dθ
Zθ=0
1 Z 1
>
= f (x ) + ∇f (x ) (y − x )dθ + (∇f (x θ ) − ∇f (x ))> (y − x )dθ
θ=0 θ=0
Next we use Cauchy-Schwarz pointwise.
We also evaluate the first integral.
Z 1
>
≤ f (x ) + ∇f (x ) (y − x ) + k∇f (x θ ) − ∇f (x )k ky − x k dθ
θ=0
Then we apply β-gradient Lipschitz and note x θ − x = θ(y − x ).
Z 1
>
≤ f (x ) + ∇f (x ) (y − x ) + βθ ky − x k2 dθ.
θ=0
β
= f (x ) + ∇f (x )> (y − x ) + ky − x k2 .
2

3.3.2 Analyzing Gradient Descent

It turns out that just knowing a function f : Rn → R is convex and β-gradient Lipschitz is
enough to let us figure out a reasonable step size for Gradient Descent and let us analyze its
convergence.

36
We start at a point x 0 ∈ Rn , and we try to find a point x 1 = x 0 + δ with lower function
value. We will let our upper bound from Proposition 3.3.3 guide us, in fact, we could ask,
what is the best update for minimizing this upper bound, i.e. a δ solving
β
minn f (x 0 ) + ∇f (x 0 )> δ + kδk2
δ∈R 2
We can compute the best according to this upper bound by noting first that function is
convex and continuously differentiable, and hence will be minimized at any point where the
> β 2
gradient is zero. Thus we want 0 = ∇δ f (x 0 ) + ∇f (x 0 ) δ + 2 kδk = ∇f (x 0 ) + βδ,
which occurs at δ = − β1 ∇f (x 0 ).
2
Plugging in this value into the upper bound, we get that f (x 1 ) ≤ f (x 0 ) − k∇f (x 0 )k2/2β .
Now, as our algorithm, we will start with some guess x 0 , and then at every step we will
update our guess using the best step based on our Proposition 3.3.3 upper bound on f at
x i , and so we get

1 k∇f (x i )k22
x i+1 = x i − ∇f (x i ) and f (x i+1 ) ≤ f (x i ) − . (3.1)
β 2β

Let us try to prove that our algorithm converges toward an x with low function value.
We will measure this by looking at

gapi = f (x i ) − f (x ∗ )

where x ∗ is a global minimizer of f (note that there may not be a unique minimizer of f ).
We will try to show that this function value gap grows small. Using f (x i+1 ) − f (x i ) =
gapi+1 − gapi , we get

k∇f (x i )k22
gapi+1 − gapi ≤ − (3.2)

i 2 k∇f (x )k2
If the gapi value is never too much bigger than 2β
, then this should help us show we
are making progress. But how much can they differ? We will now try to show a limit on
this.
Recall that in the previous chapter we showed the following theorem.
Theorem 3.3.5. Let S be an open convex subset of Rn , and let f : S → R be a differentiable
function. Then, f is convex if and only if for any x , y ∈ S we have that f (y ) ≥ f (x ) +
∇f (x )> (y − x ).

Using the convexity of f and the lower bound on convex functions given by Theorem 3.3.5,
we have that

f (x ∗ ) ≥ f (x i ) + ∇f (x i )> (x ∗ − x i ) (3.3)

37
Rearranging gets us

gapi ≤ ∇f (x i )> (x i − x ∗ ) (3.4)


≤ k∇f (x i )k2 kx i − x ∗ k2 by Cauchy-Schwarz.

At this point, we are essentially ready to connect Equation (3.2) with Equation (3.4) and
analyze the convergence rate of our algorithm.
However, at the moment, we see that the change gapi+1 − gapi in how close we are to
the optimum function value is governed by the norm of the gradient k∇f (x i )k2 , while the
size of the gap is related to both this quantity and the distance kx i − x ∗ k2 between the
current solution x i and an optimum x ∗ . Do we need both or can we get rid of, say, the
distance? Unfortunately, with this algorithm and for this class of functions, a dependence
on the distance is necessary. However, we can simplify things considerably using the following
observation, which you will prove in the exercises (Exercise 2):

Claim 3.3.6. When running Gradient Descent as given by the step in Equation (3.1), for
all i kx i − x ∗ k2 ≤ kx 0 − x ∗ k2 .

Combining this Claim with Equation (3.2) and Equation (3.4),


 2
1 gapi
gapi+1 − gapi ≤ − · (3.5)
2β kx 0 − x ∗ k2

At this point, a simple induction will complete the proof of following result.

Theorem 3.3.7. Let f : Rn → R be a β-gradient Lipschitz, convex function. Let x 0 be


a given starting point, and let x ∗ ∈ arg minx ∈Rn f (x ) be a minimizer of f . The Gradient
Descent algorithm given by
1
x i+1 = x i − ∇f (x i )
β
ensures that the kth iterate satisfies

∗ 2β kx 0 − x ∗ k22
f (x k ) − f (x ) ≤ .
k+1

Carrying out this induction is one of the Week 2 exercises (Exercise 15 in the first exercise
set).

3.4 Accelerated Gradient Descent


It turns out that we can get an algorithm that converges substantially faster than Gradient
Descent, using an approach known as Accelerated Gradient Descent, which was developed
by Nesterov [Nes83]. This algorithm in turn improved on some earlier results by Nemirovski

38
and Yudin [NY83]. The phenomenon of acceleration was perhaps first understood in the
context of quadratic functions, minimizing x > Ax − x > b when A is positive definite – for
this case, the Conjugate Gradient algorithm was developed independently by Hestenes and
Stiefel [HS+ 52] (here at ETH!), and by Lanczos [Lan52]. In the past few years, providing more
intuitive explanations of acceleration has been a popular research topic. This presentation is
based on an analysis of Nesterov’s algorithm developed by Diakonikolas and Orecchia [DO19].
We will adopt a slightly different approach to analyzing this algorithm than what we used
in the previous section for Gradient Descent.
We will use x 0 to denote the starting point of our algorithm, and we will produce a sequence
of iterates x 0 , x 1 , x 2 , . . . , x k . At each iterate x i , we will compute the gradient ∇f (x i ).
However, the way we choose x i+1 based on what we know so far will now be a little more
involved than what we did for Gradient Descent.
To help us understand the algorithm, we are going to introduce two more sequences of
iterates y 0 , y 1 , y 2 , . . . , y k ∈ Rn and v 0 , v 1 , v 2 , . . . , v k ∈ Rn .
The sequence of y i ’s will be constructed to help us get as low a function value as possible at
f (y i ), which we will consider our current solution and the last iterate y k will be the output
solution of our algorithm.
The sequence of v i ’s will be constructed to help us get a lower bound on f (x ∗ ).
By combining the upper bound on the function value of our current solution f (y i ) with a
lower bound on the function value at an optimal solution f (x ∗ ), we get an upper bound on
the gap f (y i ) − f (x ∗ ) between the value of our solution and the optimal one. Finally, each
iterate x i , which will be where we evaluate gradient ∇f (x i ), is chosen through a trade-off
between wanting to reduce the upper bound and wanting to increase the lower bound.

The upper bound sequence: y i ’s. The point y i will be chosen from x i to minimize an
upper bound on f based at x i . This is similar to what we did in the previous section. We let
y i = x i +δi and choose δi to minimize the upper bound f (y i ) ≤ f (x i )+∇f (x i )> δi + β2 kδi k2 ,
which gives us

1 k∇f (x i )k22
y i = x i − ∇f (x i ) and f (y i ) ≤ f (x i ) − .
β 2β

We will introduce a notation for this upper bound

k∇f (x i )k22
Ui = f (y i ) ≤ f (x i ) − . (3.6)

39
Philosophizing about lower bounds1 . A crucial ingredient to establishing an upper
bound on gapi was a lower bound on f (x ∗ ).
In our analysis of Gradient Descent, in Equation (3.4), we used the lower bound
f (x ∗ ) ≥ f (x i ) − k∇f (x i )k2 kx i − x ∗ k2 . We can think of the Gradient Descent analysis as
being based on a tension between two statements: Firstly that “a large gradient implies we
quickly approach the optimum” and secondly “the function value gap to optimum cannot
exceed the magnitude of the current gradient (scaled by distance to opt)”.
This analysis does not use that we have seen many different function values and gradients,
and each of these can be used to construct a lower bound on the optimum value f (x ∗ ), and,
in particular, it is not clear that the last gradient provides the best bound. To do better, we
will try to use lower bounds that take advantage of all the gradients we have seen.
Definition 3.4.1. We will adopt a new notation for inner products that sometimes is
def
more convenient when dealing with large expressions: haa , bi = a > b.

The lower bound sequence: v i ’s. We can introduce weights ai > 0 for each step and
combine the gradients
P we have observed into one lower bound based on a weighted average.
Let us use Ai = j≤i aj to denote the sum of the weights. Now a general lower bound on
the function value at any v ∈ Rn is :
1 X
f (v ) ≥ aj (f (x j ) + h∇f (x j ), v − x j i)
Ai j≤i

However, to use Cauchy-Schwarz on each individual term here to instantiate this bound at
x ∗ does not give us anything useful. Instead, we will employ a somewhat magical trick: we
introduce a regularization term
def σ
φ(v ) = kv − x 0 k22 .
2
We will choose the value σ > 0 later. Now we derive our lower bound Li
!
1 X φ(x ∗ )
f (x ∗ ) ≥ φ(x ∗ ) + aj f (x j ) + haj ∇f (x j ), x ∗ − x j i −
Ai j≤i
Ai
( !)
1 X φ(x ∗ )
≥ minn φ(v ) + aj f (x j ) + haj ∇f (x j ), v − x j i −
v ∈R Ai j≤i
Ai
= Li
We will let v i be the v obtaining the minimum in the optimization problem appearing in
the definition of Li , so that
!
1 X φ(x ∗ )
Li = φ(v i ) + ai f (x i ) + hai ∇f (x i ), v i − x i i −
Ai j≤i
Ai
1
YMMV. People have a lot of different opionions about how to understand acceleration, and you should
take my thoughts with a grain of salt.

40
How we will measure convergence. We have designed the upper bound Ui and the
lower bound Li such that gapi = f (y i ) − f (x ∗ ) ≤ Ui − Li .
As you will show in Exercise 3, we can prove the convergence of Gradient Descent directly by
an induction that establishes 1/gapi ≤ C · i for some constant C depending on the Lipschitz
gradient parameter β and the distance kx 0 − x ∗ k2 .
To analyze Accelerated Gradient Descent, we will adopt a similar, but slightly different
strategy, namely trying to show that (Ui − Li )r(i) is non-increasing for some positive “rate
function” r(i). Ideally r(i) should grow quickly, which would imply that gapi quickly gets
small. We will also need to show that (U0 −L0 )r(0) ≤ C for some constant C again depending
on β and kx 0 − x ∗ k2 . Then, we’ll be able to conclude that

gapi · r(i) ≤ (Ui − Li )r(i) ≤ (Ui−1 − Li−1 )r(i − 1) ≤ · · · ≤ (U0 − L0 )r(0) ≤ C,

and hence gapi ≤ C/r(i).


This framework is fairly general. We could have also used it to analyze Gradient Descent,
and it works for many other optimization algorithms too.
We are going to choose our rate function r(i) to be exactly Ai , which of course is no accident!
As we will see, this interacts nicely with our lower bound Li . Hence, our goals are to

1. provide an upper bound on A0 (U0 − L0 ),

2. and show that Ai+1 (Ui+1 − Li+1 ) ≤ Ai (Ui − Li ),

Establishing the convergence rate. Let’s start by looking at the change in the upper
bound scaled by our rate function:


Ai+1 Ui+1 − Ai Ui =Ai+1 f (y i+1 ) − f (x i+1 ) − Ai (f (y i ) − f (x i+1 )) + (Ai+1 − Ai )f (x i+1 )

(3.7)
!
k∇f (x i+1 )k22
≤Ai+1 − First term controlled by Equation (3.6).

− Ai h∇f (x i+1 ), y i − x i+1 i Second term bounded by Theorem 3.3.5.
+ ai+1 f (x i+1 ) Third term uses ai+1 = Ai+1 − Ai .

The solution v i to the minimization in the lower bound Li turns out to be relatively simple
to characterize. By using derivatives to find the optimum, we first analyze the initial value
of the lower bound L0 .

41
Claim 3.4.2.

a0
1. v 0 = x 0 − σ
∇f (x 0 )

2. L0 = f (x 0 ) − a0

k∇f (x 0 )k22 − σ
2a0
kx ∗ − x 0 k22 .

You will prove Claim 3.4.2 in Exercise 17 (Week 2) of the first exercise sheet. Noting A0 = a0 ,
we see from Equation (3.6) and Part 2 of Claim 3.4.2, that
 2 
a0 a0 σ
A0 (U0 − L0 ) ≤ − k∇f (x 0 )k22 + kx ∗ − x 0 k22 (3.8)
2σ 2β 2
It will be convenient to introduce notation for the rescaled lower bound Ai Li without opti-
mizing over v .
X
mi (v ) = φ(v ) − φ(x ∗ ) + aj f (x j ) + haj ∇f (x j ), v − x j i
j≤i

Thus Ai Li − Ai+1 Li+1 = mi (v i ) − mi+1 (v ). Now, it is not too hard to show the following
relationships.
Claim 3.4.3.

1. mi (v ) = mi (v i ) + σ
2
kv − v i k22

2. mi+1 (v ) = mi (v ) + ai+1 f (x i+1 ) + hai+1 ∇f (x i+1 ), v − x i+1 i


ai+1
3. v i+1 = v i − σ
∇f (x i+1 )

And again, you will prove Claim 3.4.3 in Exercise 18 (Week 2) of the first exercise sheet.
Hint for Part 1: note that mi (v) is a quadratic function, miminimized at vi and its Hessian
equals σI at all v.
Given Claim 3.4.3, we see that

Ai Li − Ai+1 Li+1
= mi (v i ) − mi+1 (v i+1 ) (3.9)
σ
= −ai+1 f (x i+1 ) − hai+1 ∇f (x i+1 ), v i+1 − x i+1 i − kv i+1 − v i k22 (3.10)
2
2
a
= −ai+1 f (x i+1 ) − hai+1 ∇f (x i+1 ), v i − x i+1 i + i+1 k∇f (x i+1 )k22 (3.11)

This means that by combining Equation (3.7) and (3.11) we get

−Ai+1 a2i+1
 
Ai+1 (Ui+1 − Li+1 ) − Ai (Ui − Li ) ≤ + k∇f (x i+1 )k22
2β 2σ
+ h∇f (x i+1 ), Ai+1 x i+1 − ai+1 v i − Ai y i i .

42
Now, this means that Ai+1 (Ui+1 − Li+1 ) − Ai (Ui − Li ) ≤ 0 if

Ai+1 x i+1 − ai+1 v i − Ai y i = 0 and Ai+1 /β ≥ a2i+1 /σ

Ai y i +ai+1 v i i+1
We can get this by letting x i+1 = Ai+1
, and σ = β and ai = 2
, which implies that
(i+1)(i+2)
Ai = 4
> a2i .
By Equation (3.8), these parameter choices also imply that

β
A0 (U0 − L0 ) ≤ kx ∗ − x 0 k22 .
2
Finally, by induction, we get Ai (Ui − Li ) ≤ β2 kx ∗ − x 0 k22 . Dividing through by Ai and using
gapi ≤ Ui − Li results in the following theorem.

Theorem 3.4.4. Let f : Rn → R be a β-gradient Lipschitz, convex function. Let x 0 be a


given starting point, and let x ∗ ∈ arg minx ∈Rn f (x ) be a minimizer of f .
The Accelerated Gradient Descent algorithm given by

i+1 (i + 1)(i + 2)
ai = , Ai =
2 4
1
v0 = x0 − ∇f (x 0 )

1
y i = x i − ∇f (x i )
β
Ai y i + ai+1 v i
x i+1 =
Ai+1
ai+1
v i+1 = v i − ∇f (x i+1 )
β
ensures that the kth iterate satisfies

2β kx 0 − x ∗ k22
f (x k ) − f (x ∗ ) ≤ .
(k + 1)(k + 2)

43
Part II

Spectral Graph Theory

44
Chapter 4

Introduction to Spectral Graph


Theory

In this chapter, we will study graphs through linear algebra. This approach is known as
Spectral Graph Theory and turns out to be surprisingly powerful. An in-depth treatment of
many topics in this area can be found in [Spi19].

4.1 Recap: Incidence and Adjacency Matrices, the


Laplacian Matrix and Electrical Energy
In Chapter 1, we looked at undirected graphs and we introduce the incidence matrix and
the Laplacian of the graph. Let us recall these.
We consider an undirected weighted graph G = (V, E, w ), with n = |V | vertices and m = |E|
edges, where w ∈ RE+ assigns positive weight for every edge. Let’s assume G is connected.

To introduce the edge-vertex incidence matrix of the graph, we first have to associate an
arbitrary direction to every edge. We then let B ∈ RV ×E .

1
 if e = (u, v)
B(v, e) = −1 if e = (v, u)

0 o.w.

The edge directions are only there to help us track the meaning of signs of quantities defined
on edges: The math we do should not depend on the choice of sign.
Let W ∈ RE×E be the diagonal matrix given by W = diag(w ), i.e W (e, e) = w (e). We
define the Laplacian of the graph as L = BW B > . Note that in the first chapter, we defined
the Laplacian as BR−1 B > , where R is the diagonal matrix with edge resistances on the
diagonal. We want to think of high weight on an edge as expressing that two vertices are

45
highly connected, whereas we think of high resistance on an edge as expressing that the two
vertices are poorly connected, so we let w (e) = 1/R(e, e).
The weighted adjacency matrix A ∈ RV ×V of a graph is given by
(
w (u, v) if {u, v} ∈ E
A(u, v) =
0 otherwise.
>
Note that we treat the edgesPas undirected here, so A = A. The weighted degree of a
vertex is defined as d (v) = {u,v}∈E w(u, v). Again we treat the edges as undirected. Let
D = diag(d ) be the diagonal matrix in RV ×V with weighted degrees on the diagonal.
In Problem Set 1, you showed that L = D − A, and that for x ∈ RV ,
X
x > Lx = w (a, b)(x (a) − x (b))2 .
{a,b}∈E

Now we can express the net flow constraint that f routes d by

Bf = d .

This is also called a conservation constraint. In our examples so far, we have d (s) = −1,
d (t) = 1 and d (u) = 0 for all u ∈ V \ {s, t}.
If we let R = diage∈E r (e) then Ohm’s law tells us that electrical voltages x will induce an
electrical flow f = R−1 B > x . We defined the electrical energy of a flow f ∈ RE to be
X
E(f ) = r (e)f (e)2 = f > Rf .
e

And, from Ohm’s Law, we can then see that

E(f ) = f > Rf = x > Lx .

Hence, define the electrical energy associated with a set of voltages to be

E(x ) = x > Lx .

The Courant-Fisher Theorem. Let us also recall the Courant-Fischer theorem, which
we proved in Chapter 3 (Theorem 4.1.1).
Theorem 4.1.1 (The Courant-Fischer Theorem). Let A be a symmetric matrix in Rn×n ,
with eigenvalues λ1 ≤ λ2 ≤ . . . ≤ λn . Then

1.
x > Ax
λi = min max
subspaceW ⊆Rn x ∈W,x 6=0 x > x
dim(W )=i

46
2.
x > Ax
λi = max min
subspace W ⊆Rn x ∈W,x 6=0 x > x
dim(W )=n+1−i

In fact, from our proof of the Courant-Fischer theorem in Chapter 3, we can also extract a
slightly different statement:
Theorem 4.1.2 (The Courant-Fischer Theorem, eigenbasis version). Let A be a symmetric
matrix in Rn×n , with eigenvalues λ1 ≤ λ2 ≤ . . . ≤ λn , and corresponding eigenvectors
x 1 , x 2 , . . . , x n which form an othernormal basis. Then

1.
x > Ax
λi = min
x ⊥x 1 ,...x i−1 x >x
x 6=0

2.
x > Ax
λi = max
x ⊥x i+1 ,...x n x >x
x 6=0

x>
i Ax i
Of course, we also have λi (A) = x>
.
i xi

4.2 Understanding Eigenvalues of the Laplacian


We would like to understand the eigenvalues of the Laplacian matrix of a graph.
But first, why should we care? It turns out that Laplacian eigenvalues can help us understand
many properties of a graph. But we are going to start off with simple motivating observation:
Electrical voltages x ∈ RV consume electrical energy E(x ) = x > Lx . This means that by
the Courant-Fischer Theorem
E(x ) = x > Lx ≤ λn (L)x > x
And, for any voltages x ⊥ 1,
E(x ) = x > Lx ≥ λ2 (L)x > x .
Thus, we can use the eigenvalues to give upper and lower bounds on how much electrical
energy will be consumed by the flow induced by x , in terms compared to x > x = kx k22 .
In a couple of chapters, we will also prove the following claim, which shows that the Laplacian
eigenvalues can directly tell us about the electrical energy that is required to route a given
demand.
Claim 4.2.1. Given a demand vector d ∈ RV such that d ⊥ 1, the electrical voltages x that
route d satisfy Lx = d and the electrical energy of these voltages satifies
kd k22 kd k22
≤ E(x ) ≤
λn λ2

47
Eigenvalues of the Laplacian of a Complete Graph. To get a sense of how Laplacian
eigenvalues behave, let us start by considering the n vertex complete graph with unit weights,
which we denote by Kn . The adjacency matrix of Kn is A = 11> − I , since it has ones
everywhere, except for the diagonal, where entries are zero. The degree matrix D = (n−1)I .
Thus the Laplacian is L = D − A = nI − 11> .
Thus for any y ⊥ 1, we have y > Ly = ny > y − (1> y )2 = ny > y .
From this, we can conclude that any y ⊥ 1 is an eigenvector of eigenvalue n, and that all
λ2 = λ3 = . . . = λn = n.
Next, let us try to understand λ2 and λn for Pn , the n vertex path graph with unit weight
edges. I.e. the graph has edges E = {{i, i + 1} for i = 1 to (n − 1)}.
This is in a sense the least well-connected unit weight graph on n vertices, whereas Kn is
the most well-connected.

4.2.1 Test Vector Bounds on λ2 and λn

We can use the eigenbasis version of the Courant-Fisher theorem to observe that the second-
smallest eigenvalue of the Laplacian is given by

x > Lx
λ2 (L) = min . (4.1)
x 6=0 x >x
x > 1=0

We can get a better understanding of this particular case through a couple of simple observa-
tions. Suppose x = y +α1, where y ⊥ 1. Then x > Lx = y > Ly , and kx k22 = ky k2 +α2 k1k2 .
>
So for any given vector, you can increase the value of xx >Lx
x
, by instead replacing x with the
component orthogonal to 1, which we denoted by y .
We can conclude from Equation (4.1) that for any vector y ⊥ 1,

y > Ly
λ2 ≤
y >y

When we use a vector y in this way to prove a bound on an eigenvalue, we call it a test
vector.
Now, we’ll use a test vector to give an upper bound on λ2 (LPn ). Let x ∈ RV be given by
x (i) = (n + 1) − 2i, for i ∈ [n]. This vector satisfies x ⊥1. We picked this because we wanted
a sequence of values growing linearly along the path, while also making sure that the vector

48
is orthogonal to 1. Now

+ 1))2
P
i∈[n−1] (x (i) − x (i
λ2 (LPn ) ≤ Pn
x (i)2
Pn−1i=12
i=1 2
= Pn 2
i=1 (n + 1 − 2i)
4(n − 1)
=
(n + 1)n(n − 1)/3
12 12
= ≤ 2.
n(n + 1) n

Later, we will prove a lower bound that shows this value is right up to a constant factor. But
the test vector approach based on the Courant-Fischer theorem doesn’t immediately work
when we want to prove lower bounds on λ2 (L).
We can see from either version of the Courant-Fischer theorem that

v > Lv
λn (L) = max . (4.2)
v 6=0 v >v

Thus for any vector y 6= 0,

y > Ly
λn ≥ .
y >y

This means we can get a test vector-based lower bound on λn . Let us apply this to the
Laplacian of Pn . We’ll try the vector x ∈ RV given by x (1) = −1, and x (n) = 1 and
x (i) = 0 for i 6= 1, n.
Here we get

x > Lx 2
λn (LPn ) ≥ >
= = 1.
x x 2

Again, it’s not clear how to use the Courant-Fischer theorem to prove an upper bound on
λn (L). But, later we’ll see how to prove an upper that shows that for Pn , the lower bound
we obtained is right up to constant factors.

4.2.2 Eigenvalue Bounds Beyond Test Vectors

In the previous sections, we first saw a complete characterization of the eigenvalues and
eigenvectors of the unit weight complete graph on n vertices, Kn . Namely, LKn = nI − 11> ,
and this means that every vector y ⊥ 1 is an eigenvector of eigenvalue n.

49
We then looked at eigenvalues of Pn , the unit weight path on n vertices, and we showed
using test vector bounds that

12
λ2 (LPn ) ≤ and 1 ≤ λn (LPn ). (4.3)
n2
Ideally we would like to prove an almost matching lower bound on λ2 and an almost matching
upper bound on λn , but it is not clear how to get that from the Courant-Fischer theorem.
To get there, we start we need to introduce some more tools.

4.2.3 The Loewner Order, aka. the Positive Semi-Definite Order

We’ll now introduce an ordering on symmetric matrices called the Loewner order, which I
also like to just call the positive semi-definite order. As we will see in a moment, it is a partial
order on symmetric matrices, we denote it by “”. For conveniece, we allow ourselves to
both write A  B and equivalently B  A.
For a symmetric matrix A ∈ Rn×n we define that
A0
if and only if A is positive semi-definite.
More generally, when we have two symmetric matrices A, B ∈ Rn×n , we will write

A  B if and only if for all x ∈ Rn we have x > Ax ≤ x > Bx (4.4)

This is a partial order, because it satisfies the three requirements of

1. Reflexivity: A  A.
2. Anti-symmetry: A  B and B  A implies A = B
3. Transitivity: A  B and B  C implies A  C

Check for yourself that these properties hold!


The PSD order has other very useful properties: A  B implies A + C  B + C for any
symmetric matrix C . Convince yourself of this too!
And, combining this observation with transitivity, we can see that A  B and C  D
implies A + C  B + D.
Here is another useful property: If 0  A then for all α ≥ 1
1
A  A  αA.
α
Here is another one:

50
Claim 4.2.2. If A  B, then for all i
λi (A) ≤ λi (B).

Proof. We can prove this Claim by applying the subspace version of the Courant-Fischer
theorem.
x > Ax x > Bx
λi (A) = min max ≤ min max = λi (B).
subspace W ⊆Rn x ∈W,x 6=0 x > x subspace W ⊆Rn x ∈W,x 6=0 x > x
dim(W )=i dim(W )=i

Note
 that
 the converse
 of Clam 4.2.2 is very much false, for example the matrices A =

2 0 1 0
and B = have equal eigenvalues, but both A 6 B and B 6 A.
0 1 0 2
Remark 4.2.3. It’s useful to get used to and remember some of the properties of the
Loewner order, but all the things we have established so far are almost immediate from the
basic characterization in Equation (4.4). So, ideally, don’t memorize all these facts, instead,
try to see that they are simple consequences of the definition.

4.2.4 Upper Bounding a Laplacian’s λn Using Degrees

In an earlier chapter, we observedPthat for any graph G = (V, E, w ), L = D − A  0. We


can see this from x > (D − A)x = (u,v)∈E w (u, v)(x (u) − x (v))2 ≥ 0. Similarly D + A  0.
because x > (D + A)x = (u,v)∈E w (u, v)(x (u) + x (v))2 ≥ 0. But this means that −A  D
P
and hence L = D − A  2D.
So, for the path graph Pn , we have LPn  D − A  2D  4I . So by Claim 4.2.2
λn (LPn ) ≤ 4. (4.5)
We can see that our test vector-based lower bound on λn (LPn ) from Equation (4.3) is tight
up to a factor 4.
Since this type of argument works for any unit weight graph, it proves the following claim.
Claim 4.2.4. For any unit weight graph G, λn (LG ) ≤ 2 maxv∈V degree(v).

This is tight on a graph consisting of a single edge.

4.2.5 The Loewner Order and Laplacians of Graphs.

It’s sometimes convenient to overload the for the PSD order to also apply to graphs. We
will write
GH

51
if LG  LH .
For example, given two unit weight graphs G = (V, E) and H = (V, F ), if H = (V, F ) is a
subgraph of G, then

LH  LG .

We can see this from the Laplacian quadratic form:


X
x > LG x = w (u, v)(x (u) − x (v))2 .
(u,v)∈E

Dropping edges will only decrease the value of the quadratic form. The same is for decreasing
the weights of edges. The graph order notation is especially useful when we allow for scaling
a graph by a constant, say c > 0,

c·H G

What is c · H? It is the same graph as H, but the weight of every edge is multiplied by c.
Now we can make statements like 21 H  G  2H, which turn out to be useful notion of the
two graphs approximating each other.

4.2.6 The Path Inequality

Now, we’ll see a general tool for comparing two graphs G and H to prove inequalities like
cH  G for some constant c. Our tools won’t necessarily work well for all cases, but we’ll
see some examples where they do.
In the rest of the chapter, we will often need to compare two graphs defined on the same
vertex set V = {1, . . . , n} = [n].
We use Gi,j to denote the unit weight graph on vertex set [n] consisting of a single edge
between vertices i and j.

Lemma 4.2.5 (The Path Inequality).

(n − 1) · Pn  G1,n ,

Proof. We want to show that for every x ∈ Rn ,


n−1
X
(n − 1) · (x (i + 1) − x (i))2 ≥ (x (n) − x (1))2 .
i=1

For i ∈ [n − 1], set

∆(i) = x (i + 1) − x (i).

52
The inequality we want to prove then becomes
n−1 n−1
!2
X X
(n − 1) (∆(i))2 ≥ ∆(i) .
i=1 i=1

But, this is immediate from the Cauchy-Schwarz inequality a > b ≤ kaa k2 kbk2 :
n−1
X
(n − 1) (∆(i))2 = k1n−1 k2 · k∆k2
i=1
= (k1n−1 k · k∆k)2
≥ (1>
n−1 ∆)
2

n−1
X
=( ∆(i))2
i=1

4.2.7 Lower Bounding λ2 of a Path Graph

We will now use Lemma 4.2.5 to prove a lower bound on λ2 (LPn ). Our strategy will be to
prove that the path Pn is at least some multiple of the complete graph Kn , measured by the
Loewner order, i.e. Kn  f (n)Pn for some function f : N → R. We can combine this with
our observation earlier that λ2 (LKn ) = n to show that

f (n)λ2 (LPn ) ≥ λ2 (LKn ) = n, (4.6)

and this will give our lower bound on λ2 (LPn ). When establishing the inequality between Pn
and Kn , we can treat each edge of the complete graph separately, by first noting that
X
LKn = LGi,j
i<j

For every edge (i, j) in the complete graph, we apply the Path Inequality, Lemma 4.2.5:
j−1
X
Gi,j  (j − i) Gk,k+1
k=i
 (j − i)Pn

This inequality says that Gi,j is at most (j − i) times the part of the path connecting i to j,
and that this part of the path is less than the whole.
Summing inequality (4.3) over all edges (i, j) ∈ Kn gives
X X
Kn = Gi,j  (j − i)Pn .
i<j i<j

53
To finish the proof, we compute
X X
(j − i) ≤ n ≤ n3
i<j i<j

So

LKn  n3 · LPn .

Plugging this into Equation (4.6) we obtain

1
≤ λ2 (Pn ).
n2

This only differs from our test vector-based upper bound in Equation (4.3) by a factor 12.
We could make this considerably tighter by being more careful about the sums.

4.2.8 Laplacian Eigenvalues of the Complete Binary Tree

Let’s do the same analysis with the complete binary tree with unit weight edges on n =
2d+1 − 1 vertices, which we denote by Td .
Td is the balanced binary tree on this many vertices, i.e. it consists of a root node, which has
two children, each of those children have two children and so on until we reach a depth of d
from the root, at which point the child vertices have no more children. A simple induction
shows that indeed n = 2d+1 − 1.
We can also describe the edge set by saying that each node i has edges to its children 2i and
2i + 1 whenever the node labels do not exceed n. We emphasize that we still think of the
graph as undirected.

The largest eigenvalue. We’ll start by above bounding λn (LTd ) using a test vector.
We let x (i) = 0 for all nodes that have a child node, and x (i) = −1 for even-numbered leaf
nodes and x (i) = +1 for odd-numbered leaf nodes. Note that there are (n + 1)/2 leaf nodes,
and every leaf node has a single edge, connecting it to a parent with value 0. Thus

v > Lv x > Lx (n + 1)/2


λn (L) = max >
≥ >
= = 1. (4.7)
v 6=0 v v x x (n + 1)/2

Meanwhile, every vertex has degree at most 3, so by Claim 4.2.4, λn (L) ≤ 6. So we can
bound the largest eigenvalue above and below by a constant.

54
λ2 and diameter in any graph. The following lemma gives a simple lower bound on λ2
for any graph.

Lemma 4.2.6. For any unit weight graph G with diameter D,


1
λ2 (LG ) ≥ .
nD

Proof. We will again prove a lower bound comparing G to the complete graph. For each
edge (i, j) ∈ Kn , let Gi,j denote a shortest path in G from i to j. This path will have length
at most D. So, we have
X
Kn = Gi,j
i<j
X
 DGi,j
i<j
X
 DG
i<j
2
 n DG.

So, we obtain the bound

n2 Dλ2 (G) ≥ n,

which implies our desired statement.

λ2 in a tree. Since a complete binary tree Td has diameter 2d ≤ 2 log2 (n), by Lemma 4.2.6,
λ2 (LTd ) ≥ 2n log1 (n) .
2

Let us give an upper bound on λ2 of the tree using a test vector. Let x ∈ Rv have x (1) = 0
and x (i) = −1 for i in the left subtree and x (i) = +1 in the right subtree. Then

v > Lv x > Lx 2
λ2 (LTd ) = min >
≤ >
= .
v 6=0 v v x x n−1
v > 1=0

So, we have shown 2n log1 (n) ≤ λ2 (LTd ) ≤ 2


n−1
, and unlike the previous examples, the gap is
2
more than a constant.
In the exercises for Week 3, I will ask you to improve the lower bound to 1/(cn) for some
constant c.

55
Chapter 5

Conductance, Expanders and


Cheeger’s Inequality

A common algorithmic problem that arises is the problem of partitioning the vertex set V
of a graph G into clusters X1 , X2 , . . . , Xk such that

• for each i, the induced graph G[Xi ] = (Xi , E ∩ (Xi × Xi )) is ”well-connected”, and

• only an -fraction of edges e are not contained in any induced graph G[Xi ] (where  is
a very small constant).

Figure 5.1: After removing the red edges (of which there are few in relation to the total
number of edges), each connected component in G is ”well-connected”.

56
In this lecture, we make precise what ”well-connected” means by introducing the notions of
conductance and expanders.
Building on the last two lectures, we show that the second eigenvalue of the Laplacian L
associated with graph G can be used to certify that a graph is ”well-connected” (more
precisely the second eigenvalue of a normalized version of the Laplacian). This result, called
Cheeger’s inequality, is one of the key tools in Spectral Graph Theory. Moreover, it can be
turned into an algorithm that computes the partition efficiently!

5.1 Conductance and Expanders


Graph Definitions. In this lecture, we let G = (V, E) be unweighted1 and always be
connected, and let d (v) be the degree of a vertex v in G. We define
P the volume vol(S) for
any vertex subset S ⊆ V , to be the sum of degrees, i.e. vol(S) = v∈S d (v).
For any A, B ⊆ V , we define E(A, B) to be the set of edges in E ∩ (A × B), i.e. with one
endpoint in A and one endpoint in B. We let G[A] be the induced graph G by A ⊆ V , which
is the graph G restricted to the vertices A, i.e. an edge e in G is in G[A] iff both endpoints
are in A.

Conductance. Given set ∅ ⊂ S ⊂ V , then we define the conductance φ(S) of S by

|E(S, V \ S)|
φ(S) = .
min{vol(S), vol(V \ S)}

It can be seen that φ(·) is symmetric in the sense that φ(S) = φ(V \ S). We define the
conductance of the graph G denoted φ(G) by

φ(G) = min φ(S).


∅⊂S⊂V

We note that finding the conductance of a graph G is NP-hard. However, good approxima-
tions can be found as we will see today (and in a later lecture).

Expander and Expander Decomposition. For any φ ∈ (0, 1], we say that a graph G
is a φ-expander if φ(G) ≥ φ. We say that the partition X1 , X2 , . . . , Xk of the vertex set V is
a φ-expander decomposition of quality q if

• each induced graph G[Xi ] is a φ-expander, and

• the number of edges not contained in any G[Xi ] is at most q · φ · m.


1
Everything we present here also works for weighted graphs, however, we focus on unweighted graphs for
simplicity.

57
Today, we obtain a φ-expander decomposition of quality q = O(φ−1/2 · log n). In a few
lectures, we revisit the problem and obtain quality q = O(logc n) for some small constant c.
In practice, we mostly care about values φ ≈ 1.

An Algorithm to Compute Conductance and Expander Decomposition. In this


lecture, the main focus is not to obtain an algorithm to compute conductance but rather
only to show that the conductance of a graph can be approximated using the eigenvalues of
the ”normalized” Laplacian.
However, this proof gives then rise to an algorithm CertifyOrCut(G, φ) that given a
graph G and a parameter φ either:

• Certifies that G is a φ-expander, or



• Presents a cut S such that φ(S) ≤ 2φ.

In the graded homework, we ask you to make the procedure CertifyOrCut(G, φ) explicit,
and then to show how to use it to compute a φ-expander decomposition.

5.2 A Lower Bound for Conductance via Eigenvalues


An Alternative Characterization of Conductance. Let us now take a closer look at
the definition of conductance and observe that if a set S has vol(S) ≤ vol(V )/2 then

|E(S, V \ S)| |E(S, V \ S)| 1> L1S


φ(S) = = = >S .
min{vol(S), vol(V \ S)} vol(S) 1S D1S

To see this, observe that we can rewrite the numerator above using the Laplacian of G as
X
|E(S, V \ S)| = (1S (u) − 1S (v))2 = 1>
S L1S
(u,v)∈E

where 1S is the characteristic vector of S. Further, we can rewrite the denominator as

vol(S) = 1> >


S d = 1S D1S

where D = diag(d ) is the degree-matrix. We can now alternatively define the graph con-
ductance of G by
1>
S L1S
φ(G) = min >
(5.1)
∅⊂S⊂V, 1S D1S
vol(S)≤vol(V )/2

where we use that φ(S) = φ(V \ S) such that the objective value is unchanged as long as for
each set ∅ ⊂ S ⊂ V either S or V \ S is in the set that we minimize over.

58
The Normalized Laplacian. Let us next define the normalized Laplacian

N = D −1/2 LD −1/2 = I − D −1/2 AD −1/2 .

To learn a bit about this new matrix, let us first look at the first eigenvalue where we use
the test vector y = D 1/2 1, to get by Courant-Fischer (see Theorem 4.1.2) that

x >N x y > D −1/2 LD −1/2 y 1> L1


λ1 (N ) = min > ≤ = =0 (5.2)
x 6=0 x x y >y y >y

because D −1/2 D 1/2 = I and L1 = 0 (for the former we use the assumption that G is
connected). Since N is PSD (as you will show in the exercises), we also know λ1 (N ) ≥ 0,
so λ1 (N ) = 0.
Let us use Courant-Fischer again to reason a bit about the second eigenvalue of N :

x >N x z > D 1/2 N D 1/2 z z > Lz


λ2 (N ) = min >
= min = min . (5.3)
x ⊥D 1/2 1 x x z ⊥d z > D 1/2 D 1/2 z z ⊥d z > Dz
x 6=0 z 6=0 z 6=0

Relating Conductance to the Normalized Laplacian. At this point, it might become


clearer why N is a natural matrix to consider when arguing about conductance: if we could
argue that for every ∅ ⊂ S ⊂ V, vol(S) ≤ vol(V )/2, we have 1S ⊥ d , then it would be
easy to see that taking the second eigenvalue of N in equation 5.3 is a relaxation of the
minimization problem 5.1 defining φ(G).
While this is clearly not true, we can still argue along these lines.
λ2 (N )
Theorem 5.2.1 (Cheeger’s Inequality, Lower Bound). We have 2
≤ φ(G).

Proof. Instead of using 1S directly, we shift 1S by 1 such that it is orthogonal to d : we


define z S = 1S − α1 where α is the scalar that solves

0 = d >z S
⇐⇒ 0 = d > (1S − α1)
⇐⇒ 0 = d > 1S − αd > 1
d > 1S vol(S)
⇐⇒ α = >
= .
d 1 vol(V )

1>
S L1S 1 z>
S Lz S
To conclude the proof, it remains to argue that 1>
≥ 2
· z>
:
S D1S S Dz S

• Numerator: since 1> L1 = 0, we have that 1> >


S L1S = z S Lz S .

59
• Denominator: observe by straight-forward calculations that

z> 2
S Dz S = vol(S) · (1 − α) + vol(V \ S) · (−α)
2

= vol(S) − 2 vol(S) · α + vol(V ) · α2


vol(S)2
= vol(S) −
vol(V )
vol(S)
= vol(S) − vol(S) ·
vol(V )
1 1
≥ vol(S) = 1> D1S
2 2 S
where we use the assumption that vol(S) ≤ vol(V )/2.

5.3 An Upper Bound for Conductance via Eigenvalues


Slightly more surprisingly, we can also show that the second eigenvalue λ2 (N ) can be used
to upper bound the conductance.
p
Theorem 5.3.1 (Cheeger’s Inequality, Upper Bound). We have φ(G) ≤ 2 · λ2 (N ).

Proof. To prove the theorem, we want to show that for any z ⊥ d , we can find a set
∅ ⊂ S ⊂ V with vol(S) ≤ vol(V )/2, such that
r
1>S L1 S z > Lz
≤ 2 · . (5.4)
1>S D1S z > Dz

As a first step, we would like to change z slightly to make it more convenient to work with:

• we renumber the vertices in V such that we have

z (1) ≤ z (2) ≤ · · · ≤ z (n).

• we center z , that is we let z c = z − α1 where α is chosen such that


X X
d (i) < vol(V )/2 and d (i) ≥ vol(V )/2
z c (i)<0 z c (i)≤0

P
i.e. z c (i)>0 d (i) ≤ vol(V )/2.

• we scale, let z sc = βz c for some scalar β such that z sc (1)2 + z sc (n)2 = 1.

60
In the exercises, you will show that changing z to z sc can only make the ratio we are
> Lz > Lz
interested in smaller, i.e. that zz> Dz ≥ zz>scDz sc
sc
. Thus, if we can show that equation 5.4 holds
sc
for z sc in place of z , then it also follows for z itself.
We now arrive at the main idea of the proof: we define the set Sτ = {i ∈ V | z sc (i) < τ } for
some random variable τ with probability density function
(
2 · |t| t ∈ [z sc (1), z sc (n)],
p(t) = (5.5)
0 otherwise.
Rb
So, we have probability P[a < τ < b] = t=a
p(t) dt.
Since the volume incident to Sτ might be quite large, let us define S for convenience by
(
Sτ vol(Sτ ) < vol(V )/2,
S=
V \ Sτ otherwise.

Eτ [1>
S L1S ]
q
z>
sc Lz sc
Claim 5.3.2. We have Eτ [1>
≤ 2· z>
.
S D1S ] sc Dz sc

Proof. Recall 1> S L1S = E(Sτ , V \Sτ ), and by choice of τ , we have for any edge e = {i, j} ∈ E
where z sc (i) ≤ z sc (j),

Pτ [e ∈ E(Sτ , V \ Sτ )] = Pτ [z sc (i) < τ ≤ z sc (j)]


Z j
= 2|t| dt = sgn(j) · z sc (j)2 − sgn(i) · z sc (i)2 .
t=i

Distinguishing by cases, we get


(
|z sc (i)2 − z sc (j)2 | sgn(i) = sgn(j),
sgn(j) · z sc (j)2 − sgn(i) · z sc (i)2 =
z sc (i)2 + z sc (j)2 otherwise.

We can further upper bound either case by |z sc (i) − z sc (j)| · (|z sc (i)| + |z sc (j)|) (we leave
this as an exercise).
Using our new upper bound, we can sum over all edges e ∈ E to conclude that
X
Eτ [|E(Sτ , V \ Sτ )|] ≤ |z sc (i) − z sc (j)| · (|z sc (i)| + |z sc (j)|)
i∼j
sX X
≤ (z sc (i) − z sc (j))2 · (|z sc (i)| + |z sc (j)|)2
i∼j i∼j

where the last line follows from hx , y i2 ≤ hx , x i · hy , y i (i.e. Cauchy-Schwarz).

61
The
P first sum should2 look> familiar by now: it is simply the Quadratic Laplacian Form
i∼j (z sc (i) − z sc (j)) = z sc Lz sc .

It is not hard to reason about the second term either


X X X
(|z sc (i)| + |z sc (j)|)2 ≤ 2 z sc (i)2 + z sc (j)2 = 2 d (i)z sc (i)2 = 2z >
sc Dz sc .
i∼j i∼j i∈V

Putting everything together, we obtain


s
p z>
sc Lz sc
Eτ [|E(Sτ , V \ Sτ )|] ≤ z > >
sc Lz sc · 2z sc Dz sc = 2· · z>
sc Dz sc (5.6)
z>
sc Dz sc

While this almost looks like what we want, we still have to argue that z > >
sc Dz sc = Eτ [1S D1S ]
to finish the proof.
To this end, when unrolling the expectation, we use a simple trick that splits by cases:
X
Eτ [1>
S D1S ] = d (i) · P[i ∈ S]
i∈V
X X
= d (i) · P[i ∈ S ∧ S = Sτ ] + d (i) · P[i ∈ S ∧ S 6= Sτ ]
i∈V,z sc (i)<0 i∈V,z sc (i)≥0
X X
= d (i) · P[z sc (i) < τ ∧ τ < 0] + d (i) · P[z sc (i) ≥ τ ∧ τ ≥ 0]
i∈V,z sc (i)<0 i∈V,z sc (i)≥0

where we use the centering of z sc the definition of S and that the event {i ∈ S ∧ S = Sτ }
can be rewritten as the event {i < τ ∧ τ < 0} (the other case is analogous).
Let i be a vertex with z sc (i) < 0, then the probability P[i ∈ S ∧ S = Sτ ] is exactly z sc (i)2 by
choice of the density function of τ (again the case for i with z sc (i) non-negative is analgous).
Thus, summing over all vertices, we obtain
X X
Eτ [1>
S D1S ] = d (i) · P[z sc (i) < τ ∧ τ < 0] + P[z sc (i) ≥ τ ∧ τ ≥ 0]
i∈V,z sc (i)<0 i∈V,z sc (i)≥0
X
= d (i) · z sc (i)2 = z >
sc Dz sc .
i∈V

Therefore, we can plug in our result directly into Equation 5.6 and the proof is completed
by dividing both sides by Eτ [1>
S D1S ].

While Theorem 5.3.2 only ensures our claim in expectation, this is already sufficient to
conclude that there exists some set S that satisfies the same guarantees deterministically, as
you will prove in Problem Set 4. This is often called the probabilistic method of expectation
and can be seen from the definition of expectation. We have thus proven the upper bound
of Cheeger’s inequality.

62
5.4 Conclusion
Today, we have introduced the concepts of conductance and formalized expanders and ex-
pander decompositions. These are crucial concepts that you will encounter often in literature
and also again in this course. They are a key tool in many recent breakthroughs in Theo-
retical Computer Science.
In the second part of the lecture (the main part), we discussed Cheeger’s inequality which
allows to relate the second eigenvalue of the normalized Laplacian to a graphs conductance.
We summarize the full statement here.

Theorem 5.4.1 (Cheeger’s Inequality). We have λ2 (N )


p
2
≤ φ(G) ≤ 2 · λ2 (N ).

We point out that this Theorem is tight as you will show in the exercises. The proof for
Cheeger’s inequality is probably the most advanced proof, we have seen so far in the course.
The many tricks that make the proof work might sometimes seem a bit magical but it is
important to remember that they are a result of many people polishing this proof over and
over. The proof techniques used are extremely useful and can be re-used in various contexts.
We therefore strongly encourage you to really understand the proof yourself!

63
Chapter 6

Random Walks

Today, we talk about random walks on graphs and how the spectrum of the Laplacian
guides convergence of random walks. We start by giving the definition of a random walk on
a weighted graph G = (V, E, w).

6.1 A Primer on Random Walks


Random Walk Basics. We call a random sequence of vertices v0 , v1 , . . . a random walk
on G, if v0 is a vertex in G chosen according to some probability distribution p 0 ∈ RV ; and
for any t ≥ 0, we have
(
w (u, v)/d (u) if {u, v} ∈ E,
P[vt+1 = v |vt = u] =
0 otherwise.

To gain some intuition for the definition, assume first that the graph G is undirected. Con-
sider a particle that is placed at a random vertex v0 initially. Then at each step the particle
is moved to a neighbor of the current vertex it is resting at, where the neighbor is chosen
uniformly at random.
If the graph is weighted, then instead of choosing a neighbor vt+1 of a vertex vt at each step
uniformly at random, one chooses a neighbor v of vt with probability w (v, vt ) divided by the
degree d (vt ).

The Random Walk Matrix. We now define the random walk matrix W by
W = AD −1
and observe that for all vertices u, v ∈ V (and any t), we have that
(
w (u, v)/d (u) if {u, v} ∈ E,
W vu =
0 otherwise.

64
Figure 6.1: A (possibly random) walk where the red edges indicate the edges that the particle
moves along. Here the walk visits the vertices v0 = 1, v1 = 2, v2 = 3, v3 = 2, v4 = 3, v5 = 4.

Thus, W vu = P[vt+1 = v |vt = u] (for any t).


Therefore, W 1u is the distribution over the vertices that the random walk visits them at
the next time step, given that it currently is at u. More generally, we can now compute the
distribution p 1 over the vertices that they are visited at time 1 by W p 0 , the distribution
p 2 by W p 1 = W (W p 0 ) and so on. Another way of writing this is p t = W t p 0 .

6.2 Convergence Results for Random Walks


In this first part of the chapter, we are interested mostly in convergence of random walks
that is the two questions:

• How does a random walk behave after a large number of steps are taken?

• How many steps does it take asymptotically until the random walk behaves as if an
infinite number of steps were taken?

To start shedding some light on these questions, we introduce stationary distributions.

Stationary Distribution. We call a distribution π ∈ RV , a stationary distribution if


W π = π. That is π is an eigenvector of W associated with eigenvalue 1. It turns out such
a stationary distribution always exists.

Lemma 6.2.1. Every graph G has a stationary distribution.

d
d (v)/1> d = 1
P P
Proof. Let π = 1> d
. Clearly, we have that kπk1 = v∈V 1> d v∈V d (v) = 1,

65
so π is indeed a distribution. Further note that

d A1 d
W π = AD −1 · = = > = π.
1> d >
1 d 1 d

For many graphs one can show that for t → ∞, we have that p t → π, i.e. that independent
of the starting distribution p 0 , the random walk always converges to distribution π.
Unfortunately, this is not true for all graphs: take the graph of two vertices connected by a
single edge with p 0 being 1 at one vertex and 0 at the other.

6.2.1 Making Random Walks Lazy

Lazy Random Walks. Luckily, we can overcome this issue by using a lazy random walk.
A lazy random walk behaves just like a random walk, however, at each time step, with
probability 21 instead of transitioning to a neighbor, it simply stays put. We give the lazy
random walk matrix by

1 1 1
I + AD −1 .

W̃ = I + W =
2 2 2

It is not hard to see that the stationary distribution π for W , is also a stationary distribution
for W̃.

Figure 6.2: A lazy random walk where the red edges indicate the edges that the particle moves
along. Here the lazy walk visits the vertices v0 = 1, v1 = 2, v2 = 2, v3 = 3, v4 = 3, v5 = 2.

66
Lazy Random Walks and the Normalized Laplacian. Recall that we defined N =
D −1/2 LD −1/2 = I − D −1/2 AD −1/2 ⇐⇒ D −1/2 AD −1/2 = I − N . We can therefore derive
1 1
W̃ = I + AD −1
2 2
1 1 1/2 −1/2
= I + D D AD −1/2 D −1/2
2 2
1 1
= I + D 1/2 (I − N )D −1/2
2 2
1 1 1/2 1
= I + D I D −1/2 − D 1/2 N D −1/2
2 2 2
1 1/2
= I − D N D −1/2
2

We will now start to reason about the eigenvalues and eigenvectors of W̃ in terms of the
normalized laplacian N that we are already familiar with.
For the rest of the lecture, we let ν1 ≤ ν2 ≤ · · · ≤ νn be the eigenvalues of N associated
with the orthogonal eigenvectors ψ 1 , ψ 2 , . . . , ψ n where we know that such eigenvectors exist
by the Spectral Theorem. We note in particular that from the last lecture, we have that
1/2
ψ 1 = (1>d d )1/2 (see Equation 5.2 where we added a normalization such that ψ > 1 ψ 1 = 1).

Lemma 6.2.2. For the ith eigenvalue νi of N associated with eigenvector ψi , we have that
W̃ has an eigenvalue of (1 − 12 νi ) associated with eigenvector D 1/2 ψ i .

Proof. The proof is by straight-forward calculations


1
W̃D 1/2 ψi = (I − D 1/2 N D −1/2 )D 1/2 ψi
2
1
= D 1/2 ψi − D 1/2 N ψi
2
1 1
= D 1/2 ψi − D 1/2 ψi νi = D 1/2 ψi (1 − νi ).
2 2

Corollary 6.2.3. Every eigenvalue of W̃ is in [0, 1].

Proof. Recall that L 4 2D which implies that N 4 2I . But this implies that every
eigenvalue of N is in [0, 2]. Thus, using Lemma 6.2.2, the corollary follows.

6.2.2 Convergence of Lazy Random Walks

We have now done enough work to obtain an interesting result. We can derive an alternative
characterization of p t by expanding p 0 along an orthogonal eigenvectors basis and then we
can repeatedly apply W̃ by taking powers of the eigenvalues.

67
Unfortunately, W̃ is not symmetric so its eigenvectors are not necessarily orthogonal. In-
stead, we use a simple trick that allows to expand along the eigenvectors of N
n
X n
X
∀i, ψ >
i D
−1/2
p0 = αi ⇐⇒ D −1/2
p0 = αi ψ i ⇐⇒ p 0 = αi D 1/2 ψ i . (6.1)
i=1 i=1

The above equivalences are best understood if you start from the middle. To get to the
left side, you need to observe that multiplying both sides by ψ > i cancels all terms ψ j with
j 6= i in the sum by orthogonality. To get the right hand side expression, one can simply
left-multiply by D 1/2 . Technically, we have to show that D −1/2 p 0 lives in the eigenspace of
N but we leave this as an exercise.
This allows us to express a right multiplication by W̃ as
n n
X
1/2
X  νi  1/2
p 1 = W̃p 0 = αi W̃D ψi = αi 1 − D ψi.
i=1 i=1
2

And as promised, if we apply W̃, the lazy random walk operator, t times, we now obtain
n n
X  νi t 1/2 1/2
X  νi t 1/2
pt = αi 1 − D ψ i = α1 D ψ 1 + αi 1 − D ψi. (6.2)
i=1
2 i=2
2

where we use in the last equality that ν1 = 0. Using this simple characterization, we
immediately get that p t → π if νi > 0 for all i ≥ 2 (which is exactly when the graph is
connected as you will prove in an exercise). To see this, observe that as t grows sum vanishes.
We have that
lim p t = α1 D 1/2 ψ 1 = π.
t→∞

where we used in the equality that D 1/2 ψ 1 = d


(1> d )1/2
and the value of α1 (from 6.1).

Theorem 6.2.4. For any connected graph G, we have that the lazy random walk converges
to the stationary distribution of G.

6.2.3 The Rate of Convergence

Let us now come to the main result that we want to prove in this lecture.
Theorem 6.2.5. In any unweighted (a.k.a. unit weight) connected graph G, for any p 0 , at
any time step t, we have for p t = W̃t p 0 that

kp t − πk∞ ≤ e−ν2 ·t/2 n

Instead of proving the theorem above, we prove the lemma below which gives point-wise
convergence. This makes it more convenient to derive a proof and it is not hard to deduce
the theorem above as a corollary.

68
Lemma 6.2.6. In any weighted connected graph G, for all a, b ∈ V , and any time step t,
we have for p 0 = 1a and p t = W̃t p 0 that
p
|p t (b) − π(b)| ≤ e−ν2 ·t/2 d b /d a

From Equation 6.2, we obtain that


n
!
X ν i
t 
p t (b) − π(b) = 1>
b (p t − π) = 1b
>
αi 1 − D 1/2 ψ i (6.3)
i=2
2
n n
X  νi t > 1/2  ν2  t X
= αi 1 − 1b D ψ i ≤ 1 − · αi 1>
b D
1/2
ψi (6.4)
i=2
2 2 i=2

Taking the absolute value on both sides, we obtain that


v ! n !
n u n
 ν2 t
 X > 1/2
 ν 2
 tu X X 
1/2
2
|p t (b) − π(b)| ≤ 1 − α i 1b D ψ i ≤ 1 − t αi2 1>
b D ψi
2 i=2
2 i=2 i=2

where we use Cauchy-Schwarz in the last inequality, i.e. |hu, v i|2 ≤ hu, ui · hv , v i. Let us
finally bound the two sums:

Pn  2
• By 6.1,
Pn 2
i=2 αi = i=2 ψ>
i D
−1/2
p0 ≤ kD −1/2 p 0 k22 = kD −1/2 1a k22 = 1/d a .

Pn  2 Pn  2
• Finally, we show that i=2 1> 1/2
≤ i=1
b D ψi = kD 1/2 1b k22 = d b
1>
b D
1/2
ψi
(we only show the first equality, the other inequalities are straight-forward). We first
expand the vector D 1/2 1b along the eigenvectors using some values βi defined
n
X
D 1/2
1b = βi ψ i ⇐⇒ ψ >
i D
1/2
1b = βi ⇐⇒ 1>
b D
1/2
ψ i = βi
i=1

We used orthogonality to get the first equivalence, and then just take the transpose to
get the second. We can now write
n
! n ! n
X X X
1/2 2 1/2 > 1/2 >
kD 1b k2 = (D 1b ) (D 1b ) = βi ψ i βi ψ i = βi2
i=1 i=1 i=2

where we again used orthogonality of ψ i . The equality then follows by definition of βi .

Putting everything together (and using 1 + x ≤ ex , ∀x ∈ R), we obtain


 ν2  t p p
|p t (b) − π(b)| ≤ 1 − d b /d a ≤ e−ν2 ·t/2 d b /d a
2

69
6.3 Properties of Random Walks
We now shift our focus away from convergence of random walks and consider some interest-
ing properties of random walks1 . Here, we are no longer interested in lazy random walks,
although all proofs can be straight-forwardly adapted. While in the previous section, we
relied on computing the second eigenvalue of the Normalized Laplacian efficiently, here, we
will discover that solving Laplacian systems, that is finding an x such that Lx = b can solve
a host of problems in random walks.

6.3.1 Hitting Times

One of the most natural questions one can ask about a random walk starting in a vertex a
(i.e. p 0 = 1a ) is how many steps it takes to get to a special vertex s. This quantity is called
the hitting time from a to s and we denote it by Ha,s = inf{t | v t = s}. For the rest of this
section, we are concerned with computing the expected hitting time, i.e. E[Ha,s ].
It turns out, that it is more convenient to compute all expected hitting times Ha,s for vertices
a ∈ V to a fixed s. We denote by h ∈ RV , the vector with h(a) = E[Ha,s ]. We now show
that we can compute h by solving a Laplacian system Lh = b. We will see later in the
course that such systems (spoiler alert!) can be solved in time Õ(m), so this will imply a
near-linear time algorithm to compute the hitting times.

Hitting Time and the Random Walk Matrix. Let us first observe that if s = a, then
the answer becomes trivially 0, i.e. h(s) = 0.
We compute the rest of the vector by writing down a system of equations that recursively
characterizes h. Observe therefore first that for any a 6= s, we have that the random walks
starting at a will next visit a neighbor b of a. If the selected neighbor b = s, the random
walks stops; otherwise, the random walks needs in expectation E[Hb,s ] time to move to s.
We can express this algebraically by
X X w (a, b)
>
h a = 1+ P[vt+1 = b |vt = a]·h(b) = 1+ ·h(b) = 1+(W 1a )> h = 1+1>
a W h.
a∼b a∼b
d (a)

Using that h(a) = 1> >


a h = 1a I h, we can rewrite this as

>
1 = 1>
a (I − W )h.

This gives a system of (linear) equations, that can be neatly summarized by

1 − α · 1s = (I − W > )h
1
Note that this part of the script is in large part inspired by Aaron Sidford’s lecture notes on the same
subject, in his really interesting course Discrete Mathematics and Algorithms.

70
where we have an extra degree of freedom in choosing α in formulating a constraint 1 − α =
>
1>s (I − W )h. This extra degree of freedom stems from the fact that n − 1 equations suffice
for us to enforce that the returned vector x from the system is indeed the hitting times
(possibly shifted by the value assigned to coordinate t).

Finding Hitting Times via Laplacian System Solve. Since we assume G connected,
we have that multiplying with D = D > preserves equality. Further since W = AD −1 , we
obtain
d − α · d (s) · 1s = (D − A)h.

Defining b = d − α · d (s) · 1s , and observing L = D − A, we have Lh = b.


Finally, we observe that we only have a solution to the above system if and only if b ∈
ker(L)⊥ = span(1)⊥ . We thus have to set α such that

1> (d − α · d (s) · 1s ) = kd k1 − α · d (s) ⇐⇒ α = kd k1 /d (s).

We have now formalized L and b completely. A last detail that we should not forget about
is that any solution x to such system Lx = b is not necessarily equal h but has the property
that it is shifted from h by the all-ones vector. Since we require h(s) = 0, we can reconstruct
h from x straight-forwardly by subtracting x (s)1.

Theorem 6.3.1. Given a connected graph G, a special vertex s ∈ V . Then, we can formalize
a Laplacian system Lx = b (where L is the Laplacian of G) such that the expected hitting
times to s are given by h = x − x (s)1. We can reconstruct h from x in time O(n).

Hitting Times and Electrical Networks. Seeing that hitting times can be computed
by formulating a Laplacian system Lx = b. You might remember that in the first lecture,
we argued that a system Lx = b also solves the problem of routing a demand b via an
electrical flow with voltages x .
Indeed, we can interpret computing expected hitting times h to a special vertex s as the
problem of computing the electrical voltages x where we insert (or more technically correct
apply) d (a) units of current at every vertex a 6= s and where we remove 1> d − d (s) units
of current at the vertex s. Then, we can express expected hitting time to some vertex a as
the voltage difference to s: E[Ha,s ] = h(a) = x (a) − x (s).

6.3.2 Commute Time

A topic very related to hitting times are commute times. That is for two vertices a, b, the
commute time is the time in a random walk starting in a to visit b and then to return to a
again. Thus, it can be defined Ca,b = Ha,b + Hb,a .

71
Commute Times via Electric Flows. Recall that expected hitting times have an electric
flow interpretation.
Now, let us denote by x a solution to the Laplacian system Lx = b b where the demand is
b b = d −d > 1·1b ∈ ker(1)⊥ . Recall that we have E[Hz,b ] = x (z)−x (b) for all z. Similarly, we
can compute voltages y to the Laplacian system Ly = b a where b a = d −d > 1·1a ∈ ker(1)⊥ .
Again, E[Hz,a ] = y (z) − y (a) for all z.
But observe that this allows us to argue by linearity that E[Ca,b ] = E[Ha,b + Hb,a ] = E[Ha,b ] +
E[Hb,a ] = x (a) − x (b) + y (b) − y (a) = (1a − 1b )> (x − y ). But using linearity again, we
can also argue at the same time that we obtain the vector z = (x − y ) as a solution to the
problem Lz = b b − b a = d > 1(1a − 1b ). That is the flow that routes kd k1 units of flow from
b to a.

Theorem 6.3.2. Given a graph G = (V, E), for any two fixed vertices a, b ∈ V , the expected
commute time Ca,b is given by the voltage difference between a and b for any solution z to
the Laplacian system Lz = kd k1 · (1b − 1a ).

We note that the voltage difference between a and b in an electrical flow routing demand
1b − 1a is also called the effective resistance Reff (a, b). This quantity will play a crucial role
in the next roles. In the next lecture, we introduce Reff (a, b) slightly differently as the energy
required by the electrical flow that routes 1b − 1a , however, it is not hard to show that these
two definitions are equivalent.
Our theorem can now be restated as saying that the expected commute time E[Ca,b ] =
kd k1 · Reff (a, b). This is a classic result.

72
Chapter 7

Pseudo-inverses and Effective


Resistance

7.1 What is a (Moore-Penrose) Pseudoinverse?


Recall that for a connected graph G with Laplacian L, we have ker(L) = span{1}, which
means L is not invertible. However, we still want some matrix which behaves like a real
inverse. To be more specific, given a Laplacian L ∈ RV ×V , we want some matrix L+ ∈ RV ×V
s.t.

1) (L+ )> = L+ (symmetric)


2) L+ 1 = 0, or more generally, L+ v = 0 for v ∈ ker(L)
3) L+ Lv = LL+ v = v for v ⊥ 1, or more generally, for v ∈ ker(L)⊥

Under the above conditions, L+ is uniquely defined and we call it the pseudoinverse of L.
Note that there are many other equivalent definitions of the pseudoinverse of some matrix
A, and we can also generalize the concept to matrices that aren’t symmetric or even square.
Let λi , v i be the i-th pair of eigenvalue and eigenvector of L, with {v i }ni=1 forming a orthog-
onal basis. Then by the spectral theorem,
X
L = V ΛV > = λi v i v >
i ,
i
 
where V = v 1 · · · v n and Λ = diag{λ1 , ..., λn }. And we can show that its pseudoinverse
is exactly X
L+ = λ−1 >
i v iv i .
i,λi 6=0

Checking conditions 1), 2), 3) is immediate. We can also prove uniqueness, but this takes
slightly more work.

73
7.2 Electrical Flows Again
Recall the incidence matrix B ∈ RV ×E of a graph G = (V, E).

Figure 7.1: An example of a graph and its incidence matrix B.

In Chapter 1, we introduced the electrical flow routing demand d ∈ RV . Let’s call the
electrical flow f̃ ∈ RE . The net flow constraint requires B f̃ = d . By Ohm’s Law, f̃ =
R−1 B > x for some voltage x ∈ RV where R = diag(r ) and r (e) = resistance of edge e. We
showed (in the exercises) that when d ⊥ 1, there exists an voltage x̃ ⊥ 1 s.t. f̃ = R−1 B > x̃
and B f̃ = d . This x̃ solves Lx = d where L = BR−1 B > .
And we also made the following claim.

Claim 7.2.1. X
f̃ = arg min f > Rf where f > Rf = r (e)f (e)2 , (7.1)
Bf =d e

You proved this in the exercises for Week 1. Let’s recap the proof briefly, just to get back
into thinking about electrical flows.

Proof. Consider any f ∈ RE s.t. Bf = d . For any x ∈ RV , we have


1 > 1
f Rf = f > Rf − x > (Bf − d )
2 2 | {z }
0
1
≥ min f > Rf − x > Bf + d > x
f ∈R 2
E
| {z }
g(f )
1
= d > x − x > Lx
2
since ∇f g(f ) = 0 gives us f = R−1 B > x . Thus, for all f ∈ RE s.t. Bf = d and all x ∈ RV ,
1 > 1
f Rf ≥ d > x − x > Lx . (7.2)
2 2

74
But for the electrical flow f̃ and electrical voltage x̃ , we have f̃ = R−1 B > x̃ and Lx̃ = d .
So
> >
f̃ Rf̃ = R−1 B > x̃ R R−1 B > x̃ = x̃ > BR−1 B > x̃ = x̃ > Lx̃ = x̃ > d .


Therefore,
1 > 1
f̃ Rf̃ = d > x̃ − x̃ > Lx̃ . (7.3)
2 2
By combining Equation (7.2) and Equation (7.3), we see that for all f s.t. Bf = d ,
1 > 1 1 >
f Rf ≥ d > x̃ − x̃ > Lx̃ = f̃ Rf̃ .
2 2 2
Thus f̃ is the minimum electrical energy flow among all flows that route demand d , proving
Equation (7.1) holds.
The drawing below shows how the quantities line up:

7.3 Effective Resistance


Given a graph G = (V, E), for any pair of vertices (a, b) ∈ V , we want to compute the cost
(or energy) of routing 1 unit of current from a to b. We call such cost the effective resistance
between a and b, denoted by Reff (a, b). Recall for a single resistor r(a, b),

energy = r(a, b)f 2 (a, b) = r(a, b).

So when we have a graph consisting of just one edge (a, b), the effective resistance is just
Reff (a, b) = r(a, b).
In a general graph, we can also consider the energy required to route one unit of current
between two vertices. For any pair a, b ∈ V , we have

Reff (a, b) = min f > Rf ,


Bf =e b −e a

75
where e v ∈ RV is the indicator vector of v. Note that the cost of routing F units of flow
from a to b will be Reff (a, b) · F 2 .
>
Since (e b − e a )> 1 = 0, we know from the previous section that Reff (a, b) = f̃ Rf̃ where f̃
is the electrical flow. Now we can write Lx̃ = e b − e a and x̃ = L+ (e b − e a ) for the electrical
voltages routing 1 unit of current from a to b. Now the energy of routing 1 unit of current
from a to b is
>
Reff (a, b) = f̃ Rf̃ = x̃ > Lx̃ = (e b − e a )> L+ LL+ (e b − e a ) = (e b − e a )> L+ (e b − e a ),

where the last equality is due to L+ LL+ = L+ .

Remark 7.3.1. We have now seen several different expressions that all take on the same
value: the energy of the electrical flow. It’s useful to remind yourself what these are. Consider
an electrical flow f̃ routes demand d , and associated electrical voltages x̃ . We know that
B f̃ = d , and f = R−1 B > x̃ , and Lx̃ = d , where L = BR−1 B > . And we have seen how to
express the electrical energy using many different quantities:
> >
f̃ Rf̃ = x̃ > Lx̃ = d > L+ d = d > x̃ = f̃ B > x̃

Claim 7.3.2. Any PSD matrix A has a PSD square root A1/2 s.t. A1/2 A1/2 = A.

1/2
λi v i v >
P
Proof. By the spectral theorem, A = i i where {v i } are orthonormal. Let A =
P 1/2 >
i λi v i v i . Then

!2
X 1/2
A 1/2
A 1/2
= λi v i v >
i
i
X X
= λi v i v > >
i v iv i + λi v i v > >
i v jv j
i i6=j
X
= λi v i v >
i
i

1/2
where the last equality is due to v >
i v j = δij . It’s easy to see that A is also PSD.

Let L+/2 be the square root of L+ . So

Reff (a, b) = (e b − e a )> L+ (e b − e a ) = kL+/2 (e b − e a )k2 .

Example: Effective resistance in a path. Consider a path graph on vertices V =


{1, 2, 3, . . . , k + 1}, with with resistances r (1), r (2), . . . , r (k) on the edges of the path.

76
Figure 7.2: A path graph with k edges.

The effective resistance between the endpoints is


k
X
Reff (1, k + 1) = r (i)
i=1

To see this, observe that to have 1 unit of flow going from vertex 1 to vertex k + 1, we must
have one unit flowing across each edge i. Let ∆(i) be the voltage difference across edge i,
and f (i) the flow on the edge. Then 1 = f (i) = ∆(i) , so that ∆(i) = r (i). The electrical
V
P r (i)
voltages are then x̃ ∈ R where x̃ (i) = x̃ (1) + j<i ∆(j). Hence the effective resistance is
k
X
> >
Reff (1, k + 1) = d x̃ = (e k+1 − e 1 ) x̃ = x̃ (k + 1) − x̃ (1) = r (i).
i=1

This behavior is sometimes known as the fact that the resistance of resistors adds up when
they are connected in series.

Example: Effective resistance of parallel edges. So far, we have only considered


graphs with at most one edge between any two vertices. But that math also works if we
allow a pair of vertices to have multiple distinct edges connecting them. We refer to this
as multi-edges. Suppose we have a graph on just two vertices, V = {1, 2}, and these are
connected by k parallel multi-edges with resistances r (1), r (2), . . . , r (k).

Figure 7.3: A graph on just two vertices with k parallel multiedges.

The effective resistance between the endpoints is


1
Reff (1, 2) = Pk .
i=1 1/r (i)

77
Let’s see why. Our electrical voltages x̃ ∈ RV can be described by just the voltage difference
∆ ∈ R between vertex 1 and vertex 2, i.e. x̃ (2) − x̃ (1) = ∆. which creates
P a flow on edge
i of f̃ (i) = ∆/r (i). Thus the total flow from vertex 1 to vertex 2 is 1 = i ∆/r (i), so that
∆ = Pk 11/r (i) . Meanwhile, the effective resistance is also
i=1

1
Reff (1, 2) = (e 2 − e 1 )> x̃ = ∆ = Pk
i=1 1/r (i)

7.3.1 Effective Resistance is a Distance


Definition 7.3.3. Consider a weighted undirected graph G with vertex set V . We say
function d : V × V → R, which takes a pair of vertices and returns a real number, is a
distance if it satisfies

1. d(a, a) = 0 for all a ∈ V

2. d(a, b) ≥ 0 for all a, b ∈ V .

3. d(a, b) = d(b, a) for all a, b ∈ V .

4. d(a, b) ≤ d(a, c) + d(c, b) for all a, b, c ∈ V .

Lemma 7.3.4. Reff is a distance.

Before proving this lemma, let’s see a claim that will help us finish the proof.
Claim 7.3.5. Let Lx̃ = e b − e a . Then for all c ∈ V , we have x̃ (b) ≥ x̃ (c) ≥ x̃ (a).

We only sketch a proof of this claim:

Proof sketch. Consider any c ∈ V , where c 6= a, b. Now (Lx̃ )(c) = 0, i.e.


   
X X
 w (u, c) x̃ (c) −  w (u, c)x̃ (u) = 0
(u,c) (u,c)
P
(u,c) w (u,c)x̃ (u)
Rearranging x̃ (c) = P . This tells us that x̃ (c) is a weighted average of the
(u,c) w (u,c)
voltages of its neighbors. From this, we can show that x̃ (a) and x̃ (b) are the extreme
values.

Proof. It is easy to check that conditions 1, 2, and 3 of Definition 7.3.3 are satisfied by Reff .
Let us confirm condition 4.
For any u, v, let x̃ u,v = L+ (−e u + e v ). Then

x̃ a,b = L+ (−e a + e b ) = L+ (−e a + e c − e c + e b ) = x̃ a,c + x̃ c,b .

78
Thus,

Reff (a, b) = (−e a + e b )> x̃ a,b = (−e a + e b )> (x̃ a,c + x̃ c,b )
= −x̃ a,c (a) + x̃ a,c (b) − x̃ c,b (a) + x̃ c,b (b)
≤ −x̃ a,c (a) + x̃ a,c (c) − x̃ c,b (c) + x̃ c,b (b).

where in the last line we applied Claim 7.3.5 to show that x̃ a,c (b) ≤ x̃ a,c (c) and
−x̃ c,b (a) ≤ −x̃ c,b (c).

79
Chapter 8

Different Perspectives on Gaussian


Elimination

8.1 An Optimization View of Gaussian Elimination for


Laplacians
In this section, we will explore how to exactly minimize a Laplacian quadratic form by
minimizing over one variable at a time. It turns out that this is in fact Gaussian Elimination
in disguise – or, more precisely, the variant of Gaussian elimination that we tend to use on
symmetric matrices, which is called Cholesky factorization.
Consider a Laplacian L of a connected graph G = (V, E, w ), where w ∈ RE is a vector of
positive edge weights. Let W ∈ RE×E be the diagonal matrix with the edge weights on the
diagonal, i.e. W = diag(w ) and L = BW B > . Let d ∈ RV be a demand vector s.t. d ⊥ 1.
Let us define an energy
1
E(x ) = −d > x + x > Lx
2
Note that this function is convex and is minimized at x s.t. Lx = d .
We will now explore an approach to solving the minimization problem

min E(x )
x ∈RV
 
y
Let x = where y ∈ R and z ∈ RV \{1} .
z
We will now explore how to minimize over y, given any z . Once we find an expression for y
in terms of z , we will be able to reduce it to a new quadratic minimization problem in z ,
1
E 0 (z ) = −d 0> z + z > L0 z
2

80
where d 0 is a demand vector on the remaining vertices, with d ⊥ 1 and L0 is a Laplacian
of a graph on the remaining vertices V 0 = V \ {1}. We can then repeat the procedure to
eliminate another variable and so on. Eventually, we can then find all the solution to our
original minimization problem.
To help us understand how to minimize over the first variable, we introduce some notation
for the first row and column of the Laplacian:

−aa >
 
W
L= . (8.1)
−aa diag(aa ) + L−1

Note that W is the weighted degree of vertex 1, and that

−aa >
 
W
(8.2)
−aa diag(aa )

is the Laplacian of the subgraph of G containing only the edges incident on vertex 1, while
L−1 is the Laplacian of the subgraph of G containing all edges not incident on vertex 1.
 
b
Let us also write d = where b ∈ R and c ∈ RV \{1} .
c
Now,
 >    > 
−aa >
 
> 1 > b y 1 y W y
E(x ) = −d x + x Lx = − +
2 c z 2 z −aa diag(aa ) + L−1 z
1 2
= −by − c > z + y W − 2yaa > z + z > diag(aa )z + z > L−1 z .

2

Now, to minimize over y, we set ∂y
E(x ) = 0 and get

−b + yW − a > z = 0.

Solving for y, we get that the minimizing y is


1
y= (b + a > z ). (8.3)
W

Observe that
1 2
E(x ) = −by − c > z + y W − 2yaa > z + z > diag(aa )z + z > L−1 z

2 

11 1 
= −by − c > z +  (yW − a > z )2 − z >a a > z + z > diag(aa )z + z > L−1 z 

2 W | W {z }
1
a )− W
Let S =diag(a a a > +L−1
 
1 1
= −by − c > z + (yW − a > z )2 + z > S z ,
2 W

81
where we simplified the expression by defining S = diag(aa ) − 1
W
aa > + L−1 . Plugging in
y = W1 (b + a > z ), we get
>
b2
  
y 1 1
min E =− c+b a z− + z >S z .
y z W 2W 2

Now, we define d 0 = c + b W1 a and E 0 (z ) = −d 0> z + 12 z > S z . And, we can see that


 
y
arg min min E = arg min E 0 (z ),
z y z z

b 2
since dropping the constant term − 2W does not change what the minimizing z values are.

Claim 8.1.1.

1. d 0 ⊥ 1

2. S = diag(aa ) − 1
W
aa> + L−1 is a Laplacian of a graph on the vertex set V \ {1}.

We will prove Claim 8.1.1 in a moment. From the Claim, we see that the problem of finding
arg minz E 0 (z ), is exactly of the same form as finding arg minx E(x ), but with one fewer
variables.
We can get a minimizing x that solves arg minx E(x ) by repeating the variable elimination
procedure until we get down to a single variable and finding its value. We then have to work
back up to getting a solution for z , and then substitute that into Equation (8.3) to get the
value for y.

Remark 8.1.2. In fact, this perspective on Gaussian elimination also makes sense for any
positive definite matrix. In this setting, minimizing over one variable will leave us with
another positive definite quadratic minimization problem.

>
Proof of Claim 8.1.1. To establish the first part, we note that 1> d 0 = 1> c + b 1Wa = 1> c +
b = 1> d = 0. To establish the second part, we notice that L−1 is a graph Laplacian by
definition. Since the sum of two graph Laplacians is another graph Laplacian, it now suffices
to show that S is a graph Laplacian.
Claim 8.1.3. A matrix M is a graph Laplacian if and only it satisfies the following condi-
tions:

• M> = M.

• The diagonal entries of M are non-negative, and the off-diagonal entries of M are
non-positive.

• M 1 = 0.

82
Let’s see that Claim 8.1.3 is true. Firstly, when the conditions hold we can write M = D −A
where D is diagonal and non-negative, and A P is non-negative, symmetric, and zero on the
diagonal, and from the last condition D(i, i) = j6=i A(i, j). Thus we can view A as a graph
adjacency matrix and D as the corresponding diagonal matrix of weighted degrees. Secondly,
it is easy to check that the conditions hold for any graph Laplacian, so the conditions indeed
hold if and only if. Now we have to check that the claim applies to S . We leave this as an
exercise for the reader.
Finally, we want to argue that the graph corresponding to S is connected. Consider any
i, j ∈ V \ {1}. Since G, the graph of L, is connected, there exists a simple path in G
connecting i and j. If this path does not use vertex 1, it is a path in the graph of L−1 and
hence in the graph of S . If the path does use vertex 1, it must do so by reaching the vertex
on some edge (v, 1) and leaving on a different edge (1, u). Replace this pair of edges with
edge (u, v), which appears in the graph of S because S (u, v) < 0. Now we have a path in
the graph of S .

8.2 An Additive View of Gaussian Elimination

Cholesky decomposition basics. Again we consider a graph Laplacian L ∈ Rn×n of a


conected graph G = (V, E, w ), where as usual |V | = n and |E| = m.

83
In this Section, we’ll study how to decompose a graph Laplacian as L = LL> , where
L ∈ Rn×n is a lower triangular matrix, i.e. L(i, j) = 0 for i < j. Such a factorization is
called a Cholesky decomposition. It is essentially the result of Gaussian elimination with a
slight twist to ensure the matrices maintained at intermediate steps of the algorithm remain
symmetric.
We use nnz(A) to denote the number of non-zero entries of matrix A.
Lemma 8.2.1. Given an invertible square lower triangular matrix L, we can solve the linear
equation Ly = b in time O(nnz(L)). Similarly, given an upper triangular matrix U , we can
solve linear equations U z = c in time O(nnz(U )).

We omit the proof, which is a straight-forward exercise. The algorithms for solving linear
equations in upper and lower triangular matrices are known as forward and back substitution
respectively.
Remark 8.2.2. Strictly speaking, the lemma requires us to have access an adjacency list
representation of L so that we can quickly tell where the non-zero entries are.

Using forward and back substitution, if we have a decomposition of an invertible matrix M


into M = LL> , we can now solve linear equations in M in time O(nnz(L)).
Remark 8.2.3. We have learned about decompositions using a lower triangular matrix,
and later we will see an algorithm for computing these. In fact, we can have more flexibility
than that. From an algorithmic perspective, it is sufficient that there exists a permutation
matrix P s.t. PLP > is lower triangular. If we know the ordering under which the matrix
becomes lower triangular, we can perform substitution according to that order to solve linear
equations in the matrix without having to explicitly apply a permutation to the matrix.

Dealing with pseudoinverses. But how can we solve a linear equation in L = LL> ,
where L is not invertible? For graph Laplacians we have a simple characterization of the
kernel, and because of this, dealing with the lack of invertibility turns out to be fairly easy.
We can use the following lemma which you will prove in an exercise next week.
Lemma 8.2.4. Consider a real symmetric matrix M = X Y X > , where X is real and
invertible and Y is real symmetric. Let ΠM denote the orthogonal projection to the image
of M . Then M + = ΠM (X > )−1 Y + X −1 ΠM .

The factorizations L = LL> that we produce will have the property that all diagonal entries
of L are strictly non-zero, except that L(n, n) = 0. Let L b be the matrix whose entries
agree with L, except that L(n,b n) = 1. Let D be the diagonal matrix with D(i, i) = 1
for i < n and D(n, n) = 0. Then LL> = LD b L b> , and Lb is invertible, and D + = D.
1 >
Finally, ΠL = I − n 11 , because this matrix is acts like identity on vectors orthogonal to
1 and ensures ΠL 1 = 0, and this matrix can be applied to a vector in O(n) time. Thus
L+ = ΠL (L b> )−1 D L
b−1 ΠL , and this matrix can be applied in time O(nnz(L)).

84
An additive view of Gaussian Elimination. The following theorem describes Gaussian
Elimination / Cholesky decompostion of a graph Laplacian.

Theorem 8.2.5 (Cholesky Decomposition on graph Laplacians). Let L ∈ Rn×n be a graph


Laplacian of a connected graph G = (V, E, w ), where |V | = n. Using Gaussian Elimination,
we can compute in O(n3 ) time a factorization L = LL> where L is lower triangular, and
has positive diagonal entries except L(n, n) = 0.

Proof. Let L(0) = L. We will use A(:, i) to denote the the ith column of a matrix A. Now,
for i = 1 to i = n − 1 we define
1
li = q L(i−1) (:, i) and L(i) = L(i−1) − l i l >
i
L(i−1) (i, i)

Finally, we let l n = 0n×1 . We will show later that

L(n−1) = 0n×n . (8.4)

It follows that L = i l i l > (i−1)


P
i , provided this procedure is well-defined, i.e. L (i, i) 6= 0 for
all i < n. We will sketch a proof of this later, while also establishing several other properties
of the procedure.
Given a matrix A ∈ Rn×n and U ⊆ [n], we will use A(U, U ) to denote the principal submatrix
of A obtained by restricting to the rows and columns with index in U , i.e. all entries A(i, j)
where i, j ∈ U .
Claim 8.2.6. Fix some i < n. Let U = {i + 1, . . . , n}. Then L(i) (i, j) = 0 if i 6∈ U or
j 6∈ U . And L(i) (U, U ) is a graph Laplacian of a connected graph on the vertex set U .

From this claim, it follows that L(i−1) (i, i) 6= 0 for i < n, since a connected graph Laplacian
on a graph with |U | > 1 vertices cannot have a zero on the diagonal, and it follows that
L(n−1) (i, i) = 0, because the only graph we allow on one vertex is the empty graph. This
shows Equation (8.4) holds.

Sketch of proof of Claim 8.2.6. We will focus on the first elimination, as the remaining are
similar. Adopting the same notation as in Equation (8.1), we write

−aa >
 
(0) W
L =L=
−aa diag(aa ) + L−1

and, noting that


−aa >
 
W
l 1l > =
1
−aa 1
W
aa>
we see that  
0 0
L (1)
=L(0)
− l 1l > = .
1
0 diag(aa ) − W1 a a > + L−1

85
Thus the first row and column of L(1) are zero claimed. It also follows by Claim 8.1.1 that
L(1) ({2, . . . , n} , {2, . . . , n}) is the Laplacian of a connected graph. This proves Claim 8.2.6
for the case i = 1. An induction following the same pattern can be used to prove the claim
for all i < n.

86
Chapter 9

Random Matrix Concentration and


Spectral Graph Sparsification

9.1 Matrix Sampling and Approximation


We want to begin understanding how sums of random matrices behave, in particular, whether
they exhibit a tendency to concentrate in the same way that sum of scalar random variables
do under various conditions.
First, let’s recall a scalar Chernoff bound, which shows that a sum of bounded, non-negative
random variables tend to concentrate around their mean.
Theorem 9.1.1 (A Chernoff Concentration Bound). Suppose X1 , . . P. , Xk ∈ R are indepen-
dent, non-negative, random variables with Xi ≤ R always. Let X = i Xi , and µ = E [X],
then for 0 <  ≤ 1
 2   2 
− µ − µ
Pr[X ≥ (1 + )µ] ≤ exp and Pr[X ≤ (1 − )µ] ≤ exp .
4R 4R

The Chernoff bound should be familiar to most of you, but you may not have seen the
following very similar bound. The Bernstein bound, which we will state in terms of zero-
mean variables, is much like the Chernoff bound. It also requires bounded variables. But,
when the variables have small variance, the Bernstein bound is sometimes stronger.
Theorem 9.1.2 (A Bernstein Concentration Bound). Suppose X1 , . . . , Xk ∈PR are in-
dependent, zero-mean, random variables with |Xi | ≤ R always. Let X = i Xi , and
2 2
P
σ = V ar [X] = i E [Xi ], then for  > 0
−t2
 
Pr[|X| ≥ t] ≤ 2 exp .
2Rt + 4σ 2

We will now prove the Bernstein concentration bound for scalar random variables, as a warm-
up to the next section, where we will prove a version of it for matrix-valued random variables.

87
To help us prove Bernstein’s bound, first let’s recall Markov’s inequality. This is a very weak
concentration inequality, but also very versatile, because it requires few assumptions.
Lemma 9.1.3 (Markov’s Inequality). Suppose X ∈ R is a non-negative random variable,
with a finite expectation. Then for any t > 0,
E [X]
Pr[X ≥ t] ≤ .
t

Proof.

E [X] = Pr[X ≥ t] E [X | X ≥ t] + Pr[X < t] E [X | X < t]


≥ Pr[X ≥ t] E [X | X ≥ t]
≥ P r[X ≥ t] · t.

We can rearrange this to get the desired statement.

Now, we are ready to prove Bernstein’s bound.

Proof of Theorem 9.1.2. We will focus on bounding the probability that Pr[X ≥ t]. The
proof that Pr[−X ≥ t] is small proceeds in the same way.
First we observe that

Pr[X ≥ t] = Pr[exp(θX) ≥ exp(θt)]


for any θ > 0, because x 7→ exp(θx) is strictly increasing.
≤ exp(−θt) E [exp(θX)] by Lemma 9.1.3 (Markov’s Inequality).

Now, let’s require that θ ≤ 1/R This will allow us to use the following bound: For all |z| ≤ 1,

exp(z) ≤ 1 + z + z 2 . (9.1)
We omit a proof of this, but the plots in Figure 9.1 suggest that this upper bound holds.
The reader should consider how to prove this. With this in mind, we see that
" !#
X
E [exp(θX)] = E exp θ Xi
i
= E [Πi exp(θXi )]
= Πi E [exp (θXi )] because E [Y Z] = E [Y ] E [Z] for independent Y and Z.
≤ Πi E 1 + θXi + (θXi )2
 

= Πi (1 + θ2 E Xi2 )
 
because Xi are zero-mean.
2
 2
≤ Πi exp(θ E Xi ) because 1 + z ≤ exp(z) for all z ∈ R.
!
X
θ2 E Xi2 = exp(θ2 σ 2 ).
 
= exp
i

88
Figure 9.1: Plotting exp(z) compared to 1 + z + z 2 .

Thus Pr[X ≥ t] ≤ exp(−θt) E [exp(θX)] ≤ exp(−θt + θ2 σ 2 ). Now, to get the best possible
bound, we’d like to minimize −θt + θ2 σ 2 subject to the constraint 0 < θ ≤ 1/R. Setting

−θt + θ2 σ 2 = −t + 2θσ 2 .

∂θ
t
Setting this derivative to zero gives θ = 2σ 2
, and plugging that in gives
t2
−θt + θ2 σ 2 = −
4σ 2
This choice only satisfies our constraints on θ if 2σt 2 ≤ 1/R. Otherwise, we let θ = 1/R and
note that in this case
t σ2 t t t
−θt + θ2 σ 2 = − + 2 ≤ − + =−
R R R 2R 2R

89
where we got the inequality from t > 2σ 2 /R. Altogether, we can conclude that there always
is a choice of θ s.t.
t t2 t2
 
2 2
−θt + θ σ ≤ − min , 2 ≤− .
2R 4σ 2Rt + 4σ 2

In fact, with the benefit of hindsight, and a little algebra, we arrive at the same conclusion
in another way: One can
 check that the following choice of θ is always valid and achives the
√ 3/2 
1 R·t
same bound: θ = 2σ2 t − √2σ2 +Rt .

We use k·k to denote the spectral norm on matrices. Let’s take a look at a version of
Bernstein’s bound that applies to sums of random matrices.

Theorem 9.1.4 (A Bernstein Matrix Concentration Bound (Tropp 2011)). Suppose


X 1 , . . . , X k ∈ Rn×n are independent, symmetric matrix-valued random variables. Assume
P
each X i is zero-mean, i.e. E [X i ] = 0n×n , and that kX i k ≤ R always. Let X = i X i,
2
2
P  
and σ = V ar [X ] = i E X i , then for  > 0

−t2
 
Pr[kX k ≥ t] ≤ 2n exp .
2Rt + 4σ 2

This basically says that probability of X being large in spectral norm behaves like the scalar
case, except the bound is larger by a factor n, where the matrices are n × n. We can get a
feeling for why this might be a reasonable boundPby considering the case of random diagonal
matrices. Now kX k = maxj |X (j, j)| = maxj | i X i (j, j)|. In this case, we need to bound
the largest of the n diagonal entries: We can do this by a union bound over n instances of
the scalar problem – and this also turns out to be essentially tight in some cases, meaning
we can’t expect a better bound in general.

9.2 Matrix Concentration

In this section we will prove the Bernstein matrix concentration bound (Tropp 2011) that
we saw in the previous section.

Theorem 9.2.1. Suppose X 1 , . . . , X k ∈ Rn×n are independent, symmetric matrix-valued


P E[X i ] = 0n×n , and that kX i k ≤ R
random variables.PAssume each X i is zero-mean, i.e.
always. Let X = i X i , and σ 2 = k V ar [X ] k = k i E X 2i k, then for t > 0

−t2
 
Pr[kX k ≥ t] ≤ 2n exp .
2Rt + 4σ 2

But let’s collect some useful tools for the proof first.

90
Definition 9.2.2 (trace). The trace of a square matrix A is defined as
X
Tr (A) := A(i, i)
i

Claim 9.2.3 (cyclic property of trace). Tr (AB) = Tr (BA)

Let S n denote the set of all n × n real symmetric matrices, S+n the set of all n × n positive
n
semidefinite matrices, and S++ the set of all n×n positive definite matrices. Their relation is
clear, S++ ⊂ S+ ⊂ S . For any A ∈ S n with eigenvalues λ1 (A) ≤ · · · ≤ λn (A), by spectral
n n n

decomposition theorem, A = V ΛV > where Λ = diagi {λi (A)} and V > V = V V > = I ,
we’ll use this property without specifying in the sequel.
P
Claim 9.2.4. Given a symmetric and real matrix A, Tr (A) = i λi , where {λi } are eigen-
values of A.

Proof. !
X
> >

Tr (A) = Tr V ΛV = Tr Λ |V {zV} = Tr (Λ) = λi .
I i

9.2.1 Matrix Functions

Definition 9.2.5 (Matrix function). Given a real-valued function f : R → R, we extend it


to a matrix function f : S n → S n . For A ∈ S n with spectral decomposition A = V ΛV > ,
let
f (A) = V diag {f (λi )} V > .
i

Example. Recall that every PSD matrix A has a square root A1/2 . If f (x) = x1/2 for
x ∈ R+ , then f (A) = A1/2 for A ∈ S+n .
Example. If f (x) = exp(x) for x ∈ R, then f (A) = exp(A) = V exp(Λ)V > for A ∈ S n .
Note that exp(A) is positive definite for any A ∈ S n .

9.2.2 Monotonicity and Operator Monotonicity

Cosider a function f : D → C. If we have a partial order ≤D defined on D and a partial order


≤C defined on C, then we say that the function is monotone increasing (resp. decreasing)
w.r.t. this pair of orderings if for all d1 , d2 ∈ D s.t. d1 ≤D d2 we have f (d1 ) ≤C f (d2 ) (resp.
decreasing if f (d2 ) ≤C f (d1 )).
Let’s introduce some terminonology for important special cases of this idea. We say that a
function f : S → R, where S ⊆ S n , is monotone increasing if A  B implies f (A) ≤ f (B).

91
Meanwhile, a function f : S → T where S, T ⊆ S n is said to be operator monotone
increasing if A  B implies f (A)  f (B).
Lemma 9.2.6. Let T ⊆ R. If the scalar function f : T → R is monotone increasing, the
matrix function X 7→ Tr (f (X )) is monotone increasing.

Proof. From previous chapters, we know if A  B then λi (A) ≤ λi (B) for all i. As x 7→ f (x)
is monotone, then λi (f (A)) ≤ λi (f (B)) for all i. By Claim 9.2.4, Tr (f (A)) ≤ Tr (f (B)).

From this, and the fact that x 7→ exp(x) is a monotone function on the reals, we get the
following corollary.
Corollary 9.2.7. If A  B, then Tr (exp(A)) ≤ Tr (exp(B)), i.e. X 7→ Tr (exp(X )) is
monotone increasing.
Lemma 9.2.8. If 0 ≺ A  B, then B −1  A−1 , i.e. X 7→ X −1 is operator monotone
n
decreasing on S++ .

You will prove the above lemma in this week’s exercises.


Lemma 9.2.9. If 0 ≺ A  B, then log(A)  log(B).

To prove this lemma, we first recall an integral representation of the logarithm.


Lemma 9.2.10. Z ∞  
1 1
log a = − dt
0 1+t a+t

Proof.
Z ∞   Z T  
1 1 1 1
− dt = lim − dt
0 1+t a+t T →∞ 0 1+t a+t
= lim [log(1 + t) − log(a + t)]T0
T →∞
 
1+T
= log(a) + lim log
T →∞ a+T
= log(a)

Proof sketch of Lemma 9.2.9. Because all the matrices involved are diagonalized by the same
orthogonal transformation, we can conclude from Lemma 9.2.10 that for a matrix A  0,
Z ∞ 
1 −1
log(A) = I − (tI + A) dt
0 1+t
This integration can be expressesd as the limit of a sum with positive coefficients, and from
this we can show that is the integrand (the term inside the integration symbol) is operator
monotone increasing in A by Lemma 9.2.8, the result of the integral, i.e. log(A) must also
be operator monotone increasing.

92
The following is a more general version of Lemma 1.6.
Lemma 9.2.11. Let T ⊂ R. If the scalar function f : T → R is monotone, the matrix
function X 7→ Tr (f (X )) is monotone.
Remark 9.2.12. It is not always true that when f : R → R is monotone, f : S n → S n is
operator monotone. For example, X 7→ X 2 and X 7→ exp(X ) are not operator monotone.

9.2.3 Some Useful Facts

Lemma 9.2.13. exp(A)  I + A + A2 for kAk ≤ 1.

Proof.

I + A + A2 − exp(A) = V I V > + V ΛV > + V Λ2 V > − V exp(Λ)V >


= V I + Λ + Λ2 − exp(Λ) V >


= V diag{1 + λi + λ2i − exp(λi )}V >


i

Recall exp(x) ≤ 1 + x + x2 for all |x| ≤ 1. Since kAk ≤ 1 i.e. |λi | ≤ 1 for all i, thus
1 + λi + λ2i − exp(λi ) ≥ 0 for all i, meaning I + A + A2 − exp(A)  0.
Lemma 9.2.14. log(I + A)  A for A  −I .

Proof.

A − log(I + A) = V ΛV > − V log(Λ + I )V >


= V (Λ − log(Λ + I )) V >
= V diag{λi − log(1 + λi )}V >
i

Recall x ≥ log(1 + x) for all x > −1. Since kAk  −I i.e. λi > −1 for all i, thus
λi − log(1 + λi ) ≥ 0 for all i, meaning A − log(I + A)  0.
n
Theorem 9.2.15 (Lieb). Let f : S++ → R be a matrix function given by

f (A) = Tr (exp (H + log(A)))

for some H ∈ S n . Then −f is convex (i.e. f is concave).

The Lieb’s theorem will be crucial in our proof of Theorem 9.2.1, but it is also highly non-
trivial and we will omit its proof here. The interested reader can find a proof in Chapter 8
of [T+ 15].
Lemma 9.2.16 (Jensen’s inequality). E [f (X)] ≥ f (E [X]) when f is convex; E [f (X)] ≤
f (E [X]) when f is concave.

93
9.2.4 Proof of Matrix Bernstein Concentration Bound

Now, we are ready to prove the Bernstein matrix concentration bound.

Proof of Theorem 9.2.1. For any A ∈ S n , its spectral norm kAk = max{|λn (A)|, |λ1 (A)|} =
max{λn (A), −λ1 (A)}. Let λ1 ≤ · · · ≤ λn be the eigenvalues of X . Then,

h _ i
Pr[kX k ≥ t] = Pr (λn ≥ t) (−λ1 ≥ t) ≤ Pr[λn ≥ t] + Pr[−λ1 ≥ t].

P
Let Y := i −X i . It’s easy to see that −λn ≤ · · · ≤ −λ1 are eigenvalues of Y , implying
λn (Y ) = −λ1 (X ). Since E [−X i ] = E [X i ] = 0 and k − X i k = kX i k ≤ R for all i, if we
can bound Pr[λn (X ) ≥ t], then applying to Y , we can bound Pr[λn (Y ) ≥ t]. As

Pr[−λ1 (X ) ≥ t] = Pr[λn (Y ) ≥ t],

it suffices to bound Pr[λn ≥ t].


P
For any θ > 0, λn ≥ t ⇐⇒ exp(θλn ) ≥ exp(θt) and Tr (exp(θX )) = i exp(θλi ) by Claim
9.2.4, thus λn ≥ t ⇒ Tr (exp(θX )) ≥ exp(θt). Then, using Markov’s inequality,

Pr[λn ≥ t] ≤ Pr[Tr (exp(θX )) ≥ exp(θt)]


≤ exp(−θt) E [Tr (exp(θX ))]

For two independent random variables U and V , we have

E f (U , V ) = E E [f (U , V )|U ] = E E [f (U , V )] .
U ,V U V U V

94
P
Define X <i = j<i X j . Let 0 < θ ≤ 1/R,
 

E Tr (exp(θX )) = E E Tr exp θX + θX k , {X i } are independent


X 1 ,...,X k−1 X k | {z<k} |{z}
H =log exp(θX k )
 
≤ E Tr exp θX <k + log E exp(θX k ) , by 10.4.9 and 9.2.16
X 1 ,...,X k−1

Tr exp θX <k + log E I + θX k + θ2 X 2k , by 10.4.14, 9.2.7, and 9.2.9


 
≤ E
X 1 ,...,X k−1
 
2 2
≤ E Tr exp θX <k + θ E X k , by 10.4.12 and 9.2.7
X 1 ,...,X k−1
 
 2 2

= E E Tr exp  θ E X k + θX <k−1 +θX k−1
,
X 1 ,...,X k−2 X k−1  
| {z }
H
..
.
!
X  2
≤ Tr exp θ2 E Xi ,
i
X  
2 2
E X 2i  σ 2 I

≤ Tr exp θ σ I , by 9.2.7 and
i
= n · exp(θ2 σ 2 ).

Then,
Pr[λn ≥ t] ≤ n · exp(−θt + θ2 σ 2 ),
and
Pr[kX k ≥ t] ≤ 2n · exp(−θt + θ2 σ 2 ).
Similar to the proof of Bernstein concentration bound for one-dimension random variable,
minimize the RHS over 0 < θ ≤ 1/R yields

−t2
 
Pr[kX k ≥ t] ≤ 2n · exp .
2Rt + 4σ 2

9.3 Spectral Graph Sparsification

In this section, we will see that for any dense graph, we can find another sparser graph whose
graph Laplacian is approximately the same as measured by their quadratic forms. This turns
out to be a very useful tool for designing algorithms.

95
Definition 9.3.1. Given A, B ∈ S+n and  > 0, we say

1
A ≈ B if and only if A  B  (1 + )A.
1+

Suppose we start with a connected graph G = (V, E, w ), where as usual we that |V | = n


say
and |E| = m. We want to produce another graph G̃ = (V, Ẽ, w̃ ) s.t Ẽ  |E| and at

the same time LG ≈ LG̃ . We call G̃ a spectral sparsifier of G. Our construction will also
ensure that Ẽ ⊆ E, although this is not important in most applications. Figure 9.2 shows
an example of a graph G and spectral sparsifier G̃.

Figure 9.2: A graph G and a spectral sparsifier G̃ , satisfisying LG ≈ LG̃ for  = 2.42.

We are going to construct G̃ by sampling some of the edges of G according to a suitable


probability distribution and scaling up their weight to make up for the fact that we pick
fewer of them.
To get a better understanding for the notion of approximation given in 9.3.1 means, let’s
observe a simple consequence of it.
Given a vertex subset T ⊆ V , we say that (T, V \ T ) is a cut in G and that the value of the
cut is X
cG (T ) = w (e).
e∈E∩(T ×V \T )

Figure 9.3 shows the cG (T ) in a graph G.

Theorem 9.3.2. If LG ≈ LG̃ , then for all T ⊆ V ,

1
cG (T ) ≤ cG̃ (T ) ≤ (1 + )cG (T ).
1+

Proof. Let 1T ∈ RV be the indicator of the set T , i.e. 1T (u) = 1 for u ∈ V and 1T (u) = 0
otherwise. We can see that 1>
T LG 1T = cG (T ), and hence the theorem follows by comparing
the quadratic forms.

96
Figure 9.3: The cut cG (T ) in G.

But how well can we spectrally approximate a graph with a sparse graph? The next theorem
gives us a nearly optimal answer to this question.
Theorem 9.3.3 (Spectral Graph Approximation by Sampling, (Spielman-Srivastava 2008)).
Consider a connected graph G = (V, E, w ), with n = |V |. For any 0 <  < 1 and 0 < δ < 1,
there exist sampling probabilities pe for each edge e ∈ E s.t. if we include each edge e in Ẽ
independently with probabilty pe and set its weight w̃ (e) = p1e w (e), then with probability at
least 1 − δ the graph G̃ = (V, Ẽ, w̃ ) satisfies

LG ≈ LG̃ and Ẽ ≤ O(n−2 log(n/δ)).

The original proof can be found in [SS11].


Remark 9.3.4. For convenience, we will abbreviate LG as L and LG̃ as L̃ in the rest of this
section.

We are going to analyze a sampling procedure by turning our goal into a problem of matrix
concentration. Recall that
Fact 9.3.5. A  B implies C AC >  C BC > for any C ∈ Rn×n .

By letting C = L+/2 , we can see that

L ≈ L̃ implies ΠL ≈ L+/2 L̃L+/2 , (9.2)

where ΠL = L+/2 LL+/2 is the orthogonal projection to the complement of the kernel of L.
Definition 9.3.6. Given a matrix A, we define ΠA to be the orthogonal projection to the
complement of the kernel of A, i.e. ΠA v = 0 for v ∈ ker(A) and ΠA v = v for v ∈ ker(A)⊥ .
Recall that ker(A)⊥ = im(A> ).
Claim 9.3.7. For a matrix A ∈ S n with spectral decomposition A = V ΛV > = i λi v i v >
P
i
s.t. V > V = I , we have ΠA = i:λi 6=0 v i v > +/2
AA+/2 = AA+ = A+ A.
P
i , and ΠA = A

97
From the definition, we can see that ΠL = I − n1 11> .
Now that we understand the projection ΠL , it is not hard to show the following claim.

Claim 9.3.8.

1. ΠL ≈ L+/2 L̃L+/2 implies L ≈ L̃.



2. For  ≤ 1, we have that ΠL − L L̃L ≤ /2 implies ΠL ≈ L+/2 L̃L+/2 .
+/2 +/2

Really, the only idea needed here is that when comparing quadratic forms in matrices with
the same kernel, we necessarily can’t have the quadratic forms disagree on vectors in the
kernel. Simple! But we are going to write it out carefully, since we’re still getting used to
these types of calculations.

Proof of Claim 9.3.8. To prove Part 1, we assume ΠL ≈ L+/2 L̃L+/2 . Recall that G is a
connected graph, so ker(L) = span {1}, while L̃ is the Laplacian of a graph which may or
may not be connected, so ker(L̃) ⊇ ker(L), and equivalently im(L̃) ⊆ im(L). Now, for any
v ∈ ker(L) we have v > L̃v = 0 = v > Lv . For any v ∈ ker(L)⊥ we have v = L+/2 z for some
z , as ker(L)⊥ = im(L) = im(L+/2 ). Hence
1 1
v > L̃v = z > L+/2 L̃L+/2 z ≥ z > L+/2 LL+/2 z = v > Lv
1+ 1+
and similarly

v > L̃v = z > L+/2 L̃L+/2 z ≤ (1 + )z > L+/2 LL+/2 z = (1 + )v > Lv .

Thus we have established L ≈ L̃.



To prove Part 2, we assume ΠL − L+/2 L̃L+/2 ≤ /2. This is equivalent to

 
− I  L+/2 L̃L+/2 − ΠL  I
2 2
But since
1> (L+/2 L̃L+/2 − ΠL )1 = 0,
we can in fact sharpen this to
 
− ΠL  L+/2 L̃L+/2 − ΠL  ΠL .
2 2
Rearranging, we then conclude
 
(1 − )ΠL  L+/2 L̃L+/2  (1 + )ΠL .
2 2

Finally, we note that 1/(1 + ) ≤ (1 − 2 ) to reach our conclusion, ΠL ≈ L+/2 L̃L+/2 .

98
We now have most of the tools to prove Theorem 9.3.3, but to help us, we are going to
establish one small piece of helpful notation: We define a matrix function Φ : Rn×n → Rn×n
by
Φ(A) = L+/2 AL+/2 .
We sometimes call this a “normalizing map”, because it transforms a matrix to the space
where spectral norm bounds can be translated into relative error guarantees compare to the
L quadratic form.

Proof of Theorem 9.3.3. By Claim 9.3.8, it suffices to show



ΠL − L+/2 L̃L+/2 ≤ /2. (9.3)

We introduce a set of independent random variables, one for each edge e, with a probability
pe associated with the edge which we will fix later. We then let
(
w (e)
pe
b eb >
e with probability pe
Ye =
0 otherwise.

Y e . Note that E [Y e ] = pe wp(e) b eb > >


P
This way, L̃ = e e e = w (e)b e b e , and so
h i X
E L̃ = E [Y e ] = L.
e

By linearity of Φ, h i h i
E Φ(L̃) = Φ(E L̃ ) = ΠL .
Let us also define X
X e = Φ(Y e ) − E [Φ(Y e )] and X = Xe
e

Note that this ensures E [X e ] = 0. We are now going to fix the edge sampling probabilities,
in a way that depends on some overall scaling parameter α > 0. We let

pe = min α Φ w (e)b e b >


 
e
,1

then we see from the definition of Y e that whenever pe < 1


1
kΦ(Y e )k ≤
α
from this, we can conclude, with a bit of work, that for all e
1
kX e k ≤ . (9.4)
α
We can also show that
X   1
E X 2e ≤ . (9.5)

α


e

99
In the exercises for this chapter, we will ask you to show that Equations (9.4) and (9.5) hold.
This means that we can apply Theorem 9.2.1 to our X = e X e , with R = α1 and σ 2 = α1 ,
P
to get
−0.252
h i  
+/2 +/2
Pr ΠL − L L̃L ≥ /2 ≤ 2n exp

( + 4)/α
Since 0 <  < 1, this means that if α = 40−2 log(n/δ), then
h i 2nδ 2
P r ΠL − L+/2 L̃L+/2 ≥ /2 ≤ 2 ≤ δ/2.

n
In the last step, we assumed n ≥ 4.
Lastly, we’d like to know that the graph G̃ is sparse. The number of edges
in G̃ is equal to
the number of Y e that come out nonzero. Thus, the expected value of Ẽ is

h i X X
E Ẽ = pe ≤ α w (e) L+/2 b e b > L+/2

e
e e

We can bound the sum of the norms with a neat trick relating it to the
 trace of ΠL . Note
n > > >
that in general for a vector a ∈ R , we have a a
= a a = Tr a a . And hence
X X  
+/2 > +/2 +/2 > +/2
w (e) L b e b e L = w (e)Tr L b e b e L
e e
! !
X
= Tr L+/2 w (e)b e b >
e L+/2
e
= Tr (ΠL ) = n − 1.

Thus with our choice of α, h i


E Ẽ ≤ 40−2 log(n/δ)n.


With a scalar Chernoff bound, can show that Ẽ ≤ O(−2 log(n/δ)n) with probability at

least 1−δ/2. Thus by a union bound, the this condition and Equation (9.3) are both satisfied
with probability at least 1 − δ.

Remark 9.3.9. Note that


+/2 2

Φ w (e)b e b > > +/2
 +/2
e
= w (e) L b b
e e L ≤ w (e) L b e .
2

Recall that in Chapter


7, we saw
2 that the effective resistance between vertex v and vertex
+/2
u is given by L (e u − e v ) , and for an edge e connecting vertex u and v, we have

2
b e = e u − e v . That means the norm of the “baby Laplacian” w (e)b e b >e of a single edge
with weight w (e) is exactly w (e) times the effective resistance between the two endpoints
of the edge.

100
We haven’t shown how to compute the sampling probabilities efficiently, so right now, it isn’t
clear whether we can efficiently find G̃. It turns out that if we have access to a fast algorithm
for solving Laplacian linear equations, then we can find sufficiently good approximations to
the effective resistances quickly, and use these to compute G̃. An algorithm for this is
described in [SS11].

101
Chapter 10

Solving Laplacian Linear Equations

10.1 Solving Linear Equations Approximately

Given a Laplacian L of a connected graph and a demand vector d ⊥ 1, we want to find x ∗


solving the linear equation Lx ∗ = d . We are going to focus on fast algorithms for finding
approximate (but highly accurate) solutions.
This means we need a notion of an approximate solution. Since our definition is not special
to Laplacians, we state it more generally for positive semi-definite matrices.

Definition 10.1.1. Given PSD matrix M and d ∈ ker(M )⊥ , let M x ∗ = d . We say that
x̃ is an -approximate solution to the linear equation M x = d if

kx̃ − x ∗ k2M ≤  kx ∗ k2M .

Remark 10.1.2. The requirement d ∈ ker(M )⊥ can be removed, but this is not important
for us.

Theorem 10.1.3 (Spielman and Teng (2004) [ST04]). Given a Laplacian L of a weighted
undirected graph G = (V, E, w ) with |E| = m and |V | = n and a demand vector d ∈ RV , we
can find x̃ that is an -approximate solution to Lx = d , using an algorithm that takes time
O(m logc n log(1/)) for some fixed constant c and succeeds with probability 1 − 1/n10 .

In the original algorithm of Spielman and Teng, the exponent on the log in the running time
was c ≈ 70.
Today, we are going to see a simpler algorithm. But first, we’ll look at one of the key tools
behind all algorithms for solving Laplacian linear equations quickly.

102
10.2 Preconditioning and Approximate Gaussian
Elimination
Recall our definition of two positive semi-definite matrices being approximately equal.
Definition 10.2.1 (Spectral approximation). Given A, B ∈ S+n , we say that
1
A ≈K B if and only if A  B  (1 + K)A.
1+K

n
Suppose we have a positive definite matrix M ∈ S++ and want to solve a linear equation
M x = d . We can do this using gradient descent or accelerated gradient descent, as we
covered in Graded Homework 1. But if we have access to an easy-to-invert matrix that
happens to also be a good spectral approximation of M , then we can use this to speed up
the (accelerated) gradient descent algorithm. An example of this would be that we have
a factorization LL> ≈K M , where L is lower triangular and sparse, which means we can
invert it quickly.
The following lemma, which you will prove in Problem Set 6, makes this preconditioning
precise.
Lemma 10.2.2. Given a matrix M ∈ S++ n
, a vector d and a decomposition M ≈K LL> , we
can find x̃ that -approximately solves M x = d , using O((1+K) log(K/)(Tmatvec +Tsol +n))
time.

• Tmatvec denotes the time required to compute M z given a vector z , i.e. a “matrix-vector
multiplication”.

• Tsol denotes the time required to compute L−1 z or (L> )−1 z given a vector z .

Dealing with pseudo-inverses. When our matrices have a null space, preconditioning
becomes slightly more complicated, but as long as it is easy to project to the complement
of the null space, there’s no real issue. The following describes precisely what we need (but
you can ignore the null-space issue when first reading these notes without losing anything
significant).
Lemma 10.2.3. Given a matrix M ∈ S+n , a vector d ∈ ker(M )⊥ and a decomposition
M ≈K LDL> , where L is invertible, we can find x̃ that -approximately solves M x = d ,
using O((1 + K) log(K/)(Tmatvec + Tsol + Tproj + n)) time.

• Tmatvec denotes the time required to compute M z given a vector z , i.e. a “matrix-vector
multiplication”.

• Tsol denotes the time required to compute L−1 z and (L> )−1 z and D + z given a vector
z.

103
• Tproj denotes the time required to compute ΠM z given a vector z .
Theorem 10.2.4 (Kyng and Sachdeva (2015) [KS16]). Given a Laplacian L of a weighted
undirected graph G = (V, E, w ) with |E| = M and |V | = n, we can find a decomposition
LL> ≈0.5 L, such that L has number of non-zeroes nnz(L) = O(m log3 n), with probability
at least 1 − 3/n5 . in time O(m log3 n).

We can combine Theorem 10.2.4 with Lemma 10.2.3 to get a fast algorithm for solving
Laplacian linear equations.
Corollary 10.2.5. Given a Laplacian L of a weighted undirected graph G = (V, E, w ) with
|E| = m and |V | = n and a demand vector d ∈ RV , we can find x̃ that is an -approximate
solution to Lx = d , using an algorithm that takes time O(m log3 n log(1/)) and succeeds
with probability 1 − 1/n10 .

Proof sketch. First we need to get a factorization that confirms to Lemma 10.2.3. The
decomposition LL> provided by Theorem 10.2.4 can be rewritten as LL> = LD( e L)e > where
Le is equal to L except L(n, n) = 1 and we let D be the identity matrix, except D(n, n) = 0.
This ensures D + = D and that L e is invertible and lower triagular with O(m log3 n) non-
zeros. We note that the inverse of an invertible lower or upper triangular matrix with N
non-zeros can be applied in time O(N ) given an adjacency list representation of the matrix.
Finally, as ker(LL> ) = span {1}, we have ΠLD( 1 >
e > = I − n 11 , and this projection matrix
e L)
can be applied in O(n) time. Altogether, this means that Tmatvec + Tsol + Tproj = O(n), which
suffices to complete the proof.

10.3 Approximate Gaussian Elimination Algorithm


Recall Gaussian Elimination / Cholesky decomposition of a graph Laplacian L. We will use
A(:, i) to denote the the ith column of a matrix A. We can write the algorithm as

Algorithm 1: Gaussian Elimination / Cholesky Decomposition


Input: Graph Laplacian L
Output: Lower triangular L s.t. LL> = L
Let S 0 = L ;
for i = 1 to i = n − 1 do
li = √ 1 S i−1 (:, i);
S i−1 (i,i)
S i = S i−1 − l i l >
i .
l n = 0n×1 ;  
return L = l 1 · · · l n ;

We want to introduce some notation that will help us describe and analyze a faster version
of Gaussian elimination – one that uses sampling to create a sparse approximation of the
decomposition.

104
Consider a Laplacian S of a graph H and a vertex v of H. We define Star(v, S ) to be the
Laplacian of the subgraph of H consisting of edges incident on v. We define
1
Clique(v, S ) = Star(v, S ) − S (:, v)S (:, v)>
S (v, v)
For example, suppose
−aa >
 
W
L=
−aa diag(aa ) + L−1
Then
−aa >
   
W 0 0
Star(1, L) = and Clique(1, L) =
−aa diag(aa ) 0 diag(aa ) − 1
W
aa >
which is illustrated in Figure 10.1.

Figure 10.1: Gaussian Elimination: Clique(1, L) = Star(1, L) − 1


L(1,1)
L(:, 1)L(:, 1)> .

In Chapter 8, we proved that Clique(v, S ) is a graph Laplacian – it follows from the proof
of Claim 8.1.1 in that chapter. Thus we have that following.
Claim 10.3.1. If S is the Laplacian of a connected graph, then Clique(v, S ) is a graph
Laplacian.

Note that in Algorithm 1, we have l i l >


i = Star(vi , S i−1 ) − Clique(vi , S i−1 ). The update
rule can be rewritten as
S i = S i−1 − Star(vi , S i−1 ) + Clique(vi , S i−1 ) ,

105
This also provides way to understand why Gaussian Elimination is slow in some cases. At
each step, one vertex is eliminated, but a clique is added to the subgraph on the remaining
vertices, making the graph denser. And at the ith step, computing Star(vi , S i−1 ) takes
around deg(vi ) time, but computing Clique(vi , S i−1 ) requires around deg(vi )2 time. In
order to speed up Gaussian Elimination, the algorithmic idea of [KS16] is to plug in a
sparser appproximate of the intended clique instead of the entire one.
The following procedure CliqueSample(v, S ) produces a sparse approximation of
clique(v, S ). Let V be the vertex set of the graph associated with S and E the edge
set. We define b i,j ∈ RV to be the vector with

b i,j (i) = 1 and b i,j (j) = −1 and b i,j (k) = 0 for k 6= i, j.

Given weights w ∈ RE and a vertex v ∈ V , we let


X
wv = w (u, v).
(u,v)∈E

Algorithm 2: CliqueSample(v, S )
Input: Graph Laplacian S ∈ RV ×V , of a graph with edge weights w , and vertex v ∈ V
Output: Y v ∈ RV ×V sparse approximation of clique(v, S )
Y v ← 0n×n ;
foreach Multiedge e = (v, i) from v to a neighbor i do
Randomly pick a neighbor j of v with probability ww(j,v) v
;
w (i,v)w (j,v)
If i 6= j, let Y v ← Y v + w (i,v)+w (j,v) b i,j b >
i,j ;

return Y v ;

Remark 10.3.2. We can implement each sampling of a neighbor j in O(1) time using a
classical algorithm known as Walker’s method (also known as the Alias method or Vose’s
method). This algorithm requires an additional O(degS (v)) time to initialize a data struc-
ture used for sampling. Overall, this means the total time for O(degS (v)) samples is still
O(degS (v)).

Lemma 10.3.3. E [Y v ] = Clique(v, S ).

Proof. Let C = Clique(v, S ). Observe that both E [Y v ] and C are Laplacians. Thus it
suffices to verify EY v (i,j) = C (i, j) for i 6= j.

w (i, v)w (j, v)


C (i, j) = − ,
wv
 
w (i, v)w (j, v) w (j, v) w (i, v) w (i, v)w (j, v)
E =− + =− = C (i, j).
Y v (i,j) w (i, v) + w (j, v) wv wv wv

106
Remark 10.3.4. Lemma 10.3.3 shows that CliqueSample(v, L) produces the original
Clique(v, L) in expectation.

Now, we define Approximate Gaussian Elimination.

Algorithm 3: Approximate Gaussian Elimination / Cholesky Decomposition


Input: Graph Laplacian L
Output: Lower triangulara L as given in Theorem 10.2.4
Let S 0 = L;
Generate a random permutation π on [n];
for i = 1 to i = n − 1 do
li = √ 1
S i−1 (:, π(i));
S i−1 (π(i),π(i))
S i = S i−1 − Star(π(i), S i−1 ) + CliqueSample(π(i), S i−1 )
l n = 0n×1 ;  
return L = l 1 · · · l n and π;

a
L is not actually lower triangular. However, if we let P π be the permutation matrix corresponding to
π, then P π L is lower triangular. Knowing the ordering that achieves this is enough to let us implement
forward and backward substitution for solving linear equations in L and L> .

Note that if we replace CliqueSample(π(i), S i−1 ) by Clique(π(i), S i−1 ) at each step, then
we can recover Gaussian Elimination, but with a random elimination order.

10.4 Analyzing Approximate Gaussian Elimination


In this Section, we’re going to analyze Approximate Gaussian Elimination, and see why it
works.
Ultimately, the main challenge in proving Theorem 10.2.4 will be to prove for the output L
of Algorithm 3 that with high probability
0.5L  LL>  1.5L. (10.1)
We can reduce this to proving that with high probability

L (LL> − L)L+/2 ≤ 0.5
+/2
(10.2)

Ultimately, the proof is going to have a lot in common with our proof of Matrix Bernstein
in Chapter 9. Overall, the lesson there was that when we have a sum of independent, zero-
mean random matrices, we can show that the sum is likely to have small spectral norm if the
spectral norm of each random matrix is small, and the matrix-valued variance is also small.
Thus, to replicate the proof, we need control over

1. The sample norms.

107
2. The sample variance.

But, there is seemlingly another major obstacle: We are trying to analyze a process where
the samples are far from independent. Each time we sample edges, we add new edges to the
remaining graph, which we will the later sample again. This creates a lot of dependencies
between the samples, which we have to handle.
However, it turns out that independence is more than what is needed to prove concentration.
Instead, it suffices to have a sequence of random variables such that each is mean-zero in
expectation, conditional on the previous ones. This is called a martingale difference sequence.
We’ll now learn about those.

10.4.1 Normalization, a.k.a. Isotropic Position

Since our analysis requires frequently measuring matrices after right and left-multiplication
by L+/2 , we reintroduce the “normalizing map” Φ : Rn×n → Rn×n defined by

Φ(A) = L+/2 AL+/2 .

We previously saw this in Chapter 9.

10.4.2 Martingales

A scalar martingale is a sequence of random variables Z0 , . . . , Zk , such that

E [Zi | Z0 , . . . , Zi−1 ] = Zi−1 . (10.3)

That is, conditional on the outcome of all the previous random variables, the expectation of
Zi equals Zi−1 . If we unravel the sequence of conditional expectations, we get that without
conditioning E [Zk ] = E [Z0 ].
Typically, we use martingales to show a statement along like “Zk is concentrated around
E [Zk ]”.
We can also think of a martingale in terms of the sequence of changes in the Zi variables.
Let Xi = Zi − Zi−1 . The sequence of Xi s is called a martingale difference sequence. We can
now state the martingale condition as

E [Xi | Z0 , . . . , Zi−1 ] = 0.

And because Z0 and X1 , . . . , Xi−1 completely determine Z1 , . . . , Zi−1 , we could also write
the martingale condition equivalently as

E [Xi | Z0 , X1 , . . . , Xi−1 ] = 0.

108
Crucially, we can write

k
X k
X
Zk = Z0 + Zi − Zi−1 = Z0 + Xi
i=1 i=1

and when we are trying to prove concentration, the martingale


Pk difference property of the
Xi ’s is often “as good as” independence, meaning that i=1 Xi concentrates similarly to a
sum of independent random variables.

Matrix-valued martingales. We can also define matrix-valued martingales. In this case,


we replace the martingalue condition of Equation (10.3), with the condition that the whole
matrix stays the same in expectation. For example, we could have a sequence of random
matrices Z 0 , . . . , Z k ∈ Rn×n , such that

E [Z i | Z 0 , . . . , Z i−1 ] = Z i−1 . (10.4)


Pi >
Lemma 10.4.1. Let Li = S i + j=1 l j l j for i = 1, ..., n and L0 = S 0 = L. Then

E [Li |all random variables before CliqueSample(π(i), S i−1 )] = Li−1 .

Proof. Let’s only consider i = 1 here as other cases are similar.

L0 = L = l 1 l >
1 + Clique(v, L) + L−1

L1 = l 1 l >
1 + CliqueSample(v, L) + L−1

E [L1 |π(1)] = l 1 l >


1 + E [CliqueSample(v, L) |π(1)] + L−1

= l 1l >
1 + Clique(v, L) + L−1
= L0

where we used Lemma 10.3.3 to get E [CliqueSample(v, L) |π(1)] = Clique(v, L).


Pi >
Remark 10.4.2. j=1 l j l j can be treated as what has already been eliminated by (Ap-
proximate) Gaussian Elimination, while SP i is what still left or going to be eliminated. In
Approximate Gaussian Elimination, Ln = ni=1 l i l > i and our goal is to show that Ln ≈K L.
Note that Li is always equal to the original Laplacian L for all i in Gaussian Elimination.
Lemma 10.4.1 demonstrates that L0 , L1 , ..., Ln forms a matrix martingale.

Ultimately, our plan is to use this matrix martingale structure to show that “Ln is concen-
trated around L” in some appropriate sense. More precisely, the spectral approximation we
would like to show can be established by showing that “Φ(Ln ) is concentrated around Φ(L)”

109
10.4.3 Martingale Difference Sequence as Edge-Samples

We start by taking a slightly different


Pi view of the observations we used to prove
Pi−1
Lemma 10.4.1. Recall that Li = S i + j=1 l j l j , and Li−1 = S i−1 + j=1 l j l >
>
j and

S i = S i−1 − Star(π(i), S i−1 ) + CliqueSample(π(i), S i−1 ) .

Putting these together, we get

Li − Li−1 = l i l >
i + CliqueSample(π(i), S i−1 ) − Star(π(i), S i−1 )
= CliqueSample(π(i), S i−1 ) − Clique(π(i), S i−1 ) (10.5)
= CliqueSample(π(i), S i−1 ) − E [CliqueSample(π(i), S i−1 ) | preceding samples]
by Lemma 10.3.3.
In particular, recall that by Lemma 10.3.3, conditional on the randomness before the call to
CliqueSample(π(i), S i−1 ), we have

E [CliqueSample(π(i), S i−1 ) | preceding samples] = Clique(π(i), S i−1 )

Adopting the notation of Lemma 10.3.3 we write

Y π(i) = CliqueSample(π(i), S i−1 )

and we further introduce notation each multi-edge sample for e ∈ Star(π(i), S i−1 ), as
Y π(i),e , denoting the random edge Laplacian sampled when the algorithm is processing
multi-edge e. Thus, conditional on preceding samples, we have

X
Y π(i) = Y π(i),e (10.6)
e∈Star(π(i),S i−1 )

Note that even the number of multi-edges in Star(π(i), S i−1 ) depends on the preceding
samples. We also want to associate zero-mean variables with each edge. Conditional on
preceding samples, we also define
  X
X i,e = Φ Y π(i),e − E Y π(i),e and X i = X i,e
e∈Star(π(i),S i−1 )

and combining this with Equations (10.5) and (10.6)


 
X i = Φ(Y π(i) − E Y π(i) ) = Φ(Li − Li−1 )

Altogether, we can write


n
X n
X n
X X
Φ (Ln − L) = Φ(Li − Li−1 ) = Xi = X i,e
i=1 i=1 i=1 e∈Star(π(i),S i−1 )

Note that the X i,e variables form a martingale difference sequence, because the linearity of
Φ ensures they are zero-mean conditional on preceding randomness.

110
10.4.4 Stopped Martingales

Unfortunately, directly analyzing the concentration properties of the Li martingale that we


just introduced turns out to be difficult. The reason is that we’re trying to prove some very
delicate multiplicative error guarantees. And, if we analyze Li , we find that the multiplicative
error is not easy to control, after it’s already gotten big. But that’s not really what we care
about anyway: We want to say it never gets big in the first place, with high probability. So
we need to introduce another martingale, that lets us ignore the bad case when the error has
already gotten too big. At the same time, we also need to make sure that statements about
our new martingale can help us prove guarantees about Li . Fortunately, we can achieve both
at once. The technique we use is related to the much broader topic of martingale stopping
times, which we only scratch the surface of here. We’re also going to be quite informal about
it, in the interest of brevity. Lecture notes by Tropp [Tro19] give a more formal introduction
for those who are interested.
We define the stopped martingale sequence L̃i by
(
Li if for all j < i we have Li  1.5L
L̃i = (10.7)
Lj ∗ for j ∗ being the least j such that Lj 6 1.5L

Figure 10.2 shows the L̃i martingale getting stuck at the first time Lj ∗ 6 1.5L.

Figure 10.2: Gaussian Elimination : Clique(1, L) = Star(1, L) − 1


L(1,1)
L(:, 1)L(:, 1)> .

We state the following without proof:

Claim 10.4.3.
n o
1. The sequence L̃i for i = 0, . . . , n is a martingale.

2. L+/2 (L̃i − L)L+/2 ≤ 0.5 implies L+/2 (Li − L)L+/2 ≤ 0.5

111
h i
The martingale property also implies that the unconditional expectation satisfies E L̃n =
L. The proof of the claim is easy to sketch: For Part 1, each difference is zero-mean if the
condition has not been violated, and is identically
n o zero (and hence zero-mean) if it has been
+/2 +/2
violated. For Part 2, if the martingale L̃i has stopped, then L (L̃i − L)L ≤ 0.5
is false, and the implication is vacuosly true. If, on the other hand, the martingale has not
stopped, the quantities are equal, because L̃i = Li , and again it’s easy to see the implication
holds.

Thus, ultimately, our strategy is goin to be to show that L+/2 (L̃i − L)L+/2 ≤ 0.5 with

high probability. Expressed using the normalizing map Φ(·), our goal is to show that with
high probability
Φ(L̃n − L) ≤ 0.5.

Stopped martingale difference


n o sequence. In order to prove the spectral norm bound,
we want to express the L̃i martingale in terms of a sequence of martingale differences.
To this end, we define X̃ i = Φ(L̃i − L̃i−1 ). This ensures that
(
X i if for all j < i we have Li  1.5L
X̃ i = (10.8)
0 otherwise

Whenever the modified martingale X̃ i has not yet stopped, we also introduce individual
modified edge samples X̃ i,e = X i,e . If the martingale has stopped, i.e. X̃ i = 0, then we can
take these edge samples X̃ i,e to be zero. We can now write
  X n Xn n
X X
Φ L̃n − L = Φ(L̃i − L̃i−1 ) = X̃ i = X̃ i,e .
i=1 i=1 i=1 e∈Star(π(i),S i−1 )

Thus, we can see that Equation (10.2) is implied by



X n
X̃ i ≤ 0.5. (10.9)



i=1

10.4.5 Sample Norm Control

In this Subsection, we’re going to see that the norms of each multi-edge sample is controlled
throughout the algorithm.
Lemma 10.4.4. Given two Laplacians L and S on the same vertex set.1 If each multiedge
e of Star(v, S ) has bounded norm in the following sense,

+/2 > +/2
L w S (e)b b
e e L ≤ R,
1
L can be regarded as the original Laplacian we care about, while S can be regarded as some intermediate
Laplacian appearing during Approximate Gaussian Elimination.

112
then each possible sampled multiedge e0 of CliqueSample(v, S ) also satisfies

+/2 0 > +/2
L w new (e )b e0 b e0 L ≤ R.

Proof. Let w = w S for simplicity. Consider a sampled edge between i and j with weight
w new (i, j) = w (i, v)w (j, v)/(w (i, v) + w (j, v)).

+/2 > +/2 +/2 > +/2
L w new (i, j)b ij b ij L = w new (i, j) L b ij b ij L
+/2 2

= w new (i, j) L b ij
 
+/2 2 +/2 2

≤ w new (i, j) L b iv + L b jv

w (j, v)
L w (i, v)b iv b >
+/2 +/2
= iv L +
w (i, v) + w (j, v)
w (i, v)
L w (j, v)b jv b >
+/2 +/2
jv L
w (i, v) + w (j, v)

w (j, v) w (i, v)
≤ R+ R
w (i, v) + w (j, v) w (i, v) + w (j, v)
=R

The first inequality uses the triangle inequality of effective resistance in L, in that effective
resistance is a distance as we proved in Chapter 7. The second inequality just uses the
conditions of this lemma.
Remark 10.4.5. Lemma 10.4.4 only requires that each single multiedge has small norm
instead of that the sum of all edges between a pair of vertices have small norm. And this
lemma tells us, after sampling, each multiedge in the new graph still satisfies the bounded
norm condition.

From the Lemma, we can conclude that each edge sample Y π(i),e satisfies Φ(Y π(i),e ) ≤ R
provided the assumptions of the Lemma hold. Let’s record this observation as a Lemma.
Lemma 10.4.6. If for all e ∈ Star(v, S i ),
Φ(w S (e)b e b >

e ) ≤ R.

i

then all e ∈ Star(π(i), S i ),


Φ(Y π(i),e ) ≤ R.

Preprocessing by multi-edge splitting. In the original graph of Laplacian L of graph


G = (V, E, w ), we have for each edge ê that
X
w (ê)b ê b >
ê  w (e)b e b >
e = L
e

113
This also implies that
+/2 > +/2
L w (ê)b ê b ê L ≤ 1.
Now, that means that if we split every original edge e of the graph into K multi-edges
e1 , . . . eK , with a fraction 1/K of the weight, we get a new graph G0 = (V, E 0 , w 0 ) such that
Claim 10.4.7.

1. G0 and G have the same graph Laplacian.

2. |E 0 | = K |E|

3. For every multi-edge in G0



L w (e)b e b >
+/2 0 +/2
e L ≤ 1/K.

Before we run Approximate Gaussian Elimination, we are going to do this multi-edge splitting
to ensure we have control over multi-edge sample norms. Combined with Lemma 10.4.4
immediately establishes the next lemma, because we start off with all multi-edges having
bounded norm and only produce multi-edges with bounded norm.
Lemma 10.4.8. When Algorithm 3 is run on the (multi-edge) Laplacian of G0 , arising from
splitting edges of G into K multi-edges, the every edge sample Y π(i),e satisfies

Φ(Y π(i),e ) ≤ 1/K.

As we will see later K = 200 log2 n suffices.

10.4.6 Random Matrix Concentration from Trace Exponentials

Let us recall how matrix-valued variances come into the picture when proving concentration
following the strategy from Matrix Bernstein in Chapter 9.
For some matrix-valued random variable X ∈ S n , we’d like to show Pr[kX k ≤ 0.5]. Using
Markov’s inequality, and some observations about matrix exponentials and traces, we saw
that for all θ > 0,

Pr[kX k ≥ 0.5] ≤ exp(−0.5θ) (E [Tr (exp (θX ))] + E [Tr (exp (−θX ))]) . (10.10)

We then want to bound E [Tr (exp (θX ))] using Lieb’s theorem. We can handle
E [Tr (exp (−θX ))] similarly.
n
Theorem 10.4.9 (Lieb). Let f : S++ → R be a matrix function given by

f (A) = Tr (exp (H + log(A)))

for some H ∈ S n . Then −f is convex (i.e. f is concave).

114
As observed by Tropp, this is useful for proving matrix concentration statements. Combined
with Jensen’s inequality, it gives that for a random matrix X ∈ S n and a fixed H ∈ S n

E [Tr (exp (H + X ))] ≤ Tr (exp (H + log(E [exp(X )]))) .

The next crucial step was to show that it suffices to obtain an upper bound on the matrix
E [exp(X )] w.r.t the Loewner order. Using the following three lemmas, this conclusion is an
immediate corollary.

Lemma 10.4.10. If A  B, then Tr (exp(A)) ≤ Tr (exp(B)).

Lemma 10.4.11. If 0 ≺ A  B, then log(A)  log(B).

Lemma 10.4.12. log(I + A)  A for A  −I .

Corollary 10.4.13. For a random matrix X ∈ S n and a fixed H ∈ S n , if E [exp(X )] 


I + U where U  −I , then

E [Tr (exp (H + X ))] ≤ Tr (exp (H + U )) .

10.4.7 Mean-Exponential Bounds from Variance Bounds

To use Corollary 10.4.13, we need to construct useful upper bounds on E [exp(X )]. This can
be done, starting from the following lemma.

Lemma 10.4.14. exp(A)  I + A + A2 for kAk ≤ 1.

If X is zero-mean and kX k ≤ 1, this means that E [exp(X )]  I + E X 2 , which is how


 

we end up wanting to bound the matrix-valued variance E X 2 . In the rest of this Sub-
section, we’re going to see the matrix-valued variance of the stopped martingale is bounded
throughout the algorithm.
Firstly, we note that for a single edge sample X̃ i,e , by Lemma 10.4.8, we have that
 
i,e ≤ Φ Y π(i),e − E Y π(i),e ≤ 1/K,

using that kA − Bk ≤ max(kAk , kBk), for A, B  0, and kE [A]k ≤ E [kAk] by Jensen’s


inequality.
Thus, if 0 < θ ≤ K, we have that
h i h i
E exp(θX̃ i,e ) | preceding samples  I + E (θX̃ i,e )2 | preceding samples (10.11)
1
 I + θ2 · E Φ(Y π(i),e ) | preceding samples
 
K

115
10.4.8 The Overall Mean-Trace-Exponential Bound

We will use E(<i) to denote expectation over variables preceding the ith elimination step. We
are going to refrain from explicitly writing out conditioning in our expectations, but any inner
expectation that appears inside another outer expectation should be taken as conditional on
the outer expectation. We are going to use di to denote the multi-edge degree of vertex π(i)
in S i−1 . This is exactly the number of edge samples in the ith elimination. Note that there
is no elimination at step n (the algorithm is already finished). As a notational convenience,
let’s write n̂ = n − 1. With√ all that in mind, we bound the mean-trace-exponential for some
parameter 0 < θ ≤ 0.5/ K

!
X
E Tr exp(θ X̃ i ) (10.12)
i=1
 
Xn̂−1 n̂ −1
dX 
= E E ··· Tr exp θ X̃ + θ X̃ +θ X̃
 
E E E  i n̂,e n̂,dn̂ 
(<n̂) π(n̂) X̃ n̂,1 X̃ n̂,dn̂ −1 X̃ n̂,dn̂  
|i=1 {ze=1 }
H
X̃ n̂,1 . . . , X̃ n̂,dn̂ are independent conditional on (< n̂), π(n̂)
n̂ −1
n̂−1 dX
!
X 1 2
≤ E E E · · · E Tr exp θX̃ i + θX̃ n̂,e + θ · E Φ(Y π(n̂),dn̂ )
(<n̂) π(n̂) X̃ n̂,1 X̃ n̂,dn̂ −1 i=1 e=1
K X̃ n̂,dn̂
By Equation (10.11) and Corollary 10.4.13 .
..
. Repeat for each multi-edge sample X̃ n̂,1 . . . , X̃ n̂,dn̂ −1
n̂−1 dn̂
!
X X 1 2
≤ E E Tr exp θX̃ i + θ · E Φ(Y π(n̂),e )
(<n̂) π(n̂)
i=1 e=1
K X̃ n̂,e
n̂−1
!
X 1 2
= E E Tr exp θX̃ i + θ Φ(Clique(π(n̂), S n̂−1 ))
(<n̂) π(n̂)
i=1
K

To further bound the this quantity, we now need to deal with the random choice of π(n̂).
We’ll be able to use this to bound the trace-exponential in a very strong way. From a random
matrix perspective, it’s the following few steps that give the analysis it’s surprising strength.
We can treat K1 θ2 Φ(Clique(π(n̂), S n̂−1 )) as a random matrix. It is not zero-mean, but we
can still bound the trace-exponential using Corollary 10.4.13.
We can also bound the expected matrix exponential in that case, using a simple corollary of
Lemma 10.4.14.
Corollary 10.4.15. exp(A)  I + (1 + R)A for 0  A with kAk ≤ R ≤ 1.

Proof. The conclusion follows after observing that for 0  A with kAk ≤ R, we have
A2  RA. We can see this by considering the spectral decomposition of A and dealing with
each eigenvalue separately.

116
Next, we need a simple structural observation about the cliques created by elimination:
Claim 10.4.16.
Clique(π(i), S i )  Star(π(i), S i )  S i

Proof. The first inequality is immediate from Clique(π(i), S i )  Clique(π(i), S i ) + l i l >


i =
Star(π(i), S i ) . The latter inequality Star(π(i), S i )  S i follows from the star being a
subgraph of the whole Laplacian S i .

Next we make use of the fact that X̃ i is from the difference sequence of the stopped martin-
gale. This means we can assume
S i  1.5L,
since otherwise X̃ i = 0 and we get an even better bound on the trace-exponential. To make
this formal, in Equation (10.12), we ought to do a case analysis that also includes the case
X̃ i = 0 when the martingale has stopped, but we omit this.
Thus we can conclude by Claim 10.4.16 that
kΦ(Clique(π(i), S i ))k ≤ 1.5.

By our assumption 0 < θ ≤ 0.5/ K, we have K1 θ2 Φ(Clique(π(i), S i−1 )) ≤ 1, so that by
Corollary 10.4.15,
 
1 2 2
E exp θ Φ(Clique(π(i), S i−1 ))  I + θ2 E Φ(Clique(π(i), S i−1 )) (10.13)
π(i) K K π(i)
2
 I + θ2 E Φ(Star(π(i), S i−1 ))
K π(i)
by Claim 10.4.16.
Next we observe that, because every multi-edge appears in exactly two stars, and π(i) is
chosen uniformly at random among the n + 1 − i vertices that S i−1 is supported on, we have
1
E Star(π(i), S i−1 ) = 2 S i−1 .
π(i) n+1−i
And, since we assume S i  1.5L, we further get
6θ2
 
1 2
E exp θ Φ(Clique(π(i), S i−1 ))  I + I.
π(i) K K(n + 1 − i)
We can combine this with Equation (10.12) and Corollary 10.4.13 to get

!
X
E Tr exp(θ X̃ i )
i=1
n̂−1
!
X 1 2
≤ E E Tr exp θX̃ i + θ Φ(Clique(π(n̂), S n̂−1 ))
(<n̂) π(n̂)
i=1
K
n̂−1
!
X 6θ2
≤ E Tr exp θX̃ i + I
(<n̂)
i=1
K(n + 1 − i)

117
And by repeating this analysis for each term X̃ i , we get

! n̂
!
X X 6θ2
E Tr exp(θ X̃ i ) ≤ Tr exp I
i=1 i=1
K(n + 1 − i)
 2 
7θ log(n)
≤ Tr exp I
K
 2 
7θ log(n)
= n exp
K

Then, by choosing K = 200 log2 n and θ = 0.5 K, we get

!  2 
X 7θ log(n)
exp(−0.5θ) E Tr exp(θ X̃ i ) ≤ exp(−0.5θ)n exp ≤ 1/n5 .
i=1
K
 Pn̂ 
E Tr exp(−θ i=1 X̃ i ) can be bounded by an identical argument, so that Equation (10.10)
gives " n̂ #
X
Pr X̃ i ≥ 0.5 ≤ 2/n5 .


i=1
P
Thus we have established n̂i=1 X̃ i ≤ 0.5 with high probability (Equation (10.9)), and

this in turn implies Equation (10.2), and finally Equation (10.1):

0.5L  LL>  1.5L.

Now, all that’s left to note is that the running time is linear in the multi-edge degree of
the vertex being eliminated in each iteration (and this also bounds the number of non-zero
entries being created in L). The total number of multi-edges left in the remaining graph
stays constant at Km = O(m log2 n). Thus the expected degree in the ith elimination is
Km/(n + i − 1), because the remaining number of vertices is n + i − 1. Hence the total
running time and total number of non-zero entries created can both be bounded as
X
Km 1/(n + i − 1) = O(m log3 n).
i

We can further prove that the bound O(m log3 n) on running time and number of non-zeros
in L holds with high probability (e.g. 1 − 1/n5 ). To show this, we essentially need a scalar
Chernoff bound, in except the degrees are in fact not independent, and so we need a scalar
martingale concentration result, e.g. Azuma’s Inequality. This way, we complete the proof
of Theorem 10.2.4.

118
Part III

Combinatorial Graph Algorithms

119
Chapter 11

Classical Algorithms for Maximum


Flow I

11.1 Maximum Flow


In this chapter, we will study the Maximum Flow Problem and discuss some classical algo-
rithms for finding solutions to it.

Setup. Consider a directed graph G = (V, E, c), where V denotes vertices, E denotes edges
and c ∈ RE , c ≥ 0 denotes edge capacities. In contrast to earlier chapters, the direction of
each edge will be important, and not just as a book keeping tool (which is how we previously
used it in undirected graphs). We consider an edge (u, v) ∈ E to be from u and to v. Edge
capacities are associated with directed edges, and we allow both edge (u, v) and (v, u) to
exist, and they may have different capacities.
A flow is any vector f ∈ RE .
We say that a flow is feasible when 0 ≤ f ≤ c. The constraint 0 ≤ f ensures that the flow
respects edge directions, while the constraint f ≤ c ensures that the flow does not exceed
capacity on any edge.
We still define the edge-vertex incidence matrix B ∈ RV ×E of G by

1
 if e = (u, v)
B(v, e) = −1 if e = (v, u)

0 o.w.

As in the undirected case, a demand vector is a vector d ∈ RV . And as in the undirected


case, we can express the net flow constraint that f routes the demand d by

Bf = d .

120
We will focus on the case:
Bf = F (−e s + e t ) = F b st .

Flows that satisfy the above equation for some scalar F are called s-t flows where s, t ∈ V .
The vertex s is called the source, t is called the sink and e s , e t are indicator vectors for
source and sink nodes respectively. The vector b st has -1 at source and 1 at sink. The
maximum flow can be expressed as a linear program as follows

max F
f ∈RE ,F

s.t. Bf = F b st (11.1)
0≤f ≤c

We use val(f ) to denote F when Bf = F b st .

11.2 Flow Decomposition

We now look at a way of simplifying flows

Definition 11.2.1. A s-t path flow is a flow f ≥ 0 that can be written as


X
f =α ee
e∈p

where p is a simple path from s to t.

Definition 11.2.2. A cycle flow is a flow f ≥ 0 that can be written as a cycle i.e.
X
f =α ee
e∈c

where c is a simple cycle.

Lemma 11.2.3 (The path-cycle decomposition lemma). Any s-t flow f can be decomposed
into a sum of s-t path flows and cycle flows such that the sum contains at most nnz(f ) terms.
Note nnz(f ) ≤ |E|.

Proof. We perform induction on the number of edges with non-zero flow, which we denote
by nnz(f ). Note that by “the support of f ”, we mean the set {e ∈ E : f (e) 6= 0}.

Base case: f = 0: nothing to do.

121
Inductive step: Try to find a path from s to t OR a cycle in the support of f .
“Path” case. If there exists such a s-t path, let α be the minimum flow value along the edges
of the path, i.e.
α = min f (a, b)
(a,b)∈p

0
X
f =α ee
e∈p

Update the flow f by


f ←f −f0
The value of the flow will still be non-negative after this update as we subtracted the min-
imum entry along any positive edge on the path. The number of non-zeros, nnz(f ), went
down by at least one. Note that the updated f must again be an s-t flow, as it is the
difference of two s-t flows.
“Cycle” case. Suppose we find a cycle c in the support of f . Let α be the minimum flow
value along the edges of the cycle, i.e.

α = min f (a, b)
(a,b)∈c

0
X
f =α ee
e∈c

Update the flow f by


f ←f −f0
As in the path case, f stays non-negative, and number of non-zeros, nnz(f ), goes down by
at least one. Note that the updated f must again be an s-t flow, as it is the difference of
two s-t flows.
“No path or cycle” case. Suppose we can find neither a path nor a cycle, and f 6= 0. Then
there must be an edge (u, v) with non-zero flow leading into a vertex v 6= s, t and with no
outgoing edge from v in the support of f . In that case, we must have (Bf )(v) ≥ 0. But
since v 6= t, this contradicts Bf = F b st . So this case cannot occur.

Lemma 11.2.4. In any s-t max flow problem instance, there is an optimal flow f ∗ with a
path-cycle decomposition that has only paths and no cycles.

Proof. Let f̃ be an optimal flow. Let f ∗ be the sum of path flows in the path cycle decom-
position of f̃ . They route the same flow (as cycles contribute to no net flow from s to t).
Thus

Bf ∗ = B f̃

and hence val(f ∗ ) = val(f̃ ). Furthermore

0 ≤ f ∗ ≤ f̃ ≤ c

122
The first inequality follows from f ∗ being a sum of positive path flow. The second inequality
holds as f ∗ is upper bounded in every single entry by f̃ , because we reduced it by positive
entry cycles. The third inequality holds because f̃ is a feasible flow, so it is upper bounded
by the capacities.

An optimal flow solving the maximum flow problem may not be unique. For example,
consider the graph below with source s and sink t:

There are two optimal paths in this example. Maximum flow is a convex optimization
problem but not a strongly convex problem as the solutions are not unique, and this is part
of what makes it hard to solve.

11.3 Cuts and Minimum Cuts

The decomposition shown earlier provides a way to show that the maximum flow in a graph
is upper bounded by constructing graph cuts.
Given a vertex subset S ⊆ V , we say that (S, V \ S) is a cut in G and that the value of the
cut is X
cG (S) = w (e).
e∈E∩(S×V \S)

Note that in a directed graph, only edges crossing the cut going from S and to V \ S count
toward the cut.

Figure 11.1: Example of a cut: No edges go from S to (S, V \ S), and so the value of this
cut is zero.

Definition. (s-t cuts). We define an s-t cut to be a subset S ⊂ V , where s ∈ S and t ∈ V \S.

123
A decision problem: “Given an instance of the Maximum Flow problem, is there a flow
from s to t such that f 6= 0?”

If YES: We can decompose this flow into s-t path flows and cycle flows.

If NO: There is no flow path from s to t. Let S be the set of vertices reachable from source
s. Then (S, V \ S) is a cut in the graph, with no edges crossing from S to V \ S.
Figure 11.1 gives an example.

Upper bounding the maximum possible flow value. How can we recognize a maxi-
mum flow? Is there a way to confirm that a flow is of maximum value?
We can now introduce the Minimum Cut problem.

min cG (S) (11.2)


S⊆V

s.t. s ∈ V and t 6∈ V

The Minimum Cut problem can also be phrased as a linear program, although we won’t see
that today.
We’d like to obtain a tight connection between flows and cuts in the graph. As a first step,
we won’t get that, but we can at least observe that the value of any s-t cut provides an
upper bound to the maximum possible s-t flow value.

Theorem 11.3.1 (Max Flow ≤ Min Cut). The maximum s-t flow value in a directed graph
G (Program (17.1)) is upper bounded by the minimum value of any s-t cut (Program (11.2)).
I.e. if S is an s-t cut, and f a feasible s-t flow then

val(f ) ≤ cG (S)

And in particular this holds for any minimum cut S ∗ and maximum flow f ∗ , i.e. val(f ∗ ) ≤
cG (S ∗ ).

Proof. Consider any feasible flow 0 ≤ f ≤ c and a cut S, T = V \ S. Consider a path-cycle


decomposition of f , where each s-t path must cross the cut going forward from S to T at
least once. We pick a cut S, T with source on one side and sink on the other side as shown
in the Figure (11.2). Every time the path flow passes through the cut, it has to use one of
the edges that connect S and T . Total amount of flow crossing the cut is bounded above
by total amount of P capacity of the cut, otherwise the capacities would be violated, thus
val(f ) ≤ cG (S, T ) = e∈E∩S×T c(e).

124
Figure 11.2: s-t Cut

Note that the edges from T to S do not count towards the cut. The above equation holds
for all flows with all s-t cuts. This implies that max flow ≤ min cut.

This theorem is an instance of a general pattern, known as weak duality. Weak duality is a
relationship between two optimization programs, a maximization problem and a minimiza-
tion problem, where any solution to the former has its value upper bounded by the value of
any solution to the latter.

11.4 Algorithms for Max flow

How can we find a good flow?

Algorithm 4: A first attempt – bad idea?


f ← 0;
repeat
Find an s-t path flow f̃ that is feasible with respect to c − f .
f ← f + f̃

Does Algorithm 1 work? Consider the graph below with directed edges with capacities
1 at every edge. If we make a single path update as shown by the orange lines in Figure 11.3,
then afterwards, using the remaining capacity, there’s no path flow we can route, as shown
in Figure 11.3. But the max flow is 2, as shown by the flow on orange edges in Figure 11.5.
So, the above algorithm does not always find a maximum flow: It can get stuck at a point
where it cannot make progress despite not having reached the maximum value.

125
Figure 11.3: Sending a unit s-t path flow through the graph.

Figure 11.4: Remaining edge capacities after sending a path flow through the graph as
depicted in Figure 11.3.

Figure 11.5: The flow depicted routes two units from s to t.

A better approach. It turns out we can fix the problem with the previous algorithm
using a simple fix. This idea is known as residual graphs.

Algorithm 5: Better Idea (Residual Graph)


f ← 0;
repeat
Find an s-t path flow f̃ that is feasible with respect to −f ≤ f̃ ≤ c − f .
f ← f + f̃

126
The −f (e) can be treated as an edge going in the other direction with capacity f (e). By
convention, an edge in G with f (e) = c(e) is called saturated, and we do not include the
edge in Gf . The graph defined above with such capacities is called the residual graph
of f , Gf . Gf is only defined for feasible f , since otherwise the constraint f̃ ≤ c − f gives
trouble.
Suppose we start with a graph having a single edge with capacity 2 and we introduce a flow
of 1 unit.

The residual graph Gf has an edge with capacity 1 going forward and −1 capacity going
forward, but we can treat the latter as +1 capacity going backwards. So it is an edge that
allows you to undo the choice made to send flow along that direction.

Let us consider the same example with its residual graph. The original graph is shown in
Figure 11.6.

Figure 11.6: Original graph G and an s-t path flow in G shown in orange.

The residual graph for the same is shown in Figure 11.7.

127
Figure 11.7: The residual graph w.r.t. the flow from Figure 11.6, and new s-t path flow which
is feasible in the residual graph.

Adding both flows together, we get the paths as shown in Figure 11.8 with value 2, which is
the optimum.

Figure 11.8: Maximum flow in the graph.

Let’s prove some important properties of residual graphs.

Lemma 11.4.1. Suppose f , f̂ are feasible in G. Then this implies that f̂ −f (where negative
entries count as flow in opposite direction) is feasible in Gf .

Proof. 0 ≤ f ≤ c and 0 ≤ f̃ ≤ c, this implies −f ≤ f̃ − f ≤ c − f . Hence, proved.

Lemma 11.4.2. Suppose that f is feasible in G and f̂ is feasible in Gf . Then, f + f̂ is


feasible in G.

Proof. 0 ≤ f ≤ c and −f ≤ f̃ ≤ c − f , this implies 0 ≤ f̃ + f ≤ c.

Lemma 11.4.3. A feasible f is optimal if and only if t is not reachable from s in Gf .

Proof. Let f be optimal, and suppose t is reachable from s in Gf then, we can find a s-t path
flow f̃ that is feasible in Gf , and val(f + f̃ ) > val(f ). f + f̃ is feasible in G by Lemma 11.4.2.
This is a contradiction, as we assumed f was supposed to be optimal.

128
Suppose t is not reachable from s in Gf , and f is feasible, but not optimal. Let f ∗ be
optimal, then by Lemma 11.4.1, the flow f ∗ − f is feasible in Gf and val(f ∗ − f ) > 0. So
there exists a s-t path flow from s to t in Gf (as we can do a path decomposition of f ∗ − f ).
But, this is a contradiction as t is not reachable from s in Gf .

Theorem 11.4.4 (Max Flow = Min Cut theorem). The maximum flow in a directed graph
G equals the minimum cut.

Proof. Consider the set S = {vertices reachable from s in Gf }. Note that f saturates the
edge capacities in cut S, V \ S in G: Consider any edge from S to T := V \ S in G. Since
this edge does not exist in the residual graph, we must have f (e) = c(e).

Figure 11.9: The cut between vertices reachable from S and everything else in Gf must have
all outgoing edges saturated by f .

This means that


val(f ) ≥ cG (S, V \ S).
Since we already know the opposite inequality by weak duality, we have shown that

val(f ) = cG (S, V \ S)

which proves the Max Flow = Min Cut theorem; also called strong duality.

Remark 11.4.5. Note that this proof also gives an algorithm to compute an s, t-min cut
given an s, t maximum flow f ∗ : Take one side of the cut to be the set S of nodes reachable
from s in the residual graph w.r.t. f ∗ .

Ford-Fulkerson Algorithm.
Algorithm 6: Ford-Fulkerson Algorithm
repeat
Add update by arbitrary s-t path flow in Gf (augment the flow f by the path flow)

Convergence properties and analysis of runtime of Ford-Fulkerson algorithm

• Does this algorithm terminate?


The algorithm terminates if the capacities are integers. However for irrational capaci-
ties the algorithm may not terminate.

129
• Does it converge towards the max flow value?
No, it does not converge to max flow value if the updates are poor and the capacities
are irrational.

Lemma 11.4.6. Consider Ford-Fulkerson algorithm with integer capacities. The algorithm
terminates in val(f ∗ ) augmentations i.e. O(m val(f ∗ )) time.

Proof. Each iteration increases the flow by at least one unit as the capacities in Gf are
integral and each iteration can be computed in O(m) time.

Can we do better than this? Suppose we pick the maximum bottleneck capacity (min-
imum capacity along path) augmenting path. This gives an algorithm that is better in some
regimes.
How to pick the maximum bottleneck capacity augmenting path? We are going to perform
a binary search on the capacities in Gf , to find a path with maximum bottleneck capacity.
Each time our binary search has picked a threshold capacity, we then try to find an s-t path
flow in Gf using only edges with capacity above that threshold. If we find a path, the binary
search tries a higher capacity next. If we don’t find a path, the binary search tries a lower
capacity next.
Using this approach, the time to find a single path is O(m log(n)) where m is number of
edges in the graph. This path must carry at least a m1 fraction of the total amount of flow

left in Gf . For instance, if F̂ is the amount of flow left in Gf , then the path must carry m
flow (from the path decomposition lemma, we know that there exists a decomposition into at
most m paths, and the one carrying the most flow must carry at least the average amount).
So if the flow is integral, the algorithms completes when
 T
1
1− val(f ∗ ) < 1
m
where T is the number of augmentations.
This means
T = m log F

Total time = O(m log nT )


= O(m2 log n log F )
≤ O(m2 log n log mU ) as F ≤ mU where U is the maximum capacity

Current state-of-the-art approaches for Max Flow. For the interested reader, we’ll
briefly mention the current state-of-the-art for Maximum Flow algorithms, which is a very
active area of research.

130
• Strongly polynomial time algorithm:

– has to work with real valued capacities, then the best time O(mn) by Orlin.

• General integer capacities: For a long time, the√state-of-the-art was a result by Gold-
berg and Rao achieving a runtime of O(m min( m, n2/3 ) log(n) log(U )) where U is the
maximum capacity. Very recently, m1+o(1) log(U ) has been claimed (not yet published
in a peer-reviewed venue), see https://arxiv.org/abs/2203.00671. There are many
other recent results prior to this work that we don’t have time to cover here.

131
Chapter 12

Classical Algorithms for Maximum


Flow II

12.1 Overview
In this chapter we continue looking at classical max flow algorithms. We derive Dinic’s
algorithm which (unlike Ford-Fulkerson) converges in a polynomial number of iterations.
We will also show faster convergence on unit capacity graphs. The setting is the same as
last chapter: we have a directed graph G = (V, E, c) with positive capacities on the edges.
Our goal is to solve the maximum flow problem on this graph for a given source s and sink
t 6= s:

max F s.t. Bf = F b s,t


F >0,f
(12.1)
0≤f ≤c

12.2 Blocking Flows


Let f be any feasible flow in G and let Gf be its residual graph. Observe that we can
partition the vertices of Gf into ‘layers’: first s itself, then all vertices reachable from s in
one hop, then all vertices reachable in two hops, etc. For each vertex v ∈ V define its level
in Gf as the length of the shortest path in Gf from s to v, denoted by `Gf (v) (or just `(v)
if the graph is clear from context). An edge (u, v) ∈ E can only take you up ‘one level’: if
`(v) ≥ 2 + `(u) this would imply we can find a shorter s-v path by appending (u, v) to the
shortest s-u path. However, edges can ‘drop down’ multiple levels (or be contained in the
same level).
A key strategy in Dinic’s algorithm will be to focus on ‘short’ augmenting paths. We can
use the levels we defined above to isolate a subgraph of Gf containing all information we

132
0/1 0/1

1/1 1/1 t

s 1/1

0/1 0/1

Figure 12.1: A blocking flow is not a maximum flow.

need to find shortest paths.

Definition 12.2.1. Call an edge (u, v) admissible if `(u) + 1 = `(v). Let L be the set of
admissible edges in Gf and define the level graph of Gf to be the subgraph induced by
these edges.

Definition 12.2.2. Define a blocking flow in a residual graph Gf to be a flow f̂ feasible


in Gf , such that f̂ only uses admissible edges. Furthermore we require that for any s-t path
in the level graph of Gf , f̂ saturates at least one edge on this path.

The last condition makes a blocking flow ‘blocking’: it blocks any shortest augmenting path
in Gf by saturating one of its edges. Note however that a blocking flow is not necessarily a
maximum flow in the level graph.

12.3 Dinic’s Algorithm


We can now formulate Dinic1 ’s algorithm: start with f = 0, and then repeatedly add to f
a blocking flow f̂ in Gf , until no more s-t paths exist in Gf .
Note that by doing a path decomposition on f̂ and adding these paths to f one by one, we
see that our algorithm is a ‘special case’ of Ford-Fulkerson (with a particular augmenting
path strategy) and hence inherits all the behaviors/bounds we proved in the last chapter.

Lemma 12.3.1. Let f be a feasible flow, f̂ a blocking flow in Gf and define f 0 = f + f̂


(think of f and f 0 to be the flows at some steps k and k + 1 in the algorithm). Then
`Gf 0 (t) ≥ `Gf (t) + 1.

Proof. Let L, L0 be the edge sets of the level graphs of Gf , Gf 0 respectively. We would now
like to show that dL0 (s, t) ≥ 1 + dL (s, t). Let’s first assume that in fact dL0 (s, t) < dL (s, t).
Now take a shortest s-t path in the level graph of Gf 0 , say s = v0 , v1 , . . . , vdL0 (s,t) = t. Let
vj be the first vertex along the path such that dL0 (s, vj ) < dL (s, vj ). As dL0 (s, t) < dL (s, t),
such a vj must exist.
1
Sometimes also transliterated as Dinitz.

133
We’d like to understand the edges in the level graph L0 . Let E(Gf ) denote the edges of the
Gf , and let rev(L) denote the set of reversed edges of L. The level graph edges of Gf 0 must
satisfy L0 ⊆ L ∪ rev(L) ∪ (E(Gf ) \ L).
In our s-t path in L0 , the vertex vj is reached by an edge from vj−1 , and the level of vj−1 in
Gf and Gf 0 are the same, so the edge (vj−1 , vj ) ∈ L skips at least one level forward in Gf .
But, edges in L do not skip a level in Gf , and edges in rev(L) or E(Gf ) \ L do not move
from a lower level to higher level in Gf . So this edge cannot exist and we have reached a
contradiction.
Now suppose that dL0 (s, t) = dL (s, t). This means there is an s-t path in L0 using edges in
L – but such a path must contain an edge saturated by the blocking flow f̂ .
(Note: the reason this path must use only edges in L is similar to the previous case: the
other possible types of edges do not move to higher levels in Gf at all, making the path too
long if we use any of them.)
So we must have dL0 (s, t) ≥ 1 + dL (s, t).

An immediate corollary of this lemma is the convergence of Dinic’s algorithm:


Theorem 12.3.2. Dinic’s algorithm terminates in O(n) iterations.

Proof. By the previous lemma, the level of t increases by at least one each iteration. A
shortest path can only contain each vertex once (otherwise it would contain a cycle) so the
level of any vertex is never more than n.

For graphs with unit capacities (c = 1) we can prove even better bounds on the number of
iterations.
Theorem 12.3.3. On unit capacity graphs, Dinic’s algorithm terminates in

O min{m1/2 , n2/3 }


iterations.

Proof. We prove the two bounds separately.

1. Suppose we run Dinic’s algorithm for k iterations, obtaining a flow f that is feasible
but not necessarily yet optimal. Let f̃ be an optimal flow in Gf (i.e. f + f̃ is a
maximum flow for the whole graph), and consider a path decomposition of f̃ . Since
after k iterations any s-t path has length k or more, we use up a total capacity of at
least val(f̃ )k across all edges. But the edges in Gf are either edges from the original
graph G or their reversals (but never both) meaning the total capacity of Gf is at most
m, hence val(f̃ ) ≤ m/k.
Recalling our earlier observation that our algorithm is a special case of Ford-Fulkerson,
this implies that our algorithm will terminate after at most another m/k iterations.

134
Hence
√ the number of iterations is bounded by k + m/k for any k > 0. Substituting
k = m gives the first desired bound.

2. Suppose again that we run Dinic’s algorithm for k iterations obtaining a flow f . The
level graph of Gf partitions the vertices into sets Di = {u | `(u) = i} for i ≥ 0. As
shown before, the sink t must be at a level of at least k, meaning we have at least this
many non-empty levels starting from D0 = {s}. To simplify, discard all vertices and
levels beyond D`(t) .

Now, consider choosing a level I uniformly at random from {1, . . . , k}. Since there are
at most n − 1 vertices in total across these levels, E [|DI |] ≤ (n − 1)/k, and by Markov’s
inequality

(n − 1)/k 1
Pr[|DI | ≥ 2n/k] ≤ < .
2n/k 2

Thus strictly more than k/2 of the levels i ∈ {1, . . . , k} have |Di | < 2n/k and so there
must be two adjacent levels j, j + 1 for which this upper bound on the size holds.
There can be at most |Dj | · |Dj+1 | ≤ 4n2 /k 2 edges between these levels, and by the
min-cost max-flow theorem we saw in the previous chapter, this is an upper bound on
the flow in Gf , and hence on the number of iterations still needed for our algorithm to
terminate.

This means the number of iterations is bounded by k + 4n2 /k 2 for any k > 0, which is
O(n2/3 ) at k = 2n2/3 .

12.4 Finding Blocking Flows

What has been missing from our discussion so far is the process of actually finding a blocking
flow. We can achieve this using repeated depth-first search. We repeatedly do a search in the
level graph (so only using edges L) for s-t paths and augment these. We erase edges whose
subtrees have been exhausted and do not contain any augmenting paths to t. Pseudocode

135
for the algorithm is given below.

Algorithm 7: FindBlockingFlow
f ← 0;
H ← L;
repeat
P ← Dfs(s, H, t);
if P 6= ∅ then
Let f̂ be a flow that saturates path P .
f ← f + f̂ ;
Remove from H all edges saturated by f̂ .
else
return f ;
end

Algorithm 8: Dfs(u, H, t)
if u = t then
return the path P on the dfs-stack.
end
for (u, v) ∈ H do
P ← Dfs(v, H, t);
if P 6= ∅ then
return P ;
else
Erase (u, v) from H.
end
end
return ∅;
Lemma 12.4.1. For general graphs, FindBlockingFlow returns a blocking flow in
O(nm) time. Hence Dinic’s algorithm runs in O(n2 m) time on general capacity graphs.

Proof. First, consider the amount of work spent pushing edges onto the stack which even-
tually results in augmentation along an s-t path consisting of those edges (i.e. adding flow
along that path). Since each augmenting path saturates at least one edge, we do at most m
augmentations. Each path has length at most n. Thus the total amount of work pushing
these edges to the stack, and removing them from the stack upon augmentation, and deleting
saturated edges, can be bounded by O(mn). Now consider the work spent pushing edges
onto the stack which are later deleted because we “retreat” after not finding an s-t. An edge
can only be pushed onto the stack once in this way, since it is then deleted. So the total
amount spent pushing edges to the stack and deleting them this way is O(m).
Lemma 12.4.2. For unit capacity graphs, FindBlockingFlow returns
 a blocking flow in
3/2 2/3
O(m) time. Hence Dinic’s algorithm runs in O min{m , mn } time on unit capacity
graphs.

136
Proof. When our depth-first search traverses some edge (u, v), one of two things will happen:
either we find no augmenting path in the subtree of v, leading to the erasure of this edge,
or we find an augmenting path which will necessarily saturate (u, v), again leading to its
erasure. This means each edge will be traversed at most once by the depth-first search.

Another interesting bound


 on the runtime, which we will not prove here, is that Dinic’s
p 
algorithm will run in O |E| |V | time on bipartite matching graphs.

12.5 Minimum Cut as a Linear Program

Finally we show that minimum cut may be formulated as a linear program. For a subset
S ⊆ V write cG (S) for the sum of c(e) over all edges e = (u, v) that cross the cut, i.e.
u ∈ S, v 6∈ S. Note that ‘reverse’ edges are not counted in this cut. The minimum cut
problem asks us to find some S ⊆ V such that s ∈ S and t 6∈ S such that cG (S) is minimal.

min cG (S)
S⊆V

s.t. s ∈ S (12.2)
t 6∈ S

We claim this is equivalent to the following minimization problem:


X
c(e) max b >

min e x , 0
x ∈RV
e∈E
s.t. x (s) = 0 (12.3)
x (t) = 1
0≤x ≤1

Recall b >
e x = x (v) − x (u) for e = (u, v). We can rewrite this as a proper linear program by
introducing extra variables for the maximum:

min c>u
x ∈RV ,u∈RE

s.t. b >
s,t x = 1
0≤x ≤1 (12.4)
u ≥0
u ≥ B >x

And, in fact, we’ll also see that we don’t need the constraint 0 ≤ x ≤ 1, leading to the

137
following simpler program:

min c>u
x ∈RV ,u∈RE

s.t. b >
s,t x = 1 (12.5)
u ≥0
u ≥ B >x

Lemma 12.5.1. Programs (12.2), (12.4), and (12.5) have equal optimal values.

Proof. We start by considering equality between optimal values of Program (12.2) and Pro-
gram (12.4) Let S be an optimal solution to Program (12.2) and take x = 1V \S . Then x is
a feasible solution to Program (12.4) with cost equal to cG (S).
Conversely, suppose x is an optimal solution to Program (12.4). Let α be uniform in [0, 1]
and define S = {v ∈ V | x (v) ≤ α}. This is called a threshold cut. We can verify that

Pr [e is cut by S] = max{b >


e x , 0}.

and hence Et [cG (S)] is exactly the optimization function of 12.3 (and hence 12.4). Since at
least one outcome must do as well as the average, there is a subset S ⊆ V achieving this
value (or less).
We can use the same threshold cut to show that we can round a solution to Program (12.5)
to an equal or smaller value cut feasible for Program (12.2). The only difference is that in
this case, we get
Pr [e is cut by S] ≤ max{b >
e x , 0}

which is still sufficient.

138
Chapter 13

Link-Cut Trees

In this chapter, we will learn about a dynamic data structure that allows us to speed-up
Dinic’s algorithm even more: Link-Cut Trees. This chapter is inspired by lecture notes of
Richard Peng1 and the presentation in a very nice book on data structures and network flows
by Robert Tarjan [Tar83].

13.1 Overview
Model. We consider a directed graph G = (V, E) that is undergoing updates in the
form of edge insertions and/or deletions. We number the graph in its different versions
G0 , G1 , G2 , . . . such that G0 is the initial input graph, and Gi is the initial graph after the
first i updates were applied. Such a graph is called a dynamic graph.
In this lecture, we restrict ourselves to dynamic rooted forests that is we assume that every
Gi forms a directed forest where in each forest a single root vertex is reached by every other
vertex in the tree. For simplicity, we assume that G0 = (V, ∅) is an empty graph.

The Interface. Let us now describe the interface of our data structure that we call a
link-cut tree. We want to support the following operations:

• Initialize(G): Creates the data structure initially and returns a pointer to it. Each
vertex is initialized to have an associated cost cost(v) equal to 0.

• FindRoot(v): Returns the root of vertex v.

• AddCost(v, ∆): Add ∆ to the cost of every vertex on the path from v to the root
vertex FindRoot(v).
1
CS7510 Graph Algorithms Fall 2019, Lecture 17 at https://faculty.cc.gatech.edu/~rpeng/CS7510_
F19/

139
Figure 13.1: The ith version of G is a rooted forest. Here, red vertices are roots and there
are two trees. The (i + 1)th version of G differs from Gi by a single edge that was inserted
(the blue edge). Note that edge insertions are only valid if the tail of the edge was a root.
In the right picture, the former root on the right side is turned into a normal node in the
tree by the update.

• FindMin(v): Returns tuple (w, cost(w)) where w is the (first) vertex on the path from
v to FindRoot(v) of lowest cost.

• Link(u, v): Links two trees that contain u and v into a single tree by inserting the edge
(u, v). This assumes that u, v are initially in different trees and u was a root vertex.

• Cut(u, v): Cuts the edge (u, v) from the graph which causes the subtree rooted at u to
become a tree and u to become a root vertex. Assumes (u, v) is in the current graph.

Main Result. The following theorem is the main result of today’s lecture.

Theorem 13.1.1. We can implement a link-cut tree such that any sequence of m operations
takes total expected time O(m log2 n + |V |).

13.2 Balanced Binary Search Trees: A Recap


In Chapter 2 of the course Algorithms, Probability and Computing, we introduced (balanced)
binary search trees, and studied treaps, which are a simple randomized approach to building
a binary search tree with good performance, at least in expectation. If you are not familiar
with balanced binary search trees, we encourage you to read this chapter from the APC
script. That said, we will quickly recap the important properties of binary search trees and
treaps. If you’re already familiar with treaps, you can skip ahead to the next section to see
how we use them for building Link-Cut trees. A binary search tree T is a data structure
for keeping track of a set of items where each item has a key associated with it. The data
structure stores the items in a rooted binary tree with a node for each item, and maintains

140
the property for each node v, all items in the left subtree of v have keys less than v, and all
items in the right subtree of v have keys greater than v. This is called the search property
of the tree, because it makes it simple to search through the tree to determine if it contains
an item with a given key.
The binary trees we studied in APC supported the following operations, while maintaining
the search property:

insert: We can add an item to the tree with a specified key.

delete: We can remove an item from the tree with a specified key.

find: Determine if the tree contains an item with a specified key.

split: Suppose the tree contains two items v1 and v2 with keys k1 and k2 where k1 < k2 , and
suppose no item in tree has a key k in the interval (k1 , k2 ). Then the split operation
applied to these two items should split the current tree into two: One tree containing
all items with keys in (−∞, k1 ] and the other with all items in [k2 , ∞).

join: Given two binary search trees, one with keys in the interval (−∞, k], the other with
keys in the interval (k, ∞), form a single binary search tree containing the items of
both.

Treaps allow us to implement all of these operations, each with expected time O(log n) for
each operation when working over n items.

Tree rotations. A crucial operation when implementing binary search trees in general
and treaps in particular is tree rotation. A tree rotation makes a local change to the tree by
changing only a constant number of pointers around, while preserving the search property.
Given two items v1 and v2 such that v1 is a child of v2 , a tree rotation applied to these two
nodes will make v2 the child of v1 , while moving their subtrees around to preserve the search
property. This is shown in Figure 13.2. When the rotation makes v2 the right child of v1
(because v2 has the larger key), this is called a right rotation, and when it makes v2 the left
child of v1 (because v2 has a smaller key), it is called a left rotation.

13.3 A Data Structure for Path Graphs


Before we prove Theorem 13.1.1 in its full generality, let us reason about implement-
ing a link-cut tree data structure in a weaker setting: we assume that every ver-
sion Gi of G is just a collection of rooted vertex-disjoint paths. We will later build
on the routines we develop for the path case to construct the data structure for the
general tree case. To distinguish the routines we develop for paths from those we
develop for the general case, we will prefix the path-case routines with a “P”, i.e.

141
Figure 13.2: Given two items v1 and v2 such that v1 is a child of v2 , a tree rotation applied to
these two nodes will make v2 the child of v1 , while moving their subtrees around to preserve
the search property. This figure shows a right rotation.

we denote our implementations of FindRoot, AddCost, FindMin, Link, and Cut by


PFindRoot, PAddCost, PFindMin, PLink, and PCut.

Representing Paths via Balanced Binary Search Trees. It turns out that paths
can be represented rather straight-forwardly via Balanced Binary Search Trees, with an
node/item for each vertex of the graph. For the sake of concreteness, we here use treaps to
represent paths2 . Most other balanced binary search trees would also work.
Note that we now represent each rooted path with a tree, and this tree as its own root, which
is usually not the root of othe path. To minimize confusion, we will refer to the root of the
path as the path-root and the root of the associated tree as the tree-root (or subtree-root for
the root of a subtree).
In our earlier discussion of binary trees, we always assumed each item has a key: Instead we
will now let the key of each vertex correspond to its distance to the path-root in the path
that contains it. We will make sure the treap respects the search property w.r.t. this set of
keys/ordering, but we will not actually explicitly compute these keys or store them. Note
one important difference to the scenario of treaps-with-keys: When we have two paths, with
path-roots r1 and r2 , we will allow ourselves to join these paths in two different ways: either
so that r1 or r2 becomes the overall path-root.
Let us describe how to represent a path P in G. First, we pick for each vertex v, we assign
it a priority denoted prio(v), which we choose to be a uniformly random integer sampled
from a large universe, say [1, n100 ). We assume henceforth that prio(v) 6= prio(w) for all
v 6= w ∈ V .
Then, for each path P in G, we store the vertices in P in a binary tree, in particular a treap,
TP and enforce the following invariants:

• Search-Property: for all v, lef tTP (v) precedes v on P and rightTP (v) appears later
2
If you have not seen treaps before, don’t worry, they are simple enough to understand them from our
application here.

142
on P than v. See the Figure 13.3 below for an illustration.

• Heap-Order: for each vertex v ∈ P , its parent w = parentTP (v) in TP is either N U LL


(if v is the path-root) or has prio(v) > prio(w).

Figure 13.3: In the upper half of the picture, the original path P in G is shown, along with
the random numbers prio(v) for each v. The lower half depicts the resulting treap with
vertices on the same vertical line as before.

Depth of Vertex in a Treap. Let us next analyze the expected depth of a vertex v in a
treap TP representing a path P . Let P = hx1 , x2 , . . . , xk = v, . . . , x|P | i, i.e. v is the k th vertex
on the path P . Observe that a vertex xi with i < k is an ancestor of v in TP if and only if
no vertex {xi+1 , xi+2 , . . . , xk } has received a smaller random number than prio(xi ). Since we
1
sample prio(w) uniformly at random for each w, we have that P[xi ancestor of v] = k−i+1 .
1
The case where i > k is analogous and has P[xi ancestor of v] = i−k+1 . Letting Xi be the
indicator variable for the event that xi is an ancestor of v, it is straight-forward to calculate
the expected depth of v in TP :

k−1 |P |
X X 1 X 1
E[depth(v)] = E[Xi ] = + = Hk + H|P |−k+1 − 2 = O(log |P |)
i6=k i=1
k − i + 1 i=k+1 i − k + 1

It is straight-forward to see that the operation PFindRoot(v) can thus be implemented to


run in expected O(log n) time by just iteratively following the parent pointers starting in v.

Implementing PFindRoot(v). From any vertex v, we can simply follow the parent point-
ers in TP until we are at the tree-root of TP . Then, we find the right-most child of TP by
following the rightTP pointers. Finally, we return the right-most child which is the path-root.

143
Extra fields to help with PAddCost(v, ∆) and PFindMin(v). The key trick to do
the Link-Cut tree operations efficiently is to store the change to subtrees instead of updating
cost(v) for each affected vertex.
To this end, we store two fields ∆cost(v) and ∆min(v) for every vertex v. We let cost(v)
be the cost of each vertex, and mincost(v) denote the minimum cost of any descendant of
v in TP (where we let v be a descendant of itself). A warning: the mincost(v) value is the
minimum cost in the treap-subtree; not the minimum cost on the path between v and the
path-root that could be different. Note also that we do not explicitly maintain these fields.
Then, we maintain for each v
∆cost(v) = cost(v) − mincost(v)
(
mincost(v) parentTP (v) = N U LL,
∆min(v) =
mincost(v) − mincost(parentTP (v)) otherwise
With
P a bit of work, we can see that with these definitions, we obtain mincost(v) =
w∈TP [v] ∆min(v) where TP [v] is the v-to-tree-root path in TP . We can then recover
cost(v) = mincost(v) + ∆cost(v). We say that ∆min(v) and ∆cost(v) are field-preserving if
we can compute mincost(v) and cost(v) in the above way.

Implementing PAddCost(v, ∆) and PFindMin(v) – the easy way. First, we will


discuss ways to implement PAddCost and PFindMin in ways that rely on the fact that
PLink and PCut can preserve that ∆cost(v) and ∆mincost(v) still satisfy the relations we
described above.
Now, PAddCost(v, ∆) is very easy: First, consider the case when v as no precessor on the
path. If this is the case, we can simply find the tree-root of the tree TP containing v and at
this root increase the value of the field mincost by ∆. This implicitly increases the cost of
all nodes in the tree by ∆. Second, consider the case when v has a precessor u in the path
(we can store these explicitly, or we can search for it in the tree). Call PCut(u, v) – now v
no longer has a precessor and we can proceed as before. Finally, call PLink(u, v). That’s it.
Implemeting PFindMin(v) is also very easy: First, consider the case when v as no precessor
on the path. Again, we can find the tree-root r of the tree TP containing v. By definition
mincost(r) is the minimum cost across the whole tree, and hence across the path between
v and the path-root. We can also find the node with the minimum cost: Search in the tree,
starting from the tree-root, and, if possible follow the child where ∆min(v) = 0 (going left
if both are eligible). Eventually, we get to a node where there is no child with ∆min(v) = 0.
This must be a minimizer. Now, to deal with the case where v has a precessor, we do the
same we did for PAddCost: Find the precessor u, call PCut(u, v), then find the minimizer,
finally call PLink(u, v).

Implementing PAddCost(v, ∆) and PFindMin(v) – the hard way. We can also


implement PAddCost(v, ∆) and PFindMin(v) without relying on PLink and PCut,

144
which is more efficient, but also more unpleasant3 .
To implement the operation PFindMin(v) more efficiently, consider the sequence of nodes
on the path in the tree TP (not the path P ) between v and the path-root. First let v =
x1 , x2 , . . . , xk = r be the nodes leading to the tree-root if v is left of the tree root, and then
let y1 , . . . , yl be the remaining nodes on the tree path to the path-root. Observe that either
the minimizer one of these nodes (which we can check by checking their ∆min), or it is in
a right subtree of some xi , or in any subtree of some yj . So we just search through these
subtrees for their minimizers (looking for ∆min(v) = 0 ) and pick the best one. There will
only be O(log n) trees in expectation.
Next, let us discuss the implementation of the operation PAddCost(v, ∆) which is given
in pseudo-code above. The first for-loop of this algorithm ensures that the tree is again
field-preserving. This is ensured by walking down the path from the path-root of v to v
and whenever we walk to the left, we make sure to increase all costs in the right subtree by
adding ∆ to the subtree-root of the right subtree r in the form of adding it to ∆min(r). On
the vertices along the path it updates the costs by using the ∆cost(·) fields. It is easy to
prove from this that thereafter the tree is field-preserving.

Algorithm 9: PAddCost(v, ∆)
Store the vertices on the path in the tree TP between v and the path-root, i.e.
hv = x1 , x2 , . . . , xk = PFindRoot(v)i.
for i = k down to 1 do
∆cost(xi ) = ∆cost(xi ) + ∆.
if (r ← rightTP (xi )) 6= N U LL AND (i = 0 OR xi−1 6= r) then
∆min(r) ← ∆min(r) + ∆
end
end
for i = 1 up to k do
minval ← ∆cost(xi ).
foreach w ∈ {lef tTP (xi ), rightTP (xi )}, w 6= N U LL do
minval ← min{minval, ∆min(w)}.
∆min(xi ) ← ∆min(xi ) + minval.
∆cost(xi ) ← ∆cost(xi ) − minval.
foreach w ∈ {lef tTP (xi ), rightTP (xi )}, w 6= N U LL do
∆min(w) ← ∆min(w) − minval.
end

While after the first for-loop the values can be computed efficiently, we might now be in the
situation that ∆min(w) or/and ∆cost(w) are negative for some vertices w. We may also
violate the invariants we want to preserve on the relationship between the ∆min(w) or/and
∆cost(w) fields and the true cost and mincost values at each node. We recover through the
3
You don’t have to know these more complicated approaches for the exam, but we include them for the
interested reader.

145
second for-loop that at each vertex on the tree path from v to its path-root first computes the
correct minimum in the subtree again (using the helper variable minval) and then adjusts
all values in its left/right subtree and its own fields. Since we argued that v is at expected
depth O(log(n)), we can see that this can be done in expected O(log n) time.

Implementing PCut(u, v). Let us first assume that we have a dummy node d0 with
prio(d0 ) = 0 in the vertex set. The trick is to first treat the operation as splitting the edge
(u, v) into (u, d0 ) and (d0 , v) by inserting the vertex d0 in between u and v in the tree TP as a
leaf (this is always possible). Then, we can do standard binary tree rotations to re-establish
the heap-order invariant (see Figure 13.4). It is not hard to see that after O(log n) tree
rotations in expectation, the heap-order is re-established and d0 is at new tree-root of TP . It
remains to remove d0 and make lef tTP (d0 ) and rightTP (d0 ) tree-roots.

Figure 13.4: For PCut(u, v), we insert a vertex d0 as a leaf of either u or v to formally split
(u, v) into (u, d0 ) and (d0 , v) (this is shown on the left). While this preserves that Search-
Property, it will violate the Heap-Order. Thus, we need to use tree rotations (shown on the
left) to push d0 to the top of the tree TP (the arrow between the tree rotations should point
in both ways).

To ensure that the fields are correctly adjusted, we can make the ∆min(·) values of the two
vertices that are rotated equal to zero while ensuring that the tree is still field-preserving by
changing the ∆min(·) fields of the (at most 3) subtree-roots of the subtrees to be rotated,
and adapting the ∆cost(·) fields of the two vertices to be rotated. Then, after the rotation,
it is not hard to see that the tree is still field-preserving and that the procedure applied

146
in the second for-loop of the operation PAddCost(v, ∆) can then just be applied to the
two rotated vertices and the subtree-roots of their subtrees. This ensures that the fields
are maintained correctly. All of these operations can be implemented in O(1) time per
tree rotation. Note that PCut is basically just the treap split operation. In the exercises
for this week, we ask you to come up with the pseudo-code for maintaing that the tree is
field-preserving during tree rotations.

Implementing PLink(u, v). Implementing PLink(u, v) could be done by reversing the


process described above. We choose however a slightly different strategy: we insert a vertex
d∞ with prio(d∞ ) = n100 and make it a right child of u. We then find the tree-root r of
the tree over the path containing v (as a tail). Finally, we make r a right child of d∞ . It
remains to use tree rotations to make d∞ a leaf in the tree. Once this is accomplished, one
can simply remove d∞ from the tree entirely (which can be seen as un-splitting two edges
into (u, v)). Note that PLink is essentially the treap join operation.

Notes. The operations PLink/PCut can be implemented using almost all Balanced Bi-
nary Search Trees (especially the ones you have seen in your first courses on data structures).
Thus, it is not hard to get a O(log n) worst-case time bound for all operations discussed
above.

13.4 Implementing Trees via Paths


We now use the result from last section as a black box to obtain Theorem 13.1.1.

Path Decomposition. For each rooted tree T , the idea is to decompose T into paths.
In particular, we decompose each T into a collection of vertex-disjoint paths P1 , P2 , . . . , Pk
such that each internal vertex v in T has exactly one incoming edge in some Pi . We call the
edges on some Pi solid edges and say that the other edges are dashed.
We maintain the paths P1 , P2 , . . . , Pk using the data structure described in the last section.
To avoid confusion, we use the prefix P when we invoke operations of the path data structure,
for example PFindRoot(v) finds the root of v in the path graph Pi where v ∈ Pi . We no
longer need to think about the balanced binary trees that are used internally to represent
each Pi . Instead, in this section, we use path-root to refer to the root of a path Pi and we
use tree root (or just root) to refer to a root of one of the trees in our given collection of
rooted trees.

The Expose(v) Operation. We start by discussing the most important operation of the
data structure that will be used by all other operations internally: the operation Expose(v).
This operation flips solid/dashed edges such that after the procedure the path from v to its

147
Figure 13.5: The dashed edges are the edges not on any path Pi . The collection of (non-
empty) paths P1 , P2 , . . . , Pk can be seen to be the maximal path segments of solid edges.

tree root in G is solid (possibly as a subpath in the path collection). Below you can find an
implementation of the procedure Expose(v).

Algorithm 10: Expose(v)


w ← v.
while (w0 = parentG (PFindRoot(w))) 6= N U LL do
Invoke PCut(z, w0 ) for the solid edge (z, w0 ) incoming to w0 .
PLink(PFindRoot(w), w0 ).
w ← w0 .
end

Implementing Operations via Expose(v). We can now implement link-cut tree opera-
tions by invoking Expose(v) and then forwarding the operation to the path data structure.

148
Algorithm 11: AddCost(v, ∆)
Expose(v); PAddCost(v, ∆)

Algorithm 12: FindMin(v)


Expose(v); return PFindMin(v)

Algorithm 13: Link(u,v)


parentG (u) ← v;
if v has an incoming solid edge (z, v) then PCut(z, v) ;
PLink(u, v)

Algorithm 14: Cut(u,v)


Expose(u); parentG (u) ← N U LL; PCut(u, v);
if v has other incoming edge (z, v) then PLink(z, v);

Analysis. All of the operations above can be implemented using a single Expose(·) op-
eration plus O(1) operations on paths. Since path operations can be executed efficiently,
our main task is to bound the run-time of Expose(·). More precisely, since each itera-
tion of Expose(·) also runs in time O(log n), the total number of while-loop iterations in
Expose(·).
To this end, we introduce a dichotomy over the vertices in G. We let parentG (v) denote the
unique parent of v in G and let sizeG (v) denote the number of vertices in the subtree rooted
at v (including v).
Definition 13.4.1. Then, we say that an edge (u, v) is heavy if sizeG (u) > sizeG (v)/2.
Otherwise, we say (u, v) is light.

It can now be seen that the number of light edges on the v-to-root path for any v is at most
lg n: every time we follow a light edge (w, w0 ), i.e. when sizeG (w0 ) > 2sizeG (w), we double
the size of the subtree so after taking more than lg n such edges, we have > 2lg n = n vertices
in the graph (which is a contradiction).
Thus, when Expose(·) runs for many iterations, it must turn many heavy edges solid (this
is also since each vertex has at most one incoming heavy edge, so when we make a heavy
edge solid, we also don’t make any other heavy edge solid).
Claim 13.4.2. Each update can only increase the number of dashed, heavy edges by O(log n).

Proof. First observe that every time Expose(·) is invoked, it turns at most lg n heavy edges
from solid to dashed (since it has to visit a light edge to do so).
The only two operations that can cause additional dashed, heavy edges are Link(u, v) and
Cut(u, v) by toggling heavy/light. For Link(u, v), we observe that only the vertices on the

149
v-to-root path increase their sizes. Since there are at most lg n light edges on this path that
can turn heavy, this increases the number of dashed, heavy edges by at most lg n.
The case for Cut(u, v) is almost analogous: only vertices on the v-to-root path decrease
their sizes which can cause any heavy edge on such a path to become light, and instead
for a sibling of such a vertex might becomes heavy. But there can be at most lg n such
new heavy edges, otherwise the total size of the tree must exceed again n which leads to a
contradiction.

Thus, after m updates, we have created at most O(m log n) dashed heavy edges. The iter-
ations of Expose(·) either visit a dashed light edge (at most ln n of them), or consumes a
dashed heavy edge, i.e. turning them . We conclude that after m updates, the while-loop in
Expose(·) runs for at most O(m log n) iterations, in total, i.e. summed across the updates
so far. Each iteration can be implemented in O(log n) expected time. This dominates the
total running time and proves Theorem 13.1.1.

13.5 Fast Blocking Flow via Dynamic Trees

Recall from Section 12.4 that computing blocking flows in a level graph L from a vertex s
to t can be done by successively running DFS(s) and routing flow along the s-t path found
if one such path exists and otherwise we know that we have found a blocking flow.
We can now speed-up this procedure by storing the DFS-tree explicitly as a dynamic tree. To
simplify exposition, we transform L to obtain a graph Transform(L) that has capacities
on vertices instead of edges. To obtain Transform(L), we simply split each edge in L and
assign the edge capacity to the mid-point vertex while assigning capacity ∞ to all vertices
that were already in L. This creates an identical flow problem with at most O(m) vertices
and edges.

Figure 13.6: Each edge (u, v) with capacity c(u, v) is split into two edges (u, m) and (m, v).
The capacity is then on the vertex m.

150
Finally, we give the new pseudo-code for the blocking flow procedure below.

Algorithm 15: FindBlockingFlow(s, t, L)


H ← Transform(L);
LC-Tree ← Initialize(H);
while s ∈ H do
u ← LC-Tree.FindRoot(s);
if there is an edge (u, v) ∈ H then
LC-Tree.Link(u, v);
if v = t then
(w, c) ← LC-Tree.FindMin(s);
LC-Tree.AddCost(s, −c);
Remove w and all its incident edges from H and LC-Tree (via Cut(·)).
end
else
Remove u and all its incident edges from H and LC-Tree (via Cut(·)).
end
end
Construct f by setting for each edge (u, v) of L, with mid-point m in Transform(L),
the flow equal to c(m) minus the cost on m just before it was removed from H.

Claim 13.5.1. The running time of FindBlockingFlow(s, t, L) is O(m log2 n + |V |).

Proof. Each edge (u, v) in the graph Transform(L) enters the link-cut tree at most once
(we only invoke Cut(u, v) when we delete (u, v) from H).
Next, observe that the first if -case requires O(1 + #edgesDeletedF romH) many tree oper-
ations. The else-case requires O(#edgesDeletedF romH) many tree operations.
But each edge is only deleted once from H, thus we have a total of O(m) tree operations over
all iterations. Since each link-cut tree operation takes amortized expected time O(log2 n),
we obtain the bound on the total running time.

The correctness of this algorithm follows almost immediately using that the level graph L
(and therefore Transform(L)) is an acyclic graph.

151
Chapter 14

The Cut-Matching Game: Expanders


via Max Flow

In this chapter, we learn about a new algorithm to compute expanders that employs max
flow as a subroutine.

14.1 Introduction

We start with a review of expanders where make a subtle change the notion of an expander
in comparison to chapter 5 to ease the exposition.

Definitions. We let G = (V, E) be an unweighted, connected graph in this chapter, and


let d be the degree vector, E(S, V \ S) denote the set of edges crossing the cut A, B
Given set ∅ ⊂ S ⊂ V , then we define the sparsity ψ(S) of S by

|E(S, V \ S)|
ψ(S) =
min{|S|, |V \ S|}

|E(S,V \S)|
Note that sparsity ψ(S) differs from conductance φ(S) = min{vol(S),vol(V \S)}
, as defined in
5, in the denominator. It is straight-forward to see that in a connected graph ψ(S) ≥ φ(S)
for all S.
Clearly, we again have ψ(S) = ψ(V \ S). We define the sparsity of a graph G by ψ(G) =
min∅⊂S⊂V ψ(S). For any ψ ∈ (0, n], we say a graph G is a ψ-expander with regard to sparsity,
if ψ(G) ≥ ψ. When the context is clear, we simply say that G is a ψ-expander.

The Main Result. The main result of this chapter is the following theorem.

152
Theorem 14.1.1. There is an algorithm SparsityCertifyOrCut(G, ψ) that given a
graph G and a parameter 0 < ψ ≤ 1 either:

• Certifies that G is a Ω(ψ/ log2 n)-expander with regard to sparsity, or

• Presents a cut S such that ψ(S) ≤ O(ψ).

The algorithm runs in time O(log2 n) · Tmax f low (G) + Õ(m) where Tmax f low (G) is the time
it takes to solve a Max Flow problem on G1 .

The bounds above can further be extended to compute φ-expanders (with regard to conduc-
tance). Using current state-of-the art Max Flow results, the above problem can be solved in
Õ(m + n3/2+o(1) ) time [vdBLL+ 21].

14.2 Embedding Graphs into Expanders


Let us start by exploring the first key idea behind the algorithm. We therefore need a
definition of what it means to embed one graph into another.

Definition of Embedding. Given graphs H and G that are defined over the same vertex
set, then we say that a function EmbedH → − G is an embedding if it maps each edge (u, v) ∈ H
to a u-to-v path Pu,v = EmbedH → − G (u, v) in G.

Figure 14.1: In this example the red edge (u, v) in H is mapped to the red u-to-v path in G.
1
Technically, we will solve problems with two additional vertices and n additional edges but this will not
change the run-time of any known max-flow algorithm asymptotically.

153
We say that the congestion of EmbedH →
− G is the maximum number of times that any edge
e ∈ E(G) appears on any embedding path:
0 0
− G ) = max |{e ∈ E(H) | e ∈ EmbedH →
cong(EmbedH → − G (e )}|.
e∈E(G)

Certifying Expander Graphs via Embeddings. Let us next prove the following lemma
that is often consider Folklore.

Lemma 14.2.1. Given a 12 -expander graph H and an embedding of H into G with congestion
C, then G must be an Ω C1 -expander.


Proof. Consider any cut (S, V \ S) with |S| ≤ |V \ S|. Since H is a ψ-expander, we have that
|EH (S, V \ S)| ≥ |S|/2. We also know by the embedding of H into G, that for each edge
(u, v) ∈ EH (S, V \ S), we can find path a Pu,v in G that also has to cross the cut (S, V \ S)
at least once. But since each edge in G is on at most C such paths, we can conclude that at
least |EH (S, V \ S)|/C ≥ |S|/2C edges in G cross the cut (S, V \ S).

Unfortunately, the reverse of the above lemma is not true, i.e. even if there exists no
embedding from 1/2-expander H into G of congestion C, then G might still be an Ω C1 -


expander.

14.3 The Cut-Matching Algorithm


Although there are still some missing pieces, let us next discuss the algorithm 16. The
algorithm runs for T iterations where we will later find that the right value to set T to is in
Θ(log2 n).

Algorithm 16: SparsityCertifyOrCut(G, ψ)


for i = 0, 1, 2, . . . , T do
(Si , S i ) ← FindBiPartition(G, {M1 , M2 , . . . , Mi }) ; // Assume |S| = |S| = n/2
Solve the flow problem on G where each vertex e ∈ E receives capacity c(e) = 1/ψ
and the demand at each edge v ∈ S is +1 and for each v ∈ S is −1, by introducing
a super-source s and super-sink t;
if flow procedure returns a flow f with val(f ) = n/2 then
Remove s, t to derive S-S flow; then decompose flow f into flow paths
P1 , P2 , . . . , Pn/2 ;
Create a matching Mi where for each s-to-s flow path Pi , we add (s, s) to Mi .
else
return the minimum cut (XS , XS ) in the flow problem after removing the
super-source s and the super-sink t from the two sets.
S
return H = i Mi

154
Figure 14.2: Illustration of the steps of the Algorithm. In a), a bi-partition of V is found.
In b), the bi-partition is used to obtain a flow problem where we inject one unit of flow to
each vertex in S via super-source s and extract one unit via super-sink t. c) A path flow
decomposition. For each path, the first vertex is in S and the last vertex in S. d) We find
Mi to be the one-to-one matching between endpoints in S and S defined by the path flows.

Beginning of an Iteration. In each iteration of the algorithm, we first invoke a sub-


procedure FindBiPartition(·) that returns a cut (S, S) (where S = V \ S) that partitions
the vertex set into two equal-sized sides. Here, we implicitly assume that the number of
vertices n is even which is w.l.o.g.
Next, the algorithm creates a flow problem where each vertex s ∈ S has to send exactly one
unit of flow and each vertex s ∈ S has to receive exactly one unit of flow. We therefore
introduce dummy nodes s and t, add for each vertex v ∈ S an edge of capacity 1 between
s and v; and for each vertex v ∈ S and edge between t and v of capacity 1. We set the
capacity of edges in the original graph G to 1/ψ (which we assume wlog to be integer).

The If Statement. If the flow problem can be solved exactly, then we can find a path
decomposition of the flow f in Õ(m) time (for example using a DFS) where each path starts
in S ends in S and carries one unit of flow2 . This defines a one-to-one correspondence
between the vertices in S and the vertices in S. We capture this correspondences in the
matching Mi . We will later prove the following lemma.

Lemma 14.3.1. If the algorithm returns after constructing T matchings, for an appropri-
ately chosen T = Θ(log2 n), then the graph H returned by the algorithm is a 12 -expander and
H can be embedded into G with congestion O(log2 n/ψ).
2
For simplicity, assume that the returned flow is integral.

155
The Else Statement. On the other hand, if the flow problem on G could not be solved,
then we return the min-cut of the flow problem. Such a cut can be found in O(m) time by
using the above reduction to an s-t flow problem on which one can compute maximum flow
f from which the s-t min-cut can be constructed by following the construction in the proof
of Theorem 11.4.4.
It turns out that this min-cut already is a sparse cut by the way our flow problem is defined.

Lemma 14.3.2. If the algorithm finds a cut (XS , XS ), then the returned cut satisfies
|EG (XS \ {s}, XS \ {t})| = O(ψ).

Proof. First observe that since a min-cut was returned, we have that |EG (XS , XS )| < n/2
(otherwise we could have routed the demands).
Let ns be the number of edges incident to the super-source s that cross the cut (XS , XS ).
Let nt be the number of edges incident to t that cross the cut (XS , XS ).

Figure 14.3: Set XS is enclosed by the orange circle. The thick orange edges are in the cut
and incident to super-source s. Thus they count towards ns . Here ns = 2, nt = 0. Note that
all remaining edges in the cut are black, i.e. were originally in G and therefore have capacity
1/ψ.

Observe that after taking away the vertices s and t, the cut (XS \ {s}, XS \ {t}) has less than
n/2 − ns − nt capacity. But each remaining edge has capacity 1/ψ, so the total number of
edges in the cut can be at most ψ · (n/2 − ns − nt ). Since XS \ {s} is of size at least n/2 − ns ,
and XS \ {t} is of size at least n/2 − nt , we have that the induced cut in G has

ψ · (n/2 − ns − nt )
ψ(XS \ {s}) < ≤ ψ.
min{n/2 − ns , n/2 − nt }

156
14.4 Constructing an Expander via Random Walks
Next, we give the implementation and analysis for the procedure FindBiPartition(·). We
start however by giving some more preliminaries.

Random Walk on Matchings. Let {M1 , M2 , . . . , MT +1 } be the set of matchings we


compute (if we never find a cut). In the ith -step of the lazy random walk, we let the mass
at each vertex j stay put with probability 1/2, and otherwise traverses the edge in matching
Mi incident to j with probability 1/2.
We let p tj7→i denote the probability that a particle that started at vertex j is at vertex i
after a t-step lazy random walk. We let p ti = [p t17→i p t27→i . . . p tn7→i ]. Note that for each edge
(i, j) ∈ Mt+1 , we have that
(t+1) 1 1 (t+1)
pi = p ti + p tj = p j .
2 2
We define the projection matrix Πt = [p t1 , p t2 , . . . , p tn ]> that maps an initial probability
distribution d to the probability distribution over the vertices that the random walk visits
them at step t. You will prove in the exercises that Πt is doubly-stochastic.
We say that a lazy random walk is mixing at step t, if for each i, j, p tj7→i ≥ 1/(2n).
Lemma 14.4.1. If t-step lazy random walk is mixing, then H = i≤t Mi is a 21 -expander.
S

Proof. Consider any cut (S, S) with |S| ≤ |S|. It is convenient to think about the random
walks in terms of probability mass that is moved around. Observe that each vertex j ∈ S
has to push at least 1/(2n) units of probability from j to i (by definition of mixing).

Figure 14.4: Each vertex j ∈ S sends at least 1/(2n) probability mass to i (red arrow).
But in order to transport it, it does have to push the mass through edges in the matchings
M1 , M2 , . . . , Mt that cross the cut.

Clearly, to move the mass from S to S it has to use the matching edges that also cross the
cut.

157
Now observe that since there are ≥ n/2 vertices in S, and each of them has to push ≥ 1/(2n)
mass to i, the total amount of probability mass pushed through the cut for i is ≥ 1/4. Since
there are |S| such vertices i, the total amount of mass that has to cross the cut is ≥ |S|/4.
But note, that after each step of the random walk, the total probability mass at each vertex
is exactly 1. Thus, at each step t0 , each edge in Mt0 crossing the cut can push at most 1/2
units of probability mass over the cut (and thereafter the edge is gone).
It follows that there must be at least |S|/2 edges in the matchings M1 , M2 , . . . Mt . But this
implies that H = ∪i Mi is a 12 -expander.

Implementing FindBiPartition(·). We can now give away the implementation of


FindBiPartition(·) which you can find below.

Algorithm 17: FindBiPartition(G, {M1 , M2 , . . . , Mt })


Choose random n-dimensional vector r orthogonal to 1.;
Compute vector u = Πt r , i.e. each u(i) = p ti · r ;
Let S be the n/2 smallest vertices w.r.t. u; and S be the n/2 largest w.r.t u (ties
broken arbitrarily but consistently).;
return (S, S)

The central claim, we want to prove is the following: given a potential function for the
random walk at step t
X X
Φt = (p tj7→i − 1/n)2 = kp ti − 1/nk22 .
i,j i

Claim 14.4.2. In the algorithm SparsityCertifyOrCut(·), we have E[Φt − Φ(t+1) ] =


Ω(Φt / log n) − O(1/n−5 ). Further, we have that Φt − Φ(t+1) is always non-negative. The
expectation is over the random vector r chosen in the current round.
Corollary 14.4.3. For appropriate T = Θ(log2 n), the algorithm
(T +1)
SparsityCertifyOrCut(·) has Φ ≤ 4/n2 .

To obtain the Corollary, one can simply set up a sequence of random 0-1 variables
X 1 , X 2 , . . . , X (T +1) where each X (t+1) is 1 if and only if the decrease in potential is at
least an Ω(1/ log n)-fraction of the previous potential. Since the expectation is only over the
current r in each round and we choose these independently at random, one can then use
a Chernoff bound to argue that after T rounds (for an appropriate hidden constant), one
has at least Ω(T ) rounds during which the potential is decreased substantially (unless it is
already tiny and the O(n−5 ) factor dominates).
We further observe that this implies that {M1 , M2 , . . . , MT +1 } is mixing (you can straight-
forwardly prove it by a proof of contradiction), and thereby we conclude the proof of our
main theorem.

158
Let us now give the prove of Claim 14.4.2:

Interpreting the Potential Drop. Let us start by writing out the amount by which the
potential decreases
X X (t+1)
Φt − Φ(t+1) = kp ti − 1/nk22 − kp i − 1/nk22
i i

Considering P Mt+1 , and an edge (i, j) ∈ Mt+1 . We can re-write the former sum
now matching
as i kp ti − 1/nk22 = (i,j)∈Mt+1 kp ti − 1/nk22 + kp tj − 1/nk22 as each vertex occurs as exactly
P
one endpoint of a matching edge. We can do the same for the t + 1-step walk probabilities.
(t+1) (t+1) p t +p t
Further, recall that for (i, j) ∈ Mt+1 , we have p i = pj = i 2 j . Thus,
X (t+1) (t+1)
Φt − Φ(t+1) = kp ti − 1/nk22 + kp tj − 1/nk22 − kp i − 1/nk22 − kp j − 1/nk22
(i,j)∈Mt+1
t 2
X p i + p tj
= kp ti − 1/nk22 + kp tj − 1/nk22 − 2
2 − 1/n
.
(i,j)∈Mt+1 2

Finally, we can use the formula kx k2 + ky k2 − 2k(x + y )/2k2 = 21 kx − y k2 term-wise to


derive
1 X 1 X
Φt − Φ(t+1) = k(p ti − 1/n) − (p tj − 1/n)k22 = kp ti − p tj k22 .
2 2
(i,j)∈Mt+1 (i,j)∈Mt+1

The potential thus drops by a lot if vertices i and j are matched where p ti and p tj differ
starkly. Note that this equality implies directly the remark in our claim that Φt − Φ(t+1) is
non-negative.

Understanding the Random Projection. Next, we want to further lower bound the
potential drop using the random vector u. This intuitively helps a lot in our analysis since
we are matching vertices i, j with high value u(i) and low value u(j) (or vice versa). We
will show that (w.p. ≥ 1 − n−3 )
1 X n−1 X
Φt − Φ(t+1) = kp ti − p tj k22 ≥ |u(i) − u(j)|2 . (14.1)
2 64 · log n
(i,j)∈Mt+1 (i,j)∈Mt+1

To prove this claim, we will prove this again term-wise, showing that for each pair of vertices
n−1
i, j ∈ V , we have kp ti − p tj k22 ≥ 16·log n
|u(i) − u(j)|2 w.h.p. It will then suffice to talk a union
bound over all pairs i, j.
To this end, let us make the following observations: since P u(i) = p ti · r , we have that
u(i) − u(j) = (p ti − p tj ) · r by linearity. Also note that since j p tj7→i = 1 for all i (since Πt
is doubly-stochastic), we further have that the projection (p ti − p tj ) is orthogonal to 1.

159
We can now use the following statement about random vector r to argue about the effect of
projecting (p ti − p tj ) onto r . Below, we note that we have d = n − 1 since r is chosen from
the (n − 1)-dimensional space orthogonal to 1.

Theorem 14.4.4. If y is a vector of length ` in Rd , and r a unit random vector in Rd , then

`2
• E[(y > r )2 ] = d
, and

• for x ≤ d/16, then P[(y > r )2 ≥ x`2 /d] ≤ e−x/4

This allows us to pick x = 32 · log n, and we then we obtain that


 
32 log n t
t t 2
P ((p i − p j ) · r ) ≥ kp i − p j k2 ≤ e−4 log n = n−8 .
t 2
(14.2)
n−1

Multiplying both sides of the event by (n − 1)/(64 log n), we derive the claimed Inequality
(14.1). We can further union bound over the n/2 matching pairs to all satisfy this bound
with probability ≥ 1 − n−7 , as desired.

Relating to the Lengths of the Projections. Let µ = maxi∈S u(i), then we have by
definition that u(i) ≤ µ ≤ u(j) for all i ∈ S, j ∈ S.
Now we can write

n−1 X
Φt − Φ(t+1) ≥ |u(i) − u(j)|2
64 · log n
(i,j)∈Mt+1
n−1 X
≥ (u(i) − µ)2 + (u(j) − µ)2
64 · log n
(i,j)∈Mt+1
n−1 X
= (u(i) − µ)2
64 · log n i∈V
!
n−1 X X
= u(i)2 − 2µ · u(i) + nµ2
64 · log n i∈V i∈V

by standard calculations. We then observe that i u(i) = i p ti · r = 1 · r = 0 by the


P P
fact that Πt is doubly-stochastic and since r is orthogonal to the all-ones vector. We can
therefore conclude
!
n−1 X X n−1 X
u(i)2 − 2µ · u(i) + nµ2 ≥ u(i)2 . (14.3)
64 · log n i∈V i
64 · log n i∈V

160
Taking the Expectation. Then, from the second fact of Theorem 14.4.4, we obtain that
" #
X X X X
E u(i)2 = E[u(i)2 ] = E[(p ti · r )2 ] = E[(p ti − 1/n) · r )2 ] (14.4)
i∈V i∈V i∈V i∈V
X kp t − 1/nk2 Φt
i 2
= = (14.5)
i∈V
n−1 n−1

where we used again that r is orthogonal to 1.


Unfortunately, we cannot directly use this expectation since we already conditioned on the
high probability events in Equation (14.2). But a simple trick allows us to recover: Let E
denote the union of all of these events. We have by the law of total expectation
" # " # " #
X X X
E u(i)2 = P[¬E] · E u(i)2 |¬E + P[E] · E u(i)2 |E
i∈V i∈V i∈V
P 2
 P 2
P t
But note that E i∈V u(i) |E has to be smaller than n because i∈V u(i) = i∈V (p i ·
2 t
r ) ≤ n with probability 1 (because each p
Pi is a unit vector). Recall that we calculated
−7 Φt −6
2

P[E] ≤ n . Thus, we can conclude that E i∈V u(i) |¬E ≥ n−1 − n .

It remains to combine our insights to conclude


 t 
n−1 Φ −6
t
E[Φ − Φ (t+1)
|¬E] ≥ −n = Ω(Φt / log n) − O(1/n−5 ).
64 · log n n − 1

Since again the event E occurs with very low probability, and since Φt − Φt+1 is non-negative
always, we can then conclude that in unconditionally, in expectation, the potential decreases
by Ω(Φt / log n) − O(1/n−5 ).

161
Part IV

Further Topics

162
Chapter 15

Separating Hyperplanes, Lagrange


Multipliers, KKT Conditions, and
Convex Duality

15.1 Overview

First part of this chapter introduces the concept of a separating hyperplane of two sets
followed by a proof that for two closed, convex and disjoint sets a separating hyperplane
always exists. This is a variant of the more general separating hyperplane theorem 1 due to
Minkowski. Then Lagrange multipliers x , s of a convex optimization problem

min E(y )
y

s.t. Ay = b
c(y ) ≤ 0

are introduced and with that, the Lagrangian

L(y , x , s) = E(y ) + x > (b − Ay ) + s > c(y )

is defined. Finally, we deal with the dual problem

max L(x , s),


x ,s,s≥0

where L(x , s) = miny L(y , x , s). We show weak duality, i.e. L(y , x , s) ≤ E(y ) and that
assuming Slater’s condition the values of both the primal and dual is equal, which is referred
to as strong duality.
1
Wikipedia is good on this: https://en.wikipedia.org/wiki/Hyperplane separation theorem

163
15.2 Separating Hyperplane Theorem
Suppose we have two convex subsets A, B ⊆ Rn that are disjoint (A ∪ B = ∅). We wish to
show that there will always be a (hyper-)plane H that separates these two sets, i.e. A lies
on one side, and B on the other side of H.
So what exactly do we mean by Hyperplane? Let’s define it.

Definition 15.2.1 (Hyperplane). A hyperplane H of dimension n is the subset H := {x ∈


Rn : hn, x i = µ}. We say H has normal n ∈ Rn and threshold µ. It is required that n 6= 0.

Every hyperplane divides Rn into two halfspaces {x : hv , x i ≥ µ} and {x : hv , x i ≤


µ}. It separates two sets, if they lie in different halfspaces. We formally define separating
hyperplane as follows.

Definition 15.2.2 (Separating Hyperplane). We say a hyperplane H separates two sets


A, B iff

∀aa ∈ A : hn, a i ≥ µ
∀b ∈ B : hn, bi ≤ µ

If we replace ≥ with > and ≤ with < we say H strictly separates A and B.

It is easy to see that there exists disjoint non-convex sets that can not be separated by a
hyperplane (e.g. a point cannot be separated from a ring around it). But can two disjoint
convex sets always be strictly separated by a hyperplane? The answer is no: consider the
two-dimensional case depicted in Figure 15.1 with A = {(x, y) : x ≤ 0} and B = {(x, y) :
x > 0 and y ≥ x1 }. Clearly they are disjoint; however the only separating hyperplane is
H = {(x, y) : x = 0} but it intersects A.
One can prove that there exists a non-strictly separating hyperplane for any two disjoint
convex sets. We will prove that if we further require A,B to be closed and bounded, then a
strictly separating hyperplane always exists. (Note in the example above how our choice of
B is not bounded.)

Theorem 15.2.3 (Separating Hyperplane Theorem; closed, bounded sets). For two closed,
bounded, and disjoint convex sets A, B ∈ Rn there exists a strictly separating hyperplane H.
One such hyperplane is given by normal n = d − c and threshold µ = 12 kd k22 − kck22 ,


where c ∈ A, d ∈ B are the minimizers of the distance between A and B

dist(A, B) = min kaa − bk2 > 0.


a ∈A,b∈B

Proof. We omit the proof that dist(A, B) = mina ∈A,b∈B kaa − bk2 > 0, which follows from
A, B being disjoint, closed, and bounded. Now, we want to show that hn, bi > µ for all
b ∈ B; then hn, a i < µ for all a ∈ A follows by symmetry. Observe that

164
Figure 15.1: The sets A = {(x, y) : x ≤ 0} and B = {(x, y) : x > 0 and y ≥ x1 } only permit
a non-strictly separating hyperplane.

1
kd k22 − kck22

hn, d i − µ = hd − c, d i −
2
1 1
= kd k2 − d c − kd k22 + kck22
2 >
2 2
1 2
= kd − ck2 > 0.
2
So suppose there exists u ∈ B such that hn, ui − µ ≤ 0. We now look at the line defined by
the distance minimizer d and the point on the “wrong side” u. Define b(λ) = d + λ(u − d ),
and take the derivative of the distance between b(λ) and c. Evaluated at λ = 0 (which is
when b(λ) = d ), this yields

d 2

kb(λ) − ck2 = 2 hd − λd + λu − c, u − d i|λ=0 = 2 hd − c, u − d i .
dλ λ=0

However, this would imply that the gradient is strictly negative since
hn, ui − µ = hd − c, ui − hd − c, d i + hd − c, d i − µ
1 1
= hd − c, u − d i + kd k22 − hc, d i − kd k22 + kck22
2 2
1
= hd − c, u − d i + kd − ck22 ≤ 0.
2
This contradicts the minimality of d and thus concludes this proof.

165
A more general separating hyperplane theorem holds even when the sets are not closed and
bounded:

Theorem 15.2.4 (Separating Hyperplane Theorem). Given two disjoint convex sets A, B ∈
Rn there exists a hyperplane H separating them.

15.3 Lagrange Multipliers and Duality of Convex


Problems
In this Section, we’ll learn about Langrange Multipliers and how they lead to convex duality.
But first, let’s see an example to help illustrate where these ideas come from.
1 1
Imagine you were to prove that for all x ∈ Rn we have kx kp ≤ n 2 − p kx k2 for some 1 ≤ p ≤ 2;.
We can look at this as optimizing maxx kx kp subject to kx k2 being constant, e.g. simply
kx k2 = 1. Then the statement above follows from a scaling argument.

Figure 15.2: Looking at fixed kx kp = α and kx k2 = 1. (Here, p = 1.5.)

If we move from x to x +δ with δ ⊥ ∇x kx k2 and δ 6⊥ ∇x kx kp means that for infinitesimally


small δ the 2-norm stays constant but the p-norm changes. That means for either x − δ or
x + δ the p-norm increases while the 2-norm stays constant. Hence at the maximum of kx kp
the gradients of both norms have to be parallel, i.e.
 
∇x kx kp − λ kx k2 = 0.

This insight is the core idea of Lagrange multipliers (in this case λ).
Note that here the problem is not convex, because {x : kx k22 = 1} is not convex and because
we are asking to maximize a norm. In the following we will study Lagrange multipliers for
general convex problems.

166
15.3.1 Karush-Kuhn Tucker Optimality Conditions for Convex
Problems

A full formal treament of convex duality would require us to be more careful about using
inf and sup in place of min and max, as well as considering problems that have no feasible
solutions. Today, we’ll ignore these concerns.
Let us consider a general convex optimization problem with convex objective, linear equality
constraints and convex inequality constraints

min E(y ) (15.1)


y∈S

s.t. Ay = b
c(y ) ≤ 0,

where E(y ) : S → R is defined on a convex subset S ⊆ Rn , A ∈ Rm×n and c(y ) is a


vector of constraints c(y ) = (ci (y ))i∈[k] . For every i ∈ [k] the function ci : S → R should
be convex, which implies the sublevel set {y : ci (y ) ≤ 0} is convex. Given a solution y , we
say an inequality constraint is tight at y if ci (y ) = 0. In the following we will denote by
α∗ = E(y ∗ ) the optimal value of this program where y ∗ is a minimizer.
We will call this the primal program – and later we will see that we can associate another
related convex program with any such convex program – and we will call this second program
the dual program.

Definition 15.3.1 (Primal feasibility). We say that y ∈ S is primal feasible if all constraints
are satisfied, i.e. Ay = b and c(y ) ≤ 0.

Now, as we did in our example with the 2- and p-norms, we will try to understand the
relationship between the gradient of the objective function and of the constraint functions
at an optimal solution y ∗ .

An (not quite true!) intuition. Suppose y ∗ is an optimal solution to the convex program
above. Let us additionally suppose that y ∗ is not on the boundary of S. Then, generally
speaking, because we are at a constrained minimum of E(y ∗ ), we must have that for any
infinitesimal δ s.t. y ∗ + δ is also feasible, δ > ∇E(y ∗ ) ≥ 0, i.e. the infinitesimal does not
decrease the objective. We can also view this as saying that if δ > ∇E(y ∗ ) < 0, the update
must be infeasible. But what kind of updates will make y ∗ + δ infeasible? This will be true
if a > ∗ >
j δ 6= 0 for some linear constraint j or, roughly speaking, ∇ci (y ) δ 6= 0 for some tight
inequality constraint. But, if this is true for all directions that have a negative inner product
with ∇E(y ∗ ), then we must have that −∇E(y ∗ ) can be written as a linear combination of a j ,
i.e. gradients of the linear constraint, and of gradients ∇ci (y ∗ ) of tight inequality constraints,
and furthermore the cofficients of the gradients of these tight inequality constraints must be
positive, so that moving along this direction will increase the function value and violate the
constraint.

167
To recap, given cofficients x ∈ Rm and s ∈ Rk with s(i) ≥ 0 if ci (y ∗ ) = 0, and s(i) = 0
otherwise, we should be able to write
X X
−∇y E(y ∗ ) = x (j)aa j + s(i)∇ci (y ∗ ) (15.2)
j i

Note that since y ∗ is feasible, and hence c(y ∗ ) ≤ 0, we can write the condition that s(i) ≥ 0
if ci (y ∗ ) = 0, and s(i) = 0 otherwise, in a very slick way: namely as s ≥ 0 and s > c(y ∗ ) = 0.
Traditionally, the variables in s are called slack variables, because of this, i.e. they are
non-zero only if there is no slack in the constraint. This condition has a fancy name: when
s > c(y ) = 0 for some feasible y and s ≥ 0, we say that y and s satisfy complementary
slackness. We will think of the vectors s and x as variables that help us prove optimality of
a current solution y , and we call them dual variables.
Definition 15.3.2 (Dual feasibility). We say (x , s) is dual feasible if s ≥ 0. If additionally
y is primal feasible, we say (y , x , s) is primal-dual feasible.

Now, we have essentially argued that at any optimal solution y ∗ , we must have that com-
plementary slackness holds, and that Equation (15.2) holds for some x and some s ≥ 0.
However, while this intuitive explanation is largely correct, it turns out that it can fail for
technical reasons in some weird situations2 . Nonetheless, under some mild conditions, it
is indeed true that the conditions we argued for above must hold at any optimal solution.
These conditions have a name: The Karush-Kuhn-Tucker Conditions. For convenience, we
will state the conditions using ∇c(y ) to denote the matrix whose ith column is given by
∇ci (y ). This is sometimes called the Jacobian of c.
Definition 15.3.3 (The Karush-Kuhn-Tucker (KKT) Conditions). Given a convex opti-
mization problem of form (15.1) where the domain S is open. Suppose y , x , s satisfy the
following conditions:

• Ay = b and c(y ) ≤ 0 (primal feasibility)


• s≥0 (dual feasibility)
• ∇y E(y ) + A> x + ∇c(y )s = 0 (KKT gradient condition, i.e. Eq. (15.2) restated)
• s(i) · c i (y ) = 0 for all i (complementary slackness)

Then we say that y , x , s satisfy the Karush-Kuhn-Tucker (KKT) conditions.


2
Consider the following single-variable optimization problem

min x
x∈R

s.t. x2 = 0.

This has only a single feasible point x = 0, which must then be optimal. But at this point, the gradient of
the constraint function is zero, while the gradient of the objective is non-zero. Thus our informal reasoning
breaks down, because there exists an infeasible direction δ we can move along where the constraint function
grows, but at a rate of O(δ 2 ).

168
15.3.2 Slater’s Condition

There exists many different mild technical conditions under which the KKT conditions do
indeed hold at any optimal solution y . The simplest and most useful is probably Slater’s
condition.

Definition 15.3.4 (Slater’s condition with full domain). A (primal) problem as defined in
(15.1) with S = Rn fulfills Slater’s condition if there exists a strictly feasible point, i.e. there
exists ỹ s.t. Aỹ = b and c(ỹ ) < 0. This means that the strictly feasible point ỹ lies strictly
inside the set {y : c(y ) ≤ 0} defined by the inequality constraints.

One way to think about Slater’s condition is that your inequality constraints should not
restrict the solution space to be lower-dimensional. This is a degenerate case as the sublevel
sets of the inequality constraints are generically full-dimenensional and you want to avoid
this degenerate case.
We can also extend Slater’s condition to the case when the domain S is an open set. To
extend Slater’s condition to this case, we need the notion of a “relative interior”.

Definition 15.3.5 (Relative interior). Given a convex set S ⊂ Rn , the relative interior of
S is

relint(S) = {x ∈ S : for all y ∈ S there exists  > 0 such that x − (y − x ) ∈ S} .

In other words, x ∈ relint(S) if starting at x ∈ S we can move “away” from any y ∈ S by a


little and still be in S . As an example, suppose S = {(s, t) ∈ R2 such that s ≥ 0 and t = 0}.
Then (0, 0) ∈ S but (0, 0) 6∈ relint(S), while (1, 0) ∈ relint(S).
Now, we can state a more general version of Slater’s condition.

Definition 15.3.6 (Slater’s condition). A (primal) problem as defined in (15.1) fulfills


Slater’s condition if there exists a strictly feasible point ỹ ∈ relint(S). We require Aỹ = b
and c(ỹ ) < 0. This means that the strictly feasible point ỹ lies strictly inside the set
{y : c(y ) ≤ 0} defined by the inequality constraints.

Finally, we end with a proposition that tells us that given Slater’s condition, the KKT are
indeed necessary for optimality of our convex programs.

Proposition 15.3.7 (KKT necessary for optimality when Slater’s condition holds). Con-
sider a convex program in the form (15.1) that satisfies Slater’s condition and has an open set
S as its domain. Suppose y is a primal optimal (feasible) solution, and that y , x , s satisfy
the KKT conditions.

We will prove this lemma later, after developing some tools we will use in the proof. In fact,
we will also see later that assuming Slater’s condition, the KKT conditions are sufficient for
optimality.

169
15.3.3 The Lagrangian and The Dual Program

Notice that we can also rewrite the KKT gradient condition as


 

∇y E(y ) + x > (b − Ay ) + s > c(y ) = 0


 
| {z }
(?)

That is, we can write this condition as the gradient of the quantity (?) is zero. But what is
this quantity (?)? We call it the Lagrangian of the program.
Definition 15.3.8. Given a convex program (15.1), we define the Lagrangian of the program
as
L(y , x , s) = E(y ) + x > (b − Ay ) + s > c(y ).

We also define a Lagrangian only in terms of the dual variables by minimizing over y as
L(x , s) = min L(y , x , s).
y

When s ≥ 0, we have that L(y , x , s) is a convex function of y . For each y , the Lagrangian
L(y , x , s) is linear in (x , s) and hence also concave in them. Hence L(x , s) is a concave
function, because it is the pointwise minimum (over y ), of a collection of concave functions
in (x , s).
We can think of x as assigning a price to violating the linear constraints, and of s as assigning
a price to violating the inequality constraints. The KKT gradient condition tells us that at
the given prices, the there is no benefit gained from locally violating the constraints – i.e.
changing the primal solution y would not improve the cost.
Notice that if y , x , s are primal-dual feasible, then
E(y ) = E(y ) + x > (b − Ay ) as b − Ay = 0.
≥ E(y ) + x > (b − Ay ) + s > c(y ) as c(y ) ≤ 0 and s ≥ 0.
= L(y , x , s) (15.3)

Thus, for primal-dual feasible variables, the Lagrangian is always a lower bound on the
objective value.
L(x , s) is defined by minizming L(y , x , s) over y , i.e. what is the worst case value of
the lower bound L(y , x , s) across all y . We can think of this as computing how good the
given “prices” are at approximately enforcing the constraints. This naturally leads to a new
optimization problem: How can we choose our prices x , s to get the best (highest) possible
lower bound?
Definition 15.3.9 (Dual problem). We define the dual problem as
max
x ,s
min L(y , x , s) = max
x ,s
L(x , s) (15.4)
y
s≥0 s≥0

and denote the optimal dual value by β ∗ .

170
The dual problem is really a convex optimization problem in disguise, because we can flip the
sign of −L(x , s) to get a convex function and minimizing this is equivalent to maximizing
L(x , s).
max
x ,s
L(x , s) = − min
x ,s
−L(x , s)
s≥0 s≥0

When x and s are optimal for the dual program, we say they are dual optimal. And for
convenience, when we also have a primal optimal y , altogether, we will say that (y , x , s)
are primal-dual optimal.
The primal problem can also be written in terms of the Lagrangian.
α∗ = min max L(y , x , s) (15.5)
y x ;s≥0

This is because for a minimizing y all constraints have to be satisfied and the Lagrangian
simplifies to L(y , x , s) = E(y ). If Ax − b = 0 was violated, making x large sends
L(y , x , s) → ∞. And if c(y ) ≤ 0 is violated, we can make L(y , x , s) → ∞ by choos-
ing large s.
Note that we require s ≥ 0, as we only want to penalize the violation of the inequality
constraints in one direction, i.e. when c(y ) > 0.
For any primal-dual feasible y , x , s we have L(y , x , s) ≤ E(y ) (see Equation (15.3)) and
hence also L(x , s) = miny L(y , x , s) ≤ E(y ).
In other words maxx ;s≥0 L(x , s) = β ∗ ≤ α∗ . This is referred to as weak duality.
Using the forms in Equations (15.4) and (15.5), we can also state this as
Theorem 15.3.10 (The Weak Duality Theorem). For any convex program (15.4) and its
dual, we have
α∗ = min max L(y , x , s) ≥ max min L(y , x , s) = β ∗ .
y x ;s≥0 x ;s≥0 y

15.3.4 Strong Duality

So now that we have proved weak duality β ∗ ≤ α∗ , what is strong duality? β ∗ = α∗ ? The
answer is yes, but strong duality only holds under some conditions. Again, a simple sufficient
condition is Slater’s condition (Definition 15.3.6.
Theorem 15.3.11. For a program (15.1) satisfying Slater’s condition, strong duality holds,
i.e. α∗ = β ∗ . In other words, the optimal value of the primal problem α∗ is equal to the
optimal value of the dual.

Note that at primal optimal y ∗ and dual optimal x ∗ , s ∗ , we have


α∗ = E(y ∗ ) ≥ L(y ∗ , x ∗ , s ∗ ) ≥ β ∗ = α∗ .
Thus, we can conclude that L(y ∗ , x ∗ , s ∗ ) = α∗ = β ∗ .

171
How are we going to prove this? Before we prove the theorem, let’s make a few
observations to get us warmed up. If you get bored, skip ahead to the proof.
It is sufficient to prove that α∗ ≤ β ∗ , as the statement then follows in conjunction with weak
duality. We define the set

G = {(E(y ), Ay − b, c(y )) : y ∈ S},

where S ⊆ Rn is the domain of E.


Immediately, we observe that we can write the optimal primal value as

α∗ = min{t : (t, v , u) ∈ G, v = 0, u ≤ 0}.

Similarly, we can write the Lagrangian (after minimizing over y )

L(x , s) = min (1, x , s)> (t, v , u).


(t,v ,u)∈G

This is equivalent to the inequality, for (t, v , u) ∈ G,

(1, x , s)> (t, v , u) ≥ L(x , s).

which defines a hyperplane with n = (1, x , s) and µ = L(x , s) such that G is on one side.
To establish strong duality, we would like to show the existence of a hyperplane such that
for (t, v , u) ∈ G
n > (t, v , u) ≥ α∗ and n = (1, xb , sb) with sb ≥ 0.
Then we would immediately get

β ∗ ≥ L(b
x , sb) = min (1, x , s)> (t, v , u) ≥ α∗ .
(t,v ,u)∈G

Perhaps not surprisingly, we will use the Separating Hyperplane Theorem. What are the
challenges we need to deal with?

• We need to replace G with a convex set (which we will call A) and separate A from
some other convex set (which we will call B).

• We need to make sure the hyperplane normal n has 1 in the first coordinate and s ≥ 0,
and the hyperplane threshold is α∗ .

Proof of Theorem 15.3.11. For simplicity, our proof will assume that S = Rn , but only a
little extra work is required to handle the general case.
Let’s move to on finding two convex disjoints sets A, B to enable the use of the separating
hyperplane Theorem 15.2.4.

172
First set we define A, roughly speaking, as a multi-dimensional epigraph of G. More precisely

A = {(t, v , u) : ∃y ∈ S, t ≥ E(y ), v = Ay − b, u ≥ c(y )}.

Note that A is a convex set. The proof is similar to the proof that the epigraph of a convex
function is a convex set. The optimal value of the primal program can be now written as

α∗ = min t.
(t,0,0)∈A

And we define another set B of the same dimensionality as A by

B := {(r ∈ R, 0 ∈ Rm , 0 ∈ Rk ) : r < α∗ }.

This set B is convex, as it is a ray. An example of two such sets A, B is illustrated in


Figure 15.3.
We show that A ∩ B = ∅ by contradiction. Suppose A, B are not disjoint; then there exists
y such that
(E(y ), Ay − b, c(y )) = (r, 0, u)
with u ≤ 0. But this means that y is feasible and E(y ) = r < α∗ ; contradicting the
optimality of α∗ .
To make things simpler, we assume that our linear constraint matrix A ∈ Rm×n , has full
row rank and m < n (but very little extra work is required to deal with the remaining cases,
which we omit).
As we just proved, A and B are convex and disjoint sets and hence the separating hyper-
plane theorem (Theorem 15.2.4) we introduced earlier in this chapter implies the existence
a separating hyperplane. This means there exists a normal n = (ρ̃, x̃ , s̃) and threshold µ
and with A on one side, i.e.

(t, v , u) ∈ A =⇒ (t, v , u)> (ρ̃, x̃ , s̃) ≥ µ (15.6)

and the set B on the other side:

(t, v , u) ∈ B =⇒ (t, v , u)> (ρ̃, x̃ , s̃) ≤ µ. (15.7)

Now, we claim that s̃ ≥ 0. Suppose s̃(i) < 0, then for u(i) → ∞ the threshold would
grow unbounded, i.e. µ → −∞ contradicting that the threshold µ is finite by the separating
hyperplane theorem. Similarly we claim ρ̃ ≥ 0, as if this were not the case, having t → ∞
implies that µ → −∞ again contradicting the finiteness of µ.
From Equation (15.7) it follows that tρ̃ ≤ µ for all t < α∗ which implies that tρ̃ ≤ µ for
t = α∗ by taking the limit. Hence we have α∗ ρ̃ ≤ µ. From (t, v , u) ∈ A we get from
Equation (15.6)
(ρ̃, x̃ , s̃)> (t, v , u) ≥ µ ≥ α∗ ρ̃

173
and thus
(ρ̃, x̃ , s̃)> (E(y ), Ay − b, c(y )) ≥ α∗ ρ̃. (15.8)

Now we consider two cases; starting with the “good” case where ρ̃ > 0. Dividing Equa-
tion (15.8) by ρ̃ gives
x̃ > s̃ >
E(y ) + (Ay − b) + c(y ) ≥ α∗ .
ρ̃ ρ̃
Noting that the left hand side above is L(y , x̃ρ̃ , s̃ρ̃ ) and that the equation holds for arbitrary
y ; therefore also for the minimum we get
 
x̃ s̃
min L y , , ≥ α∗
y ρ̃ ρ̃

and hence via definition of β ∗ finally


 
∗ x̃ s̃
β ≥L , ≥ α∗ .
ρ̃ ρ̃

Next consider the “bad” case ρ̃ = 0. As α∗ ρ̃ ≤ µ, we have 0 ≤ µ. From Equation (15.6) we


get
c(y )> s + x > (b − Ay ) ≥ µ ≥ 0.

As Slater’s condition holds, there is an interior point ỹ , i.e. it satisfies b − Aỹ = 0 and
c(ỹ ) < 0. Together with the equation above this yields

c(ỹ )> s̃ + x̃ > 0 ≥ 0

which implies c(ỹ )> s̃ ≥ 0 and as c(ỹ ) < 0 this means s̃ = 0.


As the normal (ρ̃, s̃, x̃ ) of the hyperplane can not be all zeroes, this means the last “compo-
nent” x̃ must contain a non-zero entry, i.e. x̃ 6= 0. Furthermore x̃ > (b − Aỹ ) = 0, c(ỹ ) < 0
and A has full row rank, hence there exists δ such that

x̃ > (b − A(ỹ + δ)) < 0 and c(ỹ + δ) < 0.

This, however, means that there is a point in A on the wrong side of the hyperplane, as

(ρ̃, x̃ , s̃)> (E(ỹ + δ), b − A(ỹ + δ), c(ỹ + δ)) < 0

but the threshold is µ ≥ 0.

Remark. Note that our reasoning about why s ≥ 0 in the proof above is very similar to
our reasoning for why the primal program can be written as Problem (15.5).

174
Example. As an example of A and B as they appear in the above proof, consider

min y 2
y∈(0,∞)
1/y−1≤0

This leads to α∗ = 1, y ∗ = 1, and A = {(t, u) : y ∈ (0, ∞) and t > y 2 and u ≥ 1/y − 1}, and
B = {(t, 0) : t < 1} and the separating hyperplane normal is n = (1, 2). These two sets A, B
are illustrated in Figure 15.3.

Figure 15.3: Example of the convex sets A and B we wish to separate by hyperplane.

15.3.5 KKT Revisited

Let’s come back to our earlier discussion of parallel gradients and the KKT conditions.
Consider a convex program (15.1) with an open set S as its domain. Suppose y ∗ is an
optimizer of the primal problem and x ∗ , s ∗ for the dual, and suppose that strong duality
holds. We thus have
L(y ∗ , x ∗ , s ∗ ) = α∗ = β ∗ .
Because L(y , x ∗ , s ∗ ) is a convex function in y , it also follows that as E : S → R and c are
differentiable then we must have that the gradient w.r.t. y is zero, i.e.

∇y L(y , x ∗ , s ∗ )|y =y ∗ = 0 (15.9)

This says exactly that the KKT gradient condition holds at (y ∗ , x ∗ , s ∗ ) .

Complementary slackness. We can also see that

E(y ∗ ) = α∗ = E(y ∗ ) + x > (b − Ay ∗ ) + s > c(y ∗ ) = E(y ∗ ) + s > c(y ∗ )

and hence when the i-th convex constraint is not active, i.e. ci (y ∗ ) < 0 the slack must be
zero, i.e. s(i) = 0. Conversely if the slack is non-zero, that is s(i) 6= 0 implies that the

175
constraint is active, i.e. ci (y ∗ ) = 0. This says precisely that the complementary slackness
condition holds at primal-dual optimal (y ∗ , x ∗ , s ∗ ). Combined with our previous observation
Equation (15.9), we get the following result.

Theorem 15.3.12. Consider a convex program (15.1) with an open domain set S and whose
dual satisfies strong duality. Then KKT conditions necessarily hold at primal-dual optimal
(y ∗ , x ∗ , s ∗ ).

Theorem 15.3.12 combined with Theorem 15.3.11 immediately imply Proposition 15.3.7.

KKT is sufficient. We’ve seen that when strong duality holds, the KKT conditions are
necessary for optimality. In fact, they’re also sufficient for optimality, as the next theorem
shows. And in this case, we do not need to assume strong duality, as now it is implied by
the KKT conditions.

Theorem 15.3.13. Consider a convex program (15.1) with an open domain set S. Then if
the KKT conditions hold at (ỹ , x̃ , s̃), they must be primal-dual optimal, and strong duality
must hold.

Proof. ỹ is global minimizer of y 7→ L(y , x̃ , s̃), since this function is convex with vanishing
gradient at ỹ . Hence,

L(ỹ , x̃ , s̃) = inf L(y , x̃ , s̃) = L(x̃ , s̃) ≤ β ∗ .


y

On the other hand, due to primal feasibility and complementary slackness,

L(ỹ , x̃ , s̃) = E(ỹ ) + x̃ > (b − A> ỹ ) + s̃ > c(ỹ ) = E(ỹ ) ≥ α∗ .

Thus, β ∗ ≥ α∗ . But also β ∗ ≤ α∗ by weak duality. Therefore, β ∗ = α∗ and ỹ , x̃ , s̃ are


primal/dual optimal.

A good reference for basic convex duality theory is Boyd’s free online book “Convex opti-
mization” (linked to on the course website). It provides a number of different interpretations
of duality. One particularly interesting one comes from economics: economists see the slack
variables s as prices for violating the constraints.

176
Chapter 16

Fenchel Conjugates and Newton’s


Method

16.1 Lagrange Multipliers and Convex Duality Recap


Recall the convex optimization program we studied last chapter,

min E(y )
s.t. Ay = b (16.1)
c(y ) ≤ 0,

where E(y ) : S → R is defined on a subset S ⊆ Rn , A ∈ Rm×n and c(y ) is a vector of


constraints c(y ) = (ci (y ))i∈[k] . For every i ∈ [k] the function ci : S → R is convex. We call
(16.1) the primal (program) and denote its optimal value by α∗ .
The associated Lagrangian is defined by

L(y , x , s) = E(y ) + x T (b − Ay ) + s T c(y ).

where x ∈ Rm , s ∈ Rk are dual variables. The dual (program) is given by

max
x ,s
L(x , s) (16.2)
s≥0

whose optimal value is denoted by β ∗ . The dual is always a convex optimization program
even though the primal is non-convex. The optimal value of the primal (16.1) can also be
written as
α∗ = inf sup L(y , x , s), (16.3)
y x ;s≥0

where no constraint is imposed on the primal variable y . The optimal value of the dual
(16.2) is
β ∗ = sup inf L(y , x , s). (16.4)
x ;s≥0 y

177
Note the only difference between (16.3) and (16.4) is that the positions of “inf” and “sup”
are swapped. The weak duality theorem states that the dual optimal value is a lower bound
of the primal optimal value, i.e. β ∗ ≤ α∗ .
The Slater’s condition for (16.1) open domain S requires the existence of a strictly feasible
point, i.e. there exists ỹ ∈ S s.t. Aỹ = b and c(ỹ ) < 0. This means that the strictly feasible
point ỹ lies inside the interior of the set {y : c(y ) ≤ 0} defined by the inequality constraints.
If the domain S is not open, Slater’s condition also requires that a strictly feasible point is
in the relative interior of S (NB: when S is open, S is equal to its relative interior). The
strong duality theorem says that Slater’s condition implies strong duality, β ∗ = α∗ .
We were also introduced to the KKT conditions, and we saw that for our convex programs
with continuously differentiable objective and constraints functions, when the domain is
open, the conditions are sufficient to imply strong duality and primal-dual optimality of the
points that satisfy them. If Slater’s condition holds, we also have that KKT necessarily holds
at any optimal solution. In summary: Slater’s condition =⇒ strong duality ⇐⇒ KKT

Example. In Chapter 12, we gave a combinatorial proof of the min-cut max-flow theorem,
and showed that the min-cut program can be expressed as a linear program. Now, we will
use the strong duality theorem to give an alternative proof, and directly find the min-cut
linear program is the dual program to our maximum flow linear program.
We will assume that Slater’s condition holds for our primal program. Since scaling the flow
down enough will always ensure that capacity constraints are strictly satisfied i.e. f < c, the
only concern is to make sure that non-negativity constraints are satisfied. This means that
there is an s-t flow that sends a non-zero flow on every edge. In fact, this may not always
be possible, but it is easy to detect such edges and remove them without changing the value
of the program: an edge (u, v) should be removed if there is no path s to u or no path v to
t. We can identify all such edges using a BFS from s along the directed edges and a BFS
along reversed directed edges from t.
Slater’s condition holds whenever there is a directed path from s to t with non-zero capacity
(and if there is not, the maximum flow and minimum cut are both zero).

min −F = min max −F + x > (F b s,t − Bf ) + (f − c)> s


F ∈R F ;f ≥0 x ;s≥0
Bf =F b s,t
0≤f ≤c

(Slater’s condition =⇒ strong duality)


− max F = max min F (b > > > >
s,t x − 1) + f (s − B x ) − c s
F ∈R x ;s≥0 F ;f ≥0
Bf =F b s,t
0≤f ≤c

= max −c > s
x; s ≥ 0
b>s,t x = 1
s ≥ B >x

178
Thus switching signs gives us

max F = min c>s (16.5)


F ∈R
Bf =F b s,t x; s ≥ 0
0≤f ≤c b>s,t x = 1
s ≥ B >x

The LHS of (16.5) is exactly the LP formulation of max-flow, while the RHS is exactly the
LP formulation of min-cut. Note that we treated the “constraint” 0 ≤ f as a restriction on
the domain of f rather than a constraint with a dual variable associated with it. We always
have this kind of flexibility when deciding how to compute a dual, and some choices may
lead to a simpler dual program than others.

16.2 Fenchel Conjugates


In this section we will learn about Fenchel conjugates. This is a notion of dual function
that is closely related to duality of convex programs, and we will learn more about how dual
programs behave by studying these functions.

Definition 16.2.1 (Fenchel conjugate). Given a (convex) function E : S ⊆ Rn → R, its


Fenchel conjugate is a function E ∗ : Rn → R defined as

E ∗ (z ) = suphz , y i − E(y ).
y ∈S

Remark 16.2.2. E ∗ is a convex function whether E is convex or not, since E ∗ (z) is pointwise
supremum of a family of convex (here, affine) functions of z.

In this course, we have only considered convex functions that are real-valued and continuous
and defined on a convex domain. For any such E, we have E ∗∗ = E, i.e. the Fenchel conjugate
of the Fenchel conjugate is the original function. This is a consequence of the Fenchel-Moreau
theorem, which establishes this under slightly more general conditions. We will not prove
this generally, but as part of Theorem 16.2.3 below, we sketch a proof under more restrictive
assumptions.

Example. Let E(y ) = p1 ky kpp (p > 1). We want to evaluate its Fenchel conjugate E ∗ at any
given point z ∈ Rn . Since E is convex and differentiable, the supremum must be achieved
at some y with vanishing gradient

∇y hz , y ∗ i − ∇E(y ∗ ) = z − ∇E(y ∗ ) = 0 ⇐⇒ z = ∇E(y ∗ ).

It’s not difficult to see, for all i,

z (i) = sgn(y (i))|y (i)|p−1 .

179
Then,

E ∗ (z ) = hz , y i − E(y )
X 1 1 p
= |z (i)| p−1 +1 − |z (i)| p−1
i
p
1 1
(define q s.t. + = 1)
q p
1
= kz kqq .
q

More generally, given a convex and differentiable function E : S → R, if there exists y ∈ S


s.t. z = ∇E(y ), then E ∗ (z ) = (y ∗ )> ∇E(y ∗ ) − E(y ∗ ). The Fenchel conjugate and Lagrange
duality are closely related, which is demonstrated in the following example.

Example. Consider a convex optimization program with only linear constraints,

min E(y )
y ∈Rn

s.t. Ay = b

where E : Rn → R is a convex function and A ∈ Rm×n . Then the corresponding dual


program is

sup infn E(y ) + x > (b − Ay ) = sup b > x − sup x > Ay − E(y )



x ∈Rn y ∈R x ∈Rn y ∈Rn

= sup b > x − E ∗ (A> x )


x ∈Rn

Theorem 16.2.3 (Properties of the Fenchel conjugate). Consider a strictly convex function
E : S → R where S ⊆ Rn is an open convex set. When E is differentiable with a Hessian
that is positive definite everywhere and its gradient ∇E is surjective onto Rn , we have the
following three properties:

1. ∇E(∇E ∗ (z )) = z and ∇E ∗ (∇E(y )) = y

2. (E ∗ )∗ = E, i.e. the Fenchel conjugate of the Fenchel conjugate is the original function.

3. H E ∗ (∇E(y )) = H −1
E (y )

180
∇y E
−−→
primal point y ∇ E∗
dual point z
←−z−−

gradient ∇y E(y ) = z gradient ∇z E ∗ (z ) = y

Hessian H E (y ) = H −1
E ∗ (z ) Hessian H E ∗ (y ) = H −1
E (y )

Figure 16.1: Properties of Fenchel conjugate

Proof sketch. Part 1. Because the gradient ∇y E is surjective onto Rn , given any z ∈ Rn ,
there exists a y such that ∇E(y ) = z . Let y (z ) be a y s.t. ∇E(y ) = z . It can be shown
that because E is strictly convex, y (z ) is unique.
The function y 7→ hz , y i − E(y ) is concave in y and has gradient z − ∇E(y ) and is hence
maximized at y = y (z ). This follows because we know a differentiable convex function is
minimized when its gradient is zero and so a differentiable concave function is maximized
when its gradient is zero.
Then, using the product rule and composition rule of derivatives,

∇E ∗ (z ) = ∇z (hz , y (z )i − E(y (z )))


= y (z ) + diag(z )∇z y (z ) − diag(∇y E(y (z )))∇z y (z )
| {z }
=z
= y (z )

Thus we have ∇y E(y (z )) = z and ∇z E ∗ (z ) = y (z ). Combining the two, we have


∇E(∇E ∗ (z )) = z .
We can also see that for any y , there exists a z such that ∇z E ∗ (z ) = y , namely, this is
attained by z = ∇y E(y ). Thus, ∇E ∗ (∇E(y )) = y .
Part 2. Observe that
E ∗∗ (u) = sup hu, z i − E ∗ (z )
z ∈Rn

and let z (u) denote the z obtaining the supremum, in the above program. We then have
u = ∇E ∗ (z (u)). Letting y (z ) be defined as in Part 1, we get y (z (u)) = ∇z E ∗ (z (u)) = u

E ∗∗ (u) = hu, z (u)i − (hz (u), y (z (u))i − E(y (z (u)))) = E(u).

Part 3. Now we add two infinitesimals τ and δ to z and y respectively s.t.

∇z E ∗ (z + τ ) = y + δ, ∇y E(y + δ) = z + τ .

Then,
∇y E(y + δ) − ∇y E(y ) = τ , ∇z E ∗ (z + τ ) − ∇z E ∗ (z ) = δ.

181
Since H E (y ) measures the change of ∇y E(y ) when y changes by an infinitesimal δ, then

∇y E(y + δ) − ∇y E(y ) ≈ H E (y )δ
⇐⇒ H −1
E (y ) (∇y E(y + δ) − ∇y E(y )) ≈ δ
⇐⇒ H −1 ∗ ∗
E (y )τ ≈ δ = ∇E (z + τ ) − ∇E (z )
⇐⇒ H −1 ∗ ∗
E (y )τ ≈ ∇E (z + τ ) − ∇E (z ) (16.6)

Similarly,
H E ∗ (z )τ ≈ ∇z E ∗ (z + τ ) − ∇z E ∗ (z ) (16.7)
Comparing (16.6) and (16.7), it is easy to see

H E ∗ (z ) = H −1 −1
E (y ) ⇐⇒ H E ∗ (∇E(y )) = H E (y ).

Remark 16.2.4. Theorem 16.2.3 can be generalized to show that the Fenchel conjugate has
similar nice properties under much more general conditions, e.g. see [BV04].

16.3 Newton’s Method

16.3.1 Warm-up: Quadratic Optimization

First, let us play with a toy example, minimizing a quadratic function


1
E(y ) = y > Ay + b > y + c
2
where A ∈ Rn×n is positive definite. By setting the gradient w.r.t. y to zero,

∇E(y ) = Ay + b = 0,

we obtain the global minimizer


y ∗ = −A−1 b.
To make it more like gradient descent, let us start at some “guess” point y and take a step
δ to move to the new point y + δ. Then we try to minimize E(y + δ) by setting the gradient
w.r.t. δ to zero,

∇δ E(y + δ) = A(y + δ) + b = 0
δ = −y − A−1 b
y + δ = −A−1 b

This gives us exactly global minimizer in just one step. However, the situation changes when
the function is not quadratic anymore and thus we do not have a constant Hessian. But
taking a step which tries to set the gradient to zero might still be a good idea.

182
16.3.2 K-stable Hessian

Next, consider a convex function E : Rn → R whose Hessian is “nearly constant”. Recall


the Hessian H E (y ) aka ∇2 E(y ) at a point y is just a matrix of pairwise 2nd order partial
∂ 2 E(y )
derivatives ∂y . We say E has a k-stable Hessian if there exists a constant matrix A s.t.
i ∂y j
for all y
1
H E (y ) ≈K A ⇐⇒ A  H E (y )  (1 + K)A.
1+K
Note that we just require the existence of A and do not assume we know A. Then a natural
question is to ask what convergence rate can be achieved if we take a gradient step “guided”
by the Hessian, which is called a “Newton step”. Such method is also known as the 2nd
order method. Note that this is very similar to preconditioning.
Now, let us make our setting precise. We want to minimize a convex function E with k-stable
Hessian A  0. And y ∗ is a global minimizer of E. Start from some initial point y 0 . The
update rule is
y i+1 = y i − α · H −1
E (y i )∇E(y i ),
where α is the step size and it will be decided later.
Theorem 16.3.1. E(y k ) − E(y ∗ ) ≤  (E(y 0 ) − E(y ∗ )) when k > (K + 1)6 log(1/).

Proof. By Taylor’s theorem, there exists ỹ ∈ [y , y + δ] s.t.


1
E(y + δ) = E(y ) + ∇E(y )> δ + δ > H E (ỹ )δ
2
(K + 1)2 >
≤ E(y ) + ∇E(y )> δ + δ H E (y )δ (16.8)
| {z 2 }
=:f (δ)

where the inequality comes from the K-stability of the Hessian,


H E (ỹ )  (1 + K)A  (1 + K)2 H E (y ).
Observe that f (δ) is a convex quadratic function in δ. By minimizing it, or equivalently
setting ∇δ f (δ ∗ ) = 0, we get
1
δ∗ = − H −1 (y )∇y E(y ) (16.9)
(K + 1)2 E
Here, the step size α is equal to (K + 1)−2 . Then, plugging (16.9) into (16.8),
1
E(y + δ ∗ ) ≤ E(y ) − ∇y E(y )> H −1
E (y )∇y E(y )
2(K + 1)2
1
≤ E(y ) − 3
∇y E(y )> A−1 ∇y E(y )
2(K + 1)
(subtract E(y ∗ ) on both sides)
1
E(y + δ ∗ ) − E(y ∗ ) ≤ E(y ) − E(y ∗ ) − ∇y E(y )> A−1 ∇y E(y )
2(K + 1)3 | {z }
=:σ

183
where the second inequality is due to K-stability of the inverse Hessian,
1
A−1  H E (y )−1  (1 + K)A−1 .
1+K
Meanwhile, using Taylor’s theorem and K-stability, for some ŷ between y and y ∗ , and noting
∇E(y ∗ ) = 0, we have
1
E(y ) = E(y ∗ ) + ∇E(y ∗ )> (y − y ∗ ) + (y − y ∗ )> H E (ŷ )(y − y ∗ )
2
(K + 1)
E(y ) − E(y ∗ ) ≤ (y − y ∗ )> A(y − y ∗ )
2 | {z }
=:γ

Next, our task is reduced to comparing σ and γ. y t := y ∗ + t(y − y ∗ ) (t ∈ [0, 1]) is a point
on the segment connecting y ∗ and y . Since
Z 1

∇E(y ) = ∇E(y ) − ∇E(y ) = H(y t )(y − y ∗ )dt,
0

then
Z 1
∗ >
(y − y ) ∇E(y ) = (y − y ∗ )> H(y t )(y − y ∗ )dt
0
Z 1
1
≥ (y − y ∗ )> A(y − y ∗ )dt
K +1 0
γ
= (16.10)
K +1
On the other hand, define z s = ∇E(y ∗ ) + s(∇E(y ) − ∇E(y ∗ )) and then dz s = ∇E(y )ds.
Using Theorem 16.2.3, we have
Z 1

y −y = H E ∗ (z s )∇E(y )ds.
0

Then,
Z 1
> ∗
∇E(y ) (y − y ) = ∇E(y )> H E ∗ (z s )∇E(y )ds
0
Z 1
≤ (K + 1) ∇E(y )> A−1 ∇E(y )ds
0
≤ (K + 1)σ (16.11)
Combining (16.10) and (16.11) yields
γ ≤ (K + 1)2 σ.
Therefore,  
∗ ∗ ∗ 1
E(y + δ ) − E(y ) ≤ (E(y ) − E(y )) 1 − .
(K + 1)6

184
Remark 16.3.2. The basic idea of relating σ and γ in the above proof is writing the same
quantity, ∇E(y )> (y − y ∗ ), as two integrations along different lines. (K + 1)6 can be reduced
to (K +1)2 and even to (K +1) with more care. In some settings, Newton’s method converges
in log log(1/) steps.

16.3.3 Linearly Constrained Newton’s Method

Let us apply Newton’s method to convex optimization programs with only linear constraints,

min E(f )
f ∈Rm

s.t. Bf = d

where E : Rm → R is a convex function and B ∈ Rn×m . Wlog, let d = 0, since otherwise


we can equivalently deal the following program with Bf 0 = d ,

min E(f 0 + ρ)
ρ∈Rm

s.t. Bρ = 0

It is useful to think of the variable f ∈ Rm as a flow in a graph. Define C := {f : Bf = 0}


which is the kernel space of B. C is also called the “cycle space” as it is the set of cycle
flows when treating f as flows.

Analyzing the convergence of Newton’s method with linear constraints. It is not


immediately obvious, but our analysis of Newton step’s and their convergence on objectives
with a K-stable Hessian carries over directly to linearly constrained convex optimization
problems. We will only sketch a proof of this. Firstly, we should notice that instead of
thinking of our objective function E as defined on Rm and then constrained to inputs f ∈ C,
we can think of a new function Ê : C → R defined such that for f ∈ C we have Ê(f ) = E(f ).
But, C is a linear subspace and is isomorphic1 to Rdim(C) . This means that our previous
analysis can be directly applied, if we can compute the gradient and Hessian of the function
viewed as an unconstrained function on C (or equivalently Rdim(C) ). We now have two
important questions to answer:

1. What does the gradient and Hessian, and hence Newton steps, of Ê(f ) look like?
2. Does the K-stability of the Hessian of E carry over to the function Ê?

The gradient and Hessian of Ê should live in Rdim(C) and Rdim(C)×dim(C) respectively. Let ΠC
be the orthogonal projection matrix onto C , meaning ΠC is symmetric and ΠC δ = δ for
1
You don’t need to know the formal definition of isomorphism on vector spaces. In this context, it means
equivalent up to a transformation by an invertible matrix. In fact in our case, the isomorphism is given by
an orthonormal matrix.

185
any δ ∈ C. Given any f ∈ C, add to it an infinitesimal δ ∈ C, then

Ê(f + δ) = E(f + δ) ≈ E(f ) + h∇E(f ), δi


= E(f ) + h∇E(f ), ΠC δi
= E(f ) + hΠC ∇E(f ), δi

From this, we can deduce that the gradient of Ê at a point f ∈ C is essentially equal (up
to a fixed linear transformation independent of f ) to the projection of gradient of ∇E at f
onto the subspace C. Similarly,
1
Ê(f + δ) = E(f + δ) ≈ E(f ) + hΠC ∇E(f ), δi + hδ, ΠC H E (f )ΠC δi
2
Again from this, we can deduce that the Hessian of Ê at a point f ∈ C is essentially equal
(again up to a fixed linear transformation independent of f ) to the matrix ΠC H E (f )ΠC .
Note that X  Y implies ΠC X ΠC  ΠC Y ΠC , and from this we can see that the Hessian
of Ê is K-stable if the Hessian of E is. Also note that we were not terribly formal in the
discussion above. We can be more precise by replacing ΠC with a linear map from Rdim(C)
to Rm which maps any vector in Rdim(C) to a vector in C and then going through a similar
chain of reasoning.
What is a Newton step δ ∗ w.r.t. Ê? It turns out that for actually computing the Newton
step, it is easier to think again of E with a constraint that the Newton step must lie in the
subspace C. One can show that this is equivalent to the Newton step of Ê, but we omit this.
In the constrained view, δ ∗ should be a minimizer of
1
minm h∇E(f ), δi + hδ, H E (f ) δi
δ∈R | {z } 2 | {z }
Bδ=0 =:g =:H

(Lagrange duality)
1
⇐⇒ maxn minm hg , δi + hδ, Hδi − x > Bδ (16.12)
x ∈R δ∈R
| 2 {z }
Lagrangian L(δ,x )

Applying the KKT optimality conditions, one has

Bδ = 0,
∇δ L(δ, x ) = g + Hδ − B > x = 0,

from which we get

δ + H −1 g = H −1 B > x
Bδ +BH −1 g = BH −1 B > x
|{z}
=0
BH −1 g = BH −1 >
| {z B } x
=:L

186
Finally, the solutions to (16.12) are
(
x ∗ = L−1 BH −1 g
δ ∗ = −H −1 g + H −1 B > x ∗

It is easy to verify that Bδ ∗ = 0. Thus, our update rule is f i+1 = f i + δ ∗ . And we have the
following convergence result.

Theorem 16.3.3. Ê(f k ) − Ê(f ∗ ) ≤  · Ê(f 0 ) − Ê(f ∗ ) when k > (K + 1)6 log(1/).


Pm
Remark 16.3.4. Note if E(f ) = i=1 Ei (f (i)), then H E (f ) is diagonal. Thus, L =
BH −1 B > is indeed a Laplacian provided that B is an incidence matrix. Therefore, the
linear equations we need to solve to apply Newton’s method in a network flow setting are
Laplacians, which means we can solve them very quickly.

187
Chapter 17

Interior Point Methods for Maximum


Flow

Background and Notation

In this chapter, we’ll learn about interior point methods for solving maximum flow, which is
a rich and active area of research [DS08, Mad13, LS20b, LS20a].
We’re going to frequently need to refer to vectors arising from elementwise operations com-
bining other vectors.
−−−−−−→
To that end, given two vector a ∈ Rm , and b ∈ Rm , we will use (aa (i)b(i)) to denote the
vector z with z (i) = a (i)b(i) and so on.
Throughout this chapter, when we are working in the context of some given graph G with
vertices V and edges E, we will let m = |E| and n = |V |.
The plots in this chapter were made using Mathematica, which is available to ETH students
for download through the ETH IT Shop.

17.1 An Interior Point Method

The Maximum Flow problem in undirected graphs.

max F (17.1)
f ∈RE

s.t. Bf = F b st “The Undirected Maximum Flow Problem”


−c ≤ f ≤ c

We use val(f ) to denote F when Bf = F b st .

188
As we develop algorithms for this problem, we will assume that we know the maximum flow
value F ∗ . Let f ∗ denote some maximum flow, i.e. a flow with −c ≤ f ≤ c can val(f ∗ ) = F ∗ .
In general, an a lower bound F ≤ F ∗ will allow us to find a flow with value F , and because
of this, we can use a binary search to approximate F ∗ .

17.1.1 A Barrier Function and an Algorithm


X
V (f ) = − log(c(e) − f (e)) − log(c(e) + f (e))
e

We assume the optimal value of Program (17.1) is F ∗ . Then for a given 0 ≤ α < 1 we define
a program

min V (f ) (17.2)
f ∈RE

s.t. Bf = αF ∗ b st “The Barrier Problem”

This problem makes sense for any 0 ≤ α < 1. When α = 0, we are not routing any flow
yet. This will be our starting point. For any 0 ≤ α < 1, the scaled-down maximum flow
αf ∗ strictly satisfies the capacities −c < αf ∗ < c, and Bαf ∗ = αF ∗ b st . Hence αf ∗ is a
feasible flow for this value of α and hence V (αf ∗ ) < ∞ and so the optimal flow for the Barrier
Problem at this α must also have objective value strictly below ∞, and hence in turn strictly
satisfy the capacity constraints. Thus, if we can find the optimal flow for Program (17.2)
for α = 1 − , we will have a feasible flow with Program (17.1), the Undirected Maximum
Flow Problem, routing (1 − )F ∗ . This is how we will develop an algorithm for computing
the maximum flow.
Program (17.2) has the Lagrangian

L(f , x ) = V (f ) + x > (αF ∗ b st − Bf )

And we have optimality when

Bf = αF ∗ b st and − c ≤ f ≤ c (17.3)
“Barrier feasibility”

and ∇f L(f , x ) = 0, i.e.

∇V (f ) = B > x (17.4)
”Barrier Lagrangian gradient optimality”

Let f ∗α denote the optimal solution to Problem 17.2 for a given 0 ≤ α < 1, and let x ∗α be
optimal dual voltages such that ∇V (f ∗α ) = B > x ∗α .

189
It turns out that, if we have a solution f ∗α to this problem for some α < 1, then we can find
a solution f α+α0 for some α0 < 1 − α. And, we can compute f α+α0 using a small number of
Newton steps, each of which will only require a Laplacian linear equation solve, and hence
is computable in O(m)
e time. Concretely, for any 0 ≤ α < 1, given the optimal flow at this
α, we will be able to compute the optimal flow at αnew = α + (1 − α) 1501√m . This means that

after T = 150 m log(1/) updates, we have a solution for α ≥ 1 − .
We can state the update problem as

min V (δ + f ) (17.5)
δ∈RE
0 ∗
s.t. Bδ = α F b st “The Update Problem”

17.1.2 Updates using Divergence

It turns out that for the purposes of analysis, it will be useful to ensure that our “Update
Problem” uses an objective function that is minimized at δ = 0.
This leads to a variant of the Update Problem, which we call the “Divergence Update
Problem”. We obtain our new problem by switching from V (δ + f ) as our objective to
V (δ + f ) − (V (f ) + h∇V (f ), δi) as our objective, and this is called the divergece of V w.r.t.
δ based at f .

min V (δ + f ) − (V (f ) + h∇V (f ), δi) (17.6)


δ∈RE
0 ∗
s.t. Bδ = α F b st “The Divergence Update Problem”

Now, for any flow δ such that Bδ = α0 F ∗ b st , using the Lagrangian gradient condition (17.4),
we have h∇V (f ∗α ), δi = hx ∗α , α0 F ∗ b st i. Hence, for such δ, we have
V (δ + f ∗α ) − (V (f ∗α ) + h∇V (f ∗α ), δi) = V (δ + f ∗α ) − (V (f ∗α ) + hx ∗α , α0 F ∗ b st i)
We conclude that the objectives of the Update Problem (17.5) and the Divergence Update
Problem (17.6) have the same minimizer, which we denote δα∗ 0 , although, to be precise, it is
also a function of α.
Thus f ∗α + δα∗ 0 is optimal for the optimization problem
min V (f )
f ∈RE
(17.7)
s.t. Bf = (α + α )F ∗ b st
0

Lemma 17.1.1. Suppose f ∗α is the minimizer of Problem (17.2) (the Barrier Problem with
parameter α) and δα∗ 0 is the minimizer of Problem (17.6) (the Update Problem with param-
eters f ∗α and α0 ), then f ∗α + δα∗ 0 is optimal for Problem (17.2) with parameter α + α0 (i.e. a
new instance of the Barrier problem).

190
Algorithm 18: Interior Point Method
f ← 0;
α ← 0;
while α < 1 −  do
a0 ← 201−α
√ ;
m
Compute δ, the minimizer of Problem (17.6);
Let f ← f + δ and α ← α + α0 ;
end
return f

Pseudotheorem 17.1.1. Let f be the minimizer of Problem (17.2). Then, when a0 ≤ 1−α
√ ,
20 m
the minimizer δ of Problem (17.7) can be computed in O(m)
e time.

The key insight in this type of interior point method is that when the update α0 is small
enough,

Theorem 17.1.2. Algorithm 18 returns a flow f that is feasible for Problem (17.1) in time
e 1.5 log(1/)).
O(m

Proof Sketch. First note that for α = 0, the minimizer of Problem (17.2) is f = 0. The
proof now essentially follows by Lemma (17.1.1), and Pseudotheorem 17.1.1. Note
1 √ that 1 − α
shrinks by a factor (1 − 20 m ) in each iteration of the while-loop, and so after 20 m log(1/)

iterations, we have 1 − α ≤ , at which point the loop terminates. To turn this into a formal
proof, we need to take care of the fact the proper theorem corresponding to Pseudotheo-
rem 17.1.1 only gives a highly accurate but not exact solution δ to the “Update Problem”.
But it’s possible to show that this is good enough (even though both f and δ end up not
being exactly optimal in each iteration).

Remark 17.1.3. For the maximum flow problem, when capacities are integral and polyno-
mially bounded, if we choose  = m−c for some large enough constant c, given a feasible flow
with val(f ) = 1 − , is it possible to compute an exact maximum flow in nearly linear time.
Thus Theorem 17.1.2 can also be used to compute an exact maximum flow in O(m) e time,
but we omit the proof. The idea is to first round to an almost optimal, feasible integral flow
(which requires a non-trivial combinatorial algorithm), and then to recover the exact flow
using Ford-Fulkerson. See [Mad13] for details.

Remark 17.1.4. It is possible to reduce an instance of directed maximum flow to an instance


of undirected maximum flow in nearly-linear time, in such a way that if we can exactly solve
the undirected instance, then in nearly-linear time we can recover an exact solution to the
directed maximum flow problem. Thus Theorem (17.1.2) can also be used to solve directed
maximum flow. We will ask you to develop this reduction in Graded Homework 2.

Remark 17.1.5. For sparse graphs with m = O(n) e and large capacities, this running time
is the best known, and improving it is major open problem.

191
17.1.3 Understanding the Divergence Objective

Note that if V (x) = − log(1 − x), then D(x) = V (x) − (V (0) + V 0 (0)x).

Figure 17.1: Plot showing V (x) = − log(1 − x) and then linear approximation
V (0) + V 0 (0)x.

Figure 17.2: Plot showing D(x) = V (x) − (V (0) + V 0 (0)x).

We let
c + (e) = c(e) − f (e) and c − (e) = c(e) + f (e)

192
So then

DV (δ) = V (δ + f ) − (V (f ) + h∇V (f ), δi)


 
X c(e) − (δ(e) + f (e)) δ(e)
= − log −
e
c(e) − f (e) c(e) − f (e)
 
c(e) + (δ(e) + f (e)) δ(e)
− log +
c(e) + f (e) c(e) + f (e)
X  δ(e)
 
δ(e)

= D +D −
e
c(e) − f (e) c(e) + f (e)
   
X δ(e) δ(e)
= D +D −
e
c + (e) c − (e)

Note that we can express Problem (17.6) as

min DV (δ) (17.8)


δ∈RE
0 ∗
s.t. Bδ = α F b st The Update Problem, restated

Note that DV (δ) is strictly convex of over the feasible set, so the argmin is unique.

17.1.4 Quadratically Smoothing Divergence and Local Agreement



− log(1 − x) − x

00
if |x| ≤ 
D̃ (x) = D() + D0 ()(x − ) + D 2() (x − )2 if x > 
00
D(−) + D0 (−)(x + ) + D (−)

(x + )2 if x < −

2

For brevity, we define


D̃(x) = D̃0.1 (x)

Lemma 17.1.6.

1. 1/2 ≤ D̃00 (x ) ≤ 2.

2. For x ≥ 0, we have x/2 ≤ D̃0 (x ) ≤ 2x and −2x ≤ D̃0 (−x ) ≤ −x/2.

3. x2 /4 ≤ D̃(x ) ≤ x2 .

What’s happening here? We glue together D(x) for small x with its quadratic approximation
for |x| > . For x > , we “glue in” a Taylor series expansion based at x = .

193
Figure 17.3: Plot showing D(x) = − log(1 − x) and the quadratic approximation based at
x = 0.1.

We also define
   
X δ(e) δ(e)
D̃V (δ) = D̃ + D̃ −
e
c + (e) c − (e)

We can now introduce the smoothed optimization problem


min D̃V (δ) (17.9)
δ∈RE
0 ∗
s.t. Bδ = α F b st “The Smoothed Update Problem”

194
Note that D̃V (δ) is strictly convex of over the feasible set, so the argmin is unique.

Pseudoclaim 17.1.7. We can compute the argmin δ ∗ of Problem (17.9), the Smoothed
Update Problem, using the Newton-Steps for K-stable Hessian convex functions that we saw
in the previous chapter, in O(m)
e time.

Sketch of proof. Problem (17.9) fits the class of problems for which we showed in the pre-
vious chapter that (appropriately scaled) Newton steps converge. This is true because the
Hessian is always a 2-spectral approximation of the Hessian at D̃V (δ ∗ ), as can be shown
from Lemma 17.1.6. Because the Hessian of D̃V (δ) is diagonal, and the constraints are flow
constraints, each Newton step boils down to solving a Laplacian linear system, which can be
done to high accuracy O(m)
e time.

Remark 17.1.8. There are three things we need to modify to turn the pseudoclaim into a
true claim, addressing the errors arising from both Laplacian solvers and Newton steps:

1. We need to rephrase the claim to so that we only claim δ ∗ has been computed to high
accuracy, rather than exactly.

2. We need to show that we can construct an initial guess to start off Newton’s method
δ0 for which the value D̃V (δ0 ) is not too large. (This is easy).

3. We need show that Newton steps converge despite using a Laplacian solver that doesn’t
give exact solutions, only high accuracy solutions. (Takes a bit of work, but is ulti-
mately not too difficult).

Importantly, to ensure our overall interior point method still works, we also need to show
that it converges, even if we’re using approximate solutions everywhere. This also takes some
work to show, again is not too difficult.

Local Agreement Implies Same Optimum.

Lemma 17.1.9. Suppose S ⊆ Rn is a convex set, and let f, g : S → R be convex functions.


Let x ∗ = arg minx ∈S f (x ). Suppose f, g agree on a neighborhood of x ∗ in S (i.e. an open
set containing x ∗ ). Then x ∗ = arg minx ∈S g(x ).

Proof Sketch. We sketch the proof in the case when both f, g are differentiable: Observe
that 0 = ∇f (x ∗ ) = ∇g(x ∗ ), and hence g(x ) is also minimized at x ∗ .

We define
c
b(e) = min(c + (e), c − (e)) (17.10)

Lemma 17.1.10.
−−−−−−−Suppose δ ∗ is the argmin of Problem (17.9), the Smoothed Update Prob-
∗ −→
lem, and c (e)) < 0.1. Then δ ∗ is the argmin of Problem (17.8).
(δ (e)/b

195
−−−−−−−−→
Proof. We observe that if (δ ∗ (e)/bc (e)) < 0.1, then D̃V (δ ∗ ) = DV (δ ∗ ), and, for all


τ ∈ Rm with norm −−−−−−−−→ −−−−−−−−→
(τ (e)/b c (e)) < 0.1 − (δ ∗ (e)/b
c (e))

∞ ∞
∗ ∗
we have that D̃V (δ + τ ) = DV (δ + τ ). Thus D̃V and DV agree on a neighborhood around
δ ∗ and hence by Lemma 17.1.9, we have that δ ∗ is the argmin of Problem (17.8).

17.1.5 Step size for divergence update

Definition 17.1.11 (s-t well-conditioned graph). An undirected, capacitated multi-graph


G = (V, E, c) with source s and sink t is s-t well-conditioned if, letting U denote the
maximum edge capacity U = kck∞ , we have at least 32 m multi-edges of capacity U going
directly from s to t.
Remark 17.1.12. It is straightforward to make a graph s-t well-conditioned. We just add
2m new edges of capacity U directly between s and t. Given an exact maximum flow in the
new graph, it is trivial to get one in the original graph: Just remove the flow on the new
edges.
Definition 17.1.13. Given a directed graph G = (V, E, c), the symmetrization of G is the
undirected G
b = (V, E,
b cb) is the undirected graph given by

{a, b} ∈ E
b if (a, b) ∈ E AND (b, a) ∈ E

and
c
b({a, b}) = min(c(a, b), c(b, a)).

Note that when G bf is the symmetrization of the residual graph Gf (which we defined in
Chapter 11), then c
b matches exactly the definition of c
b in Equation (17.10).
Lemma 17.1.14. Let G be an undirected, capacitated multi-graph G = (V, E, c) which is s-t
well-conditioned. Let f be the minimizer of Program (17.2). Let G bf be the symmetrization
of the residual graph Gf (in the sense of Lecture 10). Then there exists a flow δ̂ which
satisfies B δ̂ = 1−α
5
F ∗ b st and is feasible in G
bf . Note that we can also state the feasibility in
G
bf as

−−−−−−−→ 

δ̂(e)/bc (e) ≤1

Proof. We recall since f is the minimizer of Program (17.2), there exists dual-optimal volt-
ages x such that
−−−−−−−−−−−−−−−−−−−−−−→
 
> 1 1
B x = ∇V (f ) = −
c(e) − f (e) c(e) + f (e)
From Lecture 10, we know that there is flow δ̄ that is feasible with respect to the residual
graph capacities of the graph Gf such that B δ̄ = (1 − α)F ∗ b st . Note when treating δ̄

196
as an undirected flow, feasibility in the residual graph means that δ̄(e) < c(e) − f (e) and
−δ̄(e) < c(e) + f (e). Thus,

X δ̄ δ̄
(1 − α)F ∗ b > >
st x = δ̄B x = − ≤m
e
c(e) − f (e) c(e) + f (e)

Now, because the graph is s-t well-conditioned, there are at 32 m edges directly from s to t with
capacity U and each of these e satisfy by the Lagrangian gradient optimality condition (17.4)

1 1
b>
st x = −
U − f (e) U + f (e)

Note that 23 mU ≤ F ∗ ≤ mU because the graph is s-t well-conditioned. To complete the


analysis, we consider three cases.
Case 1: |f (e)| ≤ 23 U . Then the capacity on each of these edges in the symmetrized residual
graph Gbf is at least U/3. As there are 2 m of them, we get that there is a feasible flow in G
bf
3
2 1 ∗
of value at least 9 mU ≥ 10 F .
Case 2: f (e) < − 32 U . By the gradient condition, we have the same flow on all of the 32 m
s-t edges, adding up to at least 23 mU going from t to s. This means that we must have at
least 32 mU flow going from s to t via the remaining edges. But, their combined capacity is
at most 13 mU , so that cannot happen. Thus we can rule out this case entirely.
Case 3: f (e) > 23 U . Then

m 1 1 4/5

≥ b>
st x ≥ − ≥
(1 − α)F U − f (e) U + f (e) U − f (e)

So
4 (1 − α)F ∗ 1
U − f (e) ≥ ≥ (1 − α)U
5 m 2
In this case, the capacity on each of the 32 m s-t edges with capacity U in G will have
capacity (1 − α)U/2 in G
bf . This guarantees that there is feasible flow in G
bf of value at least
1 1 ∗
3
(1 − α)mU ≥ 3 (1 − α)F .

Lemma 17.1.15. Let 0 < α0 ≤ 1−α


√ .
150 m
Then the minimizer δ ∗ of Problem (17.9) satisfies
−−−−−−−−→

(δ (e)/b
c (e)) < 0.1.

Proof. By Lemma 17.1.14, there exists a flow δ̂ which satisfies B δ̂ = 1−α


5
F ∗ b st and
−−−−−−−−→


δ̂(e)/b
c (e) ≤ 1. Hence for any 0 < α0 ≤ 1−α √ , the flow δ̃ = α0 5 δ̂ satisfies
150 m 1−α

197
−−−−−−−−→
 
0 ∗
1
B δ̃ = α F b st and δ̃(e)/b
c (e)
≤ √
30 m
. This means that

! !
X δ̃(e) δ̃(e)
D̃V (δ̃) = D̃ + D̃ −
e
c + (e) c − (e)
!2 !2
X δ̃(e) δ̃(e)
≤ 4 +4 −
e
c + (e) c − (e)
!2
X δ̃(e)
≤ 8
e
c
b(e)
≤ 8/900 < 1/100.

This then means that the minimizer δ ∗ of Problem (17.9) also satisfies D̃V (δ̃) < 1/100.

−−−−−−→ 2
∗ X  δ ∗ (e) 2  δ ∗ (e) 2
c (e)) ≤
(δ /b + −


e
c + (e) c − (e)
δ ∗ (e) δ ∗ (e)
X    
≤ D̃ + D̃ − By Lemma 17.1.6.
e
c + (e) c − (e)
= D̃V (δ̃) < 1/100.
−−−−−−→
Hence (δ ∗ /b
c (e)) < 0.1.

198
Chapter 18

Distance Oracles

In this chapter, we learn about distance oracles as presented in the seminal paper [TZ05].
Distance Oracles are data structures that allow for any undirected graph G = (V, E) to be
stored compactly in a format that allows to query for the (approximate) distance between
any two vertices u, v in the graph. The main result of this chapter is the following data
structure.

Theorem 18.0.1. There is an algorithm that, for any integer k ≥ 1 and undirected graph
G = (V, E), computes a data structure that can be stored using Õ(kn1+1/k ) bits such that on
querying any two vertices u, v ∈ V returns in O(k) time a distance estimate dist(u,
g v) such
that
dist(u, v) ≤ dist(u,
g v) ≤ (2k − 1) · dist(u, v).

The algorithm computes the data structure in expected time Õ(kmn1/k ).

Remark 18.0.2. Note that for k = 1, the theorem above is trivial: it can be solved by
computing APSP and storing the distance matrix of G.

Remark 18.0.3. We point out that given space O(n1+1/k ), approximation (2k−1) is the best
that we can hope for according to a popular and widely believed conjecture that essentially
says that there are unweighted graphs that have no cycle of length (2k+1) but have Ω̃(n1+1/k )
edges. A more careful analysis than we will carry out allows to shave all logarithmic factors
from Theorem 18.0.1 and therefore the data structure is only a factor k off in space from
optimal while also answering queries extremely efficiently. It turns out that the factor k
can also be removed in space and query time (although currently preprocessing is quite
expensive), see therefore the following (really involved) articles [Che14, Che15].

Remark 18.0.4. Also note that in directed graphs no such distance oracle is possible.
Even maintaining the transitive closure (the information of who reaches who) can only be
preserved if one stores Ω̃(n2 ) bits.

199
18.1 Warm-up: A Distance Oracle for k = 2
Let us first describe the data structure for the case where k = 2. See therefore the pseudo-
code below. Here we use the convention that dist(x, X) for some vertex x ∈ V and some
subset X ⊆ V is the minimum distance from x to any y ∈ X, formally dist(x, X) =
miny∈X dist(x, y).

Algorithm 19: Preprocess(G)


Obtain S by sampling every vertex v ∈ V i.i.d. with probability n−1/2 ;
foreach s ∈ S do
Compute all distances from s to any other vertex v ∈ V ;
In a hash table Hs store for each v ∈ V an entry with key v and value distG (s, v).
end
foreach u ∈ V \ S do
Find the pivot p(u) of u to be some vertex in S that minimizes the distance to u;
Store p(u) along with distG (u, p(u)) = distG (u, S);
Find the bunch B(u) = {v ∈ V | distG (u, v) < distG (u, S)};
In a hash table Hu store for each v ∈ B(v) an entry with key v and value distG (u, v).
end

The key to the algorithm is the definition of pivots and bunches. Below is an example that
illustrates their definitions.
Without further due, let us discuss the query procedure which is depicted below. It essentially
consists of checking whether the vertex v is already in the bunch of u in which case we have
stored the distance distG (u, v) explicitly. Otherwise, it uses a detour via its pivot.

Algorithm 20: Query(u,v)


if v ∈ Hu then return value distG (u, v) ;
return distG (u, p(u)) + distG (p(u), v) (the latter from Hp(u) )

The second case is illustrated below.

Approximation Analysis. It is straight-forward to see that if we return in line 1 of the


query algorithm, then we return the exact distance.
If we return in the second line, then we know that v 6∈ B(u). By definition of a bunch this
implies that
dist(u, v) ≥ dist(u, S) = dist(u, p(u)).
We can further use the triangle inequality to conclude that

distG (p(u), v) ≤ dist(u, p(u)) + dist(u, v)

200
Figure 18.1: Graph G (without the edges) where vertices are drawn according to their
distance from u. The blue and red vertices are in S. The red vertex is chosen to be the
pivot p(u). Note that another vertex in S could have been chosen to become the pivot. The
bunch of u is the vertices that are strictly withing the red circle. In particular, both blue
vertices, the red vertex and also the white vertex on the boundary of the red circle are not
in the bunch B(u).

Combining these two inequalities, we obtain


distG (u, p(u)) + distG (p(u), v) ≤ 2 · dist(u, p(u)) + dist(u, v) ≤ 3 · dist(u, v).
We conclude that we obtain a 3-approximation.

Space Analysis. For each vertex s ∈ S, we store a hash-table with one entry for each
vertex v. This√can be stored with space O(|S|n). We have by a standard Chernoff-bound
that |S| = Õ( n) w.h.p. so this becomes Õ(n3/2 ).
Next, fix some u ∈ V \ S, and let us argue about the size of B(u) (which asymptotically
matches |Hu |). We order the vertices v1 , v2 , . . . , vn in V by their distance from u. Since we
sample uniformly at random, by a simple √Chernoff bound, we obtain that the first vertex vi
in v1 , v2 , . . . , vn that is in S has i = Õ( n) w.h.p.. But note that since√only vertices that
are strictly closer to u than vi are in B(u), this implies that |B(u)| = Õ( n).
It remains to take a careful union bound over all bad events at every vertex u ∈ V \ S to
conclude that with
√ high probability, the hash tables Hu for all u ∈ V \ S combined take total
3/2
space Õ(|V \ S| √
n) = Õ(n ). For the rest of the section, we condition on the event that
each |B(v)| = Õ( n) to use it as if it was a deterministic guarantee.

201
Preprocessing and Query Time. In order to find the distances stored in the hash
tables Hs for s ∈ S, we can simply run Dijkstra from each s ∈ S on the graph in total time
Õ(m|S|) = Õ(mn1/2 ).
To compute the pivots for each vertex u, we can insert a super-vertex s0 and add an edge
from s0 to each s ∈ S of weight 0. We can then run Dijkstra from s0 on G ∪ {s0 }. Note that
for each u, we have distG∪{s0 } (s0 , u) = distG (p(u), u) and that p(u) can be chosen to be the
closest vertex on the path from s0 to u that Dijkstra outputs along with the distances (recall
that Dijkstra can output a shortest path tree). This takes Õ(m) time.
It remains to compute the bunches B(u). Here, we use duality: we define the cluster C(w)
for every vertex w ∈ V \ S to be the set

C(w) = {v ∈ V | distG (v, w) < distG (v, p(v))}.

Note the subtle difference to the bunches in that membership of v now depends on p(v) and
not on p(u)! It is not hard to see that u ∈ C(w) ⇐⇒ P w ∈ B(u). And Pit is straight-forward
to compute the bunches from the clusters in time O( v |B(v)|) = O( w |C(w)|) = Õ(n3/2 ).
Finally, it turns out that we can compute each C(w) by running Dijkstra with a small
modification.
Lemma 18.1.1 (Lemma 4.2 in [TZ05]). Consider running Dijkstra from a vertex w but
only relaxing edges incident to vertices v that satisfy distG (v, w) < distG (v, p(v)). Then,

202
the algorithm computes C(w) and all distances dist(v, w) for v ∈ C(w) in time Õ(|E(C(w))|)
where E(C(w)) are the edges that touch a vertex in C(w).

It remains to observe that the total time required to compute all clusters is
X X X X
Õ( |E(C(w))|) = Õ( |E(v)|) = Õ( |E(v)|) = Õ( |E(v)||B(v)|).
w w,v∈C(w) v,w∈B(v) v


But√we have upper bounded |B(v)| = Õ( n)√ for all v, thus each vertex just pays its degree
Õ( n) times and we get running time Õ(m n).

Monte Carlo vs. Las Vegas. Note that the analysis above only guarantees that the
algorithm works well with high probability. However, it is not hard to see that the algorithm
can be transformed
√ into a Las Vegas algorithm: whenever we find a bunch B(v) whose size
exceeds our Õ( n) bound, we simply re-run the algorithm. This guarantees that the final
data structure that we output indeed satisfies the guarantees stipulated in the theorem.

18.2 Distance Oracles for any k ≥ 2


The generalization for all k’s is rather straight-forward except for the query operation which
works a bit magically. Let’s first define the data structure by giving our new pre-processing
algorithm.

Algorithm 21: Preprocess(G)


S1 = V ; Sk+1 = ∅;
foreach i ∈ [1, k] do Obtain Si+1 by sampling every v ∈ Si i.i.d. with prob. n−1/k ;
foreach u ∈ V do
foreach i ∈ [1, k] do
Let pi (u) be some vertex in Si that minimizes the distance to u;
Store pi (u) along with distG (u, pi (u)).
end S
Let bunch B(u) = i Bi (u) where Bi (u) = {v ∈ Si | distG (u, v) < distG (u, Si+1 )};
In a hash table Hu store for each v ∈ B(u) an entry with key v and value
distG (u, v).
end

Note that in a sense this algorithm is almost easier than the one for k = 2 since it treats
each level in the same fashion. Here we ensure that the last set Sk+1 is empty (which would
happen with constant probability otherwise).
We make the implicit assumption throughout that Sk 6= ∅ so that pk (u) is well-defined for
each u. We also define dist(x, X) = ∞ if X is the empty set.

203
The drawing below illustrates the new definition of a bunch where we have chosen k = 3 to
keep things simple.

Figure 18.2: Graph G (without the edges) where vertices are drawn according to their
distance from u. All vertices are in S1 . The blue and red vertices are in S2 . The dark blue
and dark red vertices are also in S3 . Finally, we have S4 = ∅. The light red vertex is chosen
as pivot p2 (u); the dark red vertex is chosen as pivot p3 (u). The bunch B(u) includes all
vertices that have a black edge in this graph to u; in particular these are the white vertices
within the circle drawn in light red (B1 (u)); the light blue vertices encircled by the dark red
circle (B2 (u)); and all dark blue vertices (B3 (u)).

Before we explain the query procedure, let us give a (rather informal) analysis of the space
required by our new data structure.

Space Analysis. We have for each vertex u ∈ V , and each 1 ≤ i ≤ k that the bunch
Bi (u) consists of all vertices in Si that are closer to u than the closest vertex in Si+1 .

204
Now, order the vertices x1 , x2 , . . . in Si by their distance from u. Since each vertex in Si is
sampled into Si+1 with probability n−1/k , we have that with high probability some vertex xi
with i = O(n1/k log n) is sampled into Si+1 . This ensures that |Bi (u)| = Õ(n1/k ) with high
probability.
Applying this argument for all i, we have that |B(u)| = Õ(k · n1/k ) for each u w.h.p. and
therefore our space bound follows.

Preprocessing Time. Much like in the k = 2 construction, we can define for each u ∈
Si \ Si+1 the cluster C(u) = {v ∈ V | distG (u, v) < distG (u, Si+1 )}. Extending our analysis
from before using this new definition, we get construction time Õ(kmn1/k ).

Query Operation. A straight-forward way to query our new data structure for a tu-
ple (u, v) would be to search for the smallest i such that v ∈ B(pi (u)) and then return
distG (u, pi (u)) + distG (pi (u), v). This can be analyzed in the same way as we did for k = 2
to obtain stretch 4k − 31 .
However, we aim for stretch approximation 2k − 1. We state below the pseudo-code to
achieve this guarantee.

Algorithm 22: Query(u,v)


w ← u; i ← 1;
while w 6∈ B(v) do
i ← i + 1;
(u, v) ← (v, u);
w ← pi (u)
end
return distG (u, w) + distG (w, v)

Our main tool in the analysis is the claim below where we define ∆ = distG (u, v).
Claim 18.2.1. After the ith iteration of the while-loop, we have distG (u, w) ≤ i∆.

This implies our theorem, since we have at most k − 1 iterations, then w is a vertex in Sk
and Sk ⊆ B(x) for all vertices x ∈ V . Therefore we have that for the final w, we have
distG (u, w) ≤ (k − 1)∆. It remains to conclude by the triangle inequality that

distG (u, w) + distG (w, v) ≤ 2distG (u, w) + ∆ ≤ (2k − 1)distG (u, v).

Proof of Claim 18.2.1. Let wi , ui , vi denote the variables w, u, v after the ith while-loop iter-
ation (or right before for w0 , u0 , v0 ).
For i = 0, we have that w0 = u0 ; thus, distG (u0 , w0 ) = 0.
1
One can actually prove that this strategy gives an 4k − 5 stretch with a little trick.

205
For i ≥ 1, we want to prove that if the ith while-loop iteration is executed then distG (ui , wi ) ≤
distG (ui−1 , wi−1 ) + ∆ (if it is not executed then the statement follows trivially).
In order to prove this, observe that by the while-loop condition, we must have had wi−1 6∈
B(vi−1 ), thus distG (vi−1 , wi−1 ) ≥ distG (vi−1 , pi (vi−1 )).
But the while-iteration sets ui = vi−1 and wi = pi (vi−1 ), and therefore we have

distG (ui , wi ) = distG (vi−1 , pi (vi−1 )) ≤ distG (vi−1 , wi−1 ) = distG (vi−1 , pi−1 (ui−1 ))
≤ distG (ui−1 , pi−1 (ui−1 )) + distG (vi−1 , ui−1 ) = distG (ui−1 , wi−1 ) + ∆.

206
Bibliography

[BV04] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge


university press, 2004.

[Che14] Shiri Chechik. Approximate distance oracles with constant query time. In
Proceedings of the forty-sixth annual ACM symposium on Theory of computing,
pages 654–663, 2014.

[Che15] Shiri Chechik. Approximate distance oracles with improved bounds. In Pro-
ceedings of the forty-seventh annual ACM symposium on Theory of Computing,
pages 1–10, 2015.

[DO19] Jelena Diakonikolas and Lorenzo Orecchia. Conjugate gradients and accel-
erated methods unified: The approximate duality gap view. arXiv preprint
arXiv:1907.00289, 2019.

[DS08] Samuel I. Daitch and Daniel A. Spielman. Faster approximate lossy generalized
flow via interior point algorithms. In Proceedings of the Fortieth Annual ACM
Symposium on Theory of Computing, STOC ’08, pages 451–460, New York,
NY, USA, May 2008. Association for Computing Machinery.

[HS+ 52] Magnus R Hestenes, Eduard Stiefel, et al. Methods of conjugate gradients for
solving linear systems. Journal of research of the National Bureau of Standards,
49(6):409–436, 1952.

[KS16] R. Kyng and S. Sachdeva. Approximate gaussian elimination for laplacians -


fast, sparse, and simple. In 2016 IEEE 57th Annual Symposium on Foundations
of Computer Science (FOCS), pages 573–582, 2016.

[Lan52] Cornelius Lanczos. Solution of systems of linear equations by minimized itera-


tions. J. Res. Nat. Bur. Standards, 49(1):33–53, 1952.

[LS20a] Yang P. Liu and Aaron Sidford. Faster Divergence Maximization for Faster
Maximum Flow. arXiv:2003.08929 [cs, math], April 2020.

[LS20b] Yang P. Liu and Aaron Sidford. Faster energy maximization for faster maximum
flow. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory
of Computing, pages 803–814, 2020.

207
[Mad13] Aleksander Madry. Navigating central path with electrical flows: From flows to
matchings, and back. In 2013 IEEE 54th Annual Symposium on Foundations
of Computer Science, pages 253–262. IEEE, 2013.

[Nes83] Y. E. Nesterov. A method for solving the convex programming problem with
convergence rate o(1/k 2 ). Dokl. Akad. Nauk SSSR, 269:543–547, 1983.

[NY83] A Nemirovski and D Yudin. Information-based complexity of mathematical


programming. Izvestia AN SSSR, Ser. Tekhnicheskaya Kibernetika (the jour-
nal is translated to English as Engineering Cybernetics. Soviet J. Computer &
Systems Sci.), 1, 1983.

[Spi19] Daniel A Spielman. Spectral and Algebraic Graph Theory, 2019.

[SS11] Daniel A Spielman and Nikhil Srivastava. Graph sparsification by effective


resistances. SIAM Journal on Computing, 40(6):1913–1926, 2011.

[ST04] Daniel A. Spielman and Shang-Hua Teng. Nearly-linear time algorithms for
graph partitioning, graph sparsification, and solving linear systems. In Pro-
ceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing,
STOC ’04, page 81–90, New York, NY, USA, 2004. Association for Computing
Machinery.

[T+ 15] Joel A Tropp et al. An introduction to matrix concentration inequalities. Foun-
dations and Trends® in Machine Learning, 8(1-2):1–230, 2015.

[Tar83] Robert Endre Tarjan. Data structures and network algorithms. SIAM, 1983.

[Tro19] Joel A Tropp. Matrix concentration & computational linear algebra. 2019.

[TZ05] Mikkel Thorup and Uri Zwick. Approximate distance oracles. Journal of the
ACM (JACM), 52(1):1–24, 2005.

[vdBLL+ 21] Jan van den Brand, Yin Tat Lee, Yang P. Liu, Thatchaphol Saranurak,
Aaron Sidford, Zhao Song, and Di Wang. Minimum Cost Flows, MDPs, and
$$\backslash$ell 1 $-Regression in Nearly Linear Time for Dense Instances.
arXiv preprint arXiv:2101.05719, 2021.

208

You might also like