Notes on Randomized Algorithms
Notes on Randomized Algorithms
James Aspnes
2024-07-27 23:04
i
Table of contents ii
List of figures xv
Preface xviii
1 Randomized algorithms 1
1.1 Searching an array . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Verifying polynomial identities . . . . . . . . . . . . . . . . . 4
1.3 Randomized QuickSort . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Brute force method: solve the recurrence . . . . . . . 6
1.3.2 Clever method: use linearity of expectation . . . . . . 7
1.4 Where does the randomness come from? . . . . . . . . . . . . 9
1.5 Classifying randomized algorithms . . . . . . . . . . . . . . . 10
1.5.1 Las Vegas vs Monte Carlo . . . . . . . . . . . . . . . . 10
1.5.2 Randomized complexity classes . . . . . . . . . . . . . 11
1.6 Classifying randomized algorithms by their methods . . . . . 13
2 Probability theory 15
2.1 Probability spaces and events . . . . . . . . . . . . . . . . . . 16
2.1.1 General probability spaces . . . . . . . . . . . . . . . . 16
2.2 Boolean combinations of events . . . . . . . . . . . . . . . . . 18
2.3 Conditional probability . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Conditional probability and independence . . . . . . . 21
2.3.2 Conditional probability and the law of total probability 21
2.3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 23
ii
CONTENTS iii
3 Random variables 28
3.1 Operations on random variables . . . . . . . . . . . . . . . . . 29
3.2 Random variables and events . . . . . . . . . . . . . . . . . . 29
3.3 Measurability . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Linearity of expectation . . . . . . . . . . . . . . . . . 32
3.4.1.1 Linearity of expectation for infinite sequences 33
3.4.2 Expectation and inequalities . . . . . . . . . . . . . . 34
3.4.3 Expectation of a product . . . . . . . . . . . . . . . . 35
3.4.3.1 Wald’s equation (simple version) . . . . . . . 35
3.5 Conditional expectation . . . . . . . . . . . . . . . . . . . . . 36
3.5.1 Expectation conditioned on an event . . . . . . . . . . 37
3.5.2 Expectation conditioned on a random variable . . . . 38
3.5.2.1 Calculating conditional expectations . . . . . 39
3.5.2.2 The law of iterated expectation . . . . . . . . 41
3.5.2.3 Conditional expectation as orthogonal pro-
jection . . . . . . . . . . . . . . . . . . . . . 41
3.5.3 Expectation conditioned on a σ-algebra . . . . . . . . 43
3.5.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6.1 Yao’s lemma . . . . . . . . . . . . . . . . . . . . . . . 45
3.6.2 Geometric random variables . . . . . . . . . . . . . . . 47
3.6.3 Coupon collector . . . . . . . . . . . . . . . . . . . . . 48
3.6.4 Hoare’s FIND . . . . . . . . . . . . . . . . . . . . . . . 49
5 Concentration bounds 59
5.1 Chebyshev’s inequality . . . . . . . . . . . . . . . . . . . . . . 60
5.1.1 Computing variance . . . . . . . . . . . . . . . . . . . 60
5.1.1.1 Alternative formula . . . . . . . . . . . . . . 60
5.1.1.2 Variance of a Bernoulli random variable . . . 61
5.1.1.3 Variance of a sum . . . . . . . . . . . . . . . 61
5.1.1.4 Variance of a geometric random variable . . 63
5.1.2 More examples . . . . . . . . . . . . . . . . . . . . . . 66
5.1.2.1 Flipping coins . . . . . . . . . . . . . . . . . 66
5.1.2.2 Balls in bins . . . . . . . . . . . . . . . . . . 66
5.1.2.3 Lazy select . . . . . . . . . . . . . . . . . . . 67
5.2 Chernoff bounds . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2.1 The classic Chernoff bound . . . . . . . . . . . . . . . 69
5.2.2 Easier variants . . . . . . . . . . . . . . . . . . . . . . 71
5.2.3 Lower bound version . . . . . . . . . . . . . . . . . . . 72
5.2.4 Two-sided version . . . . . . . . . . . . . . . . . . . . 72
5.2.5 What if we only have a bound on E [S]? . . . . . . . . 73
5.2.6 Almost-independent variables . . . . . . . . . . . . . . 74
5.2.7 Other tail bounds for the binomial distribution . . . . 75
5.2.8 Applications . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.8.1 Flipping coins . . . . . . . . . . . . . . . . . 75
5.2.8.2 Balls in bins again . . . . . . . . . . . . . . . 76
5.2.8.3 Flipping coins, central behavior . . . . . . . 76
5.2.8.4 Permutation routing on a hypercube . . . . . 77
5.3 The Azuma-Hoeffding inequality . . . . . . . . . . . . . . . . 80
5.3.1 Hoeffding’s inequality . . . . . . . . . . . . . . . . . . 80
5.3.1.1 Hoeffding vs Chernoff . . . . . . . . . . . . . 82
5.3.1.2 Asymmetric version . . . . . . . . . . . . . . 83
5.3.2 Azuma’s inequality . . . . . . . . . . . . . . . . . . . . 83
5.3.3 The method of bounded differences . . . . . . . . . . . 88
5.3.4 Applications . . . . . . . . . . . . . . . . . . . . . . . 90
5.3.4.1 Sprinkling points on a hypercube . . . . . . . 90
5.3.4.2 Chromatic number of a random graph . . . . 91
5.3.4.3 Balls in bins . . . . . . . . . . . . . . . . . . 92
5.3.4.4 Probabilistic recurrence relations . . . . . . . 92
5.3.4.5 Multi-armed bandits . . . . . . . . . . . . . . 94
The UCB1 algorithm . . . . . . . . . . . . . . . 95
CONTENTS v
Analysis of UCB1 . . . . . . . . . . . . . . . . . 96
5.4 Relation to limit theorems . . . . . . . . . . . . . . . . . . . . 99
5.5 Anti-concentration bounds . . . . . . . . . . . . . . . . . . . . 100
5.5.1 The Berry-Esseen theorem . . . . . . . . . . . . . . . . 100
5.5.2 The Littlewood-Offord problem . . . . . . . . . . . . . 101
7 Hashing 114
7.1 Hash tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.2 Universal hash families . . . . . . . . . . . . . . . . . . . . . . 115
7.2.1 Linear congruential hashing . . . . . . . . . . . . . . . 118
7.2.2 Tabulation hashing . . . . . . . . . . . . . . . . . . . . 119
7.3 FKS hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.4 Cuckoo hashing . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.4.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.4.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.5 Practical issues . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.6 Bloom filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.6.1 Construction . . . . . . . . . . . . . . . . . . . . . . . 128
7.6.2 False positives . . . . . . . . . . . . . . . . . . . . . . 128
7.6.3 Comparison to optimal space . . . . . . . . . . . . . . 130
7.6.4 Applications . . . . . . . . . . . . . . . . . . . . . . . 132
7.6.5 Counting Bloom filters . . . . . . . . . . . . . . . . . . 132
7.7 Data stream computation . . . . . . . . . . . . . . . . . . . . 133
7.7.1 Cardinality estimation . . . . . . . . . . . . . . . . . . 134
7.7.2 Count-min sketches . . . . . . . . . . . . . . . . . . . 136
7.7.2.1 Initialization and updates . . . . . . . . . . . 137
7.7.2.2 Queries . . . . . . . . . . . . . . . . . . . . . 137
7.7.2.3 Finding heavy hitters . . . . . . . . . . . . . 140
CONTENTS vi
14 Derandomization 253
14.1 Deterministic vs. randomized algorithms . . . . . . . . . . . . 254
14.2 Adleman’s theorem . . . . . . . . . . . . . . . . . . . . . . . . 255
14.3 Limited independence . . . . . . . . . . . . . . . . . . . . . . 256
14.3.1 MAX CUT . . . . . . . . . . . . . . . . . . . . . . . . 256
14.4 The method of conditional probabilities . . . . . . . . . . . . 257
14.4.1 MAX CUT using conditional probabilities . . . . . . . 258
14.4.2 Deterministic construction of Ramsey graphs . . . . . 259
14.4.3 Derandomized set balancing . . . . . . . . . . . . . . . 259
Bibliography 491
Index 510
List of Figures
xv
List of Tables
xvi
List of Algorithms
xvii
Preface
These are notes for the Yale course CPSC 469/569 Randomized Algorithms.
This document also incorporates the lecture schedule and assignments, as
well as some sample assignments from previous semesters. Because this
is a work in progress, it will be updated frequently over the course of the
semester.
Much of the structure of the course follows Mitzenmacher and Upfals’s
Probability and Computing [MU17], with some material from Motwani and
Raghavan’s Randomized Algorithms [MR95]. In most cases you’ll find these
textbooks contain much more detail than what is presented here, so it is
probably better to consider this document a supplement to them than to
treat it as your primary source of information.
The most recent version of these notes will be available at https:
//www.cs.yale.edu/homes/aspnes/classes/469/notes.pdf. More stable
archival versions may be found at https://arxiv.org/abs/2003.01902.
I would like to thank my many students and teaching fellows over the
years for their help in pointing out errors and omissions in earlier drafts of
these notes.
xviii
Chapter 1
Randomized algorithms
1
CHAPTER 1. RANDOMIZED ALGORITHMS 2
In this case, the bad input distribution is simple: put the 1 in each
position A[i] with equal probability 1/n. For a deterministic algorithm, there
will be some fixed sequence of positions i1 , i2 , . . . that it examines as long as
it only sees zeros. A smart deterministic algorithm will not examine the same
position twice, so the 1 is equally likely to be found 1, 2, 3, . . . , n probes. This
gives the same expected n+1 2 probes as for the simple randomized algorithm,
which shows that that algorithm is optimal.
We’ve been talking about searching an array, because that fits best in
our model of an input supplied to the algorithm, but essentially the same
analysis applies to brute-force inverting a black-box function. Here we have
a function f and target output y, and we want to find an input x such that
f (x) = y. The same analysis as for the array case shows that this takes n+1 2
expected evaluations of f assuming that exactly one x works and we can’t
do anything clever.
√
Curiously, in this case it may be possible to improve this bound to O( n)
evaluations if somehow we get our hands on a working quantum computer.
We’ll come back to this when we discuss quantum computing in general and
Grover’s algorithm in particular in Chapter 16.
get a root. Indeed, evaluating p(11) = 112320 and q(11) = 120306 quickly
shows that p and q are not in fact the same.
This is an example of a Monte Carlo algorithm, which is an algorithm
that runs in a fixed amount of time but only gives the right answer some of
the time. (In this case, with probability 1 − d/r, where r is the size of the
range of random integers we choose x from.) Monte Carlo algorithms have
the unnerving property of not indicating when their results are incorrect,
but we can make the probability of error as small as we like by running
the algorithm repeatedly. For this particular algorithm, the probability of
error after k trials is only (d/r)k , which means that for fixed d/r we need
O(log(1/)) iterations to get the error bound down to any given . If we are
really paranoid, we could get the error down to 0 by testing d + 1 distinct
values, but now the cost is as high as multiplying out p again.
The error for this algorithm is one-sided: if we find a witness to the fact
that p 6= q, we are done, but if we don’t, then all we know is that we haven’t
found a witness yet. We also have the property that if we check enough
possible witnesses, we are guaranteed to find one.
A similar property holds in the classic Miller-Rabin primality test,
a randomized algorithm for determining whether a large integer is prime
or not.4 The original version, due to Gary Miller [Mil76] showed that, as
in polynomial identity testing, it might be sufficient to pick a particular
set of deterministic candidate witnesses. Unfortunately, this result depends
on the truth of the extended Riemann hypothesis, a notoriously difficult
open problem in number theory. Michael Rabin [Rab80] demonstrated that
choosing random witnesses was enough, if we were willing to accept a small
probability of incorrectly identifying a composite number as prime.
For many years it was open whether it was possible to test primality
deterministically in polynomial time without unproven number-theoretic
assumptions, and the randomized Miller-Rabin algorithm was one of the
most widely-used randomized algorithms for which no good deterministic
alternative was known. Eventually, Agrawal et al. [AKS04] demonstrated
how to test primality deterministically using a different technique, although
the cost of their algorithm is high enough that Miller-Rabin is still used in
practice.
4
We will not describe this algorithm here.
CHAPTER 1. RANDOMIZED ALGORITHMS 6
1 n−1
X
T (n) = (n − 1) + (T (k) + T (n − 1 − k)) . (1.3.1)
n k=0
1 n−1
X
T (n) = (n − 1) + (T (k) + T (n − 1 − k))
n k=0
2 n−1
X
= (n − 1) + T (k)
n k=0
2 n−1
X
= (n − 1) + T (k)
n k=1
2 n−1
X
≤ (n − 1) + ak log k
n k=1
Z n
2
≤ (n − 1) + ak log k
n k=1
!
2a n2 log n n2 1
= (n − 1) + − +
n 2 4 4
an a
= (n − 1) + an log n − + .
2 2n
If we squint carefully at this recurrence for a while we notice that setting
a = 2 makes this less than or equal to an log n, since the remaining terms
become (n − 1) − n + 1/n = 1/n − 1, which is negative for n ≥ 1. We can
thus confidently conclude that T (n) ≤ 2n log n (for n ≥ 1).
recurrence in §1.3.1. Given that we now know the exact answer, we could in
principle go back and use it to solve the recurrence exactly.5
Which way is better? Solving the recurrence requires less probabilistic
handwaving (a more polite term might be “insight”) but more grinding
out inequalities, which is a pretty common trade-off. Since I am personally
not very clever I would try the brute-force approach first. But it’s worth
knowing about better methods so you can try them in other situations.
produce the wrong answer. These are examples of two classes of randomized
algorithms, which were originally named by László Babai [Bab79]:7
• A Las Vegas algorithm fails with some probability, but we can tell
when it fails. In particular, we can run it again until it succeeds, which
means that we can eventually succeed with probability 1 (but with a
potentially unbounded running time). Alternatively, we can think of a
Las Vegas algorithm as an algorithm that runs for an unpredictable
amount of time but always succeeds (we can convert such an algorithm
back into one that runs in bounded time by declaring that it fails if it
runs too long—a condition we can detect). QuickSort is an example of
a Las Vegas algorithm.
• A Monte Carlo algorithm fails with some probability, but we can’t
tell when it fails. If the algorithm produces a yes/no answer and the
failure probability is significantly less than 1/2, we can reduce the
probability of failure by running it many times and taking a majority of
the answers. The polynomial equality-testing algorithm in an example
of a Monte Carlo algorithm.
The heuristic for remembering which class is which is that the names
were chosen to appeal to English speakers: in Las Vegas, the dealer can tell
you whether you’ve won or lost, but in Monte Carlo, le croupier ne parle que
Français, so you have no idea what he’s saying.
Generally, we prefer Las Vegas algorithms, because we like knowing
when we have succeeded. But sometimes we have to settle for Monte Carlo
algorithms, which can still be useful if we can get the probability of failure
small enough. For example, any time we try to estimate an average by
sampling (say, inputs to a function we are trying to integrate or political
views of voters we are trying to win over) we are running a Monte Carlo
algorithm: there is always some possibility that our sample is badly non-
representative, but we can’t tell if we got a bad sample unless we already
know the answer we are looking for.
these algorithms, we never get a bogus “yes” answer but may get a bogus
“no” answer (or vice versa). This gives us several complexity classes that act
like randomized versions of NP, co-NP, etc.:
• The class co-R consists of all languages L for which a poly-time Turing
machine M exists such that if x 6∈ L, then Pr [M (x, r) = 1] ≥ 1/2 and
if x ∈ L, then Pr [M (x, r) = 1] = 0. This is the randomized analog of
co-NP.
Probability theory
15
CHAPTER 2. PROBABILITY THEORY 16
2. Pr [Ω] = 1.
CHAPTER 2. PROBABILITY THEORY 17
It’s not hard to see that the discrete probability spaces defined in the
preceding section satisfy these axioms.
General probability spaces arise in randomized algorithms when we have
an algorithm that might consume an unbounded number of random bits.
The problem now is that an outcome consists of countable sequence of bits,
and there are uncountably many such outcomes. The solution is to consider
as measurable events only those sets with the property that membership
in them can be determined after a finite amount of time. Formally, the
probability space Ω is the set {0, 1}N of all countably infinite sequences of 0
and 1 values indexed by the natural numbers, and the measurable sets F
are all sets that can be generated by countable unions1 of cylinder sets,
where a cylinder set consists of all extensions xy of some finite prefix x. The
probability measure itself is obtained by assigning the set of all points that
start with x the probability 2−|x| , and computing the probabilities of other
sets from the axioms.2
An oddity that arises in general probability spaces is it may be that every
particular outcome has probability zero but their union has probability 1.
For example, the probability of any particular infinite string of bits is 0, but
the set containing all such strings is the entire space and has probability
1. This is where the fact that probabilities only add over countable unions
comes in.
Most randomized algorithms books gloss over general probability spaces,
with three good reasons. The first is that if we truncate an algorithm after
a finite number of steps, we are usually get back to a discrete probability
space, which avoids a lot of worrying about measurability and convergence.
The second is that we are often implicitly working in a probability space that
is either discrete or well-understood (like the space of bit-vectors described
above). The last is that the Kolmogorov extension theorem says that
1
As well as complements and countable intersections. However, it is not hard to show
that that sets defined using these operations can be reduced to countable unions of cylinder
sets.
2
This turns out to give the same probabilities as if we consider each outcome as a
real number in the interval [0, 1] and use Lebesgue measure to compute the probability of
events. For some applications, thinking of our random values as real numbers (or even
sequences of real numbers) can make things easier: consider for example what happens
when we want to choose one of three outcomes with equal probability.
CHAPTER 2. PROBABILITY THEORY 18
Lemma 2.2.1.
h i
Pr Ā = 1 − Pr [A] .
For example, if our probability space consists of the six outcomes of a fair
i and A = [outcome is 3] with Pr [A] = 5/6, then Pr [outcome is not 3] =
diehroll,
Pr Ā = 1 − 1/6 = 5/6. Though this example is trivial, using the formula
does save us from having to add up the five cases where we don’t get 3.
If we want to know the probability of A ∩ B, we need to know more
about the relationship between A and B. For example, it could be that
A and B are both events representing a fair coin coming up heads, with
Pr [A] = Pr [B] = 1/2. The probability of A ∩ B could be anywhere between
1/2 and 0:
• For ordinary fair coins, we’d expect that half the time that A happens,
B also happens. This gives Pr [A ∩ B] = (1/2) · (1/2) = 1/4. To
make this formal, we might define our probability space Ω as having
four outcomes HH, HT, TH, and TT, each of which occurs with equal
probability.
• But maybe A and B represent the same fair coin: then A ∩ B = A and
Pr [A ∩ B] = Pr [A] = 1/2.
3
Countable need not be infinite, so 2 is countable.
CHAPTER 2. PROBABILITY THEORY 19
• At the other extreme, maybe A and B represent two fair coins welded
together so that if one comes up heads the other comes up tails. Now
Pr [A ∩ B] = 0.
The difference between the nice case where Pr [A ∩ B] equals 1/4 and
the other, more annoying cases where it doesn’t is that in the first case we
have assumed that A and B are independent, which is defined to mean
that Pr [A ∩ B] = Pr [A] Pr [B].
In the real world, we expect events to be independent if they refer to
parts of the universe that are not causally related: if we flip two coins that
aren’t glued together somehow, then we assume that the outcomes of the
coins are independent. But we can also get independence from events that
are not causally disconnected in this way. An example would be if we rolled
a fair four-sided die labeled HH, HT, TH, TT, where we take the first letter
as representing A and the second as B.
There’s no simple formula for Pr [A ∪ B] when A and B are not disjoint,
even for independent events, but we can compute the probability by splitting
up into smaller, disjoint events and using countable additivity:
h i
Pr [A ∪ B] = Pr (A ∩ B) ∪ (A ∩ B̄) ∪ (Ā ∩ B)
h i h i
= Pr [A ∩ B] + Pr A ∩ B̄ + Pr Ā ∩ B
h i h i
= Pr [A ∩ B] + Pr A ∩ B̄ + Pr Ā ∩ B + Pr [A ∩ B] − Pr [A ∩ B]
= Pr [A] + Pr [B] − Pr [A ∩ B] .
BT with T 6= ∅.
That the right-hand side gives the probability of this event is a sneaky con-
sequence of the binomial theorem, and in particular the fact that ni=1 (−1)i =
P
Pn i n
i=0 (−1) − 1 = (1 − 1) − 1 is −1 if n > 0 and 0 if n = 0. Using this fact
after rewriting the right-hand side using the BT events gives
" #
|S|+1
(−1)|S|+1
X \ X X
(−1) Pr Ai = Pr [BT ]
S⊆{1...n},S6=∅ i∈S S⊆{1...n},S6=∅ T ⊇S
(−1)|S|+1
X X
= Pr [BT ]
T ⊆{1...n} S⊆T,S6=∅
n
!!
X X
i |T |
= − Pr [BT ] (−1)
T ⊆{1...n} i=1
i
− Pr [BT ] ((1 − 1)|T | − 1)
X
=
T ⊆{1...n}
Pr [BT ] (1 − 0|T | )
X
=
T ⊆{1...n}
X
= Pr [BT ]
T ⊆{1...n},T 6=∅
" n #
[
= Pr Ai .
i=1
CHAPTER 2. PROBABILITY THEORY 21
Pr [A ∩ B]
Pr [A | B] = , (2.3.1)
Pr [B]
likely, the first random thing the algorithm does is flip a coin, giving two
possible outcomes i B̄. Countable additivity tells us that Pr [A] =
h B and
Pr [A ∩ B] + Pr A ∩ B̄ , which we can rewrite using conditional probability
as
h i h i
Pr [A] = Pr [A | B] Pr [B] + Pr A B̄ Pr B̄ , (2.3.2)
then
which is the law of total probability. Note that the last step works for
each term only if Pr [A | Bi ] is well-defined, meaning that Pr [Bi ] 6= 0. But
any case where Pr [Bi ] = 0 also has Pr [A ∩ Bi ] = 0, so we get the correct
answer if we simply omit these terms h from i both sums.
A special case arises when Pr A B̄ = 0, which occurs, for example,
if A ⊆ B. Then we just have Pr [A] = Pr [A | B] Pr [B]. If we consider an
CHAPTER 2. PROBABILITY THEORY 23
2.3.3 Examples
Here we have some examples of applying conditional probability to algorithm
analysis. Mostly we will be using some form of the law of total probability.
We can compute
Pr [W | Ai ] = Pr [B0 ∩ B1 ∩ · · · ∩ Bi | Ai ]
= Pr [B0 ∩ B1 ∩ · · · ∩ Bi ]
= Pr [B0 ] + Pr [B1 ] + · · · + Pr [Bi ]
i
X
= (1/2)i
j=0
1 − (1/2)i+1
= (1/2) ·
1 − 1/2
= 1 − (1/2)i+1 . (2.3.5)
The clean form of this expression suggests strongly that there is a better
way to get it, and that this way involves taking the negation of the intersection
of i + 1 independent events that occur with probability 1/2 each. With a
little reflection, we can see that the probability that your objects don’t fit in
my buffer is exactly (1/2)i+1
From the law of total probability (2.3.3),
∞
X
Pr [W ] = (1 − (1/2)i+1 )(1/2)i+1
i=0
∞
X
=1− (1/4)i+1
i=0
1 1
=1− ·
4 1 − 1/4
= 2/3.
Pr [W ] = ∞
P
i=0 (2/3) Pr [Ci ] = 2/3 since Pr [Ci ] sums to 1, because the union
of these events includes the entire space except for the probability-zero event
C∞ .
Still another approach is to compute the probability that our runs have
exactly the same length ( ∞ −i · 2−i = 1/3), and argue by symmetry that
P
i=1 2
the remaining 2/3 probability is equally split between my run being longer
(1/3) and your run being longer (1/3). Since W occurs if my run is just as
long or longer, Pr[W ] = 1/3 + 1/3 = 2/3. A nice property of this approach is
that the only summation involved is over disjoint events, so we get to avoid
using conditional probability entirely.
Proof. Let (S, T ) be a min cut of size k. Then the degree of each vertex v
is at least k (otherwise (v, G − v) would be a smaller cut), and G contains
at least kn/2 edges. The probability that we contract an S–T edge is thus
at most k/(kn/2) = 2/n, and the probability that we don’t contract one is
5
Unlike ordinary graphs, multigraphs can have more than one edge between two vertices.
CHAPTER 2. PROBABILITY THEORY 26
a e
c d
b f
e
ab c d
f
e
ab c df
ab c def
abc def
Figure 2.1: Karger’s min-cut algorithm. Initial graph (at top) has min cut
h{a, b, c} , {d, e, f }i. We find this cut by getting lucky and contracting edges
ab, df , de, and ac in that order. The final graph (at bottom) gives the cut.
This tells us what happens when we are considering a particular min cut.
If the graph has more than one min cut, this only makes our life easier. Note
that since each min cut turns up with probability at least 1/ n2 , there can’t
be more than n2 of them.6 But even if there is only one, we have a good
6
The suspiciously combinatorial appearance of the 1/ n2 suggests that there should
be some way of associating minimum cuts with particular pairs of vertices, but I’m not
aware of any natural way to do this. Sometimes the appearance of a simple expression
in a surprising context may just stem from the fact that there aren’t very many distinct
simple expressions.
CHAPTER 2. PROBABILITY THEORY 27
Random variables
28
CHAPTER 3. RANDOM VARIABLES 29
S Probability
2 1/36
3 2/36
4 3/36
5 4/36
6 5/36
7 6/36
8 5/36
9 4/36
10 3/36
11 2/36
12 1/36
Table 3.1: Probability mass function for the sum of two independent fair
six-sided dice
probability mass function show in Table 3.1. For a discrete random variable
X, the probability mass function gives enough information to calculate the
probability of any event involving X, since we can just sum up cases using
countable additivity. This gives us another way to compute Pr [S < 5] =
Pr [S = 2] + Pr [S = 3] + Pr [S = 4] = 1+2+3
36 = 16 .
For two random variables, the joint probability mass function gives
Pr [X = x ∧ Y = y] for each pair of values x and y. This generalizes in the
obvious way for more than two variables.
We will often refer to the probability mass function as giving the distri-
bution or joint distribution of a random variable or collection of random
variables, even though distribution (for real-valued variables) technically
refers to the cumulative distribution function F (x) = Pr [X ≤ x], which
is generally not directly computable from the probability mass function for
continuous random variables that take on uncountably many values. To
the extent that we can, we will try to avoid continuous random variables,
and the rather messy integration theory needed to handle them.
Two or more random variables are independent if all sets of events
involving different random variables are independent. In terms of proba-
CHAPTER 3. RANDOM VARIABLES 31
3.3 Measurability
For discrete probability spaces, any function on outcomes can be a random
variable. The reason is that any event in a discrete probability space has
a well-defined probability. For more general spaces, in order to be useful,
events involving a random variable should have well-defined probabilities.
For discrete random variables that take on only countably many values
(e.g., integers or rationals), it’s enough for the event [X = x] (that is, the
set {ω | X(ω) = x}) to be in F for all x. For real-valued random variables,
we ask that the event [X ≤ x] be in F. In these cases, we say that X is
measurable with respect to F, or just measurable F. More exotic random
variables use a definition of measurability that generalizes the real-valued
version, which we probably won’t need.3 Since we usually just assume that
all of our random variables are measurable unless we are doing something
funny with F to represent ignorance, this issue won’t come up much.
3.4 Expectation
The expectation or expected value of a random variable X is given by
P
E [X] = x x Pr [X = x]. This is essentially an average value of X weighted
by probability, and it only makes sense if X takes on values that can be
summed in this way (e.g., real or complex values, or vectors in a real- or
3
The general version is that if X takes on values on another measure space (Ω0 , F 0 ),
then the inverse image X −1 (A) = {ω ∈ Ω | X(ω) ∈ A} of any set A in F 0 is in F. This
means in particular that PrΩ maps through X to give a probability measure on Ω0 by
PrΩ0 [A] = PrΩ [X −1 (A)], and the condition on X −1 (A) being in F makes this work.
CHAPTER 3. RANDOM VARIABLES 32
= a E [X] + b E [Y ] .
The claim continues to hold even in the general case, but the proof is
more work.
One special case of this that comes up often is that X ≥ 0 implies
E [X] ≥ 0.
6
The trick here is that we are trading aPprobability-1 gain of 1 against a probability-0
∞
loss of ∞. So we could declare that E i=0
Xi involves 0 · (−∞) and is undefined.
But this would lose the useful property that expectation isn’t affected by probability-0
outcomes. As often happens in mathematics, we are forced to choose between candidate
definitions based on which bad consequences we most want to avoid, with no way to avoid
all of them. So the standard definition of expectation allows the St. Petersburg paradox
because the alternatives are worse.
CHAPTER 3. RANDOM VARIABLES 35
= E [X] E [Y ] .
Here we can’t use the sum formula directly, because N is a random variable,
and we can’t use the product formula, because the Xi are all different random
variables.
CHAPTER 3. RANDOM VARIABLES 36
If N and the Xi are all independent (which may or may not be the case
for the loop example), and N is bounded by some fixed maximum n, then
we can apply the product rule to get the value of (3.4.2) by throwing in
a few indicator variables. The idea is that the contribution of Xi to the
sum is given by Xi 1[N ≥i] , and because we assume that N is independent
h i
of the Xi , if we need to compute E Xi 1[N ≥i] , we can do so by computing
h i
E [Xi ] E 1[N ≥i] .
So we get
"N # " n #
X X
E Xi = E Xi 1[N ≥i]
i=1 i=1
n
X h i
= E Xi 1[N ≥i]
i=1
n
X h i
= E [Xi ] E 1[N ≥i] .
i=1
For general Xi we have to stop here. But if we also know that the Xi all
have the same expectation µ, then E [Xi ] doesn’t depend on i and we can
bring it out of the sum. This gives
n
X h i n
X h i
E [Xi ] E 1[N ≥i] = µ E 1[N ≥i]
i=1 i=1
= µ E [N ] . (3.4.3)
This equation is a special case of Wald’s equation, which we will see
again in §9.4.2. The main difference between this version and the general
version is that here we had to assume that N was independent of the Xi ,
which may not be true if our loop is a while loop, and termination after a
particular iteration is correlated with the time taken by that iteration.
But for simple cases, (3.4.3) can still be useful. For example, if we throw
one six-sided die to get N , and then throw N six-sided dice and add them
up, we get the same expected total 72 · 72 = 49 4 as if we just multiply two
six-sided dice. This is true even though the actual distribution of the values
is very different in the two cases.
E [aX + bY | A] = a E [X | A] + b E [Y | A] (3.5.2)
This is just another way of saying what we said already: if you want to know
what the expectation of X is conditioned on Y when you get outcome ω,
find the value of Y at ω and condition on seeing that.
Here is a simple example. Suppose that X and Y are independent fair
coin-flips that take on the values 0 and 1 with equal probability. Then our
probability space Ω has four elements, and looks like this:
h0, 0i h0, 1i
h1, 0i h1, 1i
But now suppose we only know X, and want to compute E [Z | X]. When
X = 0, E [Z | X = 0] = 12 · 0 + 12 · 1 = 12 ; and when X = 1, E [Z] X = 1 =
1 1 3
2 · 1 + 2 · 2 = 2 . So drawing E [Z | X] over our probability space gives
1 1
2 2
3 3
2 2
We’ve averaged the value of Z across each row, since each row corresponds
to one of the possible values of X.
If instead we compute E [Z | Y ], we get this picture instead:
1 3
2 2
1 3
2 2
= A(z) E [X | Z = z] + B(z) E [Y | Z = z] ,
which is the value when Z = z of A E [X | Z] + B E [Y | Z].
This means that we can quickly simplify many conditional expectations.
If we go back to the example of the previous section, where Z = X + Y is
the sum of two independent fair coin-flips X and Y , then we can compute
E [Z | X] = E [X + Y | X]
= E [X | X] + E [Y | X]
= X + E [Y ]
1
=X+ .
2
Similarly, E [Z | Y ] = E [X + Y | Y ] = 12 + Y .
In some cases we have enough additional information to run this in
reverse. If we know Z and want to estimate X, we can use the fact that
X and Y are symmetric to argue that E [X | Z] = E [Y | Z]. But then
Z = E [X + Y | Z] = 2 E [X | Z], so E [X | Z] = Z/2. Note that this works
in general only if the events [X = a, Y = b] and [X = b, Y = a] have the
same probabilities for all a and b even if we condition on Z, which in this
case follows from the fact that X and Y are independent and identically
distributed and that addition is commutative. Other cases may be messier.8
Other facts, like X ≥ Y implies E [X | Z] ≥ E [Y | Z], can be proved
using similar techniques.
8
An example with X and Y identically distributed but not independent is to imagine
that we roll a six-sided die to get X, and let Y = X + 1 if X < 6 and Y = 1 if X = 6. Now
knowing Z = X + Y = 3 tells me that X = 1 and Y = 2 exactly, neither of which is Z/2.
CHAPTER 3. RANDOM VARIABLES 41
E [X] = E [E [X | Y ]] . (3.5.6)
= E [X] .
The trick here is that we use (3.5.3) to expand out the original expression in
terms of the events Ay , then notice that E [X | Y ] is equal to E [X | Y = y]
whenever Y = y.
So as claimed, conditioning on a variable gives a way to write averaging
over cases very compactly.
It’s also not too hard to show that iterated expectation works with partial
conditioning:
E [E [X | Y, Z] | Y ] = E [X | Y ] . (3.5.7)
and the fact that E [X] is an orthogonal projection of X means that E [X]
is precisely the constant value µ that minimizes the distance E (X − µ)2 .
We won’t actually use any of these facts in the following, but having another
way to look at conditional expectation may be helpful in understanding how
it works.
CHAPTER 3. RANDOM VARIABLES 43
3.5.4 Examples
• Let X be the value of a six-sided die. Let A be the event that X is
even. Then
X
E [X | A] = x Pr [X = x | A]
x
1
= (2 + 4 + 6) ·
3
= 4.
3.6 Applications
3.6.1 Yao’s lemma
In Section 1.1, we considered a special case of the unordered search problem,
where we have an unordered array A[1..n] and want to find the location
of a specific element x. For deterministic algorithms, this requires probing
n array locations in the worst case, because the adversary can place x in
the last place we look. Using a randomized algorithm, we can reduce this
to (n + 1)/2 probes on average, either by probing according to a uniform
random permutation or just by probing from left-to-right or right-to-left
with equal probability.
Can we do better? Proving lower bounds is a nuisance even for determin-
istic algorithms, and for randomized algorithms we have even more to keep
track of. But there is a sneaky trick that allows us to reduce randomized
lower bounds to deterministic lower bounds in many cases.
The idea is that if we have a randomized algorithm that runs in time
T (x, r) on input x with random bits r, then for any fixed choice of r we
have a deterministic algorithm. So for each n, we find some random X
with |X| = n and show that, for any deterministic algorithm that runs in
time T 0 (x), E [T 0 (X)] ≥ f (n). But then E [T (X, R)] = E [E [T (X, R) | R]] =
E [E [TR (X) | R]] ≥ f (n).
This gives us Yao’s lemma:
Lemma 3.6.1 (Yao’s lemma (informal version)[Yao77]). Fix some problem.
Suppose there is a random distribution on inputs X of size n such that every
deterministic algorithm for the problem has expected cost T (n).
Then the worst-case expected cost of any randomized algorithm is at least
T (n).
CHAPTER 3. RANDOM VARIABLES 46
ω X Y Z =X +Y E [X] E [X | Y = 3] E [X | Y ] E [Z | X] E [X | Z] E [X | X]
(1, 1) 1 1 2 7/2 7/2 7/2 1 + 7/2 2/2 1
(1, 2) 1 2 3 7/2 7/2 7/2 1 + 7/2 3/2 1
(1, 3) 1 3 4 7/2 7/2 7/2 1 + 7/2 4/2 1
(1, 4) 1 4 5 7/2 7/2 7/2 1 + 7/2 5/2 1
(1, 5) 1 5 6 7/2 7/2 7/2 1 + 7/2 6/2 1
(1, 6) 1 6 7 7/2 7/2 7/2 1 + 7/2 7/2 1
(2, 1) 2 1 3 7/2 7/2 7/2 2 + 7/2 3/2 2
(2, 2) 2 2 4 7/2 7/2 7/2 2 + 7/2 4/2 2
(2, 3) 2 3 5 7/2 7/2 7/2 2 + 7/2 5/2 2
(2, 4) 2 4 6 7/2 7/2 7/2 2 + 7/2 6/2 2
(2, 5) 2 5 7 7/2 7/2 7/2 2 + 7/2 7/2 2
(2, 6) 2 6 8 7/2 7/2 7/2 2 + 7/2 8/2 2
(3, 1) 3 1 4 7/2 7/2 7/2 3 + 7/2 4/2 3
(3, 2) 3 2 5 7/2 7/2 7/2 3 + 7/2 5/2 3
(3, 3) 3 3 6 7/2 7/2 7/2 3 + 7/2 6/2 3
(3, 4) 3 4 7 7/2 7/2 7/2 3 + 7/2 7/2 3
(3, 5) 3 5 8 7/2 7/2 7/2 3 + 7/2 8/2 3
(3, 6) 3 6 9 7/2 7/2 7/2 3 + 7/2 9/2 3
(4, 1) 4 1 5 7/2 7/2 7/2 4 + 7/2 5/2 4
(4, 2) 4 2 6 7/2 7/2 7/2 4 + 7/2 6/2 4
(4, 3) 4 3 7 7/2 7/2 7/2 4 + 7/2 7/2 4
(4, 4) 4 4 8 7/2 7/2 7/2 4 + 7/2 8/2 4
(4, 5) 4 5 9 7/2 7/2 7/2 4 + 7/2 9/2 4
(4, 6) 4 6 10 7/2 7/2 7/2 4 + 7/2 10/2 4
(5, 1) 5 1 6 7/2 7/2 7/2 5 + 7/2 6/2 5
(5, 2) 5 2 7 7/2 7/2 7/2 5 + 7/2 7/2 5
(5, 3) 5 3 8 7/2 7/2 7/2 5 + 7/2 8/2 5
(5, 4) 5 4 9 7/2 7/2 7/2 5 + 7/2 9/2 5
(5, 5) 5 5 10 7/2 7/2 7/2 5 + 7/2 10/2 5
(5, 6) 5 6 11 7/2 7/2 7/2 5 + 7/2 11/2 5
(6, 1) 6 1 7 7/2 7/2 7/2 6 + 7/2 7/2 6
(6, 2) 6 2 8 7/2 7/2 7/2 6 + 7/2 8/2 6
(6, 3) 6 3 9 7/2 7/2 7/2 6 + 7/2 9/2 6
(6, 4) 6 4 10 7/2 7/2 7/2 6 + 7/2 10/2 6
(6, 5) 6 5 11 7/2 7/2 7/2 6 + 7/2 11/2 6
(6, 6) 6 6 12 7/2 7/2 7/2 6 + 7/2 12/2 6
we can calculate this out formally (no sensible person would ever do this):
h i ∞
X
E X Ā = n Pr [X = n | X 6= 1]
n=1
∞
X Pr [X = n]
= n
n=2
Pr [X 6= 1]
∞
X (1 − p)n−1 p
= n
n=2
1−p
X∞
= n(1 − p)n−2 p
n=2
X∞
= (n + 1)(1 − p)n−1 p
n=1
∞
X
=1+ n(1 − p)n−1 p
n=1
= 1 + E [X] .
integer values.
n−i+1
means that Xi has a geometric distribution with probability n , giving
n
E [Xi ] = n−i+1 from the analysis in §3.6.2.
To get the total expected number of balls, take the sum
" n # n
X X
E Xi = E [Xi ]
i=1 i=1
n
X n
=
i=1
n−i+1
n
X 1
=n
i=1
i
= nHn .
E [Y ] = E [X | X > n − X + 1] Pr [X > n − X + 1]
+ E [n − X + 1 | n − X + 1 ≥ X] Pr [n − X + 1 ≥ X] .
n/2 + n − 1 n/2 + n
E [Y ] ≤ Pr [X > n − X + 1] + Pr [n − X + 1 ≥ X]
2 2
3 1 3
= n− Pr [X > n − X + 1] + n Pr [n − X + 1 ≥ X]
4 2 4
3
≤ n.
4
Now let Xi be the number of survivors after i pivot steps. Note that
max(0, Xi − 1) gives the number of comparisons at the following pivot step,
so that ∞
P
i=0 Xi is an upper bound on the number of comparisons.
We have X0 = n, and from the preceding argument E [X1 ] ≤ (3/4)n. But
more generally, we can use the same argument to show that E [Xi+1 | Xi ] ≤
(3/4)Xi , and by induction E [Xi ] ≤ (3/4)i n. We also have that Xj = 0 for
all j ≥ n, because we lose at least one element (the pivot) at each pivoting
step. This saves us from having to deal with an infinite sum.
Using linearity of expectation,
"∞ # " n #
X X
E Xi = E Xi
i=0 i=0
n
X
= E [Xi ]
i=0
Xn
≤ (3/4)i n
i=0
≤ 4n.
Chapter 4
Basic probabilistic
inequalities
Here we’re going to look at some inequalities useful for proving properties
of randomized algorithms. These come in two flavors: inequalities involving
probabilities, which are useful for bounding the probability that something
bad happens, and inequalities involving expectations, which are used to
bound expected running times. Later, in Chapter 5, we’ll be doing both,
by looking at inequalities that show that a random variable is close to its
expectation with high probability.1
51
CHAPTER 4. BASIC PROBABILISTIC INEQUALITIES 52
E [X]
Pr [X ≥ α] ≤ .
α
The proof is immediate from the law of total probability (2.3.3). We have
4.1.1 Applications
4.1.1.1 Sum of fair coins
Flip n independent fair coins, and let S be the number of heads we get. Since
E [S] = n/2, we get Pr [S = n] = 1/2. This is much larger than the actual
value 2−n , but it’s the best we can hope for if we only know E [S]: if we let
S be 0 or n with equal probability, we also get E [S] = n/2.
Note that for this to work for infinitely many events we need to use the fact
that 1Ai is non-negative.
If we prefer to avoid any issues with infinite sums of expectations, the
direct way to prove this is to replace Ai with Bi = Ai \ i−1
S
S S j=1 Ai . Then
Ai = Bi , but since the Bi are disjoint and each Bi is a subset of the
corresponding Ai , we get Pr [ Ai ] = Pr [ Bi ] = Pr [Bi ] ≤ Pr [Ai ].
S S P P
The typical use of the union bound is to show that if an algorithm can
fail only if various improbable events occur, then the probability of failure is
no greater than the sum of the probabilities of these events. This reduces
the problem of showing that an algorithm works with probability 1 − to
constructing an error budget that divides the probability of failure among
all the bad outcomes.
then one of these sets must all land in the same bin. Call the event that all
balls in S choose the same bin AS . The probability that AS occurs is exactly
n−k+1 .
Using the union bound, we get
" #
[
Pr [some bin gets at least k balls] = Pr AS
S
X
≤ Pr [AS ]
S
!
n −k+1
= n
k
nk −k+1
≤ n
k!
n
= .
k!
If we want this probability to√ be low, we should choose k so that k! n.
Stirling’s formula says that k! ≥ 2πk(k/e)k ≥ (k/e)k , which gives ln(k!) ≥
k(ln k − 1). If we set k = c ln n/ ln ln n, we get
c ln n
ln(k!) ≥ (ln c + ln ln n − ln ln ln n − 1)
ln ln n
≥ c ln n.
4.3.1 Proof
Here is a proof for the case that f is convex and differentiable. The idea is
that if f is convex, then it lies above the tangent line at E [X]. So we can
define a linear function g that represents this tangent line, and get, for all x:
But then
E [f (X)] ≥ E [g(X)]
= E f (E [X]) + (X − E [X])f 0 (E [X])
Figure 4.1 shows what this looks like for a particular convex f .
This is pretty much all linearity of expectation in action: E [X], f (E [X]),
and f 0 (E [X]) are all constants, so we can pull them out of expectations
whenever we see them.
The proof of the general case is similar, but for a non-differentiable convex
function it takes a bit more work to show that the bounding linear function
g exists.
CHAPTER 4. BASIC PROBABILISTIC INEQUALITIES 56
f (x)
E [f (X)]
g(x)
f (E [X])
E [X]
4.3.2 Applications
4.3.2.1 Fair coins: lower bound
Suppose we flip n fair coins, and we want to get a lower bound on E X 2 ,
worse bound than the 1/2 we can get from applying Markov’s inequality to
X directly.
4.3.2.3 Sifters
Here’s an example of Jensen’s inequality in action in the analysis of an actual
distributed algorithm. For some problems in distributed computing, it’s
useful to reduce coordinating a large number of processes to coordinating
a smaller number. A sifter [AA11] is a randomized mechanism for an
asynchronous shared-memory system that sorts the processes into “winners”
and “losers,” guaranteeing that there is at least one winner. The goal is to
make the expected number of winners as small as possible. The problem
is tricky, because processes can only communicate by reading and writing
shared variables, and an adversary gets to choose which processes participate
and fix the schedule of when each of these processes perform their operations.
The current best known sifter is due to Giakkoupis and Woelfel [GW12].
For n processes, it uses an array A of dlg ne bits, each of which can be read
or written by any of the processes. When a process executes the sifter, it
chooses a random index r ∈ 1 . . . dlg ne with probability 2−r−1 (this doesn’t
exactly sum to 1, so the excess probability gets added to r = dlg ne). The
process then writes a 1 to A[r] and reads A[r + 1]. If it sees a 0 in its read
(or chooses r = dlg ne), it wins; otherwise it loses.
This works as a sifter, because no matter how many processes participate,
some process chooses a value of r at least as large as any other process’s
value, and this process wins. To bound the expected number of winners, take
the sum over all r over the random variable Wr representing the winners
who chose this particular value r. A process that chooses r wins if it carries
out its read operation before any process writes r + 1. If the adversary wants
to maximize the number of winners, it should let each process read as soon
as possible; this effectively means that a process that choose r wins if no
process previously chooses r + 1. Since r is twice as likely to be chosen as
r + 1, conditioning on a process picking r or r + 1, there is only a 1/3 chance
that it chooses r + 1. So at most 1/(1/3) − 1 = 2 = O(1) process on average
choose r before some process chooses r + 1. (A simpler argument shows that
the expected number of processes that win because they choose r = dlg ne is
at most 2 as well.)
Summing Wr ≤ 2 over all r gives at most 2dlg ne winners on average.
Furthermore, if k < n processes participate, essentially the same analysis
shows that only 2dlg ke processes win on average. So this is a pretty effective
tool for getting rid of excess processes.
CHAPTER 4. BASIC PROBABILISTIC INEQUALITIES 58
But it gets better. Suppose that we take the winners of one sifter and
feed them into a second sifter. Let Xk be the number of processes left after
k sifters. We have that X0 = n and E [X1 ] ≤ 2dlg ne, but what can we
say about E [X2 ]? We can calculate E [X2 ] = E [E [X2 | X1 ]] ≤ 2dlg X1 e.
Unfortunately, the ceiling means that 2dlg xe is not a concave function,
but f (x) = 2(lg x + 1) ≥ 2dlg xe is. So E [X2 ] ≤ f (f (n)), and in general
E [Xi ] ≤ f (i) (n), where f (i) is the i-fold composition of f . All the extra
constants obscure what is going on a bit, but with a little bit of algebra it is
not too hard to show that f (i) (n) = O(1) for i = O(log∗ n).4 So this gets rid
of all but a constant number of processes very quickly.
4
The log∗ function counts how many times you need to hit n with lg to reduce it to
one or less. So log∗ 1 = 0, log∗ 2 = 1, log∗ 4 = 2, log∗ 16 = 3, log∗ 65536 = 4, log∗ 265536 = 5,
and after that it starts getting silly.
Chapter 5
Concentration bounds
If we really want to get tight bounds on a random variable X, the trick will
turn out to be picking some non-negative function f (X) where (a) we can
calculate E [f (X)], and (b) f grows fast enough that merely large values of
X produce huge values of f (X), allowing us to get small probability bounds
by applying Markov’s inequality to f (X). This approach is often used to
show that X lies close to E [X] with reasonably high probability, what is
known as a concentration bound.
Typically concentration bounds are applied to sums of random variables,
which may or may not be fully independent. Which bound you may want to
use often depends on the structure of your sum. A quick summary of the
bounds in this chapter is given in Table 5.1. The rule of thumb is to use
Chernoff bounds (§5.2) if you have a sum of independent 0–1 random variables;
the Azuma-Hoeffding inequality (§5.3) if you have bounded variables with a
more complicated distribution that may be less independent; and Chebyshev’s
inequality (§5.1) if nothing else works but you can somehow compute the
variance of your sum (e.g., if the Xi are independent or have easily computed
covariance). In the case of Chernoff bounds, you will almost always end up
using one of the weaker but cleaner versions in §5.2.2 rather than the general
version in §5.2.1.
If none of these bounds work for your particular application, there are
many more out there. See for example the textbook by Dubhashi and
Panconesi [DP09].
59
CHAPTER 5. CONCENTRATION BOUNDS 60
E[S]
eδ
Chernoff Xi ∈ {0, 1}, independent Pr [S ≥ (1 + δ) E [S]] ≤ (1+δ)1+δ
t2
Azuma-Hoeffding |Xi | ≤ ci , martingale Pr [S ≥ t] ≤ exp − 2 P c2i
Var[S]
Chebyshev Pr [|S − E [S]| ≥ α] ≤ α2
P
Table 5.1: Concentration bounds for S = Xi (strongest to weakest)
Expand
h i h i
E (X − E [X])2 = E X 2 − 2X · E [X] + (E [X])2
h i
= E X 2 − 2 E [X] · E [X] + (E [X])2
h i
= E X 2 − (E [X])2 . (5.1.2)
This formula is easier to use if you are estimating the variance from a
sequence of samples; by tracking x2i and xi , you can estimate E X 2
P P
and E [X] in a single pass, without having to estimate E [X] first and then go
back for a second pass to calculate (xi − E [X])2 for each sample. We won’t
use this particular application much, but this explains why the formula is
popular with statisticians.
For any two random variables X and Y , the quantity E [XY ]−E [X] E [Y ]
is called the covariance of X and Y , written Cov [X, Y ]. If we take the
covariance of a variable and itself, covariance becomes variance: Cov [X, X] =
Var [X].
We can use Cov [X, Y ] to rewrite the above expansion as
" #
X X
Var Xi = Cov [Xi , Xj ] (5.1.3)
i i,j
X X
= Var [Xi ] + Cov [Xi , Xj ] (5.1.4)
i i6=j
X X
= Var [Xi ] + 2 Cov [Xi , Xj ] (5.1.5)
i i<j
Note that Cov [X, Y ] = 0 when X and Y are independent; this makes
Chebyshev’s inequality particularly useful for pairwise-independent ran-
dom variables, because then we can just sum up the variances of the individual
variables.
P
A typical application is when we have a sum S = Xi of non-negative
random variables with small covariance; here applying Chebyshev’s inequality
to S can often be used to show that S is not likely to be much smaller than
E [S], which can be handy if we want to show that some lower bound holds
on S with some probability. This complements Markov’s inequality, which
can only be used to get upper bounds.
Pn
For example, suppose S = i=1 Xi , where the Xi are independent
Bernoulli random variables with E [Xi ] = p for all i. Then E [S] = np, and
P
Var [S] = i Var [Xi ] = npq (because the Xi are independent). Chebyshev’s
inequality then says
npq
Pr [|S − E [S]| ≥ α] ≤ .
α2
The highest variance is when p = 1/2. In this case, he probability that
√
S is more than β n away from its expected value n/2 is bounded by 4β1 2 .
We’ll see better bounds on this problem later, but this may already be good
enough for many purposes.
More generally, the approach of bounding S from below by estimating
2
E [S] and either E S or Var [S] is known as the second-moment method.
In some cases, tighter bounds can be obtained by more careful analysis.
CHAPTER 5. CONCENTRATION BOUNDS 63
that once we have flipped one coin the wrong way, we are back where
we started, except that now we have to add that extra coin to X. More
Pr[X 2 =n] n−1
formally, we have, for n > 1, Pr X 2 = n X > 1 = Pr[X>1] = q q p =
and
" n # n
X X
Var Xi = Var [Xi ]
i=1 i=1
n
X i−1
= n .
i=1 (n − i − 1)2
or more balls in a particular bin is at most (m/n − m/n2 )/k 2 < m/nk 2 , and
applying the union bound, the probability that we get k + m/n or more balls
in any of the n bins is less than m/k 2 . Setting this equal to andpsolving for
k gives a probability of at most of getting more than m/n + m/ balls
in any of the bins. This is not as good a bound as we will be able to prove
later, but it’s at least non-trivial.
CHAPTER 5. CONCENTRATION BOUNDS 67
2. Use our sample to find an interval that is likely to contain S(k) . The
idea is to pick indices ` = (k − n3/4 )n−1/4 and r = (k + n3/4 )n−1/4 and
use R(`) and R(r) as endpoints (we are omitting some floors and maxes
here to simplify the notation; for a more rigorous presentation see
[MR95]). The hope is that the interval P = [R(`) , R(r) ] in S will both
contain S(k) , and be small, with |P | ≤ 4n3/4 + 2. We can compute the
elements of P in 2n comparisons exactly by comparing every element
with both R(`) and R(r) .
(with the caveat that we are being sloppy about round-off errors).
Similarly,
h i h √ i
Pr R(h) < S(k) = Pr X > kn−1/4 + n
≤ n−1/4 /4.
Y h i
= E eαXi
i
Y
= pi eα + (1 − pi )e0
i
Y
= (pi eα + 1 − pi )
i
Y
= (1 + (eα − 1) pi )
i
Y α −1)p
≤ e(e i
i
P
(eα −1) i pi
=e
α −1)µ
= e(e .
The sneaky inequality step in the middle uses the fact that (1 + x) ≤ ex
for all x, which itself is one of the most useful inequalities you can memorize.5
What’s nice about this derivation is that at the end, the pi have vanished.
We don’t care what random variables we started with or how many of them
there were, but only about their expected sumhµ. i
Now that we have an upper bound on E eαS , we can throw it into
5
For a proof of this inequality, observe that the function f (x) = ex − (1 + x) has the
derivative ex − 1, which is positive for x > 0 and negative for x < 0. It follows that x = 1
is the unique minimum of f , at which f (1) = 0.
CHAPTER 5. CONCENTRATION BOUNDS 70
2−R/µ
δ ≤ 4.11+
eδ
(1+δ)(1+δ)
δ ≤ 1.81+ R/µ ≥ 4.32−
2 /4
e−δ
2 /3 e−δ
for 0 ≤ δ ≤ 1.81.
Suppose that weqwant this bound to be less than . Then we need
2
2e /3 ≤ or δ ≥ 3 ln(2/)
−δ
µ . Setting δ to exactly this quantity, (5.2.7)
CHAPTER 5. CONCENTRATION BOUNDS 73
becomes q
Pr |S − µ| ≥ 3µ ln(2/) ≤ , (5.2.8)
provided ≥ 2e−µ/3 .
For asymptotic purposes, we can omit the constants, giving
Pr [S ≥ (1 + δ)µ] ≤ Pr [S + T ≥ (1 + δ)µ]
!µ
eδ
<= .
(1 + δ)1+δ
This also works for any of the bounds in §5.2.2 that are derived from
(5.2.1).
In the other direction, if we know E [S] ≥ µ and want to apply the lower
tail bound (5.2.5), we can apply a slightly different construction. Suppose
that E [S] = µ0 ≥ µ. For each Xi , construct a 0-1 random variable Yi such
that (a) all the Yi are independent of each other, (b) Yi ≤ Xi always, and (c)
E [Yi ] = E [Xi ] (µ/µ0 ). The easiest way to do this is to set Yi = Xi Zi where
each Zi is an independent biased coin with expectation µ/µ0 .
Let T = Yi . Then T ≤ S and E [T ] = E [Yi ] = E [Xi ] (µ/µ0 ) = µ.
P P P
CHAPTER 5. CONCENTRATION BOUNDS 74
Pr [S ≤ (1 − δ)µ] ≤ Pr [T ≤ (1 − δ)µ]
!µ
e−δ
≤ .
(1 − δ)1−δ
As with the upper tail bound, this approach also works for simpified
versions of the lower tail bound like (5.2.6).
For the two-sided variants, we are out of luck. The best we can do if we
know a ≤ E [S] ≤ b is to apply each of the one-sided bounds separately.
Proof. Rather than repeat the argument for independent variables, we will
employ a coupling, where we replace the Xi with independent Yi so that
Pn Pn
i=1 Yi gives a bound an i=1 Xi .
For the upper bound, let each Yi = 1 with independent probability
pi . Use the following process to generate a new Xi0 in increasing order
of i: if Yi = 0, set Xi0 = 0. Otherwise set Xi0 = 1 with probability
Pr Xi = 1 X1 = X10 , . . . Xi−1
0 /pi , Then Xi0 ≤ Yi , and
It follows that the Xi0 have the same joint distribution as the Xi , and so
" n # " n #
Xi0
X X
Pr ≥ µ(1 + δ) = Pr Xi ≥ µ(1 + δ)
i=1 i=1
" n #
X
≤ Pr Yi ≥ µ(1 + δ)
i=1
!µ
eδ
≤ .
(1 + δ)1+δ
For the other direction, generate the Xi first and generate the Yi using
the same rejection sampling trick. Now the Yi are independent (because
their joint distribution is) and each Yi is a lower bound on the corresponding
Xi .
The lemma is stated for the general Chernoff bounds (5.2.1) and (5.2.5),
but the easier versions follow from these, so they hold as well, as long as we
are careful to remember that the µ in the upper bound is not necessarily the
same µ as in the lower bound.
5.2.8 Applications
5.2.8.1 Flipping coins
Suppose S is the sum of n independent fair coin-flips. Then E [S] = n/2 and
Pr [S = n] = Pr [S ≥ 2 E [S]] is bounded using (5.2.1) by setting µ = n/2,
√
δ = 1 to get Pr [S = n] ≤ (e/4)n/2 = (2/ e)−n . This is not quite as good as
√
the real answer 2−n (the quantity 2/ e is about 1.213. . . ), but it’s at least
exponentially small.
CHAPTER 5. CONCENTRATION BOUNDS 76
110 111
010 011
100 101
000 001
q
The actual excess over the mean is δ(n/2) = (n/2) 6c ln n/n = 32 cn ln n.
p
the n edges leaving a processor.6 How do we route the packets so that all of
them arrive in the minimum amount of time?
We could try to be smart about this, or we could use randomization.
Valiant’s idea is to first route each process i’s packet to some random
intermediate destination σ(i), then in the second phase, we route it from σ(i)
to its ultimate destination π(i). Unlike π, σ is not necessarily a permutation;
instead, σ(i) is chosen uniformly at random independently of all the other
σ(j). This makes the choice of paths for different packets independent of
each other, which we will need later to apply Chernoff bounds..
Routing is done by a bit-fixing: if a packet is currently at node x and
heading for node y, find the leftmost bit j where xj = 6 yj and fix it, by
sending the packet on to x[xj /yj ]. In the absence of contention, bit-fixing
routes a packet to its destination in at most n steps. The hope is that the
randomization will tend to spread the packets evenly across the network,
reducing the contention for edges enough that the actual time will not be
much more than this.
The first step is to argue that, during the first phase, any particular
packet is delayed at most one time unit by any other packet whose path
overlaps with it. Suppose packet i is delayed by contention on some edge uv.
Then there must be some other packet j that crosses uv during this phase.
From this point on, j remains one step ahead of i (until its path diverges),
so it can’t block i again unless both are blocked by some third packet k (in
which case we charge i’s further delay to k). This means that we can bound
the delays for packet i by counting how many other packets cross its path.7
So now we just need a high-probability bound on the number of packets that
get in a particular packet’s way.
Following the presentation in [MR95], define Hij to be the indicator
variable for the event that packets i and j cross paths during the first phase.
Because each j chooses its destination independently, once we fix i’s path,
P
the Hij are all independent. So we can bound S = j6=i Hij using Chernoff
bounds. To do so, we must first calculate an upper bound on µ = E [S].
The trick here is to observe that any path that crosses i’s path must cross
one of its edges, and we can bound the number of such paths by bounding
how many paths cross each edge. For each edge e, let Te be the number
of paths that cross edge e, and for each j, let Xj be the number of edges
P P
that path j crosses. Counting two ways, we have e Te = j Xj , and so
6
Formally, we have a synchronous routing model with unbounded buffers at each node,
with a maximum capacity of one packet per edge per round.
7
A much more formal version of this argument is given as [MR95, Lemma 4.5].
CHAPTER 5. CONCENTRATION BOUNDS 79
hP i
Xj ≤ N (n/2). By symmetry, all the Te have the same
P
E[ e Te ] =E j
expectation, so we get E [Te ] ≤ N N
(n/2)
n = 1/2.
Now fix σ(i). This determines some path e1 e2 . . . ek for packet i. In
general we do not expect E [Te` | σ(i)] to equal E [Te` ], because conditioning
on i’s path crossing e` guarantees that at least one path crosses this edge that
might not have. However, if we let Te0 be the number of packets j 6= i that
cross e, then we have Te0 ≤ Te always, giving E [Te0 ] ≤ E [Te ], and because Te0
does not depend on i’s path, E [Te0 | σ(i)] = E [Te0 ] ≤hE [Te ] ≤ 1/2. Summing
i
this bound over all k ≤ n edges on i’s path gives E ≤ n/2,
P
H
j6=i ij σ(i)
hP i
which implies E j6=i Hij ≤ n/2 after removing the conditioning on σ(i).
Inequality (5.2.4) says that Pr [X ≥ R] ≤ 2−R when R ≥ 2eµ. Letting
X = j6=i Hij and setting R = 3n gives R = 6(n/2) ≥ 6µ > 2eµ, so
P
hP i
−3n = N −3 . This says that any one packet reaches its
Pr j Hij ≥ 3n ≤ 2
random destination with at most 3n added delay (thus, in at most 4n time
units) with probability at least 1 − N −3 . If we consider all N packets, the
total probability that any of them fail to reach their random destinations in
4n time units is at most N · N −3 = N −2 . Note that because we are using
the union bound, we don’t need independence for this step—which is good,
because we don’t have it.
What about the second phase? Here, routing the packets from the random
destinations back to the real destinations is just the reverse of routing them
from the real destinations to the random destinations. So the same bound
applies, and with probability at most N −2 some packet takes more than
4n time units to get back (this assumes that we hold all the packets before
sending them back out, so there are no collisions between packets from
different phases).
Adding up the failure probabilities and costs for both stages gives a
probability of at most 2/N 2 that any packet takes more than 8n time units
to reach its destination.
The structure of this argument is pretty typical for applications of Cher-
noff bounds: we get a very small bound on the probability that something
bad happens by applying Chernoff bounds to a part of the problem where
we have independence, then use the union bound to extend this to the full
problem where we don’t.
CHAPTER 5. CONCENTRATION BOUNDS 80
Proof. The basic idea is that, for any α, eαx is a convex function. Since we
want an upper bound, we can’t use Jensen’s inequality (4.3.1), but we can
use the fact that X is bounded and we know its expectation. Convexity of
eαx means that, for any x with −c ≤ x ≤ c, eαx ≤ λe−αc + (1 − λ)eαc , where
1 x
x = λ(−c) + (1 − λ)c. Solving for λ in terms of x gives λ = 2 1 − c and
1 x
1 − λ = 2 1 + c . So
1 X −αc 1 X αc
h i
αX
E e ≤E 1− e + 1+ e
2 c 2 c
e−αc + eαc e−αc eαc
= − E [X] + E [X]
2 2c 2c
e−αc + eαc
=
2
= cosh(αc).
8
Note that the requirement that E [Xi ] = 0 can always be satisfied by considering
instead Yi = Xi − E [Xi ].
9
The history of this is that Hoeffding [Hoe63] proved it for independent random variables,
and observed that the proof was easily extended to martingales, while Azuma [Azu67]
actually went and did the work of proving it for martingales.
CHAPTER 5. CONCENTRATION BOUNDS 81
In other words, the worst possible X is a fair choice between ±c, and
in this case we get the hyperbolic cosine of αc as its moment generating
function.
We don’t like hyperbolic cosines much, because we are going to want
to take products of our bounds, and hyperbolic cosines don’t multiply very
nicely. As before with 1 + x, we’d be much happier if we could replace the
cosh with a nice exponential. The Taylor series expansion of cosh x starts
with 1 + x2 /2 + . . . , suggesting that we should approximate it with exp(x2 /2),
2
and indeed it is the case that for all x, cosh x ≤ ex /2 . This can be shown by
comparing the rest of the Taylor series expansions:
ex + e−x
cosh x =
2
∞ ∞
!
1 X xn X (−x)n
= +
2 n=0 n! n=0 n!
∞
X x2n
=
n=0
(2n)!
∞
X x2n
≤
n=0
2n n!
∞
X (x2 /2)n
=
n=0
n!
x2 /2
=e .
.
Theorem 5.3.2. Let X1 . . . Xn be independent random variables with E [Xi ] =
0 and |Xi | ≤ ci for all i. Then for all t,
" n # !
X t2
Pr Xi ≥ t ≤ exp − Pn 2 . (5.3.2)
i=1
2 i=1 ci
make the problem fit the theorem, we replace each Xi by a rescaled version
Yi = 2Xi − 1 = ±1 with equal probability; this makes E [Yi ] = 0 as needed,
CHAPTER 5. CONCENTRATION BOUNDS 83
2. E [St+1 | Ft ] = St .
This means that even if we include any extra information we might have
at time t, we still can’t predict St+1 any better than by guessing the current
value St . This alternative definition will be important in some special cases,
as when St is a function of some other collection of random variables that
we use to define the Ft . Because Ft includes at least as much information as
S0 , . . . , St , it will always be the case that any sequence {(St , Ft )} that is a
martingale in the general sense gives a sequence {St } that is a martingale in
the more specialized E [St+1 | S0 , . . . , St ] = St sense.
Martingales were invented to analyze fair gambling games, where your
return over some time interval is not independent of previous outcomes (for
example, you may change your bet or what game you are playing depending
on how things have been going for you), but it is always zero on average
given previous information.10 The nice thing about martingales is they allow
10
Real casinos give negative expected return, so your winnings in a real casino form
a supermartingale with St ≥ E [St+1 | S0 . . . St ]. On the other hand, the casino’s take,
in a well-run casino, is a submartingale, a process with St ≤ E [St+1 | S0 . . . St ]. These
definitions also generalize in the obvious way to the {(St , Ft )} case.
CHAPTER 5. CONCENTRATION BOUNDS 85
for a bit of dependence while still acting very much like sums of independent
random variables.
Where this comes up with Hoeffding’s inequality is that we might have a
process that is reasonably well-behaved, but its increments are not technically
independent. For example, suppose that a gambler plays a game where
they bet x units 0 ≤ x ≤ 1 at each round, and receives ±x with equal
probability. Suppose also that their bet at each round may depend on the
outcome of previous rounds (for example, they might stop betting entirely
if they lose too much money). If Xi is their take at round i, we have that
E [Xi | X1 . . . Xi−1 ] = 0 and that |Xi | ≤ 1. This is enough to apply the
martingale version of Hoeffding’s inequality, often called Azuma’s inequality.
Pk
Theorem 5.3.3. Let {Sk } be a martingale with Sk = i=1 Xi and |Xi | ≤ ci
for all i. Then for all n and all t ≥ 0:
!
−t2
Pr [Sn ≥ t] ≤ exp Pn 2 . (5.3.5)
2 i=1 ci
h i 2
Proof. Basically, we just show that E eαSn ≤ exp α2 ni=1 c2i —just like
P
in the proof of Theorem 5.3.2—and the rest follows using the same argument.
The
hQonly trickyi part is weh cani no longer use independence to transform
n αX into ni=1 E eαXi .
Q
E i=1 e i
Some extensions:
• The asymmetric version of Hoeffding’s inequality (5.3.4) also holds for
martingales. So if each increment Xi satisfies ai ≤ Xi ≤ bi always,
" n # !
X 2t2
Pr Xi ≥ t ≤ exp − Pn 2
. (5.3.6)
i=1 i=1 (bi − ai )
11
The corresponding notion in the other direction is a submartingale. See §9.2.
12
This is known as a Doob decomposition and can be used to extract a martingale
{Zi } from any stochastic process {Xi }. For general processes, Yi = Xi − Zi will still be
predictable, but may not satisfy E [Yi | X1 , . . . , Xi−1 ] ≤ 0.
CHAPTER 5. CONCENTRATION BOUNDS 87
• Suppose that we stop the process after the first time τ with Sτ =
Pτ
i=1 Xi ≥ t. This is equivalent to making a new variable Yi that is
zero whenever Si−1 ≥ t and equal to Xi otherwise. This doesn’t affect
the conditions E [Yi | Y1 . . . Yi−1 ] = 0 or |Yi | ≤ ci , but it makes it so
Pn
Yi ≥ t if and only if maxk≤n ki=1 Xi ≥ t. Applying (5.3.5) to
P
P i=1
Yi then gives
k
" # !
X t2
Pr max Xi ≥ t ≤ exp − Pn 2 . (5.3.8)
k≤n
i=1
2 i=1 ci
There are also cases where the asymmetric version works with ai ≤
Xi ≤ bi where a bound on bi − ai is fixed but the precise values of
ai and bi may vary depending on X1 , . . . , Xi−1 . This shows up in the
proof of McDiarmid’s inequality [McD89], which is described below in
§5.3.3.
that f has the bounded difference property, which says that there are
bounds ct such that for any x1 . . . xn and any x0t , we have
so
when xt+1 happens to be the value of Xt+1 . So the first conditional expecta-
tion in (5.3.12) is just Yt+1 , giving
|Yt+1 − Yt | ≤ ct+1 .
CHAPTER 5. CONCENTRATION BOUNDS 90
This turns out to overestimate the possible range of Yt+1 . With a more
sophisticated argument, it can be shown that for any fixed x1 , . . . , xt , there
exist bounds at+1 ≤ Yt+1 − Yt ≤ bt+1 such that bt+1 − at+1 = ct+1 . We
would like to use this to apply the asymmetric version of Azuma-Hoeffding
given in (5.3.6). A complication is that the specific values of at+1 and bt+1
may depend on the previous values x1 , . . . , xt , even if the bound ct+1 on
their maximum difference does not. Fortunately, McDiarmid shows that the
inequality works anyway, giving:
5.3.4 Applications
Here are some applications of the preceding inequalities. Most of these are
examples of the method of bounded differences.
location, how likely is it that your distance to the nearest mailbox deviates
substantially from the average distance?
We can describe your position as a bit vector X1 , . . . , Xn , where each
Xi is an independent random bit. Let f (X1 , . . . , Xn ) be the distance from
X1 , . . . , Xn to the nearest element of A. Then changing one of the bits
changes this function by at most 1. So we have Pr [|f − E [f ]| ≥ −2t2 /n
√ t] ≤ 2e
by (5.3.13), giving a range of possible distances that is O( n log n) with
probability at least 1 − n−c for any fixed c > 0.14 Of course, without knowing
what A is, we don’t know what E[f ] is; but at least we can be assured that
(unless A is very big) the distance we have to walk to send our mail will be
pretty much the same pretty much wherever we start.
of n and p than are given by this crude result. For a more recent paper that
cites many of these see [Hec21].
1 n−1
X
T (n) = (n − 1) + (T (k) + T (n − 1 − k)) .
n k=0
so far and the summation gives an upper bound on the expected number of
comparisons remaining.
To show that this is in fact a supermartingale, observe that if we partition
a block of size n we add n − 1 to Ct but replace the cost bound 2n ln n by
CHAPTER 5. CONCENTRATION BOUNDS 93
an expected
1 n−1
Z n
X 4
2· 2k ln k ≤ n ln n
n k=0 n 2
!
4 n2 ln n n2
= − − ln 2 + 1
n 2 4
= 2n ln n − n − ln 2 + 1
< 2n ln n − n.
The net change is less than − ln 2. The fact that it’s not zero suggests
that we could improve the 2n ln n bound slightly, but since it’s going down,
we have a supermartingale.
Let’s try to get a bound on how much Xt changes at each step. The Ct
part goes up by at most n − 1. The summation can only go down; if we split
a block of size ni , the biggest drop we get is if we split it evenly,15 This gives
a drop of
n−1 n−1 n−1
2n ln n − 2 2 ln = 2n ln n − 2(n − 1) ln n
2 2 2n
2n
= 2n ln n − 2(n − 1)(ln n − ln )
n−1
2n 2n
= 2n ln n − 2n ln n + 2n ln + 2 ln n − 2 ln
n−1 n−1
= 2n · O(1) + O(log n)
= O(n).
analysis of Hoare’s FIND (§3.6.4). For each element, the number of elements
in the same block is multiplied by a factor of at most 3/4 on average each
time the element is compared, so the chance that the element is not by itself is
at most (3/4)k n after k comparisons. Setting k = log4/3 (n2 /) gives that any
particular element is compared k or times with probability at most /n. The
union bound then gives a probability of at most that the most-compared
element is compared k or more times. So the total number of comparisons is
O(log(n/)) with probability 1 − , which becomes O(log n) with probability
1 − n−c if we set = n−c for a fixed c.
We can formalize this argument itself using a supermartingale. Let Cit
be the number of times i has been compared as a non-pivot in the first t
pivot steps and Nit be the size of the block containing i after t pivot steps.
t
Let Yit = (4/3)Ci Nit . Then if we pick i’s block at step t + 1, the exponent
goes up by at most 1 and N t drops by a factor of 3/4, canceling out the
increase.h If we don’t
i pick i’s block, Yit is unchanged. In either case we get
t t+1 t
Yi ≥ E Yi Ft and Yi is a supermartingale.
h i
Now let Z t = i Yit . Since this is greater than or equal to i E Yit+1 Ft =
P P
E [Z n ]
≤
na
Z 0
≤ a
n
= na−2 .
h i
Choosing a = c+2 gives an n−c bound on Pr maxi Cin ≥ (c + 2) log4/3 n
hP i
n
and thus the same bound on Pr i Ci ≥ (c + 2)n log4/3 n . This shows that
the total number of comparisons is O(n log n) with high probability.
Ti · (µ∗ − µi ), (5.3.15)
where Ti counts the number of times we pull arm i, where µ∗ is the expected
payoff of the best arm, and µi = E [Xi ] is the expected payoff of arm i.
The tricky part here is that when we pull an arm and get a bad return,
we don’t know if we were just unlucky this time or it’s actually a bad arm.
So we have an incentive to try lots of different arms. On the other hand, the
more we pull a genuinely inferior arm, the worse our overall return. We’d like
to adopt a strategy that trades off between exploration (trying new arms)
and exploitation (collecting on the best arm so far) to do as best we can in
comparison to a strategy that always pulls the best arm.
ni
Now consider xi = n1i j=1
P
X where each Xj lies between 0 and 1. This
P ij
is equivalent to having xi = nj=1 Yj where Yj = Xj /ni lies between 0 and
1/ni . So (5.3.17) says that
√
" s #
2 ln n 2 2
Pr xi − E [xi ] ≥ ≤ e−2( (2 ln n)/ni ) /(ni (1/ni ) )
ni
= e−4 ln n
= n−4 . (5.3.19)
the bonus given to an arm that has been pulled s times in the first t pulls.
Fix some optimal arm. Let X i,s be the average return on arm i after s
∗
pulls and X s be the average return on the optimal arm after s pulls.
If we pull arm i after t total pulls, when arm i has previously been pulled
si times and our optimal arm has been pulled s∗ times, then we must have
∗
X i,si + ct,si ≥ X s∗ + Ct,s∗ . (5.3.22)
This just says that arm i with its bonus looks better than the optimal
arm with its bonus.
To show that this bad event doesn’t happen, we need three things:
∗
1. The value of X s∗ +ct,s∗ should be at least µ∗ . Were it to be smaller, the
∗
observed value X s∗ would be more than ct,s∗ away from its expectation.
Hoeffding’s inequality implies this doesn’t happen too often.
2. The value of X i,si + ct,si should not be too much bigger than µi . We’ll
again use Hoeffding’s inequality to show that X i,si is likely to be at
most µi + ct,si , making X i,si + ct,si at most µi + 2ct,si .
CHAPTER 5. CONCENTRATION BOUNDS 98
3. The bonus ct,si should be small enough that even adding 2ct,si to µi
is not enough to beat µ∗ . This means that we need to pick si large
enough that 2ct,si ≤ ∆i . For smaller values of si , we will just accept
that we need to pull arm i a few more times before we wise up.
More formally, if none of the following events hold:
∗
X s∗ + ct,s∗ ≤ µ∗ . (5.3.23)
X i,si ≥ µi + ct,si (5.3.24)
∗
µ − µi < 2ct,si , (5.3.25)
∗
then X s∗ + ct,s∗ > µ∗ > µi + 2ct,si > X i,si + ct,si , and we don’t pull arm i
because the optimal arm is better. (We don’t necessarily pull the optimal
arm, but if we don’t, it’s because we pull some other arm that still isn’t arm
i.)
For (5.3.23) and (5.3.24), we repeat the argument in (5.3.19), plugging
in t for n and si or s∗ for ni . This gives a probability of at most 2t−4 that
either or both of these bad events occur.
For (5.3.25), we need to do something a little sneaky, because the state-
ment is not actually true when si is small. So we will give `i free pulls to
arm i, and only start comparing arm i to the optimal arm after we have
done at least this many pulls. The value of `i is chosen so that, when t ≤ n
and si > `i ,
2ct,si ≤ µ∗ − µi ,
which expands to,
s
2 ln t
2 ≤ ∆i ,
si
giving
8 ln t
si ≥ .
∆2i
So we must set `i to be at least
8 ln n 8 ln t
≥ .
∆2i ∆2i
Because `i must be an integer, we actually get
& '
8 ln n 8 ln n
`i = ≤1+ .
∆2i ∆2i
CHAPTER 5. CONCENTRATION BOUNDS 99
This explains (after multiplying by the regret ∆i ) the first term in (5.3.20).
For the other sources of bad pulls, apply the union bound to the 2t−4
error probabilities we previously computed for all choices of t ≤ n, s∗ ≥ 1,
and si > `i . This gives
n X
t−1 t−1 ∞
−4
t2 · t−4
X X X
2t <2
t=1 s∗ =1 si =`i +1 t=1
π2
=2·
6
π2
= .
3
Again we have to multiply by the regret ∆i for pulling the i-th arm, which
gives the second term in (5.3.20).
Pr lim Sn /n = µ = 1 (5.4.1)
n→∞
Sn − µn
lim Pr √ ≤ t = Φ(t), (5.4.2)
n→∞ σ n
and for any fixed constant t, we don’t know when the limit behavior actually
starts working.
But there are variants of these theorems that do bound rate of convergence
that can be useful in some cases. An example is given in §5.5.1.
where the xi are constants with |xi | ≥ 1 and the i are independent ±1 fair
coin-flips, then Pr [| ni=1 i xi | ≤ r] is maximized by making all the xi equal
P
to 1. This shows that any distribution where the xi are all reasonably large
will not be any more concentrated than a binomial distribution.
There has been a lot of more recent work on variants of the Littlewood-
Offord problem, much of it by Terry Tao and Van Vu. See http://terrytao.
wordpress.com/2009/02/16/a-sharp-inverse-littlewood-offord-theorem/
for a summary of much of this work.
Chapter 6
These are data structures that are either trees or equivalent to trees, and use
randomization to maintain balance. We’ll start by reviewing deterministic
binary search trees and then add in the randomization.
102
CHAPTER 6. RANDOMIZED SEARCH TREES 103
y x
x C A y
A B B C
4 1
2 6 2
1 3 5 7 3
randomized QuickSort (see §1.3.1), so the structure of the tree will exactly
mirror the structure of an execution of QuickSort. So, for example, we
can immediately observe from our previous analysis of QuickSort that the
total path length—the sum of the depths of the nodes—is Θ(n log n),
since the depth of each node is equal to 1 plus the number of comparisons it
participates in as a non-pivot, and (using the same argument as for Hoare’s
FIND in §3.6.4) that the height of the tree is O(log n) with high probability.2
When n is small, randomized binary search trees can look pretty scraggly.
Figure 6.3 shows a typical example.
The problem with this approach in general is that we don’t have any
guarantees that the input will be supplied in random order, and in the
worst case we end up with a linked list, giving O(n) worst-case cost for all
operations.
2
The argument for Hoare’s FIND is that any node has at most 3/4 of the descendants
of its parent on average; this gives for any node x that Pr [depth(x) > d] ≤ (3/4)d−1 n, or
a probability of at most n−c that depth(x) > 1 + (c + 1) log(n)/ log(4/3) ≈ 1 + 6.952 ln n
for c = 1, which we need to apply the union bound. The right answer for the actual height
of a randomly-generated search tree in the limit is 4.31107 ln n [Dev88] so this bound
is actually pretty close. The real height is still nearly a factor of three worse than for a
completely balanced tree, which has max depth bounded by 1 + lg n ≈ 1 + 1.44269 ln n.
CHAPTER 6. RANDOMIZED SEARCH TREES 105
1 7
3 6
2 4
6.3 Treaps
The solution to bad inputs is the same as for QuickSort: instead of assuming
that the input is permuted randomly, we assign random priorities to each
element and organize the tree so that elements with higher priorities rise
to the top. The resulting structure is known as a treap [SA96], because it
satisfies the binary search tree property with respect to keys and the heap
property with respect to priorities.3
There’s an extensive page of information on treaps at http://faculty.
washington.edu/aragon/treaps.html, maintained by Cecilia Aragon, the
co-inventor of treaps; they are also discussed at length in [MR95, §8.2]. We’ll
give a brief description here.
To insert a new node in a treap, first walk down the tree according to
the key and insert the node as a new leaf. Then go back up fixing the heap
property by rotating the new element up until it reaches an ancestor with
the same or higher priority. (See Figure 6.4 for an example.) Deletion is the
reverse of insertion: rotate a node until it has 0 or 1 children (by swapping
with its higher-priority child at each step), and then prune it out, connecting
its child, if any, directly to its parent.
Because of the heap property, the root of each subtree is always the
element in that subtree with the highest priority. This means that the
structure of a treap is completely determined by the priorities and the keys,
no matter what order the elements arrive in. We can imagine in retrospect
3
The name “treap” for this data structure is now standard but the history is a little
tricky. According to Seidel and Aragon, essentially the same data structure (though with
non-random priorities) was previously called a cartesian tree by Vuillemin [Vui80], and
the word “treap” was initially applied by McCreight to a different data structure—designed
for storing two-dimensional data—that was called a priority search tree in its published
form. [McC85].
CHAPTER 6. RANDOMIZED SEARCH TREES 106
1,60 1,60
\ \
2,3 --> 3,26
\ /
(3,26) 2,3
5,78 5,78
/ \ / \
1,60 6,18 --> 1,60 7,41
\ \ \ /
3,26 (7,41) 3,26 6,18
/ \ / \
2,3 4,24 2,3 4,24
Figure 6.4: Inserting values into a treap. Each node is labeled with k, p where
k is the key and p the priority. Insertion of values not requiring rotations
are not shown.
CHAPTER 6. RANDOMIZED SEARCH TREES 107
6.3.2 Analysis
The analysis of treaps as carried out by Seidel and Aragon [SA96] is a nice
example of how to decompose a messy process into simple variables, much
like the linearity-of-expectation argument for QuickSort (§1.3.2). The key
observation is that it’s possible to bound both the expected depth of any node
and the number of rotations needed for an insert or delete operation directly
from information about the ancestor-descendant relationship between nodes.
Define two classes of indicator variables. For simplicity, we assume that
the elements have keys 1 through n, which we also use as indices.
is an ancestor of itself.
The nice thing about these indicator variables is that it’s easy to compute
their expectations.
For Ai,j , i will be the ancestor of j if and only if i has a higher priority
than j and there is no k between i and j that has an even higher prior-
ity: in other words, if i has the highest priority of all keys in the interval
[min(i, j), max(i, j)]. To see this, imagine that we are constructing the treap
recursively, by starting with all elements in a single interval and partitioning
each interval by its highest-priority element. Consider the last interval in
this process that contains both i and j, and suppose i < j (the j > i case is
symmetric). If the highest-priority element is some k with i < k < j, then i
and j are separated into distinct intervals and neither is the ancestor of the
other. If the highest-priority element is j, then j becomes the ancestor of i.
The highest-priority element can’t be less than i or greater than j, because
then we get a smaller interval that contains both i and j. So the only case
where i becomes an ancestor of j is when i has the highest priority.
1
It follows that E [Ai,j ] = |i−j|+1 , where the denominator is just the
number of elements in the range [min(i, j), max(i, j)].
For Ci;`,m , i is the common ancestor of both ` and m if and only if it is has
the highest priority in both [min(i, `), max(i, `)] and [min(i, m), max(i, m)].
It turns out that no matter what order i, `, and m come in, these intervals
overlap so that i must have the highest priority in [min(i, `, m), max(i, `, m)].
1
This gives E [Ci;`,m ] = max(i,`,m)−min(i,`,m)+1 .
CHAPTER 6. RANDOMIZED SEARCH TREES 109
6.3.2.1 Searches
− 1.4 So
P
From the Ai,j variables we can compute depth(j) = i Ai,j
n
!
X 1
E [depth(j)] = −1
i=1
|i − j| + 1
j n
X 1 X 1 −1
= +
i=1
j−i+1 i=j+1
i − j + 1
j n−j+1
X 1 X 1
= + −1
k=1
k k=2
k
= Hj + Hn−j+1 − 2.
4 2 6
/ \ / \ / \
*2 6* => 1 4 or 4 7
/ \ / \ / \ / \
1 *3 5* 7 *3 6* *2 5*
/ \ / \
5* 7 1 *3
Figure 6.5: Rotating 4 right shortens the right spine of its left subtree by
removing 2; rotating left shortens the left spine of the right subtree by
removing 6.
the subtree will be carried out from under the target without ever appearing
as a child or parent of the target. Because each rotation removes exactly one
element from one or the other of the two spines, and we finish when both
are empty, the sum of the length of the spines gives the number of rotations.
To calculate the length of the right spine of the left subtree of some
element `, start with the predecessor ` − 1 of `. Because there is no element
between them, either ` − 1 is a descendant of ` or an ancestor of `. In the
former case (for example, when ` is 4 in Figure 6.5), we want to include all
ancestors of ` − 1 up to ` itself. Starting with i Ai,`−1 gets all the ancestors
P
evaluates to zero.
It follows that the expected length of the right spine of the left subtree is
exactly
" n n
# n n
X X X 1 X 1
E Ai,`−1 − Ci;`−1,` = −
i=1 i=1 i=1
|i − (` − 1)| + 1 i=1 max(i, `) − min(i, ` − 1) + 1
`−1 n `−1 n
X 1 X 1 X 1 X 1
= + − −
i=1
` − i i=` i − ` + 2 i=1 ` − i + 1 i=` i − (` − 1) + 1
`−1
X 1 n−`+2
X 1 X `
1 n−`+2
X 1
= + − −
j=1
j j=2
j j=2 j j=2
j
1
=1− .
`
By symmetry, the expected length of the left spine of the right subtree is
1
1 − n−`+1 . So the total expected number of rotations needed to delete the
CHAPTER 6. RANDOMIZED SEARCH TREES 111
HEAD TAIL
33 LEVEL 2
13 33 48 LEVEL 1
13 21 33 48 75 99 LEVEL 0
Figure 6.6: A skip list. The blue search path for 99 is superimposed on an
original image from [AS07].
`-th element is
1 1
2− − ≤ 2.
` n−`+1
until it reaches the first element that is also in a higher level; it then jumps
to the next level up and repeats the process. The nice thing about this
reversed process is that it has a simple recursive structure: if we restrict a
skip list to only those nodes to the left of and at the same level or higher
of a particular node, we again get a skip list. Furthermore, the structure of
this restricted skip list depends only on coin-flips taken at nodes within it,
so it’s independent of anything that happens elsewhere in the full skip list.
We can analyze this process by tracking the number of nodes in the
restricted skip list described above, which is just the number of nodes in the
current level that are earlier than the current node. If we move left, this
drops by 1; if up, this drops to p times its previous value on average. So the
number of such nodes Xk after k steps satisfies the probabilistic recurrence
Skip lists are even easier to split or merge than treaps. It’s enough to
cut (or recreate) all the pointers crossing the boundary, without changing
the structure of the rest of the list.
Chapter 7
Hashing
The basic idea of hashing is that we have keys from a large set U , and we’d
like to pack them in a small set M by passing them through some function
h : U → M , without getting too many collisions, pairs of distinct keys x
and y with h(x) = h(y). Where randomization comes in is that we want this
to be true even if the adversary picks the keys to hash. At one extreme, we
could use a random function, but this will take a lot of space to store.1 So
our goal will be to find functions with succinct descriptions that are still
random enough to do what we want.
The presentation in this chapter is based largely on [MR95, §§8.4-8.5]
(which is in turn based on work of Carter and Wegman [CW77] on universal
hashing and Fredman, Komlós, and Szemerédi [FKS84] on O(1) worst-case
hashing); on [PR04] and [Pag06] for cuckoo hashing; and [MU17, §5.5.3] for
Bloom filters.
114
CHAPTER 7. HASHING 115
Proof. We’ll count collisions in the inverse image of each element z. Since
CHAPTER 7. HASHING 117
all distinct pairs of elements of h−1 (z) collide with each other, we have
δ(h−1 (z), h−1 (z), h) = h−1 (z) · h−1 (z) − 1 .
m−1
Since 1 − |U |−1 is likely to be very close to 1, we are happy if we get the
2-universal upper bound of |H|/m.
Why we care about this: With a 2-universal hash family, chaining using
linked lists costs O(1 + n/m) expected time per operation. The reason is
that the expected cost of an operation on some key x is proportional to the
size of the linked list at h(x) (plus O(1) for the cost of hashing itself). But
the expected size of this linked list is just the expected number of keys y in
the dictionary that collide with x, which is exactly δ(x, S, H)/|H| ≤ n/m.
CHAPTER 7. HASHING 118
value; it’s enough if we can order the strings so that each string gets a value
that isn’t represented among its predecessors.
More formally, suppose we can order the strings x1 , x2 , . . . , xn that we
0
are hashing so that each has a position ij such that xjij 6= xjij for any j 0 < j,
h 0
i
then we have, for each value v, Pr h(xj ) = v h(xj ) = vj 0 , ∀j 0 < j = 1/m.
It follows that the hash values are independent:
h i n
Y h i
n
1 2
Pr h(x ) = v1 , h(x ) = v2 , . . . , h(x ) = vn ) = Pr h(xj ) = vj h(x1 ) = v1 . . . h(xj−1 ) = vj−1
j=1
1
=
mn
n
Y h i
= Pr h(xj ) = vj .
j=1
Now we want to show that when n = 3, this actually works for all possible
distinct strings x, y, and z. Let S be the set of indices i such that yi 6= xi ,
and similarly let T be the set of indices i such that zi 6= xi ; note that both
sets must be non-empty, since y 6= x and z 6= x. If S \ T is nonempty, then
(a) there is some index i in T where zi =6 xi , and (b) there is some index j in
S \ T where yi 6= xi = zi ; in this case, ordering the strings as x, z, y gives
the independence property above. If T \ S is nonempty, order them as x, y,
z instead. Alternatively, if S = T , then yi = 6 zi for some i in S (otherwise
y = z, since they both equal x on all positions outside S). In this case, xi ,
yi , and zi are all distinct.
For n = 4, we can have strings aa, ab, ba, and bb. If we take the
bitwise exclusive OR of all four hash values, we get zero, because each
character is included exactly twice in each position. So the hash values are
not independent, and we do not get 4-independence in general.
However, even though the outputs of tabulation hashing are not 4-
independent, most reasonably small sets of inputs do give independence.
This can be used to show various miraculous properties like working well for
the cuckoo hashing algorithm described in §7.4.
not consuming too much space. The assumption that S is static is critical,
because FKS chooses hash functions based on the elements of S.
If we were lucky in our choice of S, we might be able to do this with
standard hashing. A perfect hash function for a set S ⊆ U is a hash
function h : U → M that is injective on S (that is, x 6= y → h(x) 6= h(y)
when x, y ∈ S). Unfortunately, we can only count on finding a perfect hash
function if m is large:
n
Lemma 7.3.1. If H is 2-universal and |S| = n with 2 ≤ m, then there is
a perfect h ∈ H for S.
Proof. Let h be chosen uniformly at random from H. For each unordered pair
x 6= y in S, let Xxy be the indicator variable for the even that h(x) = h(y),
P
and let C = x6=y Xxy be the total number of collisions in S. Each Xxy
has expectation at most 1/m, so E [C] ≤ n2 /m < 1. But we we can write
E [C] as E [C | C = 0] Pr [C = 0] + E [C | C ≥ 1] Pr [C 6= 0] ≥ Pr [C 6= 0]. So
Pr [C 6= 0] ≤ n2 /m < 1, giving Pr [C = 0] > 0. But if C is zero with nonzero
The time to do a search is O(1) in the worst case: O(1) for the outer hash
plus O(1) for the inner hash.
The last equality holds because each ordered pair of distinct values in S that
map to the same bucket i corresponds to exactly one collision in δ(S, S, h).
Since H is 2-universal, we have δ(S, S, H) ≤ |H| |S|(|S|−1)
n = |H| n(n−1)
n =
|H|(n − 1). But then the Pigeonhole principle says there exists some h ∈ H
1
with δ(S, S, h) ≤ |H| δ(S, S, H) = n − 1. Choosing this h gives ni=1 n2i ≤
P
n + (n − 1) = 2n − 1 = O(n).
The main difference is that it uses just one table instead of the two tables—one
for each hash function—in [PR04].
7.4.1 Structure
We have a table T of size n, with two separate, independent hash functions h1
and h2 . These functions are assumed to be k-universal for some sufficiently
large value k; as long as we never look at more than k values at once, this
means we can treat them effectively as random functions. In practice, using
crummy hash functions seems to work just fine, a common property of hash
tables. There are also specific hash functions that have been shown to work
with particular variants of cuckoo hashing [PR04, PT12]. We will avoid these
issues by assuming that our hash functions are actually random.
Every key x is stored either in T [h1 (x)] or T [h2 (x)]. So the search
procedure just looks at both of these locations and returns whichever one
contains x (or fails if neither contains x).
To insert a value x1 = x, we must put it in T [h1 (x1 )] or T [h2 (x1 )]. If one
or both of these locations is empty, we put it there. Otherwise we have to
kick out some value that is in the way (this is the “cuckoo” part of cuckoo
hashing, named after the bird that leaves its eggs in other birds’ nests). We
do this by letting x2 = T [h1 (x1 )] and writing x1 to T [h1 (x1 )]. We now have
a new “nestless” value x2 , which we swap with whatever is in T [h2 (x2 )]. If
that location was empty, we are done; otherwise, we get a new value x3 that
we have to put in T [h1 (x3 )] and so on. The procedure terminates when we
find an empty spot or if enough iterations have passed that we don’t expect
to find an empty spot, in which case we rehash the entire table. This process
can be implemented succinctly as shown in Algorithm 7.1.
A detail not included in the above code is that we always rehash (in
theory) after m2 insertions; this avoids potential problems with the hash
functions used in the paper not being universal enough. We will avoid this
issue by assuming that our hash functions are actually random (instead of
being approximately n-universal with reasonably high probability). For a
more principled analysis of where the hash functions come from, see [PR04].
An alternative hash family that is known to work for a slightly different
variant of cuckoo hashing is tabulation hashing, as described in §7.2.2; the
proof that this works is found in [PT12].
CHAPTER 7. HASHING 124
1 procedure insert(x)
2 if T (h1 (x) = x or T (h2 (x)) = x then
3 return
4 pos ← h1 (x)
5 for i ← 1 . . . n do
6 if T [pos] = ⊥ then
7 T [pos] ← x
8 return
9 x T [pos]
10 if pos = h1 (x) then
11 pos ← h2 (x)
12 else
13 pos ← h1 (x)
7.4.2 Analysis
The main question is how long it takes the insertion procedure to terminate,
assuming the table is not too full.
First let’s look at what happens during an insert if we have many nestless
values. We have a sequence of values x1 , x2 , . . . , where each pair of values
xi , xi+1 collides in h1 or h2 . Assuming we don’t reach the loop limit, there
are three main possibilities (the leaves of the tree of cases below):
2. Eventually we see the same key twice; there is some i and j > i such
that xj = xi . Since xi was already moved once, when we reach it
the second time we will try to move it back, displacing xi−1 . This
process continues until we have restored x2 to T [h1 (x1 )], displacing x1
to T [h2 (x1 )] and possibly creating a new sequence of nestless values.
Two outcomes are now possible:
Let’s look at the probability that we get the last case, a closed loop.
Following the argument of Pagh and Rodler, we let v be the number of
distinct nestless keys in the loop. Since v includes x1 and at least one other
element blocking x1 from being inserted at T [h1 (x1 )], v is at least 2. We can
now count how many different ways such a loop can form, and argue that in
each case we include enough information to reconstruct h1 (ui ) and h2 (ui )
for each of a specific set of unique elements u1 , . . . uv .
Formally, this means that we are expression the closed-loop case as a
union of many specific closed loops, and then bounding the probability of
each of these specific closed-loop events by the probability of the event that
h1 and h2 select the right values to make this particular closed loop possible.
Then we apply the union bound.
To describe each of the specific events, we’ll provide this information:
h1 (ui ) or h2 (u1 ) from the value of ui and its first location and the other
hash value for ui given the next location in the list.4
The series converges if 2n/m < 1, so for any fixed α < 1/2, the probability
of any closed loop forming is O(m−2 ).
If we do hit a closed loop, then we pay O(m) time to scan the existing
table and create a new empty table, and O(n) = O(m) time on average to
reinsert all the elements into the new table, assuming that this reinsertion
process doesn’t generate any more closed loops and that the average cost of an
insertion that doesn’t produce a closed loop is O(1), which we will show below.
But the rehashing step only fails with probability O(nm−2 ) = O(m−1 ), so if
it does fail we can just try again until it works, and the expected total cost
is still O(m). Since we pay this O(m) for each insertion with probability
O(m−2 ), this adds only O(m−1 ) to the expected cost of a single insertion.
Now we look at what happens if we don’t get a closed loop. This doesn’t
force us to rehash, but if the path is long enough, we may still pay a lot to
do an insertion.
It’s a little messy to analyze the behavior of keys that appear more than
once in the sequence, so the trick used in the paper is to observe that for any
sequence of nestless keys x1 . . . xp , there is a subsequence of size p/3 with no
repetitions that starts with x1 . This will be either the sequence S1 given by
x1 . . . xj−1 —the sequence starting with the first place we try to insert x1 –or
S2 given by x1 = xi+j−1 . . . xp , the sequence starting with the second place
we try to insert x1 . Between these we have a third sequence R where we
4
The original analysis in [PR04] avoids this by alternating between two tables, so that
we can determine which of h1 or h2 is used at each step by parity.
CHAPTER 7. HASHING 127
revert some of the moves made in S1 . Because |S1 | + |R| + |S2 | ≥ p, at least
one of these three subsequences has size p/3. But |R| ≤ |S1 |, so it must be
either S1 or S2 .
We can then argue that the probability that we get a sequence of v distinct
keys in either S1 or S2 is at most 2(n/m)v−1 . The (n/m)v−1 is because we
need to hit a nonempty spot (which happens with probability at most n/m)
for the first v − 1 elements in the path, and since we assume that our hash
functions are random, the choices of these v − 1 spots are all independent.
The 2 is from the union bound over S1 and S2 . If T is length of the longer
of S1 or S2 , we get E [T ] = ∞
P∞ v−1 = O(1),
v=1 Pr [T ≥ v] ≤
P
v=1 2(n/m)
assuming n/m is bounded by a constant less than 1. Since we already need
n/m ≤ 1/2 to avoid the bad closed-loop case, we can use this here as well.
We have to multiply E [T ] by 3 to get the bound on the actual path, but this
disappears into O(1).
An annoyance with cuckoo hashing is that it has high space overhead
compared to more traditional hash tables: in order for the first part of the
analysis above to work, the table must be at least half empty. This can be
avoided at the cost of increasing the time complexity by choosing between
d locations instead of 2. This technique, due to Fotakis et al. [FPSS03], is
known as d-ary cuckoo hashing. For a suitable choice of d, it uses (1 + )n
space and guarantees that a lookup takes O(1/) probes, while insertion
takes (1/)O(log log(1/)) steps in theory and appears to take O(1/) steps in
practice according to experiments done by the authors.
7.6.1 Construction
Bloom filters are a highly space-efficient randomized data structure invented
by Burton H. Bloom [Blo70] for storing sets of keys, with a small probability
for each key not in the set that it will be erroneously reported as being in
the set.
Suppose we have k independent hash functions h1 , h2 , . . . , hk . Our mem-
ory store A is a vector of m bits, all initially zero. To store a key x, set
A[hi (x)] = 1 for all i. To test membership for x, see if A[hi (x)] = 1 for all
i. The membership test always gives the right answer if x is in fact in the
Bloom filter. If not, we might decide that x is in the Bloom filter anyway,
just because we got lucky.
the kn bits that we set while inserting the n values has one chance in m of
hitting position i.
We’d like to simplify this using the inequality 1 + x ≤ ex , but it goes
2
in the wrong direction; instead, we’ll use 1 − x ≥ e−x−x , which holds for
0 ≤ x ≤ 0.683803 and in our application holds for m ≥ 2. This gives
Pr [A[i] = 1] ≤ 1 − (1 − 1/m)kn
≤ 1 − e−kn(1/m)(1+1/m)
= 1 − e−kα(1+1/m)
0
= 1 − e−kα
where α = n/m is the load factor and α0 = α(1 + 1/m) is the load factor
fudged upward by a factor of 1 + 1/m to make the inequality work.
Suppose now that we check to see if some value x that we never inserted
in the Bloom filter appears to be present anyway. This occurs if A[hi (x)] = 1
for all i. Since each A[hi (x)] is an independent sample from A, the probability
that they all come up 1 conditioned on A is
P k
A[i]
. (7.6.1)
m
0
We have an upper bound E [ A[i]] ≤ m 1 − e−kα , and if we were born
P
doing the usual trick of taking a derivative and setting it to zero; to avoid
weirdness with the k in the exponent, it helps to take the logarithm first
(which doesn’t affect the location of the minimum), and it further helps to
0
take the derivative with respect to x = e−α k instead of k itself. Note that
when we do this, k = − α10 ln x still depends on x, and we will deal with this
by applying this substitution at an appropriate point.
Compute
d d
ln (1 − x)k = k ln(1 − x)
dx dx
d 1
= − 0 ln x ln(1 − x)
dx α
1 ln(1 − x) ln x
=− 0 − .
α x 1−x
the false positive rate)6 and for any set A ⊆ U of size n, A ⊆ Si for at least
one Si (allowing us to store A).
Let N = |U |. Then each set Si covers N N
n of the n subsets of size n. If
we could. get them to overlap optimally (we can’t), we’d still need a minimum
N N
of n n = (N )n /(N )n ≈ (1/)n sets to cover everybody, where the
approximation assumes N n. Taking the log gives lg M ≈ n lg(1/),
meaning we need about lg(1/) bits per key for the data structure. Bloom
filters use 1/ ln 2 times this.
There are known data structures that approach this bound asymptotically.
The first of these, due to Pagh et al. [PPR05] also has other desirable
properties, like supporting deletions and faster lookups if we can’t look up
bits in parallel.
More recently, Fan et al. [FAKM14] have described a variant of cuckoo
hashing (see §7.4) called a cuckoo filter. This is a cuckoo hash table that,
instead of storing full keys x, stores fingerprints f (x), where f is a hash
function with `-bit outputs. False positives now arise if we happen to hash a
value x0 with f (x0 ) = f (x) to the same location as x. If f is drawn from a
2-universal family, this occurs with probability at most 2−` . So the idea is
that by accepting an small rate of false positives, we can shrink the space
needed to store each key from the full key length to lg(1/) = ln(1/)/ ln 2,
the asymptotic minimum.
One complication is that, since we are throwing away the original key x,
when we displace a key from h1 (x) to h2 (x) or vice versa, we can’t recompute
h1 (x) and h2 (x) for arbitrary h1 and h2 . The solution proposed by Fan et al.
is to let h2 (x) = h1 (x) ⊕ g(f (x)), where g is a hash function that depends
only on the fingerprint. This means that when looking at a fingerprint f (x)
stored in position i, we don’t need to know whether i is h1 (x) or h2 (x), since
whichever it is, the other location will be i ⊕ g(f (x)). Unfortunately, this
technique and some other techniques used in the paper to crunch out excess
empty space break the standard analysis of cuckoo hashing, so the authors
can only point to experimental evidence that their data structure actually
6
Technically, this gives a weaker bound on false positives. For standard Bloom filters,
assuming random hash functions, each key individually has at most an probability
of appearing as a false positive. The hypothetical data structure we are considering
here—which is effectively deterministic—allows the set of false positives to depend directly
on the set of keys actually inserted in the data structure, meaning that the adversary
could arrange for a specific key to appear as a false positive with probability 1 by choosing
appropriate keys to insert. So this argument may underestimate the space needed to get
make the false positives less predictable. On the other hand, we aren’t charging the Bloom
filter for the space needed to store the hash functions, which could be quite a bit if they
are genuine random functions.
CHAPTER 7. HASHING 132
works. However, a variant of this data structure has been shown to work by
Eppstein [Epp16].
7.6.4 Applications
Historically, Bloom filters were invented to act as a way of filtering queries
to a database table through fast but expensive7 RAM before looking up the
actual values on a slow but cheap tape drive. Nowadays the cost of RAM is
low enough that this is less of an issue in most cases, but Bloom filters are
still popular in networking and in distributed databases.
In networking, Bloom filters are useful in building network switches, where
incoming packets need to be matched against routing tables in fractions of a
nanosecond. Bloom filters work particularly well for this when implemented
in hardware, since the k hash functions can be computed in parallel. False
positives, if infrequent enough, can be handled by some slower backup
mechanism.
In distributed databases, Bloom filters are used in the Bloomjoin algo-
rithm [ML86]. Here we want to do a join on two tables stored on different
machines (a join is an operation where we find all pairs of rows, one in each
table, that match on some common key). A straightforward but expensive
way to do this is to send the list of keys from the smaller table across the
network, then match them against the corresponding keys from the larger
table. If there are ns rows in the smaller table, nb rows in the larger table,
and j matching rows in the larger table, this requires sending ns keys plus j
rows. If instead we send a Bloom filter representing the set of keys in the
smaller table, we only need to send n lg(1/)/ ln 2 bits for the Bloom filter
plus an extra nb rows on average for the false positives. This can be cheaper
than sending full keys across if the number of false positives is reasonably
small.
once it reaches this value, further increments have no effect. The resulting
structure is called a counting Bloom filter, due to Fan et al. [FCAB00].
We can only expect this to work if our chance of hitting the cap is small.
Fan et al. observe that the probability that the m table entries include one
that is at least c after n insertions is bounded by
! c
nk 1 enk 1
m c
≤m
c m c mc
enk c
=m
cm
= m(ekα/c)c .
k
(This uses the bound nk ≤ en
k , which follows from Stirling’s formula.)
For k = α1 ln 2, this is m(e ln 2/c)c . For the specific value of c = 16
(corresponding to 4 bits per entry), they compute a bound of 1.37 × 10−15 m,
which they argue is minuscule for all reasonable values of m (it’s a systems
paper).
The possibility that a long chain of alternating insertions and deletions
might produce a false negative due to overflow is considered in the paper,
but the authors state that “the probability of such a chain of events is so
low that it is much more likely that the proxy server would be rebooted in
the meantime and the entire structure reconstructed.” An alternative way of
dealing with this problem is to never decrement a maxed-out register. This
never produces a false negative, but may cause the filter to slowly fill up
with maxed-out registers, producing a higher false-positive rate.
A fancier variant of this idea is the spectral Bloom filter of Cohen
and Matias [CM03], which uses larger counters to track multiplicities of
items. The essential idea here is that we can guess that the number of times
a particular value x was inserted is equal to minki=1 A[hi (x)]), with some
extra tinkering to detect errors based on deviations from the typical joint
distribution of the A[hi (x)] values. An even more sophisticated approach
gives the count-min sketches of the next section.
of data sets that are too large to store at all (network traffic statistics), or
too large to store in fast memory (very large database tables). By building
an appropriate small data structure using a single pass through the data,
we can still answer queries about with some loss of accuracy. Examples we
will consider include estimating the size of a set presented over time with
possible duplicate elements (§7.7.1) or more general statistical queries based
on aggregate counts of some sort (§7.7.2).
In each of these cases, the answers we get will be approximate. We
will measure the quality of the approximation in terms of parameters (δ, ),
where we demand a relative error of at most with probability at least 1 − δ.
We’d also like our data structure to have size at most polylogarithmic in the
number of samples n and polynomial in δ and .
1
1 P 1 .
m i n̂i
and thus
1 1
= 1 −ri −1
,
ps
P
m i2
which looks suspiciously like the harmonic mean used on the final estimates
in HyperLogLog. As with the original HyperLogLog, it is possible to show
√
that the typical relative error for this sketch is O(1/ m). See [PWY20] for
more details and some further improvements.
If we don’t care about practical engineering issues, there is a known asymp-
totically optimal solution to the cardinality estimation problem [KNW10],
which doesn’t even require assuming a random oracle, but the constants give
worse performance than the systems that people actually use.
used for more complex tasks like finding heavy hitters—indices with high
weight. The easiest case is approximating ai when all the ct are non-negative,
so we’ll start with that.
CHAPTER 7. HASHING 137
7.7.2.2 Queries
Let’s start with point queries. Here we want to estimate ai for some
fixed i. There are two cases; the first handles non-negative increments only,
while the second handles arbitrary increments. In both cases we will get an
estimate whose error is linear in both the error parameter and the `1 -norm
kak1 = i |ai | of a. It follows that the relative error will be low for heavy
P
points, but we may get a large relative error for light points (and especially
large for points that don’t appear in the data set at all).
For the non-negative case, to estimate ai , compute âi = minj c[j, hj (i)].
(This is the min part of coin-min.) Then:
Lemma 7.7.1. When all ct are non-negative, for âi as defined above:
âi ≥ ai , (7.7.1)
and
Proof. The lower bound is easy. Since for each pair (i, ct ) we increment each
c[j, hj (i)] by ct , we have an invariant that ai ≤ c[j, hj (i)] for all j throughout
the computation, which gives ai ≤ âi = minj c[j, hj (i)].
For the upper bound, let Iijk be the indicator for the event that (i 6=
k) ∧ (hj (i) = hj (k)), i.e., that we get a collision between i and k using hj .
The 2-universality property of the hj gives E [Iijk ] ≤ 1/w ≤ /e.
Now let Xij = nk=1 Iijk ak . Then c[j, hj (i)] = ai + Xij . (The fact that
P
Xij ≥ 0 gives an alternate proof of the lower bound.) Now use linearity of
CHAPTER 7. HASHING 138
expectation to get
" n #
X
E [Xij ] = E Iijk ak
k=1
n
X
= ak E [Iijk ]
k=1
Xn
≤ ak (/e)
k=1
= (/e)kak1 .
So Pr [c[j, hj (i)] > ai + kak1 ] = Pr [Xij > e E [Xij ]] < 1/e, by Markov’s
inequality. With d choices for j, and each hj chosen independently, the prob-
ability that every count is too big is at most (1/e)d = e−d ≤ exp(− ln(1/δ)) =
δ.
Now let’s consider the general case, where the increments ct might be
negative. We still initialize and update the data structure as described in
§7.7.2.1, but now when computing âi , we use the median count instead of
the minimum count: âi = median {c[j, hj (i)] | j = 1 . . . n}. Now we get:
Lemma 7.7.2. For âi as defined above,
Proof. The basic idea is that for the median to be off by t, at least d/2
rows must give values that are off by t. We’ll show that for t = 3kak1 , the
expected number of rows that are off by t is at most d/8. Since the hash
functions for the rows are chosen independently, we can use Chernoff bounds
to show that with a mean of d/8, the chances of getting all the way to d/2
are small.
In detail, we again define the error term Xij as above, and observe that
" #
X
E [|Xij |] = E Iijk ak
k
n
X
≤ |ak E [Iijk ]|
k=1
Xn
≤ |ak |(/e)
k=1
= (/e)kak1 .
CHAPTER 7. HASHING 139
Using Markov’s inequality, we get Pr [|Xij |] > 3kak1 ] = Pr [|Xij | > 3e E [Xij ]] <
1/3e < 1/8. In order for the median to be off by more than 3kak1 , we need
d/2 of these low-probability events to occur. The expected number that
occur is µ = d/8, so applying the standard Chernoff bound (5.2.1) with δ = 3
we are looking at
Pr [S ≥ d/2] = Pr [S ≥ (1 + 3)µ]
≤ (e3 /44 )d/8
≤ (e3/8 /2)ln(1/δ)
= δ ln 2−3/8
< δ 1/4 .
(The actual exponent is about 0.31, but 1/4 is easier to deal with). This
immediately gives (7.7.3).
One way to think about this is that getting an estimate within kak1 of
the right value with probability at least 1 − δ requires 3 times the width and
4 times the depth—or 12 times the space and 4 times the time—when we
aren’t assuming increments are non-negative.
Next, we consider inner products. Here we want to estimate a · b, where
a and b are both stored as count-min sketches using the same hash functions.
The paper concentrates on the case where a and b are both non-negative,
which has applications in estimating the size of a join in a database. The
method is to estimate a · b as minj w k=1 ca [j, k] · cb [j, k].
P
For a single j, the sum consists of both good values and bad collisions; we
have w
Pn
k=1 ca [j, k] · cb [j, k] =
P P
k=1 ai bi + p6=q,hj (p)=hj (q) ap bq . The second
term has expectation
X X
Pr [hj (p) = hj (q)] ap bq ≤ (/e)ap bq
p6=q p6=q
X
≤ (/e)ap bq
p,q
≤ (/e)kak1 kbk1 .
be at most 1/φ heavy hitters. But the tricky part is figuring out which
elements they are.
The output at any stage will be approximate in the following sense:
it is guaranteed that any i such that ai ≥ φkak1 is included, and each i
with ai < (φ − ) that previously appeared in the stream is included with
probability at most 1 − δ. This is similar to what we would get if we just
ran a point query on all possible i, but (a) there are many possible i and (b)
we won’t ever output an i we’ve never seen.
The trick is to extend the data structure and update procedure to track
all the heavy elements found so far (stored in a heap, with the minimum
estimate at the top), as well as kak1 = ct . When a new increment (i, c)
P
comes in, we first update the count-min structure and then do a point query
on ai ; if âi ≥ φkak1 , we insert i into the heap. We also delete any elements
at the top of the heap that have a point-query estimate below threshold.
Because âi ≥ ai , every heavy hitter is correctly identified. However, it’s
possible that an index stops being a heavy hitter at some point (because the
threshold φkak1 rose since we included it). In this case it may get removed
from the heap, but if it becomes a heavy hitter again, we’ll put it back.
which an -PLEB data structure with radius ` and centers P returns a point
p with d(q, p) ≤ (1 + )`; then return this point as the approximate nearest
neighbor.
This requires O(log1+ R) instances of the -PLEB data structure and
O(log log1+ R) queries. The blowup as a function of R can be avoided using
a more sophisticated data structure called a ring-cover tree, defined in the
paper. We won’t talk about ring-cover trees because they are (a) complicated
and (b) not randomized. Instead, we’ll move directly to the question of how
we solve -PLEB.
These are useful if p1 > p2 and r1 < r2 ; that is, we are more likely to hash
inputs together if they are closer. Ideally, we can choose r1 and r2 to build
-PLEB data structures for a range of radii sufficient to do binary search as
described above (or build a ring-cover tree if we are doing it right). For the
moment, we will aim for an (r1 , r2 )-PLEB data structure, which returns a
point within r1 with high probability if one exists, and never returns a point
farther away than r2 .
There is some similarity between locality-sensitive hashing and a more gen-
eral dimension-reduction technique known as the Johnson-Lindenstrauss
lemma [JL84]; this says that projecting n points in a high-dimensional space
to O(−2 log n) dimensions using an appropriate random matrix preserves
`2 distances between the points to within relative error (in fact, even a
random matrix with entries in {−1, 0, +1} is enough [Ach03]). Unfortunately,
dimension reduction by itself is not enough to solve approximate nearest
neighbors in sublinear time, because we may still need to search a number of
boxes exponential in O(−2 log n), which will be polynomial in n. But we’ll
look at the Johnson-Lindenstrauss lemma and its many other applications
more closely in Chapter 8.
gj (p) for each j; these buckets are themselves stored in a hash table (by
hashing the value of gj (p) down further) so that they fit in O(n) space.
Suppose now that d(p, q) ≤ r1 for some p. Then
−ρ
=n
= 1/`.
These are not particularly clever hash functions, so the heavy lifting will
be done by the (r1 , r2 )-PLEB construction. Our goal is to build an -PLEB
for any fixed r, which will correspond to an (r, r(1 + ))-PLEB. The main
thing we need to do, following [IM98] as always, is compute a reasonable
p1 ln(1−r/d)
bound on ρ = log log p2 = ln(1−(1+)r/d) . This is essentially just a matter of
hitting it with enough inequalities, although there are a couple of tricks in
the middle.
Compute
ln(1 − r/d)
ρ=
ln(1 − (1 + )r/d)
(d/r) ln(1 − r/d)
=
(d/r) ln(1 − (1 + )r/d)
ln((1 − r/d)d/r )
=
ln((1 − (1 + )r/d)d/r )
ln(e−1 (1 − r/d))
≤
ln e−(1+)
−1 + ln(1 − r/d)
=
−(1 + )
1 ln(1 − r/d)
= − . (7.8.1)
1+ 1+
Note that we used the fact that 1 + x ≤ ex for all x in the denominator
and (1 − x)1/x ≥ e−1 (1 − x) for x ∈ [0, 1] in the numerator. The first fact is
our usual favorite inequality.
The second can be proved in a number of ways. The most visually
intuitive is that (1 − x)1/x and e−1 (1 − x) are equal at x = 1 and equal
in the limit as x goes to 0, while (1 − x)1/x is concave in between 0 and
1 and e−1 (1 − x) is linear. Unfortunately it is rather painful to show that
(1 − x)1/x is in fact concave. An alternative is to rewrite the inequality
(1 − x)1−x ≥ e−1 (1 − x) as (1 − x)1/x−1 ≥ e−1 , apply a change of variables
y = 1/x to get (1 − 1/y)y−1 ≥ e−1 for y ∈ [1, ∞), and then argue that (a)
equality holds in the limit as y goes to infinity, and (b) the left-hand-side is
CHAPTER 7. HASHING 145
We now return to (7.8.1). We’d really like the second term to be small
enough that we can just write nρ as n1/(1+) . (Note that even though it looks
negative, it isn’t, because ln(1 − r/d) is negative.) So we pull a rabbit out of
a hat by assuming that r/d < 1/ ln n.8 This assumption can be justified by
modifying the algorithm so that d is padded out with up to d ln n unused
junk bits if necessary. Using this assumption, we get
Plugging into the formula for (r1 , r2 )-PLEB gives O(n1/(1+) log n log(1/δ))
hash function evaluations per query, each of which costs O(1) time, plus
O(n1/(1+) log(1/δ)) distance computations, which will take O(d) time each.
If we add in the cost of the binary search, we have to multiply this by
O(log log1+ R log log log1+ R), where the log-log-log comes from having to
adjust δ so that the error doesn’t accumulate too much over all O(log log R)
steps. The end result is that we can do approximate nearest-neighbor queries
in
O n1/(1+) log(1/δ)(log n + d) log log1+ R log log log1+ R
time. For reasonably large, this is much better than naively testing against
all points in our database, which takes O(nd) time (although it does produce
an exact result).
8
Indyk and Motwani pull this rabbit out of a hat a few steps earlier, but it’s pretty
much the same rabbit either way.
CHAPTER 7. HASHING 146
Dimension reduction
147
CHAPTER 8. DIMENSION REDUCTION 148
and then using the union bound to show that this same property holds for
all n2 vectors u − v with nonzero probability. This shows the existence of a
good matrix, and we can generate matrices and test them until we find one
that actually works.
2
1
Radial symmetry is immediate from the density √12π e−x /2 of the univariate normal
distribution. If we consider a vector hX1 , . . . , Xd i of independent N (0, 1) random
P variables,
2
1 −x2
= (2π)−d/2 e− xi /2 . But
Qd
then the joint density is given by the product i=1 2π
e i /2
2 2
P
xi = r where r is the distance from the origin, meaning this distribution has the same
density at all points at the same distance.
CHAPTER 8. DIMENSION REDUCTION 149
kZk2 ≤ β(k/d)
which expands to
Pk 2
i=1 Xi
Pd 2
≤ β(k/d).
i=1 Xi
Having a ratio between two sums is a nuisance, but we can multiply out
the denominators to turn it into something we can apply a Chernoff-style
argument to.
k d
"P # " #
k 2
i=1 Xi
X X
Pr Pd 2
≤ β(k/d) = Pr d Xi2 ≤ βk Xi2
i=1 Xi i=1 i=1
h i
= Pr βk(X12 + · · · + Xd2 ) − d(X12 + . . . Xk2 ) ≥ 0
h i
= Pr exp t βk(X12 + · · · + Xd2 ) − d(X12 + . . . Xk2 ) ≥1
h i
≤ E exp(t(βk(X12 + · · · + Xd2 ) − d(X12 + . . . Xk2 )))
h id−k h ik
= E exp(tβkX 2 ) E exp(t(βk − d)X 2 ) ,
where t can be any value greater than 0, the shift from probability to
expectation uses Markov’s inequality, and in the last step we replace each
independent occurrence of Xi with a standard normal random variable X.
h Now2
i we just need to be able to compute the moment generating function
E e sX for X 2 . The quick way to do this is to notice that X 2 has a
chi-squared distribution with one degree of freedom (since the chi-squared
distribution with k degrees of freedom is just the distribution of the sum
of squares of k independent normal random variables), and look up its
m.g.f. (1 − 2s)−1/2 (for s < 1/2).
CHAPTER 8. DIMENSION REDUCTION 150
The same argument applied to g(−t) gives essentially the same bound
for β = 1 + > 1:
h i
Pr kZk2 ≥ β(k/d) ≤ e(k/2)(1−β+ln β) .
Lemma 8.1.2 ([JL84]). For every d, 0 < < 1, and 0 < δ < 1, there exists
a distribution over linear functions f : Rd → Rk with k = O(−2 log(1/δ))
such that for every x ∈ Rd ,
h i
Pr (1 − )kxk2 ≤ kf (x)k2 ≤ (1 + )kxk2 ≥ 1 − δ.
This can be handy for applications where we don’t know the vectors we
will be working with in advance.
8.2 Applications
The intuition is that the Johnson-Lindenstrauss lemma lets us reduce the
dimension of some problem involving distances between n points, where we
are willing to tolerate a small constant relative error, from some arbitrary d
to O(log n) (or O(log(1/δ)) if we just care about the error probability per
pair of points). So applications tend to fall into one of two categories:
If we are lucky, we’ll get both payoffs, winning on both time and space.
For example, suppose we have a set of n points x1 , x2 , . . . , xn representing
the centers of various clusters in a d-dimensional space, and we want to
rapidly classify incoming points y into one of these clusters by finding which
xi y is closest to. If we do this naively, this is a Θ(nd) operation, since it takes
Θ(d) time to compute each distance between y and some xi . If instead we
are willing to accept the inaccuracy associated with Johnson-Lindenstrauss,
we can fix a matrix A in advance, replace each xi with Axi , and find the
xi that is (approximately) closest to y using O(d log n) time to reduce y to
Ay and O(n log n) time to compute the distance between Ay and each Axi .
In this case we are reducing both the time complexity of our classification
CHAPTER 8. DIMENSION REDUCTION 152
algorithm (at least if we don’t count the pre-processing time to generate the
Axi ) and the amount of data we need to store.
An example of space savings is the use of the Johnson-Lindenstrauss
transform in streaming algorithms (see §7.7). Freksen [Fre21] gives a simple
example of estimating kxk2 where x is a vector of counts of items from a set
of size n presented one at a time. If we don’t charge for the space to store
the JLT function f , we can simply add the i-th column of the matrix to our
running total whenever we see item i, and we need only store O −2 log(1/δ)
distinct numerical values of an appropriate precision to estimate kxk2 to
within relative error with probability at least 1 − δ. The problem is that
in reality we do need to represent f somehow, and even for a ±1 matrix
this will take Θ(n−2 log(1/δ)) space. Fortunately it can be shown that
generating f using a 4-independent hash function reduces the space for f to
O(log n), giving the Tug-of-War sketch of Alon et al. [AMS96], one of the
first compact streaming data structures. Though this is a nice application of
the JLT, it’s worth mentioning Cormode and Muthukrishnan [CM05] observe
that this is still significantly more costly for most queries than their own
count-min sketch.
Chapter 9
9.1 Definitions
The general form of a martingale {Xt , Ft } consists of:
E [Xt+1 | Ft ] = Xt (9.1.1)
153
CHAPTER 9. MARTINGALES AND STOPPING TIMES 154
Xt ≤ E [Xt+1 | Ft ] (9.2.1)
Xt ≥ E [Xt+1 | Ft ] . (9.2.2)
In each case, what is “sub” or “super” is the value at the current time
compared to the expected value at the next time. Intuitively, a submartingale
corresponds to a process where you win on average, while a supermartingale
is a process where you lose on average. Casino games (in profitable casinos)
are submartingales for the house and supermartingales for the player.
Sub- and supermartingales can be reduced to martingales by subtracting
off the expected change at each step. For example, if {Xt } is a submartingale
with respect to {Ft }, then the process {Yt } defined recursively by
Y0 = X0
Yt+1 = Yt + Xt+1 − E [Xt+1 | Ft ]
1
Different authors impose different conditions on the range of τ ; for example, Mitzen-
macher and Upfal [MU17] exclude the case τ = ∞. We allow τ = ∞ to represent the
outcome where we never stop. This can be handy for modeling processes where this
outcome is possible, although in practice we will typically insist that it occurs only with
probability zero.
CHAPTER 9. MARTINGALES AND STOPPING TIMES 155
is a martingale, since
(a) Pr [τ < ∞] = 1,
(b) E [|Xτ |] < ∞, and
h i
(c) limt→∞ E Xt · 1[τ >t] = 0.
Lemma 9.3.2. Let (Xht , Ft ) be aimartingale and τ a stopping time for {Ft }.
Then for any n ∈ N, E Xmin(τ,n) = E[X0 ].
Pt
Proof. Define Yt = X0 + i=1 (Xt − Xt−1 )1[τ >t−1]
h
. Then (Yt , Ft ) is a martin-
i
gale, because we can calculate E [Yt+1 | Ft ] = E Yt + (Xt+1 − Xt )1[τ >t] Ft =
Yt + 1[τ >t] · E [Xt+1 − Xt | Ft ] = Yt ; effectively, we are treating 1[τ ≤t−1] as a
sequence of bets, and we know that h adjusting i our bets doesn’t change the
martingale property. But then E Xmin(τ,n) = E [Yn ] = E [Y0 ] = E [X0 ].
3. Conclude that E [X0 ] = E [Xτ ], since they are both limits of the same
sequence.
This holds because either τ ≤ n, and we just get Xτ , or τ > n, and we get
Xn + (Xτ − Xn ) = Xτ .
Taking the expectation of both sides gives
h i h i
E [Xτ ] = E Xmin(τ,n) + E 1[τ >n] (Xτ − Xn )
h i
= E [X0 ] + E 1[τ >n] (Xτ − Xn ) .
So if we can show that the right-hand term goes to zero in the limit, we are
done. h i
For the bounded-range case, we have |Xτ − Xn | ≤ 2M , so E 1[τ >n] (Xτ − Xn ) ≤
2M ·Pr [τ > n]. Since in this case we assume Pr [τ < ∞] = 1, limn→∞ Pr [τ > n] =
0, and the theorem holds.
For bounded increments, we have
h i X
E (Xτ − Xn )1[τ >n] = E (Xt+1 − Xt )1[τ >t]
t≥n
X
≤ E |(Xt+1 − Xt )| · 1[τ >t]
t≥n
X
≤ E c · 1[τ >t]
t≥n
X
≤ cE 1[τ >t] .
t≥n
But E [τ ] = ∞
P
t=0 1[τ >t] . Under the assumption that this sequence converges,
its tail goes to zero, and again the theorem holds.
For the general case, we can expand
h i h i h i
E [Xτ ] = E Xmin(τ,n) + E 1[τ >n] Xτ − E 1[τ >n] Xn
which implies
h i h i h i
lim E [Xτ ] = lim E Xmin(τ,n) + lim E 1[τ >n] Xτ − lim E 1[τ >n] Xn ,
n→∞ n→∞ n→∞ n→∞
assuming all these limits exist and are finite. We’ve already established that
the first limit is E [X0 ], which is exactly what we want. So we just need to
show that the other two limits both converge tohzero. Forithe last limit, we
just use condition (4c), which gives limn→∞ E 1[τ >n] Xn = 0; no further
CHAPTER 9. MARTINGALES AND STOPPING TIMES 158
argument is needed. But we still need to show that the middle limit also
vanishes. h i P h i
Here we use condition (4b). Observe that E 1[τ >n] Xτ = ∞
t=n+1 E 1 [τ =t] Xt .
h i
Compare this with E [Xτ ] = ∞
P
t=0 E 1[τ =t] Xt ; this is an absolutely conver-
gent series (this is why we need condition (4b)), so in the limit the sum of
the terms for i = 0 . . . n converges to E [Xτ ]. But this means that the sum
of the remaining terms for i = n + 1 . . . ∞ converges to zero. So the middle
term goes to zero as n goes to infinity. This completes the proof.
9.4 Applications
Here we give some example of the Optional Stopping Theorem in action. In
each case, the trick is to find an appropriate martingale and stopping time,
and let the theorem do all the work.
= Yt−1 .
Pt
For the ±1 random walk case, we have Vt = 1 always, giving V = t and E Xτ2 =
i=1 i
2
E X0 + E [τ ] when τ is a stopping time satisfying the conditions of the Optional Stopping
Pτ
Theorem. For the general case, the same argument gives E Xτ2 = E X02 + E V
t=1 t
instead: the expected square position of Xt is incremented by the conditional variance at
each step.
4
This would be a random walk with one absorbing barrier.
5
In fact, we always reach b. An easy way to see this is to imagine a sequence of intervals
Pi 2
of length n1 , n2 , . . . , where ni+1 = b + n
. At the end of the i-th interval, we
j=1 j
Pi √
are no lower than − n , so we only need to go up ni+1 positions to reach a by the
j=0 j
CHAPTER 9. MARTINGALES AND STOPPING TIMES 160
This is the same formula as in §3.4.3.1, but we’ve eliminated the bound
on N and allowed for much more dependence between N and the Xi .7
Lemma 9.4.1. Let {Xi } be a martingale with Xi ≥ 0. Then for any fixed
n,
E [X0 ]
Pr max Xi ≥ α ≤ . (9.4.2)
i≤n α
Proof. The idea is to pick a stopping time τ such that maxi≤n Xi ≥ α if and
only if Xτ ≥ α.
Let τ be the first time such that Xτ ≥ α or τ ≥ n. Then τ is a stopping
time for {Xi }, since we can determine from X0 , . . . , Xt whether τ ≤ t or
not. We also have that τ ≤ n always, which is equivalent to τ = min(τ, n).
Finally, Xτ ≥ α means that maxi≤n Xi ≥ Xτ ≥ α, and conversely if there
is some t ≤ n with Xt = maxi≤n Xi ≥ α, then τ is the first such t, giving
Xτ ≥ α.
Lemma 9.3.2 says E [Xτ ] = E [X0 ]. So Markov’s inequality gives Pr [maxi≤n Xi ≥ α] =
Pr [Xτ ≥ α] ≤ E[Xα
τ]
= E[X 0]
α , as claimed.
E [Xn ]
Pr max Xi ≥ α ≤ . (9.4.3)
i≤n α
The proof is similar, but requires showing first that E [Xτ ] ≤ E [Xn ] when
τ ≤ n is a stopping time and {Xi } is a submartingale.
Doob’s martingale inequality is what you get if you generalize Markov’s
inequality to martingales. The analogous generalization of Chebyshev’s
inequality is Kolmogorov’s inequality, which says:
Pi
Lemma 9.4.2. For sums Si = j=1 Xj of independent random variables
X1 , X2 , . . . , Xn with E [Xi ] = 0,
Var [S]
Pr max|Si | ≥ α ≤ . (9.4.4)
i≤n α2
Proof. Let Yi = Si2 − Var [Si ]. Then {Yi } is a martingale. This implies that
2
2
E [Yn ] = E Sn − Var [S] = Y0 = 0 and thus that E Sn = Var [S]. It’s easy
to see that Si2 is a submartingale since partial sums can only increase over
time. Now apply (9.4.3).
all be at −1.
Let χi = 1 if x1 . . . xi = xk−i+1 . . . xk , and 0 otherwise. Then the number
of losers is given by τ − ki=1 χi and the total expected payoff is
P
k k
" #
X X
E [Xτ ] = E −(τ − χi ) + χi 2i − 1
i=1 i=1
k
" #
X
i
= E −τ + χi 2
i=1
= 0.
Markov chains
165
CHAPTER 10. MARKOV CHAINS 166
If you want to learn more about Markov chains than presented here, they
are usually covered in general probability textbooks (for example, in [Fel68]
or [GS01]), mentioned in many linear algebra textbooks [Str03], covered
in some detail in stochastic processes textbooks [KT75], and covered in
exquisite detail in many books dedicated specifically to the subject [KS76,
KSK76]. Good sources for mixing times for Markov chains are the textbook
of Levin, Peres, and Wilmer [LPW09] and the survey paper by Montenegro
and Tetali [MT05]. An early reference on the mixing times for random
walks on graphs that helped inspire much subsequent work is the Aldous-
Fill manuscript [AF01], which can be found on-line at http://www.stat.
berkeley.edu/~aldous/RWG/book.html.
In both cases we want to have a bound on how long it takes the Markov
chain to converge, either because it tells us when our algorithm terminates,
or because it tells us how long to mix it up before looking at the current
state.
10.1.1 Examples
• A fair ±1 random walk. The state space is Z, the transition probabilities
are pij = 1/2 if |i − j| = 1, 0 otherwise. This is an example of a Markov
chain that is also a martingale.
• A fair ±1 random walk on a cycle. As above, but now the state space
is Z/m, the integers mod m. This is a finite Markov chain. It is also in
some sense a martingale, although we usually don’t define martingales
over finite groups.
chain as a random walk on graph in this way: the states become vertices,
and the transitions become edges, each labeled with its transition
probability. It’s conventional in this representation to exclude edges
with probability 0 and include self-loops for any transitions i → i.
If the resulting graph is small enough or has a nice structure, this can
be a convenient way to draw a Markov chain.
where the supremum is taken over all sets A for which Pr [X ∈ A] and
Pr [Y ∈ A] are both defined.6
An equivalent definition is
dT V (X, Y ) = sup|Pr [X ∈ A] − Pr [Y ∈ A]|.
A
So dT V (x, y) = 21 kx − yk1 .
Proof. Compute
X
|Ex (Z) − Ey (Z)| = z Pr(Z = z) − Pr(Z = z)
x y
z
X
≤ |z| · Pr(Z = z) − Pr(Z = z)
x y
z
X
≤M Pr(Z = z) − Pr(Z = z)
x y
z
≤ M kx − yk1
= 2M · dT V (x, y).
means that for any > 0, there is a mixing time tmix () such that for any
initial distribution x and any t ≥ tmix (),
dT V (xP t , π) ≤ .
dT V (X, Y ) ≤ Pr [X 6= Y ] .
7
It turns out that the bound in the Coupling Lemma is tight in the following sense:
for any given distributions on X and Y , there exists a joint distribution giving these
distributions such that dT V (X, Y ) is exactly equal to Pr [X 6= Y ] when X and Y are
sampled from the joint distribution. For discrete distributions, the easiest way to con-
struct the joint distribution is first to let to let Y = X = i for each i with prob-
ability min(Pr [X = i] , Pr [Y = i]), and then distribute the remaining probability for
X over all the cases where Pr [X = i] > Pr [Y = i] and similarly for Y over all the
cases where Pr P[Y = i] > Pr [X = i]. Looking at the unmatched values for X gives
Pr [X 6= Y ] ≤ {x | Pr X=i>Pr Y =i} (Pr [X = i] − Pr [Y = i]) ≤ dT V (X, Y ). So in this case
Pr [X 6= Y ] = dT V (X, Y ).
Unfortunately, the fact that there always exists a perfect coupling in this sense does not
mean that we can express it in any convenient way, or that even if we could, it would arise
from the kind of causal, step-by-step construction that we will use for couplings between
Markov processes.
CHAPTER 10. MARKOV CHAINS 175
Pr [X ∈ A] = Pr [X ∈ A ∧ Y ∈ A] + Pr [X ∈ A ∧ Y 6∈ A] ,
Pr [Y ∈ A] = Pr [X ∈ A ∧ Y ∈ A] + Pr [X 6∈ A ∧ Y ∈ A] ,
and thus
Pr [X ∈ A] − Pr [Y ∈ A] = Pr [X ∈ A ∧ Y 6∈ A] − Pr [X 6∈ A ∧ Y ∈ A]
≤ Pr [X ∈ A ∧ Y 6∈ A]
≤ Pr [X 6= Y ] .
Since this holds for any particular set A, it also holds when we take the
maximum over all A to get dT V (X, Y ).
For Markov chains, our goal will be to find a useful coupling between a se-
quence of random variables X0 , X1 , X2 , . . . corresponding to the Markov chain
starting in an arbitrary distribution with a second sequence Y0 , Y1 , Y2 , . . .
corresponding to the same chain starting in a stationary distribution. What
will make a coupling useful is if Pr [Xt 6= Yt ] is small for reasonably large t:
since Yt has the stationary distribution, this will show that dT V (xP t , π) is
also small.
Our first use of this technique will be to show, using a rather generic
coupling, that Markov chains with certain nice properties converge to their
stationary distribution in the limit. Later we will construct specialized
couplings for particular Markov chains to show that they converge quickly.
But first we will consider what properties a Markov chain must have to
converge at all.
the period of i is m, then starting from i we can only return to i at times that
are multiples of m. If m = 1, state i is said to be aperiodic. A Markov chain
as a whole is aperiodic if all of its states are aperiodic. In graph-theoretic
terms, this means that the graph of the chain is not k-partite for any k > 1.
Reversible chains are also an interesting special case: if a chain is reversible,
it can’t have a period greater than 2, since we can always step off a node
and step back.
If our Markov chain is not aperiodic, we can make it aperiodic by flipping
a coin at each step to decide whether to move or not. This gives a lazy
Markov chain whose transition probabilities are given by 12 pij when i 6= j
and 12 + 12 pij when i = j. This doesn’t affect the stationary distribution: if
we replace our transition
matrix P with a new transition matrix P +I 2 , and
πP = π, then π P +I2 = 21 πP + 12 πI = 12 π + 12 π = π.
Unfortunately there is no quick fix for reducible Markov chains. But
since we will often be designing the Markov chains we will be working with,
we can just take care to make sure they are not reducible.
We will later need the following lemma about aperiodic Markov chains,
which is related to the Frobenius problem of finding the minimum value
that cannot be constructed using coins of given denominations:
Proof. Let S = {t | pii (t) 6= 0}. Since gcd(S) = 1, there is a finite subset S 0
of S such that gcd S 0 = 1. Write the elements of S 0 as m1 , m2 , . . . , mk and let
M = kj=1 mj . From the extended Euclidean algorithm, there exist integer
Q
to use each aj as the number of times to go around the length mj loop from
i to i. Unfortunately many of these aj will be negative.
To solve this problem, we replace aj with bj = aj + M/mj . This makes all
the coefficients non-negative, and gives kj=1 bj mj = kM + 1. This implies
P
Proof. Consider two copies of the chain {Xt } and {Yt }, where X0 starts in
some arbitrary distribution x and Y0 starts in a stationary distribution π.
Define a coupling between {Xt } and {Yt } by the rule: (a) if Xt = 6 Yt , then
Pr [Xt+1 = j ∧ Yt+1 = j 0 | Xt = i ∧ Yt = i0 ] = pij pi0 j 0 ; and (b) if Xt = Yt ,
then Pr [Xt+1 = Yt+1 = j | Xt = Yt = i] = pij . Intuitively, we let both chains
run independently until they collide, after which we run them together. Since
each chain individually moves from state i to state j with probability pij
in either case, we have that Xt evolves normally and Yt remains in the
stationary distribution.
Now let us show that dT V (xP t , π) ≤ Pr [Xt 6= Yt ] goes to zero in the
limit. Pick some state i. Let r be the maximum over all states j of the first
passage time fji where fji is the minimum time t such that ptji 6= 0. Let s
be a time such that ptii 6= 0 for all t ≥ s (the existence of such an s is given
by Lemma 10.2.4).
Suppose that at time `(r + s), where ` ∈ N, X`(r+s) = j = 6 j 0 = Y`(r+s) .
0 0
Then there are times `(r + s) + u and `(r + s) + u , where u, u ≤ r, such that
X reaches i at time `(r + s) + u and Y reaches i at time `(r + s) + u0 with
nonzero probability. Since (r + s − u) ≤ s, then having reached i at these
times, X and Y both return to i at time `(r + s) + (r + s) = (` + 1)(r + s) with
nonzero hprobability. Let > 0 be ithe product of hthese nonzero probabilities;
i
then Pr X(`+1)(r+s) 6= Y(`+1)(r+s) ≤ (1 − ) Pr X`(r+s) = Y`(r+s) , and in
general we have Pr [Xt 6= Yt ] ≤ (1 − )bt/(r+s)c , which goes to zero in the
limit. This implies that dT V (xP t , π) also goes to zero in the limit (using the
Coupling Lemma), and since any initial distribution (including a stationary
distribution) converges to π, π is the unique stationary distribution as
claimed.
CHAPTER 10. MARKOV CHAINS 178
These are called the detailed balance equations—they say that in the
stationary distribution, the probability of seeing a transition from i to j is
equal to the probability of seeing a transition from j to i). If this is the case,
P P
then i πi pij = i πj pji = πj , which means that π is stationary.
It’s worth noting that this works for countable chains even if they are
not finite, because the sums always converge since each term is non-negative
P P
and i πi pij is dominated by i πi = 1. However, it may not be the case
for any particular p that there exists a corresponding stationary distribution
π. If this happens, the chain is not reversible.
d(u)
πu = P
u d(u)
d(u)
= ,
2|E|
which satisfies
d(u 1
πu puv = ·
2|E| d(u)
d(v 1
= ·
2|E| d(v)
= πv pvu .
If we don’t know π in advance, we can often guess it by observing that
πi pij = πj pji
implies
pij
πj = πi , (10.3.2)
pji
provided pij 6= 0. This gives us the ability to calculate πk starting from any
initial state i as long as there is some chain of transitions i = i0 → i1 → i2 →
. . . i` = k where each step im → im+1 has pim ,im+1 6= 0. For a random walk
on a graph, this implies that π is unique as long as the graph is connected.
This of course only works for reversible chains; if we try to do this with a
non-reversible chain, we are likely to get a contradiction.
For example, if we consider a biased random walk on the n-cycle, which
moves from i to (i + 1) mod n with probability p and in the other direction
with probability q = 1 − p, then applying (10.3.2) repeatedly would give
i
πi = π0 pq . This is not a problem when p = q = 1/2, since we get πi = π0
for all i and can deduce that πi = 1/n is the unique stationary distribution.
But if we try it for p = 2/3, then we get πi = π0 2i , which is fine up until we
hit π0 = πn = π0 2n . So for p 6= q, this process is not reversible, which is not
surprising if we realize that the n = 60, p = 1 case describes precisely the
movement of the second hand on a clock.8
8
It happens to be the case that πi = 1/n is a stationary distribution for any value of p,
we just can’t prove this using (10.3.1).
CHAPTER 10. MARKOV CHAINS 180
10.3.2 Examples
Random walk on a weighted graph Here each edge has a weight wuv
where 0 < wuv = wvu < ∞, with self-loops permitted. A step of
P
the random walk goes from u to v with probability wuv / v0 wuv0 . It
is easy to show that this random walk has stationary distribution
P P P
πu = u wuv / u v wuv , generalizing the previous case, and that the
resulting Markov chain satisfies the detailed balance equations.
p∗ij =
X X
pji πj /πi
j j
= πi /πi
= 1.
CHAPTER 10. MARKOV CHAINS 181
2. The reversed chain has the same stationary distribution as the original
chain:
πj p∗ji =
X X
πi pij
j j
= πi .
• Given a biased random walk on a cycle that moves right with probability
p and left with probability q, its time-reversal is the walk that moves
left with probability p and right with probability q. (Here the fact
that the stationary distribution is uniform makes things simple.) The
average of this chain with its time-reversal is an unbiased random walk.
These examples work because the original chains are simple and have
clean stationary distributions. Reversed versions of chains with messier
stationary distributions are usually messier. In practice, building reversible
chains using time-reversal is often painful precisely because we don’t have
a good characterization of the stationary distribution of the original non-
reversible chain. So we will often design our chains to be reversible from the
start rather than relying on after-the-fact flipping.
CHAPTER 10. MARKOV CHAINS 182
and let qii be whatever probability is left over. Now consider two states i
and j, and suppose that πi f (j) ≥ πj f (i). Then
qij = pij
which gives
µi qij = µi pij ,
while
whatever distribution our real process starts with, and Y0 has the stationary
distribution. Our goal is to structure the combined process so that Xt = Yt
as soon as possible.
Let Zt = Xt − Yt (mod m). If Zt = 0, then Xt and Yt have collided and
we will move both together. If Zt 6= 0, then flip a coin to decide whether
to move Xt or Yt ; whichever one moves then moves up or down with equal
probability. It’s not hard to see that this gives a probability of exactly 1/2
that Xt+1 = Xt , 1/4 that Xt+1 = Xt + 1, and 1/4 that Xt+1 = Xt − 1, and
similarly for Yt . So the transition functions for X and Y individually are the
same as for the original process.
Whichever way the first flip goes, we get Zt+1 = Zt ± 1 with equal
probability. So Z acts as an unbiased random walk on Zm with an absorbing
barriers at 0; this is equivalent to a random walk on 0 . . . m with absorbing
barriers at both endpoints. The expected time for this random walk to reach
a barrier starting from an arbitrary initial state is at most m2 /4, so if τ is
the first time at which Xτ = Yτ , we have E [τ ] ≤ m2 /4.9
Using Markov’s inequality, after t = 2(m2 /4) = m2 /2 steps we have
Pr [Xt 6= Yt ] = Pr τ > m2 /2 ≤ mE[τ ]
2 /2 ≤ 1/2. We can also iterate the whole
argument, starting over in whatever state we are in at time t if we don’t
converge. This gives at most a 1/2 chance of not converging for each interval
of m2 /2 steps. So after αm2 /2 steps we will have Pr [Xt 6= Yt ] ≤ 2−α . This
gives tmix () ≤ 12 m2 dlg(1/)e, where as before tmix () is the time needed to
make dT V (Xt , π) ≤ (see §10.2.3).
The choice of 2 for the constant in Markov’s inequality could be improved.
The following lemma gives an optimized version of this argument:
Lemma 10.4.1. Let the expected coupling time, at which two coupled
processes {Xt } and {Yt } starting from an arbitrary state are first equal, be
T . Then dT V (XT , YT ) ≤ for T ≥ T edln(1/)e.
Proof. Essentially the same argument as above, but replacing 2 with a
constant c to be determined. Suppose we restart the process every cT
steps. Then at time t we have a total variation bounded by c−bt/cT c . The
expression c−t/cT is minimized by minimizing c−1/c or equivalently − ln c/c,
which occurs at c = e. This gives tmix () ≤ T edln(1/)e.
It’s worth noting that the random walk example was very carefully rigged
to make the coupling argument clean. A similar argument still works (perhaps
9
If we know that Y0 is uniform, then Z0 is also uniform, and we can use this fact to get
a slightly smaller bound on E [τ ], around m2 /6. But this will cause problems if we want to
re-run the coupling starting from a state where Xt and Yt have not yet converged.
CHAPTER 10. MARKOV CHAINS 185
with a change in the bound) for other irreducible aperiodic walks on the ring,
but the details are messier.
10.4.3.1 Move-to-top
This is a variant of card shuffling that is interesting mostly because it gives
about the easiest possible coupling argument. At each step, we choose one
of the cards uniformly at random (including the top card) and move it to
the top of the deck. How long until the deck is fully shuffled, i.e., until the
total variation distance between the actual distribution and the stationary
distribution is bounded by ?
Here the trick is that when we choose a card to move to the top in the
X process, we choose the same card in the Y process. It’s not hard to see
that this links the two cards together so that they are always in the same
position in the deck in all future states. So to keep track of how well the
coupling is working, we just keep track of how many cards are linked in this
way, and observe that as soon as n − 1 are, the two decks are identical.
Note: Unlike some of the examples below, we don’t consider two cards to
be linked just because they are in the same position. We are only considering
cards that have gone through the top position in the deck (which corresponds
to some initial segment of the deck, viewed from above). The reason is that
these cards never become unlinked: if we pick two cards from the initial
segment, the cards above them move down together. But deeper cards that
happen to match might become separated if we pull a card from one deck
that is above the matched pair while its counterpart in the other deck is
below the matched pair.
CHAPTER 10. MARKOV CHAINS 187
Given k cards linked in this way, the probability that the next step links
another pair of cards is exactly (n − k)/n. So the expected time until we get
k + 1 cards is n/(n − k), and if we sum these waiting times for k = 0 . . . n − 1,
we get nHn , the waiting time for the coupon collector problem. So the bound
on the mixing time is the same as for the random walk on a hypercube.
1
possible swap occurs with probability 2n on both sides, but somehow we
correlate things so that like cards are pushed together but never pulled apart.
The trick is that we will use the same position i on both sides, but be sneaky
about when we swap. In particular, we will aim to arrange things so that
once some card is in the same position in both decks, both copies move
together, but otherwise one copy changes its position by ±1 relative to the
1
other with a fixed probability 2n .
The coupled process works like this. Let D be the set of indices i where
the same card appears in both decks at position i or at position i + 1. Then
we do:
1
1. For i ∈ D, swap (i, i + 1) in both decks with probability 2n .
1
2. For i 6∈ D, swap (i, i + 1) in the X deck only with probability 2n .
1
3. For i 6∈ D, swap (i, i + 1) in the Y deck only with probability 2n .
|D|
4. Do nothing with probability 2n .
It’s worth checking that the total probability of all these events is |D|/2n+
2(n − |D|)/2n + |D|/2n = 1. More important is that if we consider only one
1
of the decks, the probability of doing a swap at (i, i + 1) is exactly 2n (since
we catch either case 1 or 2 for the X deck or 1 or 3 for the Y deck).
Now suppose that some card c is at position x in X and y in Y . If
x = y, then both x and x − 1 are in D, so the only way the card can move
is if it moves in both decks: linked cards stay linked. If x = 6 y, then c
moves in deck X or deck Y , but not both. (The only way it can move
in both is in case 1, where i = x and i + 1 = y or vice versa; but in
this case i can’t be in D since the copy of c at position x doesn’t match
whatever is in deck Y , and the copy at position y doesn’t match what’s
in deck X.) In this case the distance x − y goes up or down by 1 with
1
equal probability 2n . Considering x − y (mod n), we have a “lazy” random
walk that moves with probability 1/n, with absorbing barriers at 0 and
n. The worst-case expected time to converge is n(n/2)2 = n3 /4, giving
Pr time for c to become linked ≥ αn 3 /2 ≤ 2−α using the usual argument.
Now apply the union bound to get Pr time for every c to become linked ≥ αn3 /2 ≤
O(n3 log n). A result of David Bruce Wilson [Wil04]) shows both this upper
bound holds and that the bound is optimal up to a constant factor.
t+1
We can extract from this a conditional distribution on Zi+1 given the other
three variables:
h i
0
t+1
Pr Zi+1 = zi+1 Zit+1 = zi0 , Zit = zi , Zi+1
t
= zi+1 .
means that there is exactly one index j at which these two bit-vectors differ.
We apply the following coupling (which looks suspiciously like the more
generic coupling in §10.4.2):
1. Pick a random index r.
our coupling works anyway. This also allows us to start with an improper
coloring for X 0 if we are particularly lazy. The stationary distribution is
not affected, because if i is a proper coloring and j is an improper coloring
that differs from i in exactly one place, we have pij = 0 and pji 6= 0, so the
detailed balance equations hold with πj = 0.
The natural coupling to consider given adjacent X t and Y t is to pick the
same node and the same new color for both, provided we can do so. If we
pick the one node v on which they differ, and choose a color that is not used
by any neighbor (which will be the same for both copies of the process, since
all the neighbors have the same colors), then we get X t+1 = Y t+1 ; this event
occurs with probability at least 1/n. If we pick a node that is neither v nor
adjacent to it, then the distance between X and Y doesn’t change; either
both get a new identical color or both don’t.
Things get a little messier when we pick some node u adjacent to v, an
event that occurs with probability at most ∆/n. Let c be the color of v in
X t , c0 the color of v in Y t , and T the set of colors that do not appear among
the other neighbors of u. Let ` = |T | ≥ k − (∆ − 1).
Conditioning on choosing u to recolor, X t+1 picks a color uniformly from
T \ {c} and Y t+1 picks a color uniformly from T \ {c0 }. We’d like these colors
to be the same if possible, but these are not the same sets, and they aren’t
even necessarily the same size.
There are three cases:
1. Neither c nor c0 are in T . Then X t+1 and Y t+1 are choosing a new
color from the same set, and we can make both choose the same color:
the distance between X and Y is unchanged.
2. Exactly one of c and c0 is in T . Suppose that it’s c. Then |T \ {c}| = `−1
and |T \ {c0 }| = |T | = `. Let X t+1 choose a new color c00 first. Then
let Yut+1 = c00 with probability `−1 ` (this gives a probability of ` of
1
t+1
picking each color in T \ {c}, which is what we want), and let Yu = c
with probability 1` . Now the distance between X and Y increases with
probability 1` .
3. Both c and c0 are in T . For each c00 in T \ {c, c0 }, let Xut+1 = Yut+1 = c00
1
with probability `−1 ; since there are ` − 2 such c00 , this accounts for `−2
`−1
of the probability. Assign the remaining `−1 1
to Xut+1 = c0 , Yut+1 = c.
In this case the distance between X and Y increases with probability
1
`−1 , making this the worst case.
Putting everything together, we have a 1/n chance of picking a node
that guarantees to reduce d(X, Y ) by 1, and at most a ∆/n chance of
CHAPTER 10. MARKOV CHAINS 193
1
picking a node that may increase d(X, Y ) by at most `−1 on average, where
∆ 1
` ≥ k − ∆ + 1, giving a maximum expected increase of n · k−∆ . So
h i −1 ∆ 1
E d(X t+1 , Y t+1 ) − d(X t , Y t ) d(X t , Y t ) = 1 ≤ + ·
n n k−∆
1 ∆
= −1 +
n k−∆
1 −(k − ∆) + ∆
=
n k−∆
1 k − 2∆
=− .
n k−∆
So we get
h i
dT V (X t , Y t ) ≤ Pr X t 6= Y t
1 k − 2∆ t
h i
≤ 1− · · E d(X 0 , Y 0 )
n k−∆
t k − 2∆
≤ exp − · · n.
n k−∆
For fixed k and ∆ with k > 2∆, this is e−Θ(t/n) n, which will be less than
for t = Ω(n(log n + log(1/))).
independence throughout the process. (This argument also shows that the
Markov chain is irreducible.) It may be hard to find the exact minimum
path length, so we’ll use this distance instead for our path coupling.
We can easily show that the stationary distribution of this process is
uniform. The essential idea is that if we can transform one independent set
S into another S 0 by flipping a bit, then we can go back by flipping the bit
the other ways. Since each transition happens with the same probability
1/n, we get πS · (1/n) = πS 0 · (1/n) and πS = πS 0 . Since we can apply this
equation along a path between any two states, all states must have the same
probability in the unique stationary distribution.
To prove convergence, it’s tempting to start with the obvious coupling,
even though it doesn’t actually work. Pick the same position and value
for both copies of the chain. If x and y are adjacent, then they coalesce
with probability 1/n (both probability 1/2n transitions are feasible for both
copies, since the neighboring nodes always have the same state). What is
the probability that they diverge? We can only be prevented from picking
a value if the value is 1 and some neighbor is 1. So the bad case is when
xi = 1, yi = 0, and we attempt to set some neighbor of i to 1; in the worst
case, this happens ∆/2n of the time, which is at least 1/n when ∆ ≥ 2. No
coalescence here!
This could be a sign that our random walk is no good, or it could be a
sign that our coupling is no good. But we can avoid figuring out which is
the case by using a sneakier random walk. We’ll adopt the approach used
in [MU17, §12.6].
Here the idea is that we pick a random edge uv, and then try to do one
of the following operations, all with equal probability:
1. Set u = v = 0.
2. Set u = 0 and v = 1.
3. Set u = 1 and v = 0.
know what the actual stationary probabilities are, since we don’t know how
many independent sets our graph has.
So now what happens if we run two coupled copies of this process, where
the copies differ on exactly one vertex i?
First, every neighbor of i is 0 in both processes. A transition that doesn’t
involve any neighbors of i will have the same effect on both processes. So we
need to consider all choices of edges where one of the endpoints is either i or
a neighbor j of i. In the case where the other endpoint isn’t i, we’ll call it k;
there may be several such k.
If we choose ij and don’t try to set j to one, we always coalesce the states.
2
This occurs with probability 3m . If we try to set i to zero and j to one, we
may fail in both processes, because j may have a neighbor k that is already
one; this will preserve the distance between the two processes. Similarly,
if we try to set j to one as part of a change to some jk, we will also get a
divergence between the two processes: in this case, the distance will actually
increase. This can only happen if j has at most one neighbor k (other than
i) that is already in the independent set; if there are two such k, then we
can’t set j to one no matter what the state of i is.
This argument suggests that we need to consider three cases for each j,
depending on the number s of nodes k 6= i that are adjacent to j and have
xk = yk = 1. In each case, we assume xi = 0 and yi = 1, and that all other
nodes have the same value in both x and y. (Note that these assumptions
mean that any such k can’t be adjacent to i, because we have yk = yi = 1.)
Considering all three cases, in the worst case we have E [d(Xt+1 , Yt+1 | Xt , Yt ] =
d(Xt , Yt ) + ∆−4
3m . For ∆ ≤ 3 (a pretty restrictive case), this is −1/3m, giving
a decent enough expected coupling time of O(m log n).
Here, we’ve considered the case where all independent sets have the
same probability. One can also bias the random walk in favor of larger
independent sets by accepting increases with higher probability than decreases
(as in Metropolis-Hastings); this samples independent sets of size s with
probability proportional to λs . Some early examples of this approach are
given in [LV97, LV99, DG00]. The question of exactly which values of λ give
polynomial convergence times is still open; see [MWW07] for some more
recent bounds.
and
" #
h i p q h i h i
1 −1 = p − q q − p = (p − q) · 1 −1 .
q p
this as
1 1 1 2
xP = u + u P
2 2
1 1 1
= u P + u2 P
2 2
1 1
= λ1 u + λ2 u2 .
1
2 2
This uses the defining property of eigenvectors, that ui P = λi ui .
In general, if x = i ai ui , then xP = i ai λi ui and xP t = ai λti ui . For
P P P
any eigenvalue λi with |λi | < 1, λti goes to zero in the limit. So only those
eigenvectors with λi = 1 survive. For irreducible aperiodic chains, these
consist only of the stationary distribution π = ui / ui 1 . For reducible or
periodic chains, there may be some additional eigenvalues with |λi | = 1, but
the only possibility that arises for an irreducible reversible chain is λn = −1,
corresponding to a chain with period 2.14
Assuming that |λ2 | ≥ |λn |, as t grows large λt2 will dominate the other
smaller eigenvalues, and so the size of λ2 will control the rate of convergence
of the underlying Markov process.
This assumption is always true for lazy walks that stay put with prob-
ability 1/2, because all eigenvalues of a lazy walk are non-negative. The
reason is that any such walk has transition matrix 12 (P + I), where I is the
identity matrix and P is the transition matrix of the unlazy
version
of the
walk. If xP = λx for some x and λ, then x 12 (P + I) = 12 (λ + 1) x. This
means that x is still an eigenvector of the lazy walk, and its corresponding
eigenvalue is 12 (λ + 1) ≥ 12 ((−1) + 1) ≥ 0.
But what does having small λ2 mean for total variation distance? If
xt is a vector representing the distribution of our position at time t, then
dT V = 12 ni=1 xti − πi = 12 xt − π 1 . But we know that x0 = π + ni=2 ci ui
P P
Pn 2t 2
i=2 λi ci if we normalize each ui so that kui k22 = 1. But then
n
2 X
xt − π = λ2t 2
i ci
2
i=2
n
X
≤ λ2t 2
2 ci
i=2
n
X
= λ2t
2 c2i
i=2
= λ2 kx0
2t
− πk22 .
xt − π =< λt2 x0 − π .
2
So now we just need a tool for bounding λ2 . For a small chain with a
known transition matrix, we can just feed it to our favorite linear algebra
library, but most of the time we will not be able to construct the matrix
explicitly. So we need a way to bound λ2 indirectly, in terms of other
structural properties of our Markov chain.
10.6 Conductance
The conductance or Cheeger constant Φ(S) of a set S of states in a
Markov chain is
P
πi pij
i∈S,j6∈S
Φ(S) = . (10.6.1)
π(S)
This is the probability of leaving S on the next step starting from the
stationary distribution conditioned on being in S. The conductance is a
measure of how easy it is to escape from a set. It can also be thought of as a
weighted version of edge expansion.
The conductance of a Markov chain as a whole is obtained by taking the
minimum of Φ(S) over all S that occur with probability at most 1/2:
1 − 2Φ ≤ λ2 ≤ 1 − Φ2 /2. (10.6.3)
assigning a unique path γxy from each state x to each state y in a way that
doesn’t send too many paths across any one edge. So if we have a partition
of the state space into sets S and T , then there are |S| · |T | paths from states
in S to states in T , and since (a) every one of these paths crosses an S–T
edge, and (b) each S–T edge carries at most ρ paths, there must be at least
|S| · |T |/ρ edges from S to T . Note that because there is no guarantee we
chose good canonical paths, this is only useful for getting lower bounds on
conductance—and thus upper bounds on mixing time—but this is usually
what we want.
Let’s start with a small example. Let G = Cn Cm , the n × m torus.
A lazy random walk on this graph moves north, east, south, or west with
probability 1/8 each, wrapping around when the coordinates reach n or m.
Since this is a random walk on a regular graph, the stationary distribution
is uniform. What is the relaxation time?
Intuitively, we expect it to be O(max(n, m)2 ), because we can think of this
two-dimensional random walk as consisting of two one-dimensional random
walks, one in the horizontal direction and one in the vertical direction, and
we know that a random walk on a cycle mixes in O(n2 ) time. Unfortunately,
the two random walks are not independent: every time I take a horizontal
step is a time I am not taking a vertical step. We can show that the expected
coupling time is O(n2 + m2 ) by running two sequential instances of the
coupling argument for the cycle, where we first link the two copies in the
horizontal direction and then in the vertical direction. So this gives us one
bound on the mixing time. But what happens if we try to use conductance?
Here it is probably easiest to start with just a cycle. Given points x
and y on Cn , let the canonical path γxy be a shortest path between them,
breaking ties so that half the paths go one way and half the other. Then each
each is crossed by exactly k paths of length k for each k = 1 . . . (n/2 − 1),
and at most n/4 paths of length n/2 (0 if n is odd), giving a total of
(n/2−1)(n/2)
ρ ≤ n/2
n
2 + 4 = 2 + n4 = n2 /8 paths across the edge.
If we now take an S–T partition where |S| = m, we get at least m(n −
m)/ρ = 8m(n − m)/n2 S–T edges. This peaks at m = n/2, where we
get 2 edges—exactly the right number—and in general when m ≤ n/2 we
get at least 8m(n/2)/n2 = 4m/n outgoing edges, giving a conductance
ΦS ≥ (1/4n)(4m/n)/(m/n) = 1/n.
This is essentially what we got before, except we have to divide by 2
because we are doing a lazy walk. Note that for small m, the bound is a
gross underestimate, since we know that every nonempty proper subset has
at least 2 outgoing edges.
CHAPTER 10. MARKOV CHAINS 206
1 4|S|/n 1
Φ(S) ≥ · = .
8nm |S|/nm 2n
10.6.3 Congestion
For less symmetric chains, we weight paths by the probabilities of their
endpoints when counting how many cross each edge, and treat the flow
across the edge as a capacity. This gives the congestion of a collection of
canonical paths Γ = {γxy }, which is computed as
1 X
ρ(Γ) = max πx πy ,
uv∈E πu puv γxy 3uv
Since each S–T path crosses at least one S–T edge, we have
X
π(S)π(T ) = πx πy
x∈S,y∈T
X X
≤ πx πy
u∈S,v∈T,uv∈E γxy 3uv
X
≤ ρπu puv .
u∈S,v∈T,uv∈E
X
=ρ πu puv .
u∈S,v∈T,uv∈E
But then
P
u∈S,t∈T,uv∈E πu puv
Φ(S) =
πS
π(S)π(T )/ρ
≥
π(S)
π(T )
=
ρ
1
≥ .
2ρ
To get the bound on τ2 , use (10.6.4) to compute τ2 ≤ 2/Φ2 ≤ 8ρ2 .
CHAPTER 10. MARKOV CHAINS 208
10.6.4 Examples
Here are some more examples of applying canonical paths.
This gives a relaxation time τ2 ≤ 8ρ2 = 8n2 , which whenwe account for the
large state space gives tmix () ≤ 8n2 12 ln 2nn + ln(1/) = O(n3 ). In this
case the bound is substantially worse than what we previously proved using
coupling.
CHAPTER 10. MARKOV CHAINS 209
The fact that the number of canonical paths that cross a particular edge
is exactly one half the number of nodes in the hypercube is not an accident:
if we look at what information we need to fill in to compute x and y from u
and v, we need (a) the part of x we’ve already gotten rid of, plus (b) the
part of y we haven’t filled in yet. If we stitch these two pieces together, we
get all but one of the n bits we need to specify a node in the hypercube, the
missing bit being the bit we flip to get from u to v. This sort of thing shows
up often in conductance arguments where we build our canonical paths by
fixing a structure one piece at a time.
vertex in each), then for each component replace the X edges with Y edges.
If we do this cleverly enough, we can guarantee that for any transition ST ,
the set of edges (X ∪ Y ) \ (S ∪ T ) always consists of a matching plus at most
two extra edges, and that S, T , (X ∪ Y ) \ (S ∪ T ) are enough to reconstruct
X and Y . Since there are at most N m2 choices of (X ∪ Y ) \ (S ∪ T ), this
will give at most N m2 canonical paths across each transition.15
Here is how we do the replacement within a component. If C is a
cycle with k vertices, order the vertices v0 , . . . , vk−1 such that v0 has the
smallest index in C and v0 v1 ∈ X. Then X ∩ C consists of the even-
numbered edges e2i = v2i v2+1 and Y ∩ C consists of the odd-numbered edges
e2i+1 = v2i+1 v2i+2 , where we take vk = v0 . We now delete e0 , then alternate
between deleting e2i and adding e2i−1 for i = 1 . . . k/2 − 1. Finally we add
ek−1 .
The number of edges in C at each step in this process is always one of
k/2, k/2 − 1, or k/2 − 2, and since we always add or delete an edge, the
number of edges in C ∩ (S ∪ T ) for any transition will be at least k/2 − 1,
which means that the total degree of the vertices in C ∩ (S ∪ T ) is at least
k − 2. We also know that SU T is a matching, since it’s either equal to S or
equal to T . So there are at least k − 2 vertices in C ∩ (S ∪ T ) with degree
1, and none with degree 2, leaving at most 2 vertices with degree 0. In the
complement C \ (S ∪ T ), this becomes at most two vertices with degree 2,
with the rest having degree 1. If these vertices are adjacent, removing the
edge between them leaves a matching; if not, removing one edge adjacent to
each does so. So two extra edges are enough.
A similar argument works for paths, which we will leave as an exercise
for the reader.
Now suppose we know S, T , and (XU Y ) \ (S ∪ T ). We can compute
X ∪ Y = ((X ∪ Y ) \ (S ∪ T )) ∪ (S ∪ T ); this means we can reconstruct the
graph X ∪ Y , identify its components, and so on. We know which component
C we are working on because we know which edge changes between S and T .
For any earlier component C 0 , we have C 0 ∩ S = C 0 ∩ Y (since we finished
already), and similarly C 0 \ S = C 0 ∩ X. For components C 0 we haven’t
reached yet, the reverse holds, C 0 ∩ S = C ∩ X and C 0 \ S = C ∩ Y . This
determines the edges in both X and Y for all edges not in C.
For C, we know from C ∩ (X ∪ Y ) which vertex is v0 . In a cycle, whenever
we remove an edge, its lower-numbered endpoint is always an even distance
from v0 , and similarly when we add an edge, its lower-numbered endpoint
is always an odd distance from v0 . So we can orient the cycle and tell
15 m
We could improve the constant a bit by using N 2
, but we won’t.
CHAPTER 10. MARKOV CHAINS 211
which edges in S ∪ T have already been added (and are thus part of Y ) and
which have been removed (and are thus part of X). Since we know that
C ∩ Y = C \ X, this is enough to reconstruct both C ∩ X and C ∩ Y . (The
same general idea works for paths, although for paths we do not need to be as
careful to figure out orientation since we can only leave v0 in one direction.)
The result is that given S and T , there are at most N m2 choices of
X and Y such that the canonical path γXY crosses ST . This gives ρ ≤
N m2 N −2
N −1 1
= 2m3 , which gives τ2 ≤ 32m6 . This is not great but it is at least
2m
polynomial
in the size
of the graph.
For actual
sampling, this translates to
6 1 6 1
O m log N + log = O m m + log steps to get a total variation
distance of .
This may or may not give rapid mixing, depending on how big N is relative
to m. For many graphs, the number of matchings N will be exponential in
m, and so the m6 term will be polylogarithmic in N . But N can be much
smaller in dense graphs. For example, on a clique we have N = m + 1,
since no matching can contain more than one edge, making the bound
τ2 ≤ 32m6 = Θ(N 6 ) polynomial in N but not polynomial in log N , which is
what we want. The actual mixing time in this case is Θ(m) (for the upper
bound, wait to delete the only edge, and do so on both sides of the coupling;
for the lower bound, observe that until we delete the only edge we are still in
our initial state). This is much better than the upper bound from canonical
paths, but it still doesn’t give rapid maxing.
Approximate counting
214
CHAPTER 11. APPROXIMATE COUNTING 215
i, we can cut this time to O(log n) such computations down using binary
search. The second step depends on whatever mechanism we use to unrank
within Ui .
An example would be generating one of the nk subsets of a set of size
element as their smallest element in some fixed ordering. But then we can
n−i
easily compute |Si | = k−1 for each i, apply the above technique to pick a
particular Si to start with, and then recurse within Si to get the rest of the
elements.
A more general case is when we can’t easily sample from U but we can
sample from some T ⊇ U . Here rejection sampling comes to our rescue
CHAPTER 11. APPROXIMATE COUNTING 217
erty that prevents us from setting all the xi to zero). We’ll assume that the
ai and b are all integers.
For #KNAPSACK, we want to compute |S|, where S is the set of all
assignments to the xi that make ni=1 ai xi ≤ b.
P
CHAPTER 11. APPROXIMATE COUNTING 218
n n
a0i xi =
X X
bn2 ai /bcxi
i=1 i=1
n
X
≤ (n2 /b)ai xi
i=1
n
X
= (n2 /b) ai xi
i=1
≤ n2 ,
which shows ~x ∈ S 0 .
The converse does not hold. However, we can argue that any ~x ∈ S 0 can
be shoehorned into S by setting at most one of the xi to 0. Consider the
set of all positions i such that xi = 1 and ai > b/n. If this set is empty,
then ni=1 ai xi ≤ ni=1 b/n = b, and ~x is already in S. Otherwise, pick
P P
any position i with xi = 1 and ai > b/n, and let yj = 0 when j = i and
yj = xj otherwise. If we recall that the total error from the floors in our
approximation was at most (b/n2 )n = b/n, the intuition is that deleting yj
CHAPTER 11. APPROXIMATE COUNTING 219
To make this concrete, let’s look at the specific problem studied by Karp
and Luby of approximating the number of satisfying assignment to a DNF
formula. A DNF formula is a formula that is in disjunctive normal form:
it is an OR of zero or more clauses, each of which is an AND of variables
or their negations. An example would be (x1 ∧ x2 ∧ x3 ) ∨ (¬x1 ∧ x4 ) ∨ x2 ).
The #DNF problem is to count the number of satisfying assignments of a
formula presented in disjunctive normal form.
Solving #DNF exactly is #P-complete, so we don’t expect to be able to
do it. Instead, we’ll get a FPRAS by cleverly sampling solutions. The need
for cleverness arises because just sampling solutions directly by generating
one of the 2n possible assignments to the n variables may find no satisfying
assignments at all, since the size of any individual clause might be big enough
that getting a satisfying assignment for that clause at random is exponentially
unlikely.
So instead we will sample pairs (x, i), where x is an assignment that
satisfies clause Ci ; these are easier to find, because if we know which clause
Ci we are trying to satisfy, we can read off the satisfying assignment from
its variables. Let S be the set of such pairs. For each pair (x, i), define
CHAPTER 11. APPROXIMATE COUNTING 221
P
f (x, i) = 1 if and only if Cj (x) = 0 for all j < i. Then (x,i)∈S f (x, i) counts
every satisfying assignment x, because (a) there exists some i such that
x satisfies Ci , and (b) only the smallest such i will have f (x, i) = 1. In
effect, f is picking out a single canonical satisfied clause from each satisfying
assignment. Note that we can compute f efficiently by testing x against all
clauses Cj with j < i.
Our goal is to estimate the proportion ρ of “good” pairs with f (x, i) =
1 out of all pairs in S 0 = Si , and then use this to estimate |S| =
U
P
f (x, i) = ρ|S 0 |. If we can sample from S uniformly, the proportion
(x,i)∈S 0
11.5.1 Matchings
We saw in §10.6.4.3 that a random walk on matchings on a graph with m
edges has mixing time τ2 ≤ 32m6 , where the walk is defined by selecting
an edge uniformly at random and flipping whether it is in the matching or
not, while rejecting any steps that produce a non-matching. This allows
us to sample matchings of a graph with δ totalvariation distance from
the uniform distribution in O m6 log N + log 1δ time, where N is the
number of matchings. Since every matching is a subset of the edges, we
m
can
crudely
N ≤ 2 which lets us rewrite the sampling time as
bound
6 1
O m m + log δ .
Suppose now that we want to count matchings instead of sampling them.
It’s easy to show that for any particular edge uv ∈ G, at least half of all
matchings in G don’t include uv: the reason is that if M is a matching in G,
then M 0 = M \ {uv} is also a matching, and at most two matchings M 0 and
M 0 ∪ {uv} are mapped to any one M 0 by this mapping.
Order the edges of G arbitrarily as e1 , e2 , . . . , em . Let Si be the set of
matchings in G \ {e1 . . . ei }. Then S0 is the set of all matchings, and we’ve
just argued that ρi+1 = |Si+1 |/|Si | ≥ 1/2. We also know that |Sm | counts
the number of matchings in a graph with no edges, so it’s exactly one. So
|Si |
we can use the product-of-ratios trick to compute S0 = m
Q
i=1 |Si+1 | .
A random walk of length O m6 m + log η1 can sample matchings from
0
Si with a probability ρ of getting a matching in Si+1 that is between (1 −
η)ρi+1 and (1+η)ρi+1 . From Lemma 11.2.1, we canestimate ρ0 within
relative
error γ with probability at least 1 − ζ using O γ 21ρ0 log ζ1 = O γ1 log ζ1
0
with the error on ρ ,this gives relative error at most
samples. Combined
γ + η + γη in O m6 m + log η1 log γ1 log ζ1 operations.1 If we then multiply
1
This is the point where sensible people start hauling out the O e notation, where a
function is O(f
e (n)) if it O(f (n)g) where g is polylogarithmic in n and any other parameters
that may be running around ( 1 , η1 , etc.).
CHAPTER 11. APPROXIMATE COUNTING 223
out all the estimates for |Si |/|Si+1 |, we get an estimate of S0 that is at most
(1+γ +η +γη)m times the correct value with probability at least1−mζ (with
a similar bound on the other side), in total time O m7 m log η1 log γ1 log ζ1 .
To turn this into a fully polynomial-time approximation scheme, given
, δ, and m, we need to select η, γ, and ζ to get relative error with
probability at least 1 − δ. Letting ζ = δ/m gets the δ part. For , we need
(1 + γ + η + γη)m ≤ 1 + . Suppose that < 1 and let γ = η = /6m. Then
m
m
(1 + γ + η + γη) ≤ 1+
2m
/2
≤e
≤ 1 + .
Plugging these values into our cost formula gives O(m8 ) times a bunch
of factors that are polynomial in log m and 1 , which we can abbreviate as
e 8 ).
O(m
Hitting times
In addition to using Markov chains for sampling, we can also use Markov
chains to model processes that we hope will eventually reach some terminating
state. These Markov chains will generally not be irreducible, since often
the terminating states are inescapable, and may or may not have the other
desirable properties we needed for convergence analysis. So instead of looking
at convergence to a stationary distribution that might not even exist, we
will be mostly interested in the hitting time for some subset of A of the
states, defined as the minimum τ such that X τ ∈ A starting from some given
initial distribution on X 0 . As suggested by the notation, hitting times will a
special case of stopping times (see Chapter 9), and if we are very lucky we
may be able to use the optional stopping theorem to bound them. If we are
less lucky, we will use whatever tools we can.
E [τ ] = p · 1 + (1 − p) · (1 + E [τ ])
= 1 + (1 − p) E [τ ]
E [τ ] = 1/p.
224
CHAPTER 12. HITTING TIMES 225
k(n − k)
pk,k+1 = n
2
pk,k = 1 − pk,k+1 ,
and all other transition probabilities are 0. So any trajectory of this process
involves starting in 1, moving to 2 after some time, then 3, and so on until
all n agents are infected.
The hitting time τn for n will be the sum of the waiting times for each of
these steps. This gives
n−1
X 1
E [τn ] =
k=1
pk,k+1
n−1
X n
2
=
k=1
k(n − k)
n n−1
!
X 1/n 1/n
= +
2 k=1
k n−k
n−1
(n − 1) X1 1
= +
2 k=1
k n−k
= (n − 1)Hn−1 .
r0 = rs − pr
p0 = pr − sp
s0 = sp − rs,
where r, p, and s represent the densities of rocks, papers, and scissors in the
population at some time and primes are used to indicate derivatives with
respect to time.
This system of differential equations does not have simple closed-form
solutions, but it does have some interesting invariants. If we compute
(r + p + s)0 = r0 + p0 + s0 = (rs − pr) + (pr − sp) + (sp − rs) = 0, we see that
the total concentration of agents r + p + s doesn’t change over time. This is
1
This is also a special case of the Lotka-Volterra model in population dynamics.
The translation is to convert rocks to wolves, paper to grass, and scissors to sheep, and
adopt the rules “sheep eats grass”, “wolf eats sheep”, and “grass starves wolf.”
CHAPTER 12. HITTING TIMES 227
This shows that the product of the three concentrations also doesn’t change
over time, giving a family of closed-loop trajectories in the continuous case.
This makes the rock-paper-scissors process a useful way to build an oscillator
in a dynamical system.
In the discrete case, we don’t have infinitesimal rocks breaking infinites-
imal scissors over infinitesimal time intervals. Instead, we have a Markov
chain. The states of this Markov chain are triples of counts hRt , Pt , St i, and
at each step we see transitions
!
n
hR, P, Si → hR − 1, P + 1, Si with probability RP/
2
!
n
hR, P, Si → hR, P − 1, S + 1i with probability P S/
2
!
n
hR, P, Si → hR + 1, P, S − 1i with probability SR/ ,
2
Φ = RP S,
n
Suppose paper covers rock, which occurs with probability P R/ 2 . Then
the change in Φ is
∆Φ = (R − 1)(P + 1)S − RP S
= RP S + RS − P S − S − RP S
= (R − P − 1)S.
RP S
E [∆Φ] = n ((R − P − 1) + (P − S − 1) + (S − R − 1)),
2
−3
= RP S · n .
2
is a martingale.
For proving an upper bound on the hitting time, the first formulation
(12.2.1) is more useful. Using our old friend 1 + x ≤ ex , we can rewrite the
CHAPTER 12. HITTING TIMES 229
Pr [Φt 6= 0] = Pr [Φt ≥ n − 2]
E [Φt ]
≤
n−2
(n/3)3 e−6t/n(n−1)
≤
n −2
= O n2 exp −Θ(t/n2 ) .
We can immediately see that there is some value t = O(n2 log n) that
knocks Pr [Φt 6= 0] down to at most 1/2. As when converging to a stationary
distribution, if we lose this coin-flip, we can restart the argument and try
again. This gives an expected waiting time of O(n2 log n) to reach Φ = 0.
We are not quite done. Having Φ = 0 only means that one of our three
species has disappeared; for full convergence, we need to lose two. But once
we are down to two remaining species (say rock and paper), we have an
epidemic process. From our previous analysis, we know that only O(n log n)
additional steps are needed to get down to one.
It’s worth noting that hitting times are generally stopping times. So
perhaps there is a cleaner argument using (12.2.2) and the optional stopping
theorem (see Theorem 9.3.1). Let τ be the first time at which Φτ = 0.
We’ve already shown that E [τ ] is finite, so the bounded-increments case of
Theorem 9.3.1 is tempting. But if we get unlucky and Φt+1 = Φt 6= 0 for
some large t, Zt can increase by an arbitrarily large amount. The bounded
time and bounded range cases also don’t apply. It’s tempting to see if the
general case works, but since we know that Zτ = 0 6= Z0 and OST implies
Zτ = Z0 , we can rule out applying the theorem to this particular martingale
and stopping time no matter how clever we get. What we can do is use a
fixed time t and compute E [Zt ] = Z0 = Φ0 , but at this point we are just
reinventing (12.2.1).
To salve our disappointment, let’s get a lower bound on the expected
fixation time. Use 1 − x ≥ e−x/2 , which holds for 0 ≤ x ≤ 2 at least, to get
t
E [Φt ] = Φ0 1 − Θ(1/n2 )
≥ Φ0 exp −Θ(t/n2 ) .
still Θ(n3 ), just with a smaller constant. But then we can argue
Θ(n3 ) = E [Φt ]
= E [Φt ] Pr [Φt 6= 0]
≤ O(n3 ) Pr [Φt 6= 0] ,
1. If there exists δ > 0 such that for all s ∈ S \ {0} and for all t ≥ 0,
∆t (s) = E [Xt − Xt+1 | Xt = s] ≥ δ,
then
E [X0 ]
E [T ] ≤ .
δ
2. If there exists δ > 0 such that for all s ∈ S \ {0} and for all t ≥ 0,
∆t (s) = E [Xt − Xt+1 | Xt = s] ≤ δ,
then
E [X0 ]
E [T ] ≥ .
δ
The proof of Theorem 12.3.1 is a fairly simple application of the optional
stopping theorem. Depending on which case we are in, Xt + tδ is either a
supermartingale or submartingale; we have a bounded range because of the
assumption that S is finite; and we have finite time in the supermartingale
case because the downward drift implies and finite state space implies that
there is always a path to 0 that we can take with nonzero probability, and in
the submartingale case because if we don’t E [T ] is infinite. So in either case
we are looking at arguments that we’ve done before.
Where the theorem is helpful is that it saves us from having to repeat
these arguments. For example, in the case of a biased random walk with
reflecting barriers at 0 and n and a probability of p > 1/2 of dropping at
each step, we can compute E [Xt − Xt+1 | Xt = s] ≥ p − q (note that the
convention here is that we are tracking expected drop, which is the negative
of the expected change) and instantly get E [T ] ≤ n/(p − q) for any starting
point X0 ≤ n. In the other direction we are not so lucky: the reflecting
barrier at n means that E [Xt − Xt+1 | Xt = n] = 1, so the best bound we
get starting at X0 = n is E [T ] ≥ n, which we don’t really need the theorem
to get.
In some cases, it can be hard to find a Lyapunov function with constant
additive drift. Rescaling may help in this case: the idea is to stretch the
Lyapunov function locally by multiplying by the inverse of the expected
drop, so that the new stretched function has expected drop at least 1. This
trick has been reinvented several times in different contexts, dating back at
least to the probabilistic recurrence relations of Karp et al. [KUW88].
We’ll quote the version from the evolutionary analysis literature given by
Lengler [Len20], which is known as the variable drift theorem:
CHAPTER 12. HITTING TIMES 232
For the full proof, see [Len20, Theorem 2.3.3] or the original [Joh10,
Theorem 4.6]. The intuition (using the notation of [Len20]) is that we can
apply Theorem 12.3.1 to a function
smin + R s 1
when s ≥ smin ,
smin h(σ) dσ
g(s) = h(s min )
s when s ≤ smin .
smin
The slope of this function for large s is 1/h(s), which means that the
expected change in g(s) when we drop by h(s) is roughly h(s) · (1/h(s)) = 1.
The exact bound requires observing that the integral is concave and applying
Jensen’s inequality. For s close to smin , the linear term covers any drop that
goes below smin (where we don’t care about h(s), because we are never going
to land there).
The actual proof does a case analysis on all possible changes in s depending
on whether they involve going up, going down but staying about smin , or
going down and crossing to 0. Adding up all of these cases shows an expected
drift for g(s) of at least 1, reducing to Theorem 12.3.1. The somewhat messy
bound is just what we get when we run the bound from the theorem back
through g.
A special case of variable drift is multiplicative drift, where ∆t (s) ≥ δs
for some constant δ > 0. In this case the bound in Theorem 12.3.2 simplifies
to
1 + E [ln(X0 /smin )]
E [T ] ≤ . (12.3.1)
δ
For example, we showed in §12.2 that the function RPS for the rock-
paper-scissors process has multiplicative drift with δ = 3/ n2 , smin = n − 2,
CHAPTER 12. HITTING TIMES 233
The probabilistic method is a tool for proving the existence of objects with
particular combinatorial properties, by showing that some process generates
these objects with nonzero probability.
The relevance of this to randomized algorithms is that in some cases
we can make the probability large enough that we can actual produce such
objects.
We’ll mostly be following Chapter 5 of [MR95] with some updates for
more recent results. If you’d like to read more about these technique, a
classic reference on the probabilistic method in combinatorics is the text of
Alon and Spencer [AS92].
234
CHAPTER 13. THE PROBABILISTIC METHOD 235
In each case, the probability that we get a good outcome is actually pretty
high, so we could in principle generate a good outcome by retrying our
random process until it works. There are some more complicated examples
of the method for which this doesn’t work, either because the probability of
success is vanishingly small, or because we can’t efficiently test whether what
we did succeeded (the last example below may fall into this category). This
means that we often end up with objects whose existence we can demonstrate
even though we can’t actually point to any examples of them. For example,
it is known that there exist sorting networks (a special class of circuits
for sorting numbers in parallel) that sort in time O(log n), where n is the
number of values being sorted [AKS83], and these these can be generated
randomly with nonzero probability. But the best explicit constructions of
such networks take time Θ(log2 n), and the question of how to find an explicit
network that achieves O(log n) time has been open for decades despite many
efforts to solve it.
Proof. Suppose each pair of schoolchildren flip a fair coin to decide whether
they like each other or not. Then the probability that any particular set of
k
k schoolchildren all like each other is 2−(2) and the probability that they
all dislike each other is the same. Summing over both possibilities and all
k
subsets gives a bound of nk 21−(2) on the probability that there is at least one
The last step in the proof uses the fact that 21+k/2 < k! for k ≥ 3, which
can be tested explicitly for k = 3 and proved by induction for larger k. The
resulting bound is a little bit weaker than just saying that n must be large
k
enough that nk 21−(2) ≥ 1, but it’s easier to use.
m/2 = E [X]
= (1 − p) E [X | X < m/2] + p E [X | X ≥ m/2]
m−1
≤ (1 − p) + pm.
2
Solving this for p gives the claimed bound.4
By running this enough times to get a good cut, we get a polynomial-time
randomized algorithm for approximating the maximum cut within a factor
of 2, which is pretty good considering that MAX CUT is NP-hard.
There exist better approximation algorithms. Goemans and Williamson [GW95]
give a 0.87856-approximation algorithm for MAX CUT based on randomized
rounding of a semidefinite program.5 The analysis of this algorithm is a
little involved, so we won’t attempt to show this here, but we will describe
(in §13.2.2) an earlier result, also due to Goemans and Williamson [GW94],
that gives a 43 -approximation to MAX SAT using a similar technique.
4
This is tight for m = 1, but I suspect it’s an underestimate for larger m. The
main source of slop in the analysis seems to be the step E [X | X ≥ m/2] ≤ m; using a
concentration bound, we should be able to show a much stronger inequality here and thus
a much larger lower bound on p.
5
Semidefinite programs are like linear programs except that the variables are vectors
instead of scalars, and the objective function and constraints apply to linear combinations
of dot-products of these variables. The Goemans-Williamson MAX CUTPalgorithm is
based on a relaxation of the integer optimization problem of maximizing xi xj where
each xi ∈ {−1, +1} encodes membership in S or T . They instead allow xi to be any unit
vector in an n-dimensional space, and then take the sign of the dot-product with a random
unit vector r to map each optimized xi to one side of the cut or the other.
CHAPTER 13. THE PROBABILISTIC METHOD 239
subject to X X
yi + (1 − yi ) ≥ zj
i∈Cj+ i∈Cj−
for all j.
The main trick here is to encode OR in the constraints; there is no
requirement that zj is the OR of the yi and (1 − yi ) values, but we maximize
the objective function by setting it that way.
Sadly, solving integer programs like the above is NP-hard (which is not
surprising, since if we could solve this particular one, we could solve SAT).
But if we drop the requirements that yi , zj ∈ {0, 1} and replace them with
0 ≤ yi ≤ 1 and 0 ≤ zj ≤ 1, we get a linear program—solvable in polynomial
time—with an optimal value at least as good as the value for the integer
program, for the simple reason that any solution to the integer program is
also a solution to the linear program.
The problem now is that the solution to the linear program is likely to
be fractional: instead of getting useful 0–1 values, we might find out we are
supposed to make xi only 2/3 true. So we need one more trick to turn the
Linear programming has an interesting history. The basic ideas were developed indepen-
dently by Leonid Kantorovich in the Soviet Union and George Dantzig in the United States
around the start of the Second World War. Kantorovich’s work had direct relevance to
Soviet planning problems, but wasn’t pursued seriously because it threatened the political
status of the planners, required computational resources that weren’t available at the
time, and looked suspiciously like trying to sneak a capitalist-style price system into the
planning process; for a fictionalized account of this tragedy, see [Spu12]. Dantzig’s work,
which included the development of the simplex method for solving linear programs, had
a higher impact, although its publication was delayed until 1947 by wartime secrecy.
8
Randomized rounding was invented by Raghavan and Thompson [RT87]; the particular
application here is due to Goemans and Williamson [GW94].
CHAPTER 13. THE PROBABILISTIC METHOD 241
fractional values back into integers. This is the randomized rounding step:
given a fractional assignment ŷi , we set xi to true with probability ŷi .
So what does randomized rounding do to clauses? In our fractional
solution, a clause might have value ẑj , obtained by summing up bits and
pieces of partially-true variables. We’d like to argue that the rounded version
gives a similar probability that Cj is satisfied.
Suppose Cj has k variables; to make things simpler, we’ll pretend that
Cj is exactly x1 ∨ x2 ∨ . . . xk . Then the probability that Cj is satisfied is
exactly 1 − ki=1 (1 − ŷi ). This quantity is minimized subject to ki=1 ŷi ≥ ẑj
Q P
j Yj ≥ (3/2)
P
gives E j Xj +E j ẑj . But then one of the two expected
P
sums must beat (3/4) j ẑj , giving us a (3/4)-approximation algorithm.
we’ll find that the events we care about aren’t independent, so this won’t
work either.
The Lovász Local Lemma [EL75] handles a situation intermediate
between these two extremes, where events are generally not independent of
each other, but each collection of events that are not independent of some
particular event A has low total probability. In the original version, it’s
non-constructive: the lemma shows a nonzero probability that none of the
events occur, but this probability may be very small if we try to sample
the events at random, and there is no guidance for how to find a particular
outcome that makes all the events false.
Subsequent work [Bec91, Alo91, MR98, CS00, Sri08, Mos09, MT10]
showed how, when the events A are determined by some underlying set
of independent variables and independence between two events is detected
by having non-overlapping sets of underlying variables, an actual solution
could be found in polynomial expected time. The final result in this series,
due to Moser and Tardos [MT10], gives the same bounds as in the original
non-constructive lemma, using the simplest algorithm imaginable: whenever
some bad event A occurs, try to get rid of it by resampling all of its variables,
and continue until no bad events are left.
then
" #
\ Y
Pr Ā ≥ (1 − xA ). (13.3.2)
A∈A A∈A
In particular, this means that the probability that none of the Ai occur
is not zero, since we have assumed xAi < 1 holds for all i.
The role of xA in the original proof is to act as an upper bound on the
probability that A occurs given that some collection of other events doesn’t
9
This version is adapted from [MT10].
CHAPTER 13. THE PROBABILISTIC METHOD 244
occur. For the constructive proof, the xA are used to show a bound on the
number of resampling steps needed until none of the A occur.
Proof. Basically, we are going to pick a single value x such that xA = x for
all A in A, and (13.3.1) is satisfied. This works as long as p ≤ x(1 − x)d ,
d |Γ(A)| =
as in for all A, Pr [A] ≤ p ≤ x(1 − x) ≤ x(1 − x)
Qthis case we have,
xA B∈Γ(A) (1 − xB ) .
d
For fixed d, x(1 − x)d is maximized using the usual trick: dx x(1 − x)d =
d d−1 1
(1 − x) − xd(1 − x) = 0 gives (1 − x) − xd = 0 or x = d+1 . So now
d d
1 1 1
we need p ≤ d+1 1− d+1 . It is possible to show that 1/e < 1 − d+1
d
1 1 1
for all d ≥ 0.10 So ep(d + 1) ≤ 1 implies p ≤ e(d+1) ≤ 1− d+1 d+1 ≤
x(1 − x)|Γ(A)| as required by Lemma 13.3.1.
13.3.3 Applications
Here we give some simple applications of the lemma.
that none of them occurred, but they aren’t, so we can’t. Instead, we’ll use
the local lemma.
The set of bad events A is just the set of events Ai = [edge i is monochromatic].
We’ve already computed p = 1/c. To get d, notice that each edge only shares
a vertex with two other edges, so |Γ(Ai )| ≤ 2. Corollary 13.3.2 then says that
there is a good coloring as long as ep(d + 1) = 3e/c ≤ 1, which holds as long
as c ≥ 9. We’ve just shown we can 9-color a cycle. If use the asymmetric
2
version, we can set all xA to 1/3 and show that p ≤ 13 1 − 13 = 27 4
would
also work; with this we can 7-color a cycle. This is still not as good as
what we can do if we are paying attention, but not bad for a procedure that
doesn’t use the structure of the problem much.
When |S| = 0, this just says Pr [A] ≤ xA , which follows immediately from
(13.3.1).
For larger S, split S into S1 = S ∩ Γ(A), the events in S that might
not be independent of A; and S2 = S \ Γ(A), the events in S that we know
to be independent of A. If S2 = S, then A is independent
h iof all events
B∈S B̄ = Pr [A] ≤
T
in S, and (13.3.3) follows immediately from Pr A
xA B∈Γ(A) (1 − xB ) ≤ xA . Otherwise |S2 | < |S|, which means that we can
Q
Pr [A ∩ C1 | C2 ] ≤ Pr [A | C2 ]
= Pr [A]
Y
≤ xA (1 − xB ), (13.3.5)
B∈Γ(A)
from (13.3.1) and the fact that A is independent of all B in S2 and thus also
independent of C2 .
T
For the denominator, we expand C1 backh out to B∈Si1 B̄ and break
T
out the induction hypothesis. To bound Pr B∈S1 B̄ C2 , we order S1
arbitrarily as {B1 , . . . , Br }, for some r, and show by induction on ` as ` goes
from 1 to r that
" ` # `
\ Y
Pr B̄i C2 ≥ (1 − xBi ). (13.3.6)
i=1 i=1
using the outer induction hypothesis (13.3.3), and for larger `, we can compute
" ` # " `−1
! # "`−1 #
\ \ \
Pr B̄i C2 = Pr B̄` B̄i ∩ C2 · Pr B̄i C2
i=1 i=1 i=1
`−1
Y
≥ (1 − xB` ) (1 − xBi )
i=1
`
Y
= (1 − xBi ),
i=1
CHAPTER 13. THE PROBABILISTIC METHOD 248
where the second-to-last step uses the outer induction hypothesis (13.3.3)
for the first term and the inner induction hypothesis (13.3.6) for the rest.
This completes the proof of the inner induction.
When ` = r, we get
" r #
\
Pr [C1 | C2 ] = Pr B̄i C2
i=1
Y
≥ (1 − xB ). (13.3.7)
B∈S1
≤ xA .
This completes the proof of the outer induction.
To get the bound (13.3.2), we reach back inside the proof and repeat
T
the argument for (13.3.7) with A∈A Ā in place of C1 and without the
conditioning on C2 . We order A arbitrarily as {A1 , A2 , . . . , Am } and show
by induction on k that
" k # k
\ Y
Pr Āi ≥ (1 − xAk ). (13.3.8)
i=1 i=1
For the base case we have k = 0 and Pr [Ω] ≥ 1, using the usual conventions
on empty products. For larger k, we have
" k # " k−1
# "k−1 #
\ \ \
Pr Āi = Pr Āk Āi Pr Āi
i=1 i=1 i=1
k−1
Y
≥ (1 − xAk ) (1 − xAi )
i=1
k
Y
≥ (1 − xAk ),
i=1
where in the second-to-last step we use (13.3.3) for the first term and the
induction hypothesis (13.3.8) for the big product.
Setting k = m finishes the proof.
CHAPTER 13. THE PROBABILISTIC METHOD 249
A
1 − xA
How this expected m/d bound translates into actual time depends on
the cost of each resampling step. The expensive part at each step is likely to
be the cost of finding an A that occurs and thus needs to be resampled.
12
Even though there is a lot of resampling going on here, Moser-Rardos is not actually a
sampling algorithm, in the sense that some solutions to the original problem may be much
more likely to be produced than others. There has been some more recent work on the
sampling Lovász local lemma, where the goal is to obtain a close-to-uniform sample
from the space of possible solutions. This is much harder than just finding one solution
and usually requires stronger constraints on the problem. An example of this work is the
paper of He et al. [HWY22].
CHAPTER 13. THE PROBABILISTIC METHOD 250
trees T we expect to see rooted at A, given that each occurs with probability
Q
v∈T Pr [Av ]. This is done by constructing a branching process using the
xB values from Lemma 13.3.1 as probabilities of a node with label A having
a child labeled B for each B in Γ+ (A), and doing algebraic manipulations
Q
on the resulting probabilities until v∈T Pr [Av ] shows up. We can then sum
over the expected number of copies of trees to get a bound on the expected
number of events in the execution log (since each such even is the root of
some tree), which is equal to the expected number of resamplings.
Consider the process where we start with a root labeled A, and for each
vertex v with label Av , give it a child labeled B for each B ∈ Γ+ (Av ) with
independent probability xB . We’ll now calculate the probability pT that this
process generates a particular tree T in the set TA of trees with root A.
Let x0B = xB C∈Γ(B) (1 − xC ). Note that (13.3.1) says precisely that
Q
Pr [B] ≤ x0B .
For each vertex v in T , let Wv ⊆ Γ+ (Av ) be the set of events B ∈ Γ+ (Av )
that don’t occur as labels of children of v. The probability of getting T is
equal to the product of the probabilities at each v of getting all of its children
and none of its non-children. The non-children of v collectively contribute
B∈Wv (1 − xB ) to the product, and v itself contributes xAv (via the product
Q
for its parent), unless v is the root node. So we can express the giant product
as
1 Y Y
pT = xAv (1 − xB ) .
xA v∈T B∈W v
We don’t like the Wv very much, so we get rid of them by taking the product
of B in Γ+ (A), then dividing out the ones that aren’t in Wv . This gives
1 Y Y
pT = xAv (1 − xB ) .
xA v∈T B∈W v
1 Y Y Y 1
= xAv (1 − xB ) .
xA v∈T 1 − xB
B∈Γ+ (Av ) B∈Γ+ (Av )\Wv
This seems like we exchanged one annoying index set for another, but each
element of Γ+ (AV ) \ Wv is Av0 for some child of v in T . So we can push these
factors down to the children, and since we are multiplying over all vertices
in T , they will each show up exactly once except at the root. To keep the
1
products clean, we’ll throw in 1−x A
for the root as well, but compensate for
CHAPTER 13. THE PROBABILISTIC METHOD 252
1 − xA Y 0
= xAv .
xA v∈T
Now we can bound the expected number of trees rooted at A that appear
in C, assuming (13.3.1) holds. Letting TA as before be the set of all such
trees and NA the number that appear in C, we have
X
E [NA ] = Pr [T appears in C]
T ∈TA
X Y
≤ Pr [A(v)]
T ∈TA v∈T
x0Av
X Y
≤
T ∈TA v∈T
X xA
= pT
T ∈TA
1 − xA
xA X
= pT
1 − xA T ∈T
A
xA
≤ .
1 − xA
The last sum is bounded by one because occurrences of particular trees
T in TA are all disjoint events.
Now sum over all A, and we’re done.
Chapter 14
Derandomization
1. Reduce the number of random bits used down to O(log n), and then
search through all choices of random bits exhaustively. For example, if
we only need pairwise independence, we could use the XOR technique
from §5.1.2.1 to replace a large collection of variables with a small
collection of random bits.
Except for the exhaustive search part, this is how randomized algo-
rithms are implemented in practice: rather than burning random bits
continuously, a pseudorandom generator is initialized from a seed
consisting of a small number of random bits. For pretty much all of
the randomized algorithms we know about, we don’t even need to use
a particularly strong pseudorandom generator. This is largely because
1
The class RP consists of all languages L for which there is a polynomial-time random-
ized algorithm that correctly outputs “yes” given an input x in L with probability at least
1/2, and never answers “yes” given an input x not in L. See §1.5.2 for a more extended
description of RP and other randomized complexity classes.
253
CHAPTER 14. DERANDOMIZATION 254
Proof. The intuition is that if any one random string has a constant proba-
bility of making M happy, then by choosing enough random strings we can
make the probability that M fails using on every random string for any given
input so small that even after we sum over all inputs of a particular size,
the probability of failure is still small using the union bound (4.2.1). This is
an example of probability amplification, where we repeat a randomized
algorithm many times to reduce its failure probability.
Formally, consider any fixed input x of size n, and imagine running M
repeatedly on this input with n + 1 independent sequences of random bits
r1 , r2 , . . . , rn+1 . If x 6∈ L, then M (x, ri ) never outputs 1. If x ∈ L, then for
each ri , there is an independent probability of at least 1/2 that M (x, ri ) = 1.
So Pr [M (x, ri ) = 0] ≤ 1/2, and Pr [∀iM (x, ri ) = 0] ≤ 2−(n+1) . If we sum
this probability of failure for each individual x ∈ L of length n over the at
most 2n such elements, we get a probability that any of them fail of at most
CHAPTER 14. DERANDOMIZATION 256
The classic version of this theorem shows that anything you can do with
a polynomial-size randomized circuit (a circuit made up of AND, OR, and
NOT gates where some of the inputs are random bits, corresponding to the r
input to M ) can be done with a polynomial-size deterministic circuit (where
now the pn input is baked into the circuit, since we need a different circuit
for each size n anyway).
A limitation of this result is that ordinary algorithms seem to be better
described by uniform families of circuits, where there exists a polynomial-
time algorithm that, given input n, outputs the circuit Cn for processing size-n
inputs. In contrast, the class of circuits generated by Adleman’s theorem
is most likely non-uniform: the process of finding the good witnesses ri is
not something we know how to do in polynomial time (with the usual caveat
that we can’t prove much about what we can’t do in polynomial time).
random to one side of the cut or the other, each edge appears in the cut with
probability 1/2, giving a total of m/2 edges in the cut in expectation.
Suppose that we replace these n independent random bits with n pairwise-
independent bits generated by taking XORs of subsets of dlg(n + 1)e inde-
pendent random bits as described in §5.1.2.1. Because the bits are pairwise-
independent, the probability that the two endpoints of an edge are assigned
to different sides of the cut is still exactly 1/2. So on average we get m/2
edges in the cut as before, and there is at least one sequence of random bits
that guarantees a cut at least this big.
But with only dlg(n+1)e random bits, there are only 2dlog(n+1)e < 2(n+1)
possible sequences of random bits. If we try all of them, then we find a cut
of size m/2 always. The total cost is O(n(n + m)) if we include the O(n + m)
cost of testing each cut. Note that this algorithm does not generate all 2n
possible cuts, but among those it does generate, there must be a large one.
In this particular case, we’ll see below how to get the same result at a
much lower cost, using more knowledge of the problem. So we have the
typical trade-off between algorithm performance and algorithm designer
effort.
2. For any edge with only one endpoint assigned, there’s a 1/2 probability
that the other endpoint gets assigned to the other side (in the original
randomized algorithm). Add 1/2 to the conditional expectation for
these edges.
3. For any edge with neither endpoint assigned, we again have a 1/2
probability that it crosses the cut. Add 1/2 for these as well.
So now let us ask how assigning a particular previously unassigned vertex
v to S or T affects the conditional probability. For any neighbor w of v that
is not already assigned, adding v to S or T doesn’t affect the 1/2 contribution
of vw. So we can ignore these. The only effects we see are that if some
neighbor w is in S, assigning v to S decreases the conditional expectation by
1/2 and assigning v to T increases the expectation by 1/2. So to maximize
the conditional expectation, we should assign v to whichever side currently
holds fewer of v’s neighbors—the obvious greedy algorithm, which runs in
O(n + m) time if we are reasonably clever about keeping track of how many
CHAPTER 14. DERANDOMIZATION 259
This last expression involves a linear number of terms, each of which we can
calculate using a linear number of operations on rational numbers that fit
in a linear number of bits, so we can calculate the probability exactly in
polynomial time by just adding them up.
For our pessimistic estimator, we take
n
X h √ i
U (1 , . . . , k ) = Pr |Xj | > 2n ln 2m 1 , . . . , k .
i=j
Since each term in the sum is a Doob martingale, the sum is a martingale
as well, so E [U (1 , . . . , k+1 ) | 1 , . . . , k ] = U (1 , . . . , k ). It follows that for
any choice of 1 , . . . , k there exists some k+1 such that U (1 , . . . , k ) ≥
U (1 , . . . , k+1 ), and we can determine this winning k+1 explicitly. Our
previous argument shows that U (hi) < 1, which implies that our final value
U (1 , . . . , n ) will also be less than 1. Since U is an integer, √ this means it
must be 0, and we find an assignment in which |Xj | < 2n ln 2m for all j.
Chapter 15
Probabilistically-checkable
proofs
261
CHAPTER 15. PCP AND HARDNESS OF APPROXIMATION 262
• g(x, πi ) takes as input the same x as f and a sequence πi = πi1 πi2 . . . πi(q)n
and outputs either 1 (for accept) or 0 for (reject); and
15.3 NP ⊆ PCP(poly(n), 1)
Here we give a weak version of the PCP theorem, adapted from [AB07, §18.4],
showing that any problem in NP has a probabilistically-checkable proof
where the verifier uses polynomially-many random bits but only needs to
look at a constant number of bits of the proof, which we can state succinctly
as NP ⊆ PCP(poly(n), 1).2 The proof itself will be exponentially long.
The central step is to construct a hpoly(n), 1)i-PCP for a particular
NP-complete problem; we can then take any other problem in NP, reduce
it to this problem, and use the construction to get a PCP for that problem
as well.
15.3.1 QUADEQ
The particular problem we will look at is QUADEQ, the language of systems
of quadratic equations over Z2 that have solutions.
This is in NP because we can guess and verify a solution; it’s NP-hard
because we can use quadratic equations over Z2 to encode instances of SAT,
using the representation 0 for false, 1 for true, 1 − x for ¬x, xy for x ∧ y, and
1 − (1 − x)(1 − y) = x + y + xy for x ∨ y. We may also need to introduce
2
This is a rather weak result, since (a) the full PCP theorem gives NP using only
O(log n) random bits, and (b) PCP(poly(n), 1) is known to be equal to NEXP [BFL91].
But the construction is still useful for illustrating many of the ideas behind probabilistically-
checkable proofs.
CHAPTER 15. PCP AND HARDNESS OF APPROXIMATION 265
auxiliary variables to keep the degree from going up: for example, to encode
the clause x ∨ y ∨ z, we introduce an auxiliary variable q representing x ∨ y
and enforce the constraints q = x ∨ y and 1 = q ∨ z = x ∨ y ∨ z using two
equations
x + y + xy = q,
q + z + qz = 1.
It will be helpful later to rewrite these in a standard form with only zeros
on the right:
q + x + y + xy = 0
q + z + qz + 1 = 0.
This works because we can move summands freely from one side of an
equation to the other since all addition is mod 2.
2. That for some random r, s, f (r)f (s) = g(r ⊗ s). This may let us
know if w is inconsistent with u. Define W as the n × n matrix with
Wij = wij and U as the n × n matrix U = u ⊗ u (so Uij = ui uj ). Then
g(r ⊗s) = w ·(r ⊗s) = ij wij ri sj = rW s and f (r)f (s) = (u·r)(u·s) =
P
P P P
( i ui ri ) j uj sj = ij ri Uij rj = rU s, where we are treating r as a
row vector and s as a column vector. Now apply the random subset
principle to argue that if U =6 W , then rU 6= rW at least half the time,
and if rU 6= rW , then rU s 6= rW s at least half the time. This gives a
probability of at least 1/4 that we catch U 6= W , and we can repeat
the test a few times to amplify this to whatever constant we want.
Since we can make each step fail with only a small constant probability,
we can make the entire process fail with the sum of these probabilities, also
a small constant.
But we can do better than this. Suppose that we can approximate the
number of φS that can be satisfied to within a factor of 2 − . Then if φ has
an assignment that makes all the φS true (which followed from completeness
if x ∈ L), our approximation algorithm will give us an assignment that makes
1
at least a 2− > 12 fraction of the φS true. But we can never make more than
1
2 of the φS true if x 6∈ L. So we can run our hypothetical approximation
algorithm, and if it gives us an assignment that satisfies more than half of
the φS , we know x ∈ L. If the approximation runs in P, we just solved SAT
in P and showed P = NP.
case we observe that a partial solution to the target problem maps back to a
partial solution to the original SAT problem.
In some cases we can do better, by applying a gap-amplification step. For
example, suppose that no polynomial-time algorithm for INDEPENDENT
SET can guarantee an approximation ratio better than ρ, assuming P 6= NP.
Given a graph G, construct the graph Gk on nk vertices where each vertex
in Gk represents a set of k vertices in G, and ST is an edge in Gk if S ∪ T
is not an independent set in G. Let I be an independent set for G. Then
the set I k of all k-subsets of I is an independent set in Gk (S ∪ T ⊆ I is an
independent set for any S and T in I k ). Conversely, given any independent
set J ⊆ Gk , its union J is an independent set in G (because otherwise
S
1. We first assume that our input graph is k-regular (all vertices have the
same degree k) and an expander (every subset S with |S| ≤ m/2 has
δ|S| external neighbors for some constant δ > 0). Dinur shows that
even when restricted to graphs satisfying these assumptions, GRAPH
3-COLORABILITY is still NP-hard.
vertex in the overlap between the two neighborhoods, and (b) assigns
colors to the endpoints of any edge in either neighborhood that are
permitted by the constraint on that edge. Intuitively, this means that
a bad edge in a coloring of G will turn into many bad edges in G0 , and
the expander assumption means that many bad edges in G will also
turn into many bad edges in G0 . In particular, Dinur shows that with
appropriate tuning this process amplifies the UNSAT value of G by a
constant. Unfortunately, we also blow up the size of the alphabet by
Θ(k d ).
So the second part of the amplification knocks the size of the alphabet
back down to 2. This requires replacing each node in G0 with a set of
nodes in a new constraint graph G00 , where the state of the nodes in
the set encodes the state of the original node, and some coding-theory
magic is used to preserve the increased gap from the first stage (we
lose a little bit, but not as much as we gained).
The net effect of both stages is to take a constraint graph G of size n
with UNSAT(G) ≥ and turn it into a constraint graph G00 of size cn,
for some constant c, with UNSAT(G00 ) ≥ 2.
6
Note that this is not a decision problem, in that the machine M considering G does
not need to do anything sensible if G is in the gap; instead, it is an example of a promise
problem where we have two sets L0 and L1 , M (x) must output i when x ∈ Li , but L0 ∪ L1
does not necessarily cover all of {0, 1}∗ .
Chapter 16
Quantum computing
Any such state vector x for the system must consist of non-negative
real values that sum to 1; this is just the usual requirement for a discrete
274
CHAPTER 16. QUANTUM COMPUTING 275
probability space. Operations on the state consist of taking the old values of
one or both bits and replacing them with new values, possibly involving both
randomness and dependence on the old values. The law of total probability
applies here, so we can calculate the new state vector x0 from the old state
vector x by the rule
These are linear functions of the previous state vector, so we can summarize
the effect of our operation using a transition matrix A, where x0 = Ax.1
We imagine that these operations are carried out by feeding the initial
state into some circuit that generates the new state. This justifies the calling
this model of computation a random circuit. But the actual implementa-
tion might be an ordinary computer than is just flipping coins. If we can
interpret each step of the computation as applying a transition matrix, the
actual implementation doesn’t matter.
For example, if we negate the second bit 2/3 of the time while leaving
the first bit alone, we get the matrix
1/3 2/3 0 0
2/3 1/3 0 0
A= .
0 0 1/3 2/3
0 0 2/3 1/3
One way to derive this matrix other than computing each entry directly is
that it is the tensor product of the matrices that represent the operations
on the individual bits. The idea here is that the tensor product of A and B,
written A ⊗ B, is the matrix C with Cij,k` = Aik Bj` . We’re cheating a little
bit by allowing the C matrix to have indices consisting of pairs of indices,
one for each of A and B; there are more formal definitions that justify this
at the cost of being harder to understand.
In this particular case, we have
1/3 2/3 0 0 " # " #
2/3 1/3 0 0 1 0 1/3 2/3
= ⊗ .
0 0 1/3 2/3 0 1 2/3 1/3
0 0 2/3 1/3
1
Note that this is the reverse of the convention we adopted for Markov chains in
Chapter 10. There it was convenient to have Pij = pij = Pr [Xt+1 = j | Xt = i]. Here we
defer to the physicists and make the update operator come in front of its argument, like
any other function.
CHAPTER 16. QUANTUM COMPUTING 276
The first matrix in the tensor product gives the update rule for the first bit
(the identity matrix—do nothing), while the second gives the update rule for
the second.
Some operations are not decomposable in this way. If we swap the values
of the two bits, we get the matrix
1 0 0 0
0 0 1 0
S= ,
0 1 0 0
0 0 0 1
xout = ASAxin
1/3 2/3 0 0 1 0 0 0 1/3 2/3 0 0 1
2/3 1/3 0 0 0 0 1 0 2/3 1/3 0 0 0
=
0 0 1/3 2/3 0 1 0 0 0 0 1/3 2/3 0
0 0 0 0 0
maps |10i to |01i (Proof: |01i h10| |10i = |01i h10|10i = |01i) and sends all
other states to 0. Add up four of these mappings to get
1 0 0 0
0 0 1 0
S = |00i h00| + |10i h01| + |01i h10| + |11i h11| = .
0 1 0 0
0 0 0 1
Here the bra-ket notation both labels what we are doing and saves writing a
lot of zeros.
CHAPTER 16. QUANTUM COMPUTING 279
The intuition is that just like a ket represents a state, a bra represents a
test for being in that state. So something like |01i h10| tests if we are in the
10 state, and if so, sends us to the 01 state.
where a0 and a1 are complex numbers with |a0 |2 + |a1 |2 = 1.2 The reason for
this restriction on amplitudes is that if we measure a qubit, we will see state 0
with probability |a0 |2 and state 1 with probability |a1 |2 . Unlike with random
bits, these probabilities are not mere expressions of our ignorance but arise
through a still somewhat mysterious process from the more fundamental
amplitudes.3
2
√ The absolute value, norm, or magnitude |a + bi| of a complex number is given by
a2 + b2 . When b = 0, this is the same as the absolute value for the corresponding
√ real
number. For any complex number x, the norm p can also be written
p as x̄x, where
√ x̄ is
the complex conjugate of x. This is because (a + bi)(a − bi) = a2 − (bi)2 = a2 + b2 .
The appearance of the complex conjugate here explains why we define hx|yi = x∗ y; the
conjugate transpose means that for hx|xi, when we multiply x∗i by xi we are computing a
squared norm.
3
In the old days of “shut up and calculate,” this process was thought to involve the
unexplained power of a conscious observer to collapse a superposition into a classical
state. Nowadays the most favored explanation involves decoherence, the difficulty of
maintaining superpositions in systems that are interacting with large, warm objects with lots
of thermodynamic degrees of freedom (measuring instruments, brains). The decoherence
explanation is particularly useful for explaining why real-world quantum computers have a
hard time keeping their qubits mixed even when nobody is looking at them. Decoherence
by itself does not explain which basis states a system collapses to. Since bases in linear
algebra are pretty much arbitrary, it would seem that we could end up running into a
physics version of Goodman’s grue-bleen paradox [Goo83], but there are apparently ways of
dealing with this too using a mechanism called einselection [Zur03] that favors classical
states over weird ones. Since all of this is (a) well beyond my own limited comprehension
of quantum mechanics and (b) irrelevant to the theoretical model we are using, these issues
will not be discussed further.
CHAPTER 16. QUANTUM COMPUTING 280
With multiple bits, we get amplitudes for all combinations of the bits,
e.g.
1
(|00i + |01i + |10i + |11i)
2
gives a state vector in which each possible measurement will be observed
2
1
with equal probability 2 = 14 . We could also write this state vector as
1/2
1/2
.
1/2
1/2
0 0 1 0
which is clearly unitary (the rows are just the standard basis vectors). We
could also write this more compactly as |00i h00| + |01i h01| + |11i h10| +
|10i h11|.
CHAPTER 16. QUANTUM COMPUTING 282
The CNOT operator gives us XOR, but for more destructive operations
we need to use more qubits, possibly including junk qubits that we won’t look
at again but that are necessary to preserve reversibility. The Toffoli gate or
controlled controlled NOT gate (CCNOT) is a 3-qubit gate that was
originally designed to show that classical computation could be performed
reversibly [Tof80]. It implements the mapping (x, y, z) 7→ (x, y, (x ∧ y) ⊕ z),
which corresponds to the 8 × 8 matrix
1 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 1 0 0 0 0
.
0 0 0 0 1 0 0 0
0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1
0 0 0 0 0 0 1 0
to − |xi when f (x) = 1, and passing it through intact when f (x) = 0. The
result is a diagonal matrix whose diagonal looks like a truth table for f
expressed as ±1 values. This has the effect of mapping each |xi to − |xi
when f (x) = 1, and passing it through intact when f (x) = 0. The result is a
diagonal matrix Uf whose diagonal looks like a truth table for f expressed
as ±1 values, as in this matrix for the XOR function f (x0 , x1 ) = x0 ⊕ x1 :
1 0 0 0
0 −1 0 0
.
0 0 −1 0
0 0 0 1
1
f |xyi = √ (f |x0i − f |x1i)
2
1
= √ (|x, f (x)i − |x, ¬f (x)i)
2
(−1)f (x)
= √ (|x0i − |x1i).
2
1
= (−1)f (x) |xi √ (|0i − |1i).
2
= (Uf |xi) |yi .
Our goal is for this final state to tell us what we want to know, with
reasonably high probability.
Suppose now that f (0) = f (1) = b. Then the |1i terms cancel out and
we are left with
1
2 · (−1)b |0i = (−1)b |0i .
2
This puts all the weight on |0i, so when we take our measurement at the
end, we’ll see 0.
Alternatively, if f (0) = b 6= f (1), it’s the |0i terms that cancel out,
leaving (−1)b |1i. The phase depends on b, but we don’t care about the
phase. The important thing is that if we measure the qubit, we always see 1.
The result in either case is that with probability 1, we determine the
value of f (0) ⊕ f (1), after evaluating f once (albeit on a superposition of
quantum states).
This is kind of a silly example, because the huge costs involved in building
our quantum computer almost certainly swamp the factor-of-2 improvement
we got in the number of calls to f . But a generalization of this trick, known
as the Deutsch-Josza algorithm [DJ92], solves the much harder (although
still a bit contrived-looking) problem of distinguishing a constant Boolean
function on n bits from a function that outputs one for exactly half of its
inputs. No deterministic algorithm can solve this problem without computing
at least 2n /2 + 1 values of f , giving an exponential speed-up.
The speed-up compared to a randomized algorithm that works with
probability 1 − is less impressive. With randomization, we only need to
look at O(log 1/) values of f to see both a 0 and a 1 in the non-constant
case. But even here, the Deutsch-Josza algorithm does have the advantage
of giving the correct answer with probability 1. If we make the same demand
of a randomized algorithm, it does no better than a deterministic algorithm,
at least in the constant-function case.
Making this work requires showing that (a) we can generate the original
superposition |si, (b) we can implement D efficiently using unitary operations
on a constant number of qubits each, and (c) we actually get w at the end
of this process.
Here we use the fact that |si hs| |si hs| = |si hs|si hs| = |si (1) hs| = |si hs|.
Now let’s look at implementation. Recall that |si = H ⊗n |0n i, where
H ⊗n is the result of applying H to each of the n bits individually. We also
have that H ∗ = H and HH ∗ = I, from which H ⊗n H ⊗n = I as well.
CHAPTER 16. QUANTUM COMPUTING 287
So we can expand
D = 2 |si hs| − I
= 2H ⊗n |0n i (H ⊗n |0n i)∗ − I
= 2H ⊗n |0n i h0n | H ⊗n − I
= H ⊗n (2 |0n i h0n | − I) H ⊗n .
D = 2 |si hs| − I
= 2 ((sin θ) |wi + (cos θ) |ui) ((sin θ) hw| + (cos θ) hu|) − I
= (2 sin2 θ − 1) |wi hw| + (2 sin θ cos θ) |wi hu| + (2 sin θ cos θ) |ui hw| + (2 cos2 θ − 1) |ui hu|
= (− cos 2θ) |wi hw| + (sin 2θ) |wi hu| + (sin 2θ) |ui hw| + (cos 2θ) |ui hu|
" #
− cos 2θ sin 2θ
= ,
sin 2θ cos 2θ
Ideally, we pick t so that (2t + 1)θ = π/2, which would put all of the
amplitude on |wi. Because t is an integer, we can’t do this exactly, but
setting t = b π/2θ−1
2 c will get us somewhere between π/2 − 2θ and π/2. Since
q
1
θ≈ N,this gives us a probability of seeing |wi in our final measurement
p √
of 1 − O( 1/N ) after O( N ) iterations of Uw D.
Sadly, this is as good as it gets. A lower bound of Bennet et al. [BBBV97]
shows that any quantum √ algorithm using Uw as the representation for f must
apply Uw at least Ω( N ) times to find w. So we get a quadratic speedup
but not the exponential speedup we’d need to solve NP-complete problems
directly.
Chapter 17
Randomized distributed
algorithms
289
CHAPTER 17. RANDOMIZED DISTRIBUTED ALGORITHMS 290
17.1 Consensus
The consensus problem is to get a collection of n processes to agree on a
value. The requirements are that all the processes that don’t crash finish the
protocol (with probability 1 for randomized protocols) (termination), that
they all output the same value (agreement), and that this value was an input
to one of the processes (validity)—this last condition excludes protocols that
always output the same value no matter what happens during the execution,
and makes consensus useful for choosing among values generated by some
other process.
There are many versions of consensus. The original problem as proposed
by Pease, Shostak, and Lamport [PSL80] assumes Byzantine processes in a
synchronous message-passing system. Here scheduling is entirely predictable,
and the obstacle to agreement is dissension sown by the lying Byzantine
processes. We will instead consider wait-free shared-memory consensus,
where scheduling is unpredictable but the the processes and memory are
trustworthy. Even in this case, the unpredictable scheduling makes solving
the problem deterministically impossible.
CHAPTER 17. RANDOMIZED DISTRIBUTED ALGORITHMS 291
might continue generating many local coins after the total count crosses the
threshold, and each other process might see a different prefix of these coins
depending on when it reads p’s register.
Bracha and Rachman showed that this wasn’t too much of a problem
using Azuma’s inequality (this is where the O(log n) factor comes in). But
later work by Attiya and Censor [AC08] allowed for a simpler analysis of a
slightly different algorithm, which we will describe here.
Pseudocode for the Attiya-Censor coin is given in Algorithm 17.1. The
algorithm only checks the total count once for every n coin-flips, giving an
amortized cost of 1 read per coin-flip. But to keep the processes for running
too long, each process checks a multi-writer stop bit after ever coin-flip.
6 stop ← true
P
n
7 return sgn j=1 rj .sum
Each process may see a different total sum at the end, but our hope is
that if T is large enough, there is at least a δ probability that all these total
sums are positive (or, by symmetry, negative). We can represent the total
sum seen by any particular process i as a sum S + D + Xi − Hi , where:
2. D is the sum of all coin-flips after the first T that are generated before
some process sets the stop bit. There are at most n2 + n such coin-flips,
and they form a martingale with bounded increments. So Azuma’s
inequality gives us that |D| ≤ 2n with at least constant probability,
independent of S.
process j between some process setting the stop bit and process i
CHAPTER 17. RANDOMIZED DISTRIBUTED ALGORITHMS 295
reading rj . Since each process can generate at most one extra coin-flip
before checking stop, |Xi | ≤ n always.
4. Hi = nj=1 Zij , where Zij is the sum of all votes that are generated by
P
process j before i reads rj , but that are not included in rj .sum because
they haven’t been written yet. Again, each process can contribute only
one coin-flip to Hi . So |Hi | ≤ n always.
The cost of executing the sifter is two operations. Each process chooses
an index r according to a geometric distribution, writes A[r], and then checks
if any other process has written A[r + 1]. The process stays if it sees nothing.
CHAPTER 17. RANDOMIZED DISTRIBUTED ALGORITHMS 296
Because any process with the maximum value of r always stays, at least
one process stays. To show that not too many processes stay, let Xi be the
number of survivors with r = i. This will be bounded by the number of
processes that write to A[i] before any process writes to A[i + 1]. We can
immediately see that E [Xi ] ≤ n · 2−i , since each process has probability 2−i
of writing to A[i]. But we can also argue that E [Xi ] ≤ 2 for any value of n.
The reason is that because the adversary is oblivious, the choice of which
process writes next is independent of the location it writes to. If we condition
on some particular write being to either A[i] or A[i + 1], there is a 1/3
chance that it writes to A[i + 1]. So we can think of the subsequence of
writes that either land in A[i + 1] or A[i] as a sequence of biased coin-flips,
and we are counting how many probability-2/3 tails we get before the first
probability-1/3 heads. This will be 2 on average, or at most 2 if we take into
account that we will stop after n writes.
We thus have E [Xi ] ≤ min(2, n · 2−i ). So the expected total number of
winners is bounded by ∞ −i
i=1 E [Xi ] min(2, n · 2 ) = 2 lg n + O(1).
P
Now comes the fun part. Take all the survivors of our sifter, and run
them through more sifters. Let Si be the number of survivors of the first
i sifters. We’ve shown that E [Si+1 | Si ] = O(log Si ). Since log is concave,
Jensen’s inequality then gives E [Si+1 ] = O(log E [Si ]). Iterating this gives
E [Si ] = O(log(i) S0 ) = O(log(i) n). So there is some i = O(log∗ n at which
E [Si ] is a constant.
This doesn’t quite give us leader election because the constant might not
be 1. But there are known leader election algorithms that run in O(1) time
with O(1) expected participants [AAG+ 10], so we can use these to clean up
any excess processes that make it through all the sifters. The total cost is
O(log∗ n) operations for each process.
1 procedure sifter(v)
2 Choose random ranks r1 . . . r`
3 for i ← 1 . . . ` do
4 writeMax(Mi , hri , . . . , r` , vi)
5 hri , . . . , r` , vi ← readMax(Mi )
6 return v
Algorithm 17.3: Sifter using max registers [AE19]
299
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 300
decide whether it will send or receive. Each sender then picks one of
its d neighbors independently and uniformly at random. An edge uv is
included in the matching if u is a sender, v is a receiver, and u is the
only sender that picks v; or if the same conditions hold with u and v
reversed.
Solution
In each case we’ll use linearity of expectation.
1. Let Xuv be the indicator for the event that uv is included in the
matching. Then
2. Again let Xuv be the indicator variable for the event that uv is included.
Now
3. For this version it’s convenient to break symmetry and make Xuv be
the indicator variable for the event that u is a sender, v is a receiver,
and u is matched with v. This means that we will have two variables
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 301
Xuv and Xvu for each edge, but we can deal with this when we need
to.
Compute
2. Since we can’t rely on a random input, let’s move the randomness into
the algorithm. Suppose that a fixed permutation π on all k-bit vectors
is chosen uniformly at random and each vi (supplied by an adversary)
is mapped to π(vi ) before being written. Now what is the expected
number of cells used in the worst case? (As usual, assume that π is
chosen after v1 , . . . , vn are fixed.)
Solution
1. Let v and v 0 be consecutive values. Then v 0 requires a new cell if there
is some position j such that vj0 = 0 but vj = 1. For each position j,
this occurs with probability 1/4 and does not occur with probability
3/4. Because the positions are independent, the probability that v 0
does not require a new cell is (3/4)k and the probability that it does
require a new cell is 1 − (3/4)k .
For each i > 1, let Xi be the indicator variable for the even that
vi requires a new cell. If we let X1 = 1, then S = ni=1 Xi gives
P
1 − (3/4)k = E [X]
= E X π(v) = π(v 0 ) Pr π(v) = π(v 0 )
1 − (3/4)k
E X π(v) 6= π(v 0 ) =
.
1 − 2−k
Since the only choice the adversary can make that affects E [Xi ] or not
is to set vi 6= vi−1 , this gives a worst case of
hX i 1 − (3/4)k
E Xi = 1 + (n − 1) ,
1 − 2−k
which is only a little bit worse than the average case.
An alternative approach to computing E [X] is to just count the number
of pairs π(v), π(v 0 ) that require a new cell and divide by the total
number of possibilities 2k (2k − 1). Here there are 3 choices for each bit
position that don’t require a new cell, giving 3k choices overall, but 2k
of these choices make π(v) = π(v 0 ). So the number of distinct pairs that
don’t require a new cell is 3k − 2k . We can get the number of pairs that
do require a new cell by subtracting from the total 2k (2k − 1) = 4k − 2k .
k k k
This gives E [X] = 44k −3−2k
= 1−(3/4)
1−2−k
as computed above.
All of these bounds are terrible, but they are not equally terrible and
are modestly distinct for small k. For k = 2, for example, the coefficient on
(n − 1) goes from 7/16 in the average case, to 7/12 in the permutation case,
and all the way up to 3/4 in the bitwise XOR case.
Solution
1. We’ll give a proof using Chebyshev’s inequality.
For each edge uv let Zuv = Xu ⊕ Xv be the indicator variable for the
P
event that uv is in the cut. Let S = uv Zuv be the size of the cut.
P
We have E [S] = E [Zuv ] = m/2.
To compute Var [S], we will first show that the variables Zuv are
pairwise-independent. This holds trivially for any variables correspond-
ing to non-incident edges. For the case of two incident edges Zuv and
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 305
Zvw , observe that E [Zvw | Zuv ] = 1/2, because whatever value Xv has,
adding Xw yields 0 or 1 with equal probability. So Zuv and Zvw are
independent as well.
P
Since the Zuv are pairwise-independent, Var [S] = Var [Zuv ] = m/4.
Chebyshev then says
√ m/4 1
Pr |S − m/2| ≥ m ≤ √ 2 = .
( m) 4
√
This gives a 3/4 chance of getting S in the range m/2 ± m = m/2 ±
o(m).
Solution
1. First we need to figure out the worst-case choice of n1 , . . . , nk .
For any particular choice of the block sizes, the expected cost is
k
X ni
Pi .
i=1 j=1 nj
Let Xi be the training cost for the i-th block. Then each Xi is an
P
independent Bernoulli random variable and S = Xi has known
expectation µ = Hn = Θ(log n), so we can use the two-sided Chernoff
bound (5.2.7) to get
Solution
We’ll have a sequence of p = O(log n) phases of d2n/me rounds each, and
in phase use a hash function to assign each package to one of the ` =
md2n/me ≥ 2n slots corresponding to a particular combination of pad and
round within the phase.
A complication is that we have limited randomness, which is going to
constrain what hash functions we can use. We’ll pick an independent linear
congruential hash function hj for each phase j, which requires O(log t) bits
of randomness per phase or O(log t log n) = O(log2 t) bits of randomness
across all phases. Tabulation hashing also works (we want the version that
adds table elements mod ` rather than using XOR on bit vectors), but the
bits per phase goes up to O(log2 t) giving O(log3 t) overall.
In each phase, a remaining robot or drone assigned tracking number
ti computes hj (ti ) ∈ [`] and goes to pad hj (ti ) mod modm during round
bhj (ti )/mc (numbering from 0 within the phase). Each pair of tracking
numbers ti , ti0 produces a collision with probability Pr [hj (ti ) = hj (ti0 )] ≤ 1` .
Let Xj be the number of robots left after j phases; then X0 = n and
Xj2
!
Xj 1 Xj
E [Xj+1 | Xj ] ≤ 2 ≤ ≤ .
2 ` 2n 2
Here the 2 at the start accounts for the fact that each collision may send up
to two robots to the next phase, the n2 counts all the pairs of robots, and
the last step uses the bound Xj ≤ n.
Iterating this inequality gives E [Xp ] ≤ n · 2−p . Set p = lg n to get
Pr [Xp > 0] ≤ E [Xp ] ≤ 2−c lg n = n−c . This gives the desired error probability
in c lg nd2n/me = O(n log n/m) rounds.
n+1
2 = Θ(n), which is (up to constants) no better than the deterministic
worst case. But for more skewed distributions we may hope that more
probable elements get inserted early, meaning that the cost of searching for
them will be less than for more improbable elements.
For each of the following distributions, compute a tight (big-Θ) asymptotic
bound on the expected cost to search for a random element given by the
distribution, assuming that we constructed the list as described above by
repeatedly sampling from the same distribution.
Solution
Let’s do what we can for a generic distribution before getting into the details
of each distribution individually.
Let Aij be the indicator variable for the event that i appears in the list
at or before the same position as j (the “at” part covers the case Aii , which
P
we take to be 1). Let Di be the position of i in the list. Then Di = j Aji .
6 j, we can compute E [Aij ] by looking at probability of seeing i
For i =
conditioned on seeing i or j at a particular step. This gives
pi
E [Aij ] = ,
pi + pj
X pj
E [Di ] = 1 + .
j6=i
pi + pj
and thus
X X X pj
E [search cost] = pi E [Di ] = 1 + pi .
i i j6=i
pi + pj
Let’s see what happens to this for each of our given distributions.
2−i
−i
pi = = Θ 2 .
1 − 2−n−1
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 310
= Θ(i) + Θ(1)
= Θ(i).
Θ(2−i i) = Θ(1).
X X
pi E [Di ] = 1 +
i i
1/i 1
pi = = .
Hn iHn
But the Hn ’s cancel out when we compute
X pj X 1/j
=
j6=i
pi + pj j6=i
1/i + 1/j
X i
= .
j6=i
i+j
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 311
We can use this to get an upper bound on the expected search cost
n n
X X 1 X i
pi E [Di ] = 1 +
i=1 i=1
iHn j6=i
i+j
n n
1 XX 1
<1+
Hn i=1 j=1
i+j
2n
1 X k−1
<1+
Hn k=1 k
1
<1+ · 2n
Hn
= O(n/ log n).
To get a matching lower bound, notice that the best possible ordering
places each i at position i, giving an expected search cost of exactly
P 1
i iHn · i = n/Hn = Ω(n/ log n). Since whatever random ordering we
end up with is at least this bad, this gives us the Θ(n/ log n) bound
we are looking for.
Solution
This is a job for the Johnson-Lindenstrauss lemma (distributional version).
First normalize the rows of A so that each kAi k = 1. This takes O(n2 )
time and doesn’t affect the angles between rows.
Next, observe for any unit vectors x and y, kx − yk2 is an increasing
function of θ = θxy . We don’t actually need anything more than this, but
if we want to we can compute the distance exactly by constructing an
isosceles triangle with x and y as the unit edges and chopping it in half:
this gives kx − yk2 = 2 sin 2θ . So if θ ≤ , kx − yk2 ≤ 2 sin 2 , and if θ ≥ 2,
kx − yk2 > 2 sin .
Pick a threshold halfway between these two bounds. By choosing a
small enough error term in Lemma 8.1.2, we can guarantee that any pair
x, y that is nearly orthogonal to within has kf (x − y)k2 close enough to
kx − yk2 to be within this threshold, and any pair x, y that is not nearly
orthogonal to within 2 doesn’t. (We could calculate what the exact error
bound we need for this, but since it’s a constant, we don’t care.) If we set
the probability of exceeding the error bound for a single vector to δ/ n2 ,
this gives a probability that kAi − A − jk2 exceeds the relative error bound
for any i, j of at most δ. To obtain δ/ n2 probability of error, we need
k = O(log(δ/n2 )) = O(log n + log δ).
So now we need to feed all of our rows to the Johnson-Lindenstrauss
transform whose existence is given in the lemma and then check the threshold
for each pair of rows in the output. Expanding out the steps, we must:
2. Compute AB. This takes O(n2 k) time using ordinary matrix multipli-
cation.
3. For each pair of rows i and j, compute k(AB)i − (AB)j k2 and check
it against the thresholds. This takes O(k) time per pair of rows for
O(n2 k) time total.
only once during the walk only you can push a “boost” button that replaces
the normal transition rule Xt+1 = Xt ± 1 with a rule that doubles the current
value: Xt+1 = 2Xt . Naturally, your choice to use your boost can’t depend on
knowledge of the future: when choosing to set Xt+1 = 2Xt , you only know
the outcomes X1 , . . . , Xt .
1. Suppose that your goal is to maximize the probability that you reach
a state Xt ≥ n before you reach Xt = 0. What strategy should you
use and what probability of reaching Xt ≥ n (as a function of k and n)
does it give you?
2. Suppose instead that you want to minimize the probability that you
reach a state Xt ≥ n before you reach reach 0, but you want to do this
in a way that always uses the boost before you reach Xt ≥ n or Xt = 0,
so that suspicious onlookers won’t think you aren’t trying. Now what
strategy should you use, and what probability of reaching Xt ≥ n does
it give you?
Solution
Let’s formalize things a bit. Let {∆t } be the set of independent fair ±1
increments, so that Xt+1 = Xt + ∆t+1 if we don’t use the boost. Then we
can let Ft = h∆1 , . . . , ∆t i and make the time σ at which we use the boost a
stopping time with respect to {Ft }. This makes [σ ≤ t] measurable Ft for
each t and gives E [∆t+1 | Ft ] = 0.
We can now write the transition rule compactly as
compute
h i
E [Yt+1 | Ft ] = E 2[σ≥t+1] Xt+1 Ft
h i
= E 2[σ≥t+1] 2[σ=t] Xt + [σ 6= t]∆t+1 Ft
= 2[σ≥t]−[σ=t] · 2[σ=t] Xt + 2[σ≥t+1] [σ 6= t] E [∆t+1 | Ft ]
= 2[σ≥t] Xt
= Yt .
E [Y0 ] = E [Yτ ]
= E [Yτ | Xτ ≥ n] Pr [Xτ ≥ n] + E [Yτ | Xτ = 0] Pr [Xτ = 0]
= E [Yτ | Xτ ≥ n] Pr [Xτ ≥ n] .
2k
Pr [Xτ ≥ n] = . (A.4.1)
E [Yτ | Xτ ≥ n]
best we can hope to get is Xτ = 2(n − 1), by using the boost on when
Xt = n − 1. But what if we don’t reach n − 1 and are forced to use
the boost anyway?
Analyzing this using the Yt martingale turns out to be tricky, and an
easier approach is just to guess the worst strategy and show that it
is in fact the worst. Since we must use the boost before τ , we are
forced to use the boost if we reach 1 or n − 1, since otherwise there
is a possibility the process finishes before we use it. Let’s suppose we
adopt a maximum-delay strategy and set σ = min {t | t ∈ {1, n − 1}}.
Starting from any position x, this σ gives Pr [Xτ = n] of 2/n if we
reach 1 first and 1 if we reach n − 1 first. Which we reach first is just
the probability of hitting on or the other absorbing barrier in a simple
random walk in [1, n − 1]. So we can compute
x−1 x−1 2
Pr [Xτ = n | Xt = x, t < τ ] = ·1+ 1− ·
n−2 n−2 n
x−1 2 2
= 1− +
n−2 n n
x−1 n−2 2
= +
n−2 n n
x−1 2
= +
n n
x+1
= .
n
relaxation time τ2 . Fix some function f with 1 ≤ f (x) ≤ r for all states x
and apply Metropolis-Hastings to get a new Markov chain with transition
probabilities p0ij = pij min(1, f (j)/f (i)) and stationary distribution πi0 ∝ f (i).
Let τ20 be the relaxation time of this new chain.
Prove or disprove: Under these assumptions, there is an upper bound on
τ20 that is polynomial in τ2 and r.
Solution
We will prove this by showing that πu0 and p0uv are both bounded relative to
their unmodified versions by polynomial functions of r, then show this leads
to at most polynomial blowup going from τ2 to the conductance Φ of the
original chain to the conductance Φ0 of the modified chain and finally to τ20 .
Let n be the number of states. For πu0 , we have
f (u) 1 1 1
πu0 = P ≥ > = πu
v f (v) (n − 1)r + 1 nr r
For p0uv = puv min(1, f (v)/f (u)), we get 1r puv ≤ p0uv ≤ puv .
So now let’s look at some set of states S and try to bound Φ0 (S).
P 0 0
0 u∈S,u6∈S πu puv
Φ (S) =
π 0 (S)
1
· 1r puv
P
u∈S,u6∈S r πu
>
rπ(S)
P
1 u∈S,u6∈S πu puv
=
r3 π(S)
1
= 3 Φ(S). (A.5.1)
r
We’d like to use this to bound Φ0 = min0<π0 (S)≤1/2 Φ0 (S) in terms of r
and Φ = min0<π(S)≤1/2 Φ(S), but there is a complication: there may be sets
S with π 0 (S) ≤ 1/2 but π(S) > 1/2 that are included in the computation of
Φ0 but not Φ. For these sets we will need to take advantage of reversibility.
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 317
Suppose π 0 (S) ≤ 1/2, π(S) > 1/2. Let T = S. Then π(T ) = 1 − π(S) <
1/2, so Φ(T ) ≤ Φ. Now we can argue
P 0 0
0 u∈S,v∈T πu puv
Φ (S) =
π 0 (S)
P 0 0
v∈T,u∈S πv pvu
=
π 0 (S)
π 0 (T )
= Φ0 (T ) .
π 0 (S)
1 π 0 (T )
> 3 Φ(T ) 0 .
r π (S)
1 1/2
> 3Φ ·
r 1/2
1
= 3 Φ. (A.5.2)
r
So given S with 0 < π 0 (S) ≤ 1/2, (A.5.1) shows Φ0 (S) ≥ r−3 Φ when
π(S) ≤ 1/2 and (A.5.2) shows Φ0 (S) ≥ r−3 Φ when π(S) > 1/2. In either
case we get Φ0 = min0<π0 (S)≤1/2 Φ0 (S) ≥ r−3 Φ.
1
From (10.6.4), we have 2Φ ≤ τ2 , which gives Φ ≥ 2τ12 . But then we can
apply the other direction of (10.6.4) to get
2 2 2r6
τ20 ≤ < ≤ = 8r6 τ22 ,
(Φ0 )2 (r−3 Φ)2 (1/(2τ2 ))2
Solution
First observe that the number of possible states N is at least 2dn/2e , the
number of strings with zeros in all odd-numbered positions.2 So if we can
get a mixing time that is polynomial in n, it will be polylogarithmic in n.
Coupling and canonical paths are both options for showing such a bound.
One possible coupling is to apply the same update rule to both the X
and Y processes: Pick an index i and a bit i and update each of Xi and Yi
to equal b if permitted by the no consecutive ones rule. We can then track
how the Hamming distance between X and Y evolves over time.
Let S t = i Xit 6= Yit and let H t = S t . Then H t can change in two
ways:
We thus get τ2 ≤ 8ρ2 = O(n4 ), with a hideous constant swept behind the
big O for decency’s sake. This again is polylogarithmic in N , so we’re done.
Solution
We’ll show c = 4.
For the lower bound, we just need one bad family of graphs and starting
configurations that gives Ω(n4 ) expected convergence time. One choice is a
lollipop graph, consisting of a handle that is a path of n/2 vertices and a
sucker that is Kn/2 , plus an edge linking one end of the path to one vertex
in the clique.
Let’s make the stubborn node be a member of the clique with initial
value 0, and start in a configuration where the most distant n/4 nodes in the
path have 1 and everybody else has 0. Let Xit be the value of node i at time
t and let Zt = i X t . Then as long as all the clique nodes have Xit = 0 and
P
some path node has Xit = 1, we have a configuration where the 1 nodes are
exactly the Zt most distant nodes on the path. So Z t+1 = Zt only if we pick
1
the unique 0 − 1 edge, and we have Z t+1 = Zt ± 1 with probability 2m each.
1
This give an unbiased random walk that takes steps with probability m .
2
It takes (n/4) steps for this to reach Zt = 0 or Zt = n/2, which is a lower
bound on the time for the process as a whole to converge. The waiting time
for each random walk step is m, so the total expected time to reach Zt = 0
or Zt = n/2 is m(n/4)2 by Wald’s Equation. The clique makes m = Ω(n2 ),
so the expected time to converge is Ω(n4 ).
In the other direction, we will show that any graph reaches an all-equal
state from any initial state in time O(n4 ). For this argument it’s convenient
to assume that the stubborn node starts with 1, so we finish when Zt = n.
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 321
Solution
Suppose we label each vertex independently and uniformly at random. For
each vertex v, let N (v) be the set of all nodes adjacent to v, let Lv =
1P
d u∈N (v) `(v) be the probability that taking a step starting at V lands on
a node labeled with a 1, and let Av be the event |Lv − 1/2| > .
Observe that if none of the events Av occurs, we get the almost-Markov
property because the bound on E [`(Vt+1 ] will hold conditioned on any
previous vertex Vt . We will use the Lovász Local Lemma to show that a
labeling exists that makes none of these occur, and then use Moser-Tardos
to find one.
With a bit of tinkering, we can write Lv −1/2 as the sum of d independent
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 323
1
± 2d random variables, so Hoeffding’s inequality says
2 /2d(1/d2 ) 2
p = Pr [Av ] ≤ Pr [|Lv − 1/2| ≥ ] ≤ 2e− = 2e−d .
2. Compute the best high-probability asymptotic bound you can for the
maximum number of elements in any table position as a function of
324
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2023 325
m. This means that you should find a function f (m) such that for
all c > 0, the maximum number of elements is at most O(f (m)) with
probability at least 1 − m−c , where the constant in the asymptotic
bound may depend on c.
Solution
1. Let X be the number of elements inserted into the table. Then X
counts the number of insertions up to and including the first insertion
to position 0, making it a geometric random variable with parameter
p = 1/m (see §3.6.2). This gives E[X] = m.
The expected load factor is E[X/m] = 1.
The first branch is 0 when k > 0. For the second branch, observe
that Pr [Xi ≥ k | insert i] = Pr [Xi ≥ k − 1], since we already have one
element and now we are asking if we can get k − 1 more using the same
process that generates Xi . This gives a recurrence:
(
1 k=0
Pr [Xi ≥ k] = 1
2 Pr [Xi ≥ k − 1] k > 0
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2023 326
Since this makes max Xi = O(log n) with probability 1 − n−c for any
fixed c > 0, we get a high-probability bound of O(log n).
Though this is not required by the problem, we can see that it is not
possible to do better than O(log m) by looking at the distribution
for a single bin. We have Pr [max Xi ≥ k] ≥ Pr [X1 ≥ k] = 2−k , so
Pr [max Xi ≥ c lg m] ≥ m−c for any fixed c.
1 4 1 3
2 1 2 1
3 3 3 2
4 2 4 4
Solution
1. Let Xi be the indicator variable for the event that sender i is successful,
so that S = ni=1 Xi . Each sender i is successful if and only if i is the
P
label of one of i’s d neighbors. This occurs with probability d/n = 1/2,
so E [Xi ] = 1/2 and E [S] = n/2.
P P
2. We have Var [S] = i Var [Xi ] + i6=j Cov [Xi , Yi ]. Since the Xi are
fair coin-flips we get Var [Xi ] = 1/4.
To bound Cov [Xi , Yi ] let δi and δj be the neighborhoods of senders i
and j, and let cij = |δi ∩ δj | be the number of neighbors these senders
have in common. Then
E [Xi Xj ] = Pr [i ∈ δi ∧ j ∈ δj ]
= Pr [i ∈ δi ∩ δj ] Pr [j ∈ δj | i ∈ δi ∩ δj ]
+ Pr [i ∈ δi \ δj ] Pr [j ∈ δj | i \ δi ∩ δj ]
cij d − 1 d − cij d
= · + ·
n n−1 n n−1
cij (d − 1) + (d − cij )d
=
n(n − 1)
cij d − cij + d2 − cij d
=
n(n − 1)
2
d − cij
= .
n(n − 1)
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2023 328
This gives
P
To minimize the variance, we want to make i6=j cij as large as possible.
We can do this by routing all n senders to the same d = n/2 receivers,
giving cij = n/2 for all i =6 j and thus Var [S] = 0. This is not
surprising because in this case, all and only those senders whose ids
appear among the d favored receivers will be successful, making S a
constant.
For the upper bound, we can get crude bound by observing that cij ≥ 0
gives Var [S] ≤ n/2. But in general there is no graph that actually
produces cij = 0 for all i 6= j. Instead, we need to look at minimizing
P
cij subject to the constraint that there are nd edges.
Let eijk be 1 if senders i and j share an edge to receiver k and 0
P
otherwise. Then cij = k eijk . But if we write dk for the degree of
receiver k, we also have that i6=j eijk = dk (dk − 1), since each ordered
P
that
X XX
cij = eijk
i6=j i6=j k
X
= dk (dk − 1).
k
1
In either case the absolute value is bounded by n−t−1 . There are
n − t − 1 values of i in this range, so the sum for all of them is at
most ±1.
2. Give a Las Vegas algorithm that returns a value that occurs at least
twice in the stream with probability Ω(1). If it doesn’t succeed, it
should return ⊥ to indicate failure. (This algorithm should never return
a value that occurs less than twice.)
In both cases, prove the correctness of your algorithm. You may assume
that n is known to the algorithm.
Solution
1. There are two straightforward ways to do this using O(log n) bits:
(a) The easiest is to sample the position of the desired element ahead
of time, then count down until we hit it. This requires 2n possible
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2023 332
states for the countdown timer plus n possible states for the stored
value, for a total of dlg 3ne = O(log n) bits.
(b) Alternatively, we could store a value while counting how many
values we’ve seen, and replace the stored value with the k-th value
with probability 1/k.1 A straightforward induction argument
shows that this makes the stored value uniform across all the
values seen so far. We need dlg 2ne bits for the counter and dlgne
bits to store the current value, but the total is still O(log n).
2. For this part, the intuition is that we will sample a value as we go,
then record a winner if we happen to see it again. How easy this is to
analyze depends on which sampling algorithm we use:
b0 a0 b0
a b a b a b
Solution
n
1. We’ll show that {Dt } is a martingale, giving E [Dt ] = E [D0 ] = m/ 2
for all t.
Let Gt be the graph after t steps. Let Ft be the σ-algebra generated
by hG0 , . . . , Gt i.
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2023 334
mt (2)n − 1
E [Dt+1 | Ft ] = nt +1
2
n+1
mt n−1
= nt +1 n+1
2 n−1
mt
= nt
2
= Dt .
n
It follows that E [Dt ] = E [D0 ] = m/ 2 .
mt−1 , giving
mt−1 mt−1
Dt − Dt−1 ≥ nt−1 +1 − nt−1
2 2
nt−1
− nt−12 +1
2
= mt−1 nt−1 +1 nt−1
2 2
−nt−1
= mt−1 nt−1 +1 nt−1
2 2
−nt−1
≤ nt−1 +1
2
2
=− .
n+t
2
So in either direction we have |Xt | = |Dt+1 − Dt | ≤ n+t−2 .
Now let us apply Azuma’s inequality:
t 2 !
2
X 2
Pr [|Dt − E [Dt ]| ≥ α] ≤ 2 exp −α /
i=1
n+i−2
∞
1
X
≤ 2 exp−α2 /4
k=n−1
k2
Z ∞
1
≤ 2 exp −α2 / 4 dx
x=n−2 x2
2
≤ 2 exp −α (n − 2)/4 .
p p
For any fixed c, setting α = 4(c + 1) ln n/(n − 2) = O log n/n
gives a probability of at most 2n−c−1 ≤ n−c (when n ≥ 2) that we are
more than α away from the expectation.
its endpoints. This process yields a Markov chain that is aperiodic and
irreducible,2 so it has a unique stationary distribution.
Solution
1. Suppose there is a labeling i and a labeling j such that j is reachable
from i in one step. Then these labelings differ by swapping one edge
1 P
so pij = pji = 2m . For each labeling i, let πi = 1/n!. Then i πi = 1
m
and, for all i and j, πi pij = πj pji = n! . This shows that the chain is
reversible and that this particular π is its stationary distribution.
process. So the question then is how long it takes for two unlinked
copies of a label to become linked.
Define a new process Z t = U t , V t where U t represents the position
X`t and V t represents Y`t . Unlike the X`t , Y`t process, we will assume
that at each step of Z t we always move U t across an incident edge with
1
probability 2n and similarly for V t , without regard to whether U t = V t .
This gives a Markov chain with the same transition probabilities as
X`t , Y`t when X`t 6= Y`t , but unlike the X`t , Y`t process, the Z t process
is both irreducible and reversible, with a uniform stationary distribution.
We can also argue that it is aperiodic under our assumption that m > 1,
because there exists at least one state where U t is not adjacent to some
edge, so there is a nonzero probability that Zt+1 = Z t in this state.
We will now show that X`t and Y`t collide with high probability after
polynomially-many steps by showing convergence of Z t to its stationary
distribution using Cheeger’s inequality. Observe that the Z chain has
exactly m2 states, and that each non-empty subset S of the chain has at
1
least one transition that leaves it with probability πi pij = m12 · 2m 1
= 2m 3.
1/2m3 1
This gives ΦS ≥ |S|/m2
which is at least 4m 3 in the worst case. It follows
2
that τ2 ≤ Φ2
≤ 6
8m and that the total variation distance between Z t
1
and the uniform distribution is at most 2m after O(m6 log m) steps. In
1 1 1
particular this gives a probability of at least m − 2m = 2m of being
t t t t
in a state U , V with U = V at the end of each such interval.
Repeating for O(m log m) intervals with a suitable constant gives a
probability of at most m−c for any fixed c that U t = V t at the end
of one of these intervals, which shows that X` and Y` collide within
O(m7 log2 m) steps with probability at least 1 − m−c .
Since m ≥ n − 1, we can choose c so that the probability that X` and
Y` don’t collide in O(m7 log2 m) steps is O(n−2 ). By the union bound,
this gives that X` and Y` collide in time O(m7 log2 m) with probability
at least 1 − O(n− 1). Repeat as needed at most O(log(1/)) times to
drive the probability of failure down below , and use m = O(n2 ) to
express the convergence time in terms of n as O(n1 4 log2 n log(1/)).
This polynomial in n and log 1 even though the exponents are terrible.
Solution
Let A be the set of all assignments that produce d ≥ t. For each i, let
P
Xi = x∈Si x be the random variable representing the sum of the assignments
of elements of Si . We can write A as the union of Bij for all i ∈ [m] and all
j with |j| ≥ t, where Bij is the event that Xi = j. Since this union is not
disjoint, we will use Karp-Luby (see §11.4) to approximate its size.
As with #DNF, the idea is that we can both count and sample triples
hi, j, xi where x is an assignment that makes Xi = j. Let ni = |Si |. Let Yi be
the number of elements of Si that are assigned +1; then Xi = Yi − (ni − Yi ) =
2Yi − ni , and solving for Yi gives Yi = ni +X
2 ,
i
assuming the numerator is even.
n−n ni
Given i and j, there are exactly 2 i
(ni +j)/2 assignments with Xi = j when
ni + j is even, and none when ni + j is odd. For the even case, we can sample
these assignments in polynomial time by choosing (ni +j)/2 elements without
replacement from Si to have value +1, making the remaining elements of Si
have value −1, and assigning ±1 values independently to any x ∈ S \ Si . We
will abuse notation slightly by writing Bij for the set of all possible triples
hi, j, xi resulting from this process for a particular pair of values i and j. Let
B = ni=1 |j|≥d,j + ni even Bij ; note that this union is disjoint.
S S
cost of O(n) each, for O(n2 m) time total, which dominates the cost of
generating the sample.
1. Your name.
3. Whether you are taking the course as CPSC 469 or CPSC 569.
(You will not be graded on the bureaucratic part, but you should do it
anyway.)
340
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 341
Unfortunately, the candy company owner is notorious for trickery and lies,
so the probability that the golden ticket actually exists is only 1/2.
Consider the last child to open their candy bar, and what estimate they
should make of their probability of winning after seeing the first k children
open their candy bars to find no ticket.
An optimist reasons: If the ticket does exist, then the last child’s chances
went from 1/n to 1/(n − k). So the fact that none of the first k children got
the ticket is good news!
A pessimist reasons: The more candy bars are opened without seeing
the ticket, the more likely it is that the ticket doesn’t exist. So the fact that
none of the first k children got the ticket is bad news!
Which of them is right? Compute the probability that the last child
receives the ticket given the first k candy bars come up empty, and com-
pare this to the probability that the last child receives the ticket given no
information other than the initial setup of the problem.
Solution
Let W be the event that the last child wins, C the event that the candy bar
exists, and Lk the event that the first k children lose.
Without conditioning, we have
Pr [W ∧ Lk ]
Pr [W | Lk ] =
Pr [Lk ]
1/2n
=
Pr [Lk ] C Pr [C] + Pr [Lk ] ¬C Pr [¬C]
1/2n
= n−k 1 1
n · 2 +1· 2
1/2n
= n−k 1
2n + 2
1/2n
= 2n−k
2n
1
= .
2n − k
1 1
Since 2n−k > 2n when k > 0, the optimist is correct.
Solution
1. Let Ai be the event that machine mi is active, and let Xi be the
indicator variable for mi being active and not exploding. We are trying
P P
to maximize E [ Xi ] = E [Xi ] = n E [X1 ], where the last equation
holds by symmetry.
Observe that X1 = 1 if A1 occurs and it is not the case that both
A0 and A2 occur. This gives E [X1 ] = Pr [X1 = 1] = p(1 − p2 ) =
p − p3 . By settingpthe derivative 1 − 3p2 to zero we find a unique
extremum at p = 1/3; this is a maximum since p(1 − p p2 ) = 0 at
the two corner p solutions p = 0 and p = 1. Setting p = 1/3 gives
3n E [Xi ] = 3n · 1/3(2/3) = n √23 ≈ n · 1.154701 . . . .
Solution
For each t, let Xt be the size of the log entry at time t. Then E [S] =
a+b a+b a+b
t pt 2 = 2 E [ t pt ] = 2 · n.
P P P
t E [Xt ] =
We can’t use Chernoff bounds directly on the Xt because they aren’t
Bernoulli random variables, and we can’t use Hoeffding’s inequality because
we don’t actually know how many of them there are. So we will use Chernoff
first to bound the number of nonzero Xt and then use Hoeffding to bound
their sum.
The easiest way to think about this is to imagine we generate the nonzero
Xt by generating a sequence of independent values Y1 , Y2 , . . . , with each Yi
uniform in [a, b], then select the i-th nonzero Xt to have value Yi .
Let Zt be the indicator variable for the event that a log entry is written
P
at time t. Then N = t Zt gives the total number of log entries, and
P P
E [N ] = E [ t Zt ] = t pt = n.
Pick some δ < 1; then Chernoff’s inequality (5.2.2) gives
2 /3
Pr [N ≥ (1 + δ)n] ≤ e−nδ .
P(1+δ)n
Suppose N < (1 + δ)n. Then t Xt = N i=1 Yi ≤
P P
i=1 Yi . This last
quantity is a sum of independent bounded random variables, so we can apply
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 345
a+b
For s = δn · 2 this gives a bound of
2 2
a+b a+b
−δ 2 n2 2 −δ 2 n 2
exp− 2 ≤ exp−
2
b−a b+a
2(1 + δ)n 2 2(1 + δ) 2
!
−δ 2 n
= exp − .
2(1 + δ)
Solution
We can use the method of bounded differences. Let Xi be the indicator for
the event that machine i is turned on, and let S = f (X1 , . . . , X3n ) compute
the number of active machines. Then changing one input to f from 0 to 1
increases f by at most 1, and decreases it by at most 2 (since it may be that
the new machine explodes, producing no increase, and it also explodes both
neighbors, producing a net −2 change). So McDiarmid’s inequality (5.3.13)
applies with ci = 2 for all i, giving
!
2t2 2 /6n
Pr [f (X1 , . . . , X3n − E [f (X1 , . . . , X3n ] ≤ t] ≤ exp − = e−t .
3n · 22
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 346
• Analyze the dependence and use Lemma 5.2.2. This gives similar
results to McDiarmid’s inequality with a bit more work. The basic idea
is that E [Xi |X1 . . . Xi−1 ] ≥ p(1 − p) for i < 3n whether or not Xi−1 is
1 (we don’t care about the other variables). So we can let pi = p(1 − p)
for these values of i. To avoid wrapping around at the end, either omit
X3n entirely or let p3n = 0.
• Even simpler: Let Yi be the indicator for the event that machine i is
active. Then Y3 , Y6 , . . . , Y3n are all independent, and we can bound
Pn P3n
i=1 Y3i ≤ i=1 Yi using either Chernoff or Hoeffding.
1. Suppose that you expect to live forever (and keep whatever plan you
pick forever as well). If your goal is to minimize your expected total
cost over eternity, which plan should you pick?
2. Which plan is cheaper on average if you don’t plan to live for more
than, say, the next 1000 years?
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 347
Solution
Let Ai be the event that we pay a penalty in month i with plan A, and let
Bi for even i be the event that we pay a penalty for months i and i + 1 in
plan B. Then for i > 0,
Pr [Ai ] = Pr Xi > max Xj
j<i
= Pr Xi = max Xi
j≤i
i!
=
(i + 1)!
1
= .
i+1
Similarly,
Pr [Bi ] = Pr Xi > max Xj and Xi+1 > max
j<i j<i
i! · 2
=
(i + 2)!
2
= .
(i + 2)(i + 1)
(In the first case, we count all orderings of the i + 1 elements 0 . . . i that
make Xi largest, and divide by the total number of orderings. In the second
case, we count all the orderings that make Xi and Xi+1 both larger than all
the other values.)
Now let us do some sums.
2
Plan B has a 3·4 = 16 chance of costing $100 after the first four months,
giving an expected cost of at least $16. As long as you keep both for
at least four months, for n e16 ≈ 8886110 months ≈ 740509 years,
plan A is better.
1. What are the corresponding asymptotic values for the expected pointers
per element and the expected cost of a search for this new data structure,
as a function of p and n?
2. Suppose you are given n in advance. Is it possible to choose p based
on n for the tree-based data structure to get lower expected space
overhead (expressed in terms of pointers per element) than with a skip
list, while still getting O(log n) expected search time?
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 349
Solution
1. This is actually easier than the analysis for a skip list, because there is
no recursion to deal with.
A balanced binary search tree stores exactly 2 pointers per element,
so with an expected pn elements in the tree, we get n + 2pn pointers
total, or 1 + 2p) pointers per element.
For the search cost, work backwards from the target x to argue that
we traverse an expected O(1/p) edges in the linked list before hitting
a tree node. Searching the tree takes O(log n) time independent of p.
This gives a cost of O p1 + log n .
2. Yes. Let p = Θ(1/ log n). Then we use an expected 1 + O(1/ log n)
1
pointers per element, while the search cost is O 1/ log n + log n =
O(log n). In contrast, getting 1 + O(1/ log n) expected space with a
skip list would also require setting p = Θ(1/ log n), but this would give
an expected search cost Θ(log2 n).
Solution
Disproof: We will construct a 2-universal family H such that h(x) = x for
all h ∈ H whenever 1 ≤ x ≤ m. We can then insert the values xi = i with
corresponding hash values h(xi ) = i and get a tree of depth exactly n.
Start with a strongly 2-universal family H 0 . For each h0 in H, construct
a corresponding h according to the rule h(x) = x when 1 ≤ x ≤ m and
h(x) = h0 (x) otherwise. We claim that this gives a 2-universal family of hash
functions.
Recall that H is 2-universal if, for any x =6 y, Pr [h(x) = h(y)] ≤ 1/m,
when h is chosen uniformly from H. We consider three cases:
3. If m < x and m < y, then Pr [h(x) = h(y)] = Pr [h0 (x) = h0 (y)] = 1/m.
Solution
We’ll use a variant of the Xt2 − t martingale for random walks.
For each i and t, let Yit = Xi+1,t − Xit be the size of the gap between
Xi+1 and Xi (wrapping around in the obvious way).
Suppose at some step t the adversary chooses robot j to move. Let i and
k be the smallest and largest robot ids such that Xit = Xjt = Xkt (again
wrapping around in the obvious way).
Then Yi−1,t and Ykt are the only gapsh that change,i and each rises or
2
drops by 1 with equal probability. So E Yi−1,t+1 2
Ft = Yi−1,t + 1 and
h i
2 2 + 1. If we define Z = n 2
Ft = Yj,t i=1 Yit − 2t, then {Zt } is a
P
E Yj,t+1 t
martingale with respect to {Ft }.
For any adversary strategy, starting from any configurations, there is a
sequence of at most n2 m coin-flips that causes all robots to coalesce. So
2
E [τ ] ≤ n2 m2n m < ∞. We also have that |Zt+1 − Zt | ≤ 2, so we can apply
the finite-expectation/bounded-increments case of Theorem 9.3.1 to show
that E [Zτ ] = E [Z0 ] = nm2 . But at time τ , all but one interval Yit is zero, and
2 − 2τ ,
the remaining interval has length mn. This gives E [Z τ ] = E (mn)
giving E [τ ] = 12 n2 m2 − nm2 = n2 m2 .
The nice thing about this argument is that it instantly shows that the
adversary’s strategy doesn’t affect the expected time until all robots coalesce,
but the price is that the argument is somewhat indirect. For specific adversary
strategies, it may be possible to show the coalescence time more directly.
For example, consider an adversary that always picks robot 0. Then we
can break down the resulting process into a sequence of phases delimited by
collisions between 0 and robots that have not yet been added to its entourage.
The first such collision happens after robot 0 hits one of the two absorbing
barriers at −m and +m, which occurs in m2 steps on average. At this point,
we now have a new random walk with absorbing barriers at −m and +2m
relative to the position of robot 0 (possibly after flipping the directions to
get the signs right). This second random walk takes 2m2 steps on average to
hit one of the barriers. Continuing in this way gives us a sequence of random
walks with absorbing barriers at −m and +km for each k ∈ {1 . . . n − 1}.
(n−1)n 2
The total time is thus n−1 2 m = n2 m2 , as shown above for
P
k=1 km = 2
the general case.
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 352
1. Show that when 0 < p < 1, the sequence of values S t forms a Markov
chain with states consisting of all permissible sets, and that this chain
has a unique stationary distribution π in which π(S) = π(T ) whenever
|S| = |T |, where |S| is the number of 1 bits in S.
2. Let τ be the first time at which |S τ | ≥ n/2. Show that for any n, there
is a choice of p ≥ 0 such that E [τ ] = O(n log n).
Solution
1. First observe that the updaterule for generating S t+1 depends only on
the state S t ; this gives that S t is a Markov chain. It is irreducible,
because there is always a nonzero-probability path from any permissible
configuration S to any permissible configuration T that goes through
the empty configuration 0n . It is aperiodic, because for any nonempty
configuration S t , there is a nonzero probability that S t+1 = S t , since
we can pick a position i with Sit = 1 and then choose not set Sit+1 to 0.
So a unique stationary distribution π exists.
We can show that π(S) depends only on |S| by using the fact that the
Markov chain is reversible. Let S and T be reachable states that differ
1
The marketing department convinced the company to change the number of computers
in the ring from 3n to n in response to consumer complaints. If the engineers’ scheme
works, the next meeting of the marketing department will consider choosing a new name
for the company.
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 353
only in position i. Suppose that Sit = 0 and Tit = 1. Then pST = 1/n
and pT S = p/n. Let c = S p−|S| , where S ranges over all admissible
P
states, and let πS = p−|S| /c. Pick some S and T as above and let
k = |S|.
p−k 1
πS pST =
c n
p−k−1 p
=
c n
= πT pT S .
To show that this is an injection, observe that the only way for f (i) =
f (j) to occur is if f (i) = i + 1 and f (j) = j − 1 (or vice versa). But if
f (i) = i + 1, then Sj = Si+2 = 1, and Sj is not an unswitchable zero.
So out of n − k zeroes we have at most k unswitchable zeroes, giving at
least n − 2k switchable zeroes. The expected waiting time to hit one of
n
these at least n − 2k switchable zeroes is at most n−2k , giving a total
Pb(n−1)/2c n
expected waiting time of at most k=0 n−2k ≤ nHn = O(n log n).
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 354
Solution
We can simplify things a bit by tracking X t = |S t |. Since changes in the
length of S don’t depend on the characters in S, X t is also a Markov
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 355
Solution
1. The obvious way to generate n pairwise-independent random values
in {0, . . . , k − 1} is to apply the subset-sum construction described in
§5.1.2.1. Let ` = dlg(n + 1)e, assign a distinct nonempty subset Sv of
{1, . . . , `} to each vertex v, and let Xv = i∈Cv ri where r1 , . . . , r` are
P
Solution
We’ll put each vertex v to G0 with independent probability p = 1/(d + 1).
We now have two possible bad outcomes:
We’ll start by showing that the sum of the probabilities of these bad
outcomes is not too big.
Let ` = d2d ln ne. Observe that we can describe a path of length ` in G
by specifying its starting vertex (n choices) and a sequence of ` edges each
leaving the last vertex added so far (d choices each). This gives at most
nd` paths of length ` in G. Each such path appears in G0 with probability
exactly p`+1 . Using the usual probabilistic-method argument, we get
≤ nd` p`+1
= np(pd)` .
`
1 d
= n
d+1 d+1
`
1 1
= n 1−
d+1 d+1
d2d ln ne
1 1
= n 1−
d+1 d+1
2d ln n
1 1
≤ n 1−
d+1 d+1
1
≤ ne− ln n
d+1
1
= .
d+1
1
≤ .
2
2d
1
For the second-to-last inequality, we use the inequality 1 − d+1 ≤ e−1
for d ≥ 1. This is easiest to demonstrate numerically, but if we had to prove
it we could argue that it holds when d = 1 (since 1/4 < 1/e) and that
1 2d
(1 − d+1 ) is a decreasing function of d.
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 359
For getting too few vertices, use Chernoff bounds. Let X be the number
of vertices in G0 . Let µ = E [X] = d+1
n
.
Solution
First observe that any transition preserves the property that our state looks
like a string of the form 0k 1n−k , where both the zeroes and ones all appear
in consecutive positions around the ring.
Let Xt be the number of zeroes after t steps. We have X0 = n/2, and
our possible transitions when 0 < Xt < n are:
1. With probability 1/n, we choose the leftmost zero and replace it with
a one. This makes Xt+1 = Xt − 1.
2. With probability 1/n, we choose the leftmost one and replace it with a
zero. This makes Xt+1 = Xt + 1.
Solution
Let i range from 1 to n, and let Xi be the i-th coin-flip in the sequence. Let
Ri be the indicator variable for the event that there is a run of length k or
more starting at position i. Then the expected number of runs of length
k or more is given by ni=1 E [Ri ] by linearity of expectation, and because
P
each such run adds one to the length of the encoded sequence, the expected
encoded length is n + ni=1 E [Ri ].
P
i=1 i=2
−k
=2·2 + (n − k) · 2−k
= (n − k + 2) · 2−k .
For the expected encoded sequence length, we must add back n to get
n + (n − k + 2) · 2−k .
Solution
Start by factoring
n X
X n n X
X n
Bij = Xi Yj
i=1 j=1 i=1 j=1
n
! n
X X
= Xi Yj .
i=1 j=1
and
n
!
X s2
Pr Yj ≥ s ≤ 2 exp − .
j=1
2n
Pn Pn Pn Pn
If neither of these events holds, then i=1 j=1 Bij =| i=1 Xi |· j=1 Yj <
s2 .
So by the union bound,
n X
n
" n # n
X X X
Pr Bij ≥ s2 ≤ Pr Xi ≥ s + Pr Yj ≥ s
i=1 j=1 i=1 j=1
!
s2
≤ 4 exp − .
2n
1. Your name.
3. Whether you are taking the course as CPSC 469 or CPSC 569.
(You will not be graded on the bureaucratic part, but you should do it
anyway.)
364
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 365
1 procedure BubbleSortOnePass(A, n)
2 for i ← 1 to n − 1 do
3 if A[i] > A[i + 1] then
4 Swap A[i] and A[i + 1]
algorithm repeats this loop until the array is sorted, but here we just do it
once.
Suppose that A starts out as a uniform random permutation of distinct
elements. As a function of n, what is the exact expected number of swaps
performed by Algorithm D.1?
Solution
The answer is n − Hn , where Hn = ni=1 1i is the n-th harmonic number.
P
There are a couple of ways to prove this. Below, we let Ai represent the
original contents of A[i], before doing any swaps. In each case, we will use the
fact that after i iterations of the loop, A[i] contains the largest of Ai , . . . , Ai ;
this is easily proved by induction on i.
• Let Xij be the indicator variable for the event that Ai is eventually
swapped with Aj . For this to occur, Ai must be bigger than Aj , and
must be present in A[j −1] after j −1 passes through the loop. This hap-
pens if and only if Ai is the largest value in A1 , . . . , Aj . Because these
values are drawn from a uniform random permutation, by symmetry
Ai is largest with probability exactly 1/j. So E [Xij ] = 1/j.
Now sum Xij over all pairs i < j. It is easiest to do this by summing
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 366
over j first:
X X
E Xij = E [Xij ]
i<j i<j
n j−1
X X
= E [Xij ]
j=2 i=1
n j−1
X X1
=
j=2 i=1
j
n
X j−1
=
j=2
j
n
X j−1
=
j=1
j
n
X 1
= (1 − )
j=1
j
n
X 1
=n−
j=1
j
= n − Hn .
• Alternatively, let’s count how many values are not swapped from A[i]
to A[i − 1]. We can then subtract from n to get the number that are.
Let Yi be the indicator variable for the event that Ai is not swapped
into A[i − 1]. This occurs if, when testing A[i − 1] against A[i], A[i] is
larger. Since we know that at this point A[i − 1] is the largest value
among A1 , . . . , Ai−1 , Yi = 1 if and only if Ai is greater than all of
A1 , . . . , Ai−1 , or equivalently if Ai is the largest value in A1 , . . . , Ai .
Again by symmetry we have E [Yi ] = 1i , and summing over all i gives
an expected Hn values that are not swapped down. So there are n − Hn
values on average that are swapped down, which also gives the expected
number of swaps.
1. The students enter the room one at a time. No student enters until
the previous student has found a seat.
Solution
Define the i-th student to be the student who takes their seat after i students
have already sat (note that we are counting from zero here, which makes
things a little easier). Let Xi be the indicator variable for the even that the
i-th student does not find a seat on their own and needs help from a USA.
Each attempt to find a seat fails with probability i/m. Since each attempt is
independent of the others, the probability that all k attempts fail is (i/m)k .
The number of students who need help is n−1
P
i=0 Xi , so the expected
number is
"n−1 # n−1
X X
E Xi = E [Xi ]
i=0 i=0
n−1
X
= (i/m)k
i=0
n−1
= m−k
X
ik .
i=0
days.
Our automated financial reporter will declare a dead cat bounce if a
stock falls in price for two days in a row, followed by rising in price, followed
by falling in price again. Formally, a dead cat bounce occurs on day i if
i ≥ 4 and Si−4 > Si−3 > Si−2 < Si−1 > Si . Let D be the number of dead
cat bounces over the n days.
1. What is E [D]?
3. Our investors will shut down our system if it doesn’t declare at least
one dead cat bounce during the first n days. What upper bound you
can get on Pr [D = 0] using Chebyshev’s inequality?
Note added 2016-09-28: It’s OK if your solutions only work for sufficiently
large n. This should save some time dealing with weird corner cases when n
is small.
Solution
1. Let Di be the indicator variable for the event that a dead cat bounce
occurs on day i. Let p = E [Di ] = Pr [Di = 1]. Then
1
p = Pr [Xi−3 = −1 ∧ Xi−2 = −1 ∧ Xi−1=+1 ∧ Xi = −1] = ,
16
since the Xi are independent.
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 370
Then
" n #
X
E [D] = E Di
i=4
n
X
= E [Di ]
i=4
n
X 1
=
i=4
16
n−3
= ,
16
assuming n is at least 3.
15
2. For variance, we can’t just sum up Var [Di ] = p(1 − p) = 256 , because
the Di are not independent. Instead, we have to look at covariance.
Each Di depends on getting a particular sequence of four values for
Xi−3 , Xi−2 , Xi−1 , and Xi . If we consider how Di overlaps with Dj for
j > i, we get these cases:
Variable Pattern Correlation
Di --+-
Di+1 --+- inconsistent
Di+2 --+- inconsistent
Di+3 --+- overlap in one place
For larger j, there is no overlap, so Di and Dj are independent for
j ≥ 4, giving Cov [Di , Dj ] = 0 for these cases.
Because Di can’t occur with Di+1 or Di+2 , when j ∈ {i + 1, i + 2} we
have
Cov [Di , Di+3 ] = E [Di Di+3 ]−E [Di ] E [Di+3 ] = 1/128−(1/16)2 = 1/256.
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 371
4. Here is an upper bound that avoids the mess of computing the exact
probability, but is still exponentially small in n. Consider the events
[D4i = 1] for i ∈ {1, 2, . . . , bn/4c}. These events are independent
since they are functions of non-overlapping sequences of increments.
So we can compute Pr [D = 0] ≤ Pr [D4i = 0, ∀i ∈ {1, . . . , bn/4c}] =
(15/16)bn/4c .
This expression is a little awkward, so if we want to get an asymptotic
estimate we can simplify it using 1 + x ≤ ex to get (15/16)bn/4c ≤
exp(−(1/16)bn/4c) = O(exp(−n/64)).
With a better analysis, we should be able to improve the constant in
the exponent; or, if we are real fanatics, calculate Pr [D = 0] exactly.
But this answer is good enough given what the problems asks for.
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 372
Solution
We’ll adapt QuickSort. If the bad comparisons occurred randomly, we could
just take the majority of Θ(log n) faulty comparisons to simulate a non-faulty
comparison with high probability. But the adversary is watching, so if we do
these comparisons at predictable times, it could hit all Θ(log n) of them and
stay well within its fault budget. So we are going to have to be more sneaky.
Here’s this idea: Suppose we have a list of n comparisons hx1 , y1 i , hx2 , y2 i , . . . , hxn , yn i
that we want to perform. We will use a subroutine that carries out these
comparisons with high probability by doing kn ln n+n possibly-faulty compar-
isons, with k a constant to be chosen below, where each of the possibly-faulty
comparisons looks at xr and yr where each r is chosen independently and uni-
formly at random. The subroutine collects the results of these comparisons
for each pair hxi , yi i and takes the majority value.
At most n of these comparisons are faulty, and we get an error only if
some pair hxi , yi i gets more faulty comparisons than non-faulty ones. Let Bi
be the number of bad comparisons of the pair hxi , yi i and Gi the number of
good comparisons. We want to bound the probability of the event Bi > Gi .
The probability that any particular comparison lands on a particular
pair is exactly 1/n; so E [Bi ] = 1 and E [Gi ] = k ln n. Now apply Cher-
noff bounds. From (5.2.4), Pr [Bi ≥ (k/2) ln n] ≤ 2−(k/2) ln n = n−(k ln 2)/2)
provided (k/2) ln n is at least 6. In the other direction, (5.2.6) says that
2
Pr [Gi ≤ (k/2) log n] ≤ e−(1/2) k log n/2 = n−k/8 . So we have a probability of
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 373
at most n−k/2 + n−k/8 that we get the wrong result for this particular pair,
and from the union bound we have a probability of at most n1−k/2 + n1−k/8
that we get the wrong result for any pair. We can simplify this a bit by
observe that n must be at least 2 (or we have nothing to sort), so we can put
a bound of n2−k/8 on the probability of any error among our n simulated
comparisons, provided k ≥ 12/ ln 2. We’ll choose k = max(12/ ln 2, 8(c + 3))
to get a probability of error of at most n−c−1 .
So now let’s implement QuickSort. The first round of QuickSort picks a
pivot and performs n − 1 comparisons. We perform these comparisons using
our subroutine (note we can always throw in extra comparisons to bring
the total up to n). Now we have two piles on which to run the algorithm
recursively. Comparing all nodes in each pile to the pivot for that pile
requires n − 3 comparisons, which can again be done by our subroutine.
At the next state, we have four piles, and we can again perform the n − 7
comparisons we need using the subroutine. Continue until all piles have size
1 or less; this takes O(log n) rounds with high probability. Since each round
does O(n log n) comparisons and fails with probability at most n−c−1 , the
entire process takes O(n log2 n) comparisons and fails with probability at
most n−c−1 O(log n) ≤ n−c when n is sufficiently large.
Figure D.1: Filling a screen with Space Invaders. The left-hand image places
four copies of a 46-pixel sprite in random positions on a 40 × 40 screen. The
right-hand image does the same thing with 24 copies.
For this problem, we will imagine that we have a graphics device that can
only display one sprite. This is a bitmap consisting of m ones, at distinct rel-
ative positions hy1 , x1 i , hy2 , x2 i , . . . hym , xm i. Displaying a sprite at position
y, x on our n×n screen sets the pixels at positions h(y + yi ) mod n, (x + xi ) mod ni
for all i ∈ {1, . . . , m}. The screen is initially blank (all pixels 0) and setting
a pixel at some position hy, xi changes it to 1. Setting the same pixel more
than once has no effect.
We would like to use these sprites to simulate white noise on the screen,
by placing them at independent uniform random locations with the goal
of setting roughly half of the pixels. Unfortunately, because the contents
of the screen are not actually stored anywhere, we can’t detect when this
event occurs. Instead, we want to fix the number of sprites ` so that we
get 1/2 ± o(1) of the total number of pixels set to 1 with high probability,
by which we mean that (1/2 ± o(1))n2 total pixels are set with probability
1 − n−c for any fixed c > 0 and sufficiently large n, assuming m is fixed.
An example of this process, using Taito Corporation’s classic Space
Invader bitmap, is shown in Figure D.1.
Compute the value of ` that we need as a function of n and m, and show
that this choice of ` does in fact get 1/2 ± o(1) of the pixels set with high
probability.
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 375
Solution
We will apply the following strategy. First, we’ll choose ` so that each
individual pixel is set with probability close to 1/2. Linearity of expectation
then gives roughly n2 /2 total pixels set on average. To show that we get
close to this number with high probability, we’ll use the method of bounded
differences.
The probability that pasting in a single copy of the sprite sets a particular
pixel is exactly m/n2 . So the probability that the pixel is not set after ` sprites
ln(1/2)
is (1 − m/n2 )` . This will be exactly 1/2 if ` = log1−m/n2 1/2 = ln(1−m/n 2) .
Since ` must be an integer, we can just round this quantity to the nearest
integer to pick our actual `. For example, when n = 40 and m = 46,
ln(1/2)
ln(1−46/402 )
= 23.7612... which rounds to 24.
For large n, we can use the approximation ln(1 + x) ≈ x (which holds for
small x) to approximate ` as (n2 ln 2/m) + o(1). We will use this to get a
concentration bound using (5.3.13).
We have ` independent random variables corresponding to the positions
of the ` sprites. Moving a single sprite changes the number of set pixels by
at most m. So the probability that the √ total number of set pixels deviates
from its expectation by more than an ln n is bounded by
√ 2
an ln n
!
a2 n2 ln n
2 exp− = 2 exp − 2
`m2 (n ln 2/m + o(1))m2
!
a2 ln n
= 2 exp −
m ln 2 + o(m2 /n2 )
≤ exp −(a2 /m) ln n
2 /m
= n−a ,
√
where the inequality holds when n is sufficiently large.
√ Now choose a = cm
to get a√probability of deviating by more than n cm ln n of at most n−c .
Since n cm ln n = o(1)n2 , this give us our desired bound.
Solution
Let Xij be the indicator variable for the event that machine i gets job j.
Then E [Xij ] = 1/j for all i ≤ j, and E [Xij ] = 0 when i > j.
Let Yi = nj=1 Xij be the load on machine i.
P
hP i
n Pn Pn
Then E [Yi ] = E j=1 Xij j=i 1/j ≤
= j=1 1/j = Hn ≤ ln n + 1.
From (5.2.4), we have Pr [Yi ≥ R] ≤ 2−R as long as R > 2e E [Yi ]. So if
we let R = (c + 1) lg n, we get a probability of at most n−c−1 that we get
more than R jobs on machine i. Taking the union bound over all i gives
a probability of at most n−c that any machine gets a load greater than
(c + 1) lg n. This works as long as (c + 1) lg n ≥ 2e(ln n + 1), which holds for
sufficiently large c. For smaller c, we can just choose a larger value c0 that
0
does work, and get that Pr [max Yi ≥ (c0 + 1) lg n] ≤ n−c ≤ n−c .
So for any fixed c, we get that with probability at least 1 − n−c the
maximum load is O(log n).
Solution
The best possible bound is at most 2 rotations on average in the worst case.
There is an easy but incorrect argument that 2 is an upper bound, which
says that if we rotate up, we do at most the same number of rotations as
when we insert a new element, and if we rotate down, we do at most the same
number of rotations as when we delete an existing element. This gives the
right answer, but for the wrong reasons: the cost of deleting x, conditioned
on the event that re-rolling its priority gives a lower priority, is likely to be
greater than 2, since the conditioning means that x is likely to be higher up
in the tree than average; the same thing happens in the other direction when
x moves up. Fortunately, it turns out that the fact that we don’t necessarily
rotate x all the way to or from the bottom compensates for this issue.
This can be formalized using the following argument, for which I am
indebted to Adit Singha and Stanislaw Swidinski. Fix some element x, and
suppose its old and new priorities are p and p0 . If p < p0 , we rotate up,
and the sequence of rotations is exactly the same as we get if we remove
all elements of the original treap with priority p or less and then insert a
new element x with priority p0 . But now if we condition on the number k of
elements with priority greater than p, their priorities together with p0 are all
independent and identically distributed, since they are all obtained by taking
their original distribution and conditioning on being greater than p. So all
(k + 1)! orderings of these priorities are equally likely, and this means that
we have the same expected cost as an insertion into a treap with k elements,
which is at most 2. Averaging over all k shows that the expected cost of
rotating up is at most 2, and, since rotating down is just the reverse of this
process with a reversed distribution on priorities (since we get it by choosing
p0 as our old priority and p as the new one), the expected cost of rotating
down is also at most 2. Finally, averaging the up and down cases gives that
the expected number of rotations without conditioning on anything is at
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 378
most 2.
We now give an exact analysis of the expected number of rotations, which
will show that 2 is in fact the best bound we can hope for.
The idea is to notice that whenever we do a rotation involving x, we
change the number of ancestors of x by exactly one. This will always be
a decrease in the number of ancestors if the priority of x went up, or an
increase if the priority of x went down, so the total number of rotations will
be equal to the change in the number of ancestors.
Letting Ai be the indicator for the event that i is an ancestor of x before
the re-roll, and A0i the indicator for the even that i is an ancestor of x after the
re-roll, then the number of rotations is just | i Ai − i A0i |, which is equal
P P
to i |Ai − A0i | since we know that all changes Ai − A0i have the same sign. So
P
the expected number of rotations is just E [ i |Ai − A0i |] = i E [|Ai − A0i |],
P P
by linearity of expectation.
So we have to compute E [|Ai − A0i |]. Using the same argument as in
§6.3.2, we have that Ai = 1 if and only if i has the highest initial priority
of all elements in the range [min(i, x), max(i, x)], and the same holds for A0i
if we consider updated priorities. So we want to know the probability that
changing only the priority of x to a new random value changes whether i
has the highest priority.
Let k = max(i, x) − min(i, x) + 1 be the number of elements in the
range under consideration. To avoid writing a lot of mins and maxes, let’s
renumber these elements as 1 through k, with i = 1 and x = k (this may
involve flipping the sequence if i > x). Let X1 , . . . , Xk be the priorities of
these elements, and let Xk0 be the new priority of x. These k + 1 random
variables are independent and identically distributed, so conditioning on the
even that no two are equal, all (k + 1)! orderings of their values are equally
likely.
So now let us consider how many of these orderings result in |Ai − A0i | = 1.
For Ai to be 1, X1 must exceed all of X2 , . . . , Xk . For A0i to be 0, X1 must
not exceed all of X2 , . . . , Xk1 , Xk0 . The intersection of these events is when
Xk0 > X1 > max(X2 , . . . , Xk ). Since X2 , . . . , Xk can be ordered in any of
(k − 1)! ways, this gives
(k − 1)! 1
Pr Ai = 1 ∧ A0i = 0 =
= .
(k + 1)! k(k + 1)
1. Suppose we insert a set S of n = |S| items into this hash table, using
a hash function h chosen at random from a strongly 2-universal hash
family H. Show that there is a constant c such that, for any t > 0, the
probability that at least n/2 + t items are lost is at most cn/t2 .
2. Suppose instead that we insert a set S of n = |S| items into this hash
table, using a hash function h chosen at random from a hash family
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 380
Solution
1. Let Xi be the indicator variable for the event that the i-th element
of S is hashed to an odd-numbered bucket. Since H is strongly 2-
universal, E [Xi ] ≤ 1/2 (with equality when m is even), from which
it follows that Var [Xi ] = E [Xi ] (1 − E [Xi ]) ≤ 1/4; and the Xi are
pairwise independent. Letting Y = ni=1 Xi be the total number of
P
items lost, we get E [Y ] ≤ n/2 and Var [Y ] ≤ n/4. But then we can
apply Chebyshev’s inequality to show
Pr [Y ≥ n/2 + t] ≤ Pr [Y ≥ E [Y ] + t]
Var [Y ]
≤
t2
n/4
≤ 2
t
1
= (n/t2 ).
4
So the desired bound holds with c = 1/4.
Solution
To paraphrase an often-misquoted line of Trotsky’s, young Karl Marx may
not recognize the Optional Stopping Theorem, but the Optional Stopping
Theorem does not permit him to escape its net. No strategy, no matter how
clever, can produce an expected return better or worse than simply waiting
for the last candy.
Let Xt be the expected return if Karl declares the revolution after seeing
t cards. We will show that {Xt , Ft } is a martingale where each Ft is the
σ-algebra generated by the random variables vπ(1) through vπ(t) .
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 382
n t+1
" ! #
1 X X
E [Xt+1 | Ft ] = E vi − vπ(i) Ft
n − t − 1 i=1 i=1
n t
!
1 X X 1 h i
= vi − vπ(i) − E vπ(t+1) FT
n − t − 1 i=1 i=1
n−t−1
n−t 1
= Xt − Xt
n−t−1 n−t−1
= Xt .
Now fix some strategy for Karl, and let τ be the time at which he
launches the revolution. Then τ < n is a stopping time with respect to the
Ft , and the Optional Stopping Theorem (bounded time version) says that
E [Xτ ] = E [X0 ]. So any strategy is equivalent (in expectation) to launching
the revolution immediately.
2. Show that the mixing time tmix for this process is polynomial in n.
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 383
Solution
1. The stationary distribution is a uniform distribution on all nk place-
ments of the robots. To prove this, observe that two vectors increasing
vectors x and y are adjacent if and only if there is some i such that
xj = yj for all j 6= i and xj + d = yj for some d ∈ {−1, +1}. In this
1
case, the transition probability pxy is 2n , since there is a n1 chance that
1
we choose i and a 2 chance that we choose d. But this is the same
as the probability that starting from y we choose i and −d. So we
have pxy = pyx for all adjacent x and y, which means that a uniform
distribution π satisfies πx pxy = πy pyx for all x and y.
To show that this stationary distribution is unique, we must show that
there is at least one path between any two states x and y. One way to
do this is to show that there is a path from any state x to the state
h1, . . . , ki, where at each step we move the lowest-index robot i that
is not already at position i. Since we can reverse this process to get
to y, this gives a path between any x and y that occurs with nonzero
probability.
2. This one could be done in a lot of ways. Below I’ll give sketches of
three possible approaches, ordered by increasing difficulty. The first
reduces to card shuffling by adjacent swaps, the second uses an explicit
coupling, and the third uses conductance.
Of these approaches, I am personally only confident of the coupling
argument, since it’s the one I did before handing out the problem, and
indeed this is the only one I have written up in enough detail below to
be even remotely convincing. But the reduction to card shuffling is also
pretty straightforward and was used in several student solutions, so I
am convinced that it can be made to work as well. The conductance
idea I am not sure works at all, but it seems like it could be made to
work with enough effort.
(a) Let’s start with the easy method. Suppose that instead of colliding
robots, we have a deck of n cards, of which k are specially marked.
Now run a shuffling algorithm that swaps adjacent cards at each
step. If we place a robot at the position of each marked card,
the trajectories of the robots follow the pretty much the same
distribution as in the colliding-robots process. This is trivially
the case when we swap a marked card with an unmarked card (a
robot moves), but it also works when we swap two marked cards
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 384
(no robot moves, since the positions of the set of marked cards
stays the same; this corresponds to a robot being stuck).
Unfortunately we can’t use exactly the same process we used in
§10.4.3.3, because this (a) allows swapping the cards in the first
and last positions of the deck, and (b) doesn’t include any moves
corresponding to a robot at position 1 or n trying to move off the
end of the line.
The first objection is easily dealt with, and indeed the cited result
of Wilson [Wil04] doesn’t allow such swaps either. The second can
be dealt with by adding extra no-op moves to the card-shuffling
process that occur with probability 1/n, scaling the probabilities
of the other operations to keep the sum to 1. This doesn’t affect
the card-shuffling convergence argument much, but it is probably
a good idea to check that everything still works.
Finally, even after fixing the card-shuffling argument, we still have
to argue that convergence in the card-shuffling process implies
convergence in the corresponding colliding-robot process. Here is
where the definition of total variation distance helps. Let C t be the
permutation of the cards after t steps, and let f : C t 7→ X t map
permutations of cards to positions of robots. Let π and π 0 be the
stationary distributions of the card and robot processes, respec-
t 0
t
tively. Then dT V (f (C ), f (π)) = maxA Pr X ∈ A − π (A) =
maxA Pr C t ∈ f −1 (A) − π(f −1 (A)) ≤ maxB Pr C t ∈ B − π(B) =
doesn’t change Xjt for any j 6= i, the only change in Z t will occur
if one of Xit and Yit can move and the other can’t. There are
several cases:
i. If i = k, d = +1, and exactly one of Xkt and Ykt is n, then the
copy of the robot not at n moves toward the copy at n, giving
Zt+1 − Zt = −1. The same thing occurs if i = 1, d = −1, and
exactly one of X1t and Y1t is 1.
It is interesting to note that these two cases will account for
the entire nonzero part of E [Zt+1 − Zt | Ft ], although we will
not use this fact.
ii. If i < k, d = +1, Xit + 1 = Xi+1 t , but Y t + 1 < X t , and
i i+1
Xi ≤ Yi , then robot i can move right in Y t but not in X t .
t t
and attempts to move to it. If the robot chooses not to move, it makes no
noise. If it chooses to move, but (i0 , j 0 ) is off the grid or occupied by one of
the m crates, then the robot stays at (i, j) and emits the noise. Otherwise it
moves to (i0 , j 0 ) and remains silent. The robot’s position at time 0 can be
any unoccupied location on the grid.
To keep the robot from getting walled in somewhere, whatever adversary
placed the crates was kind enough to ensure that if a crate was placed at
position (i, j), then all of the eight positions
(i − 1, j + 1) (i, j + 1) (i + 1, j + 1)
(i − 1, j) (i + 1, j)
(i − 1, j − 1) (i, j − 1) (i + 1, j − 1)
reachable by a king’s move from (i, j) are unoccupied grid locations. This
means that they are not off the grid and not occupied by another crate, so
that the robot can move to any of these eight positions.
Your job is to devise an algorithm for estimating m to within relative
error with probability at least 1 − δ, based on the noises emitted by the robot.
The input to your algorithm is the sequence of bits x1 , x2 , x3 , . . . , where xi
is 1 if the robot makes a noise on its i-th step. Your algorithm should run in
time polynomial in n, 1/, and log(1/δ).
Solution
It turns out that is a bit of a red herring: we can in fact compute the exact
number of crates with probability 1 − δ in time polynomial in n and log(1/δ).
The placement restrictions and laziness make this an irreducible aperiodic
chain, so it has a unique stationary distribution π. It is easy to argue from
reversibility that this is uniform, so each of the N = n2 − m unoccupied
positions occurs with probability exactly 1/N .
It will be useful to observe that we can assign three unique unoccupied
positions to each crate, to the east, south, and southeast, and this implies
m ≤ n2 /4.
The idea now is to run the robot until the distribution on its position is
close to π, and then see if it hits an obstacle on the next step. We can easily
count the number of possible transitions that hit an obstacle, since there are
4m incoming edges to the crates, plus 4n incoming edges to the walls. Since
1
each edge uv has probability πu puv = 8N of being selected in the stationary
distribution, the probability q that we hit an obstacle starting from π is
exactly n+m n+m
2N = n2 −m . This function is not trivial to invert, but we don’t
have to invert it: if we can compute its value (to some reasonable precision),
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 388
1. We cross the edge as part of the left-to-right portion of the path (this
includes left-to-right moves that are part of detours). In this case we
have |j − y| ≤ 1. This gives at most 3 choices for j, giving at most 3n3
possible paths. (The constants can be improved here.)
This gives at most 6n3 possible paths across each edge, giving a congestion
1
ρ≤ (6n3 )πij πi0 j 0
πij pij,i0 j 0
1
= −1 (6n3 )N −2
N (1/8)
= 24n3 N − 1
3
≤ 24n3 ( n2 )−1
4
= 32n,
since we can argue from the placement restrictions that N ≤ n2 /4. This
immediately gives a bound τ2 ≤ 8ρ2 ≤ O(n2 ). using Lemma 10.6.3.
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 389
So let’s run for d51τ2 ln ne = O(n2 ln n) steps for each sample. Starting
from any initial location, we will reach some distribution σ with dT V σ, π =
O(n−50 ). Let X be the number of obstacles (walls or crates) adjacent to the
current position, then we can apply Lemma 10.2.2 to get |Eσ (X) − Eπ (X)| ≤
4dT V (σ, π) = O(n−50 ). The same bound (up to constants) also applies to
the probability ρ = X/8 of hitting an obstacle, giving ρ = 4(n + m)/(n2 −
m) ± O(n−50 ). Note that ρ is Θ(n−1 for all values of m.
Now take n10 ln(1/δ) samples, with a gap of d51τ2 ln ne steps between
p
Figure D.2: Two hidden Space Invaders. On the left, the Space Invaders hide
behind random pixels. On the right, their positions are revealed by turning
the other pixels gray.
the bitmap wrap around as in Problem D.3.1.) Your algorithm should make
the probability of each bitmap that contains at least two copies of the sprite
be exactly equally likely, and should run in expected time polynomial in n
and m.
Solution
Rejection sampling doesn’t work here because if the sprites are large, the
chances of getting two sprites out of a random bitmap are exponentially
small. Generating a random bitmap and slapping two sprites on top of it
also doesn’t work, because it gives a non-uniform distribution: if the sprites
2
overlap in k places, there are 2n −2m+k choices for the remaining bits, which
means that we would have to adjust the probabilities to account for the effect
of the overlap. But even if we do this, we still have issues with bitmaps that
contain more than two sprites: a bitmap with three sprites can be generated
in three different ways, and it gets worse quickly as we generate more. It
may be possible to work around these issues, but a simpler approach is to
use the sampling mechanism from Karp-Luby [KL85] (see also §11.4).
2
Order all n2 positions u < v for the two planted sprites in lexicographic
order. For each pair of positions u < v, let Auv be the set of all bitmaps with
2
sprites at u and v. Then we can easily calculate |Auv | = 2n −2m+k where k
is the number of positions where sprites at u and v overlap, and so we can
sample a particular Auv with probability |Auv |/ st |Ast | and then choose an
P
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 391
Solution
Let Yt = Xt + 12 min(t, τ ). We will show that {Yt } is a supermartingale.
Suppose we disinfect node v at time t, and that v has d0 ≤ d(v) uninfected
d0
neighbors. Then E [Xt+1 | Xt ] = Xt − 1 + 2d(v) ≤ Xt − 12 , because we always
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 392
1. What is E [Y0 ]?
2. What is E [Y1 ]?
Solution
1. This is a straightforward application of linearity of expectation. Let Zit ,
for each i ∈ {1, . . . , n − 1}, be the indicator for the event that Xit = 1
t
and Xi+1 = 0. For t = 0, Xi0 and Xi+1 0 are independent, so this event
h Pn−1 i Pn−1
occurs with probability 14 . So E [Y0 ] = E Zi0 = E Zi0 =
i=1 i=1
n−1
(n − 1) 14 = 4 .
2. Show that any correct algorithm for this problem will use more than n
coin-flips with probability at least 2−O(n) .
Solution
1. Use rejection sampling: generate 3 bits, and if the resulting binary
number is not in the range 1, . . . , 6, try again. Each attempt consumes
3 bits and succeeds with probability 3/4, so we need to generate 43 ·3 = 4
bits on average.
It is possible to improve on this by reusing the last bit of a discarded
triple of bits as the first bit in the next triple. This requires a more
complicated argument to show uniformity, but requires only two bits
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 394
2. Suppose that we have generated n bits so far. Since 6 does not evenly
divide 2n for any n, we cannot assign an output from 1, . . . , 6 to all
possible 2n sequences of bits without giving two outputs different
probabilities. So we must keep going in at least one case, giving a
probability of at least 2−n = 2−O(n) that we continue.
Appendix E
1. Your name.
(You will not be graded on the bureaucratic part, but you should do it
anyway.)
395
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 396
Solution
1. Suppose we have already inserted k elements. Then the next element
is equally likely to land in any of the positions A[1] through A[k + 1].
The number of displaced elements is then uniformly distributed in 0
through k, giving an expected cost for this insertion of k2 .
Summing over all insertions gives
n−1
X k 1 n−1
X
=
k=0
2 2 k=0
n(n − 1)
= .
4
It’s always nice when we get the same answer in situations like this.
2. Now we need to count how far the pointer moves between any two
consecutive elements. Suppose that we have already inserted k − 1 > 0
elements, and let Xk be the cost of inserting the k-th element. Let
i and j be the indices in the sorted list of the new and old pointer
positions after the k-th insertion. By symmetry, all pairs of distinct
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 398
This is such a simple result that we might reasonably expect that there
is a faster way to get it, and we’d be right. A standard trick is to
observe that we can simulate choosing k points uniformly at random
from a line of n points by instead choosing k + 1 points uniformly at
random from a cycle of n + 1 points, and deleting the first point chosen
to turn the cycle back into a line. In the cycle, symmetry implies that
the expected distance between each point and its successor is the same
as for any other point; there are k + 1 such distances, and they add up
to n + 1, so each expected distance is exactly n+1
k+1 .
In our particular case, n (in the formula) is k and k (in the formula) is
2, so we get k+1
3 . Note we are sweeping the whole absolute value thing
under the carpet here, so maybe the more explicit derivation is safer.
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 399
However we arrive at E [Xk ] = k+13 (for k > 1), we can sum these
expectations to get our total expected cost:
" n # n
X X
E Xk = E [Xk ]
k=2 k=2
n
X k+1
=
k=2
3
1 n+1
X
= `
3 `=3
1 (n + 1)(n + 2)
= −3
3 2
(n + 1)(n + 2)
= − 1.
6
It’s probably worth checking a few small cases to see that this answer
actually makes sense.
For large n, this shows that the doubly-linked list wins, but not by much:
we get roughly n2 /6 instead of n2 /4. This is a small enough difference that
in practice it is probably dominated by other constant-factor differences that
we have neglected.
Solution
We can use linearity of expectation to compute the probability that any
particular edge is monochromatic, and then multiply by m to get the total.
Fix some edge uv. If either of u or v is recolored in step 2, then the
probability that c0u = c0v is exactly 1/k. If neither is recolored, the probability
that c0u = c0v is zero (otherwise cu = cv , forcing both to be recolored). So we
can calculate the probability that c0u = c0v by conditioning on the event A
that neither vertex is recolored.
This event occurs if both u and v have no neighbors with the same color.
The probability that cu = cv is 1/k. The probability that any particular
neighbor w of u has cw = cu is also 1/k; similarly for any neighbor w of
v. These events are all independent on the assumption that the graph is
triangle-free (which implies that no neighbor of u is also a neighbor of v).
So the probability that none of these 2r − 1 events occur is (1 − 1/k)2r−1 .
We then have
h i h i
Pr [cu = cv ] = Pr cu = cv A Pr A + Pr [cu = cv | A] Pr [A]
2r−1 !
1 1
= · 1− 1− .
k k
Multiply by m to get
2r−1 !
m 1
· 1− 1− .
k k
For large k, this is approximately m −(2r−1)/k , which is a little
k · 1−e
bit better than than the m k expected monochrome edges from just running
step 1.
Repeated application of step 2 may give better results, particular if k is
large relative to r. We will see this technique applied to a more general class
of problems in §13.3.5.
3
Both endpoints have the same color.
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 401
Solution
Let Cij be the communication cost between machines i and j. This is just
an indicator variable for the event that i and j are assigned to different
1
machines, which occurs with probability 1 − m
P
. We have C = 1≤i<j≤n Cij .
1. Expectation is a straightforward application of linearity of expectation.
There are n2 pairs of processes, and E [Cij ] = 1 − m 1
for each pair, so
!
n 1
E [C] = 1− .
2 m
2. Variance is a little trickier because the Cij are not independent. But
they are pairwise independent: even if we fix the location of i and j,
1
the expectation of Cjk is still 1 − m , so Cov [Cij , Cjk ] = E [Cij Cjk ] −
E [Cij ] · E [Cjk ] = 0. So we can compute
!
n 1 1
X
Var [C] = Var [Cij ] = 1− .
1≤i<j<n
2 m m
Solution
The expected cost of searching bucket i is E [dlg(Xi + 1)e]. This is the
expectation of a function of Xi , so we would like to bound it using Jensen’s
inequality (§4.3).
Unfortunately the function f (n) = dlg(n + 1)e is not concave (because
of the ceiling), but 1 + lg(n + 1) > dlg(n + 1)e is. So the expected cost of
searching bucket i is bounded by 1 + lg(E [Xi ] + 1) = 1 + lg(k + 1).
Assuming we search the buckets in some fixed order until we find x, we
will search Y buckets where E [Y ] = n+12 . Because Y is determined by the
position of x, which is independent of the Xi , Y is also independent of the
Xi . So Wald’s equation (3.4.3) applies, and the total cost is bounded by
n+1
(1 + lg(k + 1)) .
2
0 0
1 1 5
2 5 2 3 6
3 4 6 4
Solution
For any node u, let depth(u) be the depth of u in T and depth0 (u) be the
depth of u in T 0 . Note that depth0 (u) is a random variable. We will start by
0
computing E depth (u) as a function of depth(u), by solving an appropriate
recurrence.
Let S(k) = depth0 (u) when depth(u) = k. The base cases are S(0) = 0
(the depth of the root never changes) and S(1) = 1 (same for the root’s
children). For larger k, we have
1 1
E depth0 (u) = E 1 + depth0 (parent(u)) + E 1 + depth0 (parent(parent(u)))
2 2
or
1 1
S(k) = 1 + S(k − 1) + S(k − 2).
2 2
There are various ways to solve this recurrence. The most direct may be
to define a generating function F (z) = ∞ k
P
k=0 S(k)z . Then the recurrence
becomes
z 1 1
F = + zF + z 2 F.
1−z 2 2
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 404
and similarly
2 /(depth(u)−1)
Pr depth0 (u) ≤ E depth0 (u) − t ≤ e−2t
. (E.3.3)
q
1
Let t = 2 D ln (1/). Then the right-hand side of (E.3.2) and (E.3.3)
becomes e −D ln(1/)/(depth(u)−1) < e− ln(1/) = . For = 12 n−c−1 , we get t =
q q √
1
= c+1
c+1 )
2 D ln (2n 2 D(ln n + ln 2) = O D log n when c is constant.
0
For the lower bound on D , when can apply (E.3.3) to some u
√ single node
with depth (u) = D; this node by itself will give D0 ≥ 23 D−O D log n with
4
A less direct but still effective approach is to guess that S(k) grows linearly, and find a
and b such that S(k) ≤ ak + b. For this we need ak + b ≤ 1 + 12 (a(k − 1)+ b)+ 12 (a(k − 2) + b).
The b’s cancel, leaving ak ≤ 1 + ak − 32 a. Now the ak’s cancel, leaving us with 0 ≤ 1 − 32 a
or a ≥ 2/3. We then go back and make b = 1/3 to get the right bound on S(1), giving the
bound S(k) ≤ 32 · k + 13 . We can then repeat the argument for S(k) ≥ a0 k + b0 to get a full
bound 23 k ≤ S(k) ≤ 32 k + 13 .
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 405
probability at least 1− 12 n−c−1 . For the upper bound, we need to take the max-
imum over all nodes. In general, an upper bound on the maximum of a bunch
of random variables is likely to be larger than an upper bound on any one of
the random variables individually, because there is a lot of room for one of
the variables to get unlucky, but we can happly the union bound to get around
√ i
this. For each individual u, we have Pr depth0 (u) ≥ 23 D + O D log n ≤
1 −c−1
h
0 ≥ 2D + O
√ i
≤ u 12 n−c−1 = 12 n−c . This com-
P
2 n , so Pr D 3 D log n
pletes the proof.
At each step, the part inspectors apply the following rules to decide which
test to apply:
• For the first part, a fair coin-flip decides between the tests.
• For subsequent parts, if the previous part passed, the inspectors become
suspicious and apply the rigorous test; if it failed, they relax and apply
the normal test.
For example, writing N+ for a part that passes the normal test, N- for
one that fails the normal test, R+ for a part that passes the rigorous test,
and R- for one that fails the rigorous test, a typical execution of the testing
procedure might look like N- N+ R- N- N+ R- N+ R- N- N- N- N+ R+ R-
N+ R+. This execution tests 16 parts and passes 7 of them.
Suppose that we test n parts. Let S be the number that pass.
1. Compute E [S].
2. Show that there a constant c > 0 such that, for any t > 0,
2 /n
Pr [|S − E [S]| ≥ t] ≤ 2e−ct . (E.3.4)
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 406
Solution
Using McDiarmid’s inequality and some cleverness Let Xi be the
indicator variable for the event that part i passes, so that S = ni=1 Xi .
P
1. We can show by induction that E [Xi ] = 1/2 for all i. The base case
is X1 , where Pr [part 1 passes] = 12Pr [part
1 passes rigorous test] +
1
2 Pr [part 1 passes normal test] = 12 13 + 23 = 12 . For i > 1, E [Xi−1 ] =
1/2 implies that part i is tested with the normal and rigorous tests
with equal probability, so the analysis for X1 carries through and gives
E [Xi ] = 1/2 as well. Summing over all Xi gives E [S] = n/2.
2. We can’t use Chernoff, Hoeffding, or Azuma here, because the Xi are
not independent, and do not form a martingale difference sequence even
after centralizing them by subtracting off their expectations. So we
are left with McDiarmid’s inequality unless we want to do something
clever and new (we don’t). Applying McDiarmid to the Xi directly
doesn’t work so well, but we can make it work with a different set of
variables that generate the same outcomes.
Let Yi ∈ {A, B, C} be the grade of part i, where A means that it passes
both the rigorous and the normal test, B means that fails the rigorous
test but passes the normal test, and C means that it fails both tests.
In terms of the Xi , Yi = A means Xi = 1, Yi = C means Xi = 0,
and Yi = B means Xi = 1 − Xi−1 (when i > 1). We get the right
probabilities for passing each test by assigning equal probabilities.
We can either handle the coin-flip at the beginning by including an
extra variable Y0 , or we can combine the coin-flip with Y1 by assuming
that Y1 is either A or C with equal probability. The latter approach
improves our bound a little bit since then we only have n variables and
not n + 1.
6 i and ask what happens if Yi
Now suppose that we fix all Yj for j =
changes.
Solution
Let X be the random variable representing the final table size. Our goal is
to bound E [X].
First let’s look at the probability that we get at least one collision between
consecutive elements when inserting n elements into a table with m locations.
Because the pairs of consecutive elements overlap, computing the exact
probability that we get a collision is complicated, but we only need an upper
bound.
We have n − 1 consecutive pairs, and each produces a collision with
probability at most 1/m. This gives a total probability of a collision of at
most (n − 1)/m.
Let k = dlg ne, so that 2k ≥ n. Then the probability of a consecutive
collision in a table with 2k+` locations is at most (n−1)/2k+` < 2k /2k+` = 2−` .
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 408
Since the events that collisions occur at each table size are independent, we
can compute, for ` > 0,
h i h i
Pr X = 2k+` ≤ Pr X ≥ 2k+`
`−1
2−i
Y
≤
i=0
−`(`−1)/2
=2 .
since the series converges to some constant that does not depend on k. But
we chose k so that 2k = O(n), so this gives us our desired bound.
Solution
Insert the sequence 1 . . . n.
Let us first argue by induction on i that all elements x with h(x) = m
appear as the uppermost elements of the right spine of the treap. Suppose
that this holds for i − 1. If h(i) = m, then after insertion i is rotated up
until it has a parent that also has heap key m; this extends the sequence
of elements with heap key m in the spine by 1. Alternatively, if h(i) < m,
then i is never rotated above an element x with h(x) = m, so the sequence
of elements with heap key m is unaffected.
Because each new element has a larger tree key than all previous elements,
inserting a new element i requires moving past any elements in the right
spine, and in particular requires moving past any elements j < i with
h(j) = m. So the expected cost of inserting i is at least the expected
number of such elements j. Because h is chosen from a strongly 2-universal
hash family, Pr [h(j) = m] = 1/m for any j, and by linearity of expectation,
E [|{j < i | h(j) = m}|] = (i − 1)/m. Summing this quantity over all i gives
a total expected insertion cost of at least n(n − 1)/2m = Ω(n2 /m).
Solution
This is a job for the optional stopping theorem. Essentially we are going to
follow the same analysis from §9.4.1 for a random walk with two absorbing
barriers, applied to Xt .
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 410
1. For the first part, we show that (Xt , Ft ) is a martingale. The intuition
is that however the bits in At are arranged, there are always exactly
the same number of positions where a zero can be replaced by a one as
there are where a one can be replaced by a zero.
Let Rt be the random location chosen in state At . Observe that
But then
n−1
X 1
E [Xt+1 | Ft ] = (Xt + A[i] − A[(i + 1) mod n])
i=0
n
1 1
= Xt + Xt − Xt
n n
= Xt ,
E [Xτ ] = E [X0 ] = k.
But then
so E [Xτ ] = k/n.
k 2 ≤ kn − 2 E [τ ] /n
which gives
kn − k 2
E [τ ] ≤
2/n
kn2 − k 2 n
= .
2
This is maximized at k = bn/2c, giving
bn/2c · n2 − (bn/2c)2 · n
E [τ ] ≤ .
2
To show that this bound applies in the worst case, observe that if we
start with have contiguous regions of k ones and n − k zeros in At , then
(a) Yt = 1, and (b) the two-region property is preserved in At+1 . In
this case, for t < τ it holds that E [Zt+1 | Ft ] = Zt + 2Yt /n − 2/n = Zt ,
2 2n
so Zt is a martingale, and thus E [τ ] = kn −k 2 . This shows that the
initial state with bn/2c consecutive zeros and dn/2e consecutive ones
(or vice versa) gives the claimed worst-case time.
Solution
Consider the n diagonal elements in positions Aii . For each such element,
there is a 1/n chance at each step that its row or column is chosen. The
6
I’ve been getting some questions about what this means, so here is an attempt to
translate it into English.
Recall that f (n) is ω(g(n)) if limn→∞ fg(n)
(n)
goes to infinity, and f (n) is o(g(n)) if
limn→∞ fg(n)
(n)
goes to zero.
The problem is asking you to show that there is some f (n) that is more than a constant
times n, such that the total variation distance between Af (n) becomes arbitrarily close to
1 for sufficiently large n.
So for example, if you showed that at t = n4 , dT V (At , B) ≥ 1 − log12 n , that would
2 n
demonstrate the claim, because limn→∞ nn goes to infinity and limn→∞ 1/ log
1
= 0. These
functions are, of course, for illustration only. The actual process might or might not
converge by time n4 .)
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 413
time until every diagonal node is picked at least once maps to the coupon
collector problem, which means that it is Ω(n log n) with high probability
using standard concentration bounds.
Let C be the event that there is at least one diagonal element that is
in its original position. If there is some diagonal node that has not moved,
C holds; so with probability 1 − o(1), At holds at some time t that is
Θ(n log n) = ω(n). But by the union bound, C holds in B with probability
at most n · n−2 = 1/n. So the difference between the probability of C in At
and B is at least 1 − o(1) − 1/n = 1 − o(1).
Solution
Since it’s a Las Vegas algorithm, Markov chain Monte Carlo is not going to
help us here. So we can set aside couplings and conductance and just go
straight for generating a solution.
First, let’s show that we can generate uniform colorings of an n-node
line in linear time. Let Xi be the color of node i, where 0 ≤ i < n. Choose
X0 uniformly at random from all three colors; then for each i > 0, choose
Xi uniformly at random from the two colors not chosen for Xi−1 . Given a
coloring, we can show by induction on i that it can be generated by this
process, and because each choice is uniform and each coloring is generated
only once, we get all 3 · 2n−1 colorings of the line with equal probability.
Now we try hooking Xn−1 to X0 . If Xn−1 = X0 , then we don’t have a
cycle coloring, and have to start over. The probability that this event occurs
is at most 1/2, because for every path coloring with Xn−1 = X0 , there is
another coloring where we replace Xn−1 with a color not equal to Xn−2 or
X0 . So after at most 2 attempts on average we get a good cycle coloring.
This gives a total expected cost of O(n).
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 414
too low at any time t, all your worldly goods will be repossessed, and you
will have nothing but your talent for randomized algorithms to fall back on.
Note: an earlier version of this problem demanded a tighter
bound.
Show that when n and m are sufficiently large,
√ it is always possible to
choose a subset S of size n/2 so that wt ≥ −m n ln nm for all 0 < t ≤ m,
and give an algorithm that finds such a subset in time polynomial in n and
m on average.
Solution
Suppose that we flip a coin independently to choose whether to include each
investment i. There are two bad things that can happen:
1. We lose too much at some time t from the investments the coin chooses.
If we can show that the sum of the probabilities of these bad events is
less than 1, we get the existence proof we need. If we can show that it is
enough less than 1, we also get an algorithm, because we can test in time
O(nm) if a particular choice works.
Let Xi be the indicator variable for the event that we include investment
i. Then
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 415
n
X t
X
wt = Xi aij
i=1 j=1
n t
1 1
X X
= + Xi − aij
i=1
2 2 j=1
t X n n t
!
1X 1
X X
= aij + Xi − aij
2 j=1 i=1 i=1
2 i=1
n t
!
1
X X
= Xi − aij .
i=1
2 i=1
h i
Because Xi − 12 is always 12 , E Xi − 12 = 0, and each aij is ±1, each
term
in the
P outermost sum is a zero-mean random variable that satisfies
t t m
Xi − 12 j=1 aij ≤ 2 ≤ 2 . So Hoeffding’s inequality says
h √ i 2 2
Pr wt − E [wt ] < m n ln nm ≤ e−m n ln nm/(2n(m/2) )
= e−2 ln nm
= (nm)−2 .
Summing over all t, the probability that this bound is violated for any t
is at most n21m .
For the second source of error, we have Pr [ ni=1 Xi 6= n/2] = 1− n2 /2n =
P
√
1 − Θ(1/ n). So the total probability that the random assignment fails is
√
bounded by 1 − Θ(1/ n) + 1/n, giving a probability that it succeeds of at
√ √
least Θ(1/ n) − 1/(n2 m) = Θ( n). It follows that generating and testing
random assignments gives an assignment with the desired characteristics
√
after Θ( n) trials on average, giving a total expected cost of Θ(n3/2 m).
Solution
Suppose we condition on a particular set of k elements appearing in positions
A[1] through A[k]. By symmetry, all k! permutations of these elements are
equally likely. Putting the largest two elements in A[k − 1] and A[k] leaves
(k − 2)! choices for the remaining elements, giving a probability of a double
record at k of exactly (k−2)!
k!
1
= k(k−1) .
Applying linearity of expectation gives a total expected number of double
records of
n n
X 1 X 1
≤
k=2
k(k − 1) k=2 (k − 1)2
∞
X 1
≤
k=2
(k − 1)2
∞
X 1
=
k=1
k2
π2
=
6
= O(1).
Since the expected number of double records is at least 1/2 = Ω(1) for
n ≥ 2, this gives a tight asymptotic bound of Θ(1).
I liked seeing our old friend π 2 /6 so much that I didn’t notice an easier
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 417
Solution
This problem produced the widest range of solutions, including several very
clever deterministic algorithms. Here are some examples.
time we get a labeling in which the number of happy nodes is at least the
n/2 expected happy nodes we started with.
With a bit of care, the cost can be reduced to linear: because each new
labeled node only affects its own probability of happiness and those of its
three neighbors, we can update the conditional expectation by just updating
the values for those four nodes. This gives O(1) cost per step or O(n) total.
Solution
We need two ideas here. First, we’ll show that Xt − t2 is a martingale, despite
the fact that Xt by itself isn’t. Second, we’ll use the idea of stopping Xt
when it hits s + 1, creating a new martingale Yt = Xt∧τ 2 − (t ∧ τ ) where τ is
the first time where Xt = s (and t ∧ τ is shorthand for min(t, τ )). We can
then apply Markov’s inequality to Xn∧τ .
To save time, we’ll skip directly to showing that Yt is a martingale. There
are two cases:
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 419
1. If t < τ , then
1 1
E [Yt+1 | Yt , t < τ ] = (Xt + 1)2 − (t + 1) + (Xt − 1)2 − (t + 1)
2 2
2
= Xt + 1 − (t + 1)
= Xt2 − t
= Yt .
The engineer reasons that since a typical monitor covers 5 grid locations,
using n2 /4 monitors on average should cover (5/4)n2 locations, with the
extra monitors adding a little bit of safety to deal with bad random choices.
So few if any processing units should escape.
1. Compute the exact expected number of processing units that are not
within range of any monitor, as a function of n. You may assume
n > 0.
Solution
1. This part is actually more annoying, because we have to deal with
nodes on the edges. There are three classes of nodes:
2. Let Xij be the indicator for a monitor at position (i, j). Recall that
we have assumed that these variables are independent.
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 421
1. Your name.
(You will not be graded on the bureaucratic part, but you should do it
anyway.)
422
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 423
Solution
If we can figure out the probability that bins i and i + 1 are both empty for
some particular i, then by symmetry and linearity of expectation we can just
multiply by n − 1 to get the full answer.
For bins i and i + 1 to be empty, every ball must choose another bin.
This occurs with probability (1 − 2/n)m . The full answer is thus n(1 − 2/n)m ,
or approximately ne−2m/n when n is large.
Solution
We’ll use the law of total probability. First observe that the probability that
a random labeling yields a zero sum for any single vertex and its neighbors is
exactly 1/(n + 1); the easiest way to see this is that after conditioning on the
values of the neighbors, there is only one value in n + 1 that can be assigned
to the vertex itself to cause a failure. Now sum this probability over all n
vertices to get a probability of failure of at most n/(n + 1). It follows that
after n + 1 attempts on average (each of which takes O(n2 ) time to check all
the neighborhood sums), the algorithm will find a good labeling, giving a
total expected time of O(n3 ).
E [T | T ≥ n] = 2n + 1. (F.1.1)
Solution
Expand (F.1.1) using the definition of conditional expectation to get
∞
X
2n + 1 = x Pr [T = x | T ≥ n]
x=0
∞
X Pr [T = x ∧ T ≥ n]
= x
x=0
Pr [T ≥ n]
∞
1 X
= x Pr [T = x] ,
Pr [T ≥ n] x=n
which we can rearrange to get
∞
X
x Pr [T = x] = (2n + 1) Pr [T ≥ n] , (F.1.2)
x=n
n Pr [T = n] = (2n + 1) Pr [T = n] − 2 Pr [T ≥ n + 1]
= (2n + 3) Pr [T = n] − 2 Pr [T = n] − 2 Pr [T ≥ n + 1]
= (2n + 3) Pr [T = n] − 2 Pr [T ≥ n] .
A little bit of algebra turns this into
Pr [T = n] 2
Pr [T = n | T ≥ n] = = .
Pr [T ≥ n] n+3
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 425
that choose the left machine and the number that choose the right. By
symmetry, the expectation of S is zero. What is the variance of S as a
function of p and n?
Solution
P
To compute the variance, we’ll use (5.1.5), which says that Var [ i Xi ] =
P P
i Var [Xi ] + 2 i<j Cov [Xi , Xj ].
Recall that Cov [Xi , Xj ] = E [Xi Xj ] − E [Xi ] E [Xj ]. Since the last term
is 0 (symmetry again), we just need to figure out E [Xi Xj ] for all i ≤ j (the
i = j case gets us Var [Xi ]).
First, let’s compute E [Xj = 1 | Xi = 1]. It’s easiest to do this starting
with the j = i case: E [Xi | Xi = 1] = 1. For larger j, compute
It follows that
We next have
as calculated in (F.2.1).
So now we just need to evaluate the horrible sum.
X X X n
n X
Var [Xi ] + 2 Cov [Xi , Xj ] = n + 2 (2p − 1)j−i
i i<j i=1 j=i+1
n n−i
X X
=n+2 (2p − 1)k
i=1 k=1
n
X (2p − 1) − (2p − 1)n−i+1
=n+2
i=1
1 − (2p − 1)
n
n(2p − 1) 1 X
=n+ − (2p − 1)m
1−p 1 − p m=1
n(2p − 1) (2p − 1) − (2p − 1)n+1
=n+ − .
1−p 2(1 − p)2
(F.2.2)
This covers all but the p = 1 case, for which the geometric series formula
fails. Here we can compute directly that Var [S] = n2 , since S will be ±n
with equal probability.
For smaller values of p, plotting (F.2.2) shows the variance increasing
smoothly starting at 0 (for even n) or 1 (for odd n) at p = 0 to n2 in the limit
as p goes to 1, with an interesting intermediate case of n at p = 1/2, where
all terms but the first vanish. This makes a certain intuitive sense: when
p = 0, the processes alternative which machine they take, which gives an
even split for even n and a discrepancy of ±1 for odd n; when p = 1/2, the
processes choose machines independently, giving variance n; and for p = 1,
the processes all choose the same machine, giving n2 .
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 427
2. Let c > 0. Show that the absolute value of the difference between the
actual
√ number of scan operations and the expected number is at most
O( cn log n) with probability at least 1 − n−c .
Solution
1. Number the balls 1 to n. Forball
i,there are i−1 bins already occupied,
giving a probability of i−1
n
i−2
n−1 that we choose an occupied bin on
both attempts and incur a scan. Summing over all i gives us that the
expected number of scans is:
n n−1
i−1 i−2 1
X X
= (i2 − i)
i=1
n n−1 n(n − 1) i=1
1 (n − 1)n(2n − 1) (n − 1)n)
= −
n(n − 1) 6 2
(2n − 1) 1
= −
6 2
n−2
,
3
provided n ≥ 2. For n < 2, we incur no scans.
2. It’s tempting to go after this using Chernoff’s inequality, but in this case
Hoeffding gives a better bound. Let S be the number of scans. Then S is
the sum of n independent Bernoulli random variables,
√ so 5.3.4√says that
2
Pr [|S − E [S]| ≥ t] ≤ 2e−2t /n . Now let t = cn ln n = O( cn log n)
to make the right-hand side 2n−2c ≤ n−c for sufficiently large n.
Solution
Let S be the total vote. The intuition here is that if there is no conspiracy,
S is concentrated around 0, and if there is a conspiracy, S is concentrated
around ±m. So if m is sufficiently large and |S| ≥ m/2, we can reasonably
guess that there is a conspiracy.
We need to prove two bounds: first, that the probability that we see
|S| ≥ m/2 when there is no conspiracy is small, and second, that the
probability that we see |S| < m/2 when there is a conspiracy is large.
For the first case, let Xi be the total vote cast by the i-th group. This
will be ±ni with equal probability, where ni ≤ k is the size of the group.
P
This gives E [Xi ] = 0. We also have that ni = n.
Because the Xi are all bounded, we can use Hoeffding’s inequality (5.3.2),
so long as we can compute an upper bound on n2i . Here we use the fact
P
P 2
that ni is maximized subject to 0 ≤ ni ≤ k and ni = 0 by setting as
P
We want to set m so that the right-hand side is less than n−c . Taking
logs as usual gives
ln 2 − m2 /8nk ≤ −c ln n,
For the second case, repeat the above analysis on the n −√m votes except
the ±m from the conspiracy. Again we get that if m = Ω( ckn log n), the
−c
√ that these votes exceed m/2 is bounded by n . So in both cases
probability
m = Ω( ckn log n) is enough.
To ensure that Bi+2 makes sense even for i = n, assume that there exist
random variables Xn+1 and Xn+2 and the corresponding Bn+1 and Bn+2 .
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 430
Solution
This is a job for McDiarmid’s inequality (5.3.13). Observe that S is a function
of X1 . . . Xn+2 . We need to show that changing any one of the Xi won’t
change this function by too much.
From the description of the Yi , we have that Xi can affect any of Yi−2 (if
Xi−2 = X), Yi−1 (if Xi−1 = /) and Yi . We can get a crude bound by observing
P
that each Yi ranges from 0 to 30, so changing Xi can change Yi by at most
±90, giving ci ≤ 90. A better bound can be obtained by observing that Xi
contributes only Bi to each of Yi−2 and Yi−1 , so changing Xi can only change
these values by up to 10; this gives ci ≤ 50. An even more pedantic bound
can be obtained by observing that X1 , X2 , Xn+1 , and Xn+2 are all special
cases, with c1 = 30, c2 = 40, cn+1 = 20, and cn+2 = 10, respectively; these
values can be obtained by detailed meditation on the rules above.
We thus have n+2 2 2 2 2 2 2
i=1 ci = (n − 2)50 + 30 + 40 + 20 + 10 = 2500(n −
P
Solution
An easy way to do this is to produce a tree that consists of a single path,
which we can do by arranging that the remaining keys have priorities that
are ordered the same as their key values.
√
Here’s a simple strategy that works. Divide the keys into n ranges of
√ √ √ √
n keys each (1 . . . n, n + 1 . . . 2 n, etc.).3 Rank the priorities from 1
√ √
to n. From each range (i − 1) n . . . i n, choose a key to keep whose priority
√ √
is also ranked in the range (i − 1) n . . . i n (if there is one), or choose no
key (if there isn’t). Delete all the other keys.
√
For a particular range, we are drawing n samples without replacement
√
from the n priorities, and there are n possible choices that cause us
to√keep a key in that range. The probability that every draw misses is
Q n √ √ √n
i=1 (1 − n/(n − i + 1)) ≤ (1 − 1/ n) ≤ e−1 . So each range contributes
−1 √
at least 1 − e keys on average. Summing over all n ranges gives a
sequence of keys with increasing priorities with expected length at least
√ √
(1 − e−1 ) n = Ω ( n).
An alternative solution is to apply the Erdős-Szekeres theorem [ES35],
which says that every sequence of length k 2 + 1 has either an increasing
subsequence of length k + 1 or a decreasing sequence of k + 1. Consider √ the
sequence of priorities corresponding to the √ keys 1 . . . n; letting k = b n − 1c
gives a subsequence of length at least n − 1 that is either increasing or
decreasing. If we delete all other elements of the treap, the elements cor-
responding to this subsequence will form a path, giving the desired bound.
Note that this does not require any probabilistic reasoning at all.
√
Though not required for the problem, it’s possible to show that Θ( n)
is the best possible bound here. The idea is that the number of possible
sequences of keys that correspond to a path of length k in a binary search
tree is exactly nk 2k−1 ; the nk corresponds to choosing the keys in the path,
and the 2k−1 is because for each node except the last, it must contain either
the smallest or the largest of the remaining keys because of the binary search
tree property.
Since each such sequence will be a treap path only if the priorities are
decreasing (with probability 1/k!), the union bound says that the probability
3
To make our life easer, we’ll assume that n is a square. This doesn’t affect the
asymptotic result.
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 432
*
/ \
a / \ b
/ \
* *
/ \ \
a / \ b \ b
/ \ \
aa ab bb
Figure F.1: A radix tree, storing the strings aa, ab, and ba.
n k−1
of having any length-k paths is at most k 2 /k!. But
!
n k−1 (2n)k
2 /k! ≤
k 2(k!)2
(2n)k
≥
2(k/e)2k
1
= (2e2 n/k 2 )k .
2
√
This is exponentially small for k 2e2 n, showing that with high probability
√
all possible paths have length O( n).
Solution
We need to create a new node for each prefix of the new string that is not
already represented in the tree.
For a prefix of length `, the chance that none of the n strings have this
n
prefix is exactly 1 − m−` . Summing over all ` gives that the expected
n
number of new nodes is k`=0 1 − m−` .
P
4
If you actually need to do this, there exist better data structures for this problem. See
[KNW10].
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 434
Solution
We’ll apply the usual error budget approach and show that the probability
\
that n − d is too big and the probability that n \ − d is too small are both
small. For the moment, we will leave c and as variables, and find values
that work at the end.
Let’s start with the too-big side. To get A[k] = 1, we need h(xi ) = k for
some xi that is inserted but not subsequently deleted. There are n−d such xi ,
and each gives h(xi ) = k with probability 2−k . So Pr [A[k] = 1] ≤ (n − d)2−k .
This gives
h i h i
\
Pr n − d ≥ (n − d)c = Pr kb ≥ dlg ((n − d)c)e
∞
(n − d)2−k
X
≤
k=dlg((n−d)c)e
= 2(n − d)2−dlg((n−d)c)e
2
≤ .
c
h = 1 gives k̂ ≥i
On the too-small side, fix k =hdlg ((n − d)/c))e. iSince A[k]
k ≥ dlg ((n − d)/c))e, we have Pr n − d < (n − d)/c = Pr k̂ < lg(n − d)/c ≤
\
Pr [A[k] = 0]. (We might be able to get a better bound by looking at larger
indices, but to solve the problem this one k will turn out to be enough.)
Let x1 . . . xn−d be the values that are inserted and not later deleted, and
xn−d+1 . . . xn the values that are inserted and then deleted. For A[k] to be
zero, either (a) no xi for i in 1 . . . xn−d has h(xi ) = k; or (b) some xi for
i in n − d + 1 . . . xn has h(xi ) = k. The probability of the first event is
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 435
n−d d
1 − 2−k ; the probability of the second is 1 − 1 − 2−k . So we have
n−d d
Pr [A[k] = 0] ≤ 1 − 2−k + 1 − 1 − 2−k
≤ exp −2−k (n − d) + 1 − exp − 2−k + 2−2k d
≤ exp −2−dlg((n−d)/c))e (n − d) + 1 − exp −2 · 2−dlg((n−d)/c))e d
≤ exp −2− lg((n−d)/c)) (n − d) + 1 − exp −2 · 2− lg((n−d)/c))+1 d
4cd
−c
= e + 1 − exp −
n−d
4cd
≤ e−c +
n−d
−c 4c
≤e + .
1−
So our total probability of error is bounded by 2c + e−c + 1−
4c
. Let c = 8
1 −8 128 1
and = 1/128 to make this less than 4 + e + 127 · 16 ≈ 0.313328 < 1/3,
giving the desired bound.
1 procedure insert(x)
2 for i ← 0 to ∞ do
3 for j ← 1 to k do
4 if Ti [hij (x)] = ⊥ then
5 Ti [hij (x)] ← x
6 return
Solution
The idea is that we use Ti+1 only if we get a collision in Ti . Let Xi be the
indicator for the event that there is a collision in Ti . Then
∞
X
E [steps] ≤ 1 + E [Xi ] (F.4.1)
i=0
and
∞
X
E [space] ≤ m0 + E [Xi ] mi+1 . (F.4.2)
i=0
Let ` be the largest value such that m` ≤ n2+ . We will show that, for
an appropriate choice of k, we are sufficiently unlikely to get a collision in
round ` that the right-hand sides of (F.4.2) and (F.4.2) end up being not
much more than the corresponding sums up to ` − 1.
From our choice of `, it follows that (a) ` ≤ lg lg n2+ = lg lg n+lg(2+) =
√
O(log log n); and (b) m`+1 > n2+ , giving m` = m`+1 > n1+/2 . From this
we get E [X` ] ≤ n(n/m` )k < n1−k/2 .
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 437
For E [steps], compute the same sum without all the mi+1 factors. This
makes the tail terms even smaller, so they is still bounded by a constant,
and the head becomes just `i=0 1 = O(log log n).
P
2. Use this to compute the exact probability that hi (x) 6= hi (y) when
m = 0, m = n/2, and m = n.
Hint: You may find it helpful to use the identity (a mod 2) = 12 (1−(−1)a ).
Solution
1. Observe that hi (x) 6= hi (y) if an only if i chooses an odd number of
indices where x and y differ. Let p = m/n be the probability that each
index in i hits a position where x and y differ, and let q = 1 − p. Then
the event that we get an odd number of differences is
k k
! !
X k j k−j X 1 k
(j mod 2) p q = 1 − (−1)j pj q k−j
j=0
j j=0
2 j
k k
! !
1X k j k−j 1 X k
= p q − (−p)j q k−j
2 j=0 j 2 j=0 j
1 1
= (p + q)k − (−p + q)k
2 2
1 − (1 − 2(m/n))k
= .
2
1−1k
2. • For m = 0, this is 2 = 0.
k
• For m = n/2, it’s 1−02 = 1
2 (assuming k > 0).
k
• For m = n, it’s 1−(−1)
2 = (k mod 2).
In fact, the chances of not colliding as a function of m are symmetric
around m = n/2 if k is even and increasing if k is odd. So we can only hope
to use this as locality-sensitive hash function in the odd case.
Solution
The trick is to observe that Xt2 + Yt2 + Zt2 − t is a martingale, essentially
following the same analysis as for Xt2 − t for a one-dimensional random walk.
Suppose we pick Xt to change. Then
h
2
i 1
E Xt+1 Xt = (Xt + 1)2 + (Xt − 1)2
2
= Xt2 + 1.
So
h i
2 2 2
E Xt+1 + Yt+1 + Zt+1 − (t + 1) Xt , Yt , Zt , X changes = Xt2 + Yt2 + Zt2 − t.
immediately gives E [τ ] ≥ k 2 .
To get an upper bound, observe that Xτ2−1 + Yτ2−1 + Zτ2−1 < k 2 , and
that exactly one of these three term increases between τ − 1 and τ . Suppose
it’s X (the other cases are symmetric). Increasing X by 1 sets Xτ2 =
Xτ2−1 + 2Xτ −1 + 1. So we get
Xτ2 + Yτ2 + Zτ2 = Xτ2−1 + Yτ2−1 + Zτ2−1 + 2Xτ −1 + 1
< k 2 + 2k + 1.
So we have k 2 ≤ E [τ ] < k 2 + 2k + 1, giving E [τ ] = Θ(k 2 ). (Or k 2 + O(k)
if we are feeling really precise.)
Solution
As usual, let Xt be a copy of the chain starting in an arbitrary initial state
and Yt be a copy starting in the stationary distribution.
From the Metropolis-Hastings algorithm, the probability that the walk
moves to a particular child is 1/3α, so the probability that the depth increases
after one step is at most 2/3α. The probability that the walk moves to the
parent (if we are not already at the root) is 1/3.
We’ll use the same choice (left, right, or parent) in both the X and
Y processes, but it may be that only one of the particles moves (because
the target node doesn’t exist). To show convergence, we’ll track Zt =
max(depth(Xt ), depth(Yt )). When Zt = 0, both Xt and Yt are the root
node.
There are two ways that Zt can change:
1. Both processes choose “parent”; if Zt is not already 0, Zt+1 = Zt − 1.
This case occurs with probability 1/3.
Solution
We’ll use rejection sampling. The idea is to choose a node in the infinite
binary tree with probability proportional to α−d , and then repeat the process
if we picked a node that doesn’t actually exist. Conditioned on finding a
depth(x)
node i that exists, its probability will be P αα− depth(j) .
j
If we think of a node in the infinite tree as indexed by a binary strength
of length equal to its depth, we can generate it by first choosing the length
X and then choosing the bits in the string. We want Pr [X = n] to be
proportional to 2n α−n = (2/α)n . Summing the geometric series gives
(2/α)n
Pr [X = n] = .
1 − (2/α)
Solution
This can be solved using a fairly straightforward application of Karp-
Luby [KL85] (see §11.4). Recall that for Karp-Luby we need to be able to
express our target set U as the union of a polynomial number of covering sets
Uj , where we can both compute the size of each Uj and Psample uniformly from
it. We can then estimate |U | = j,x∈Uj f (j, x) = j |Uj | Pr [f (j, x) = 1]
P
where f (j, x) is the indicator for the event that x 6∈ Uj 0 for any j 0 < j
and in the probability, the pair (j, x) is chosen uniformly at random from
{(j, x) | x ∈ Uj }.
In this case, let Uj be the set of all permutations that are increasing on
Sj . We can specify each such permutation by specifying the choice of which
kj = |Sj | elements are in positions i1 . . . ikj (the order of these elements
is determined by the requirement that the permutation be increasing on
Sj ) and specifying the order of the remaining n − kj elements. This gives
n
kj (n − kj )! = (n)n−kj such permutations. Begin by computing these counts
for all Sj , as well as their sum.
We now wish to sample uniformly from pairs (j, π) where each π is an
element of Sj . First sample each j with probability |Sj |/ ` |S` |, using the
P
bacab bacab
ccaac ccaac
bbbac babac
bbaaa bbaaa
acbab acbab
Figure F.2: Non-futile (left) and futile (right) word search grids for the
lexicon {aabc, ccca}
word search puzzles with no solution for a given lexicon. That is, given a set
of words S over some alphabet and a grid size n, the output should be an
n × n grid of letters such that no word in S appears as a contiguous sequence
of letters in one of the eight directions anywhere in the grid. We will refer to
such puzzles as futile word search puzzles.
Solution
1. We’ll apply the symmetric version of the Lovász local lemma. Suppose
the grid is filled in independently and uniformly at random with
characters from Σ. Given a position ij in the grid, let Aij be the
event that there exists a word in S whose first character is at position
ij; observe that Pr [Aij ] ≤ 8pS by the union bound (this may be an
overestimate, both because we might run off the grid in some directions
and because the choice of initial character is not independent). Observe
also that Aij is independent of any event Ai0 j 0 where |i − i0 | ≥ 2k − 1 or
|j − j 0 | ≥ 2k − 1, because no two words starting at these positions can
overlap. So we can build a dependency graph with p ≤ 8pS and d ≤
(4k−3)2 . The Lovász local lemma shows that there exists an assignment
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 444
where no Aij occurs provided ep(d + 1) < 1 or 8epS ((4k − 3)2 + 1) < 1.
1
This easily holds if pS < 8e(4k 1 −2 .
2 ) = 128e k
2. For this part, we can just use Moser-Tardos [MT10], particularly the
symmetric version described in Corollary 13.3.4. We have a collection
of m = O(n2 ) bad events, with d = Θ(k 2 ), so the expected number of
resamplings is bounded by m/d = O(n2 /k 2 ). Each resampling requires
checking every position in the new grid for an occurrence of some string
in S; this takes O(n2 k · |S|) time per resampling even if we are not
very clever about it. So the total expected cost is O(n4 · |S|/k).
With some more intelligence, this can be improved. We don’t need to
recheck any position at distance greater than k from any of the at most
k letters we resample, and if we are sensible, we can store S using a
radix tree or some similar data structure that allows us to look up all
words that appear as a prefix of a given length-k string in time O(k).
This reduces the cost of each resampling to O(k 3 ), with an additive
cost of O(k · |S|) to initialize the data structure. So the total expected
cost is now O(n2 k + |S|).
Solution
Assign each player randomly to a faction. Let Xij be the number of players
in Si that are assigned to faction j. Then E [Xij ] = |Si |/3. Applying the
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 445
1
Pr [Xij ≥ |Si |/2] = Pr Xij ≥ 1 + E [Xij ]
2
2 !
1
≤ exp − (|Si |/3) /3
2
= e−|Si |/36 .
Solution
Suppose we are have state St at time t. We will use a random walk where
we choose a vertex uniformly at random to add to or remove from St , and
carry out the action only if the resulting set is still a dominating set.
In more detail: For each vertex v, with probability 1/n, St+1 = St ∪ {v}
if v 6∈ St , St+1 = St \ {v} if v ∈ St and St \ {v} is a dominating set, and
St+1 = St otherwise. To implement this transition rule, we need to be able
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 446
to choose a vertex v uniformly at random (easy) and test in the case where
v ∈ St if St \ {v} is a dominating set (also polynomial: for each vertex, check
if it or one of its neighbors is in St \ {v}, which takes time O(|V | + |E|)).
Note that we do not need to check if St ∪ {v} is a dominating set.
For any pair of adjacent states S and S 0 = S \ {v} the probability of
moving from S to S 0 and the probability of moving from S 0 to S are both 1/n.
So the Markov chain is reversible with a uniform stationary distribution.
This is an aperiodic chain, because there exist minimal dominating sets
for which there is a nonzero chance that St+1 = St .
It is irreducible, because for any dominating set S, there is a path to
the complete set of vertices V by adding each vertex in V \ S one at a time.
Conversely, removing these vertices from V gives a path to S. This gives a
path S V T between any two dominating sets S and T .
r
g r
b
Solution
Though it is possible, and tempting, to go after this using Karp-Luby (see
§11.4), naive sampling is enough.
If a graph has at least one triangle (which can be checked in O(n3 )
time just by enumerating all possible triangles), then the probability that
that particular triangle is tricolored when colors are chosen uniformly and
independently at random is 6/27 = 2/9. This gives a constant hit rate ρ, so
by Lemma 11.2.1, we can get relative error with 1 − δ probability using
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 447
1
O 2
log 1δ samples. Each sample costs O(n3 ) time to evaluate (again, brute-
force checking of all possible triangles), for a total cost of O n3 −2 log 1δ .
2. Mark any square whose label is larger than any other label in its row
and column.
Solution
Each square is marked if it is the largest of the 2n − 1 total squares in its
row and column. By symmetry, each of these 2n − 1 squares is equally likely
to be the largest, so the probability that a particular square is marked is
1
exactly 2n−1 . By linearity of expectation, the total expected number of
n2
marked squares is then 2n−1 .
Solution
The basic idea is to just go through positions 0, 1, 2, . . . until we encounter
the target, but we have to be a little careful about parity to make sure we
don’t pass it by accident.8
Let Xi = ±1 be the increment of the target’s i-th move, and let Si =
Pi
j=1 Xj , so that its position after i steps is n + Si mod 2n.
Let Yi be the position of the pursuer after i steps.
First move: stay at 0 if n is odd, move to 1 if n is even. The purpose of
this is to establish the invariant that n + Si − Yi is even starting at i = 1.
For subsequent moves, let Yi+1 = Yi + 1. Observe that this maintains the
invariant.
We assume that n is at least 2. This is necessary to ensure that at time
1, Y1 ≤ n + S1 .
Claim: if at time 2n, Y2n ≥ n+S2n , then at some time i ≤ 2n, Yi = n+Si .
Proof: Let i be the first time at which Yi ≥ n + Si ; under the assumption
that n ≥ 2, i ≥ 1. So from the invariant, we can’t have Yi = n + Si + 1,
and if Yi ≥ n + Si + 2, we have Yi−1 ≥ Yi − 1 ≥ n + Si + 1 ≥ n + Si−1 ,
contradicting our assumption that i is minimal. The remaining alternative
is that Yi = n + Si , giving a collision at time i.
We now argue that Y2n ≥ n − 1 is very likely to be at least n + S2n . Since
S2n is a sum of 2n independent ±1 variables, from Hoeffding’s inequality we
2
have Pr [Yn < n + S2n ] ≤ Pr [S2n ≥ n] ≤ e−n /4n = e−n/4 . For sufficiently
large n, this is much smaller than n−c for any fixed c.
8
This is not the only possible algorithm, but there are a lot of plausible-looking
algorithms that turn out not to work. One particularly tempting approach is to run to
position n using the first n steps and then spend the next n steps trying to hit the target in
the immediate neighborhood of n, either by staying put (a sensible strategy when lost in the
woods in real life, assuming somebody is looking for you), or moving in a random walk of
some sort starting at n. This doesn’t work if we want a high-probability bound. To see this,
observe that the target has a small but nonzero constant probability in the limit of begin
√
at some position greater than or equal to n + 4 n after exactly n/2 steps. Conditioned on
√ √ √ √
starting at n + 4 n or above, its chances of moving below n + 4 n − 2 n = n + 2 n
−4n/2(3n/2) −4/3
at any time in the next 3n/2 steps is bounded by e =e (Azuma), and a
√
similar bound holds independently for our chances of getting up to n + 2 n or above.
Multiplying out all these constants gives a constant probability of failure. A similar but
bigger disaster occurs if we don’t rush to n first.
Appendix G
1. Your name.
(You will not be graded on the bureaucratic part, but you should do it
anyway.)
449
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 450
1. Show that both rejection sampling and range coding produce a uniform
value 0 ≤ s < n using an expected O(log n) random bits.
Solution
1. For rejection sampling, each sample requires dlg ne bits and is accepted
with probability n/2dlg ne ≥ 1/2. So rejection sampling returns a value
after at most 2 samples on average, using no more than an expected
2dlg ne < 2(lg n + 1) expected bits for the worst n.
For range coding, we keep going as long as one of the n − 1 nonzero end-
points s/n lies inside the current interval. After k bits, the probability
that one of the 2k intervals contains an endpoint is at most (n − 1)2−k ;
in particular, it drops below 1 as soon as k = 2dlg ne and continues to
drop by 1/2 for each additional bit, requiring 2 more bits on average.
So the expected cost of range coding is at most dlg ne + 2 < lg n + 3
bits.
Solution
In principle, it is possible to compute this value exactly, but we are lazy.
m
For a lower bound, observe that after m rolls, each of the 2 pairs of rolls
has probability 1/n of being equal, for an expected total of m 2 /n duplicates.
√
For m = n/2, this is less than 1/8, which shows that the expected number
√
of rolls is Ω( n).
√
For the upper bound, suppose we have already rolled the die n times.
If we haven’t gotten a duplicate already, each new roll has probability at
√ √ √
least n/n = 1/ n of matching a previous roll. So after an additional n
rolls on average, we get a repeat. This shows that the expected number of
√
rolls is O( n).
√
Combining these bounds shows that we need Θ( n) rolls on average.
Solution
Let T (n) be the expected number of rounds remaining given we are starting
with n candies. We can set up a probabilistic recurrence relation T (n) =
1 + T (n − Xn ) where Xn is the number of candies chosen by eactly one
child. It is easy to compute E [Xn ], since the probability that any candy gets
chosen exactly once is n(1/n)(1 − 1/n)n−1 = (1 − 1/n)n−1 . Summing over
all candies gives E [Xn ] = n(1 − 1/n)n−1 .
The term (1 − 1/n)n−1 approaches e−1 in the limit, so for any fixed > 0,
we have n(1 − 1/n)n−1 ≥ n(e−1 − ) for sufficiently large n. We can get a
quick bound by choosing so that e−1 − ≥ 1/4 (for example) and then
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 452
There is a sneaky trick here, which is that we stop if we get down to 1 candy
instead of 0. This avoids the usual problem with KUW and ln 0, by observing
that we can’t ever get down to exactly one candy: if there were exactly one
candy that gets grabbed twice or not at all, then there must be some other
candy that also gets grabbed twice or not at all.
This analysis is sufficient for an asymptotic estimate: the last candy gets
grabbed in O(log n) rounds on average. For most computer-sciency purposes,
we’d be done here.
We can improve the constant slightly by observing that (1 − 1/n)n−1 is
in fact always greater than or equal to e−1 . The easiest way to see this is
to plot the function, but if we want to prove it formally we can show that
(1 − 1/n)n−1 is a decreasing function by taking the derivative of its logarithm:
d d
ln(1 − 1/n)n−1 = (n − 1) ln(1 − 1/n)
dn dn
n − 1 −1
= ln(1 − 1/n) + · .
1 − 1/n n2
and observing that it is negative for n > 1 (we could also take the derivative
of the original function, but it’s less obvious that it’s negative). So if it
approaches e−1 in the limit, it must do so from above, implying (1−1/n)n−1 ≥
e−1 .
This lets us apply (I.3.1) with µ(n) = n/e, giving E [T (n)] ≤ e ln n.
If we skip the KUW bound and use the analysis in §I.4.2 instead, we get
that Pr [T (n) ≥ ln n + ln(1/)] ≤ . This suggests that the actual expected
value should be (1 + o(1)) ln n.
2. For your choice of p above, what bound can you get on Pr [|D| − E [|D|] ≥ t]?
Solution
1. First let’s compute E [|D|]. Let Xv be the indicator for the event
that v ∈ D. Then Xv = 1 if either (a) v is in D1 , which occurs
with probability p; or (b) v and all d of its neighbors are not in D1 ,
which occurs with probability (1 − p)d+1 . Adding these two cases gives
E [Xv ] = p + (1 − p)d+1 and thus
X
E [|D|] = E [Xv ] = n p + (1 − p)d+1 . (G.2.1)
v
1 − e− ln d/d
= lim
d→∞ ln d/d
1 − 1 − ln d/d + O(ln2 d/d2 )
= lim
d→∞ ln d/d
ln d/d + O(ln2 d/d2 )
= lim
d→∞ ln d/d
= lim (1 + O(ln d/d))
d→∞
= 1.
all δ ≥ 0,
!µ
eδ
Pr [S ≥ (1 + δ)µ] ≤ .
(1 + δ)1+δ
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 455
Solution
Let St = ti=1 Xi , sohthati S = Sn , and let µt = ti=1 pi . We’ll show by
P P
Now apply the rest of the proof of (5.2.1) to get the full result.
Solution
1. We’ll compute the probability that any particular position i = 1 . . . n is
the start of a run of length k or more, then sum over all i. For a run of
length k to start at position i, either (a) i = 1 and Wi . . . Wi+k−1 are
all 1, or (b) i > 1, Wi−1 = 0, and Wi . . . Wi+k−1 are all 1. Assuming
n ≥ k, case (a) adds 2−k to E [Xk ] and case (b) adds (n − k)2−k−1 , for
a total of 2−k + (n − k)2−k−1 = (n − k + 2)2−k−1 .
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 456
2. We can get an easy bound without too much cleverness using McDi-
armid’s inequality (5.3.13). Observe that Xk is a function of the inde-
pendent random variables W1 . . . Wn and that changing one of these bits
changes Xk by at most 1 (this can happen in several ways: a previous
run of length k−1 can become a run of length k or vice versa, or two runs
of length k or more separated by a single zero may become a single
2 run,
t
or vice versa). So (5.3.13) gives Pr [|X − E [X]| ≥ t] ≤ 2 exp − 2n .
We can improve on this a bit by grouping the Wi together into blocks
of length `. If we are given control over a block of ` consecutive bits
and want to minimize the number of runs, we can either (a) make
all the bits zero, causing no runs to appear within the block and
preventing adjacent runs from extending to length k using bits from
the block, or (b) make all the bits one, possibly creating a new run but
possibly also causing two existing runs on either side of the block to
merge into one. In the first case, changing all the bits to one except
for a zero after every k consecutive ones creates at most b `+2k−1k+1 c
new runs. Treating each of the dne` blocks as a2 single variable
then
t
gives Pr [|X − E [X]| ≥ t] ≤ 2 exp − 2dn/`e(b(`+2k−1)/(k+1)c)2 . Staring
at plots of the denominator for a while suggests that it is minimized at
` = k + 3, the largest value with
b(` + 2k − 1)/(k + 1)c ≤ 2. This gives
t2
Pr [|X − E [X]| ≥ t] ≤ 2 exp − 8dn/(k+3)e , improving the bound on t
p p
from Θ( n log(1/)) to Θ( (n/k) log(1/)).
For large k, the expectation of any individual Xk becomes small, so we’d
expect that Chernoff bounds would work better on the upper bound
side than the method of bounded differences. Unfortunately, we don’t
have independence. But from Problem G.2.2, we know that the usual
Chernoff bound works as long as we can show E [Xi | X1 , . . . , Xi−1 ] ≤ pi
for some sequence of fixed bounds pi .
For X1 , there are no previous Xi , and we have E [X1 ] = 2−k exactly.
For Xi with i > 1, fix X1 , . . . , Xi−1 ; that is, condition on the event
Xj = xj for all j < i with some fixed sequence x1 , . . . , xi−1 . Let’s call
this event A. Depending on the particular values of the xj , it’s not
clear how conditioning on A will affect Xi ; but we can split on the
value of Wi−1 to show that either it has no effect or Xi = 0:
E [Xi | A] = E [Xi | A, Wi−1 = 0] Pr [Wi−1 = 0 | A] + E [Xi | A, Wi−1 = 1] Pr [Wi−1 = 1 | A]
≤ 2−k Pr [Wi−1 = 0 | A] + 0
≤ 2−k .
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 457
Solution
Let’s count the expectation of the number Xk of common subsequences of
length k. We have nk choices of positions in v, and nk choices of positions
in w; for each such choices, there is a probability of exactly n−k that the
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 458
Solution
This is a job for the Lovász Local Lemma. And it’s even symmetric, so we
can use the symmetric version (Corollary 13.3.2).
Suppose we assign the non-blank symbols to each string uniformly and
independently at random. For each A ⊆ S with |A| = k, let XA be the string
that has non-blank symbols in all positions in A. For each pair of subsets
A, B with |A| = |B| = k and |A ∩ B| = k − 1, let CA,B be the event that XA
and XB are identical on all positions in A ∩ B. Then Pr [CA,B ] = m−k+1 .
We now now need to figure out how many events are in each neighborhood
Γ(CA,B ). Since CA,B depends only on the choices of values for A and B, it
is independent of any events CA0 ,B 0 where neither of A0 or B 0 is equal to A
or B. So we can make Γ(CA,B ) consist of all events CA,B 0 and CA0 ,B where
B 0 6= B and A0 6= A.
For each fixed A, there are exactly (n − k)k events B that overlap it in
k − 1 places, because we can specify B by choosing the elements in B \ A
and A \ B. This gives (n − k)k − 1 events CA,B 0 where B 0 6= B. Applying
the same argument for A0 gives a total of d = 2(n − k)k − 2 events in
Γ(CA,B ). Corollary 13.3.2 applies if ep(d + 1) ≤ 1, which in this case means
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 460
1. Show that any graph with m edges has a 3-way cut with at least 2m/3
edges.
Solution
1. Assign each vertex independently to S, T , or U with probability 1/3
each. Then the probability that any edge uv is contained in the cut is
exactly 2/3. Summing over all edges gives an expected 2m/3 edges.
• Any payoff resulting from a bet must be a nonzero integer in the range
−Xt to Xt , inclusive, where Xt is your current wealth.
• The expected payoff must be exactly 0. (In other words, your assets
Xt should form a martingale sequence.)
For example, if you have 2 dollars, you may make a bet that pays off −2
with probability 2/5, +1 with probability 2/5 and +2 with probability 1/5;
but you may not make a bet that pays off −3, +3/2, or +4 under any
circumstances, or a bet that pays off −1 with probability 2/3 and +1 with
probability 1/3.
3. What strategy should you use to maximize the number of bets you
make before leaving?
Solution
1. Let Xt be your wealth at time t, and let τ be the stopping time when
you leave. Because {Xt } is a martingale, E [X0 ] = a = E [Xτ ] =
Pr [Xτ ≥ b] E [Xτ | Xτ ≥ b]. So Pr [Xτ ≥ b] is maximized by making
E [Xτ | Xτ ≥ b] as small as possible. It can’t be any smaller than b,
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 462
a b−1
T (a) = a(b − a − 1) + 1+ T (b − 2) . (G.4.1)
b−1 b
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 463
When a = b − 2, we get
b−2 b−1
T (b − 2) = (b − 2) + 1+ T (b − 2)
b−1 b
1 b−2
= (b − 2) 1 + + T (b − 2),
b−1 b
which gives
(b − 2) 2b−1
b−1
T (b − 2) =
2/b
b(b − 2)(2b − 1)
= .
2(b − 1)
Plugging this back into (G.4.1) gives
a b − 1 b(b − 2)(2b − 1)
T (a) = a(b − a − 1) + 1+
b−1 b 2(b − 1)
a a(b − 2)(2b − 1)
= ab − a2 + a + +
b−1 2(b − 1)
3
= ab + O(b). (G.4.2)
2
This is much better than the a(b − a) value for the straight ±1 strategy,
especially when a is also large.
I don’t know if this particular strategy is in fact optimal, but that’s
what I’d be tempted to bet.
Solution
1
1. For n > 0, we have πn = 2 πn−1 + 38 πn+1 , with a base case π0 =
1 3 3
8 + 8 π0 + 8 π1 .
The πn expression is a linear homogeneous recurrence, so its solution
consists of linear combinations of terms bn , where b satisfies 1 =
1 −1
2b + 38 b. The solutions to this equation are b = 2/3 and b = 2; we
can exclude the b = 2 case because it would make our probabilities
blow up for large n. So we can reasonably guess πn = a(2/3)n when
n > 0.
1
For n = 0, substitute π0 = 8 + 38 π0 + 38 a(2/3) to get π0 = 1
5 + 25 a. Now
substitute
π1 = (2/3)a
1 3
= π0 + a(2/3)2
2 8
1 1 2 3
= + a + a(2/3)2
2 5 5 8
1 11
= + a,
10 30
which we can solve to get a = 1/3.
So our candidate π is π0 = 1/3, πn = (1/3)(2/3)n , and in fact we can
drop the special case for π0 .
Pn Pn n 1/3
As a check, i=0 πn = (1/3) i=0 (2/3) = 1−2/3 = 1.
Solution
It’s tempting to use the same coupling as for move-to-top (see §10.4.3). This
would be that at each step we choose the same card to swap to the top
position, which increases by at least one the number of cards that are in the
same position in both decks. The problem is that at the next step, these two
cards are most likely separated again, by being swapped with other cards in
two different positions.
Instead, we will do something slightly more clever. Let Zt be the number
of cards in the same position at time t. If the top cards of both decks are
equal, we swap both to the same position chosen uniformly at random. This
has no effect on Zt . If the top cards of both decks are not equal, we pick
a card uniformly at random and swap it to the top in both decks. This
increases Zt by at least 1, unless we happen to pick cards that are already in
the same position; so Zt increases by at least 1 with probability 1 − Zt /n.
Let’s summarize a state by an ordered pair (k, b) where k = Zt and b is
0 if the top cards are equal and 1 if they are not equal. Then we have a
Markov chain where (k, 0) goes to (k, 1) with probability n−k n (and otherwise
stays put); and (k, 1) goes to (k + 1, 0) (or higher) with probability n−k
n and
k
to (k, 0) with probability n .
n
Starting from (k, 0), we expect to wait n−k steps on average to reach
(k, 1), at which point we move to (k + 1, 0) or back to (k, 0) in one more step;
n
we iterate through this process n−k times on average before we are successful.
This gives an expected number of steps to get from (k, 0) to (k + 1, 0) (or
n n
possibly a higher value) of n−k n−k + 1) . Summing over k up to n − 2
(since once k > n − 2, we will in fact have k = n, since k can’t be n − 1), we
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 466
get
n−2
n n
X
E [τ ] ≤ +1
k=0
n−k n−k
n
!
X n2 n
= +
m=2
m2 m
!
2 π2
≤n − 1 + n ln n.
6
= O(n2 ).
So we expect the deck to mix in O(n2 log(1/)) steps. (I don’t know if
this is the real bound; my guess is that it should be closer to O(n log n) as
in all the other shuffling procedures.)
Solution
Suppose we can make this chain reversible, and let π be the resulting
stationary distribution. From the detailed balance equations, we have
(2/3)πi = (1/3)πi+1 or πi+1 = 2πi for i = 0 . . . m − 2. The solution to
i
this recurrence is πi = 2i π0 , which gives πi = 2m2−1 when we set π0 to get
P
i πi = 1.
Now solve π0 p0,m−1 = πm−1 pm−1,0 to get
πm−1 pm−1,0
p0,m−1 =
π0
= 2m−1 (2/3)
= 2m /3.
This is greater than 1 for m > 1, so except for the degenerate cases of
m = 1 and m = 2, it’s not possible to make the chain reversible.
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 467
Solution
1. First let’s show irreducibility. Starting from an arbitrary configuration,
repeatedly switch the leftmost 0 to a 1 (this is always permitted
by the transition rules); after at most n steps, we reach the all-1
configuration. Since we can repeat this process in reverse to get to any
other configuration, we get that every configuration is reachable from
every other configuration in at most 2n steps (2n − 1 if we are careful
about handling the all-0 configuration separately).
We also have that for any two adjacent configurations x and y, pxy =
1
pyx = 2n . So we have a reversible, irreducible, aperiodic (because
there exists at least one self-loop) chain with a uniform stationary
distribution πx = 2−n .
2. Here is a bound using the obvious coupling, where we choose the same
position in X and Y and attempt to set it to the same value. To
show this coalesces, given Xt and Yt define Zt to be the position of the
rightmost 1 in the common prefix of Xt and Yt , or 0 if there is no 1 in
the common prefix of Xt and Yt . Then Zt increases by at least 1 if we
1
attempt to set position Zt + 1 to 1, which occurs with probability 2n ,
and decreases by at most 1 if we attempt to set Zt to 0, again with
1
probability 2n .
It follows that Zt reaches n no later than a ±1 random walk on 0 . . . n
with reflecting barriers that takes a step every 1/n time units on
average. The expected number of steps to reach n from the worst-case
3
Motivation: Imagine each bit represents whether a node in some distributed system
is inactive (0) or active (1), and you can only change your state if you have an active
left neighbor to notify. Also imagine that there is an always-active base station at −1
(alternatively, imagine that this assumption makes the problem easier than the other
natural arrangement where we put all the nodes in a ring).
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 468
The bound on τ2 that follows from this is 8n6 , which is pretty bad
(although the constant could be improved by counting the (a) and
(b) bits more carefully). As with the coupling argument, it may be
that there is a less congested set of canonical paths that gives a better
bound.
Solution
1. Since every transition has a matching reverse transition with the same
transition probability, the chain is reversible with a uniform stationary
distribution πH = 1/N .
(a) If Xt and Yt are both trees, then send them to the same tree
with probability 1/2; else let them both add edges independently
(or we could have them add the same edge—it doesn’t make any
difference to the final result).
(b) If only one of Xt and Yt is a tree, with probability 1/2 scramble
the tree while attempting to remove an edge from the non-tree,
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 470
and the rest of the time scramble the non-tree (which has no
effect) while attempting to add an edge to the tree. Since the
non-tree has at least three edges that can be removed, this puts
(Xt+1 , Yt+1 in the two-tree case with probability at least 3/2m.
(c) If neither Xt nor Yt is a tree, attempt to remove an edge from
both. Let S and T be the sets of edges that we can remove from
Xt and Yt , respectively, and let k = min(|S|, |T |) ≥ 3. Choose k
edges from each of S and T and match them, so that if we remove
one edge from each pair, we also remove the other edge. As in
the previous case, this puts (Xt+1 , Yt+1 in the two-tree case with
probability at least 3/2m.
Solution
Essentially, we’re going to do the Karp-Luby covering trick [KL85] described
in §11.4, but will tweak the probability distribution when we generate our
samples so that we only get samples with weight w.
n
Let U be the set of assignment with weight w (there are exactly w such
assignments, where n is the number of variables). For each clause Ci , let
Ui = {x ∈ U | Ci (x) = 1}. Now observe that:
1. We can compute |Ui |. Let ki = |Ci | be the number of variables in Ci
and ki+ = Ci+ the number of variables that appear in positive form
n−k
in Ci . Then |Ui | = w−k+i is the number of ways to make a total of w
i
variables true using the remaining n − ki variables.
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 471
1 Initialize A[1 . . . n] to ⊥
2 Choose a hash function h
3 for i ← 1 . . . n do
4 x ← S[i]
5 if A[h(x)] = x then
6 return true
7 else
8 A[h(x)] ← x
9 return false
Algorithm G.1: Dubious duplicate detector
It’s easy to see that Algorithm G.1 never returns true unless some value
appears twice in S. But maybe it misses some duplicates it should find.
1. Suppose h is a random function. What is the worst-case probability
that Algorithm G.1 returns false if S contains two copies of some
value?
2. Is this worst-case probability affected if h is drawn instead from a
2-universal family of hash functions?
Solution
1. Suppose that S[i] = S[j] = x for i < j. Then the algorithm will see x in
A[h(x)] on iteration j and return true, unless it is overwritten by some
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 472
value S[k] with i < k < j. This occurs if h(S[k]) = h(x), which occurs
with probability exactly 1 − (1 − 1/n)j−i−1 if we consider all possible k.
This quantity is maximized at 1 − (1 − 1/n)n−2 ≈ 1 − (1 − 1/n)2 /e ≈
1 − (1 − 1/2n)/e when i = 1 and j = n.
2. As it happens, the algorithm can fail pretty badly if all we know is that
h is 2-universal. What we can show is that the probability that some
S[k] with i < j < k gets hashed to the same place as x = S[i] = S[j]
in the analysis above is at most (j − i − 1)/n ≤ (n − 2)/n = (1 − 2/n),
since each S[k] has at most a 1/n chance of colliding with x and the
union bound applies. But it is possible to construct a 2-universal family
for which we get exactly this probability in the worst case.
Let U = {0 . . . n}, and define for each a in {0 . . . n − 1} ha : U → n by
ha (n) = 0 and ha (x) = (x + a) mod n for x 6= n. Then H = {ha } is 2-
universal, since if x 6= y and neither x nor y is n, Pr [ha (x) = ha (y)] = 0,
and if one of x or y is n, Pr [ha (x) = ha (y)] = 1/n. But if we use this
family in Algorithm G.1 with S[1] = S[n] = n and S[k] = k for
1 < k < n, then there are n − 2 choices of a that put one of the middle
values in A[0].
2. Suppose that we insert x at some point, and then follow this insertion
with a sequence of insertions of new, distinct values y1 , y2 , . . . . Assum-
ing a worst-case state before inserting x, give asymptotic upper and
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 473
Solution
1. The probability of a false positive is maximized when exactly half the
bits in A are one. If x has never been inserted, each A[hi (x)] is equally
likely to be zero or one. So Pr [false positive for x] = Pr [Sk ≥ (3/4)k]
when Sk is a binomial random variable with parameters 1/2 and k.
Chernoff bounds give
2. For false negatives, we need to look at how quickly the bits for x are
eroded away. A minor complication is that the erosion may start even
as we are setting A[h1 (x)] . . . A[hk (x)].
Let’s consider a single bit A[i] and look at how it changes after (a)
setting A[i] = 1, and (b) setting some random A[r] = 1.
In the first case, A[i] will be 1 after the assignment unless it is set back
1
to zero, which occurs with probability m/2+1 . This distribution does
not depend on the prior value of A[i].
In the second case, if A[i] was previously 0, it becomes 1 with probability
1 1 1 m/2
1− = ·
m m/2 + 1 m m/2 + 1
1
= .
m+2
So after the initial assignment, A[i] just flips its value with probability
1
m+2 .
It is convenient to represent A[i] as ±1; let Xit = −1 if A[i] = 0 at time
t, and 1 otherwise. Then Xit satisfies the recurrence
h i m+1 t 1
E Xit+1 Xit = X − Xt
m+2 i m+2 i
m
= Xit .
m + 2
2
= 1− Xit .
m+2
t
We can extend this to E Xit Xi0 = 1 − 2
Xi0 ≈ e−2t/(m+2) Xi0 .
m+2
1
Similarly, after setting A[i] = 1, we get E [Xi ] = 1 − 2 m/2+1 = 1−
4
2m+1 = 1 − o(1).
Let S t = ki=1 Xht i (x) . Let 0 be the time at which we finish inserting x.
P
So for any fixed 0 < < 1 and sufficiently large m, we will have
E S = k for some t0 where t ≤ t0 ≤ k + t and t = Θ(m ln(1/)).
t
We are now looking for the time at which S t drops below k/2 (the k/2
is because we are working with ±1 variables). We will bound when
this time occurs using Markov’s inequality.
t with E S t ≥ (3/4)k. Then E k − S t ≤
Let’s look for the largest time
k/4 and Pr k − S t ≥ k/2 ≤ 1/2. It follows that after Θ(m) − k op-
erations, x is still visible with probability 1/2, which implies that the
expected time at which it stops being visible is at least (Ω(m) − k)/2.
To get the expected number of insert operations, we divide by k, to
get Ω(m/k).
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 475
For the upper bound, apply the same reasoning to the first time at which
t
E S ≤ k/4. This occurs at time O(m) at the latest (with a different
constant), so after O(m) steps there is at most a 1/2 probability that
S t ≥ k/2. If S t is still greater than k/2 at this point, try again using
the same analysis; this gives us the usual geometric series argument
that E [t] = O(m). Again, we have to divide by k to get the number of
insert operations, so we get O(m/k) in this case.
Combining these bounds, we have that x disappears after Θ(m/k)
insertions on average. This seems like about what we would expect.
Solution
Let’s suppose that there is some such c. We will necessarily have c ≤ 1 = T (1),
so the induction hypothesis will hold in the base case n = 1.
For n ≥ 2, compute
n
!
n
2−n
X
T (n) = T (k)
k=0
k
n−1
!
n
= 2−n T (n) + 2−n nT (1) + 2−n
X
T (k)
k=2
k
≥ 2−n T (n) + 2−n n + 2−n (2n − n − 2)c.
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 476
n + (2n − n − 2)c
T (n) ≥
2n − 1
n
2 − n − 2 + n/c
=c .
2n − 1
Solution
Suppose we have a monochromatic edge surrounded by non-monochrome
edges, e.g. RBRRBR. If we pick one of the endpoints of the edge (say
the left endpoint in this case), then the monochromatic edge shifts in the
direction of that endpoint: RBBRBRB. Picking any node not incident to a
monochromatic edge has no effect, so in this case there is no way to increase
the number of monochromatic edges.
It may also be that we have two adjacent monochromatic edges: BRRRB.
Now if we happen to pick the node in the middle, we end up with no
monochromatic edges (BRBRB) and the process terminates. If on the other
hand we pick one of the nodes on the outside, then the monochromatic edges
move away from each other.
We can thus model the process with 2 monochromatic edges as a random
walk, where the difference between the leftmost nodes of the edges (mod
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 477
1 Randomly permute A
2 m ← −∞
3 for i ← 1 . . . n do
4 if A[i] > m then
5 m ← A[i]
6 return m
Algorithm G.2: Randomized max-finding algorithm
Solution
Let Xi be the indicator variable for the event that Line 5 is executed on
the i-th pass through the loop. This will occur if A[i] is greater than A[j]
for all j < i, which occurs with probability exactly 1/i (given that A has
been permuted randomly). So the expected number of calls to Line 5 is
i = 1n E [Xi ] = ni=1 1i = Hn .
P P
Solution
The fact that G is itself a random graph is a red herring; all we really need
to know is that it’s d-regular.
1. Because G has exactly dn/2 edges, and each edge has probability 1/2
of being monochromatic, the expected number of monochromatic edges
is dn/4.
Solution
Color the elements in the final merged list red or blue based on which sublist
they came from. The only elements that do not require a comparison to
insert into the main list are those that are followed only by elements of the
479
APPENDIX H. SAMPLE ASSIGNMENTS FROM SPRING 2009 480
same color; the expected number of such elements is equal to the expected
length of the longest monochromatic suffix. By symmetry, this is the same as
the expected longest monochromatic prefix, which is equal to the expected
length of the longest sequence of identical coin-flips.
The probability of getting k identical coin-flips in a row followed by a
different coin-flip is exactly 2−k ; the first coin-flip sets the color, the next k −1
must follow it (giving a factor of 2−k+1 , and the last must be the opposite
color (giving an additional factor of 2−1 ). For n identical coin-flips, there is
a probability of 2−n+1 , since we don’t need an extra coin-flip of the opposite
color. So the expected length is n−1 −k + n2−n+1 = Pn −k + n2−n .
P
k=1 k2 k=0 k2Pn
We can simplify the sum using generating functions. The sum k=0 2−k z k
n+1
is given by 1−(z/2)
Pn −k k−1 =
1−z/2 . Taking the derivative with respect to z gives k=0 2 kz
n+1 2 n
(1/2) 1−(z/2)
1−z/2 + (1/2) (n+1)(z/2)
1−z/2 . At z = 1 this is 2(1 − 2−n−1 ) − 2(n +
1)2−n = 2 − (n + 2)2−n . Adding the second term gives E [X] = 2 − 2 · 2−n =
2 − 2−n+1 .
Note that this counts the expected number of elements for which we do
not have to do a comparison; with n elements total, this leaves n − 2 + 2−n+1
comparisons on average.
Solution
We can show that the suggested sequence is a martingale, by computing
Xt+1 Xt Xt + 1 Yt Xt
E Xt , Yt = · + ·
Xt+1 + Yt+1 Xt + Yt Xt + Yt + 1 Xt + Yt Xt + Yt + 1
Xt (Xt + 1) + Yt Xt
=
(Xt + Yt )(Xt + Yt + 1)
Xt (Xt + Yt + 1)
=
(Xt + Yt )(Xt + Yt + 1)
Xt
= .
Xt + Yt
h i
From the martingale property we have E XnX+Y n
n
= X0X+Y
0
0
. But Xn +
Yn = X0 + Y0 + n,
a constant,
so we can multiply both sides by this value to
X0 +Y0 +n
get E [Xn ] = X0 X0 +Y0 .
Solution
1. Any deterministic algorithm essentially just chooses some fixed number
m of tickets to collect before buying the placard. Let n be the actual
number of tickets issued. For m = 0, the competitive ratio is infinite
when n = 0. For m = 1, the competitive ratio is 3 when n = 1. For
m > 2, the competitive ratio is (m + 2)/2 > 2 when n = m. So m = 2
is the optimal choice.
Solution
This is a job for Chernoff bounds. For any particular machine, the load S is
a sum of independent indicator variables and the mean load is µ = n. So we
have
!n
eδ
Pr [S ≥ (1 + δ)µ] ≤ .
(1 + δ)1+δ
Observe that eδ /(1 + δ)1+δ < 1 for δ > 0. One proof of this fact is to take
the log to get δ −(1+δ) log(1+δ), which equals 0 at δ = 0, and then show that
d
the logarithm is decreasing by showing that dδ · · · = 1 − 1+δ
1+δ − log(1 + δ) =
− log(1 + δ) < 0 for all δ > 0.
So we can let c = eδ /(1 + δ)1+δ to get a bound of cn on the probability
that any particular machine is overloaded and a bound of ncn (from the
union bound) on the probability that any of the machines is overloaded.
Appendix I
Probabilistic recurrences
I.2 Examples
• How long does it take to get our first heads if we repeatedly flip
a coin that comes up heads with probability p? Even though we
probably already know the answer to this, We can solve it by solving
the recurrence T (1) = 1 + T (1 − X1 ), T (0) = 0, where E [X1 ] = p.
483
APPENDIX I. PROBABILISTIC RECURRENCES 484
• Suppose we start with n biased coins that each come up heads with
probability p. In each round, we flip all the coins and throw away the
ones that come up tails. How many rounds does it take to get rid of
all of the coins? (This essentially tells us how tall a skip list [Pug90]
can get.) Here we have E [Xn ] = (1 − p)n.
• Let’s play Chutes and Ladders without the chutes and ladders. We
start at location n, and whenever it’s our turn, we roll a fair six-sided
die X and move to n − X unless this value is negative, in which case
we stay put until the next turn. How many turns does it take to get to
0?
we pass. From the point of view of the interval [k, k + 1], we don’t know
which n we are going to start from before we cross it, but we do know that
for any n ≥ k + 1 we start from, our speed will Rbe at least µ(n) ≥ µ(k + 1)
on average. So the time it takes will be at most kk+1 µ(t)
1
dt on average, and
the total time is obtained by summing all of these intervals.
Of course, this intuition is not even close to a real proof (among other
things, there may be a very dangerous confusion in there between 1/ E [Xn ]
and E [1/Xn ]), so we will give a real proof as well.
Proof of Lemma I.3.1. This is essentially the same proof as in Motwani and
Raghavan [MR95], but we add some extra detail to allow for the possibility
that Xn = 0.
Let p = Pr [Xn = 0], q = 1 − p = Pr [Xn 6= 0]. Note we have q > 0
because otherwise E [Xn ] = 0 < µ(n). Then we have
E [T (n)] = 1 + E [T (n − Xn )]
= 1 + p E [T (n − Xn ) | Xn = 0] + q E [T (n − Xn ) | Xn 6= 0]
= 1 + p E [T (n)] + q E [T (n − Xn ) | Xn 6= 0] .
Now we have E [T (n)] on both sides, which we don’t like very much. So
we collect it on the left-hand side:
(1 − p) E [T (n)] = 1 + q E [T (n − Xn ) | Xn 6= 0] ,
E [T (n)] = 1/q + E [T (n − Xn ) | Xn 6= 0]
= 1/q + E [E [T (n − Xn ) | Xn ] | Xn 6= 0]
"Z #
n−Xn1
≤ 1/q + E dt Xn 6= 0
a µ(t)
Z n Z n
1 1
= 1/q + E dt − dt Xn 6= 0
a µ(t) n−Xn µ(t)
Z n
1 Xn
≤ 1/q + dt − E Xn 6= 0
a µ(t) µ(n)
Z n
1 E [Xn | Xn 6= 0]
≤ 1/q + dt − .
a µ(t) µ(n)
The second-to-last step uses the fact that µ(t) ≤ µ(n) for t ≤ n.
It may seem like we don’t know what E [Xn | Xn 6= 0] is. But we know
that Xn ≥ 0, so we have E [Xn ] = p E [Xn | Xn = 0] + q E [Xn | Xn 6= 0] =
APPENDIX I. PROBABILISTIC RECURRENCES 486
I.3.2 Quickselect
In Quickselect, we pick a random pivot and split the original array of size
n into three piles of size m (less than the pivot), 1 (the pivot itself), and
n−m−1 (greater than the pivot). We then figure out which of the three piles
contains the k-th smallest element (depend on how k compares to m − 1) and
recurse, stopping when we hit a pile with 1 element. It’s easiest to analyze
this by assuming that we recurse in the largest of the three piles, i.e., that
our recurrence is T (n) = 1 + max(T (m), T (n − m − 1)), where m is uniform
in 0 . . . n − 1. The exact value of E [max(m, n − m − 1)] is a little messy to
compute (among other things, it depends on whether n is odd or even), but
it’s not hard to see that it’s always less than (3/4)n. So letting µ(n) = n/4,
we get
Z n
1
E [T (n)] ≤ dt = 4 ln n.
1 t/4
As it happens, this is the exact answer for this case. This will happen
whenever X is always a 0–1 variable1 and we define µ(x) = E [X | n = dxe],
which can be seen by spending far too much time thinking about the precise
sources of error in the inequalities in the proof.
Then we have
Z n
1
E [T (n)] ≤ dt
0+ µ(t)
n
X 1
=
k=1
µ(k)
P
n µ(n) 1/µ(n) 1/µ(k)
1 1/6 6 6
2 1/2 2 8
3 1 1 9
4 5/3 3/5 48/5
5 5/2 2/5 10
≥6 7/2 2/7 10 + (2/7)(n − 5) = (2/7)n + 65/7
i.e., a statement of the form Pr [T (n) ≥ t] ≤ . There are two natural ways
to do this: we can repeatedly apply Markov’s inequality to the expectation
bound, or we can attempt to analyze the recurrence in more detail. The first
method tends to give weaker bounds but it’s easier.
[AAG+ 10] Dan Alistarh, Hagit Attiya, Seth Gilbert, Andrei Giurgiu, and
Rachid Guerraoui. Fast randomized test-and-set and renaming.
In Nancy A. Lynch and Alexander A. Shvartsman, editors,
Distributed Computing, 24th International Symposium, DISC
2010, Cambridge, MA, USA, September 13-15, 2010. Proceed-
ings, volume 6343 of Lecture Notes in Computer Science, pages
94–108. Springer, 2010.
[ABMRT96] Arne Andersson, Peter Bro Miltersen, Søren Riis, and Mikkel
Thorup.
p Static dictionaries on AC0 RAMs: Query time
θ( log n/ log log n) is necessary and sufficient. In FOCS, pages
441–450, 1996.
491
BIBLIOGRAPHY 492
[AC08] Hagit Attiya and Keren Censor. Tight bounds for asynchronous
randomized consensus. Journal of the ACM, 55(5):20, October
2008.
[AE19] James Aspnes and He Yang Er. Consensus with max registers.
In Jukka Suomela, editor, 33rd International Symposium on
Distributed Computing (DISC 2019), volume 146 of Leibniz In-
ternational Proceedings in Informatics (LIPIcs), pages 1:1–1:9,
Dagstuhl, Germany, 2019. Schloss Dagstuhl–Leibniz-Zentrum
fuer Informatik.
[ALM+ 98] Sanjeev Arora, Carsten Lund, Rajeev Motwani, Madhu Sudan,
and Mario Szegedy. Proof verification and the hardness of
BIBLIOGRAPHY 493
[BD92] Dave Bayer and Persi Diaconis. Trailing the dovetail shuffle to
its lair. Annals of Applied Probability, 2(2):294–313, 1992.
[BD97] Russ Bubley and Martin Dyer. Path coupling: A technique for
proving rapid mixing in markov chains. In Proceedings 38th
Annual Symposium on Foundations of Computer Science, pages
223–231. IEEE, 1997.
[CM03] Saar Cohen and Yossi Matias. Spectral Bloom filters. In Alon Y.
Halevy, Zachary G. Ives, and AnHai Doan, editors, Proceed-
ings of the 2003 ACM SIGMOD International Conference on
Management of Data, San Diego, California, USA, June 9-12,
2003, pages 241–252, 2003.
[DGH+ 87] Alan Demers, Dan Greene, Carl Hauser, Wes Irish, John Larson,
Scott Shenker, Howard Sturgis, Dan Swinehart, and Doug Terry.
Epidemic algorithms for replicated database maintenance. In
Proceedings of the Sixth Annual ACM Symposium on Principles
of Distributed Computing, PODC ’87, page 1–12, New York,
NY, USA, 1987. Association for Computing Machinery.
[GW12] George Giakkoupis and Philipp Woelfel. On the time and space
complexity of randomized test-and-set. In Darek Kowalski and
Alessandro Panconesi, editors, ACM Symposium on Princi-
ples of Distributed Computing, PODC ’12, Funchal, Madeira,
Portugal, July 16-18, 2012, pages 19–28. ACM, 2012.
[HH80] P. Hall and C.C. Heyde. Martingale Limit Theory and Its
Application. Academic Press, 1980.
[HWY22] Kun He, Chunyang Wang, and Yitong Yin. Sampling lovász
local lemma for general constraint satisfaction solutions in
near-linear time. In 2022 IEEE 63rd Annual Symposium on
Foundations of Computer Science (FOCS), pages 147–158, 2022.
[HY04] Jun He and Xin Yao. A study of drift analysis for estimating
computation time of evolutionary algorithms. Natural Comput-
ing, 3:21–35, 2004.
[KS76] J.G. Kemeny and J.L. Snell. Finite Markov Chains: With
a New Appendix “Generalization of a Fundamental Matrix”.
Undergraduate Texts in Mathematics. Springer, 1976.
[KUW88] Richard M. Karp, Eli Upfal, and Avi Wigderson. The complexity
of parallel search. Journal of Computer and System Sciences,
36(2):225–253, 1988.
[LV99] Michael Luby and Eric Vigoda. Fast convergence of the Glauber
dynamics for sampling independent sets. Random Structures &
Algorithms, 15(3–4):229–241, 1999.
BIBLIOGRAPHY 504
510
INDEX 511
with two absorbing barriers, 158 sampling Lovász local lemma, 249
random-access machine, 1 SAT, 476
randomized computation, 276 satisfiability, 245
randomized rounding, 220, 240 satisfiability problem, 239
randomness satisfy, 239
physical, 9 scapegoat tree, 103
range coding, 450 search tree
rapid mixing, 185 balanced, 102
record, 416 binary, 102
double, 416 priority, 105
red-black tree, 103 second-moment method, 62
reducible, 170 seed, 9, 253
reduction, 214 separate chaining, 115
gap-amplifying, 269 set balancing, 235
gap-preserving, 269 SHA, 115
gap-reducing, 269 shared coin
register, 289 weak, 292
max, 297 shared memory, 289
regret, 95 sharp P, 214
rejection sampling, 182, 216, 450 sifter, 57, 292, 295
relax, 239 simple path, 357
relaxation, 239 simplex method, 240
relaxation time, 201 simulated annealing, 196
reservoir sampling, 332 simulation
reversibility, 171 Monte Carlo, 11
reversible, 178, 281 sketch, 133
right spine, 109 count-min, 136
ring-cover tree, 141 Flajolet-Martin, 134, 433
rock-paper-scissors, 226 Tug-of-War, 152
root, 4, 102 sort
rotation, 103 bubble, 364
tree, 103 sorting network, 235
routing soundness, 262
bit-fixing, 78, 208 Space Invaders, 374
permutation, 77 spanning tree, 469
run, 361, 455 spare, 429
spectral Bloom filter, 133
sampling, 11, 13 spectral theorem, 199
rejection, 182, 216, 450 spine
reservoir, 332 left, 109
INDEX 519
UCB1 algorithm, 95
unary, 146
unbiased random walk, 158
uniform, 256
union bound, 53
unique game, 272
Unique Games Conjecture, 272
unitary matrix, 280
unitary transformation, 280
universal hash family, 115
unranking, 216
upper confidence bound, 95
wait-free, 290
wait-free shared memory, 289
Wald’s equation, 36, 162
Walsh-Hadamard code, 265
weak shared coin, 292
with high probability, 51
witness, 5, 256
witness tree, 250
word search puzzle, 442
futile, 443
worst-case analysis, 2
xxHash, 115
Yao’s lemma, 3, 45