Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
8 views

Notes on Randomized Algorithms

Uploaded by

R S
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Notes on Randomized Algorithms

Uploaded by

R S
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 539

Notes on Randomized Algorithms

James Aspnes

2024-07-27 23:04
i

Copyright © 2009–2024 by James Aspnes. Distributed under a Cre-


ative Commons Attribution-ShareAlike 4.0 International license: https:
//creativecommons.org/licenses/by-sa/4.0/.
Contents

Table of contents ii

List of figures xv

List of tables xvi

List of algorithms xvii

Preface xviii

1 Randomized algorithms 1
1.1 Searching an array . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Verifying polynomial identities . . . . . . . . . . . . . . . . . 4
1.3 Randomized QuickSort . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Brute force method: solve the recurrence . . . . . . . 6
1.3.2 Clever method: use linearity of expectation . . . . . . 7
1.4 Where does the randomness come from? . . . . . . . . . . . . 9
1.5 Classifying randomized algorithms . . . . . . . . . . . . . . . 10
1.5.1 Las Vegas vs Monte Carlo . . . . . . . . . . . . . . . . 10
1.5.2 Randomized complexity classes . . . . . . . . . . . . . 11
1.6 Classifying randomized algorithms by their methods . . . . . 13

2 Probability theory 15
2.1 Probability spaces and events . . . . . . . . . . . . . . . . . . 16
2.1.1 General probability spaces . . . . . . . . . . . . . . . . 16
2.2 Boolean combinations of events . . . . . . . . . . . . . . . . . 18
2.3 Conditional probability . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Conditional probability and independence . . . . . . . 21
2.3.2 Conditional probability and the law of total probability 21
2.3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 23

ii
CONTENTS iii

2.3.3.1 Racing coin-flips . . . . . . . . . . . . . . . . 23


2.3.3.2 Karger’s min-cut algorithm . . . . . . . . . . 25

3 Random variables 28
3.1 Operations on random variables . . . . . . . . . . . . . . . . . 29
3.2 Random variables and events . . . . . . . . . . . . . . . . . . 29
3.3 Measurability . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Linearity of expectation . . . . . . . . . . . . . . . . . 32
3.4.1.1 Linearity of expectation for infinite sequences 33
3.4.2 Expectation and inequalities . . . . . . . . . . . . . . 34
3.4.3 Expectation of a product . . . . . . . . . . . . . . . . 35
3.4.3.1 Wald’s equation (simple version) . . . . . . . 35
3.5 Conditional expectation . . . . . . . . . . . . . . . . . . . . . 36
3.5.1 Expectation conditioned on an event . . . . . . . . . . 37
3.5.2 Expectation conditioned on a random variable . . . . 38
3.5.2.1 Calculating conditional expectations . . . . . 39
3.5.2.2 The law of iterated expectation . . . . . . . . 41
3.5.2.3 Conditional expectation as orthogonal pro-
jection . . . . . . . . . . . . . . . . . . . . . 41
3.5.3 Expectation conditioned on a σ-algebra . . . . . . . . 43
3.5.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6.1 Yao’s lemma . . . . . . . . . . . . . . . . . . . . . . . 45
3.6.2 Geometric random variables . . . . . . . . . . . . . . . 47
3.6.3 Coupon collector . . . . . . . . . . . . . . . . . . . . . 48
3.6.4 Hoare’s FIND . . . . . . . . . . . . . . . . . . . . . . . 49

4 Basic probabilistic inequalities 51


4.1 Markov’s inequality . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . 52
4.1.1.1 Sum of fair coins . . . . . . . . . . . . . . . . 52
4.1.1.2 Randomized QuickSort . . . . . . . . . . . . 52
4.1.1.3 Balls in bins . . . . . . . . . . . . . . . . . . 52
4.2 Union bound (Boole’s inequality) . . . . . . . . . . . . . . . . 53
4.2.1 Example: Balls in bins . . . . . . . . . . . . . . . . . . 53
4.3 Jensen’s inequality . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.2 Applications . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.2.1 Fair coins: lower bound . . . . . . . . . . . . 56
CONTENTS iv

4.3.2.2 Fair coins: upper bound . . . . . . . . . . . . 56


4.3.2.3 Sifters . . . . . . . . . . . . . . . . . . . . . . 57

5 Concentration bounds 59
5.1 Chebyshev’s inequality . . . . . . . . . . . . . . . . . . . . . . 60
5.1.1 Computing variance . . . . . . . . . . . . . . . . . . . 60
5.1.1.1 Alternative formula . . . . . . . . . . . . . . 60
5.1.1.2 Variance of a Bernoulli random variable . . . 61
5.1.1.3 Variance of a sum . . . . . . . . . . . . . . . 61
5.1.1.4 Variance of a geometric random variable . . 63
5.1.2 More examples . . . . . . . . . . . . . . . . . . . . . . 66
5.1.2.1 Flipping coins . . . . . . . . . . . . . . . . . 66
5.1.2.2 Balls in bins . . . . . . . . . . . . . . . . . . 66
5.1.2.3 Lazy select . . . . . . . . . . . . . . . . . . . 67
5.2 Chernoff bounds . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2.1 The classic Chernoff bound . . . . . . . . . . . . . . . 69
5.2.2 Easier variants . . . . . . . . . . . . . . . . . . . . . . 71
5.2.3 Lower bound version . . . . . . . . . . . . . . . . . . . 72
5.2.4 Two-sided version . . . . . . . . . . . . . . . . . . . . 72
5.2.5 What if we only have a bound on E [S]? . . . . . . . . 73
5.2.6 Almost-independent variables . . . . . . . . . . . . . . 74
5.2.7 Other tail bounds for the binomial distribution . . . . 75
5.2.8 Applications . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.8.1 Flipping coins . . . . . . . . . . . . . . . . . 75
5.2.8.2 Balls in bins again . . . . . . . . . . . . . . . 76
5.2.8.3 Flipping coins, central behavior . . . . . . . 76
5.2.8.4 Permutation routing on a hypercube . . . . . 77
5.3 The Azuma-Hoeffding inequality . . . . . . . . . . . . . . . . 80
5.3.1 Hoeffding’s inequality . . . . . . . . . . . . . . . . . . 80
5.3.1.1 Hoeffding vs Chernoff . . . . . . . . . . . . . 82
5.3.1.2 Asymmetric version . . . . . . . . . . . . . . 83
5.3.2 Azuma’s inequality . . . . . . . . . . . . . . . . . . . . 83
5.3.3 The method of bounded differences . . . . . . . . . . . 88
5.3.4 Applications . . . . . . . . . . . . . . . . . . . . . . . 90
5.3.4.1 Sprinkling points on a hypercube . . . . . . . 90
5.3.4.2 Chromatic number of a random graph . . . . 91
5.3.4.3 Balls in bins . . . . . . . . . . . . . . . . . . 92
5.3.4.4 Probabilistic recurrence relations . . . . . . . 92
5.3.4.5 Multi-armed bandits . . . . . . . . . . . . . . 94
The UCB1 algorithm . . . . . . . . . . . . . . . 95
CONTENTS v

Analysis of UCB1 . . . . . . . . . . . . . . . . . 96
5.4 Relation to limit theorems . . . . . . . . . . . . . . . . . . . . 99
5.5 Anti-concentration bounds . . . . . . . . . . . . . . . . . . . . 100
5.5.1 The Berry-Esseen theorem . . . . . . . . . . . . . . . . 100
5.5.2 The Littlewood-Offord problem . . . . . . . . . . . . . 101

6 Randomized search trees 102


6.1 Binary search trees . . . . . . . . . . . . . . . . . . . . . . . . 102
6.1.1 Rebalancing and rotations . . . . . . . . . . . . . . . . 103
6.2 Random insertions . . . . . . . . . . . . . . . . . . . . . . . . 103
6.3 Treaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3.1 Assumption of an oblivious adversary . . . . . . . . . 107
6.3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.2.1 Searches . . . . . . . . . . . . . . . . . . . . 109
6.3.2.2 Insertions and deletions . . . . . . . . . . . . 109
6.3.2.3 Other operations . . . . . . . . . . . . . . . . 111
6.4 Skip lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7 Hashing 114
7.1 Hash tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.2 Universal hash families . . . . . . . . . . . . . . . . . . . . . . 115
7.2.1 Linear congruential hashing . . . . . . . . . . . . . . . 118
7.2.2 Tabulation hashing . . . . . . . . . . . . . . . . . . . . 119
7.3 FKS hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.4 Cuckoo hashing . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.4.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.4.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.5 Practical issues . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.6 Bloom filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.6.1 Construction . . . . . . . . . . . . . . . . . . . . . . . 128
7.6.2 False positives . . . . . . . . . . . . . . . . . . . . . . 128
7.6.3 Comparison to optimal space . . . . . . . . . . . . . . 130
7.6.4 Applications . . . . . . . . . . . . . . . . . . . . . . . 132
7.6.5 Counting Bloom filters . . . . . . . . . . . . . . . . . . 132
7.7 Data stream computation . . . . . . . . . . . . . . . . . . . . 133
7.7.1 Cardinality estimation . . . . . . . . . . . . . . . . . . 134
7.7.2 Count-min sketches . . . . . . . . . . . . . . . . . . . 136
7.7.2.1 Initialization and updates . . . . . . . . . . . 137
7.7.2.2 Queries . . . . . . . . . . . . . . . . . . . . . 137
7.7.2.3 Finding heavy hitters . . . . . . . . . . . . . 140
CONTENTS vi

7.8 Locality-sensitive hashing . . . . . . . . . . . . . . . . . . . . 140


7.8.1 Approximate nearest neighbor search . . . . . . . . . . 140
7.8.2 Locality-sensitive hash functions . . . . . . . . . . . . 141
7.8.3 Constructing an (r1 , r2 )-PLEB . . . . . . . . . . . . . 142
7.8.4 Hash functions for Hamming distance . . . . . . . . . 143
7.8.5 Hash functions for `1 distance . . . . . . . . . . . . . . 146

8 Dimension reduction 147


8.1 The Johnson-Lindenstrauss lemma . . . . . . . . . . . . . . . 147
8.1.1 Reduction to single-vector case . . . . . . . . . . . . . 148
8.1.2 A relatively simple proof of the lemma . . . . . . . . . 148
8.1.3 Distributional version . . . . . . . . . . . . . . . . . . 151
8.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

9 Martingales and stopping times 153


9.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.2 Submartingales and supermartingales . . . . . . . . . . . . . 154
9.3 The optional stopping theorem . . . . . . . . . . . . . . . . . 155
9.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
9.4.1 Random walks . . . . . . . . . . . . . . . . . . . . . . 158
9.4.2 Wald’s equation . . . . . . . . . . . . . . . . . . . . . 162
9.4.3 Maximal inequalities . . . . . . . . . . . . . . . . . . . 162
9.4.4 Waiting times for patterns . . . . . . . . . . . . . . . . 164

10 Markov chains 165


10.1 Basic definitions and properties . . . . . . . . . . . . . . . . . 166
10.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 168
10.2 Convergence of Markov chains . . . . . . . . . . . . . . . . . . 170
10.2.1 Stationary distributions . . . . . . . . . . . . . . . . . 171
10.2.2 Total variation distance . . . . . . . . . . . . . . . . . 171
10.2.2.1 Total variation distance and expectation . . 173
10.2.3 Mixing time . . . . . . . . . . . . . . . . . . . . . . . . 173
10.2.4 Coupling of Markov chains . . . . . . . . . . . . . . . 174
10.2.5 Irreducible and aperiodic chains . . . . . . . . . . . . 175
10.2.6 Convergence of finite irreducible aperiodic Markov chains177
10.3 Reversible chains . . . . . . . . . . . . . . . . . . . . . . . . . 178
10.3.1 Stationary distributions . . . . . . . . . . . . . . . . . 178
10.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 180
10.3.3 Time-reversed chains . . . . . . . . . . . . . . . . . . . 180
CONTENTS vii

10.3.4 Adjusting stationary distributions with the Metropolis-


Hastings algorithm . . . . . . . . . . . . . . . . . . . . 182
10.4 The coupling method . . . . . . . . . . . . . . . . . . . . . . . 183
10.4.1 Random walk on a cycle . . . . . . . . . . . . . . . . . 183
10.4.2 Random walk on a hypercube . . . . . . . . . . . . . . 185
10.4.3 Various shuffling algorithms . . . . . . . . . . . . . . . 186
10.4.3.1 Move-to-top . . . . . . . . . . . . . . . . . . 186
10.4.3.2 Random exchange of arbitrary cards . . . . . 187
10.4.3.3 Random exchange of adjacent cards . . . . . 187
10.4.3.4 Real-world shuffling . . . . . . . . . . . . . . 189
10.4.4 Path coupling . . . . . . . . . . . . . . . . . . . . . . . 189
10.4.4.1 Random walk on a hypercube . . . . . . . . 190
10.4.4.2 Sampling graph colorings . . . . . . . . . . . 191
10.4.4.3 Sampling independent sets . . . . . . . . . . 193
10.4.4.4 Simulated annealing . . . . . . . . . . . . . . 196
Single peak . . . . . . . . . . . . . . . . . . . . 197
Somewhat smooth functions . . . . . . . . . . . 197
10.5 Spectral methods for reversible chains . . . . . . . . . . . . . 198
10.5.1 Spectral properties of a reversible chain . . . . . . . . 198
10.5.2 Analysis of symmetric chains . . . . . . . . . . . . . . 199
10.5.3 Analysis of asymmetric chains . . . . . . . . . . . . . . 202
10.6 Conductance . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
10.6.1 Easy cases for conductance . . . . . . . . . . . . . . . 204
10.6.2 Edge expansion using canonical paths . . . . . . . . . 204
10.6.3 Congestion . . . . . . . . . . . . . . . . . . . . . . . . 206
10.6.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 208
10.6.4.1 Lazy random walk on a line . . . . . . . . . . 208
10.6.4.2 Random walk on a hypercube . . . . . . . . 208
10.6.4.3 Matchings in a graph . . . . . . . . . . . . . 209
10.6.4.4 Perfect matchings in dense bipartite graphs . 211

11 Approximate counting 214


11.1 Exact counting . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.2 Counting by sampling . . . . . . . . . . . . . . . . . . . . . . 215
11.2.1 Generating samples . . . . . . . . . . . . . . . . . . . 216
11.3 Approximating #KNAPSACK . . . . . . . . . . . . . . . . . 217
11.4 Approximating #DNF . . . . . . . . . . . . . . . . . . . . . . 220
11.5 Approximating exponentially improbable events . . . . . . . . 221
11.5.1 Matchings . . . . . . . . . . . . . . . . . . . . . . . . . 222
11.5.2 Other problems . . . . . . . . . . . . . . . . . . . . . . 223
CONTENTS viii

12 Hitting times 224


12.1 Waiting times . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
12.2 Lyapunov functions . . . . . . . . . . . . . . . . . . . . . . . . 225
12.3 Drift analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

13 The probabilistic method 234


13.1 Randomized constructions and existence proofs . . . . . . . . 234
13.1.1 Set balancing . . . . . . . . . . . . . . . . . . . . . . . 235
13.1.2 Ramsey numbers . . . . . . . . . . . . . . . . . . . . . 235
13.2 Approximation algorithms . . . . . . . . . . . . . . . . . . . . 237
13.2.1 MAX CUT . . . . . . . . . . . . . . . . . . . . . . . . 238
13.2.2 MAX SAT . . . . . . . . . . . . . . . . . . . . . . . . 239
13.3 The Lovász Local Lemma . . . . . . . . . . . . . . . . . . . . 242
13.3.1 General version . . . . . . . . . . . . . . . . . . . . . . 243
13.3.2 Symmetric version . . . . . . . . . . . . . . . . . . . . 244
13.3.3 Applications . . . . . . . . . . . . . . . . . . . . . . . 244
13.3.3.1 Graph coloring . . . . . . . . . . . . . . . . . 244
13.3.3.2 Satisfiability of k-CNF formulas . . . . . . . 245
13.3.3.3 Hypergraph 2-colorability . . . . . . . . . . . 245
13.3.4 Non-constructive proof . . . . . . . . . . . . . . . . . . 246
13.3.5 Constructive proof . . . . . . . . . . . . . . . . . . . . 249

14 Derandomization 253
14.1 Deterministic vs. randomized algorithms . . . . . . . . . . . . 254
14.2 Adleman’s theorem . . . . . . . . . . . . . . . . . . . . . . . . 255
14.3 Limited independence . . . . . . . . . . . . . . . . . . . . . . 256
14.3.1 MAX CUT . . . . . . . . . . . . . . . . . . . . . . . . 256
14.4 The method of conditional probabilities . . . . . . . . . . . . 257
14.4.1 MAX CUT using conditional probabilities . . . . . . . 258
14.4.2 Deterministic construction of Ramsey graphs . . . . . 259
14.4.3 Derandomized set balancing . . . . . . . . . . . . . . . 259

15 Probabilistically-checkable proofs 261


15.1 Probabilistically-checkable proofs . . . . . . . . . . . . . . . . 262
15.2 A PCP for GRAPH NON-ISOMORPHISM . . . . . . . . . . 263
15.2.1 GRAPH NON-ISOMORPHISM with private coins . . 263
15.2.2 A probabilistically-checkable proof for GRAPH NON-
ISOMORPHISM . . . . . . . . . . . . . . . . . . . . . 264
15.3 NP ⊆ PCP(poly(n), 1) . . . . . . . . . . . . . . . . . . . . . 264
15.3.1 QUADEQ . . . . . . . . . . . . . . . . . . . . . . . . . 264
CONTENTS ix

15.3.2 The Walsh-Hadamard Code . . . . . . . . . . . . . . . 265


15.3.3 A PCP for QUADEQ . . . . . . . . . . . . . . . . . . 266
15.4 PCP and approximability . . . . . . . . . . . . . . . . . . . . 267
15.4.1 Approximating the number of satisfied verifier queries 267
15.4.2 Gap-preserving reduction to MAX SAT . . . . . . . . 268
15.4.3 Other inapproximable problems . . . . . . . . . . . . . 269
15.5 Dinur’s proof of the PCP theorem . . . . . . . . . . . . . . . 270
15.6 The Unique Games Conjecture . . . . . . . . . . . . . . . . . 272

16 Quantum computing 274


16.1 Random circuits . . . . . . . . . . . . . . . . . . . . . . . . . 274
16.2 Bra-ket notation . . . . . . . . . . . . . . . . . . . . . . . . . 277
16.2.1 States as kets . . . . . . . . . . . . . . . . . . . . . . . 277
16.2.2 Composition of kets . . . . . . . . . . . . . . . . . . . 278
16.2.3 Operators as sums of kets times bras . . . . . . . . . . 278
16.3 Quantum circuits . . . . . . . . . . . . . . . . . . . . . . . . . 279
16.3.1 Quantum operations . . . . . . . . . . . . . . . . . . . 280
16.3.2 Quantum implementations of classical operations . . . 281
16.3.3 Phase representation of a function . . . . . . . . . . . 283
16.3.4 Practical issues (which we will ignore) . . . . . . . . . 284
16.3.5 Quantum computations . . . . . . . . . . . . . . . . . 284
16.4 Deutsch’s algorithm . . . . . . . . . . . . . . . . . . . . . . . 284
16.5 Grover’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . 285
16.5.1 Initial superposition . . . . . . . . . . . . . . . . . . . 286
16.5.2 The Grover diffusion operator . . . . . . . . . . . . . . 286
16.5.3 Effect of the iteration . . . . . . . . . . . . . . . . . . 287

17 Randomized distributed algorithms 289


17.1 Consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
17.1.1 Impossibility of deterministic algorithms . . . . . . . . 291
17.2 Leader election . . . . . . . . . . . . . . . . . . . . . . . . . . 292
17.3 How randomness helps . . . . . . . . . . . . . . . . . . . . . . 292
17.4 Building a weak shared coin . . . . . . . . . . . . . . . . . . . 293
17.5 Leader election with sifters . . . . . . . . . . . . . . . . . . . 295
17.6 Consensus with sifters . . . . . . . . . . . . . . . . . . . . . . 296

A Sample assignments from Spring 2024 299


A.1 Assignment 1, due Thursday 2024-02-01 at 23:59 . . . . . . . 299
A.1.1 Matchings . . . . . . . . . . . . . . . . . . . . . . . . . 299
A.1.2 Non-volatile memory . . . . . . . . . . . . . . . . . . . 301
CONTENTS x

A.2 Assignment 2, due Thursday 2024-02-15 at 23:59 . . . . . . . 304


A.2.1 Mediocre cuts . . . . . . . . . . . . . . . . . . . . . . . 304
A.2.2 Training costs . . . . . . . . . . . . . . . . . . . . . . . 305
A.3 Assignment 3, due Thursday 2024-02-29 at 23:59 . . . . . . . 307
A.3.1 A robot rendezvous problem . . . . . . . . . . . . . . 307
A.3.2 A linked list . . . . . . . . . . . . . . . . . . . . . . . . 308
A.4 Assignment 4, due Thursday 2024-03-28 at 23:59 . . . . . . . 311
A.4.1 Nearly orthogonal vectors . . . . . . . . . . . . . . . . 311
A.4.2 Boosting a random walk . . . . . . . . . . . . . . . . . 312
A.5 Assignment 5, due Thursday 2024-04-11 at 23:59 . . . . . . . 315
A.5.1 Relaxation time for Metropolis-Hastings . . . . . . . . 315
A.5.2 A constrained random walk . . . . . . . . . . . . . . . 317
A.6 Assignment 6, due Thursday 2024-04-25 at 23:59 . . . . . . . 319
A.6.1 The power of intransigence . . . . . . . . . . . . . . . 319
Potential function argument . . . . . . . . . . . 321
Coupling argument . . . . . . . . . . . . . . . . 321
A.6.2 Almost Markov . . . . . . . . . . . . . . . . . . . . . . 322

B Sample assignments from Spring 2023 324


B.1 Assignment 1, due Thursday 2023-02-16 at 23:59 . . . . . . . 324
B.1.1 Hashing without counting . . . . . . . . . . . . . . . . 324
B.1.2 Permutation routing on an incomplete network . . . . 326
B.2 Assignment 2, due Thursday 2023-03-30 at 23:59 . . . . . . . 331
B.2.1 Some streaming data structures . . . . . . . . . . . . . 331
B.2.2 A dense network . . . . . . . . . . . . . . . . . . . . . 333
B.3 Assignment 3, due Thursday 2023-04-27 at 23:59 . . . . . . . 335
B.3.1 Shuffling a graph . . . . . . . . . . . . . . . . . . . . . 335
B.3.2 Counting unbalanced sets . . . . . . . . . . . . . . . . 337

C Sample assignments from Fall 2019 340


C.1 Assignment 1: due Thursday, 2019-09-12, at 23:00 . . . . . . 340
C.1.1 The golden ticket . . . . . . . . . . . . . . . . . . . . . 340
C.1.2 Exploding computers . . . . . . . . . . . . . . . . . . . 342
C.2 Assignment 2: due Thursday, 2019-09-26, at 23:00 . . . . . . 344
C.2.1 A logging problem . . . . . . . . . . . . . . . . . . . . 344
C.2.2 Return of the exploding computers . . . . . . . . . . . 345
C.3 Assignment 3: due Thursday, 2019-10-10, at 23:00 . . . . . . 346
C.3.1 Two data plans . . . . . . . . . . . . . . . . . . . . . . 346
C.3.2 A randomly-indexed list . . . . . . . . . . . . . . . . . 348
C.4 Assignment 4: due Thursday, 2019-10-31, at 23:00 . . . . . . 349
CONTENTS xi

C.4.1 A hash tree . . . . . . . . . . . . . . . . . . . . . . . . 349


C.4.2 Randomized robot rendezvous on a ring . . . . . . . . 350
C.5 Assignment 5: due Thursday, 2019-11-14, at 23:00 . . . . . . 352
C.5.1 Non-exploding computers . . . . . . . . . . . . . . . . 352
C.5.2 A wordy walk . . . . . . . . . . . . . . . . . . . . . . . 354
C.6 Assignment 6: due Monday, 2019-12-09, at 23:00 . . . . . . . 356
C.6.1 Randomized colorings . . . . . . . . . . . . . . . . . . 356
C.6.2 No long paths . . . . . . . . . . . . . . . . . . . . . . . 357
C.7 Final exam . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
C.7.1 A uniform ring . . . . . . . . . . . . . . . . . . . . . . 359
C.7.2 Forbidden runs . . . . . . . . . . . . . . . . . . . . . . 361
C.7.3 A derandomized balancing scheme . . . . . . . . . . . 362

D Sample assignments from Fall 2016 364


D.1 Assignment 1: due Sunday, 2016-09-18, at 17:00 . . . . . . . . 364
D.1.1 Bubble sort . . . . . . . . . . . . . . . . . . . . . . . . 364
D.1.2 Finding seats . . . . . . . . . . . . . . . . . . . . . . . 366
D.2 Assignment 2: due Thursday, 2016-09-29, at 23:00 . . . . . . 368
D.2.1 Technical analysis . . . . . . . . . . . . . . . . . . . . 368
D.2.2 Faulty comparisons . . . . . . . . . . . . . . . . . . . . 372
D.3 Assignment 3: due Thursday, 2016-10-13, at 23:00 . . . . . . 373
D.3.1 Painting with sprites . . . . . . . . . . . . . . . . . . . 373
D.3.2 Dynamic load balancing . . . . . . . . . . . . . . . . . 375
D.4 Assignment 4: due Thursday, 2016-11-03, at 23:00 . . . . . . 376
D.4.1 Re-rolling a random treap . . . . . . . . . . . . . . . . 376
D.4.2 A defective hash table . . . . . . . . . . . . . . . . . . 379
D.5 Assignment 5: due Thursday, 2016-11-17, at 23:00 . . . . . . 381
D.5.1 A spectre is haunting Halloween . . . . . . . . . . . . 381
D.5.2 Colliding robots on a line . . . . . . . . . . . . . . . . 382
D.6 Assignment 6: due Thursday, 2016-12-08, at 23:00 . . . . . . 386
D.6.1 Another colliding robot . . . . . . . . . . . . . . . . . 386
D.6.2 Return of the sprites . . . . . . . . . . . . . . . . . . . 389
D.7 Final exam . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
D.7.1 Virus eradication (20 points) . . . . . . . . . . . . . . 391
D.7.2 Parallel bubblesort (20 points) . . . . . . . . . . . . . 392
D.7.3 Rolling a die (20 points) . . . . . . . . . . . . . . . . . 393
CONTENTS xii

E Sample assignments from Spring 2014 395


E.1 Assignment 1: due Wednesday, 2014-09-10, at 17:00 . . . . . 395
E.1.1 Bureaucratic part . . . . . . . . . . . . . . . . . . . . . 395
E.1.2 Two terrible data structures . . . . . . . . . . . . . . . 395
E.1.3 Parallel graph coloring . . . . . . . . . . . . . . . . . . 399
E.2 Assignment 2: due Wednesday, 2014-09-24, at 17:00 . . . . . 401
E.2.1 Load balancing . . . . . . . . . . . . . . . . . . . . . . 401
E.2.2 A missing hash function . . . . . . . . . . . . . . . . . 401
E.3 Assignment 3: due Wednesday, 2014-10-08, at 17:00 . . . . . 402
E.3.1 Tree contraction . . . . . . . . . . . . . . . . . . . . . 402
E.3.2 Part testing . . . . . . . . . . . . . . . . . . . . . . . . 405
Using McDiarmid’s inequality and some cleverness406
E.4 Assignment 4: due Wednesday, 2014-10-29, at 17:00 . . . . . 407
E.4.1 A doubling strategy . . . . . . . . . . . . . . . . . . . 407
E.4.2 Hash treaps . . . . . . . . . . . . . . . . . . . . . . . . 408
E.5 Assignment 5: due Wednesday, 2014-11-12, at 17:00 . . . . . 409
E.5.1 Agreement on a ring . . . . . . . . . . . . . . . . . . . 409
E.5.2 Shuffling a two-dimensional array . . . . . . . . . . . . 412
E.6 Assignment 6: due Wednesday, 2014-12-03, at 17:00 . . . . . 413
E.6.1 Sampling colorings on a cycle . . . . . . . . . . . . . . 413
E.6.2 A hedging problem . . . . . . . . . . . . . . . . . . . . 414
E.7 Final exam . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
E.7.1 Double records (20 points) . . . . . . . . . . . . . . . 416
E.7.2 Hipster graphs (20 points) . . . . . . . . . . . . . . . . 417
Using the method of conditional probabilities . 417
Using hill climbing . . . . . . . . . . . . . . . . 418
E.7.3 Storage allocation (20 points) . . . . . . . . . . . . . . 418
E.7.4 Fault detectors in a grid (20 points) . . . . . . . . . . 419

F Sample assignments from Spring 2013 422


F.1 Assignment 1: due Wednesday, 2013-01-30, at 17:00 . . . . . 422
F.1.1 Bureaucratic part . . . . . . . . . . . . . . . . . . . . . 422
F.1.2 Balls in bins . . . . . . . . . . . . . . . . . . . . . . . . 422
F.1.3 A labeled graph . . . . . . . . . . . . . . . . . . . . . 423
F.1.4 Negative progress . . . . . . . . . . . . . . . . . . . . . 423
F.2 Assignment 2: due Thursday, 2013-02-14, at 17:00 . . . . . . 425
F.2.1 A local load-balancing algorithm . . . . . . . . . . . . 425
F.2.2 An assignment problem . . . . . . . . . . . . . . . . . 427
F.2.3 Detecting excessive collusion . . . . . . . . . . . . . . 427
F.3 Assignment 3: due Wednesday, 2013-02-27, at 17:00 . . . . . 429
CONTENTS xiii

F.3.1 Going bowling . . . . . . . . . . . . . . . . . . . . . . 429


F.3.2 Unbalanced treaps . . . . . . . . . . . . . . . . . . . . 430
F.3.3 Random radix trees . . . . . . . . . . . . . . . . . . . 432
F.4 Assignment 4: due Wednesday, 2013-03-27, at 17:00 . . . . . 433
F.4.1 Flajolet-Martin sketches with deletion . . . . . . . . . 433
F.4.2 An adaptive hash table . . . . . . . . . . . . . . . . . 435
F.4.3 An odd locality-sensitive hash function . . . . . . . . . 437
F.5 Assignment 5: due Friday, 2013-04-12, at 17:00 . . . . . . . . 438
F.5.1 Choosing a random direction . . . . . . . . . . . . . . 438
F.5.2 Random walk on a tree . . . . . . . . . . . . . . . . . 439
F.5.3 Sampling from a tree . . . . . . . . . . . . . . . . . . . 440
F.6 Assignment 6: due Friday, 2013-04-26, at 17:00 . . . . . . . . 441
F.6.1 Increasing subsequences . . . . . . . . . . . . . . . . . 441
F.6.2 Futile word searches . . . . . . . . . . . . . . . . . . . 442
F.6.3 Balance of power . . . . . . . . . . . . . . . . . . . . . 444
F.7 Final exam . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
F.7.1 Dominating sets . . . . . . . . . . . . . . . . . . . . . 445
F.7.2 Tricolor triangles . . . . . . . . . . . . . . . . . . . . . 446
F.7.3 The n rooks problem . . . . . . . . . . . . . . . . . . . 447
F.7.4 Pursuing an invisible target on a ring . . . . . . . . . 447

G Sample assignments from Spring 2011 449


G.1 Assignment 1: due Wednesday, 2011-01-26, at 17:00 . . . . . 449
G.1.1 Bureaucratic part . . . . . . . . . . . . . . . . . . . . . 449
G.1.2 Rolling a die . . . . . . . . . . . . . . . . . . . . . . . 449
G.1.3 Rolling many dice . . . . . . . . . . . . . . . . . . . . 451
G.1.4 All must have candy . . . . . . . . . . . . . . . . . . . 451
G.2 Assignment 2: due Wednesday, 2011-02-09, at 17:00 . . . . . 452
G.2.1 Randomized dominating set . . . . . . . . . . . . . . . 452
G.2.2 Chernoff bounds with variable probabilities . . . . . . 454
G.2.3 Long runs . . . . . . . . . . . . . . . . . . . . . . . . . 455
G.3 Assignment 3: due Wednesday, 2011-02-23, at 17:00 . . . . . 457
G.3.1 Longest common subsequence . . . . . . . . . . . . . . 457
G.3.2 A strange error-correcting code . . . . . . . . . . . . . 459
G.3.3 A multiway cut . . . . . . . . . . . . . . . . . . . . . . 460
G.4 Assignment 4: due Wednesday, 2011-03-23, at 17:00 . . . . . 461
G.4.1 Sometimes successful betting strategies are possible . 461
G.4.2 Random walk with reset . . . . . . . . . . . . . . . . . 463
G.4.3 Yet another shuffling algorithm . . . . . . . . . . . . . 465
G.5 Assignment 5: due Thursday, 2011-04-07, at 23:59 . . . . . . 466
CONTENTS xiv

G.5.1 A reversible chain . . . . . . . . . . . . . . . . . . . . 466


G.5.2 Toggling bits . . . . . . . . . . . . . . . . . . . . . . . 467
G.5.3 Spanning trees . . . . . . . . . . . . . . . . . . . . . . 469
G.6 Assignment 6: due Monday, 2011-04-25, at 17:00 . . . . . . . 470
G.6.1 Sparse satisfying assignments to DNFs . . . . . . . . . 470
G.6.2 Detecting duplicates . . . . . . . . . . . . . . . . . . . 471
G.6.3 Balanced Bloom filters . . . . . . . . . . . . . . . . . . 472
G.7 Final exam . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
G.7.1 Leader election . . . . . . . . . . . . . . . . . . . . . . 475
G.7.2 Two-coloring an even cycle . . . . . . . . . . . . . . . 476
G.7.3 Finding the maximum . . . . . . . . . . . . . . . . . . 477
G.7.4 Random graph coloring . . . . . . . . . . . . . . . . . 478

H Sample assignments from Spring 2009 479


H.1 Final exam, Spring 2009 . . . . . . . . . . . . . . . . . . . . . 479
H.1.1 Randomized mergesort (20 points) . . . . . . . . . . . 479
H.1.2 A search problem (20 points) . . . . . . . . . . . . . . 480
H.1.3 Support your local police (20 points) . . . . . . . . . . 481
H.1.4 Overloaded machines (20 points) . . . . . . . . . . . . 482

I Probabilistic recurrences 483


I.1 Recurrences with constant cost functions . . . . . . . . . . . . 483
I.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
I.3 The Karp-Upfal-Wigderson bound . . . . . . . . . . . . . . . 484
I.3.1 Waiting for heads . . . . . . . . . . . . . . . . . . . . . 486
I.3.2 Quickselect . . . . . . . . . . . . . . . . . . . . . . . . 486
I.3.3 Tossing coins . . . . . . . . . . . . . . . . . . . . . . . 486
I.3.4 Coupon collector . . . . . . . . . . . . . . . . . . . . . 487
I.3.5 Chutes and ladders . . . . . . . . . . . . . . . . . . . . 487
I.4 High-probability bounds . . . . . . . . . . . . . . . . . . . . . 488
I.4.1 High-probability bounds from expectation bounds . . 489
I.4.2 Detailed analysis of the recurrence . . . . . . . . . . . 489
I.5 More general recurrences . . . . . . . . . . . . . . . . . . . . . 490

Bibliography 491

Index 510
List of Figures

2.1 Karger’s min-cut algorithm . . . . . . . . . . . . . . . . . . . 26

4.1 Proof of Jensen’s inequality . . . . . . . . . . . . . . . . . . . 56

5.1 Comparison of Chernoff bound variants . . . . . . . . . . . . 72


5.2 Hypercube network with n = 3 . . . . . . . . . . . . . . . . . 77

6.1 Tree rotations . . . . . . . . . . . . . . . . . . . . . . . . . . . 103


6.2 Balanced and unbalanced binary search trees . . . . . . . . . 104
6.3 Binary search tree after inserting 5 1 7 3 4 6 2 . . . . . . . . 105
6.4 Inserting values into a treap . . . . . . . . . . . . . . . . . . . 106
6.5 Tree rotation shortens spines . . . . . . . . . . . . . . . . . . 110
6.6 Skip list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

10.1 Drawing a Markov chain as a digraph . . . . . . . . . . . . . 167

13.1 Tricky step in MAX SAT argument . . . . . . . . . . . . . . . 242

A.1 Non-volatile memory in action . . . . . . . . . . . . . . . . . 302

B.1 An incomplete network with n = 2d = 4 . . . . . . . . . . . . 327


B.2 Graph density evolution . . . . . . . . . . . . . . . . . . . . . 333

D.1 Filling a screen with Space Invaders . . . . . . . . . . . . . . 374


D.2 Hidden Space Invaders . . . . . . . . . . . . . . . . . . . . . . 390

E.1 Example of tree contraction for Problem E.3.1 . . . . . . . . 403

F.1 Radix tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432


F.2 Word searches . . . . . . . . . . . . . . . . . . . . . . . . . . . 443

xv
List of Tables

3.1 Sum of two dice . . . . . . . . . . . . . . . . . . . . . . . . . . 30


3.2 Various conditional expectations on two independent dice . . 46

5.1 Concentration bounds . . . . . . . . . . . . . . . . . . . . . . 60

7.1 Hash table parameters . . . . . . . . . . . . . . . . . . . . . . 116

xvi
List of Algorithms

7.1 Insertion procedure for cuckoo hashing . . . . . . . . . . . . . 124

17.1 Attiya-Censor weak shared coin [AC08] . . . . . . . . . . . . . 294


17.2 Giakkoupis-Woelfel sifter [GW12] . . . . . . . . . . . . . . . . 295
17.3 Sifter using max registers [AE19] . . . . . . . . . . . . . . . . . 298

D.1 One pass of bubble sort . . . . . . . . . . . . . . . . . . . . . . 365

F.1 Adaptive hash table insertion . . . . . . . . . . . . . . . . . . . 436

G.1 Dubious duplicate detector . . . . . . . . . . . . . . . . . . . . 471


G.2 Randomized max-finding algorithm . . . . . . . . . . . . . . . 477

xvii
Preface

These are notes for the Yale course CPSC 469/569 Randomized Algorithms.
This document also incorporates the lecture schedule and assignments, as
well as some sample assignments from previous semesters. Because this
is a work in progress, it will be updated frequently over the course of the
semester.
Much of the structure of the course follows Mitzenmacher and Upfals’s
Probability and Computing [MU17], with some material from Motwani and
Raghavan’s Randomized Algorithms [MR95]. In most cases you’ll find these
textbooks contain much more detail than what is presented here, so it is
probably better to consider this document a supplement to them than to
treat it as your primary source of information.
The most recent version of these notes will be available at https:
//www.cs.yale.edu/homes/aspnes/classes/469/notes.pdf. More stable
archival versions may be found at https://arxiv.org/abs/2003.01902.
I would like to thank my many students and teaching fellows over the
years for their help in pointing out errors and omissions in earlier drafts of
these notes.

xviii
Chapter 1

Randomized algorithms

A randomized algorithm flips coins during its execution to determine what


to do next. When considering a randomized algorithm, we usually care
about its expected worst-case performance, which is the average amount
of time it takes on the worst input of a given size. This average is computed
over all the possible outcomes of the coin flips during the execution of the
algorithm. We may also ask for a high-probability bound, showing that
the algorithm doesn’t consume too much resources most of the time.
In studying randomized algorithms, we consider pretty much the same
issues as for deterministic algorithms: how to design a good randomized
algorithm, and how to prove that it works within given time or error bounds.
The main difference is that it is often easier to design a randomized algo-
rithm—randomness turns out to be a good substitute for cleverness more
often than one might expect—but harder to analyze it. So much of what
one does is develop good techniques for analyzing the often very complex
random processes that arise in the execution of an algorithm. Fortunately,
in doing so we can often use techniques already developed by probabilists
and statisticians for analyzing less overtly algorithmic processes.
Formally, we think of a randomized algorithm as a machine M that
computes M (x, r), where x is the problem input and r is the sequence of
random bits. Our machine model is the usual random-access machine or
RAM model, where we have a memory space that is typically polynomial in
the size of the input n, and in constant time we can read a memory location,
write a memory location, or perform arithmetic operations on integers of
up to O(log n) bits.1 In this model, we may find it easier to think of the
1
This model is unrealistic in several ways: the assumption that we can perform arithmetic
on O(log n)-bit quantities in constant time omits at least a factor of Ω(log log n) for addition

1
CHAPTER 1. RANDOMIZED ALGORITHMS 2

random bits as supplied as needed by some subroutine, where generating


a random integer of size O(log n) takes constant time; the justification for
this assumption is that it takes constant time to read the next O(log n)-sized
value from the random input.
Because the number of these various constant-time operations, and thus
the running time for the algorithm as a whole, may depend on the random
bits, it is now a random variable—a function on points in some probability
space. The probability space Ω consists of all possible sequences r, each
of which is assigned a probability Pr [r] (typically 2−|r| ), and the running
time for M on some input x is generally given as an expected value2
Er [time(M (x, r))], where for any X,
X
Er [X] = X(r) Pr [r] . (1.0.1)
r∈Ω

We can now quote the performance of M in terms of this expected value:


where we would say that a deterministic algorithms runs in time O(f (n)),
where n = |x| is the size of the input, we instead say that our randomized algo-
rithm runs in expected time O(f (n)), which means that Er [time(M (x, r))] =
O(f (|x|)) for all inputs x.
This is distinct from traditional worst-case analysis, where there is no
r and no expectation, and average-case analysis, where there is again no
r and the value reported is not a maximum but an expectation over some
distribution on x. The following trivial example shows the distinction.

1.1 Searching an array


Suppose we have an array A with n elements, and we are told that one of
the positions in the array contains a nonzero value (the “prize”) while every
other position contains zero. How quickly can we find the prize in the worst
case?
Any deterministic algorithm will probe the array positions in some
predictable order. An adversary can simulate the algorithm running in an
and probably more for multiplication in any realistic implementation, while the assumption
that we can address nc distinct locations in memory in anything less than nc/3 time in the
worst case requires exceeding the speed of light. But for reasonably small n, this gives a
pretty good approximation of the performance of real computers, which do in fact perform
arithmetic and access memory in a fixed amount of time, although with fixed bounds on
the size of both arithmetic operands and memory.
2
We’ll see more details of these and other concepts from probability theory in Chapters 2
and 3.
CHAPTER 1. RANDOMIZED ALGORITHMS 3

execution where it always sees 0 until it has checked every location. By


putting the prize in the last place the algorithm looks, we get a worst-case
input that requires at least n probes to find it.
In the average case, we might assume that instead of placing the prize in
the worst possible position, it is placed uniformly at random. Now if we just
scan the array from left to right, the expected number of probes will be
n n
X X 1
i · Pr [algorithm does i probes] = i·
i=1 i=1
n
1 n(n + 1)
= ·
n 2
n+1
= .
2
This is still Θ(n), but we’ve improved the constant factor by almost a factor
of 2. The cost is that we trust the adversary to place the prize randomly—but
there is no reason for the adversary to do this.
The trick to randomized algorithms is that we can can obtain the same
expected payoff even in the worst case by supplying the randomness ourselves.
Let’s suppose that instead of scanning the array predictably from left to
right, we flip a coin and either scan left-to-right or scan right-to-left with
probability 1/2 each. If the prize is in position i, then scanning left-to-right
finds it in i probes, while scanning right-to-left finds it in n − i + 1 probes.
The expected cost is thus
1 1 n+1
· i + · (n − i + 1) = ,
2 2 2
the same as in the average case. But now we don’t make any assumptions
about the input—by using just one bit of randomness we get this expected
time no matter where the adversary puts the prize.
A natural question is whether a more clever algorithm can do even better.
Usually it’s pretty hard to prove lower bounds on algorithms, but in this
case we can use a classic result called Yao’s lemma [Yao77] to show a
worst-case lower bound on the expected cost of any randomized algorithm
by constructing a probability distribution on inputs that gives a bad average-
case lower bound for any deterministic algorithm. We’ll give a proof of this
in §3.6.1, but the intuition is that any randomized algorithm acts like picking
a deterministic algorithm at random, so if there is an input distribution that
is bad for all deterministic algorithms, it is still bad even if we pick one of
those algorithms randomly.
CHAPTER 1. RANDOMIZED ALGORITHMS 4

In this case, the bad input distribution is simple: put the 1 in each
position A[i] with equal probability 1/n. For a deterministic algorithm, there
will be some fixed sequence of positions i1 , i2 , . . . that it examines as long as
it only sees zeros. A smart deterministic algorithm will not examine the same
position twice, so the 1 is equally likely to be found 1, 2, 3, . . . , n probes. This
gives the same expected n+1 2 probes as for the simple randomized algorithm,
which shows that that algorithm is optimal.
We’ve been talking about searching an array, because that fits best in
our model of an input supplied to the algorithm, but essentially the same
analysis applies to brute-force inverting a black-box function. Here we have
a function f and target output y, and we want to find an input x such that
f (x) = y. The same analysis as for the array case shows that this takes n+1 2
expected evaluations of f assuming that exactly one x works and we can’t
do anything clever.

Curiously, in this case it may be possible to improve this bound to O( n)
evaluations if somehow we get our hands on a working quantum computer.
We’ll come back to this when we discuss quantum computing in general and
Grover’s algorithm in particular in Chapter 16.

1.2 Verifying polynomial identities


This classic problem is described in [MU17, §1.1]. Here we are given two
products of polynomials and we want to determine if they compute the same
function. For example, we might have

p(x) = (x − 7)(x − 3)(x − 1)(x + 2)(2 x + 5)


q(x) = 2 x5 − 13 x4 − 21 x3 + 127 x2 + 121 x − 210

These expressions both represent degree-5 polynomials, and it is not


obvious without multiplying out the factors of p whether they are equal or
not. Multiplying out all the factors of p may take as much as O(d2 ) time if we
assume integer multiplication takes unit time and do it the straightforward
way.3 We can do better than this using randomization.
The trick is that evaluating p(x) and q(x) takes only O(d) integer opera-
tions, and we will find p(x) = q(x) only if either (a) p(x) and q(x) are really
the same polynomial, or (b) x is a root of p(x) − q(x). Since p(x) − q(x)
has degree at most d, it can’t have more than d roots. So if we choose x
uniformly at random from some much larger space, it’s likely that we will not
3
It can be faster if we do something sneaky like use fast Fourier transforms [SS71].
CHAPTER 1. RANDOMIZED ALGORITHMS 5

get a root. Indeed, evaluating p(11) = 112320 and q(11) = 120306 quickly
shows that p and q are not in fact the same.
This is an example of a Monte Carlo algorithm, which is an algorithm
that runs in a fixed amount of time but only gives the right answer some of
the time. (In this case, with probability 1 − d/r, where r is the size of the
range of random integers we choose x from.) Monte Carlo algorithms have
the unnerving property of not indicating when their results are incorrect,
but we can make the probability of error as small as we like by running
the algorithm repeatedly. For this particular algorithm, the probability of
error after k trials is only (d/r)k , which means that for fixed d/r we need
O(log(1/)) iterations to get the error bound down to any given . If we are
really paranoid, we could get the error down to 0 by testing d + 1 distinct
values, but now the cost is as high as multiplying out p again.
The error for this algorithm is one-sided: if we find a witness to the fact
that p 6= q, we are done, but if we don’t, then all we know is that we haven’t
found a witness yet. We also have the property that if we check enough
possible witnesses, we are guaranteed to find one.
A similar property holds in the classic Miller-Rabin primality test,
a randomized algorithm for determining whether a large integer is prime
or not.4 The original version, due to Gary Miller [Mil76] showed that, as
in polynomial identity testing, it might be sufficient to pick a particular
set of deterministic candidate witnesses. Unfortunately, this result depends
on the truth of the extended Riemann hypothesis, a notoriously difficult
open problem in number theory. Michael Rabin [Rab80] demonstrated that
choosing random witnesses was enough, if we were willing to accept a small
probability of incorrectly identifying a composite number as prime.
For many years it was open whether it was possible to test primality
deterministically in polynomial time without unproven number-theoretic
assumptions, and the randomized Miller-Rabin algorithm was one of the
most widely-used randomized algorithms for which no good deterministic
alternative was known. Eventually, Agrawal et al. [AKS04] demonstrated
how to test primality deterministically using a different technique, although
the cost of their algorithm is high enough that Miller-Rabin is still used in
practice.
4
We will not describe this algorithm here.
CHAPTER 1. RANDOMIZED ALGORITHMS 6

1.3 Randomized QuickSort


The QuickSort algorithm [Hoa61a] works as follows. For simplicity, we
assume that no two elements of the array being sorted are equal.

• If the array has > 1 elements,


– Pick a pivot p uniformly at random from the elements of the
array.
– Split the array into A1 and A2 , where A1 contains all elements
< p elements > p.
– Sort A1 and A2 recursively and return the sequence A1 , p, A2 .
• Otherwise return the array.

The splitting step takes exactly n − 1 comparisons, since we have to check


each non-pivot against the pivot. We assume all other costs are dominated by
the cost of comparisons. How many comparisons does randomized QuickSort
do on average?
There are two ways to solve this problem: the dumb way and the smart
way. We’ll do it the dumb way now and save the smart way for §1.3.2.

1.3.1 Brute force method: solve the recurrence


Let T (n) be the expected number of comparisons done on an array of n
elements. We have T (0) = T (1) = 0 and for larger n,

1 n−1
X
T (n) = (n − 1) + (T (k) + T (n − 1 − k)) . (1.3.1)
n k=0

Why? Because we do (n − 1) comparisons to split the piles, there are n


equally-likely choices for our pivot (hence the 1/n), and for each choice the
expected cost of the recursive sorts is T (k) + T (n − 1 − k), where k is the
number of elements that land in A1 . Formally, we are using here the law of
total probability, which says that for any random variable X and partition
of the probability space into events B1 . . . Bn , then
X
E [X] = Pr [Bi ] E [X | Bi ] ,
where
X Pr [ω]
E [X | Bi ] = X(ω)
ω∈Bi
Pr [Bi ]
CHAPTER 1. RANDOMIZED ALGORITHMS 7

is the conditional expectation of X conditioned on Bi , which we can


think of as just the average value of X if we know that Bi occurred. (See
§2.3.2 for more details.)
So now we just have to solve this ugly recurrence. We can reasonably
guess that when n ≥ 1, T (n) ≤ an log n for some constant a. Clearly this
holds for n = 1. Now apply induction on larger n to get

1 n−1
X
T (n) = (n − 1) + (T (k) + T (n − 1 − k))
n k=0
2 n−1
X
= (n − 1) + T (k)
n k=0
2 n−1
X
= (n − 1) + T (k)
n k=1
2 n−1
X
≤ (n − 1) + ak log k
n k=1
Z n
2
≤ (n − 1) + ak log k
n k=1
!
2a n2 log n n2 1
= (n − 1) + − +
n 2 4 4
an a
= (n − 1) + an log n − + .
2 2n
If we squint carefully at this recurrence for a while we notice that setting
a = 2 makes this less than or equal to an log n, since the remaining terms
become (n − 1) − n + 1/n = 1/n − 1, which is negative for n ≥ 1. We can
thus confidently conclude that T (n) ≤ 2n log n (for n ≥ 1).

1.3.2 Clever method: use linearity of expectation


Alternatively, we can use linearity of expectation (which we’ll discuss
further in §3.4.1) to compute the expected number of comparisons used by
randomized QuickSort.
Imagine we use the following method for choosing pivots: we generate a
random permutation of all the elements in the array, and when asked to sort
some subarray A0 , we use as pivot the first element of A0 that appears in our
list. Since each element is equally likely to be first, this is equivalent to the
actual algorithm. Pretend that we are always sorting the numbers 1 . . . n and
define for each pair of elements i < j the indicator variable Xij to be 1 if
CHAPTER 1. RANDOMIZED ALGORITHMS 8

i is compared to j at some point during the execution of the algorithm and 0


otherwise. Amazingly, we can actually compute the probability of this event
(and thus E [Xij ]): the only time i and j are compared is if one of them is
chosen as a pivot before they are split up into different arrays. How do they
get split up into different arrays? This happens if some intermediate element
k is chosen as pivot first, that is, if some k with i < k < j appears in the
permutation before both i and j. Occurrences of other elements don’t affect
the outcome, so we can concentrate on the restriction of the permutations
to just the numbers i through j, and we win if this restricted permutation
starts with either i or j. This event occurs with probability 2/(j − i + 1), so
we have E [Xij ] = 2/(j − i + 1). Summing over all pairs i < j gives:
 
X X
E  Xij  = E [Xij ]
i<j i<j
X 2
=
i<j
j−i+1
X n−i+1
n−1 X 2
=
i=1 k=2
k
n Xi
X 2
=
i=2 k=2
k
n
X 2(n − k + 1)
=
k=2
k
n 
2(n + 1)
X 
= −2
k=2
k
n
X 2(n + 1)
= − 2(n − 1)
k=2
k
= 2(n + 1)(Hn − 1) − 2(n − 1)
= 2(n + 1)Hn − 4n.

Here Hn = ni=1 1i is the n-th harmonic number, equal to ln n + γ +


P

O(n−1 ), where γ ≈ 0.5772 is the Euler-Mascheroni constant (whose exact


value is unknown!). For asymptotic purposes we only need Hn = Θ(log n).
For the first step we are taking advantage of the fact that linearity of
expectation doesn’t care about the variables not being independent. The
rest is just algebra.
This is pretty close to the bound of 2n log n we computed using the
CHAPTER 1. RANDOMIZED ALGORITHMS 9

recurrence in §1.3.1. Given that we now know the exact answer, we could in
principle go back and use it to solve the recurrence exactly.5
Which way is better? Solving the recurrence requires less probabilistic
handwaving (a more polite term might be “insight”) but more grinding
out inequalities, which is a pretty common trade-off. Since I am personally
not very clever I would try the brute-force approach first. But it’s worth
knowing about better methods so you can try them in other situations.

1.4 Where does the randomness come from?


Typically we assume in analyzing a randomized algorithm that we have
access to genuinely random bits r, and don’t ask too carefully how we can
get these random bits. For practical applications, there are basically three
choices, from strongest (and most expensive) to weakest (and cheapest):

1. Physical randomness. Lock a cat in a box with a radioactive source


and a Geiger counter attached to a solenoid aimed at a vial of prussic
acid [Sch35]. Come back in an hour and check if the cat is still breathing.
With an appropriately-tuned radioactive source, this generates one
very unpredictable bit of randomness at the cost of one half of a cat
on average.6
There are cheaper variants, mostly involving amplified quantum noise
at the high end, or monitoring physical processes that are expected
to be somewhat random (like intervals between keyboard presses or
seek times of hard drive heads) at the low end. In each case you get a
sequence of random bits that, under plausible physical assumptions,
are effectively unpredictable.
The /dev/random device in Linux systems gives you access to the cheap
kind of physical randomness, and will block waiting for you to wave
your mouse around if it runs out.

2. Cryptographically-secure pseudorandomness. Find some func-


tion that spits out a sequence of random-looking values given a seed,
such that if the seed is chosen uniformly at random (say using a physical
random number generator), no polynomial-time program can distin-
guish the sequence from an actual random sequence with non-trivial
5
We won’t.
6
Expected cost may be higher if the observer forgets their gas mask.
CHAPTER 1. RANDOMIZED ALGORITHMS 10

probability. Usually expensive, but if your polynomial-time random-


ized algorithm fails using a CPRNG, you’ve succeeded in breaking its
cryptographic assumptions.
The /dev/urandom device in Linux systems gives you this, based on a
seed derived from the same sources as /dev/random.

3. Statistical pseudorandomness. As above, but choose a function


that is not cryptographically secure but merely passes common statis-
tical tests for things like k-wise independence of consecutive outputs.
Which function to choose is largely a matter of convenience and fashion
(although PRNGs in older standard libraries can be very, very bad).
The advantage is that many PRNGs are very fast and will not slow
your program down. The disadvantage is that you are relying on your
program not doing anything that exposes the weakness of the PRNG.
The random function in the standard C library is an example of this,
although maybe not a good example. The cool kids mostly use Mersenne
Twister [MN98].

One advantage of pseudorandom generators is that they allow for debug-


ging: if you run your program twice with the same key, it will do the same
thing. This was in fact an argument made by von Neumann against using
physical randomness back in the old days [VN63].
In practice, most people just use whatever PRNG is ready to hand, and
hope it works. As theorists, we will ignore this issue, and assume that our
random inputs are actually random.

1.5 Classifying randomized algorithms


Different random algorithms make different guarantees about the likelihood
of getting a correct output or even the possibility of recognizing a correct
output. These different guarantees have established names in the randomized
algorithms literature and are correspond to various complexity classes in
complexity theory.

1.5.1 Las Vegas vs Monte Carlo


One difference between QuickSort and polynomial equality testing that
QuickSort always succeeds, but may run for longer than you expect; while
the polynomial equality tester always runs in a fixed amount of time, but may
CHAPTER 1. RANDOMIZED ALGORITHMS 11

produce the wrong answer. These are examples of two classes of randomized
algorithms, which were originally named by László Babai [Bab79]:7
• A Las Vegas algorithm fails with some probability, but we can tell
when it fails. In particular, we can run it again until it succeeds, which
means that we can eventually succeed with probability 1 (but with a
potentially unbounded running time). Alternatively, we can think of a
Las Vegas algorithm as an algorithm that runs for an unpredictable
amount of time but always succeeds (we can convert such an algorithm
back into one that runs in bounded time by declaring that it fails if it
runs too long—a condition we can detect). QuickSort is an example of
a Las Vegas algorithm.
• A Monte Carlo algorithm fails with some probability, but we can’t
tell when it fails. If the algorithm produces a yes/no answer and the
failure probability is significantly less than 1/2, we can reduce the
probability of failure by running it many times and taking a majority of
the answers. The polynomial equality-testing algorithm in an example
of a Monte Carlo algorithm.
The heuristic for remembering which class is which is that the names
were chosen to appeal to English speakers: in Las Vegas, the dealer can tell
you whether you’ve won or lost, but in Monte Carlo, le croupier ne parle que
Français, so you have no idea what he’s saying.
Generally, we prefer Las Vegas algorithms, because we like knowing
when we have succeeded. But sometimes we have to settle for Monte Carlo
algorithms, which can still be useful if we can get the probability of failure
small enough. For example, any time we try to estimate an average by
sampling (say, inputs to a function we are trying to integrate or political
views of voters we are trying to win over) we are running a Monte Carlo
algorithm: there is always some possibility that our sample is badly non-
representative, but we can’t tell if we got a bad sample unless we already
know the answer we are looking for.

1.5.2 Randomized complexity classes


Las Vegas vs Monte Carlo is the typical distinction made by algorithm
designers, but complexity theorists have developed more elaborate classifi-
cations. These include algorithms with “one-sided” failure properties. For
7
To be more precise, Babai defined Monte Carlo algorithms based on the properties of
Monte Carlo simulation, a technique dating back to the Manhattan project. The term
Las Vegas algorithm was new.
CHAPTER 1. RANDOMIZED ALGORITHMS 12

these algorithms, we never get a bogus “yes” answer but may get a bogus
“no” answer (or vice versa). This gives us several complexity classes that act
like randomized versions of NP, co-NP, etc.:

• The class R or RP (randomized P) consists of all languages L for which


a polynomial-time Turing machine M exists such that if x ∈ L, then
Pr [M (x, r) = 1] ≥ 1/2 and if x 6∈ L, then Pr [M (x, r) = 1] = 0. In
other words, we can find a witness that x ∈ L with constant probability.
This is the randomized analog of NP (but it’s much more practical,
since with NP the probability of finding a winning witness may be
exponentially small).

• The class co-R consists of all languages L for which a poly-time Turing
machine M exists such that if x 6∈ L, then Pr [M (x, r) = 1] ≥ 1/2 and
if x ∈ L, then Pr [M (x, r) = 1] = 0. This is the randomized analog of
co-NP.

• The class ZPP (zero-error probabilistic P ) is defined as RP ∩ co-RP.


If we run both our RP and co-RP machines for polynomial time, we
learn the correct classification of x with probability at least 1/2. The
rest of the time we learn only that we’ve failed (because both machines
return 0, telling us nothing). This is the class of (polynomial-time)
Las Vegas algorithms. The reason it is called “zero-error” is that we
can equivalently define it as the problems solvable by machines that
always output the correct answer eventually, but only run in expected
polynomial time.

• The class BPP (bounded-error probabilistic P) consists of all languages


L for which a poly-time Turing machine exists such that if x 6∈ L,
then Pr [M (x, r) = 1] ≤ 1/3, and if x ∈ L, then Pr [M (x, r) = 1] ≥
2/3. These are the (polynomial-time) Monte Carlo algorithms: if our
machine answers 0 or 1, we can guess whether x ∈ L or not, but we
can’t be sure.

• The class PP (probabilistic P) consists of all languages L for which a


poly-time Turing machine exists such that if x 6∈ L, then Pr [M (x, r) = 1] ≥
1/2, and if x ∈ L, then Pr [M (x, r) = 1] < 1/2. Since there is only an
exponentially small gap between the two probabilities, such algorithms
are not really useful in practice; PP is mostly of interest to complexity
theorists.
CHAPTER 1. RANDOMIZED ALGORITHMS 13

Assuming we have a source of random bits, any algorithm in RP, co-RP,


ZPP, or BPP is good enough for practical use. We can usually even get
away with using a pseudorandom number generator, and there are plausible
reasons to suspect that in fact every one of these classes is equal to P.

1.6 Classifying randomized algorithms by their meth-


ods
We can also classify randomized algorithms by how they use their randomness
to solve a problem. Some very broad categories:8

• Avoiding worst-case inputs, by hiding the details of the algorithm


from the adversary. Typically we assume that an adversary supplies
our input. If the adversary can see what our algorithm is going to
do (for example, he knows which door we will open first), he can use
this information against us. By using randomness, we can replace our
predictable deterministic algorithm by what is effectively a random
choice of many different deterministic algorithms. Since the adversary
doesn’t know which algorithm we are using, he can’t (we hope) pick
an input that is bad for all of them.

• Sampling. Here we use randomness to find an example or examples


of objects that are likely to be typical of the population they are drawn
from, either to estimate some average value (pretty much the basis of
all of statistics) or because a typical element is useful in our algorithm
(for example, when picking the pivot in QuickSort). Randomization
means that the adversary can’t direct us to non-representative samples.

• Hashing. Hashing is the process of assigning a large object x a small


name h(x) by feeding it to a hash function h. Because the names are
small, the Pigeonhole Principle implies that many large objects hash to
the same name (a collision). If we have few objects that we actually
care about, we can avoid collisions by choosing a hash function that
happens to map them to different places. Randomization helps here
by keeping the adversary from choosing the objects after seeing what
our hash function is.
Hashing techniques are used both in load balancing (e.g., insuring
that most cells in a hash table hold only a few objects) and in finger-
8
These are largely adapted from the introduction to [MR95].
CHAPTER 1. RANDOMIZED ALGORITHMS 14

printing (e.g., using a cryptographic hash function to record a


fingerprint of a file, so that we can detect when it has been modified).

• Building random structures. The probabilistic method shows


the existence of structures with some desired property (often graphs
with interesting properties, but there are other places where it can be
used) by showing that a randomly-generated structure in some class
has a nonzero probability of having the property we want. If we can
beef the probability up to something substantial, we get a randomized
algorithm for generating these structures.

• Symmetry breaking. In distributed algorithms involving multi-


ple processes, progress may be stymied by all the processes trying to
do the same thing at the same time (this is an obstacle, for example,
in leader election, where we want only one process to declare itself
the leader). Randomization can break these deadlocks.
Chapter 2

Probability theory

In this chapter, we summarize the parts of probability theory that we


will use in the course. This is not really a substitute for reading an actual
probability theory book like Feller [Fel68] or Grimmett and Stirzaker [GS01],
but the hope is that it’s enough to get by.
The basic idea of probability theory is that we want to model all possible
outcomes of whatever process we are studying simultaneously. This gives
the notion of a probability space, which is the set of all possible outcomes;
for example, if we roll two dice, the probability space would consist of all 36
possible combinations of values. Subsets of this space are called events; an
example in the two-dice space would be the event that the sum of the two
dice is 11, given by the set A = {h5, 6i , h6, 5i}. The probability of an event
A is given by a probability measure Pr [A]; for simple probability spaces,
this is just the sum of the probabilities of the individual outcomes contains
in A, while for more general spaces, we define the measure on events first
and the probabilities of individual outcomes are derived from this. Formal
definitions of all of these concepts are given later in this chapter.
When analyzing a randomized algorithm, the probability space describes
all choices of the random bits used by the algorithm, and we can think of the
possible executions of an algorithm as living within this probability space.
More formally, the sequence of operations carried out by an algorithm and the
output it ultimately produces are examples of random variables—functions
from a probability space to some other set—which we will discuss in detail
in Chapter 3.

15
CHAPTER 2. PROBABILITY THEORY 16

2.1 Probability spaces and events


A discrete probability space is a countable set Ω of points or outcomes
ω. Each ω in Ω has a probability Pr [ω], which is a real value with 0 ≤
Pr [ω] ≤ 1. It is required that ω∈Ω = 1.
P
P
An event A is a subset of Ω; its probability is Pr [A] = ω∈A Pr [ω].
We require that Pr [Ω] = 1, and it is immediate from the definition that
Pr [∅] = 0.
The complement Ā or ¬A of an event A is the event Ω − A. It is always
the case that Pr [¬A] = 1 − Pr [A].
This fact is a special case of the general principle that if A1 , A2 , . . . forms
a partition of Ω—that is, if Ai ∩ Aj = ∅ when i 6= j and Ai = Ω—then
S

Pr [Ai ] = Pr [Ω] = 1. It happens that ¬A and A form a partition of Ω


P

consisting of exactly two elements.


Even more generally, if A1 , A2 , . . . are disjoint events (that is, if Ai ∩Aj =
∅ whenever i = 6 j), then Pr [ Ai ] = Pr [Ai ]. This fact does not hold in
S P

general for events that are not disjoint.


For discrete probability spaces, all of these facts can be proven directly
from the definition of probabilities for events. For more general probability
spaces, it’s no longer possible to express the probability of an event as the
sum of the probabilities of its elements, and we adopt an axiomatic approach
instead.

2.1.1 General probability spaces


More general probability spaces consist of a triple (Ω, F, Pr) where Ω is
a set of points, F is a σ-algebra (a family of subsets of Ω that contains
Ω and is closed under complement and countable unions) of measurable
sets, and Pr is a function from F to [0, 1] that gives Pr [Ω] = 1 and satisfies
S P
countable additivity: when A1 , . . . are disjoint, Pr [ Ai ] = Pr [Ai ].
This definition is needed for uncountable spaces, because (under certain set-
theoretic assumptions) we may not be able to assign a meaningful probability
to all subsets of Ω.
Formally, this definition is often presented as three axioms of proba-
bility, due to Kolmogorov [Kol33]:

1. Pr [A] ≥ 0 for all A ∈ F.

2. Pr [Ω] = 1.
CHAPTER 2. PROBABILITY THEORY 17

3. For any countable collection of disjoint events A1 , A2 , . . . ,


" #
[ X
Pr Ai = Pr [Ai ] .
i i

It’s not hard to see that the discrete probability spaces defined in the
preceding section satisfy these axioms.
General probability spaces arise in randomized algorithms when we have
an algorithm that might consume an unbounded number of random bits.
The problem now is that an outcome consists of countable sequence of bits,
and there are uncountably many such outcomes. The solution is to consider
as measurable events only those sets with the property that membership
in them can be determined after a finite amount of time. Formally, the
probability space Ω is the set {0, 1}N of all countably infinite sequences of 0
and 1 values indexed by the natural numbers, and the measurable sets F
are all sets that can be generated by countable unions1 of cylinder sets,
where a cylinder set consists of all extensions xy of some finite prefix x. The
probability measure itself is obtained by assigning the set of all points that
start with x the probability 2−|x| , and computing the probabilities of other
sets from the axioms.2
An oddity that arises in general probability spaces is it may be that every
particular outcome has probability zero but their union has probability 1.
For example, the probability of any particular infinite string of bits is 0, but
the set containing all such strings is the entire space and has probability
1. This is where the fact that probabilities only add over countable unions
comes in.
Most randomized algorithms books gloss over general probability spaces,
with three good reasons. The first is that if we truncate an algorithm after
a finite number of steps, we are usually get back to a discrete probability
space, which avoids a lot of worrying about measurability and convergence.
The second is that we are often implicitly working in a probability space that
is either discrete or well-understood (like the space of bit-vectors described
above). The last is that the Kolmogorov extension theorem says that
1
As well as complements and countable intersections. However, it is not hard to show
that that sets defined using these operations can be reduced to countable unions of cylinder
sets.
2
This turns out to give the same probabilities as if we consider each outcome as a
real number in the interval [0, 1] and use Lebesgue measure to compute the probability of
events. For some applications, thinking of our random values as real numbers (or even
sequences of real numbers) can make things easier: consider for example what happens
when we want to choose one of three outcomes with equal probability.
CHAPTER 2. PROBABILITY THEORY 18

if we specify Pr [A1 ∩ A2 ∩ · · · ∩ Ak ] consistently for all finite sets of events


{A1 . . . Ak }, then there exists some probability space that makes these prob-
abilities work, even if we have uncountably many such events. So it’s usually
enough to specify how the events we care about interact, without worrying
about the details of the underlying space.

2.2 Boolean combinations of events


Even though events are defined as sets, we often think of them as representing
propositions that we can combine using the usual Boolean operations of NOT
(¬), AND (∧), and OR (∨). In terms of sets, these correspond to taking a
complement Ā = Ω \ A, an intersection A ∩ B, or a union A ∪ B.
We can use the axioms of probability to calculate the probability of Ā:

Lemma 2.2.1.
h i
Pr Ā = 1 − Pr [A] .

Proof. First note that A ∩ Ā = ∅, so A ∪ Ā =h Ωi is a disjoint union of


countably many3 events. This gives Pr [A] + Pr Ā = Pr [Ω] = 1.

For example, if our probability space consists of the six outcomes of a fair
i and A = [outcome is 3] with Pr [A] = 5/6, then Pr [outcome is not 3] =
diehroll,
Pr Ā = 1 − 1/6 = 5/6. Though this example is trivial, using the formula
does save us from having to add up the five cases where we don’t get 3.
If we want to know the probability of A ∩ B, we need to know more
about the relationship between A and B. For example, it could be that
A and B are both events representing a fair coin coming up heads, with
Pr [A] = Pr [B] = 1/2. The probability of A ∩ B could be anywhere between
1/2 and 0:

• For ordinary fair coins, we’d expect that half the time that A happens,
B also happens. This gives Pr [A ∩ B] = (1/2) · (1/2) = 1/4. To
make this formal, we might define our probability space Ω as having
four outcomes HH, HT, TH, and TT, each of which occurs with equal
probability.

• But maybe A and B represent the same fair coin: then A ∩ B = A and
Pr [A ∩ B] = Pr [A] = 1/2.
3
Countable need not be infinite, so 2 is countable.
CHAPTER 2. PROBABILITY THEORY 19

• At the other extreme, maybe A and B represent two fair coins welded
together so that if one comes up heads the other comes up tails. Now
Pr [A ∩ B] = 0.

• With a little bit of tinkering, we could also find probabilities for


the outcomes in our four-outcome space to make Pr [A] = Pr [HH] +
Pr [HT] = 1/2 and Pr [B] = Pr [HH] + Pr [TH] = 1/2 while setting
Pr [A ∩ B] = Pr [HH] to any value between 0 and 1/2.

The difference between the nice case where Pr [A ∩ B] equals 1/4 and
the other, more annoying cases where it doesn’t is that in the first case we
have assumed that A and B are independent, which is defined to mean
that Pr [A ∩ B] = Pr [A] Pr [B].
In the real world, we expect events to be independent if they refer to
parts of the universe that are not causally related: if we flip two coins that
aren’t glued together somehow, then we assume that the outcomes of the
coins are independent. But we can also get independence from events that
are not causally disconnected in this way. An example would be if we rolled
a fair four-sided die labeled HH, HT, TH, TT, where we take the first letter
as representing A and the second as B.
There’s no simple formula for Pr [A ∪ B] when A and B are not disjoint,
even for independent events, but we can compute the probability by splitting
up into smaller, disjoint events and using countable additivity:
h i
Pr [A ∪ B] = Pr (A ∩ B) ∪ (A ∩ B̄) ∪ (Ā ∩ B)
h i h i
= Pr [A ∩ B] + Pr A ∩ B̄ + Pr Ā ∩ B
 h i  h i 
= Pr [A ∩ B] + Pr A ∩ B̄ + Pr Ā ∩ B + Pr [A ∩ B] − Pr [A ∩ B]
= Pr [A] + Pr [B] − Pr [A ∩ B] .

The idea is that we can compute Pr [A ∪ B] by adding up the individual


probabilities and then subtracting off the part where the counted the event
twice.
This is a special case of the general inclusion-exclusion formula, which
says:
CHAPTER 2. PROBABILITY THEORY 20

Lemma 2.2.2. For any finite sequence of events A1 . . . An ,


" n #
[ X X X
Pr Ai = Pr [Ai ] − Pr [Ai ∩ Aj ] + Pr [Ai ∩ Aj ∩ Ak ] − . . .
i=1 i i<j i<j<k
" #
|S|+1
X \
= (−1) Pr Ai . (2.2.1)
S⊆{1...n},S6=∅ i∈S

Ω into 2n disjoint events BT , where BT = ( i∈T Ai ) ∩


T
Proof.
T Partition

/ Āi is the event that all Ai occur for i in T and no Ai occurs for i not
i∈T
in T . Then Ai is the union of all BT with T 3 i and Ai is the union of all
S

BT with T 6= ∅.
That the right-hand side gives the probability of this event is a sneaky con-
sequence of the binomial theorem, and in particular the fact that ni=1 (−1)i =
P
Pn i n
i=0 (−1) − 1 = (1 − 1) − 1 is −1 if n > 0 and 0 if n = 0. Using this fact
after rewriting the right-hand side using the BT events gives
" #
|S|+1
(−1)|S|+1
X \ X X
(−1) Pr Ai = Pr [BT ]
S⊆{1...n},S6=∅ i∈S S⊆{1...n},S6=∅ T ⊇S
 

(−1)|S|+1 
X X
= Pr [BT ]
T ⊆{1...n} S⊆T,S6=∅
n
!!
X X
i |T |
= − Pr [BT ] (−1)
T ⊆{1...n} i=1
i
 
− Pr [BT ] ((1 − 1)|T | − 1)
X
=
T ⊆{1...n}
 
Pr [BT ] (1 − 0|T | )
X
=
T ⊆{1...n}
X
= Pr [BT ]
T ⊆{1...n},T 6=∅
" n #
[
= Pr Ai .
i=1
CHAPTER 2. PROBABILITY THEORY 21

2.3 Conditional probability


The probability of A conditioned on B or probability of A given B,
written Pr [A | B], is defined by

Pr [A ∩ B]
Pr [A | B] = , (2.3.1)
Pr [B]

provided Pr [B 6= 0]. If Pr [B] = 0, we can’t condition on B.


Such conditional probabilities represent the effect of restricting our prob-
ability space to just B, which can think of as computing the probability of
each event if we know that B occurs. The intersection in the numerator
limits A to circumstances where B occurs, while the denominator normalizes
the probabilities so that, for example, Pr [Ω | B] = Pr [B | B] = 1.

2.3.1 Conditional probability and independence


Rearranging (2.3.1) gives Pr [A ∩ B] = Pr [B] Pr [A | B] = Pr [A] Pr [B | A].
In many cases, knowing that B occurs tells us nothing about whether A
occurs; if so, we have Pr [A | B] = Pr [B], which implies that Pr [A ∩ B] =
Pr [A | B] Pr [B] = Pr [A] Pr [B]—events A and B are independent. So
Pr [A | B] = Pr [A] gives an alternative criterion for independence when
Pr [B] is nonzero.4
A set of events A1 , A2 , . . . is independent if Ai is independent of B when
B is any Boolean formula of the Aj for j 6= i. The idea is that you can’t
predict Ai by knowing anything about the rest of the events.
A set of events A1 , A2 , . . . is pairwise independent if each Ai and Aj
are independent when i 6= j. It is possible for a set of events to be pairwise
independent but not independent; a simple example is when A1 and A2 are
the events that two independent coins come up heads and A3 is the event
that both coins come up with the same value. The general version of pairwise
independence is k-wise independence, which means that any subset of k
(or fewer) events are independent.

2.3.2 Conditional probability and the law of total probability


The reason we like conditional probability in algorithm analysis is that it
gives us a natural way to model the kind of case analysis that we are used
to applying to deterministic algorithms. Suppose we are trying to prove that
a randomized algorithm works (event A) with a certain probability. Most
4
If Pr [B] is zero, then A and B are always independent.
CHAPTER 2. PROBABILITY THEORY 22

likely, the first random thing the algorithm does is flip a coin, giving two
possible outcomes i B̄. Countable additivity tells us that Pr [A] =
h B and
Pr [A ∩ B] + Pr A ∩ B̄ , which we can rewrite using conditional probability
as
h i h i
Pr [A] = Pr [A | B] Pr [B] + Pr A B̄ Pr B̄ , (2.3.2)

a special case of the law of total probability.


h nicei about this expression is that we can often compute Pr [A | B]
What’s
and Pr A B̄ by looking at what the algorithm does starting from the
point where it has just gotten heads (B) or tails (B̄), and use the formula to
combine these values to get the overall probability of success.
For example, if

Pr [class occurs | snow] = 3/5,


Pr [class occurs | no snow] = 99/100, and
Pr [snow] = 1/10,

then

Pr [class occurs] = (3/5) · (1/10) + (99/100) · (1 − 1/10) = 0.951.

More generally, we can do the same computation for any partition of Ω


into countably many disjoint events Bi :
" #
[
Pr [A] = Pr (A ∩ Bi )
i
X
= Pr [A ∩ Bi ]
i
X
= Pr [A | Bi ] Pr [Bi ] , (2.3.3)
i,Pr[Bi ]6=0

which is the law of total probability. Note that the last step works for
each term only if Pr [A | Bi ] is well-defined, meaning that Pr [Bi ] 6= 0. But
any case where Pr [Bi ] = 0 also has Pr [A ∩ Bi ] = 0, so we get the correct
answer if we simply omit these terms h from i both sums.
A special case arises when Pr A B̄ = 0, which occurs, for example,
if A ⊆ B. Then we just have Pr [A] = Pr [A | B] Pr [B]. If we consider an
CHAPTER 2. PROBABILITY THEORY 23

event A = A1 ∩ A2 ∩ · · · ∩ Ak , then we can iterate this expansion to get

Pr [A1 ∩ A2 ∩ · · · ∩ Ak ] = Pr [A1 ∩ · · · ∩ Ak−1 ] Pr [Ak | A1 , . . . , Ak−1 ]


= Pr [A1 ∩ · · · ∩ Ak−2 ] Pr [Ak−1 | A1 , . . . , Ak−2 ] Pr [Ak | A1 , . . . , Ak−1 ]
= ...
k
Y
= Pr [Ai | A1 , . . . , Ai ] . (2.3.4)
i=1

Here Pr [A | B, C, . . .] is short-hand for Pr [B ∩ C ∩ . . .], the probability


that A occurs given that all of B, C, etc., occur.

2.3.3 Examples
Here we have some examples of applying conditional probability to algorithm
analysis. Mostly we will be using some form of the law of total probability.

2.3.3.1 Racing coin-flips


Suppose that I flip coins and allocate a space for each heads that I get before
the coin comes up tails. Suppose that you then supply me with objects (each
of which takes up one unit of space), one for each heads that you get before
you get tails. What are my chances of allocating enough space?
Let’s start by solving this directly using the law of total probability. Let
Ai be the event that I allocate i spaces. The event Ai is the intersection of
i independent events that I get heads in the first i positions and the event
that I get tails in position i + 1; this multiplies out to (1/2)i+1 . Let Bi be
the similar event that you supply i objects. Let W be the event that I win.
To make the Ai partition the space, we must also add an extra event A∞
equal to the singleton set {HHHHHHH . . .} consisting of the all-H sequence;
this has probability 0 (so it won’t have much of an effect), but we need to
include it since HHHHHHH . . . is not contained in any of the other Ai .
CHAPTER 2. PROBABILITY THEORY 24

We can compute

Pr [W | Ai ] = Pr [B0 ∩ B1 ∩ · · · ∩ Bi | Ai ]
= Pr [B0 ∩ B1 ∩ · · · ∩ Bi ]
= Pr [B0 ] + Pr [B1 ] + · · · + Pr [Bi ]
i
X
= (1/2)i
j=0
1 − (1/2)i+1
= (1/2) ·
1 − 1/2
= 1 − (1/2)i+1 . (2.3.5)

The clean form of this expression suggests strongly that there is a better
way to get it, and that this way involves taking the negation of the intersection
of i + 1 independent events that occur with probability 1/2 each. With a
little reflection, we can see that the probability that your objects don’t fit in
my buffer is exactly (1/2)i+1
From the law of total probability (2.3.3),

X
Pr [W ] = (1 − (1/2)i+1 )(1/2)i+1
i=0

X
=1− (1/4)i+1
i=0
1 1
=1− ·
4 1 − 1/4
= 2/3.

This gives us our answer. However, we again see an answer that is


suspiciously simple, which suggests looking for another way to find it. We
can do this using conditional probability by defining new events Ci , where
Ci contains all sequences of coin-flips for both players where get i heads in a
row but at least one gets tails on the (i + 1)-th coin. These events plus the
probability-zero event C∞ = {HHHHHHH . . . , TTTTTTT . . .} partition the
space, so Pr [W ] = ∞ i=0 Pr [W | Ci ] Pr [Ci ].
P

Now we ask, what is Pr [W | Ci ]? Here we only need to consider three


cases, depending on the outcomes of our (i + 1)-th coin-flips. The cases
hH, Ti and hT, Ti cause me to win, while the case hT, Hi causes me to
lose, and each case occurs with equal probability conditioned on Ci (which
excludes hH, Hi). So I win 2/3 of the time conditioned on Ci , and summing
CHAPTER 2. PROBABILITY THEORY 25

Pr [W ] = ∞
P
i=0 (2/3) Pr [Ci ] = 2/3 since Pr [Ci ] sums to 1, because the union
of these events includes the entire space except for the probability-zero event
C∞ .
Still another approach is to compute the probability that our runs have
exactly the same length ( ∞ −i · 2−i = 1/3), and argue by symmetry that
P
i=1 2
the remaining 2/3 probability is equally split between my run being longer
(1/3) and your run being longer (1/3). Since W occurs if my run is just as
long or longer, Pr[W ] = 1/3 + 1/3 = 2/3. A nice property of this approach is
that the only summation involved is over disjoint events, so we get to avoid
using conditional probability entirely.

2.3.3.2 Karger’s min-cut algorithm


Here we’ll give a simple algorithm for finding a global min-cut in a multi-
graph,5 due to David Karger [Kar93].
The idea is that we are given a multigraph G, and we want to partition
the vertices into nonempty sets S and T such that the number of edges with
one endpoint in S and one endpoint in T is as small as possible. There are
many efficient ways to do this, most of which are quite sophisticated. There
is also the algorithm we will now present, which solves the problem with
reasonable efficiency using almost no sophistication at all (at least in the
algorithm itself).
The main idea is that given an edge uv, we can construct a new multigraph
G1 by contracting the edge: in G1 , u and v are replaced by a single vertex,
and any edge that used to have either vertex as an endpoint now goes to the
combined vertex (edges with both endpoints in {u, v} are deleted). Karger’s
algorithm is to contract edges chosen uniformly at random until only two
vertices remain. All the vertices that got packed into one of these become
S, the others become T . It turns out that this finds a minimum cut with
probability at least 1/ n2 .
An example of the algorithm in action is given in Figure 2.1.

Theorem 2.3.1. Given any min cut (S, T ) of a graph G on n vertices,


n
Karger’s algorithm outputs (S, T ) with probability at least 1/ 2 .

Proof. Let (S, T ) be a min cut of size k. Then the degree of each vertex v
is at least k (otherwise (v, G − v) would be a smaller cut), and G contains
at least kn/2 edges. The probability that we contract an S–T edge is thus
at most k/(kn/2) = 2/n, and the probability that we don’t contract one is
5
Unlike ordinary graphs, multigraphs can have more than one edge between two vertices.
CHAPTER 2. PROBABILITY THEORY 26

a e
c d
b f
e
ab c d
f
e
ab c df

ab c def

abc def

Figure 2.1: Karger’s min-cut algorithm. Initial graph (at top) has min cut
h{a, b, c} , {d, e, f }i. We find this cut by getting lucky and contracting edges
ab, df , de, and ac in that order. The final graph (at bottom) gives the cut.

1 − 2/n = (n − 2)/n. Assuming we missed collapsing (S, T ) the first time,


we now have a new graph G1 with n − 1 vertices in which the min cut is still
of size k. So now the chance that we miss (S, T ) is (n − 3)/(n − 1). We stop
when we have two vertices left, so the last step succeeds with probability
1/3.
We can compute the probability that the S–T cut is never contracted by
applying (2.3.4), which just tells us to multiply all the conditional probabili-
ties together:
n
Y i−2 2
= .
i=3
i n(n − 1)

This tells us what happens when we are considering a particular min cut.
If the graph has more than one min cut, this only makes our life easier. Note
that since each min cut turns up with probability at least 1/ n2 , there can’t
be more than n2 of them.6 But even if there is only one, we have a good



6
The suspiciously combinatorial appearance of the 1/ n2 suggests that there should
be some way of associating minimum cuts with particular pairs of vertices, but I’m not
aware of any natural way to do this. Sometimes the appearance of a simple expression
in a surprising context may just stem from the fact that there aren’t very many distinct
simple expressions.
CHAPTER 2. PROBABILITY THEORY 27

chance of finding it if we simply re-run the algorithm substantially more


than n2 times.
Chapter 3

Random variables

Probabilities are fine when all we want to do is ask whether an algorithm


worked or not. But for many randomized algorithms (particularly Las Vegas
algorithms), we can structure the algorithm so that it works eventually
with probability 1. This makes the important question that of how long is
eventually. To measure a quantity like running time that depends on the
random choices of the algorithm, we need a random variable, which is just
a function whose domain is some probability space Ω.1
Even though random variables are just functions, rather than writing a
random variable as f (ω) everywhere, the convention is to write a random
variable as a capital letter (X, Y , S, etc.) and make the argument implicit:
X is really X(ω). Variables that aren’t random (or aren’t variable) are
written in lowercase.
Most of the random variables we will consider will be discrete random
variables. A discrete random variable takes on only countably many values,
each with some nonzero probability.
For example, consider the probability space corresponding to rolling two
independent fair six-sided dice. There are 36 possible outcomes in this space,
corresponding to the 6 × 6 pairs of values hx, yi we might see on the two dice.
We could represent the value of each die as a random variable X or Y given
by X(hx, yi) = x or Y (hx, yi) = y, but for many applications, we don’t care
so much about the specific values on each die. Instead, we want to know
the sum S = X + Y of the dice. This value S is also random variable; as a
function on Ω, it’s defined by S(hx, yi) = x + y.
1
Technically, this only works for discrete spaces. In general, a random variable is a
measurable function from a probability space (Ω, F) to some other set S equipped with
its own σ-algebra F 0 . What makes a function measurable in this sense is that that for any
set A in F 0 , the inverse image f −1 (A) must be in F. See §3.3 for more details.

28
CHAPTER 3. RANDOM VARIABLES 29

Random variables need not be real-valued. There’s no reason why we


can’t think of the pair hx, yi itself a random variable, whose range is the
set [1 . . . 6] × [1 . . . 6]. Similarly, if we imagine choosing a point uniformly at
random in the unit square [0, 1]2 , its coordinates are a random variable. For
a more exotic example, the random graph Gn,p obtained by starting with
n vertices and including each possible edge with independent probability p
is a random variable whose range is the set of all graphs on n vertices.

3.1 Operations on random variables


Random variables may be combined using standard arithmetic operators,
have functions applied to them, etc., to get new random variables. For
example, the random variable X/Y is a function from Ω that takes on the
value X(ω)/Y (ω) on each point ω.

3.2 Random variables and events


Any random variable X allows us to define events based on its possible values.
Typically these are expressed by writing a predicate involving the random
variable in square brackets. An example would be the probability that the
sum of two dice is exactly 11: [S = 11]; or that the sum of the dice is less than
5: [S < 5]. These are both sets of outcomes; we could expand [S = 11] =
{h5, 6i , h6, 5i} or [S < 5] = {h1, 1i , h1, 2i , h1, 3i , h2, 1i , h2, 2i , h3, 1i}. This
allows us to calculate the probability that a random variable has particular
2 1 6
properties: Pr [S = 11] = 36 = 18 and Pr [S < 5] = 36 = 61 .
Conversely, given any event A, we can define an indicator random
variable 1A that is 1 when A occurs and 0 when it doesn’t.2 Formally,
1A (ω) = 1 for ω in A and 1A (ω) = 0 for ω not in A.
Indicator variables are mostly useful when combined with other random
variables. For example, if you roll two dice and normally collect the sum of
the values but get nothing if it is 7, we could write your payoff as S · 1[S6=7] .
The probability mass function of a random variable gives Pr [X = x]
for each possible value x. For example, our random variable S has the
2
Some people like writing χA for these.
You may also see [P ] where P is some predicate, a convention known as Iverson
notation or the Iverson bracket that was invented by Iverson for the programming
language APL, appears in later languages like C where the convention is that true predicates
evaluate to 1 and false ones to 0, and ultimately popularized for use in mathematics—with
the specific choice of square brackets to set off the predicate—by Graham et al. [GKP88].
Out of these alternatives, I personally find 1A to be the least confusing.
CHAPTER 3. RANDOM VARIABLES 30

S Probability
2 1/36
3 2/36
4 3/36
5 4/36
6 5/36
7 6/36
8 5/36
9 4/36
10 3/36
11 2/36
12 1/36

Table 3.1: Probability mass function for the sum of two independent fair
six-sided dice

probability mass function show in Table 3.1. For a discrete random variable
X, the probability mass function gives enough information to calculate the
probability of any event involving X, since we can just sum up cases using
countable additivity. This gives us another way to compute Pr [S < 5] =
Pr [S = 2] + Pr [S = 3] + Pr [S = 4] = 1+2+3
36 = 16 .
For two random variables, the joint probability mass function gives
Pr [X = x ∧ Y = y] for each pair of values x and y. This generalizes in the
obvious way for more than two variables.
We will often refer to the probability mass function as giving the distri-
bution or joint distribution of a random variable or collection of random
variables, even though distribution (for real-valued variables) technically
refers to the cumulative distribution function F (x) = Pr [X ≤ x], which
is generally not directly computable from the probability mass function for
continuous random variables that take on uncountably many values. To
the extent that we can, we will try to avoid continuous random variables,
and the rather messy integration theory needed to handle them.
Two or more random variables are independent if all sets of events
involving different random variables are independent. In terms of proba-
CHAPTER 3. RANDOM VARIABLES 31

bility mass functions, X and Y are independent if Pr [X = x ∧ Y = y] =


Pr [X = x] · Pr [Y = y] for any constants x and y. In terms of cumulative
distribution functions, X and Y are independent if Pr [X ≤ x ∧ Y ≤ y] =
Pr [X = x] · Pr [Y = y] for any constants x and y. As with events, we gen-
erally assume that random variables associated with causally disconnected
processes are independent, but this is not the only way we might have
independence.
It’s not hard to see that the individual die values X and Y in our
two-dice example are independent, because every possible combination of
values x and y has the same probability 1/36 = Pr [X = x] Pr [Y = y]. If we
chose a different probability distribution on the space, we might not have
independence.

3.3 Measurability
For discrete probability spaces, any function on outcomes can be a random
variable. The reason is that any event in a discrete probability space has
a well-defined probability. For more general spaces, in order to be useful,
events involving a random variable should have well-defined probabilities.
For discrete random variables that take on only countably many values
(e.g., integers or rationals), it’s enough for the event [X = x] (that is, the
set {ω | X(ω) = x}) to be in F for all x. For real-valued random variables,
we ask that the event [X ≤ x] be in F. In these cases, we say that X is
measurable with respect to F, or just measurable F. More exotic random
variables use a definition of measurability that generalizes the real-valued
version, which we probably won’t need.3 Since we usually just assume that
all of our random variables are measurable unless we are doing something
funny with F to represent ignorance, this issue won’t come up much.

3.4 Expectation
The expectation or expected value of a random variable X is given by
P
E [X] = x x Pr [X = x]. This is essentially an average value of X weighted
by probability, and it only makes sense if X takes on values that can be
summed in this way (e.g., real or complex values, or vectors in a real- or
3
The general version is that if X takes on values on another measure space (Ω0 , F 0 ),
then the inverse image X −1 (A) = {ω ∈ Ω | X(ω) ∈ A} of any set A in F 0 is in F. This
means in particular that PrΩ maps through X to give a probability measure on Ω0 by
PrΩ0 [A] = PrΩ [X −1 (A)], and the condition on X −1 (A) being in F makes this work.
CHAPTER 3. RANDOM VARIABLES 32

complex-valued vector space). Even if the expectation makes sense, it may


be that a particular random variable X doesn’t have an expectation, because
the sum fails to converge.4
For an example that does work, ifXand Y are independent fair six-sided
dice, then E [X] = E [Y ] = 6k=1 k 16 = 21 7
P
6 = 2 , while E [X + Y ] is the
rather horrific
12
X 2 · 1 + 3 · 2 + 4 · 3 + 5 · 4 + · · · + 9 · 4 + 10 · 3 + 11 · 2 + 12 · 1
k · Pr [X + Y = i] =
k=2
36
252
= = 7.
36
The fact that 7 = 72 + 72 here is not a coincidence, but a consequence of
linearity of expectation, which is the subject of the next section.

3.4.1 Linearity of expectation


The main reason we like expressing the run times of algorithms in terms of
expectation is linearity of expectation: E [aX + bY ] = E [aX] + E [bY ]
for all random variables X and Y for which E [X] and E [Y ] are defined, and
all constants a and b. This means that we can compute the running time for
different parts of our algorithm separately and then add them together, even
if the costs of different parts of the algorithm are not independent.
P P
The general version is E [ ai Xi ] = ai E [Xi ] for any finite collection of
random variables Xi and constants ai , which follows by applying induction to
the two-variable case. A special case is E [cX] = c E [X] when c is constant.
For discrete random variables, linearity of expectation follows immediately
from the definition of expectation and the fact that the event [X = x] is the
4
Example: Let X be the number of times you flipa fairPcoin until it comes
P∞ up heads.

We’ll see later than E [X] = 1/(1/2) = 2. But E 2X = n=1 2n 2−n = n=1 1, which
diverges. With some tinkering it is possible to come up with even uglier cases, like an
array that contains 1 element on average but requires infinite expected time to sort using
a Θ(n log n) algorithm.
CHAPTER 3. RANDOM VARIABLES 33

disjoint union of the events [X = x, Y = y] for all y:


X
E [aX + bY ] = (ax + by) Pr [X = x ∧ Y = y]
x,y
X X
=a x Pr [X = x, Y = y] + b y Pr [X = x, Y = y]
x,y x,y
X X X X
=a x P r[X = x, Y = y] + b y Pr [X = x, Y = y]
x y y x
X X
=a x Pr [X = x] + b y Pr [Y = y]
x y

= a E [X] + b E [Y ] .

A technical note: we are assuming that E [X], E [Y ], and E [X + Y ] all


exist.
This proof does not require that X and Y be independent. The sum
of two fair six-sided dice always has expectation 72 + 72 = 7, whether they
are independent dice, the same die counted twice, or one die X and its
complement 7 − X.
Linearity of expectation makes it easy to compute the expectations of
random variables that can be expressed as sums of other random variables.
One example that will come up a lot is a binomial random variable, which
is a sum S = ni=1 Xi of n independent Bernoulli random variables, each
P

of which is 1 with probability p and 0 with probability q = 1 − p. These are


called binomial random variables because the probability that S is equal to
k is given by !
n k n−k
Pr [S = k] = p q , (3.4.1)
k
which is the k-th term in the binomial expansion of (p + q)n . In this case
each Xi has E [Xi ] = p, so E [S] is just np. It is possible to calculate this
fact directly from (3.4.1), but it’s much more work.5

3.4.1.1 Linearity of expectation for infinite sequences


For infinite sequences of random variables, linearity of expectation may
break down. This is true even if the sequence is countable. An example
5
P∞One way isk to Puse the k probability generating function F (z) =
∞ n
k=0
Pr [S = k] z = k=0 k
p q n−k z k = (pz + q)n . Then take the derivative
0
P∞ k−1 0
P∞
F (z) = k=0
Pr [S = k] kz and observe F (1) = k=0
Pr [S = k] k = E [S]. Or we
can write F 0 (z) as n(pz + q)n−1 p, which gives F 0 (1) = n(p + q)n−1 p = np.
CHAPTER 3. RANDOM VARIABLES 34

is the St. Petersburg paradox, in which a gambler bets $1 on a double-


or-nothing game, then bets $2 if they lose, then $4, and so on, until they
eventually wins and stops, up $1. If we represent the gambler’s gain or loss
at stage i as a random variable Xi , it’s easy to show that E [Xi ] = 0, because
the gambler either wins ±2i with equal probability, or doesn’t play at all.
So ∞
P P∞
i=0 E [Xi ] = 0. But E [ i=0 Xi ] = 1, because the probability that the
gambler doesn’t eventually win is zero.6
Fortunately, these pathological cases don’t come up often in algorithm
analysis, and with some additional side constraints we can apply linearity of
expectation even to infinite sums of random variables. The simplest is when
Xi ≥ 0 for all i; then E [ ∞
P P∞
i=0 Xi ] exists and is equal to i=0 E [Xi ] whenever
the sum of the expectations converges (this is a consequence of the monotone
convergence theorem). Another condition that works is if | ni=0 Xi | ≤ Y for
P

all n, where Y is a random variable with finite expectation; the simplest


version of this is when Y is constant. See [GS92, §5.6.12] or [Fel71, §IV.2]
for more details.

3.4.2 Expectation and inequalities


If X ≤ Y (that is, if the event [X ≤ Y ] holds with probability 1), then
E [X] ≤ E [Y ]. For finite discrete spaces the proof is trivial:
X
E [X] = Pr [ω] X(ω)
ω∈Ω
X
≤ Pr [ω] Y (ω)
ω∈Ω
= E [Y ] .

The claim continues to hold even in the general case, but the proof is
more work.
One special case of this that comes up often is that X ≥ 0 implies
E [X] ≥ 0.
6
The trick here is that we are trading aPprobability-1 gain of 1 against a probability-0


loss of ∞. So we could declare that E i=0
Xi involves 0 · (−∞) and is undefined.
But this would lose the useful property that expectation isn’t affected by probability-0
outcomes. As often happens in mathematics, we are forced to choose between candidate
definitions based on which bad consequences we most want to avoid, with no way to avoid
all of them. So the standard definition of expectation allows the St. Petersburg paradox
because the alternatives are worse.
CHAPTER 3. RANDOM VARIABLES 35

3.4.3 Expectation of a product


When two random variables X and Y are independent, it also holds that
E [XY ] = E [X] E [Y ]. The proof (at least for discrete random variables) is
straightforward:
XX
E [XY ] = xy Pr [X = x ∧ Y = y]
x y
XX
= xy Pr [X = x] Pr [Y = y]
x y
! !
X X
= x Pr [X = x] Pr [Y = y]
x y

= E [X] E [Y ] .

For example, the expectation of the product of two independent fair


 2
six-sided dice is 27 = 49 4 .
This is not true for arbitrary random variables. If we compute the
expectation of the product of a single fair six-sided die with itself, we get
1·1+2·2+3·3+4·4+5·5+6·6
6 = 91
6 , which is much larger.
One measure of the dependence between two random variables by is the
difference E [XY ] − E [X] · E [Y ]. This is called the covariance of X and Y ,
written Cov [X, Y ], and it is 0 when X and Y are independent and nonzero
otherwise. Covariance will come back later when we look at concentration
bounds in Chapter 5.

3.4.3.1 Wald’s equation (simple version)


Computing the expectation of a product does not often come up directly in
the analysis of a randomized algorithm. Where we might expect to do it is
when we have a loop: one random variable N tells us the number of times
we execute the loop, while another random variable X tells us the cost of
each iteration. The problem is that if each iteration is randomized, then we
really have a sequence of random variables X1 , X2 , . . . , and what we want
to calculate is
N
" #
X
E Xi . (3.4.2)
i=1

Here we can’t use the sum formula directly, because N is a random variable,
and we can’t use the product formula, because the Xi are all different random
variables.
CHAPTER 3. RANDOM VARIABLES 36

If N and the Xi are all independent (which may or may not be the case
for the loop example), and N is bounded by some fixed maximum n, then
we can apply the product rule to get the value of (3.4.2) by throwing in
a few indicator variables. The idea is that the contribution of Xi to the
sum is given by Xi 1[N ≥i] , and because we assume that N is independent
h i
of the Xi , if we need to compute E Xi 1[N ≥i] , we can do so by computing
h i
E [Xi ] E 1[N ≥i] .
So we get
"N # " n #
X X
E Xi = E Xi 1[N ≥i]
i=1 i=1
n
X h i
= E Xi 1[N ≥i]
i=1
n
X h i
= E [Xi ] E 1[N ≥i] .
i=1

For general Xi we have to stop here. But if we also know that the Xi all
have the same expectation µ, then E [Xi ] doesn’t depend on i and we can
bring it out of the sum. This gives
n
X h i n
X h i
E [Xi ] E 1[N ≥i] = µ E 1[N ≥i]
i=1 i=1
= µ E [N ] . (3.4.3)
This equation is a special case of Wald’s equation, which we will see
again in §9.4.2. The main difference between this version and the general
version is that here we had to assume that N was independent of the Xi ,
which may not be true if our loop is a while loop, and termination after a
particular iteration is correlated with the time taken by that iteration.
But for simple cases, (3.4.3) can still be useful. For example, if we throw
one six-sided die to get N , and then throw N six-sided dice and add them
up, we get the same expected total 72 · 72 = 49 4 as if we just multiply two
six-sided dice. This is true even though the actual distribution of the values
is very different in the two cases.

3.5 Conditional expectation


We can also define a notion of conditional expectation, analogous to
conditional probability. There are three versions of this, depending on how
CHAPTER 3. RANDOM VARIABLES 37

fancy we want to get about specifying what information we are conditioning


on.

3.5.1 Expectation conditioned on an event


The expectation of X conditioned on an event A is written E [X | A] and
defined by
X X Pr [X = x ∧ A]
E [X | A] = x Pr [X = x | A] = x . (3.5.1)
x x Pr [A]

This is essentially the weighted average value of X if we know that A occurs.


Most of the properties that we see with ordinary expectations continue
to hold for conditional expectation. For example, linearity of expectation

E [aX + bY | A] = a E [X | A] + b E [Y | A] (3.5.2)

holds whenever a and b are constant on A.


Similarly if Pr [X ≥ Y ] A = 1, E [X | A] ≥ E [Y | A].
Conditional expectation is handy because we can use it to compute
expectations by case analysis the same way we use conditional probabilities
using the law of total probability (see §2.3.2). If A1 , A2 , . . . are a countable
partition of Ω, then
!
X X X
Pr [Ai ] E [X | Ai ] = Pr [Ai ] x Pr [X = x | Ai ]
i i x
!
X X Pr [X = x ∧ Ai ]
= Pr [Ai ] x
i x Pr [A]i
XX
= x Pr [X = x ∧ Ai ]
i x
!
X X
= x Pr [X = x ∧ Ai ]
x i
X
= x Pr [X = x]
x
= E [X] . (3.5.3)

This is actually a special case of the law of iterated expectation, which


we will see in the next section.
CHAPTER 3. RANDOM VARIABLES 38

3.5.2 Expectation conditioned on a random variable


In the previous section, we considered computing E [X] by breaking it up
into disjoint cases E [X | A1 ], E [X | A2 ], etc. But keeping track of all the
events in our partition of Ω is a lot of book-keeping. Conditioning on a
random variable lets us combine all these conditional probabilities into a
single expression
E [X | Y ] ,
the expectation of X conditioned on Y , which is defined to have the
value E [X | Y = y] whenever Y = y.7
Note that E [X | Y ] is generally a function of Y , unlike E [X] which is a
constant. This also means that E [X | Y ] is a random variable, and its value
can depend on which outcome ω we picked from our probability space Ω.
The intuition behind the definition is that E [X | Y ] is the best estimate we
can make of X given that we know the value of Y but nothing else.
If we want to be formal about the definition, we can specific the value of
E [X | Y ] explicitly for each point ω ∈ Ω:

E [X | Y ] (ω) = E [X | Y = Y (ω)] . (3.5.4)

This is just another way of saying what we said already: if you want to know
what the expectation of X is conditioned on Y when you get outcome ω,
find the value of Y at ω and condition on seeing that.
Here is a simple example. Suppose that X and Y are independent fair
coin-flips that take on the values 0 and 1 with equal probability. Then our
probability space Ω has four elements, and looks like this:

h0, 0i h0, 1i
h1, 0i h1, 1i

where each tuple hx, yi gives the values of X and Y .


We can also define the total number of heads as Z = X + Y . If we label
all the points ω in our probability space with Z(ω), we get a picture that
looks like this:
0 1
1 2
This is what we see if we know the exact value of both coin-flips (or at
least the exact value of Z).
7
If Y is not discrete, the situation is more complicated. See [Fel71, §§III.2 and V.9–V.11]
or [GS92, §7.9].
CHAPTER 3. RANDOM VARIABLES 39

But now suppose we only know X, and want to compute E [Z | X]. When
X = 0, E [Z | X = 0] = 12 · 0 + 12 · 1 = 12 ; and when X = 1, E [Z] X = 1 =
1 1 3
2 · 1 + 2 · 2 = 2 . So drawing E [Z | X] over our probability space gives

1 1
2 2
3 3
2 2

We’ve averaged the value of Z across each row, since each row corresponds
to one of the possible values of X.
If instead we compute E [Z | Y ], we get this picture instead:
1 3
2 2
1 3
2 2

Now instead of averaging across rows (values of X) we average across


columns (values of Y ). So the left column shows E [Z | Y = 0] = 12 and the
right columns shows E [Z | Y = 1] = 32 , which is pretty much what we’d
expect.
Nothing says that we can only condition on X and Y . What happens if
we condition on Z?
Now we are going to get fixed values for each possible value of Z. If we
compute E [X | Z], then when Z = 0 this will be 0 (because Z = 0 implies
X = 0), and when Z = 2 this will be 1 (because Z = 2 implies X = 1). The
middle case is E [X | Z = 1] = 12 · 0 + 12 · 1 = 12 , because the two outcomes
h0, 1i and h1, 0i that give Z = 1 are equally likely. The picture is
1
0 2
1
2 1

3.5.2.1 Calculating conditional expectations


Usually we will not try to E [X | Y ] individually for each possible ω or even
each possible value y of Y . Instead, we can use various basic facts to compute
E [X | Y ] by applying arithmetic to random variables.
The two basic facts to start with are:

1. If X is a function of Y , then E [X | Y ] = X. Proof: Suppose X =


f (Y ). From (3.5.4), for each outcome ω, we have E [X | Y ] (ω) =
E [X | Y = Y (ω)] = E [f (Y (ω)) | Y = Y (ω)] = f (Y (ω)) = X(ω).

2. If X is independent of Y , then E [X | Y ] = E [X]. Proof: Now for each


ω, we have E [X | Y ] (ω) = E [X | Y = Y (ω)] = E [X].
CHAPTER 3. RANDOM VARIABLES 40

We also have a rather strong version of linearity of expectation. If A and


B are both functions of Z, then
E [AX + BY | Z] = A E [X | Z] + B E [Y | Z] . (3.5.5)
Here is a proof for discrete probability spaces. For each value z of Z, we
have
X Pr [ω]
E [A(Z)X + B(Z)Y | Z = z] = (A(z)X(ω) + B(z)Y (ω)
−1
Pr [Z = z]
ω∈Z (z)
X Pr [ω]
= A(z) X(ω)
Pr [Z = z]
ω∈Z −1 (z)
X Pr [ω]
+ B(z) Y (ω)
Pr [Z = z]
ω∈Z −1 (z)

= A(z) E [X | Z = z] + B(z) E [Y | Z = z] ,
which is the value when Z = z of A E [X | Z] + B E [Y | Z].
This means that we can quickly simplify many conditional expectations.
If we go back to the example of the previous section, where Z = X + Y is
the sum of two independent fair coin-flips X and Y , then we can compute
E [Z | X] = E [X + Y | X]
= E [X | X] + E [Y | X]
= X + E [Y ]
1
=X+ .
2
Similarly, E [Z | Y ] = E [X + Y | Y ] = 12 + Y .
In some cases we have enough additional information to run this in
reverse. If we know Z and want to estimate X, we can use the fact that
X and Y are symmetric to argue that E [X | Z] = E [Y | Z]. But then
Z = E [X + Y | Z] = 2 E [X | Z], so E [X | Z] = Z/2. Note that this works
in general only if the events [X = a, Y = b] and [X = b, Y = a] have the
same probabilities for all a and b even if we condition on Z, which in this
case follows from the fact that X and Y are independent and identically
distributed and that addition is commutative. Other cases may be messier.8
Other facts, like X ≥ Y implies E [X | Z] ≥ E [Y | Z], can be proved
using similar techniques.
8
An example with X and Y identically distributed but not independent is to imagine
that we roll a six-sided die to get X, and let Y = X + 1 if X < 6 and Y = 1 if X = 6. Now
knowing Z = X + Y = 3 tells me that X = 1 and Y = 2 exactly, neither of which is Z/2.
CHAPTER 3. RANDOM VARIABLES 41

3.5.2.2 The law of iterated expectation


The law of iterated expectation says that

E [X] = E [E [X | Y ]] . (3.5.6)

When Y is discrete, this is just (3.5.3) in disguise. For each of the


(countably many) values of Y that occurs with nonzero probability, let Ay
be the event [Y = y]. Then these events are a countable partition of Ω, and
X
E [E [X | Y ]] = Pr [Y = y] E [E [X | Y ] | Y = y]
y
X
= Pr [Y = y] E [X | Y = y]
y

= E [X] .

The trick here is that we use (3.5.3) to expand out the original expression in
terms of the events Ay , then notice that E [X | Y ] is equal to E [X | Y = y]
whenever Y = y.
So as claimed, conditioning on a variable gives a way to write averaging
over cases very compactly.
It’s also not too hard to show that iterated expectation works with partial
conditioning:
E [E [X | Y, Z] | Y ] = E [X | Y ] . (3.5.7)

3.5.2.3 Conditional expectation as orthogonal projection


If you are comfortable with linear algebra, it may be helpful to think about
expectation conditioned on a random variable as a form of projection onto
a subspace. In this section, we’ll give a brief description of how this works
for a finite, discrete probability space. For a more general version, see for
example [GS92, §7.9].
Consider the set of all real-valued random variables on the probability
space {TT, TH, HT, HH} corresponding to flipping two independent fair coins.
We can think of each such random variable as a vector in R4 , where the
four coordinates give the value of the variable on the four possible outcomes.
For example, the indicator variable for the event that the first coin is heads
would look like X = h0, 0, 1, 1i and the indicator variable for the event that
the second coin is heads would look like Y = h0, 1, 0, 1i.
When we add two random variables together, we get a new random vari-
able. This corresponds to vector addition: X + Y = h0, 0, 1, 1i + h0, 1, 0, 1i =
CHAPTER 3. RANDOM VARIABLES 42

h0, 1, 1, 2i. Multiplying a random variable by a constant looks like scalar


multiplication 2X = 2 · h0, 0, 1, 1i = h0, 0, 2, 2i. Because random variables
support both addition and scalar multiplication, and because these operations
obey the axioms of a vector space, we can treat the set of all real-valued
random variables defined on a given probability space as a vector space, and
apply all the usual tools from linear algebra to this vector space.
One thing in particular we can look at is subspaces of this vector space.
Consider the set of all random variables that are functions of X. These are
vectors of the form ha, a, b, bi, and adding any two such vectors or multiplying
any such vector by a scalar yields another vector of this form. So the functions
of X form a two-dimensional subspace of the four-dimensional space of all
random variables. An even lower-dimensional subspace is the one-dimensional
subspace of constants: vectors of the form ha, a, a, ai. As with functions of
X, this set is closed under addition and multiplication by a constant.
When we take the expectation of X, we are looking for a constant
that gives us the average value of X. In vector terms, this means that
E [h0, 0, 1, 1i] = h1/2, 1/2, 1/2, 1/2i. This expectation vector is in fact the
orthogonal projection of h0, 0, 1, 1i onto the subspace generated by 1 =
h1, 1, 1, 1i; we can tell this because the dot-product of X − E [X] with 1
is h−1/2, −1/2, 1/2, 1/2i · h1, 1, 1, 1i = 0. If instead we take a conditional
expectation, we are again doing an orthogonal projection, but now onto a
higher-dimensional subspace. So E [X + Y | X] = h1/2, 1/2, 3/2, 3/2i is the
orthogonal projection of X + Y = h0, 1, 1, 2i onto the space of all functions
of X, which is generated by the basis vectors h0, 0, 1, 1i and h1, 1, 0, 0i. As
in the simple expectation, the dot-product of (X + Y ) − E [X + Y | X] with
either of these basis vectors is 0.
Many facts about conditional expectation translate in a straightforward
way to facts about projections. Linearity of expectation is equivalent to
linearity of projection onto a subspace. The law of iterated expectation
E [E [X | Y ]] = E [X] says that projecting onto the subspace of functions of
Y and then onto the subspace of constants is equivalent to projection directly
down to the subspace of constants; this is true in general for projection
operations. It’s also possible to represent other features of probability spaces
in terms of expectations; for example, E [XY ] acts like an inner product
for random variables, E X 2 acts like the square of the Euclidean distance,
 

and the fact that E [X] is an orthogonal projection of X means that E [X]
is precisely the constant value µ that minimizes the distance E (X − µ)2 .
We won’t actually use any of these facts in the following, but having another
way to look at conditional expectation may be helpful in understanding how
it works.
CHAPTER 3. RANDOM VARIABLES 43

3.5.3 Expectation conditioned on a σ-algebra


Expectation conditioned on a random variable is actually a special case of
the expectation of X conditioned on a σ-algebra F. Recall that a σ-algebra is
a family of subsets of Ω that includes Ω and is closed under complement and
countable union; for discrete probability spaces, this turns out to be the set
of all unions of equivalence classes for some equivalence relation on Ω,9 and
we think of F as representing knowledge of which equivalence class we are in,
but not which point in the equivalence class we land on. An example would
be if Ω consists of all values (X1 , X2 ) obtained from two die rolls, and F
consists of all sets A such that whenever one point ω with X1 (ω) + X2 (ω) = s
is in A, so is every other point ω 0 with X1 (ω 0 ) + X2 (ω 0 ) = s. (This is the
σ-algebra generated by the random variable X1 + X2 .)
A discrete random variable X is measurable with respect to F, or
F-measurable, if every event [X = x] is contained in F; in other words,
knowing only where we are in F, we can compute exactly the value of X.
This gives a formal way to define σ(X): it is the smallest σ-algebra F such
that X is F-measurable.
If X is not F-measurable, the best approximation we can make to it given
that we only know where we are in F is E [X | F], which is defined as a random
variable Q that is (a) F-measurable; and (b) satisfies E [Q | A] = E [X | A]
for any event A ∈ F with Pr [A] 6= 0.
For discrete probability spaces, this just means that we replace X with
its average value across each equivalence class: property (a) is satisfied
because E [X | F] is constant across each equivalence class, meaning that
[E [X | F] = x] is a union of equivalence classes, and property (b) is satisfied
because we define E [E [X | F] | A] = E [X | A] for each equivalence class A,
and the same holds for unions of equivalence classes by a simple calculation.
This gives the same result as E [X | Y ] if F is generated by Y , or more
generally as E [X | Y1 , Y2 , . . .] if F is generated by Y1 , Y2 , . . . . In each case
the intuition is that we are getting the best estimate we can for X given the
information we have. It is also possible to define E [X | F] as a projection
onto the subspace of all random variables that are F-measurable, analogously
to the special case for E [X | Y ] described in §3.5.2.3.
9
Proof: Let F be a σ-algebra over a countable set Ω. Let ω ∼ ω 0 if, for all A in F,
ω ∈ A if and only if ω 0 ∈ A; this is an equivalence relation on Ω. To show that the
equivalence classes of ∼ are elementsTof F, for each ω 00 6∼ ω, let Aω00 be some element of F
that contains ω but not ω 00 . Then ω00 Aω00 (a countable intersection of elements of F)
contains ω and all points ω 0 ∼ ω but no points ω 00 6∼ ω; in other words, it’s the equivalence
class of ω. Since there are only countably many such equivalence classes, we can construct
all the elements of F by taking all possible unions of them.
CHAPTER 3. RANDOM VARIABLES 44

Sometimes it is convenient to use more than one σ-algebra to represent


increasing knowledge over time. A filtration is a sequence of σ-algebras
F0 ⊆ F1 ⊆ F2 ⊆ . . . , where each Ft represents the information we have
available at time t. That each Ft is a subset of Ft+1 means that any event
we can determine at time t we can also determine at all future times t0 > t:
though we may learn more information over time, we never forget what
we already know. A common example of a filtration is when we have a
sequence of random variables X1 , X2 , . . . , and define Ft as the σ-algebra
hX1 , X2 , . . . , Xt i generated by X1 , X2 , . . . , Xt .
When one σ-algebra is a subset of another, a version of the law of iterated
expectation applies: F ∈ F 0 implies E [E [X | F 0 ] | F] = E [X | F]. One way
to think about this is that if we forget everything about X we can’t predict
from F 0 and then forget everything that’s left that we can’t predict from
F, we get to the same place as if we just forget everything except F to
begin with. The simplest version E [X | F 0 ] = E [X] is just what happens
when F is the trivial σ-algebra {∅, Ω}, where all we know is that something
happened, but we don’t know what.

3.5.4 Examples
• Let X be the value of a six-sided die. Let A be the event that X is
even. Then
X
E [X | A] = x Pr [X = x | A]
x
1
= (2 + 4 + 6) ·
3
= 4.

• Let X and Y be independent six-sided dice, and let Z = X + Y . Then


E [Z | X] is a random variable whose value is 1 + 7/2 when X = 1,
2 + 7/2 when X = 2, etc. We can write this succinctly by writing
E [Z | X] = X + 7/2.

• Conversely, if X, Y , and Z are as above, we can also compute E [X] Z.


Here we are told what Z is and must make an estimate of X.
For some values of Z, this nails down X completely: E [X | Z = 2] = 1
because X you can only make 2 in this model as 1 + 1. For other values,
we don’t know much about X, but can still compute the expectation.
For example, to compute E [X | Z = 5], we have to average X over all
the pairs (X, Y ) that sum to 5. This gives E [X | Z = 5] = 14 (1 + 2 + 3 +
CHAPTER 3. RANDOM VARIABLES 45

4) = 52 . (This is not terribly surprising, since by symmetry E [Y | Z = 5]


should equal E [X | Z = 5], and since conditional expectations add
just like regular expectations, we’d expect that the sum of these two
expectations would be 5.)
The actual random variable E [X | Z] summarizes these conditional
expectations for all events of the form [Z = z]. Because of the symmetry
argument above, we can write it succinctly as E [X | Z] = Z2 . Or we
could list its value for every ω in our underlying probability space, as
in done in Table 3.2 for this and various other conditional expectations
on the two-independent-dice space.

3.6 Applications
3.6.1 Yao’s lemma
In Section 1.1, we considered a special case of the unordered search problem,
where we have an unordered array A[1..n] and want to find the location
of a specific element x. For deterministic algorithms, this requires probing
n array locations in the worst case, because the adversary can place x in
the last place we look. Using a randomized algorithm, we can reduce this
to (n + 1)/2 probes on average, either by probing according to a uniform
random permutation or just by probing from left-to-right or right-to-left
with equal probability.
Can we do better? Proving lower bounds is a nuisance even for determin-
istic algorithms, and for randomized algorithms we have even more to keep
track of. But there is a sneaky trick that allows us to reduce randomized
lower bounds to deterministic lower bounds in many cases.
The idea is that if we have a randomized algorithm that runs in time
T (x, r) on input x with random bits r, then for any fixed choice of r we
have a deterministic algorithm. So for each n, we find some random X
with |X| = n and show that, for any deterministic algorithm that runs in
time T 0 (x), E [T 0 (X)] ≥ f (n). But then E [T (X, R)] = E [E [T (X, R) | R]] =
E [E [TR (X) | R]] ≥ f (n).
This gives us Yao’s lemma:
Lemma 3.6.1 (Yao’s lemma (informal version)[Yao77]). Fix some problem.
Suppose there is a random distribution on inputs X of size n such that every
deterministic algorithm for the problem has expected cost T (n).
Then the worst-case expected cost of any randomized algorithm is at least
T (n).
CHAPTER 3. RANDOM VARIABLES 46

ω X Y Z =X +Y E [X] E [X | Y = 3] E [X | Y ] E [Z | X] E [X | Z] E [X | X]
(1, 1) 1 1 2 7/2 7/2 7/2 1 + 7/2 2/2 1
(1, 2) 1 2 3 7/2 7/2 7/2 1 + 7/2 3/2 1
(1, 3) 1 3 4 7/2 7/2 7/2 1 + 7/2 4/2 1
(1, 4) 1 4 5 7/2 7/2 7/2 1 + 7/2 5/2 1
(1, 5) 1 5 6 7/2 7/2 7/2 1 + 7/2 6/2 1
(1, 6) 1 6 7 7/2 7/2 7/2 1 + 7/2 7/2 1
(2, 1) 2 1 3 7/2 7/2 7/2 2 + 7/2 3/2 2
(2, 2) 2 2 4 7/2 7/2 7/2 2 + 7/2 4/2 2
(2, 3) 2 3 5 7/2 7/2 7/2 2 + 7/2 5/2 2
(2, 4) 2 4 6 7/2 7/2 7/2 2 + 7/2 6/2 2
(2, 5) 2 5 7 7/2 7/2 7/2 2 + 7/2 7/2 2
(2, 6) 2 6 8 7/2 7/2 7/2 2 + 7/2 8/2 2
(3, 1) 3 1 4 7/2 7/2 7/2 3 + 7/2 4/2 3
(3, 2) 3 2 5 7/2 7/2 7/2 3 + 7/2 5/2 3
(3, 3) 3 3 6 7/2 7/2 7/2 3 + 7/2 6/2 3
(3, 4) 3 4 7 7/2 7/2 7/2 3 + 7/2 7/2 3
(3, 5) 3 5 8 7/2 7/2 7/2 3 + 7/2 8/2 3
(3, 6) 3 6 9 7/2 7/2 7/2 3 + 7/2 9/2 3
(4, 1) 4 1 5 7/2 7/2 7/2 4 + 7/2 5/2 4
(4, 2) 4 2 6 7/2 7/2 7/2 4 + 7/2 6/2 4
(4, 3) 4 3 7 7/2 7/2 7/2 4 + 7/2 7/2 4
(4, 4) 4 4 8 7/2 7/2 7/2 4 + 7/2 8/2 4
(4, 5) 4 5 9 7/2 7/2 7/2 4 + 7/2 9/2 4
(4, 6) 4 6 10 7/2 7/2 7/2 4 + 7/2 10/2 4
(5, 1) 5 1 6 7/2 7/2 7/2 5 + 7/2 6/2 5
(5, 2) 5 2 7 7/2 7/2 7/2 5 + 7/2 7/2 5
(5, 3) 5 3 8 7/2 7/2 7/2 5 + 7/2 8/2 5
(5, 4) 5 4 9 7/2 7/2 7/2 5 + 7/2 9/2 5
(5, 5) 5 5 10 7/2 7/2 7/2 5 + 7/2 10/2 5
(5, 6) 5 6 11 7/2 7/2 7/2 5 + 7/2 11/2 5
(6, 1) 6 1 7 7/2 7/2 7/2 6 + 7/2 7/2 6
(6, 2) 6 2 8 7/2 7/2 7/2 6 + 7/2 8/2 6
(6, 3) 6 3 9 7/2 7/2 7/2 6 + 7/2 9/2 6
(6, 4) 6 4 10 7/2 7/2 7/2 6 + 7/2 10/2 6
(6, 5) 6 5 11 7/2 7/2 7/2 6 + 7/2 11/2 6
(6, 6) 6 6 12 7/2 7/2 7/2 6 + 7/2 12/2 6

Table 3.2: Various conditional expectations on two independent dice


CHAPTER 3. RANDOM VARIABLES 47

For unordered search, putting x in a uniform random array location


makes any deterministic algorithm take at least (n + 1)/2 probes on average.
So randomized algorithms take at least (n + 1)/2 probes as well.

3.6.2 Geometric random variables


Suppose that we are running a Las Vegas algorithm that takes a fixed
amount of time T , but succeeds only with probability p (which we take to
be independent of the outcome of any other run of the algorithm). If the
algorithm fails, we run it again. How long does it take on average to get the
algorithm to work?
We can reduce the problem to computing E [T X] = T E [X], where X
is the number of times the algorithm runs. The probability that X = n is
exactly (1−p)n−1 p, because we need to get n−1 failures with probability 1−p
each followed by a single success with probability p, and by assumption all of
these probabilities are independent. A variable with this kind of distribution
is called a geometric random variable. We saw a special case of this
distribution earlier (§2.3.3.1) when we were looking at how many flips it
would take on average to get the first tails from a fair coin coin (in that case,
p was 1/2).
Using conditional expectation, it’s straightforward to compute E [X]. Let
A be the event that the algorithm succeeds on the first run, i.e., then event
[X = 1]. Then
h i h i
E [X] = E [X | A] Pr [A] + E X Ā Pr Ā
h i
= 1 · p + E X Ā · (1 − p).
h i
The tricky part here is to evaluate E X Ā . Intuitively, if we don’t succeed
the first time, we’ve wasted
h onei step and are back where we started, so it
should be the case that E X Ā = 1 + E [X]. If we want to be really careful,
CHAPTER 3. RANDOM VARIABLES 48

we can calculate this out formally (no sensible person would ever do this):
h i ∞
X
E X Ā = n Pr [X = n | X 6= 1]
n=1

X Pr [X = n]
= n
n=2
Pr [X 6= 1]

X (1 − p)n−1 p
= n
n=2
1−p
X∞
= n(1 − p)n−2 p
n=2
X∞
= (n + 1)(1 − p)n−1 p
n=1

X
=1+ n(1 − p)n−1 p
n=1
= 1 + E [X] .

Since we know that E [X] = p + (1 + E [X])(1 − p), a bit of algebra gives


E [X] = 1/p, which is about what we’d expect.
There are more direct ways to get the same result. If we don’t have
conditional expectation to work with, we can try computing the sum E [X] =
P∞ n−1 p directly. The easiest way to do this is probably to use
n=1 n(1 − p)
generating functions (see, for example, [GKP88, Chapter 7] or [Wil06]).
An alternative argument is given in [MU17, §2.4]; this uses the fact that
E [X] = ∞ n=1 Pr [X ≥ n], which holds when X takes on only non-negative
P

integer values.

3.6.3 Coupon collector


In the coupon collector problem, we throw balls uniformly and indepen-
dently into n bins until every bin has at least one ball. When this happens,
how many balls have we used on average?10
Let Xi be the number of balls needed to go from i − 1 nonempty bins
to i nonempty bins. It’s easy to see that X1 = 1 always. For larger i, each
time we throw a ball, it lands in an empty bin with probability n−i+1
n . This
10
The name comes from the problem of collecting coupons at random until you have all
of them. A typical algorithmic application is having a cluster of machines choose jobs to
finish at random from some list until all are done. The expected number of job executions
to complete n jobs is given exactly by the solution to the coupon collector problem.
CHAPTER 3. RANDOM VARIABLES 49

n−i+1
means that Xi has a geometric distribution with probability n , giving
n
E [Xi ] = n−i+1 from the analysis in §3.6.2.
To get the total expected number of balls, take the sum
" n # n
X X
E Xi = E [Xi ]
i=1 i=1
n
X n
=
i=1
n−i+1
n
X 1
=n
i=1
i
= nHn .

In asymptotic terms, this is Θ(n log n).

3.6.4 Hoare’s FIND


Hoare’s FIND [Hoa61b], often called QuickSelect, is an algorithm for
finding the k-th smallest element of an unsorted array that works like
QuickSort, only after partitioning the array around a random pivot we
throw away the part that doesn’t contain our target and recurse only on
the surviving piece. As with QuickSort, we’d like to compute the expected
number of comparisons used by this algorithm, on the assumption that the
cost of the comparisons dominates the rest of the costs of the algorithm.
Here the indicator-variable trick gets painful fast. It turns out to be
easier to get an upper bound by computing the expected number of elements
that are left after each split.
First, let’s analyze the pivot step. If the pivot is chosen uniformly, the
number of elements X smaller than the pivot is uniformly distributed in the
range 0 to n−1. The number of elements larger than the pivot will be n−X−1.
In the worst case, we find ourselves recursing on the large pile always, giving
a bound on the number of survivors Y of Y ≤ max(X, n − X + 1).
What is the expected value of Y ? By considering both ways the max can
go, we get

E [Y ] = E [X | X > n − X + 1] Pr [X > n − X + 1]
+ E [n − X + 1 | n − X + 1 ≥ X] Pr [n − X + 1 ≥ X] .

For both conditional expectations, we are choosing a value uniformly in


either the range d n−1 n−1
2 e to n − 1 or d 2 e + 1 to n − 1, and in either case the
CHAPTER 3. RANDOM VARIABLES 50

expectation will be equal to the average of the two endpoints by symmetry.


So we get

n/2 + n − 1 n/2 + n
E [Y ] ≤ Pr [X > n − X + 1] + Pr [n − X + 1 ≥ X]
2  2
3 1 3

= n− Pr [X > n − X + 1] + n Pr [n − X + 1 ≥ X]
4 2 4
3
≤ n.
4
Now let Xi be the number of survivors after i pivot steps. Note that
max(0, Xi − 1) gives the number of comparisons at the following pivot step,
so that ∞
P
i=0 Xi is an upper bound on the number of comparisons.
We have X0 = n, and from the preceding argument E [X1 ] ≤ (3/4)n. But
more generally, we can use the same argument to show that E [Xi+1 | Xi ] ≤
(3/4)Xi , and by induction E [Xi ] ≤ (3/4)i n. We also have that Xj = 0 for
all j ≥ n, because we lose at least one element (the pivot) at each pivoting
step. This saves us from having to deal with an infinite sum.
Using linearity of expectation,
"∞ # " n #
X X
E Xi = E Xi
i=0 i=0
n
X
= E [Xi ]
i=0
Xn
≤ (3/4)i n
i=0
≤ 4n.
Chapter 4

Basic probabilistic
inequalities

Here we’re going to look at some inequalities useful for proving properties
of randomized algorithms. These come in two flavors: inequalities involving
probabilities, which are useful for bounding the probability that something
bad happens, and inequalities involving expectations, which are used to
bound expected running times. Later, in Chapter 5, we’ll be doing both,
by looking at inequalities that show that a random variable is close to its
expectation with high probability.1

4.1 Markov’s inequality


This is the key tool for turning expectations of non-negative random variables
into (upper) bounds on probabilities. Used directly, it generally doesn’t give
very good bounds, but it can work well if we apply it to E [f (X)] for a
fast-growing function f ; for some examples, see Chebyshev’s inequality (§5.1)
or Chernoff bounds (§5.2).
1
Often, the phrase with high probability is used in algorithm analysis to mean
specifically with probability at least 1 − n−c for any fixed c. The idea is that if an algorithm
works with high probability in this sense, then the probability that it fails each time you
run it is at most n−c , which means that if you run it as a subroutine in a polynomial-
0
time algorithm that calls is at most nc times, the total probability of failure is at most
0 0
nc n−c = nc −c by the union bound. Assuming we can pick c to be much larger than c0 ,
this makes the outer algorithm also work with high probability.

51
CHAPTER 4. BASIC PROBABILISTIC INEQUALITIES 52

Markov’s inequality says that if X ≥ 0 and α > 0, then

E [X]
Pr [X ≥ α] ≤ .
α
The proof is immediate from the law of total probability (2.3.3). We have

E [X] = E [X | X ≥ α] Pr [X ≥ α] + E [X | x < α] Pr [X < α]


≥ α · Pr [X ≥ α] + 0 · Pr [X < α]
= α · Pr [X ≥ α] ;

now solve for Pr [X ≥ α].


Markov’s inequality doesn’t work in reverse. For example, consider the
following game: for each integer k > 0, with probability 2−k , I hgive youi 2k
dollars. Letting X be your payoff from the game, we have Pr X ≥ 2k =
P∞
2−k = 2−k ∞ −` = 2 . The right-hand side here is exactly what we
P
j=k `=0 2 2k
would get from Markov’s inequality if E [X] = 2. But in this case, E [X] 6= 2;
in fact, the expectation of X is given by ∞ k −k
P
k=1 2 2 , which diverges.

4.1.1 Applications
4.1.1.1 Sum of fair coins
Flip n independent fair coins, and let S be the number of heads we get. Since
E [S] = n/2, we get Pr [S = n] = 1/2. This is much larger than the actual
value 2−n , but it’s the best we can hope for if we only know E [S]: if we let
S be 0 or n with equal probability, we also get E [S] = n/2.

4.1.1.2 Randomized QuickSort


The expected running time for randomized QuickSort is O(n log n). It follows
that the probability that randomized QuickSort takes more than f (n) time is
O(n log n/f (n)). For example, the probability that it performs the maximum
n
2 = O(n2 ) comparisons is O(log n/n). (It’s possible to do much better
than this.)

4.1.1.3 Balls in bins


Suppose we toss m balls in n bins, uniformly and independently. What
is the probability that some particular bin contains at least k balls? The
probability that a particular ball lands in a particular bin is 1/n, so the
expected number of balls in the bin is m/n. This gives a bound of m/nk
CHAPTER 4. BASIC PROBABILISTIC INEQUALITIES 53

that a particular bin contains k or more balls. Unfortunately this is not a


very good bound.

4.2 Union bound (Boole’s inequality)


The union bound or Boole’s inequality says that for any countable
collection of events {Ai },
h[ i X
Pr Ai ≤ Pr [Ai ] . (4.2.1)

Combining Markov’s inequality with linearity of expectation and indicator


variables gives a succinct proof of the union bound:
h[ i hX i
Pr Ai = Pr 1Ai ≥ 1
hX i
≤E 1Ai
X
= E [1Ai ]
X
= Pr [Ai ] .

Note that for this to work for infinitely many events we need to use the fact
that 1Ai is non-negative.
If we prefer to avoid any issues with infinite sums of expectations, the
direct way to prove this is to replace Ai with Bi = Ai \ i−1
S
S S j=1 Ai . Then
Ai = Bi , but since the Bi are disjoint and each Bi is a subset of the
corresponding Ai , we get Pr [ Ai ] = Pr [ Bi ] = Pr [Bi ] ≤ Pr [Ai ].
S S P P

The typical use of the union bound is to show that if an algorithm can
fail only if various improbable events occur, then the probability of failure is
no greater than the sum of the probabilities of these events. This reduces
the problem of showing that an algorithm works with probability 1 −  to
constructing an error budget that divides the  probability of failure among
all the bad outcomes.

4.2.1 Example: Balls in bins


Suppose we toss n balls uniformly and independently into n bins. What
high-probability bound can we get on the maximum number of balls in any
one bin?2
2
Algorithmic version: we insert n elements into a hash table with n positions using a
random hash function. What is the maximum number of elements in any one position?
CHAPTER 4. BASIC PROBABILISTIC INEQUALITIES 54

Consider all nk sets S of k balls. If we get at least k balls in some bin,




then one of these sets must all land in the same bin. Call the event that all
balls in S choose the same bin AS . The probability that AS occurs is exactly
n−k+1 .
Using the union bound, we get
" #
[
Pr [some bin gets at least k balls] = Pr AS
S
X
≤ Pr [AS ]
S
!
n −k+1
= n
k
nk −k+1
≤ n
k!
n
= .
k!
If we want this probability to√ be low, we should choose k so that k!  n.
Stirling’s formula says that k! ≥ 2πk(k/e)k ≥ (k/e)k , which gives ln(k!) ≥
k(ln k − 1). If we set k = c ln n/ ln ln n, we get
c ln n
ln(k!) ≥ (ln c + ln ln n − ln ln ln n − 1)
ln ln n
≥ c ln n.

when n is sufficiently large.


It follows that the bound n/k! in this case is less than n/ exp(c ln n) =
n · n−c = n1−c . For suitable choice of c we get a high probability that every
bin gets at most O(log n/ log log n) balls.

4.3 Jensen’s inequality


This is mostly useful if we can calculate E [X] easily for some X, but what
we really care about is some other random variable Y = f (X).
Jensen’s inequality applies when f is a convex function, which means
that for any x, y, and 0 ≤ µ ≤ 1, f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y).
Geometrically, this means that the line segment between any two points on the
graph of f never goes below f ; i.e., that the set of points {(x, y) | y ≥ f (x)}
is convex. If we want to show that a continuous function f is convex, it’s
enough to show that that f 2 ≤ f (x)+f
x+y
2
(y)
for all x and y (in effect, we
CHAPTER 4. BASIC PROBABILISTIC INEQUALITIES 55

only need to prove it for the case λ = 1/2). If f is twice-differentiable, an


even easier way is to show that f 00 (x) ≥ 0 for all x.
The inequality says that if X is a random variable and f is convex then

f (E [X]) ≤ E [f (X)] . (4.3.1)

Alternatively, if f is concave (which means that f (λx + (1 − λ)y) ≥


λf (x) + (1 − λ)f (y), or equivalently that −f is convex), the reverse inequality
holds:

f (E [X]) ≥ E [f (X)] . (4.3.2)

In both cases, the direction of Jensen’s inequality matches the direction


of the inequality in the definition of convexity or concavity. This is not
surprising because convexity or or concavity is just Jensen’s inequality for
the random variable X that equals x with probability λ and y with probability
1−λ. Jensen’s inequality just says that this continues to work for any random
variable for which the expectations exist.

4.3.1 Proof
Here is a proof for the case that f is convex and differentiable. The idea is
that if f is convex, then it lies above the tangent line at E [X]. So we can
define a linear function g that represents this tangent line, and get, for all x:

f (x) ≥ g(x) = f (E [X]) + (x − E [X])f 0 (E [X]). (4.3.3)

But then

E [f (X)] ≥ E [g(X)]
= E f (E [X]) + (X − E [X])f 0 (E [X])
 

= f (E [X]) + E [(X − E [X])] f 0 (E [X])


= f (E [X]) + (E [X] − E [X])f 0 (E [X])
= f (E [X]).

Figure 4.1 shows what this looks like for a particular convex f .
This is pretty much all linearity of expectation in action: E [X], f (E [X]),
and f 0 (E [X]) are all constants, so we can pull them out of expectations
whenever we see them.
The proof of the general case is similar, but for a non-differentiable convex
function it takes a bit more work to show that the bounding linear function
g exists.
CHAPTER 4. BASIC PROBABILISTIC INEQUALITIES 56

f (x)

E [f (X)]
g(x)

f (E [X])

E [X]

Figure 4.1: Proof of Jensen’s inequality

4.3.2 Applications
4.3.2.1 Fair coins: lower bound
Suppose we flip n fair coins, and we want to get a lower bound on E X 2 ,
 

where X is the number of heads. The function f : x 7→ x2 is convex (take its


2
second derivative), so (4.3.1) gives E X 2 ≥ (E [X])2 = n4 .
 
2
The actual value for E X 2 is n4 + n4 , which can be found directly using
 

generating functions3 or less directly using variance, which we will encounter


in §5.1. This is pretty close to the lower bound we got out of Jensen’s
inequality, but we can’t count on this happening in general.

4.3.2.2 Fair coins: upper bound


For an upper bound, we can choose a concave f . For example, if X is as in
the previous example, E [lg X] ≤ lg E [X] = lg n2 = lg n − 1. This is probably
pretty close to the exact value, as we will see later that X will almost always
be within a factor of 1 + o(1) of n/2. It’s not a terribly useful upper bound,
because if we use it with (say) Markov’s inequality, the most we can prove
is that Pr [X = n] = Pr [lg X = lg n] ≤ lglgn−1 1
n = 1 − lg n , which is an even
 
Here’s how: The probability generating function for X is F (z) = E z k =
3

z Pr [X = k] = 2−n (1 + z)n . Then zF 0 (z) = 2−n nz(1 + z)n−1 =


P k
kz k Pr [X = k].
P
k
−n n−1 −n
Taking the
Pderivative of this a second time gives 2 n(1 + z) + 2 n(n −1)z(1  +
z)n−2 = k
k 2 k−1
z Pr [X = k]. Evaluate this monstrosity at z = 1 to get E X 2
=
2n+n2 −n
k2 Pr [X = k] = 2−n n2n−1 + 2−n n(n − 1)2n−2 = n/2 + n(n − 1)/4 =
P
k 4
=
2
n /4 + n/4.
CHAPTER 4. BASIC PROBABILISTIC INEQUALITIES 57

worse bound than the 1/2 we can get from applying Markov’s inequality to
X directly.

4.3.2.3 Sifters
Here’s an example of Jensen’s inequality in action in the analysis of an actual
distributed algorithm. For some problems in distributed computing, it’s
useful to reduce coordinating a large number of processes to coordinating
a smaller number. A sifter [AA11] is a randomized mechanism for an
asynchronous shared-memory system that sorts the processes into “winners”
and “losers,” guaranteeing that there is at least one winner. The goal is to
make the expected number of winners as small as possible. The problem
is tricky, because processes can only communicate by reading and writing
shared variables, and an adversary gets to choose which processes participate
and fix the schedule of when each of these processes perform their operations.
The current best known sifter is due to Giakkoupis and Woelfel [GW12].
For n processes, it uses an array A of dlg ne bits, each of which can be read
or written by any of the processes. When a process executes the sifter, it
chooses a random index r ∈ 1 . . . dlg ne with probability 2−r−1 (this doesn’t
exactly sum to 1, so the excess probability gets added to r = dlg ne). The
process then writes a 1 to A[r] and reads A[r + 1]. If it sees a 0 in its read
(or chooses r = dlg ne), it wins; otherwise it loses.
This works as a sifter, because no matter how many processes participate,
some process chooses a value of r at least as large as any other process’s
value, and this process wins. To bound the expected number of winners, take
the sum over all r over the random variable Wr representing the winners
who chose this particular value r. A process that chooses r wins if it carries
out its read operation before any process writes r + 1. If the adversary wants
to maximize the number of winners, it should let each process read as soon
as possible; this effectively means that a process that choose r wins if no
process previously chooses r + 1. Since r is twice as likely to be chosen as
r + 1, conditioning on a process picking r or r + 1, there is only a 1/3 chance
that it chooses r + 1. So at most 1/(1/3) − 1 = 2 = O(1) process on average
choose r before some process chooses r + 1. (A simpler argument shows that
the expected number of processes that win because they choose r = dlg ne is
at most 2 as well.)
Summing Wr ≤ 2 over all r gives at most 2dlg ne winners on average.
Furthermore, if k < n processes participate, essentially the same analysis
shows that only 2dlg ke processes win on average. So this is a pretty effective
tool for getting rid of excess processes.
CHAPTER 4. BASIC PROBABILISTIC INEQUALITIES 58

But it gets better. Suppose that we take the winners of one sifter and
feed them into a second sifter. Let Xk be the number of processes left after
k sifters. We have that X0 = n and E [X1 ] ≤ 2dlg ne, but what can we
say about E [X2 ]? We can calculate E [X2 ] = E [E [X2 | X1 ]] ≤ 2dlg X1 e.
Unfortunately, the ceiling means that 2dlg xe is not a concave function,
but f (x) = 2(lg x + 1) ≥ 2dlg xe is. So E [X2 ] ≤ f (f (n)), and in general
E [Xi ] ≤ f (i) (n), where f (i) is the i-fold composition of f . All the extra
constants obscure what is going on a bit, but with a little bit of algebra it is
not too hard to show that f (i) (n) = O(1) for i = O(log∗ n).4 So this gets rid
of all but a constant number of processes very quickly.

4
The log∗ function counts how many times you need to hit n with lg to reduce it to
one or less. So log∗ 1 = 0, log∗ 2 = 1, log∗ 4 = 2, log∗ 16 = 3, log∗ 65536 = 4, log∗ 265536 = 5,
and after that it starts getting silly.
Chapter 5

Concentration bounds

If we really want to get tight bounds on a random variable X, the trick will
turn out to be picking some non-negative function f (X) where (a) we can
calculate E [f (X)], and (b) f grows fast enough that merely large values of
X produce huge values of f (X), allowing us to get small probability bounds
by applying Markov’s inequality to f (X). This approach is often used to
show that X lies close to E [X] with reasonably high probability, what is
known as a concentration bound.
Typically concentration bounds are applied to sums of random variables,
which may or may not be fully independent. Which bound you may want to
use often depends on the structure of your sum. A quick summary of the
bounds in this chapter is given in Table 5.1. The rule of thumb is to use
Chernoff bounds (§5.2) if you have a sum of independent 0–1 random variables;
the Azuma-Hoeffding inequality (§5.3) if you have bounded variables with a
more complicated distribution that may be less independent; and Chebyshev’s
inequality (§5.1) if nothing else works but you can somehow compute the
variance of your sum (e.g., if the Xi are independent or have easily computed
covariance). In the case of Chernoff bounds, you will almost always end up
using one of the weaker but cleaner versions in §5.2.2 rather than the general
version in §5.2.1.
If none of these bounds work for your particular application, there are
many more out there. See for example the textbook by Dubhashi and
Panconesi [DP09].

59
CHAPTER 5. CONCENTRATION BOUNDS 60

 E[S]

Chernoff Xi ∈ {0, 1}, independent Pr [S ≥ (1 + δ) E [S]] ≤ (1+δ)1+δ

 
t2
Azuma-Hoeffding |Xi | ≤ ci , martingale Pr [S ≥ t] ≤ exp − 2 P c2i

Var[S]
Chebyshev Pr [|S − E [S]| ≥ α] ≤ α2

P
Table 5.1: Concentration bounds for S = Xi (strongest to weakest)

5.1 Chebyshev’s inequality


Chebyshev’s inequality allows us to show that a random variable is close
to its mean, byapplying Markov’s inequality to the variance of X, defined
2

as Var [X] = E (X − E [X]) . It’s a fairly weak concentration bound, that is
most useful when X is a sum of random variables with limited independence.
Using Markov’s inequality, calculate
h i
Pr [|X − E [X]| ≥ α] = Pr (X − E [X])2 ≥ α2
E (X − E [X])2
 

α2
Var [X]
= . (5.1.1)
α2

5.1.1 Computing variance


At this point it would be reasonable to ask why we are going through
Var [X] = E (X − E [X])2 rather than just using E [|X − E [X]|]. The
reason is that Var [X] is usually easier to compute, especially if X is a sum.
In this section, we’ll give some examples of computing variance, including
for various standard random variables that come up often in randomized
algorithms.

5.1.1.1 Alternative formula


The first step is to give an alternative formula for the variance that is more
convenient in some cases.
CHAPTER 5. CONCENTRATION BOUNDS 61

Expand
h i h i
E (X − E [X])2 = E X 2 − 2X · E [X] + (E [X])2
h i
= E X 2 − 2 E [X] · E [X] + (E [X])2
h i
= E X 2 − (E [X])2 . (5.1.2)

This formula is easier to use if you are estimating the variance from a
sequence of samples; by tracking x2i and xi , you can estimate E X 2
P P  

and E [X] in a single pass, without having to estimate E [X] first and then go
back for a second pass to calculate (xi − E [X])2 for each sample. We won’t
use this particular application much, but this explains why the formula is
popular with statisticians.

5.1.1.2 Variance of a Bernoulli random variable


Recall that a Bernoulli random variable is 1 with probability p and 0 with
probability q = 1 − p; in particular, any indicator variable is a Bernoulli
random variable.
The variance of a Bernoulli random variable is easily calculated from
(5.1.2):
h i
Var [X] = E X 2 − (E [X])2
= p − p2
= p(1 − p)
= pq.

5.1.1.3 Variance of a sum


P
If S = i Xi , then we can calculate
 !2  " #!2
X X
Var [S] = E  Xi − E Xi
i i
 
XX XX
= E Xi Xj  − E [Xi ] E [Xj ]
i j i j
XX
= (E [Xi Xj ] − E [Xi ] E [Xj ]) .
i j
CHAPTER 5. CONCENTRATION BOUNDS 62

For any two random variables X and Y , the quantity E [XY ]−E [X] E [Y ]
is called the covariance of X and Y , written Cov [X, Y ]. If we take the
covariance of a variable and itself, covariance becomes variance: Cov [X, X] =
Var [X].
We can use Cov [X, Y ] to rewrite the above expansion as
" #
X X
Var Xi = Cov [Xi , Xj ] (5.1.3)
i i,j
X X
= Var [Xi ] + Cov [Xi , Xj ] (5.1.4)
i i6=j
X X
= Var [Xi ] + 2 Cov [Xi , Xj ] (5.1.5)
i i<j

Note that Cov [X, Y ] = 0 when X and Y are independent; this makes
Chebyshev’s inequality particularly useful for pairwise-independent ran-
dom variables, because then we can just sum up the variances of the individual
variables.
P
A typical application is when we have a sum S = Xi of non-negative
random variables with small covariance; here applying Chebyshev’s inequality
to S can often be used to show that S is not likely to be much smaller than
E [S], which can be handy if we want to show that some lower bound holds
on S with some probability. This complements Markov’s inequality, which
can only be used to get upper bounds.
Pn
For example, suppose S = i=1 Xi , where the Xi are independent
Bernoulli random variables with E [Xi ] = p for all i. Then E [S] = np, and
P
Var [S] = i Var [Xi ] = npq (because the Xi are independent). Chebyshev’s
inequality then says
npq
Pr [|S − E [S]| ≥ α] ≤ .
α2
The highest variance is when p = 1/2. In this case, he probability that

S is more than β n away from its expected value n/2 is bounded by 4β1 2 .
We’ll see better bounds on this problem later, but this may already be good
enough for many purposes.
More generally, the approach of bounding S from below by estimating
 2
E [S] and either E S or Var [S] is known as the second-moment method.
In some cases, tighter bounds can be obtained by more careful analysis.
CHAPTER 5. CONCENTRATION BOUNDS 63

5.1.1.4 Variance of a geometric random variable


Let X be a geometric random variable with parameter p as defined in
§3.6.2, so that X takes on the values 1, 2, . . . and Pr [X = n] = q n−1 p, where
q = 1 − p as usual. What is Var [X]?
We know that E [X] = 1/p, so (E [X])2 = 1/p2 . Computing E X 2 is
 

trickier. Rather than do this directly from the definition of expectation, we


can exploit the memorylessness of geometric random variables to get it using
conditional expectations, just like we did for E [X] in §3.6.2.
Conditioning on the two possible outcomes of the first trial, we have
h i h i
E X2 = p + q E X2 X > 1 . (5.1.6)

We now argue that E X 2 X > 1 = E (X + 1)2 . The intuition is


   

that once we have flipped one coin the wrong way, we are back where
we started, except that now we have to add that extra coin to X. More
Pr[X 2 =n] n−1
formally, we have, for n > 1, Pr X 2 = n X > 1 = Pr[X>1] = q q p =
 

q n−2 p = Pr [X = n − 1] = Pr [X + 1 = n]. So we get the same probability


mass function for X conditioned on X > 1 as for X + 1 with no conditioning.
Applying this observation to the right-hand side of (5.1.6) gives
h i h i
E X 2 = p + q E (X + 1)2
 h i 
= p + q E X 2 + 2 E [X] + 1
2qh i
= p + q E X2 + +q
p
h i 2q
= 1 + q E X2 + .
p
A bit of algebra turns this into
h i 1 + 2q/p
E X2 =
1−q
1 + 2q/p
=
p
p + 2q
=
p2
2−p
= .
p2
CHAPTER 5. CONCENTRATION BOUNDS 64

Now subtract (E [X])2 = 1


p2
to get
1−p q
Var [X] = 2
= 2. (5.1.7)
p p
By itself, this doesn’t give very good bounds on X. For example, if we
want to bound the probability that X = 1, we get
1
 
Pr [X = 1] = Pr X − E [X] = 1 −
p
1
 
≤ Pr |X − E [X]| ≥ − 1
p
Var [X]
≤ 2
1
p − 1
q/p2
= 2
1
−1
p
q
=
(1 − p)2
1
= .
q
Since 1q ≥ 1, we could have gotten this bound with much less work.
The other direction is not much better. We can easily calculate that
Pr [X ≥ n] is exactly q n−1 (because this corresponds to flipping n coins
the wrong way, no matter what happens with subsequent coins). Using
Chebyshev’s inequality gives
1 1
 
Pr [X ≥ n] ≤ Pr X − ≥n−
p p
q/p 2
≤ 2
n − p1
q
= .
(np − 1)2
This at least has the advantage of dropping below 1 when n gets large
enough, but it’s only polynomial in n while the true value is exponential.
Where this might be useful is in analyzing the sum of a bunch of geometric
random variables, as occurs in the Coupon Collector problem discussed in
§3.6.3.1 Letting Xi be the number of balls to take us from i − 1 to i empty
1
We are following here a similar analysis in [MU17, §3.3.1].
CHAPTER 5. CONCENTRATION BOUNDS 65

bins, we have previously argued that Xi has a geometric distribution with


p = n−i−1
n , so
, 2
i−1 n−i−1
Var [Xi ] =
n n
i−1
=n ,
(n − i − 1)2

and
" n # n
X X
Var Xi = Var [Xi ]
i=1 i=1
n
X i−1
= n .
i=1 (n − i − 1)2

Having the numerator go up while the denominator goes down makes


this a rather unpleasant sum to try to solve directly. So we will follow the
lead of Mitzenmacher and Upfal and bound the numerator by n, giving
" n # n
X X n
Var Xi ≤ n
i=1 i=1 (n − i − 1)2
n
X 1
= n2
i=1
i2

X 1
≤ n2
i=1
i2
π2
= n2 .
6
2
The fact that ∞ 1 π
P
i=1 i2 converges to 6 is not trivial to prove, and was first
shown by Leonhard Euler in 1735 some ninety years after the question was
first proposed.2 But it’s easy to show that the series converges to something,
so even if we didn’t have Euler’s help, we’d know that the variance is O(n2 ).
Since the expected value of the sum is Θ(n log n), this tells us that we
are likely to see a total waiting time reasonably close to this; with at least
2
See http://en.wikipedia.org/wiki/Basel_Problem for a history, or Euler’s original
paper [Eul68], available at http://eulerarchive.maa.org/docs/originals/E352.pdf,
for the actual proof in the full glory of its original 18th-century typesetting. Curiously,
though Euler announced his result in 1735, he didn’t submit the journal version until 1749,
and it didn’t see print until 1768. Things moved more slowly in those days.
CHAPTER 5. CONCENTRATION BOUNDS 66

a constant probability, it will be within Θ(n) of the expectation. In fact,


the distribution is much more sharply concentrated (see [MU17, §5.4.1] or
[MR95, §3.6.3]), but this bound at least gives us something.

5.1.2 More examples


Here are some more examples of Chebyshev’s inequality in action. Some
of these repeat examples for which we previously got crummier bounds in
§4.1.1.

5.1.2.1 Flipping coins


Let X be the sum of n independent fair coins. Let Xi be the indicator
variable for the event that the i-th coin comes up heads. Then Var [Xi ] = 1/4
and Var [X] = Var [Xi ] = n/4. Chebyshev’s inequality gives Pr [X = n] ≤
P
n/4 1
Pr [|X − n/2| ≥ n/2] ≤ (n/2) 2 = n . This is still not very good, but it’s
getting better. It’s also about the best we can do given only the assumption
of pairwise independence.
To see this, let n = 2m − 1 for some m, and let Y1 . . . Ym be independent,
fair 0–1 random variables. For each non-empty subset S of {1 . . . m}, let
XS be the exclusive OR of all Yi for i ∈ S. Then (a) the XS are pairwise
independent; (b) each XS has variance 1/4; and thus (c) the same Chebyshev’s
P
inequality analysis for independent coin flips above applies to X = S XS ,
Var[S] n/4 1
giving Pr [|X − n/2| = n/2] ≤ (n/2) 2 = n2 /4 = n . In this case it is not
actually possible for X to equal n, but we can have X = 0 if all the Yi are 0,
which occurs with probability 2−m = n+1 1
. So the Chebyshev’s inequality is
almost tight in this case.

5.1.2.2 Balls in bins


Let Xi be the indicator that the i-th of m balls lands in a particular bin.
Then E [Xi ] = 1/n, giving E [ Xi ] = m/n, and Var [Xi ] = 1/n − 1/n2 ,
P

giving Var [ Xi ] = m/n − m/n2 . So the probability that we get k + m/n


P

or more balls in a particular bin is at most (m/n − m/n2 )/k 2 < m/nk 2 , and
applying the union bound, the probability that we get k + m/n or more balls
in any of the n bins is less than m/k 2 . Setting this equal to  andpsolving for
k gives a probability of at most  of getting more than m/n + m/ balls
in any of the bins. This is not as good a bound as we will be able to prove
later, but it’s at least non-trivial.
CHAPTER 5. CONCENTRATION BOUNDS 67

5.1.2.3 Lazy select


This example comes from [MR95, §3.3]; essentially the same example, spe-
cialized to finding the median, also appears in [MU17, §3.5].3
We want to find the k-th smallest element S(k) of a set S of size n. (The
parentheses around the index indicate that we are considering the sorted
version of the set S(1) < S(2) · · · < S(n) .) The idea is to:

1. Sample a multiset R of n3/4 elements of S with replacement and sort


them. This takes O(n3/4 log n3/4 ) = o(n) comparisons so far.

2. Use our sample to find an interval that is likely to contain S(k) . The
idea is to pick indices ` = (k − n3/4 )n−1/4 and r = (k + n3/4 )n−1/4 and
use R(`) and R(r) as endpoints (we are omitting some floors and maxes
here to simplify the notation; for a more rigorous presentation see
[MR95]). The hope is that the interval P = [R(`) , R(r) ] in S will both
contain S(k) , and be small, with |P | ≤ 4n3/4 + 2. We can compute the
elements of P in 2n comparisons exactly by comparing every element
with both R(`) and R(r) .

3. If both these conditions hold, sort P (o(n) comparisons) and return


S(k) . If not, try again.
We want to get a bound on how likely it is that P either misses S(k) or
is too big.
For any fixed k, the probability that one sample in R is less than or
equal to S(k) is exactly k/n, so the expected number X of samples ≤ S(k)
is exactly kn−1/4 . The variance on X can be computed by summing the
variances of the indicator variables that each sample is ≤ S(k) , which gives
a bound Var [X] =hn3/4 ((k/n)(1 − k/n)) 3/4
i ≤ n /4. Applying Chebyshev’s

inequality gives Pr X − kn−1/4 ≥ n ≤ n3/4 /4n = n−1/4 /4.
Now let’s look at the probability that P misses S(k) because R(`) is too

big, where ` = kn−1/4 − n. This is
h i h √ i
Pr R(`) > S(k) = Pr X < kn−1/4 − n
≤ n−1/4 /4.
3
The history is that Motwani and Raghavan adapted this algorithm from a similar
algorithm by Floyd and Rivest [FR75]. Mitzenmacher and Upfal give a version that also
includes the adaptations appearing Motwani and Raghavan, although they don’t say where
they got it from, and it may be that both textbook versions come from a common folklore
source.
CHAPTER 5. CONCENTRATION BOUNDS 68

(with the caveat that we are being sloppy about round-off errors).
Similarly,
h i h √ i
Pr R(h) < S(k) = Pr X > kn−1/4 + n
≤ n−1/4 /4.

So the total probability that P misses S(k) is at most n−1/4 /2.


Now we want to show that |P | is small. We will do so by showing that
it is likely that R(`) ≥ S(k−2n3/4 ) and R(h) ≤ S(k+2n3/4 ) . Let X` be the
number of samples in R that are ≤ S(k−2n3/4 ) and Xr be the number of

samples in R that are ≤ S(k+2n3/4 ) . Then we have E [X` ] = kn−1/4 − 2 n

and E [Xr ] = kn−1/4 + 2 n, and Var [Xl ] and Var [Xr ] are both bounded by
n3/4 /4.
We can now compute
h i √
Pr R(l) < S(k−2n3/4 ) = P r[Xl > kn−1/4 − n] < n−1/4 /4

by the same Chebyshev’s inequality argument


h as before,
i and get the sym-
metric bound on the other side for Pr R(r) > S(k+2n3/4 ) . This gives a total
bound of n−1/4 /2 that P is too big, for a bound of n−1/4 = o(n) that the
algorithm fails on its first attempt.
The total expected number of comparisons is thus given by T (n) =
2n + o(n) + O(n−1/4 T (n)) = 2n + o(n).

5.2 Chernoff bounds


To get really tight bounds, we apply Markov’s inequality to exp(αS), where
P
S = i Xi . This works best when the Xi are independent: if this is
the case, so are the variables exp(αXi ), and so we can easily calculate
Q Q
E [exp(αS)] = E [ i exp(αXi )] = i E [exp(αXi )].
The quantity E [exp(αS)], treated as a function of α, is called the
moment h generating
i function of S, because it expands formally into
P∞
E X k αk , the exponential generating function for the series of
k=0 k! h i
k-th moments E X k . Note that it may not converge for all S and α;4 we
will be careful to choose α for which it does converge and for which Markov’s
inequality gives us good bounds.
 
4
For example, the moment generating function for our earlier bad X with Pr X = 2k =
−k
P −k αk
2 is equal to k
2 e , which diverges unless eα /2 < 1.
CHAPTER 5. CONCENTRATION BOUNDS 69

5.2.1 The classic Chernoff bound


The basic Chernoff bound applies to sums of independent 0–1 random vari-
ables, which need not be identically distributed. For identically distributed
random variables, the sum has a binomial distribution, which we can either
compute exactly or bound more tightly using approximations specific to
binomial tails; for sums of bounded random variables that aren’t necessarily
0–1, we can use Hoeffding’s inequality instead (see §5.3).
Let each Xi for i = 1 . . . n be a 0–1 random variable with expectation
pi , so that E [S] = µ = i pi . The plan is to show Pr [S ≥ (1 +hδ)µ]iis small
P

when δ and µ are large, by applying Markov’s inequality to E eαS , where


α will be chosen to make the bound as tight has possible
i for some specific δ.
The first step is to get an upper bound on E eαS .
Compute
h i h P i
E eαS = E eα Xi

Y h i
= E eαXi
i
Y 
= pi eα + (1 − pi )e0
i
Y
= (pi eα + 1 − pi )
i
Y
= (1 + (eα − 1) pi )
i
Y α −1)p
≤ e(e i

i
P
(eα −1) i pi
=e
α −1)µ
= e(e .

The sneaky inequality step in the middle uses the fact that (1 + x) ≤ ex
for all x, which itself is one of the most useful inequalities you can memorize.5
What’s nice about this derivation is that at the end, the pi have vanished.
We don’t care what random variables we started with or how many of them
there were, but only about their expected sumhµ. i
Now that we have an upper bound on E eαS , we can throw it into
5
For a proof of this inequality, observe that the function f (x) = ex − (1 + x) has the
derivative ex − 1, which is positive for x > 0 and negative for x < 0. It follows that x = 1
is the unique minimum of f , at which f (1) = 0.
CHAPTER 5. CONCENTRATION BOUNDS 70

Markov’s inequality to get the bound we really want:


h i
Pr [S ≥ (1 + δ)µ] = Pr eαS ≥ eα(1+δ)µ
h i
E eαS

eα(1+δ)µ
α
e(e −1)µ
≤ α(1+δ)µ
e !µ
α
ee −1
=
eα(1+δ)
 α −1−α(1+δ)

= ee .

We now choose α to minimize the base in the last expression, by minimiz-


ing its exponent eα − 1 − α(1 + δ). Setting the derivative of this expression
with respect to α to zero gives eα = (1 + δ) or α = ln(1 + δ); luckily, this
value of α is indeed greater than 0 as we have been assuming. Plugging this
value in gives
 µ
Pr [S ≥ (1 + δ)µ] ≤ e(1+δ)−1−(1+δ) ln(1+δ)


= . (5.2.1)
(1 + δ)1+δ

The base of this rather atrocious quantity is e0 /11 = 1 at δ = 0, and its


derivative is negative for δ ≥ 0 (the easiest way to show this is to substitute
δ = x − 1 first). So the bound is never greater than 1 and is both decreasing
and less than 1 as soon as δ > 0. We also have that the bound is exponential
in µ for any fixed δ.
If we look at the shape of the base as a function of δ, we can observe that
when δ is very large, we can replace (1 + δ)1+δ with δ δ without changing the
bound much (and to the extent that we change it, it’s an increase, so it still
δ
works as a bound). This turns the base into δeδ = (e/δ)δ = 1/(δ/e)δ . This is

pretty close to Stirling’s formula for 1/δ! (there is a 2πδ factor missing).
For very small δ, we have that 1 + δ ≈ eδ , so the base becomes approxi-
eδ 2
mately eδ(1+δ) = e−δ . This approximation goes in the wrong direction (it’s
smaller than the actual value) but with some fudging we can show bounds
2
of the form e−µδ /c for various constants c, as long as δ is not too big.
CHAPTER 5. CONCENTRATION BOUNDS 71

5.2.2 Easier variants


The full Chernoff bound can be difficult to work with, especially since it’s
hard to invert (5.2.1) to find a good δ that gives a particular  bound.
Fortunately, there are approximate variants that substitute a weaker but less
intimidating bound. Some of the more useful are:
• For 0 ≤ δ ≤ 1.81,
2 /3
Pr [X ≥ (1 + δ)µ] ≤ e−µδ . (5.2.2)
(The actual upper limit is slightly higher.) Useful for small val-
ues of δ, especially because the bound can be inverted: ifpwe want
Pr [X ≥ (1 + δ)µ] ≤ exp(−µδ 2 /3) ≤ , we can use any δ with 3 ln(1/)/µ ≤
δ ≤ 1.81.
The proof of the approximate bound is to show that, in the given
range, eδ /(1 + δ)1+δ ≤ exp(−δ 2 /3). This is easiest to do numerically;
a somewhat more formal argument that the bound holds in the range
0 ≤ δ ≤ 1 can be found in [MU17, Theorem 4.4].
• For 0 ≤ δ ≤ 4.11,
2 /4
Pr [X ≥ (1 + δ)µ] ≤ e−µδ . (5.2.3)
This is a slightly weaker bound than the previous
p that holds over a
larger range. It gives Pr [X ≥ (1 + δ)µ] ≤  if 4 ln(1/)/µ ≤ δ ≤ 4.11.
Note that the version given on page 72 of [MR95] is not correct; it
claims that the bound holds up to δ = 2e − 1 ≈ 4.44, but it fails
somewhat short of this value.
• For R ≥ 2eµ,
Pr [X ≥ R] ≤ 2−R . (5.2.4)
Sometimes the assumption is replaced with the stronger R ≥ 6µ (this
is the version given in [MU17, Theorem 4.4], for example); one can also
verify numerically that R ≥ 5µ (i.e., δ ≥ 4) is enough. The proof of the
2eµ bound is that eδ /(1+δ)(1+δ) < e1+δ /(1+δ)(1+δ) = (e/(1+δ))1+δ ≤
2−(1+δ) when e/(1 + δ) ≤ 1/2 or δ ≥ 2e − 1. Raising this to µ gives
Pr [X ≥ (1 + δ)µ] ≤ 2−(1+δ)µ for δ ≥ 2e − 1. Now substitute R for
(1 + δ)µ (giving R ≥ 2eµ) to get the full result. Inverting this one gives
Pr [X ≥ R] ≤  when R ≥ min(2eµ, lg(1/)).
Figure 5.1 shows the relation between the various bounds, in the region
where they cross each other.
CHAPTER 5. CONCENTRATION BOUNDS 72

2−R/µ
δ ≤ 4.11+


(1+δ)(1+δ)
δ ≤ 1.81+ R/µ ≥ 4.32−

2 /4
e−δ
2 /3 e−δ

Figure 5.1: Comparison of Chernoff bound variants with exponent µ omitted,


plotted in logarithmic scale relative to the standard bound. Each bound is
valid only in in the region where it exceeds eδ /(1 + δ)1+δ .

5.2.3 Lower bound version


We can also use Chernoff bounds to show that a sum of independent 0–1
random variables isn’t too small. The essential idea is to repeat the up-
per bound argument with a negative value of α, which makes eα(1−δ)µ an
increasing function in δ. The resulting bound is:

e−δ
Pr [S ≤ (1 − δ)µ] ≤ . (5.2.5)
(1 − δ)1−δ

A simpler but weaker version of this bound is


2 /2
Pr [S ≤ (1 − δ)µ] ≤ e−µδ . (5.2.6)

Both bounds hold for all δ with 0 ≤ δ ≤ 1.

5.2.4 Two-sided version


If we combine (5.2.2) with (5.2.6), we get
2 /3
Pr [|S − µ| ≥ δµ] ≤ 2e−µδ , (5.2.7)

for 0 ≤ δ ≤ 1.81.
Suppose that weqwant this bound to be less than . Then we need
2
2e /3 ≤  or δ ≥ 3 ln(2/)
−δ
µ . Setting δ to exactly this quantity, (5.2.7)
CHAPTER 5. CONCENTRATION BOUNDS 73

becomes  q 
Pr |S − µ| ≥ 3µ ln(2/) ≤ , (5.2.8)

provided  ≥ 2e−µ/3 .
For asymptotic purposes, we can omit the constants, giving

Lemma 5.2.1. Let S be a sum of independent 0–1  E [S] = µ.


variables with
p
Then for any 0 <  ≤ 2e −µ/3 , S lies within O µ log(1/) of µ, with
probability at least 1 − .

5.2.5 What if we only have a bound on E [S]?


P
For some applications, we may not know E [S] = E [Xi ] exactly, but have
only an upper bound E [S] ≤ µ. It is not immediately obvious that we can
use this upper bound in place of the actual value of E [S] when computing
an upper tail bound on S, because µ appears in the exponent of all of
our bounds, and it is not clear that the corresponding reduction in δ fully
compensates for this. However, there is a simple argument that shows that
all of these bounds hold even if we overestimate E [S].
Consider a sum S of independent 0-1 random variables with E [S] =
0
µ ≤ µ. We can turn this into a sum of independent 0-1 random variables
with expectation exactly µ by adding enough new extra variables to make a
second sum T with E [T ] = µ − µ0 . Now S + T satisfies the conditions for
(5.2.1), and S ≤ S + T always, so

Pr [S ≥ (1 + δ)µ] ≤ Pr [S + T ≥ (1 + δ)µ]


<= .
(1 + δ)1+δ

This also works for any of the bounds in §5.2.2 that are derived from
(5.2.1).
In the other direction, if we know E [S] ≥ µ and want to apply the lower
tail bound (5.2.5), we can apply a slightly different construction. Suppose
that E [S] = µ0 ≥ µ. For each Xi , construct a 0-1 random variable Yi such
that (a) all the Yi are independent of each other, (b) Yi ≤ Xi always, and (c)
E [Yi ] = E [Xi ] (µ/µ0 ). The easiest way to do this is to set Yi = Xi Zi where
each Zi is an independent biased coin with expectation µ/µ0 .
Let T = Yi . Then T ≤ S and E [T ] = E [Yi ] = E [Xi ] (µ/µ0 ) = µ.
P P P
CHAPTER 5. CONCENTRATION BOUNDS 74

Since T satisfies the requirements of (5.2.5), we can argue

Pr [S ≤ (1 − δ)µ] ≤ Pr [T ≤ (1 − δ)µ]

e−δ
≤ .
(1 − δ)1−δ

As with the upper tail bound, this approach also works for simpified
versions of the lower tail bound like (5.2.6).
For the two-sided variants, we are out of luck. The best we can do if we
know a ≤ E [S] ≤ b is to apply each of the one-sided bounds separately.

5.2.6 Almost-independent variables


Chernoff bounds generally don’t work very well for variables that are not
independent, and in most such cases we must use Chebyshev’s inequality
(§5.1) or the Azuma-Hoeffding inequality (§5.3) instead. But there is one
special case that comes up occasionally where it helps to be able to apply
the Chernoff bound to variables that are almost independent in a particular
technical sense.

Lemma 5.2.2. Let X1 , . . . , Xn be 0–1 random variables with the property


that E [Xi | X1 , . . . , Xi−1 ] ≤ pi ≤ 1 for all i. Let µ = ni=1 pi and S =
P
Pn
i=1 Xi . Then (5.2.1) holds.
Alternatively, let E [Xi | X1 , . . . , Xi−1 ] ≥ pi ≥ 0 for all i, and let µ =
Pn Pn
i=1 pi and S = i=1 Xi as before. Then (5.2.5) holds.

Proof. Rather than repeat the argument for independent variables, we will
employ a coupling, where we replace the Xi with independent Yi so that
Pn Pn
i=1 Yi gives a bound an i=1 Xi .
For the upper bound, let each Yi = 1 with independent probability
pi . Use the following process to generate a new Xi0 in increasing order
of i: if Yi = 0, set Xi0 = 0. Otherwise set Xi0 = 1 with probability
Pr Xi = 1 X1 = X10 , . . . Xi−1
0 /pi , Then Xi0 ≤ Yi , and

Pr Xi0 = 1 X10 , . . . , Xi0 = Pr Xi = 1 X1 = X10 , . . . , Xi−1 = Xi−1


0
    
/pi Pr [Yi = 1]
= Pr Xi = 1 X1 = X10 , . . . , Xi−1 = Xi−1
0
 
.
CHAPTER 5. CONCENTRATION BOUNDS 75

It follows that the Xi0 have the same joint distribution as the Xi , and so
" n # " n #
Xi0
X X
Pr ≥ µ(1 + δ) = Pr Xi ≥ µ(1 + δ)
i=1 i=1
" n #
X
≤ Pr Yi ≥ µ(1 + δ)
i=1


≤ .
(1 + δ)1+δ

For the other direction, generate the Xi first and generate the Yi using
the same rejection sampling trick. Now the Yi are independent (because
their joint distribution is) and each Yi is a lower bound on the corresponding
Xi .

The lemma is stated for the general Chernoff bounds (5.2.1) and (5.2.5),
but the easier versions follow from these, so they hold as well, as long as we
are careful to remember that the µ in the upper bound is not necessarily the
same µ as in the lower bound.

5.2.7 Other tail bounds for the binomial distribution


The random graph literature can be a good source for bounds on the binomial
distribution. See for example [Bol01, §1.3], which uses normal approximation
to get bounds that are slightly tighter than Chernoff bounds in some cases,
and [JLR00, Chapter 2], which describes several variants of Chernoff bounds
as well as tools for dealing with sums of random variables that aren’t fully
independent.

5.2.8 Applications
5.2.8.1 Flipping coins
Suppose S is the sum of n independent fair coin-flips. Then E [S] = n/2 and
Pr [S = n] = Pr [S ≥ 2 E [S]] is bounded using (5.2.1) by setting µ = n/2,

δ = 1 to get Pr [S = n] ≤ (e/4)n/2 = (2/ e)−n . This is not quite as good as

the real answer 2−n (the quantity 2/ e is about 1.213. . . ), but it’s at least
exponentially small.
CHAPTER 5. CONCENTRATION BOUNDS 76

5.2.8.2 Balls in bins again


Let’s try applying the Chernoff bound to the balls-in-bins problem. Here
we let S = m
P
i=1 Xi be the number of balls in a particular bin, with Xi
the indicator that the i-th ball lands in the bin, E [Xi ] = pi = 1/n, and
E [S] = µ = m/n. To get a bound on Pr [S ≥ m/n + k], apply the Chernoff
bound with δ = kn/m to get

Pr [S ≥ m/n + k] = Pr [S ≥ (m/n)(1 + kn/m)]


ek
≤ .
(1 + kn/m)1+kn/m
For m = n, this collapses to the somewhat nicer but still pretty horrifying
ek /(k + 1)k+1 .
Staying with m = n, if we are bounding the probability of having large
bins, we can use the 2−R variant to show that the probability that any
particular bin has more than 2 lg n balls (for example), is at most n−2 , giving
the probability that there exists such a bin of at most 1/n. This is not as
strong as what we can get out of the full Chernoff bound. If we take the
logarithm of ek /(k + 1)k+1 , we get k − (k + 1) ln(k + 1); if we then substitute
k = lnc lnlnnn − 1, we get
c ln n c ln n c ln n
−1− ln
ln ln n  ln ln n ln ln n
c 1 c

= (ln n) − − (ln c + ln ln n − ln ln ln n)
ln ln n ln n ln ln n
c 1 c ln c c ln ln ln n
 
= (ln n) − − −c+
ln ln n ln n ln ln n ln ln n
= (ln n)(−c + o(1)).

So the probability of getting more than c ln n/ ln ln n balls in any one bin


is bounded by exp((ln n)(−c + o(1))) = n−c+o(1) . This gives a maximum bin
size of O(log n/ log log n) with any fixed probability bound n−a for sufficiently
large n.

5.2.8.3 Flipping coins, central behavior


Suppose we flip n fair coins, and let S be the number that come up heads.
We expect µ = n/2 heads on average. How many extra heads can we get, if
we want to stay within a probability bound of n−c ?
p Pr [S ≥ (1 +pδ)(n/2)] ≤
Here we use the small-δ approximation, which gives
exp(−δ 2 n/6). Setting exp(−δ 2 n/6) = n−c gives δ = 6 ln nc /n = 6c ln n/n.
CHAPTER 5. CONCENTRATION BOUNDS 77

110 111

010 011

100 101

000 001

Figure 5.2: Hypercube network with n = 3

q
The actual excess over the mean is δ(n/2) = (n/2) 6c ln n/n = 32 cn ln n.
p

By symmetry, the same bound applies to extra tails. So if we flip 1000


coins and see more than 676 heads (roughly the bound when c=3), we can
reasonably conclude that either (a) our coin is biased, or (b) we just hit a
rare one-in-a-billion jackpot. p
In algorithm analysis, the (3/2)c part usually gets absorbed into the
−c
√ with probability at least 1 − n ,
asymptotic notation, and we just say that
the sum of n random bits is within O( n log n) of n/2.

5.2.8.4 Permutation routing on a hypercube


Here we use Chernoff bounds to show bounds on a classic permutation-
routing algorithm for hypercube networks due to Valiant [Val82]. The
presentation here is based on §4.2 of [MR95], which in turn is based on an
improved version of Valiant’s original analysis that appeared in a follow-up
paper with Brebner [VB81]. There’s also a write-up of this in [MU17, §4.6.1].
The basic idea of a hypercube architecture is that we have a collection of
N = 2n processors, each with an n-bit address. Two nodes are adjacent if
their addresses differ by one bit (see Figure 5.2 for an example). Though
now mostly of theoretical interest, these things were the cat’s pajamas back
in the 1980s: see http://en.wikipedia.org/wiki/Connection_Machine.
Suppose that at some point in a computation, each processor i wants to
send a packet of data to some processor π(i), where π is a permutation of
the addresses. But we can only send one packet per time unit along each of
CHAPTER 5. CONCENTRATION BOUNDS 78

the n edges leaving a processor.6 How do we route the packets so that all of
them arrive in the minimum amount of time?
We could try to be smart about this, or we could use randomization.
Valiant’s idea is to first route each process i’s packet to some random
intermediate destination σ(i), then in the second phase, we route it from σ(i)
to its ultimate destination π(i). Unlike π, σ is not necessarily a permutation;
instead, σ(i) is chosen uniformly at random independently of all the other
σ(j). This makes the choice of paths for different packets independent of
each other, which we will need later to apply Chernoff bounds..
Routing is done by a bit-fixing: if a packet is currently at node x and
heading for node y, find the leftmost bit j where xj = 6 yj and fix it, by
sending the packet on to x[xj /yj ]. In the absence of contention, bit-fixing
routes a packet to its destination in at most n steps. The hope is that the
randomization will tend to spread the packets evenly across the network,
reducing the contention for edges enough that the actual time will not be
much more than this.
The first step is to argue that, during the first phase, any particular
packet is delayed at most one time unit by any other packet whose path
overlaps with it. Suppose packet i is delayed by contention on some edge uv.
Then there must be some other packet j that crosses uv during this phase.
From this point on, j remains one step ahead of i (until its path diverges),
so it can’t block i again unless both are blocked by some third packet k (in
which case we charge i’s further delay to k). This means that we can bound
the delays for packet i by counting how many other packets cross its path.7
So now we just need a high-probability bound on the number of packets that
get in a particular packet’s way.
Following the presentation in [MR95], define Hij to be the indicator
variable for the event that packets i and j cross paths during the first phase.
Because each j chooses its destination independently, once we fix i’s path,
P
the Hij are all independent. So we can bound S = j6=i Hij using Chernoff
bounds. To do so, we must first calculate an upper bound on µ = E [S].
The trick here is to observe that any path that crosses i’s path must cross
one of its edges, and we can bound the number of such paths by bounding
how many paths cross each edge. For each edge e, let Te be the number
of paths that cross edge e, and for each j, let Xj be the number of edges
P P
that path j crosses. Counting two ways, we have e Te = j Xj , and so
6
Formally, we have a synchronous routing model with unbounded buffers at each node,
with a maximum capacity of one packet per edge per round.
7
A much more formal version of this argument is given as [MR95, Lemma 4.5].
CHAPTER 5. CONCENTRATION BOUNDS 79
hP i
Xj ≤ N (n/2). By symmetry, all the Te have the same
P
E[ e Te ] =E j
expectation, so we get E [Te ] ≤ N N
(n/2)
n = 1/2.
Now fix σ(i). This determines some path e1 e2 . . . ek for packet i. In
general we do not expect E [Te` | σ(i)] to equal E [Te` ], because conditioning
on i’s path crossing e` guarantees that at least one path crosses this edge that
might not have. However, if we let Te0 be the number of packets j 6= i that
cross e, then we have Te0 ≤ Te always, giving E [Te0 ] ≤ E [Te ], and because Te0
does not depend on i’s path, E [Te0 | σ(i)] = E [Te0 ] ≤hE [Te ] ≤ 1/2. Summing
i
this bound over all k ≤ n edges on i’s path gives E ≤ n/2,
P
H
j6=i ij σ(i)
hP i
which implies E j6=i Hij ≤ n/2 after removing the conditioning on σ(i).
Inequality (5.2.4) says that Pr [X ≥ R] ≤ 2−R when R ≥ 2eµ. Letting
X = j6=i Hij and setting R = 3n gives R = 6(n/2) ≥ 6µ > 2eµ, so
P
hP i
−3n = N −3 . This says that any one packet reaches its
Pr j Hij ≥ 3n ≤ 2
random destination with at most 3n added delay (thus, in at most 4n time
units) with probability at least 1 − N −3 . If we consider all N packets, the
total probability that any of them fail to reach their random destinations in
4n time units is at most N · N −3 = N −2 . Note that because we are using
the union bound, we don’t need independence for this step—which is good,
because we don’t have it.
What about the second phase? Here, routing the packets from the random
destinations back to the real destinations is just the reverse of routing them
from the real destinations to the random destinations. So the same bound
applies, and with probability at most N −2 some packet takes more than
4n time units to get back (this assumes that we hold all the packets before
sending them back out, so there are no collisions between packets from
different phases).
Adding up the failure probabilities and costs for both stages gives a
probability of at most 2/N 2 that any packet takes more than 8n time units
to reach its destination.
The structure of this argument is pretty typical for applications of Cher-
noff bounds: we get a very small bound on the probability that something
bad happens by applying Chernoff bounds to a part of the problem where
we have independence, then use the union bound to extend this to the full
problem where we don’t.
CHAPTER 5. CONCENTRATION BOUNDS 80

5.3 The Azuma-Hoeffding inequality


The problem with Chernoff bounds is that they only work for sums of
Bernoulli random variables. Hoeffding’s inequality [Hoe63] is another
concentration bound based on the moment generating function that applies
to any sum of bounded independent random variables with mean 0.8 It has
the additional useful feature that it generalizes nicely to some collections of
random variables that are not mutually independent, as we will see in §5.3.2.
This more general version is known as Azuma’s inequality [Azu67] or the
Azuma-Hoeffding inequality.9

5.3.1 Hoeffding’s inequality


This is the version for sums of bounded independent random variables. We
will consider the symmetric case, where each variable Xi satisfies |Xi | ≤ ci
for some constant ci . Hoeffding’s original result considered bounds of the
form ai ≤ Xi ≤ bi , and is equivalent when ai = −bi .
The main tool is Hoeffding’s lemma, which states
Lemma 5.3.1. Let E [X] = 0 and |X| ≤ c with probability 1. Then
h i 2 /2
E eαX ≤ e(αc) . (5.3.1)

Proof. The basic idea is that, for any α, eαx is a convex function. Since we
want an upper bound, we can’t use Jensen’s inequality (4.3.1), but we can
use the fact that X is bounded and we know its expectation. Convexity of
eαx means that, for any x with −c ≤ x ≤ c, eαx ≤ λe−αc + (1 − λ)eαc , where
1 x

x = λ(−c) + (1 − λ)c. Solving for λ in terms of x gives λ = 2 1 − c and
1 x

1 − λ = 2 1 + c . So
1 X −αc 1 X αc
h i      
αX
E e ≤E 1− e + 1+ e
2 c 2 c
e−αc + eαc e−αc eαc
= − E [X] + E [X]
2 2c 2c
e−αc + eαc
=
2
= cosh(αc).
8
Note that the requirement that E [Xi ] = 0 can always be satisfied by considering
instead Yi = Xi − E [Xi ].
9
The history of this is that Hoeffding [Hoe63] proved it for independent random variables,
and observed that the proof was easily extended to martingales, while Azuma [Azu67]
actually went and did the work of proving it for martingales.
CHAPTER 5. CONCENTRATION BOUNDS 81

In other words, the worst possible X is a fair choice between ±c, and
in this case we get the hyperbolic cosine of αc as its moment generating
function.
We don’t like hyperbolic cosines much, because we are going to want
to take products of our bounds, and hyperbolic cosines don’t multiply very
nicely. As before with 1 + x, we’d be much happier if we could replace the
cosh with a nice exponential. The Taylor series expansion of cosh x starts
with 1 + x2 /2 + . . . , suggesting that we should approximate it with exp(x2 /2),
2
and indeed it is the case that for all x, cosh x ≤ ex /2 . This can be shown by
comparing the rest of the Taylor series expansions:
ex + e−x
cosh x =
2
∞ ∞
!
1 X xn X (−x)n
= +
2 n=0 n! n=0 n!

X x2n
=
n=0
(2n)!

X x2n

n=0
2n n!

X (x2 /2)n
=
n=0
n!
x2 /2
=e .

This gives the claimed bound


h i 2 /2
E eαX ≤ cosh(αc) ≤ e(αc) .

.
Theorem 5.3.2. Let X1 . . . Xn be independent random variables with E [Xi ] =
0 and |Xi | ≤ ci for all i. Then for all t,
" n # !
X t2
Pr Xi ≥ t ≤ exp − Pn 2 . (5.3.2)
i=1
2 i=1 ci

Proof. Let S = ni=1 Xi . As with Chernoff hbounds,


P
i we’ll first calculate a
bound on the moment generating function E e αS and then apply Markov’s
inequality with a carefully-chosen α.
CHAPTER 5. CONCENTRATION BOUNDS 82
h i 2 /2
From (5.3.1), we have E eαXi ≤ e(αci ) for all i. Using this bound and
the independence of the Xi , we compute
n
" !#
h i X
αS
E e = E exp α Xi
i=1
" n #
Y
=E eαXi
i=1
n
Y h i
= E eαXi
i=1
n
Y 2 /2
≤ e(αci )
i=1
n
!
X α2 c2i
= exp
i=1
2
n
!
α2 X
= exp c2i .
2 i=1

Applying Markov’s inequality then gives (when α > 0):


h i
Pr [S ≥ t] = Pr eαS ≥ eαt
n
!
α2 X
≤ exp c2 − αt . (5.3.3)
2 i=1 i

Now we do the same trick as in Chernoff bounds and choose α to minimize


the bound. If we write C for ni=1 c2i , this is done by minimizing the exponent
P
α2
2 C − αt, which we do by taking the derivative with respect to α and setting
it to zero: αC − t = 0, or α = t/C. At this point, the exponent becomes
(t/C)2 t2
2 C − (t/C)t = − 2C .
Plugging this into (5.3.3) gives the bound (5.3.2) claimed in the theorem.

5.3.1.1 Hoeffding vs Chernoff


Let’s see how good a bound this gets us for our usual test problem of bounding
Pr [S = n] where S = ni=1 Xi is the sum of n independent fair coin-flips. To
P

make the problem fit the theorem, we replace each Xi by a rescaled version
Yi = 2Xi − 1 = ±1 with equal probability; this makes E [Yi ] = 0 as needed,
CHAPTER 5. CONCENTRATION BOUNDS 83

with |Yi | ≤ ci = 1. Hoeffding’s inequality (5.3.2) then gives


" n # !
X n2
Pr Yi ≥ n ≤ exp −
i=1
2n

= e−n/2 = ( e)−n .
√ √
Since e ≈ 1.649 . . . , this is actually slightly better than the (2/ e)−n
bound we get using Chernoff bounds.
On the other hand, Chernoff bounds work better if we have a more
skewed distribution on the Xi ; for example, in the balls-in-bins case, each
Xi is a Bernoulli random variable with E [Xi ] = 1/n. Using Hoeffding’s
inequality, we get a bound ci on |Xi − E [Xi ]| of only 1 − 1/n, which puts
Pn 2 √
i=1 ci very close to n, requiring t = Ω( n) before we get any non-trivial
bound out of (5.3.2), pretty much the same as in the fair-coin case (which is
not surprising, since Hoeffding’s inequality doesn’t know anything about the
distribution of the Xi ). But we’ve already seen that Chernoff gives us that
P
Xi = O(log n/ log log n) with high probability in this case.

5.3.1.2 Asymmetric version


The original version of Hoeffding’s inequality [Hoe63] assumes ai ≤ Xi ≤ bi ,
but E [Xi ] is still zero for all Xi . In this version, the bound is
" n # !
X 2t2
Pr Xi ≥ t ≤ exp − Pn 2
. (5.3.4)
i=1 i=1 (bi − ai )

This reduces to (5.3.2) when ai = −ci and bi = ci . The proof is essentially


the
h same,
i but a little more analytic sneakery is required to show that
2 2
E eαXi ≤ eα (bi −ai ) /8 ; see [McD89] for a proof of this that is a little more
approachable than Hoeffding’s original paper. For most applications, the
only difference between the symmetric version (5.3.2) and the asymmetric
version (5.3.4) is a small constant factor on the resulting bound on t.
Hoeffding’s inequality is not the tightest possible inequality that can be
obtained from the conditions under which it applies, but is relatively simple
and easy to work with. For a particularly strong version of Hoeffding’s
inequality and a discussion of some variants, see [FGQ12].

5.3.2 Azuma’s inequality


A general rule of thumb is that most things that work for sums of independent
random variables also work for martingales, which are sequences of random
variables that have similar behavior but allow for more dependence.
CHAPTER 5. CONCENTRATION BOUNDS 84

Formally, a martingale is a sequence of random variables S0 , S1 , S2 , . . . ,


where E [St | S1 , . . . , St−1 ] = St−1 . In other words, given everything you
know up until time t − 1, your best guess of the expected value at time t is
just wherever you are now.
Another way to describe a martingale is to take the partial sums St =
Pt
i=1 Xt of a martingale difference sequence, which is a sequence of
random variables X1 , X2 , . . . where E [Xt | X1 . . . Xt−1 ] = 0. So in this
version, your expected change from time t − 1 to t averages out to zero, even
if you try to predict it using all the information you have at time t − 1.
In some cases it makes sense to allow extra information to sneak in. We
can represent this using σ-algebras, in particular by using a filtration of the
form F0 ⊆ F1 ⊆ F2 ⊆ . . . , where each Ft is a σ-algebra (see §3.5.3). A
sequence S0 , S1 , S2 , . . . is adapted to a filtration F0 ⊆ F1 ⊆ F2 ⊆ . . . if each
St is Ft -measurable. This means that at time t the sum of our knowledge
(Ft ) is enough to predict exactly the value of St . The subset relations also
mean that we remember everything there is to know about St0 for t0 < t.
The general definition of a martingale is a collection {(St , Ft ) | t ∈ N}
where

1. Each St is Ft -measurable; and

2. E [St+1 | Ft ] = St .

This means that even if we include any extra information we might have
at time t, we still can’t predict St+1 any better than by guessing the current
value St . This alternative definition will be important in some special cases,
as when St is a function of some other collection of random variables that
we use to define the Ft . Because Ft includes at least as much information as
S0 , . . . , St , it will always be the case that any sequence {(St , Ft )} that is a
martingale in the general sense gives a sequence {St } that is a martingale in
the more specialized E [St+1 | S0 , . . . , St ] = St sense.
Martingales were invented to analyze fair gambling games, where your
return over some time interval is not independent of previous outcomes (for
example, you may change your bet or what game you are playing depending
on how things have been going for you), but it is always zero on average
given previous information.10 The nice thing about martingales is they allow
10
Real casinos give negative expected return, so your winnings in a real casino form
a supermartingale with St ≥ E [St+1 | S0 . . . St ]. On the other hand, the casino’s take,
in a well-run casino, is a submartingale, a process with St ≤ E [St+1 | S0 . . . St ]. These
definitions also generalize in the obvious way to the {(St , Ft )} case.
CHAPTER 5. CONCENTRATION BOUNDS 85

for a bit of dependence while still acting very much like sums of independent
random variables.
Where this comes up with Hoeffding’s inequality is that we might have a
process that is reasonably well-behaved, but its increments are not technically
independent. For example, suppose that a gambler plays a game where
they bet x units 0 ≤ x ≤ 1 at each round, and receives ±x with equal
probability. Suppose also that their bet at each round may depend on the
outcome of previous rounds (for example, they might stop betting entirely
if they lose too much money). If Xi is their take at round i, we have that
E [Xi | X1 . . . Xi−1 ] = 0 and that |Xi | ≤ 1. This is enough to apply the
martingale version of Hoeffding’s inequality, often called Azuma’s inequality.
Pk
Theorem 5.3.3. Let {Sk } be a martingale with Sk = i=1 Xi and |Xi | ≤ ci
for all i. Then for all n and all t ≥ 0:
!
−t2
Pr [Sn ≥ t] ≤ exp Pn 2 . (5.3.5)
2 i=1 ci
h i  2

Proof. Basically, we just show that E eαSn ≤ exp α2 ni=1 c2i —just like
P

in the proof of Theorem 5.3.2—and the rest follows using the same argument.
The
hQonly trickyi part is weh cani no longer use independence to transform
n αX into ni=1 E eαXi .
Q
E i=1 e i

Instead, we use the martingale property.


h i have E [Xi | X1 . . . Xi−1 ] =
For each Xi , we
0 and |Xi | ≤ ci always. Recall that E eαXi X1 . . . Xi−1 is a random vari-
able that takes on the average value of eαXi for each setting of X1 . . . Xi−1 .
We can applyhthe same analysisias in the proof of 5.3.2 to show that this
2
means that E eαXi X1 . . . Xi−1 ≤ e(αci ) /2 always.
The trick is to use the fact that, for any random variables X and Y ,
E [XY ] = E [E [XY | X]] = E [X E [Y | X]].
hQ i 2
We argue by induction on n that E ni=1 eαXi ≤ ni=1 e(αc) /2 . The
Q
CHAPTER 5. CONCENTRATION BOUNDS 86

base case is when n = 0. For the induction step, compute


" n # " " n ##
Y Y
E eαXi = E E eαXi X1 . . . Xn−1
i=1 i=1
n
" ! #
Y h i
αXi αXn
=E e E e X1 . . . Xn−1
i=1
n−1
" ! #
(αcn )2 /2
Y
αXi
≤E e e
i=1
"n−1 #
Y
αXi 2 /2
=E e e(αcn )
i=1
n−1
!
(αci )2 /2 2 /2
Y
≤ e e(αcn )
i=1
n
Y 2 /2
= e(αci )
i=1
n
!
α2 X
= exp c2 .
2 i=1 i

The rest of the proof goes through as before.

Some extensions:
• The asymmetric version of Hoeffding’s inequality (5.3.4) also holds for
martingales. So if each increment Xi satisfies ai ≤ Xi ≤ bi always,
" n # !
X 2t2
Pr Xi ≥ t ≤ exp − Pn 2
. (5.3.6)
i=1 i=1 (bi − ai )

• The same bound works for bounded-difference supermartingales. A


supermartingale is a process where E [Xi | X1 . . . Xi−1 ] ≤ 0; the idea is
that my expected gain at any step is non-positive, so my present wealth
is always superior to my future wealth.11 If E [Xi | X1 . . . Xi−1 ] ≤ 0 and
|Xi | ≤ ci , then we can write Xi = Yi +Zi where Yi = E [Xi | X1 . . . Xi−1 ] ≤
0 is predictable from X1 . . . Xi−1 and E [Zi | X1 . . . Xi−1 ] = 0.12 Then
we can bound ni=1 Xi by observing that it is no greater than ni=1 Zi .
P P

11
The corresponding notion in the other direction is a submartingale. See §9.2.
12
This is known as a Doob decomposition and can be used to extract a martingale
{Zi } from any stochastic process {Xi }. For general processes, Yi = Xi − Zi will still be
predictable, but may not satisfy E [Yi | X1 , . . . , Xi−1 ] ≤ 0.
CHAPTER 5. CONCENTRATION BOUNDS 87

A complication is that we no longer have |Zi | ≤ ci ; instead, |Zi | ≤ 2ci


(since leaving out Yi may shift Zi up). But with this revised bound,
(5.3.5) gives
" n # " n #
X X
Pr Xi ≥ t ≤ Pr Zi ≥ t
i=1 i=1
!
t2
≤ exp − Pn 2 . (5.3.7)
8 i=1 ci

• Suppose that we stop the process after the first time τ with Sτ =

i=1 Xi ≥ t. This is equivalent to making a new variable Yi that is
zero whenever Si−1 ≥ t and equal to Xi otherwise. This doesn’t affect
the conditions E [Yi | Y1 . . . Yi−1 ] = 0 or |Yi | ≤ ci , but it makes it so
Pn
Yi ≥ t if and only if maxk≤n ki=1 Xi ≥ t. Applying (5.3.5) to
P
P i=1
Yi then gives
k
" # !
X t2
Pr max Xi ≥ t ≤ exp − Pn 2 . (5.3.8)
k≤n
i=1
2 i=1 ci

• Since the conditions on Xi in Theorem 5.3.3 apply equally well to −Xi ,


we have
" n # !
X t2
Pr Xi ≤ −t ≤ exp − Pn 2 . (5.3.9)
i=1
2 i=1 ci

which we can combine with (5.3.5) to get the two-sided bound


" n # !
X t2
Pr Xi ≥ t ≤ 2 exp − Pn 2 . (5.3.10)
i=1
2 i=1 ci

• The extension of Hoeffding’s inequality to the case ai ≤ Xi ≤ bi works


equally well for Azuma’s inequality, giving the same bound as in (5.3.4).

• Finally, one can replace the requirement that each ci be a constant


with a requirement that ci be predictable from X1 . . . Xi−1 and that
Pn 2 Pn −t2 /2C . This generally
i=1 ci ≤ C always and get Pr [ i=1 Xi ≥ t] ≤ e
doesn’t come up unless you have an algorithm that explicitly cuts off
the process if c2i gets too big, but there is at least one example of
P

this in the literature [AW96].


CHAPTER 5. CONCENTRATION BOUNDS 88

There are also cases where the asymmetric version works with ai ≤
Xi ≤ bi where a bound on bi − ai is fixed but the precise values of
ai and bi may vary depending on X1 , . . . , Xi−1 . This shows up in the
proof of McDiarmid’s inequality [McD89], which is described below in
§5.3.3.

5.3.3 The method of bounded differences


To use Azuma’s inequality, we need a bounded-difference martingale. The
easiest way to get such martingales is through the method of bounded dif-
ferences, which was popularized by a survey paper by McDiarmid [McD89].
For this reason the key result is often referred to as McDiarmid’s inequal-
ity.
The basic idea of the method is to structure a problem so that we are
computing a function f (X1 , . . . , Xn ) of a sequence of independent random
variables X1 , . . . , Xn . To get our martingale, we’ll imagine we reveal the Xi
one at a time, and compute at each step the expectation of the final value of
f based on just the inputs we’ve seen so far.
Formally, let Ft = hX1 , . . . , Xt i, the σ-algebra generated by X1 through
Xt . This represents all the information we have at time t. Let Yt = E [f | Ft ],
the expected value of f given the values of the first t variables. Then {hYt , Ft i}
forms a martingale, with Y0 = E [f ] and Yt = E [f | X1 , . . . , Xt ] = f . So if
we can find a bound ct on |Yt − Yt−1 |, we can apply Azuma’s inequality to
get bounds on Yn − Y0 = f − E [f ].
A sequence of random variables of the form Yt = E [Z | Ft ], where Z is
some fixed random variable and F0 ⊆ F1 ⊆ F2 ⊆ . . . is a filtration, is a called
a Doob martingale, and this is one of the most common ways to construct
a martingale. The proof that a Doob martingale is in fact a martingale
is immediate from the general version of the law of iterated expectation
E [Yt+1 | Ft ] = E [E [Z | Ft+1 ] | Ft ] = E [Z | Ft ] = Yt . Not all martingales are
Doob martingales: for example, the martingale whose difference sequence
consists of fair ±1 coin-flips doesn’t converge to any random variable in the
limit.13
To show that Yt meets the conditions for Azuma’s inequality, we require
13
A question came up in class whether martingales that converge at some finite time
n are technically Doob martingales with respect to Ft = hY0 , Y1 , . . . Yt i. This is sort of
true, since we have that E [Yn | Ft ] = Yt for all t. However this doesn’t help much for
constructing a particular martingale. If we let Yt = 0 for all t < k and Yt = Yk = ±1 for
all t ≥ k, then for any n ≥ k, ({hYt , hY0 , . . . , Yt ii}) is a Doob martingale generated by Yn .
But this doesn’t tell us anything about which Yk does the coin-flip.
CHAPTER 5. CONCENTRATION BOUNDS 89

that f has the bounded difference property, which says that there are
bounds ct such that for any x1 . . . xn and any x0t , we have

f (x1 . . . xt . . . xn ) − f (x1 . . . x0t . . . xn ) ≤ ct . (5.3.11)

We want to bound |Yt+1 − Yt | = |E [f | Ft+1 ] − E [f | Ft ]|. We can do


this showing Yt+1 − Yt ≤ ct+1 . Because f can be replaced by −f without
changing the bounded difference property, essentially the same argument
will show Yt+1 − Yt ≥ −ct+1 .
Fix some possible value xt+1 for Xt+1 . The bounded difference property
says that

|f (X1 , . . . , Xt , xt+1 , Xt+2 , . . . , Xn )


− f (X1 , . . . , Xt , Xt+1 , Xt+2 , . . . , Xn )|
≤ct+1 ,

so

| E [f (X1 , . . . , Xt , xt+1 , Xt+2 , . . . , Xn ) | X1 , . . . , Xt ]


− E [f (X1 , . . . , Xt , Xt+1 , Xt+2 , . . . , Xn ) | X1 , . . . , Xt ]|
≤ct+1 . (5.3.12)

The second conditional expectation is just Yt . What is the first one?


Recall

Yt+1 = E [f (X1 , . . . , Xn | X1 , . . . , Xt+1 ]


= E [f (x1 , . . . , xt+1 , Xt+2 , . . . , Xn ]

when Xi = xi for all i ≤ t + 1, since conditioning on X1 , . . . , Xt+1 just


replaces the random variables with their actual values and then averages
over the rest. This is the same as

E [f (X1 , . . . , Xt , xt+1 , Xt+2 , . . . , Xn | X1 , . . . , Xt ]


= E [f (x1 , . . . , xt+1 , Xt+2 , . . . , Xn )]

when xt+1 happens to be the value of Xt+1 . So the first conditional expecta-
tion in (5.3.12) is just Yt+1 , giving

|Yt+1 − Yt | ≤ ct+1 .
CHAPTER 5. CONCENTRATION BOUNDS 90

Now we can apply Azuma-Hoeffding to get


!
t2
Pr [Yn − Y0 ≥ t] ≤ exp − Pn 2 .
2 i=1 ci

This turns out to overestimate the possible range of Yt+1 . With a more
sophisticated argument, it can be shown that for any fixed x1 , . . . , xt , there
exist bounds at+1 ≤ Yt+1 − Yt ≤ bt+1 such that bt+1 − at+1 = ct+1 . We
would like to use this to apply the asymmetric version of Azuma-Hoeffding
given in (5.3.6). A complication is that the specific values of at+1 and bt+1
may depend on the previous values x1 , . . . , xt , even if the bound ct+1 on
their maximum difference does not. Fortunately, McDiarmid shows that the
inequality works anyway, giving:

Theorem 5.3.4 (McDiarmid’s inequality [McD89]). Let X1 , . . . , Xn be


independent random variables and let f (X1 , . . . , Xn ) have the bounded differ-
ence property with bounds ci . Then
!
2t2
Pr [f (X1 , . . . , Xn ) − E [f (X1 , . . . , Xn )] ≥ t] ≤ exp − Pn 2 . (5.3.13)
i=1 ci

The main difference between this and a direct application of Azuma-


Hoeffding is that the constant factor in the exponent is better by a factor of
4.
Since −f satisfies the same constraints as f , the same bound holds for
Pr [f (X1 , . . . , Xn ) − E [f (X1 , . . . , Xn )] ≤ −t]. For some applications it may
make sense to apply the union bound to get a two-sided version
!
2t2
Pr [|f (X1 , . . . , Xn ) − E [f (X1 , . . . , Xn )]| ≥ t] ≤ 2 exp − Pn 2 .
i=1 ci
(5.3.14)

5.3.4 Applications
Here are some applications of the preceding inequalities. Most of these are
examples of the method of bounded differences.

5.3.4.1 Sprinkling points on a hypercube


Suppose you live in a hypercube, and the local government has conveniently
placed mailboxes on some subset A of the nodes. If you start at a random
CHAPTER 5. CONCENTRATION BOUNDS 91

location, how likely is it that your distance to the nearest mailbox deviates
substantially from the average distance?
We can describe your position as a bit vector X1 , . . . , Xn , where each
Xi is an independent random bit. Let f (X1 , . . . , Xn ) be the distance from
X1 , . . . , Xn to the nearest element of A. Then changing one of the bits
changes this function by at most 1. So we have Pr [|f − E [f ]| ≥ −2t2 /n
√ t] ≤ 2e
by (5.3.13), giving a range of possible distances that is O( n log n) with
probability at least 1 − n−c for any fixed c > 0.14 Of course, without knowing
what A is, we don’t know what E[f ] is; but at least we can be assured that
(unless A is very big) the distance we have to walk to send our mail will be
pretty much the same pretty much wherever we start.

5.3.4.2 Chromatic number of a random graph


Consider a random graph G(n, p) consisting of n vertices, where each pos-
sible edge appears with independent probability p. Let χ be the chromatic
number of this graph, the minimum number of colors necessary if we want
to assign a color to each vertex that is distinct for the colors of all of its
neighbors. The vertex exposure martingale shows us the vertices of the
graph one at a time, along with all the edges between vertices that have
been exposed so far. We define Xt to be the expected value of χ given this
information for vertices 1 . . . t.
If Zi is a random variable describing which edges are present between i
and vertices less than i, then the Zi are all independent, and we can write
χ = f (Z1 , . . . , Zn ) for some function f (this function may not be very easy to
compute, but it exists). Then Xt as defined above is just E [f | Z1 , . . . , Zt ].
Now observe that f has the bounded difference property with ct = 1: if I
change the edges for some vertex vt , I can’t increase the number of colors
I need by more than 1, since in the worst case I can always take whatever
coloring I previously had for all the other vertices and add a new color for
vt . This implies that the difference between any two graphs G and G0 that
differ only in the value of some Zt is at most one, because going between
them is an increase in one direction or the other.
McDiarmid’s inequality (5.3.13) then says that Pr [|χ − E [χ]| ≥ t] ≤
2
2e−2t /n ; in other words, the chromatic number of a random graph is tightly
concentrated around its mean, even if we don’t know what that mean is.
This proof is due to Shamir and Spencer [SS87]. Much better bounds are
known on the expected value and distribution of χ(G(n, p)) for many values
p1 √ 2
14
Proof: Let t = 2
(c + 1)n ln n = O( n log n). Then 2e−2t /n = 2e−(c+1) ln n =
−c−1 −c
2n < n when n is sufficiently large.
CHAPTER 5. CONCENTRATION BOUNDS 92

of n and p than are given by this crude result. For a more recent paper that
cites many of these see [Hec21].

5.3.4.3 Balls in bins


Suppose we toss m balls into n bins. How many empty bins do we get?
The probability that each bin individually is empty is exactly (1 − 1/n)m ,
which is approximately e−m/n when n is large. So the expected number of
empty bins is exactly n(1 − 1/n)m . If we let Xi be the bin that ball i gets
tossed into, and let Y = f (X1 , . . . , Xm ) be the number of empty bins, then
changing a single Xi can change f by at most 1. So from (5.3.13) we have
2
Pr [Y ≥ n(1 − 1/n)m + t] ≤ e−2t /m .

5.3.4.4 Probabilistic recurrence relations


Most probabilistic recurrence arguments (as in Appendix I) can be interpreted
as supermartingales: the current estimate of T (n) is always greater than or
equal to the expected estimate after doing one stage of the recurrence. This
fact can be used to get concentration bounds using (5.3.7).
For example, let’s take the recurrence (1.3.1) for the expected number of
comparisons for QuickSort:

1 n−1
X
T (n) = (n − 1) + (T (k) + T (n − 1 − k)) .
n k=0

We showed in §1.3.1 that the solution to this recurrence satisfies T (n) ≤


2n ln n.
To turn this into a supermartingale, imagine that we carry out a pro-
cess where we keep around at each step t a set of unsorted blocks of size
nt1 , nt2 , . . . , ntkt for some kt (note that the superscripts on nti are not expo-
nents). One step of the process involves choosing one of the blocks (we can
do this arbitrarily without affecting the argument) and then splitting that
block around a uniformly-chosen pivot. We will track a random variable Xt
equal to Ct + ki=1 2nti ln nti , where Ct is the number of comparisons done
P t

so far and the summation gives an upper bound on the expected number of
comparisons remaining.
To show that this is in fact a supermartingale, observe that if we partition
a block of size n we add n − 1 to Ct but replace the cost bound 2n ln n by
CHAPTER 5. CONCENTRATION BOUNDS 93

an expected

1 n−1
Z n
X 4
2· 2k ln k ≤ n ln n
n k=0 n 2
!
4 n2 ln n n2
= − − ln 2 + 1
n 2 4
= 2n ln n − n − ln 2 + 1
< 2n ln n − n.

The net change is less than − ln 2. The fact that it’s not zero suggests
that we could improve the 2n ln n bound slightly, but since it’s going down,
we have a supermartingale.
Let’s try to get a bound on how much Xt changes at each step. The Ct
part goes up by at most n − 1. The summation can only go down; if we split
a block of size ni , the biggest drop we get is if we split it evenly,15 This gives
a drop of
n−1 n−1 n−1
   
2n ln n − 2 2 ln = 2n ln n − 2(n − 1) ln n
2 2 2n
2n
= 2n ln n − 2(n − 1)(ln n − ln )
n−1
2n 2n
= 2n ln n − 2n ln n + 2n ln + 2 ln n − 2 ln
n−1 n−1
= 2n · O(1) + O(log n)
= O(n).

(with a constant tending to 2 in the limit).


So we can apply (5.3.7) with ct = O(n) to the at most n steps of the
algorithm, and get
2 /O(n3 )
Pr [Cn − 2n ln n ≥ t] ≤ e−t .

This gives Cn = O(n3/2 ) with constant probability or O(n3/2 log n) with
all but polynomial probability. This is a rather terrible bound, but it’s much
better than O(n2 ).
Much tighter bounds are known: QuickSort in fact uses Θ(n log n) com-
parisons with high probability [MH92]. If we aren’t too worried about
constants, an easy way to see the upper bound side of this is to adapt the
15
This can be proven most easily using convexity of n ln n.
CHAPTER 5. CONCENTRATION BOUNDS 94

analysis of Hoare’s FIND (§3.6.4). For each element, the number of elements
in the same block is multiplied by a factor of at most 3/4 on average each
time the element is compared, so the chance that the element is not by itself is
at most (3/4)k n after k comparisons. Setting k = log4/3 (n2 /) gives that any
particular element is compared k or times with probability at most /n. The
union bound then gives a probability of at most  that the most-compared
element is compared k or more times. So the total number of comparisons is
O(log(n/)) with probability 1 − , which becomes O(log n) with probability
1 − n−c if we set  = n−c for a fixed c.
We can formalize this argument itself using a supermartingale. Let Cit
be the number of times i has been compared as a non-pivot in the first t
pivot steps and Nit be the size of the block containing i after t pivot steps.
t
Let Yit = (4/3)Ci Nit . Then if we pick i’s block at step t + 1, the exponent
goes up by at most 1 and N t drops by a factor of 3/4, canceling out the
increase.h If we don’t
i pick i’s block, Yit is unchanged. In either case we get
t t+1  t
Yi ≥ E Yi Ft and Yi is a supermartingale.
h i
Now let Z t = i Yit . Since this is greater than or equal to i E Yit+1 Ft =
P P

Z t+1 , Z t is also a supermartingale. For t = 0, Z 0 = n · (4/3)0 · n = n2 .



n
For t = n, Z n = i (4/3)Ci . But then
P
   
Cin
Pr max Cin ≥ a log4/3 n = Pr max(4/3) ≥n a
i i
" #
Cin
X
a
≤ Pr (4/3) ≥n
i
= Pr [Z ≥ na ]
n

E [Z n ]

na
Z 0
≤ a
n
= na−2 .
h i
Choosing a = c+2 gives an n−c bound on Pr maxi Cin ≥ (c + 2) log4/3 n
hP i
n
and thus the same bound on Pr i Ci ≥ (c + 2)n log4/3 n . This shows that
the total number of comparisons is O(n log n) with high probability.

5.3.4.5 Multi-armed bandits


In the multi-armed bandit problem, we must choose at each time step
one of a fixed set of k arms to pull. Pulling arm i at time t yields a return
CHAPTER 5. CONCENTRATION BOUNDS 95

of Xit , a random payoff typically assumed to be between 0 and 1. Suppose


that all the Xit are independent, and that for each fixed i, all Xit have the
same distribution, and thus the same expected payoff. Suppose also that we
initially know nothing about these distributions. What strategy can we use
to maximize our expected payoff over a large number of pulls?
More specifically, we want to minimize our regret, defined as

Ti · (µ∗ − µi ), (5.3.15)

where Ti counts the number of times we pull arm i, where µ∗ is the expected
payoff of the best arm, and µi = E [Xi ] is the expected payoff of arm i.
The tricky part here is that when we pull an arm and get a bad return,
we don’t know if we were just unlucky this time or it’s actually a bad arm.
So we have an incentive to try lots of different arms. On the other hand, the
more we pull a genuinely inferior arm, the worse our overall return. We’d like
to adopt a strategy that trades off between exploration (trying new arms)
and exploitation (collecting on the best arm so far) to do as best we can in
comparison to a strategy that always pulls the best arm.

The UCB1 algorithm Fortunately, there is a simple algorithm due to


Auer et al. [ACBF02] that solves this problem for us.16 To start with, we
pull each arm once. For any subsequent pull, suppose that for each i, we
P
have pulled the i-th arm ni times so far. Let n = ni , and let xi be the
average payoff from arm i so far. Then the UCB1 algorithm pulls the arm
that maximizes s
2 ln n
xi + . (5.3.16)
ni
UCB stands for upper confidence bound. (The “1” is because it is
the first, and simplest, of several algorithms of this general structure given in
the paper.) The idea is that we give arms that we haven’t tried very much
the benefit of the doubt, and assume that their actual average payout lies at
the upper end of some plausible range of values.17
16
This is not the only algorithm for solving multi-armed bandit problems, and it’s not
even the only algorithm in the Auer et al. paper. But it has the advantage of being
relatively straightforward to analyze. For a more general survey of multi-armed bandit
algorithms, see [BCB12] or [LS20].
17
The “bound” part is because we don’t attempt to compute the exact upper end of
this confidence interval, which may be difficult, but instead use an upper bound derived
from Hoeffding’s inequality. This distinguishes the UCB1 algorithm of [ACBF02] from the
upper confidence interval approach of Lai and Robbins [LR85] that it builds on.
CHAPTER 5. CONCENTRATION BOUNDS 96
q
The quantity 2 nlni n is a bit mysterious, but it arises in a fairly natural
way from the asymmetric version of Hoeffding’s inequality. With a small
adjustment to deal with non-zero-mean variables, (5.3.4) says that, if S is a
sum of n random variables bounded between ai and bi , then
" n #
2/
Pn
(bi −ai )2
(Xi − E [Xi ]) ≥ t ≤ e−2t
X
Pr i=1 . (5.3.17)
i=1

By applying (5.3.4) to −Xi , we also get


" n #
2/
Pn
(bi −ai )2
(Xi − E [Xi ]) ≤ −t ≤ e−2t
X
Pr i=1 . (5.3.18)
i=1

ni
Now consider xi = n1i j=1
P
X where each Xj lies between 0 and 1. This
P ij
is equivalent to having xi = nj=1 Yj where Yj = Xj /ni lies between 0 and
1/ni . So (5.3.17) says that

" s #
2 ln n 2 2
Pr xi − E [xi ] ≥ ≤ e−2( (2 ln n)/ni ) /(ni (1/ni ) )
ni
= e−4 ln n
= n−4 . (5.3.19)

We also get a similar lower bound using (5.3.18).


So the bonus applied to xi is really a high-probability bound on how big
the difference
√ between the observed payoff and expected payoff might be.
The ln n part is there to make the error probability be small as a function
of n, since we will be summing over a number of bad cases polynomial in n
and not a particular ni . Applying the bonus to all arms make it likely that
the observed payoff of the best arm stays above its actual payoff, so we won’t
forget to pull it. The hope is that over time the same bonus applied to other
arms will not boost them up so much that we pull them any more than we
have to.
However, in an infinite execution of the algorithm, even a bad arm will
be pulled infinitely often, as ln n rises enough to compensate for the last
increase in ni . This accounts for an O(log n) term in the regret, as we will
see below. It also prevents us from getting stuck refusing to give a good arm
a second chance just because we had an initial run of bad luck.

Analysis of UCB1 The following theorem is a direct quote of [ACBF02,


Theorem 1]:
CHAPTER 5. CONCENTRATION BOUNDS 97

Theorem 5.3.5 ([ACBF02]). For all K > 1, if policy UCB1 is run on K


machines having arbitrary reward distributions P1 , . . . , PK with support in
[0, 1], then its expected regret after any number n of plays is at most
  ! K 
X ln n  π2 X
8 + 1+  ∆j  . (5.3.20)
i:µi <µ∗
∆i 3 j=1

Here µi = E [Pi ] is the expected payoff for arm i, µ∗ as before is maxi µi ,


and ∆i = µ∗ − µi is the regret for pulling arm i. The theorem states that
our expected regret is bounded by a term for each arm worse that µ∗ that
grows logarithmically in n and is inversely proportional to how close the arm
is to optimal, plus a constant additional loss corresponding to pulling every
arm a little more than 4 times on average. The logarithmic regret in n is
a bit of a nuisance, but an earlier lower bound of Lai and Robbins [LR85]
shows that something like this is necessary in the limit.
To prove Theorem 5.3.5, we need to get an upper bound on the number
of times each suboptimal arm is pulled during the first n pulls. Define
s
2 ln t
ct,s = , (5.3.21)
s

the bonus given to an arm that has been pulled s times in the first t pulls.
Fix some optimal arm. Let X i,s be the average return on arm i after s

pulls and X s be the average return on the optimal arm after s pulls.
If we pull arm i after t total pulls, when arm i has previously been pulled
si times and our optimal arm has been pulled s∗ times, then we must have

X i,si + ct,si ≥ X s∗ + Ct,s∗ . (5.3.22)

This just says that arm i with its bonus looks better than the optimal
arm with its bonus.
To show that this bad event doesn’t happen, we need three things:

1. The value of X s∗ +ct,s∗ should be at least µ∗ . Were it to be smaller, the

observed value X s∗ would be more than ct,s∗ away from its expectation.
Hoeffding’s inequality implies this doesn’t happen too often.

2. The value of X i,si + ct,si should not be too much bigger than µi . We’ll
again use Hoeffding’s inequality to show that X i,si is likely to be at
most µi + ct,si , making X i,si + ct,si at most µi + 2ct,si .
CHAPTER 5. CONCENTRATION BOUNDS 98

3. The bonus ct,si should be small enough that even adding 2ct,si to µi
is not enough to beat µ∗ . This means that we need to pick si large
enough that 2ct,si ≤ ∆i . For smaller values of si , we will just accept
that we need to pull arm i a few more times before we wise up.
More formally, if none of the following events hold:

X s∗ + ct,s∗ ≤ µ∗ . (5.3.23)
X i,si ≥ µi + ct,si (5.3.24)

µ − µi < 2ct,si , (5.3.25)

then X s∗ + ct,s∗ > µ∗ > µi + 2ct,si > X i,si + ct,si , and we don’t pull arm i
because the optimal arm is better. (We don’t necessarily pull the optimal
arm, but if we don’t, it’s because we pull some other arm that still isn’t arm
i.)
For (5.3.23) and (5.3.24), we repeat the argument in (5.3.19), plugging
in t for n and si or s∗ for ni . This gives a probability of at most 2t−4 that
either or both of these bad events occur.
For (5.3.25), we need to do something a little sneaky, because the state-
ment is not actually true when si is small. So we will give `i free pulls to
arm i, and only start comparing arm i to the optimal arm after we have
done at least this many pulls. The value of `i is chosen so that, when t ≤ n
and si > `i ,
2ct,si ≤ µ∗ − µi ,
which expands to,
s
2 ln t
2 ≤ ∆i ,
si
giving
8 ln t
si ≥ .
∆2i
So we must set `i to be at least
8 ln n 8 ln t
≥ .
∆2i ∆2i
Because `i must be an integer, we actually get
& '
8 ln n 8 ln n
`i = ≤1+ .
∆2i ∆2i
CHAPTER 5. CONCENTRATION BOUNDS 99

This explains (after multiplying by the regret ∆i ) the first term in (5.3.20).
For the other sources of bad pulls, apply the union bound to the 2t−4
error probabilities we previously computed for all choices of t ≤ n, s∗ ≥ 1,
and si > `i . This gives
n X
t−1 t−1 ∞
−4
t2 · t−4
X X X
2t <2
t=1 s∗ =1 si =`i +1 t=1
π2
=2·
6
π2
= .
3
Again we have to multiply by the regret ∆i for pulling the i-th arm, which
gives the second term in (5.3.20).

5.4 Relation to limit theorems


Since in many cases we are working with sums Sn = ni=1 Xi of independent,
P

identically distributed random variables Xi , classical limit theorems apply.


These relate the behavior of Sn in the limit to the common expectation
µ = E [Xi ] and variance σ 2 = Var [Xi ].18
These include the strong law of large numbers

Pr lim Sn /n = µ = 1 (5.4.1)
n→∞

and the central limit theorem (CLT)

Sn − µn
 
lim Pr √ ≤ t = Φ(t), (5.4.2)
n→∞ σ n

where Φ is the normal distribution function.


These can sometimes be useful in the analysis if randomized algorithms,
but often are not strong enough to get the results we want. The main problem
is that the standard versions of both the strong law and the central limit
theorem say nothing about rate of convergence. So if we want (for example)
to use the CLT to show that Sn is exponentially unlikely to be nσ away from
the mean, we can’t do it directly, because nσ/σ is not a fixed constant t,
18
p
The quantity σ = Var [X] is called the standard deviation of X, and informally
gives a measure of the typical distance of X from its mean, measured in the same units as
X.
CHAPTER 5. CONCENTRATION BOUNDS 100

and for any fixed constant t, we don’t know when the limit behavior actually
starts working.
But there are variants of these theorems that do bound rate of convergence
that can be useful in some cases. An example is given in §5.5.1.

5.5 Anti-concentration bounds


It may be that for some problem you want to show that a sum of random
variables is far from its mean at least some of the time: this would be
an anti-concentration bound. Anti-concentration bounds are much less
well-understood than concentration bounds, but there are known results that
can help in some cases.
For variables where we know the distribution of the sum exactly (e.g.,
sums with binomial distributions, or sums we can attack with generating
functions), we don’t need these. But they may be useful if computing the
distribution of the sum directly is hard.

5.5.1 The Berry-Esseen theorem


The Berry-Esseen theorem19 is an extension of the central limit theorem
that characterizes how quickly a sum of independent identically-distributed
random variables converges to a normal distribution, as a function of the
third moment of the random variables. Its simplest version says that if
we have n independent, identically-distributed hrandomi variables X1 . . . Xn ,
 2 2 3
with E [Xi ] = 0, Var [Xi ] = E Xi = σ , and E |Xi | ≤ ρ, then
n
" #
1 X Cρ
sup Pr √ Xi ≤ x − Φ(x) ≤ 3 √ , (5.5.1)
−∞<x<∞ n i=1 σ n

were C is an absolute constant and Φ is the normal distribution function. Note


that the σ 3 in the denominator is really Var [Xi ]3/2 . Since the probability
bound doesn’t depend on x, it’s more useful toward the middle of the
distribution than in the tails.
A classic proof of this result with C = 3 can be found in [Fel71, §XVI.5].
Shevtsova [She11] shows a stronger bound of C < 0.4784.
As with most results involving sums of random variables, there are
generalizations to martingales. These are too involved to describe here, but
see [HH80, §3.6].
19
Sometimes written Berry-Esséen theorem to help with the pronunciation of Esseen’s
last name.
CHAPTER 5. CONCENTRATION BOUNDS 101

5.5.2 The Littlewood-Offord problem


The Littlewood-Offord problem asks, given a set of n complex numbers
x1 . . . xn with |xi | ≥ 1, for how many assignments of ±1 to coefficients
1 . . . n it holds that | ni=1 i xi | ≤ r. Paul Erdős showed [Erd45] that this
P
√ √
quantity was at mostcr2n / n, where c is a constant. The quantity c2n / n
n
here is really 12 bn/2c : Erdős’s proof shows that for each interval of length
2r, the number of assignments that give a sum in the interior of the interval
is bounded by at most the sum of the r largest binomial coefficients.
In random-variable terms, this means that if we are looking at ni=1 xi ,
P

where the xi are constants with |xi | ≥ 1 and the i are independent ±1 fair
coin-flips, then Pr [| ni=1 i xi | ≤ r] is maximized by making all the xi equal
P

to 1. This shows that any distribution where the xi are all reasonably large
will not be any more concentrated than a binomial distribution.
There has been a lot of more recent work on variants of the Littlewood-
Offord problem, much of it by Terry Tao and Van Vu. See http://terrytao.
wordpress.com/2009/02/16/a-sharp-inverse-littlewood-offord-theorem/
for a summary of much of this work.
Chapter 6

Randomized search trees

These are data structures that are either trees or equivalent to trees, and use
randomization to maintain balance. We’ll start by reviewing deterministic
binary search trees and then add in the randomization.

6.1 Binary search trees


A binary search tree is a standard data structure for holding sorted data.
A binary tree is either empty, or it consists of a root node containing a
key and pointers to left and right subtrees. What makes a binary tree a
binary search tree is the invariant that, both for the tree as a whole and
any subtree, all keys in the left subtree are less than the key in the root,
while all keys in the right subtree are greater than the key in the root. This
ordering property means that we can search for a particular key by doing
binary search: if the key is not at the root, we can recurse into the left or
right subtree depending on whether it is smaller or bigger than the key at
the root.
The efficiency of this operation depends on the tree being balanced. If
each subtree always holds a constant fraction of the nodes in the tree, then
each recursive step throws away a constant fraction of the remaining nodes.
So after O(log n) steps, we find the key we are looking for (or find that
the key is not in the tree). But the definition of a binary search tree does
not by itself guarantee balance, and in the worst case a binary search tree
degenerates into a linked list with O(n) cost for all operations (see Figure 6.2.

102
CHAPTER 6. RANDOMIZED SEARCH TREES 103

y x

x C A y

A B B C

Figure 6.1: Tree rotations

6.1.1 Rebalancing and rotations


Deterministic binary search tree implementations include sophisticated re-
balancing mechanisms to adjust the structure of the tree to preserve balance
as nodes are inserted or delete. Typically this is done using rotations,
which are operations that change the position of a parent and a child while
preserving the left-to-right ordering of keys (see Figure 6.1).
Examples include AVL trees [AVL62], where the left and right subtrees
of any node have heights that differ by at most 1; red-black trees [GS78],
where a coloring scheme is used to maintain balance; and scapegoat
trees [GR93], where no information is stored at a node but part of the
tree is rebuilt from scratch whenever an operation takes too long. These all
give O(log n) cost per operation (amortized in the case of scapegoat trees),
and vary in how much work is needed in rebalancing. Both AVL trees and
red-black trees perform more rotations than randomized rebalancing does on
average.

6.2 Random insertions


Suppose we insert n keys into an initially-empty binary search tree in random
order with no rebalancing. This means that for each insertion, we follow the
same path that we would when searching for the key, and when we reach an
empty tree, we replace it with a tree consisting solely of the key at the root.1
Since we chose a random order, each element is equally likely to be the
root, and all the elements less than the root end up in the left subtree, while
all the elements greater than the root end up in the right subtree, where
they are further partitioned recursively. This is exactly what happens in
1
This is not the only way to generate a binary search tree at random. For example, we
could instead chooseuniformly from the set of all Cn binary search trees with n nodes,
1 2n
where Cn = n+1 n
is the n-th Catalan number. For n ≥ 3, this gives a different
distribution that we don’t care about.
CHAPTER 6. RANDOMIZED SEARCH TREES 104

4 1

2 6 2

1 3 5 7 3

Figure 6.2: Balanced and unbalanced binary search trees

randomized QuickSort (see §1.3.1), so the structure of the tree will exactly
mirror the structure of an execution of QuickSort. So, for example, we
can immediately observe from our previous analysis of QuickSort that the
total path length—the sum of the depths of the nodes—is Θ(n log n),
since the depth of each node is equal to 1 plus the number of comparisons it
participates in as a non-pivot, and (using the same argument as for Hoare’s
FIND in §3.6.4) that the height of the tree is O(log n) with high probability.2
When n is small, randomized binary search trees can look pretty scraggly.
Figure 6.3 shows a typical example.
The problem with this approach in general is that we don’t have any
guarantees that the input will be supplied in random order, and in the
worst case we end up with a linked list, giving O(n) worst-case cost for all
operations.
2
The argument for Hoare’s FIND is that any node has at most 3/4 of the descendants
of its parent on average; this gives for any node x that Pr [depth(x) > d] ≤ (3/4)d−1 n, or
a probability of at most n−c that depth(x) > 1 + (c + 1) log(n)/ log(4/3) ≈ 1 + 6.952 ln n
for c = 1, which we need to apply the union bound. The right answer for the actual height
of a randomly-generated search tree in the limit is 4.31107 ln n [Dev88] so this bound
is actually pretty close. The real height is still nearly a factor of three worse than for a
completely balanced tree, which has max depth bounded by 1 + lg n ≈ 1 + 1.44269 ln n.
CHAPTER 6. RANDOMIZED SEARCH TREES 105

1 7

3 6

2 4

Figure 6.3: Binary search tree after inserting 5 1 7 3 4 6 2

6.3 Treaps
The solution to bad inputs is the same as for QuickSort: instead of assuming
that the input is permuted randomly, we assign random priorities to each
element and organize the tree so that elements with higher priorities rise
to the top. The resulting structure is known as a treap [SA96], because it
satisfies the binary search tree property with respect to keys and the heap
property with respect to priorities.3
There’s an extensive page of information on treaps at http://faculty.
washington.edu/aragon/treaps.html, maintained by Cecilia Aragon, the
co-inventor of treaps; they are also discussed at length in [MR95, §8.2]. We’ll
give a brief description here.
To insert a new node in a treap, first walk down the tree according to
the key and insert the node as a new leaf. Then go back up fixing the heap
property by rotating the new element up until it reaches an ancestor with
the same or higher priority. (See Figure 6.4 for an example.) Deletion is the
reverse of insertion: rotate a node until it has 0 or 1 children (by swapping
with its higher-priority child at each step), and then prune it out, connecting
its child, if any, directly to its parent.
Because of the heap property, the root of each subtree is always the
element in that subtree with the highest priority. This means that the
structure of a treap is completely determined by the priorities and the keys,
no matter what order the elements arrive in. We can imagine in retrospect
3
The name “treap” for this data structure is now standard but the history is a little
tricky. According to Seidel and Aragon, essentially the same data structure (though with
non-random priorities) was previously called a cartesian tree by Vuillemin [Vui80], and
the word “treap” was initially applied by McCreight to a different data structure—designed
for storing two-dimensional data—that was called a priority search tree in its published
form. [McC85].
CHAPTER 6. RANDOMIZED SEARCH TREES 106

1,60 1,60
\ \
2,3 --> 3,26
\ /
(3,26) 2,3

1,60 1,60 1,60 5,78


\ \ \ /
3,26 --> 3,26 --> (5,78) --> 1,60
/ \ / \ / \
2,3 4,24 2,3 (5,78) 3,26 3,26
\ / / \ / \
(5,78) 4,24 2,3 4,24 2,3 4,24

5,78 5,78
/ \ / \
1,60 6,18 --> 1,60 7,41
\ \ \ /
3,26 (7,41) 3,26 6,18
/ \ / \
2,3 4,24 2,3 4,24

Figure 6.4: Inserting values into a treap. Each node is labeled with k, p where
k is the key and p the priority. Insertion of values not requiring rotations
are not shown.
CHAPTER 6. RANDOMIZED SEARCH TREES 107

that the treap is constructed recursively by choosing the highest-priority


element as the root, then organizing all smaller-index and all larger-index
nodes into the left and right subtrees by the same rule.
If we assign the priorities independently and uniformly at random from a
sufficiently large set (ω(n2 ) is enough in the limit), then we get no duplicates,
and by symmetry all n! orderings are equally likely. So the analysis of the
depth of a treap with random priorities is identical to the analysis of a binary
search tree with random insertion order. It’s not hard to see that the costs
of search insertion, and deletion operations are all linear in the depth of the
tree, so the expected cost of each of these operations is O(log n).

6.3.1 Assumption of an oblivious adversary


One caveat is that this only works if the priorities of the elements of the tree
are in fact independent. If operations on the tree are chosen by an adaptive
adversary, this assumption may not work. An adaptive adversary is one
that can observe the choice made by the algorithm and react to them: in this
case, a simple strategy would be to insert elements 1, 2, 3, 4, etc., in order,
deleting each one and reinserting it until it has a lower priority value than all
the smaller elements. This might take a while for the later elements, but the
end result is the linked list again. For this reason it is standard to assume in
randomized data structures that the adversary is oblivious, meaning that
it has to specify a sequence of operations without knowing what choices are
made by the algorithm. Under this assumption, whatever insert or delete
operations the adversary chooses, at the end of any particular sequence of
operations we still have independent priorities on all the remaining elements,
and the O(log n) analysis goes through.

6.3.2 Analysis
The analysis of treaps as carried out by Seidel and Aragon [SA96] is a nice
example of how to decompose a messy process into simple variables, much
like the linearity-of-expectation argument for QuickSort (§1.3.2). The key
observation is that it’s possible to bound both the expected depth of any node
and the number of rotations needed for an insert or delete operation directly
from information about the ancestor-descendant relationship between nodes.
Define two classes of indicator variables. For simplicity, we assume that
the elements have keys 1 through n, which we also use as indices.

1. Ai,j indicates the event that i is an ancestor of j, where i is an ancestor


of j if it appears on the path from the root to j. Note that every node
CHAPTER 6. RANDOMIZED SEARCH TREES 108

is an ancestor of itself.

2. Ci;`,m indicates the event that i is a common ancestor of both ` and


m; formally, Ci;`,m = Ai,` Ai,m .

The nice thing about these indicator variables is that it’s easy to compute
their expectations.
For Ai,j , i will be the ancestor of j if and only if i has a higher priority
than j and there is no k between i and j that has an even higher prior-
ity: in other words, if i has the highest priority of all keys in the interval
[min(i, j), max(i, j)]. To see this, imagine that we are constructing the treap
recursively, by starting with all elements in a single interval and partitioning
each interval by its highest-priority element. Consider the last interval in
this process that contains both i and j, and suppose i < j (the j > i case is
symmetric). If the highest-priority element is some k with i < k < j, then i
and j are separated into distinct intervals and neither is the ancestor of the
other. If the highest-priority element is j, then j becomes the ancestor of i.
The highest-priority element can’t be less than i or greater than j, because
then we get a smaller interval that contains both i and j. So the only case
where i becomes an ancestor of j is when i has the highest priority.
1
It follows that E [Ai,j ] = |i−j|+1 , where the denominator is just the
number of elements in the range [min(i, j), max(i, j)].
For Ci;`,m , i is the common ancestor of both ` and m if and only if it is has
the highest priority in both [min(i, `), max(i, `)] and [min(i, m), max(i, m)].
It turns out that no matter what order i, `, and m come in, these intervals
overlap so that i must have the highest priority in [min(i, `, m), max(i, `, m)].
1
This gives E [Ci;`,m ] = max(i,`,m)−min(i,`,m)+1 .
CHAPTER 6. RANDOMIZED SEARCH TREES 109

6.3.2.1 Searches
− 1.4 So
P
From the Ai,j variables we can compute depth(j) = i Ai,j

n
!
X 1
E [depth(j)] = −1
i=1
|i − j| + 1
   
j n
X 1 X 1 −1
= +
i=1
j−i+1 i=j+1
i − j + 1
   
j n−j+1
X 1 X 1
= + −1
k=1
k k=2
k
= Hj + Hn−j+1 − 2.

This is maximized at j = (n + 1)/2, giving 2H(n+1)/2 − 2 = 2 ln n + O(1).


So we get the same 2 ln n + O(1) bound on the expected depth of any one
node that we got for QuickSort. We can also sum over all j to get the exact
value of the expected total path length (but we won’t). These quantities
bound the expected cost of searches.

6.3.2.2 Insertions and deletions


For insertions and deletions, the question is how many rotations we have to
perform to float a new leaf up to its proper location (after an insertion) or
to float a deleted node down to a leaf (before a deletion). Since insertion is
just the reverse of deletion, we can get a bound on both by concentrating on
deletion. The trick is to find some metric for each node that (a) bounds the
number of rotations needed to move a node to the bottom of the tree and
(b) is easy to compute based on the A and C variables
The left spine of a subtree is the set of all nodes obtained by starting
at the root and following left pointers; similarly the right spine is what we
get if we follow the right pointers instead.
When we rotate an element down, we are rotating either its left or right
child up. This removes one element from either the right spine of the left
subtree or the left spine of the right subtree, but the rest of the spines are
left intact (see Figure 6.5). Subsequent rotations will eventually remove all
these elements by rotating them above the target, while other elements in
4
We need the −1 because of the convention that the root has depth 0, making the
depth of a node one less than the number of its ancestors. Equivalently, we could exclude
j from the sum and count only proper ancestors.
CHAPTER 6. RANDOMIZED SEARCH TREES 110

4 2 6
/ \ / \ / \
*2 6* => 1 4 or 4 7
/ \ / \ / \ / \
1 *3 5* 7 *3 6* *2 5*
/ \ / \
5* 7 1 *3

Figure 6.5: Rotating 4 right shortens the right spine of its left subtree by
removing 2; rotating left shortens the left spine of the right subtree by
removing 6.

the subtree will be carried out from under the target without ever appearing
as a child or parent of the target. Because each rotation removes exactly one
element from one or the other of the two spines, and we finish when both
are empty, the sum of the length of the spines gives the number of rotations.
To calculate the length of the right spine of the left subtree of some
element `, start with the predecessor ` − 1 of `. Because there is no element
between them, either ` − 1 is a descendant of ` or an ancestor of `. In the
former case (for example, when ` is 4 in Figure 6.5), we want to include all
ancestors of ` − 1 up to ` itself. Starting with i Ai,`−1 gets all the ancestors
P

of ` − 1, and subtracting off i Ci;`−1,` removes any common ancestors of


P

` − 1 and `. Alternatively, if ` − 1 is an ancestor of `, every ancestor of


` − 1 is also an ancestor of `, so the same expression i Ai,`−1 − i Ci;`−1,`
P P

evaluates to zero.
It follows that the expected length of the right spine of the left subtree is
exactly
" n n
# n n
X X X 1 X 1
E Ai,`−1 − Ci;`−1,` = −
i=1 i=1 i=1
|i − (` − 1)| + 1 i=1 max(i, `) − min(i, ` − 1) + 1
`−1 n `−1 n
X 1 X 1 X 1 X 1
= + − −
i=1
` − i i=` i − ` + 2 i=1 ` − i + 1 i=` i − (` − 1) + 1
`−1
X 1 n−`+2
X 1 X `
1 n−`+2
X 1
= + − −
j=1
j j=2
j j=2 j j=2
j
1
=1− .
`
By symmetry, the expected length of the left spine of the right subtree is
1
1 − n−`+1 . So the total expected number of rotations needed to delete the
CHAPTER 6. RANDOMIZED SEARCH TREES 111

HEAD TAIL
33 LEVEL 2

13 33 48 LEVEL 1

13 21 33 48 75 99 LEVEL 0

Figure 6.6: A skip list. The blue search path for 99 is superimposed on an
original image from [AS07].

`-th element is
1 1
2− − ≤ 2.
` n−`+1

6.3.2.3 Other operations


Treaps support some other useful operations: for example, we can split a
treap into two treaps consisting of all elements less than and all elements
greater than a chosen pivot by rotating the pivot to the root (O(log n)
rotations on average, equal to the pivot’s expected depth as calculated in
§6.3.2.1) and splitting off the left and right subtrees. Merging two treaps with
non-overlapping keys is the reverse of this and so it has the same expected
complexity.

6.4 Skip lists


A skip list [Pug90] is a randomized tree-like data structure based on linked
lists. It consists of a level 0 list that is an ordinary sorted linked list, together
with higher-level lists that contain a random sampling of the elements at
lower levels. When inserted into the level i list, an element flips a coin that
tells it with probability p to insert itself in the level i + 1 list as well. The
result is that the element is represented by a tower of nodes, one in each
of the bottom 1 + X many layers, where X is a geometrically-distributed
random variable. An example of a small skip list is shown in Figure 6.6.
Searches in a skip list are done by starting in the highest-level list and
searching forward for the last node whose key is smaller than the target; the
search then continues in the same way on the next level down. To bound the
expected running time of a search, it helps to look at this process backwards;
the reversed search path starts at level 0 and continues going backwards
CHAPTER 6. RANDOMIZED SEARCH TREES 112

until it reaches the first element that is also in a higher level; it then jumps
to the next level up and repeats the process. The nice thing about this
reversed process is that it has a simple recursive structure: if we restrict a
skip list to only those nodes to the left of and at the same level or higher
of a particular node, we again get a skip list. Furthermore, the structure of
this restricted skip list depends only on coin-flips taken at nodes within it,
so it’s independent of anything that happens elsewhere in the full skip list.
We can analyze this process by tracking the number of nodes in the
restricted skip list described above, which is just the number of nodes in the
current level that are earlier than the current node. If we move left, this
drops by 1; if up, this drops to p times its previous value on average. So the
number of such nodes Xk after k steps satisfies the probabilistic recurrence

E [Xk+1 | Xk ] = (1 − p)(Xk − 1) + p(pXk )


= (1 − p + p2 )Xk − (1 − p)
≤ (1 − p + p2 )Xk ,

with X0 = n − 1 (since the last node is not included). Wrapping expectations


around both sides gives E [Xk+1 ] = E [E [Xk+1 | Xk ]] ≤ (1 − p + p2 ) E [Xk ],
and in general we get E [Xk ] ≤ (1 − p + p2 )k X0 < (1 − p + p2 )k n. This is
minimized at p = 1/2, giving E [Xk ] ≤ (3/4)k n, suspiciously similar to the
bound we computed before for random binary search trees.
When Xk = 0, our search is done, so if T is the time to search, we have
Pr [T ≥ k] = Pr [Xk ≥ 1] ≤ (3/4)k n, by Markov’s inequality. In particular,
if we want to guarantee that we finish with probability 1 − , we need to
run for log4/3 (n/) steps. This translates into an O(log n) bound on the
expected search time, and the constant is even the same as our (somewhat
loose) bound for treaps.
The space per element of a skip list also depends on p. Every element
needs one pointer for each level it appears in. The number of levels each
element appears in is a geometric random variable where we are waiting for
1
an event of probability 1 − p, so the expected number of pointers is 1−p .
For constant p this is O(1). However, the space cost can reduced (at the
cost of increasing search time) by adjusting p. For example, if space is at a
premium, setting p = 1/10 produces 10/9 pointers per node on average—not
much more than in a linked list—but still gives  O(log n) search time.
1
In general the trade-off is between n 1−p total expected pointers and
log1/(1−p+p2 ) (n/) search time. For small p, the number of pointers scales as
1
1+O(p), while the constant factor in the search time is − log(1−p+p 2 ) = O(1/p).
CHAPTER 6. RANDOMIZED SEARCH TREES 113

Skip lists are even easier to split or merge than treaps. It’s enough to
cut (or recreate) all the pointers crossing the boundary, without changing
the structure of the rest of the list.
Chapter 7

Hashing

The basic idea of hashing is that we have keys from a large set U , and we’d
like to pack them in a small set M by passing them through some function
h : U → M , without getting too many collisions, pairs of distinct keys x
and y with h(x) = h(y). Where randomization comes in is that we want this
to be true even if the adversary picks the keys to hash. At one extreme, we
could use a random function, but this will take a lot of space to store.1 So
our goal will be to find functions with succinct descriptions that are still
random enough to do what we want.
The presentation in this chapter is based largely on [MR95, §§8.4-8.5]
(which is in turn based on work of Carter and Wegman [CW77] on universal
hashing and Fredman, Komlós, and Szemerédi [FKS84] on O(1) worst-case
hashing); on [PR04] and [Pag06] for cuckoo hashing; and [MU17, §5.5.3] for
Bloom filters.

7.1 Hash tables


Here we review hash tables, which are implementations of the dictionary
data type mapping keys to values. The central idea is to use a random-
looking function to map keys to small numeric indices. The first published
version of this is due to Dumey [Dum56].
Suppose we want to store n elements from a universe U of in a table
with keys or indices drawn from an index space M of size m. Typically we
1
An easy counting argument shows that almost all functions from U to M take |U | log|M |
bits to represent, no matter how clever you are about choosing your representation. This
forms the basis for algorithmic information theory, which defines an object as random
if there is no way to reduce the number of bits used to express it.

114
CHAPTER 7. HASHING 115

assume U = [|U |] = {0 . . . |U | − 1} and M = [m] = {0 . . . m − 1}.


If |U | ≤ m, we can just use an array. Otherwise, we can map keys to
positions in the array using a hash function h : U → M . This necessarily
produces collisions: pairs (x, y) with h(x) = h(y), and any design of a
hash table must include some mechanism for handling keys that hash to
the same place. Typically this is a secondary data structure in each bin,
but we may also place excess values in some other place. Typical choices
for data structures are linked lists (separate chaining or just chaining)
or secondary hash tables (see §7.3 below). Alternatively, we can push
excess values into other positions in the same hash table according to some
deterministic rule (open addressing or probing) or a second hash function
(see §7.4).
For most of these techniques, the costs of insertions and searches will
depend on how likely it is that we get collisions. An adversary that knows
our hash function can always choose keys with the same hash value, but we
can avoid that by choosing our hash function randomly.2 Our ultimate goal
is to do each search in O(1 + n/m) expected time, which for n ≤ m will be
much better than the Θ(log n) time for pointer-based data structures like
balanced trees or skip lists. The quantity n/m is called the load factor of
the hash table and is often abbreviated as α.
All of this only works if we are working in a RAM (random-access machine
model), where we can access arbitrary memory locations in time O(1) and
similarly compute arithmetic operations on O(log|U |)-bit values in time O(1).
There is an argument that in reality any actual RAM machine requires either
Ω(log m) time to read one of m memory locations (routing costs) or, if one is
particularly pedantic, Ω(m1/3 ) time (speed of light + finite volume for each
location). We will ignore this argument.
We will try to be consistent in our use of variables to refer to the different
parameters of a hash table. Table 7.1 summarizes the meaning of these
variable names.

7.2 Universal hash families


A family of hash functions H is 2-universal if for any x 6= y, Pr [h(x) = h(y)] ≤
1/m for a uniform random h ∈ H. It’s strongly 2-universal if for any
2
In practice, hardly anybody every does this. Hash functions are instead chosen based
on fashion and occasional experiments, often with additional goals like cryptographic
security or speed. For cryptographic security, the SHA family is standard. For speed,
xxHash (see https://cyan4973.github.io/xxHash/) is probably the fastest widely-used
hash function with decent statistical properties.
CHAPTER 7. HASHING 116

U Universe of all keys


S⊆U Set of keys stored in the table
n = |S| Number of keys stored in the table
M Set of table positions
m = |M | Number of table positions
α = n/m Load factor

Table 7.1: Hash table parameters

x1 6= x2 ∈ U , y1 , y2 ∈ M , Pr [h(x1 ) = y1 ∧ h(x2 ) = y2 ] = 1/m2 for a uniform


random h ∈ H. Another way to describe strong 2-universality is that the val-
ues of the hash function are uniformly distributed and pairwise-independent.
For k > 2, k-universal usually means strongly k-universal: Given
distinct x1 . . . xk , and any y1 . . . yk , Pr [∀i : h(xi ) = yi ] = m−k . This is equiv-
alent to the h(xi ) values for distinct xi and randomly-chosen h having a
uniform distribution and k-wise independence. It is possible to generalize
the weak version of 2-universality to get a weak version of k-universality
(Pr [h(xi ) are all equal] ≤ m−(k−1) ), but this generalization is not as useful
as strong k-universality.
To analyze universal hash families, it is helpful to have some notation for
counting collisions. We’ll mostly be doing counting rather than probabilities
because it saves carrying around a lot of denominators. Since we are assuming
uniform choices of h we can always get back probabilities by dividing by |H|.
The particular notation we will be using follows §8.4.1 of [MR95]; the original
paper of Carter and Wegman [CW77] uses essentially the same notation with
slightly different formatting.
Let δ(x, y, h) = 1 if x =6 y and h(x) = h(y), 0 otherwise. Abusing notation,
P
we also define, for sets X, Y , and H, δ(X, Y, H) = x∈X,y∈Y,h∈H δ(x, y, h);
and allow lowercase variables to stand in for singleton sets, as in δ(x, Y, h) =
δ({x} , Y, {h}). Now the statement that H is 2-universal becomes ∀x, y :
δ(x, y, H) ≤ |H|/m; this says that only 1/m of the functions in H cause any
particular distinct x and y to collide.
If H includes all functions U → M , we get equality: a random function
gives h(x) = h(y) with probability exactly 1/m. But we might do better if
each h tends to map distinct values to distinct places. The following lemma
shows we can’t do too much better:
Lemma
  For any family H, there exist x, y such that δ(x, y, H) ≥
7.2.1.
|H| m−1
m 1 − |U |−1 .

Proof. We’ll count collisions in the inverse image of each element z. Since
CHAPTER 7. HASHING 117

all distinct pairs of elements of h−1 (z) collide with each other, we have
 
δ(h−1 (z), h−1 (z), h) = h−1 (z) · h−1 (z) − 1 .

Summing over all z ∈ M gets all collisions, giving


X   
δ(U, U, h) = h−1 (z) · h−1 (z) − 1 .
z∈M

Use convexity or Lagrange multipliers to argue that the right-hand side is


minimized subject to z h−1 (z) = |U | when all pre-images are the same
P

size |U |/m. It follows that


X |U |  |U | 
δ(U, U, h) ≥ −1
z∈M
m m
|U | |U |
 
=m −1
m m
|U |
= (|U | − m).
m
If we now sum over all h, we get
|H|
δ(U, U, H) ≥ |U |(|U | − m).
m
There are exactly |U |(|U | − 1) ordered pairs x, y for which δ(x, y, H) might
not be zero; so the Pigeonhole principle says some pair x, y has
|H| |U |(|U | − m)
 
δ(x, y, H) ≥
m |U |(|U | − 1)
|H| m−1
 
= 1− .
m |U | − 1

m−1
Since 1 − |U |−1 is likely to be very close to 1, we are happy if we get the
2-universal upper bound of |H|/m.
Why we care about this: With a 2-universal hash family, chaining using
linked lists costs O(1 + n/m) expected time per operation. The reason is
that the expected cost of an operation on some key x is proportional to the
size of the linked list at h(x) (plus O(1) for the cost of hashing itself). But
the expected size of this linked list is just the expected number of keys y in
the dictionary that collide with x, which is exactly δ(x, S, H)/|H| ≤ n/m.
CHAPTER 7. HASHING 118

7.2.1 Linear congruential hashing


Universal hash families often look suspiciously like classic pseudorandom
number generators. Here is a 2-universal hash family based on taking
remainders. It is assumed that the universe U is a subset of Zp , the integers
mod p; effectively, this just means that every element x of U satisfies 0 ≤
x ≤ p − 1.
Lemma 7.2.2. Let hab (x) = (ax + b mod p) mod m, where a ∈ Zp \ {0} , b ∈
Zp , and p is a prime ≥ m. Then {hab } is 2-universal.
Proof. Again, we count collisions. Split hab (x) as g(fab (x)) where fab (x) =
ax + b mod p and g(x) = x mod m.
The intuition is that if we fix x and y and consider all p(p − 1) pairs a, b
with a 6= 0, all p(p − 1) distinct pairs of values r = fab (x) and s = fab (y)
are equally likely. We then show that feeding these values to g produces no
more collisions than expected.
The formal statement of the intuition is that for any 0 ≤ x, y ≤ p − 1
with x 6= y, δ(x, y, H) = δ(Zp , Zp , g).
To prove this, fix x and y, and consider some pair r 6= s ∈ Zp . Then
the equations ax + b = r and ay + b = s have a unique solution for a and
b mod p (because Zp is a finite field). Furthermore this solution has a = 6 0
since otherwise fab (x) = fab (y) = b. So the function q(a, b) = hfab (x), fab (y)i
is a bijection between pairs a, b and pairs r, s. Any collisions will arise
P P
from applying g, giving δ(x, y, H) = a,b δ(x, y, hab ) = r6=s δ(r, s, g) =
δ(Zp , Zp , g).
Now we just need to count how many distinct r and s collide. There are
p choices for r. For each r, the possible s that map to the same remainder
mod m are in a set of the form {r0 , r0 + m, r0 + 2m, . . . , r0 + km} where
r0 = r mod m and r0 + km ≤ p − 1, which gives k ≤ (p − 1)/m. There are
k + 1 elements of this set, but s 6= r means we can only use k of them. This
gives at most k ≤ (p − 1)/m choices for s for each of the p choices for r.
Multiplying out these bounds gives δ(x, y, H) = δ(Zp , Zp , g) ≤ p(p − 1)/m.
1
Since each choice of a 6= 0 and b occurs with probability p(p−1) , this gives
a probability of collision of at most 1/m.

A difficulty with this hash family is that it requires doing modular


arithmetic. A faster hash is given by Dietzfelbinger et al. [DHKP97], although
it requires a slight weakening of the notion of 2-universality. For each k and
` they define a class Hk,` of functions from [2k ] to [2` ] by defining
ha (x) = (ax mod 2k ) div 2k−` ,
CHAPTER 7. HASHING 119

where x div y = bx/yc. They prove [DHKP97, Lemma 2.4] that if a is a


random odd integer with 0 < a < 2` , and x 6= y, Pr [ha (x) = ha (y)] ≤ 2−`+1 .
This increases by a factor of 2 the likelihood of a collision, but any extra
costs from this can often be justified in practice by the reduction in costs
from working with powers of 2.
If we are willing to use more randomness (and more space), a method
called tabulation hashing (§7.2.2) gives a simpler alternative that is 3-
universal.

7.2.2 Tabulation hashing


Tabulation hashing [CW77] is a method for hashing fixed-length strings
(or things that can be represented as fixed-length strings) into bit-vectors.
The description here follows Patrascu and Thorup [PT12].
Let c be the length of each string in characters, and let s be the size of
the alphabet. Initialize the hash function by constructing tables T1 . . . Tc
mapping characters to independent random bit-vectors of size lg m. Define

h(x) = T1 [x1 ] ⊕ T2 [x2 ] ⊕ . . . Tc [xc ],

where ⊕ represents bitwise exclusive OR (what ^ does in C-like languages).3


This gives a family of hash functions that is 3-wise independent but not
4-wise independent.
Like many hash algorithms, tabulation hashing was already in use before
it was formalized in general. Zobrist hashing [Zob70] is a special case of
tabulation hashing used to represent positions in board games like Chess and
Go, where Ti [xi ] gives the contribution of having a piece xi in position i. This
is useful for games whose state changes slowly because of a homomorphic
property of tabulation hashing: replacing xi by x0i while leaving all other
xj unchanged does not require recomputing the entire hash function, since
h[x0 ] = h[x] ⊕ Ti [xi ] ⊕ Ti0 [xi ] in this case.
The intuition for why the hash values might be independent is that if we
have a collection of strings, and each string brings in an element of T that
doesn’t appear in the other strings, then that element is independent of the
hash values for the other strings and XORing it with the rest of the hash
value gives a random bit string that is independent of the hash values of the
other strings. In fact, we don’t even need each string to include a unique
3
Letting m be a power of 2 and using exclusive OR is convenient on real computers. If
for some reason we don’t like this approach, the same technique, with essentially the same
analysis, works for arbitrary m if we replace bitwise XOR with addition mod m.
CHAPTER 7. HASHING 120

value; it’s enough if we can order the strings so that each string gets a value
that isn’t represented among its predecessors.
More formally, suppose we can order the strings x1 , x2 , . . . , xn that we
0
are hashing so that each has a position ij such that xjij 6= xjij for any j 0 < j,
h 0
i
then we have, for each value v, Pr h(xj ) = v h(xj ) = vj 0 , ∀j 0 < j = 1/m.
It follows that the hash values are independent:
h i n
Y h i
n
1 2
Pr h(x ) = v1 , h(x ) = v2 , . . . , h(x ) = vn ) = Pr h(xj ) = vj h(x1 ) = v1 . . . h(xj−1 ) = vj−1
j=1
1
=
mn
n
Y h i
= Pr h(xj ) = vj .
j=1

Now we want to show that when n = 3, this actually works for all possible
distinct strings x, y, and z. Let S be the set of indices i such that yi 6= xi ,
and similarly let T be the set of indices i such that zi 6= xi ; note that both
sets must be non-empty, since y 6= x and z 6= x. If S \ T is nonempty, then
(a) there is some index i in T where zi =6 xi , and (b) there is some index j in
S \ T where yi 6= xi = zi ; in this case, ordering the strings as x, z, y gives
the independence property above. If T \ S is nonempty, order them as x, y,
z instead. Alternatively, if S = T , then yi = 6 zi for some i in S (otherwise
y = z, since they both equal x on all positions outside S). In this case, xi ,
yi , and zi are all distinct.
For n = 4, we can have strings aa, ab, ba, and bb. If we take the
bitwise exclusive OR of all four hash values, we get zero, because each
character is included exactly twice in each position. So the hash values are
not independent, and we do not get 4-independence in general.
However, even though the outputs of tabulation hashing are not 4-
independent, most reasonably small sets of inputs do give independence.
This can be used to show various miraculous properties like working well for
the cuckoo hashing algorithm described in §7.4.

7.3 FKS hashing


The FKS hash table, named for Fredman, Komlós, and Szemerédi [FKS84],
is a method for storing a static set S so that we never pay more than
constant time for search (not just in expectation), while at the same time
CHAPTER 7. HASHING 121

not consuming too much space. The assumption that S is static is critical,
because FKS chooses hash functions based on the elements of S.
If we were lucky in our choice of S, we might be able to do this with
standard hashing. A perfect hash function for a set S ⊆ U is a hash
function h : U → M that is injective on S (that is, x 6= y → h(x) 6= h(y)
when x, y ∈ S). Unfortunately, we can only count on finding a perfect hash
function if m is large:
n
Lemma 7.3.1. If H is 2-universal and |S| = n with 2 ≤ m, then there is
a perfect h ∈ H for S.

Proof. Let h be chosen uniformly at random from H. For each unordered pair
x 6= y in S, let Xxy be the indicator variable for the even that h(x) = h(y),
P
and let C = x6=y Xxy be the total number of collisions in S. Each Xxy
has expectation at most 1/m, so E [C] ≤ n2 /m < 1. But we we can write
E [C] as E [C | C = 0] Pr [C = 0] + E [C | C ≥ 1] Pr [C 6= 0] ≥ Pr [C 6= 0]. So
Pr [C 6= 0] ≤ n2 /m < 1, giving Pr [C = 0] > 0. But if C is zero with nonzero


probability, there must be some h that makes it 0. That h is perfect for


S.

Essentially the same argument shows that if α n2 ≤ m, then




Pr [h is perfect for S] ≥ 1 − α. This can be handy if we want to find a


perfect hash function and not just demonstrate that it exists.
Using a perfect hash function, we get O(1) search time using O(n2 ) space.
But we can do better by using perfect hash functions only at the second
level of our data structure, which at top level will just be an ordinary hash
table. This is the idea behind the Fredman-Komlós-Szemerédi (FKS) hash
table [FKS84].
The short version is that we hash to n = |S| bins, then rehash perfectly
within each bin. The top-level hash table stores a pointer to a header for
each bin, which gives the size of the bin and the hash function used within
it. The i-th bin, containing ni elements, uses O(n2i ) space to allow perfect
hashing. The total size is O(n) as long as we can show that ni=1 n2i = O(n).
P

The time to do a search is O(1) in the worst case: O(1) for the outer hash
plus O(1) for the inner hash.

Theorem 7.3.2. The FKS hash table uses O(n) space.

Proof. Suppose we choose h ∈ H as the outer hash function, where H is


CHAPTER 7. HASHING 122

some 2-universal family of hash functions. Compute:


n
X n
X
n2i = (ni + ni (ni − 1))
i=1 i=1
= n + δ(S, S, h).

The last equality holds because each ordered pair of distinct values in S that
map to the same bucket i corresponds to exactly one collision in δ(S, S, h).
Since H is 2-universal, we have δ(S, S, H) ≤ |H| |S|(|S|−1)
n = |H| n(n−1)
n =
|H|(n − 1). But then the Pigeonhole principle says there exists some h ∈ H
1
with δ(S, S, h) ≤ |H| δ(S, S, H) = n − 1. Choosing this h gives ni=1 n2i ≤
P

n + (n − 1) = 2n − 1 = O(n).

If we want to find a good h quickly, increasing the size of the outer


table to n/α gives us a probability of at least 1 − α of getting a good one,
using essentially the same argument as for perfect hash functions. More
generally, it’s possible to adapt FKS hashing to work with dynamic data sets
by growing the main table and each subtable as needed to keep the 2-lookup
property while maintaining O(1) expected amortized cost per insertion. This
is know as dynamic perfect hashing and was studied by Dietzfelbinger et
al. [DKM+ 94]. In practice the overhead of this approach is higher than other
2-lookup schemes, such as the one described in the next section.

7.4 Cuckoo hashing


Cuckoo hashing [PR04] is a hash table implementation that uses a single
table and guarantees at most 2 probes per lookup, with O(1) expected
amortized cost per insertion.
The name comes from the cuckoo, a bird notorious for stealing space
for their own eggs in other birds’ nests. In cuckoo hashing, newly-inserted
elements may steal slots from other elements, forcing those elements to find
an alternate nest.
The formal mechanism is to use two hash functions h1 and h2 , and store
each element x in one of the two positions h1 (x) or h2 (x). This may require
moving other elements to their alternate locations to make room. But the
payoff is that each search takes only two reads, which can even be done in
parallel. This is optimal by a lower bound of Pagh [Pag01], which also shows
a matching upper bound for static dictionaries using a different technique.
Cuckoo hashing was invented by Pagh and Rodler [PR04]. The version
described here is based on a simplified version from notes of Pagh [Pag06].
CHAPTER 7. HASHING 123

The main difference is that it uses just one table instead of the two tables—one
for each hash function—in [PR04].

7.4.1 Structure
We have a table T of size n, with two separate, independent hash functions h1
and h2 . These functions are assumed to be k-universal for some sufficiently
large value k; as long as we never look at more than k values at once, this
means we can treat them effectively as random functions. In practice, using
crummy hash functions seems to work just fine, a common property of hash
tables. There are also specific hash functions that have been shown to work
with particular variants of cuckoo hashing [PR04, PT12]. We will avoid these
issues by assuming that our hash functions are actually random.
Every key x is stored either in T [h1 (x)] or T [h2 (x)]. So the search
procedure just looks at both of these locations and returns whichever one
contains x (or fails if neither contains x).
To insert a value x1 = x, we must put it in T [h1 (x1 )] or T [h2 (x1 )]. If one
or both of these locations is empty, we put it there. Otherwise we have to
kick out some value that is in the way (this is the “cuckoo” part of cuckoo
hashing, named after the bird that leaves its eggs in other birds’ nests). We
do this by letting x2 = T [h1 (x1 )] and writing x1 to T [h1 (x1 )]. We now have
a new “nestless” value x2 , which we swap with whatever is in T [h2 (x2 )]. If
that location was empty, we are done; otherwise, we get a new value x3 that
we have to put in T [h1 (x3 )] and so on. The procedure terminates when we
find an empty spot or if enough iterations have passed that we don’t expect
to find an empty spot, in which case we rehash the entire table. This process
can be implemented succinctly as shown in Algorithm 7.1.
A detail not included in the above code is that we always rehash (in
theory) after m2 insertions; this avoids potential problems with the hash
functions used in the paper not being universal enough. We will avoid this
issue by assuming that our hash functions are actually random (instead of
being approximately n-universal with reasonably high probability). For a
more principled analysis of where the hash functions come from, see [PR04].
An alternative hash family that is known to work for a slightly different
variant of cuckoo hashing is tabulation hashing, as described in §7.2.2; the
proof that this works is found in [PT12].
CHAPTER 7. HASHING 124

1 procedure insert(x)
2 if T (h1 (x) = x or T (h2 (x)) = x then
3 return
4 pos ← h1 (x)
5 for i ← 1 . . . n do
6 if T [pos] = ⊥ then
7 T [pos] ← x
8 return
9 x  T [pos]
10 if pos = h1 (x) then
11 pos ← h2 (x)
12 else
13 pos ← h1 (x)

14 If we got here, rehash the table and reinsert x.


Algorithm 7.1: Insertion procedure for cuckoo hashing. Adapted
from [Pag06]

7.4.2 Analysis
The main question is how long it takes the insertion procedure to terminate,
assuming the table is not too full.
First let’s look at what happens during an insert if we have many nestless
values. We have a sequence of values x1 , x2 , . . . , where each pair of values
xi , xi+1 collides in h1 or h2 . Assuming we don’t reach the loop limit, there
are three main possibilities (the leaves of the tree of cases below):

1. Eventually we reach an empty position without seeing the same key


twice.

2. Eventually we see the same key twice; there is some i and j > i such
that xj = xi . Since xi was already moved once, when we reach it
the second time we will try to move it back, displacing xi−1 . This
process continues until we have restored x2 to T [h1 (x1 )], displacing x1
to T [h2 (x1 )] and possibly creating a new sequence of nestless values.
Two outcomes are now possible:

(a) Some x` is moved to an empty location. We win!


(b) Some x` is moved to a location we’ve already looked at. We lose!
CHAPTER 7. HASHING 125

We find we are playing musical chairs with more players than


chairs, and have to rehash.

Let’s look at the probability that we get the last case, a closed loop.
Following the argument of Pagh and Rodler, we let v be the number of
distinct nestless keys in the loop. Since v includes x1 and at least one other
element blocking x1 from being inserted at T [h1 (x1 )], v is at least 2. We can
now count how many different ways such a loop can form, and argue that in
each case we include enough information to reconstruct h1 (ui ) and h2 (ui )
for each of a specific set of unique elements u1 , . . . uv .
Formally, this means that we are expression the closed-loop case as a
union of many specific closed loops, and then bounding the probability of
each of these specific closed-loop events by the probability of the event that
h1 and h2 select the right values to make this particular closed loop possible.
Then we apply the union bound.
To describe each of the specific events, we’ll provide this information:

• The v elements u1 , . . . uv . Since we can fix u1 = x1 , this leaves v − 1


choices from S, giving n(v−1) possibilities (we are overcounting by
allowing duplicates, but that’s not a problem for an upper bound).
We’ll require that the other ui for i > 1 appear in the list in the same
order they first appear in the sequence x1 , x2 , . . . .

• The v − 1 locations we are trying to fit these v elements into. There


are at most m(v−1) choices for these. Again we order these locations
by order of first appearance.

• The values of i, j, and `. These allow us to identify which segments of


the sequence x1 , x2 , . . . correspond to new values ui and which are old
values repeated (possibly in reverse order).
There are at most v choices for i and j (because we are still in the
initial segment with no repeats), and at most 2v choices for ` if we
count carefully (because we either land on either the initial no-duplicate
sequence starting with x1 or the second no-duplicate sequence starting
with the second occurrence of x1 ).
All together, these give 2v 3 choices.

• For each i 6= 1, whether the first occurrence of ui appears in h1 (ui )


or h2 (ui ). This gives 2v−1 choices, and allows us to correctly identify
CHAPTER 7. HASHING 126

h1 (ui ) or h2 (u1 ) from the value of ui and its first location and the other
hash value for ui given the next location in the list.4

Multiplying everything out gives at most 2v 3 (2nm)(v−1) choices of closed


loops with v unique elements. Since each particular loop allows us to
determine both h1 and h2 for all v of its elements, the probability that we
get exactly these hash values (so that the loop occurs) is m−2v . Summing
over all closed loops with v elements gives a total probability of

2v 3 (2nm)v−1 · m−2v = 2v 3 (2n)v−1 m−v−1


= 2v 3 (2n/m)v−1 m−2 .

Now sum over all v ≥ 2. We get


n ∞
m−2 2v 3 (2n/m)v−1 < m−2
X X
2v 3 (2n/m)v−1 .
v=2 v=2

The series converges if 2n/m < 1, so for any fixed α < 1/2, the probability
of any closed loop forming is O(m−2 ).
If we do hit a closed loop, then we pay O(m) time to scan the existing
table and create a new empty table, and O(n) = O(m) time on average to
reinsert all the elements into the new table, assuming that this reinsertion
process doesn’t generate any more closed loops and that the average cost of an
insertion that doesn’t produce a closed loop is O(1), which we will show below.
But the rehashing step only fails with probability O(nm−2 ) = O(m−1 ), so if
it does fail we can just try again until it works, and the expected total cost
is still O(m). Since we pay this O(m) for each insertion with probability
O(m−2 ), this adds only O(m−1 ) to the expected cost of a single insertion.
Now we look at what happens if we don’t get a closed loop. This doesn’t
force us to rehash, but if the path is long enough, we may still pay a lot to
do an insertion.
It’s a little messy to analyze the behavior of keys that appear more than
once in the sequence, so the trick used in the paper is to observe that for any
sequence of nestless keys x1 . . . xp , there is a subsequence of size p/3 with no
repetitions that starts with x1 . This will be either the sequence S1 given by
x1 . . . xj−1 —the sequence starting with the first place we try to insert x1 –or
S2 given by x1 = xi+j−1 . . . xp , the sequence starting with the second place
we try to insert x1 . Between these we have a third sequence R where we
4
The original analysis in [PR04] avoids this by alternating between two tables, so that
we can determine which of h1 or h2 is used at each step by parity.
CHAPTER 7. HASHING 127

revert some of the moves made in S1 . Because |S1 | + |R| + |S2 | ≥ p, at least
one of these three subsequences has size p/3. But |R| ≤ |S1 |, so it must be
either S1 or S2 .
We can then argue that the probability that we get a sequence of v distinct
keys in either S1 or S2 is at most 2(n/m)v−1 . The (n/m)v−1 is because we
need to hit a nonempty spot (which happens with probability at most n/m)
for the first v − 1 elements in the path, and since we assume that our hash
functions are random, the choices of these v − 1 spots are all independent.
The 2 is from the union bound over S1 and S2 . If T is length of the longer
of S1 or S2 , we get E [T ] = ∞
P∞ v−1 = O(1),
v=1 Pr [T ≥ v] ≤
P
v=1 2(n/m)
assuming n/m is bounded by a constant less than 1. Since we already need
n/m ≤ 1/2 to avoid the bad closed-loop case, we can use this here as well.
We have to multiply E [T ] by 3 to get the bound on the actual path, but this
disappears into O(1).
An annoyance with cuckoo hashing is that it has high space overhead
compared to more traditional hash tables: in order for the first part of the
analysis above to work, the table must be at least half empty. This can be
avoided at the cost of increasing the time complexity by choosing between
d locations instead of 2. This technique, due to Fotakis et al. [FPSS03], is
known as d-ary cuckoo hashing. For a suitable choice of d, it uses (1 + )n
space and guarantees that a lookup takes O(1/) probes, while insertion
takes (1/)O(log log(1/)) steps in theory and appears to take O(1/) steps in
practice according to experiments done by the authors.

7.5 Practical issues


Most hash functions used in practice do not have very good theoretical
guarantees, and indeed we have assumed in several places in this chapter that
we are using genuinely random hash functions when we would expect our
actual hash functions to be at most 2-universal. There is some justification
for doing this if there is enough entropy in the set of keys S. A proof of
this for many common applications of hash functions is given by Chung et
al. [CMV13].
Even taking into account these results, hash tables that depend on
strong properties of the hash function may behave badly if the user supplies a
crummy hash function. For this reason, many library implementations of hash
tables are written defensively, using algorithms that respond better in bad
cases. See https://svn.python.org/projects/python/trunk/Objects/
dictobject.c for an example of a widely-used hash table implementation
CHAPTER 7. HASHING 128

chosen specifically because of its poor theoretical characteristics.


For large hash tables, local probing schemes are often faster than chaining
or cuckoo hashing, because it is likely that all of the locations probed to find
a particular value will be on the same virtual memory page. This means
that a search for a new value usually requires one cache miss instead of
two. Hopscotch hashing [HST08] combines ideas from linear probing and
cuckoo hashing to get better performance than both in practice.

7.6 Bloom filters


See [MU17, §5.5.3] for basics and a formal analysis or http://en.wikipedia.
org/wiki/Bloom_filter for many variations and the collective wisdom of
the unwashed masses. The presentation here mostly follows [MU17].

7.6.1 Construction
Bloom filters are a highly space-efficient randomized data structure invented
by Burton H. Bloom [Blo70] for storing sets of keys, with a small probability
for each key not in the set that it will be erroneously reported as being in
the set.
Suppose we have k independent hash functions h1 , h2 , . . . , hk . Our mem-
ory store A is a vector of m bits, all initially zero. To store a key x, set
A[hi (x)] = 1 for all i. To test membership for x, see if A[hi (x)] = 1 for all
i. The membership test always gives the right answer if x is in fact in the
Bloom filter. If not, we might decide that x is in the Bloom filter anyway,
just because we got lucky.

7.6.2 False positives


The probability of such false positives can be computed in two steps: first,
we estimate how many of the bits in the Bloom filter are set after inserting
n values, and then we use this estimate to compute a probability that any
fixed x shows up when it shouldn’t.
If the hi are close to being independent random functions,5 then with
n entries in the filter we have Pr [A[i] = 1] = 1 − (1 − 1/m)kn , since each of
5
We are going to sidestep the rather deep swamp of how plausible this assumption
is and what assumption we should be making instead. However, it is known [KM08]
that starting with two sufficiently random-looking hash functions h and h0 and setting
hi (x) = h(x) + ih0 (x) works.
CHAPTER 7. HASHING 129

the kn bits that we set while inserting the n values has one chance in m of
hitting position i.
We’d like to simplify this using the inequality 1 + x ≤ ex , but it goes
2
in the wrong direction; instead, we’ll use 1 − x ≥ e−x−x , which holds for
0 ≤ x ≤ 0.683803 and in our application holds for m ≥ 2. This gives

Pr [A[i] = 1] ≤ 1 − (1 − 1/m)kn
≤ 1 − e−kn(1/m)(1+1/m)
= 1 − e−kα(1+1/m)
0
= 1 − e−kα

where α = n/m is the load factor and α0 = α(1 + 1/m) is the load factor
fudged upward by a factor of 1 + 1/m to make the inequality work.
Suppose now that we check to see if some value x that we never inserted
in the Bloom filter appears to be present anyway. This occurs if A[hi (x)] = 1
for all i. Since each A[hi (x)] is an independent sample from A, the probability
that they all come up 1 conditioned on A is
P k
A[i]
. (7.6.1)
m
 0

We have an upper bound E [ A[i]] ≤ m 1 − e−kα , and if we were born
P

luckier, we might be able to get an upper bound on the expectation of (7.6.1)


by applying Jensen’s inequality to the function f (x) = xk . But sadly this
inequality also goes in the wrong direction, because f is convex for k > 1.
P
So instead we will prove a concentration bound on S = A[i].
Because the A[i] are not independent, we can’t use off-the-shelf Chernoff
bounds. Instead, we rely on McDiarmid’s inequality. Our assumption is
that the locations of the kn ones that get written to A are independent.
Furthermore, changing the location of one of these writes changes S by at most
2
1. So McDiarmid’s inequality (5.3.13)
q gives Pr [S ≥ E [S] + t] ≤ e−2t /kn ,
which is bounded by n−c for t ≥ 12 ckn log n. So as long as a reasonably
large fraction of the array is likely to be full, the relative error from assuming
S = E [S] is likely to be small. Alternatively, if the array is mostly empty,
then we don’t care about the relative error so much because the probability
of getting a false positive will already be exponentially small as a function of
k.
So let’s assume for simplicity that our false positive probability is exactly
0
(1 − e−kα )k . We can choose k to minimize this quantity for fixed α0 by
CHAPTER 7. HASHING 130

doing the usual trick of taking a derivative and setting it to zero; to avoid
weirdness with the k in the exponent, it helps to take the logarithm first
(which doesn’t affect the location of the minimum), and it further helps to
0
take the derivative with respect to x = e−α k instead of k itself. Note that
when we do this, k = − α10 ln x still depends on x, and we will deal with this
by applying this substitution at an appropriate point.
Compute
d   d
ln (1 − x)k = k ln(1 − x)
dx dx 
d 1

= − 0 ln x ln(1 − x)
dx α
1 ln(1 − x) ln x
 
=− 0 − .
α x 1−x

Setting this to zero gives (1 − x) ln(1 − x) = x ln x, which by symmetry


has the unique solution x = 1/2, giving k = α10 ln 2.
In other words, to minimize the false positive rate for a known load factor
α, we want to choose k = α10 ln 2 = α(1+1/m)
1
ln 2, which makes each bit one
with probability approximately 1 − e − ln 2 = 12 . This makes intuitive sense,
since having each bit be one or zero with equal probability maximizes the
entropy of the data.
0
The probability of a false positive for a given key is then 2−k = 2− ln 2/α .
For a given maximum false positive rate , and assuming optimal choice of
ln2 2 ln2 2
k, we need to keep α0 ≤ ln(1/) or α ≤ (1+1/m) ln(1/) .
Another way to look at this is that if we fix  and n, we need m/(1 +
1/m) ≥ n · ln(1/)
ln2 2
≈ 1.442695 · n lg(1/), which works out to m ≥ 1.442695 ·
n lg(1/) + O(1). This is very good for constant .
Note that for this choice of m, we have α = O(1/ ln(1/)), giving k =
O(log(1/)). So for polynomial , we get k = O(log n). This is closer to the
complexity of tree lookups than hash table lookups, so the main payoff for a
sequential implementation is that we don’t have to store full keys.

7.6.3 Comparison to optimal space


If we wanted to design a Bloom-filter-like data structure from scratch and
had no constraints on processing power, we’d be looking for something that
stored an index of size lg M into a family of subsets S1 , S2 , . . . SM of our
universe of keys U , where |Si | ≤ |U | for each i (giving the upper bound on
CHAPTER 7. HASHING 131

the false positive rate)6 and for any set A ⊆ U of size n, A ⊆ Si for at least
one Si (allowing us to store A).
Let N = |U |. Then each set Si covers N N

n of the n subsets of size n. If
we could. get them to overlap optimally (we can’t), we’d still need a minimum
N N
of n n = (N )n /(N )n ≈ (1/)n sets to cover everybody, where the
approximation assumes N  n. Taking the log gives lg M ≈ n lg(1/),
meaning we need about lg(1/) bits per key for the data structure. Bloom
filters use 1/ ln 2 times this.
There are known data structures that approach this bound asymptotically.
The first of these, due to Pagh et al. [PPR05] also has other desirable
properties, like supporting deletions and faster lookups if we can’t look up
bits in parallel.
More recently, Fan et al. [FAKM14] have described a variant of cuckoo
hashing (see §7.4) called a cuckoo filter. This is a cuckoo hash table that,
instead of storing full keys x, stores fingerprints f (x), where f is a hash
function with `-bit outputs. False positives now arise if we happen to hash a
value x0 with f (x0 ) = f (x) to the same location as x. If f is drawn from a
2-universal family, this occurs with probability at most 2−` . So the idea is
that by accepting an  small rate of false positives, we can shrink the space
needed to store each key from the full key length to lg(1/) = ln(1/)/ ln 2,
the asymptotic minimum.
One complication is that, since we are throwing away the original key x,
when we displace a key from h1 (x) to h2 (x) or vice versa, we can’t recompute
h1 (x) and h2 (x) for arbitrary h1 and h2 . The solution proposed by Fan et al.
is to let h2 (x) = h1 (x) ⊕ g(f (x)), where g is a hash function that depends
only on the fingerprint. This means that when looking at a fingerprint f (x)
stored in position i, we don’t need to know whether i is h1 (x) or h2 (x), since
whichever it is, the other location will be i ⊕ g(f (x)). Unfortunately, this
technique and some other techniques used in the paper to crunch out excess
empty space break the standard analysis of cuckoo hashing, so the authors
can only point to experimental evidence that their data structure actually
6
Technically, this gives a weaker bound on false positives. For standard Bloom filters,
assuming random hash functions, each key individually has at most an  probability
of appearing as a false positive. The hypothetical data structure we are considering
here—which is effectively deterministic—allows the set of false positives to depend directly
on the set of keys actually inserted in the data structure, meaning that the adversary
could arrange for a specific key to appear as a false positive with probability 1 by choosing
appropriate keys to insert. So this argument may underestimate the space needed to get
make the false positives less predictable. On the other hand, we aren’t charging the Bloom
filter for the space needed to store the hash functions, which could be quite a bit if they
are genuine random functions.
CHAPTER 7. HASHING 132

works. However, a variant of this data structure has been shown to work by
Eppstein [Epp16].

7.6.4 Applications
Historically, Bloom filters were invented to act as a way of filtering queries
to a database table through fast but expensive7 RAM before looking up the
actual values on a slow but cheap tape drive. Nowadays the cost of RAM is
low enough that this is less of an issue in most cases, but Bloom filters are
still popular in networking and in distributed databases.
In networking, Bloom filters are useful in building network switches, where
incoming packets need to be matched against routing tables in fractions of a
nanosecond. Bloom filters work particularly well for this when implemented
in hardware, since the k hash functions can be computed in parallel. False
positives, if infrequent enough, can be handled by some slower backup
mechanism.
In distributed databases, Bloom filters are used in the Bloomjoin algo-
rithm [ML86]. Here we want to do a join on two tables stored on different
machines (a join is an operation where we find all pairs of rows, one in each
table, that match on some common key). A straightforward but expensive
way to do this is to send the list of keys from the smaller table across the
network, then match them against the corresponding keys from the larger
table. If there are ns rows in the smaller table, nb rows in the larger table,
and j matching rows in the larger table, this requires sending ns keys plus j
rows. If instead we send a Bloom filter representing the set of keys in the
smaller table, we only need to send n lg(1/)/ ln 2 bits for the Bloom filter
plus an extra nb rows on average for the false positives. This can be cheaper
than sending full keys across if the number of false positives is reasonably
small.

7.6.5 Counting Bloom filters


It’s not hard to modify a Bloom filter to support deletion. The basic trick is
to replace each bit with a counter, so that whenever a value x is inserted, we
increment A[hi (x)] for all i and when it is deleted, we decrement the same
locations. The search procedure now returns mini A[hi (x)] (which means
that it principle it can even report back multiplicities, though with some
probability of reporting a value that is too high). To avoid too much space
overhead, each array location is capped at some small maximum value c;
7
As much as $0.10/bit in 1970.
CHAPTER 7. HASHING 133

once it reaches this value, further increments have no effect. The resulting
structure is called a counting Bloom filter, due to Fan et al. [FCAB00].
We can only expect this to work if our chance of hitting the cap is small.
Fan et al. observe that the probability that the m table entries include one
that is at least c after n insertions is bounded by
! c
nk 1 enk 1

m c
≤m
c m c mc
enk c
 
=m
cm
= m(ekα/c)c .
k
(This uses the bound nk ≤ en

k , which follows from Stirling’s formula.)
For k = α1 ln 2, this is m(e ln 2/c)c . For the specific value of c = 16
(corresponding to 4 bits per entry), they compute a bound of 1.37 × 10−15 m,
which they argue is minuscule for all reasonable values of m (it’s a systems
paper).
The possibility that a long chain of alternating insertions and deletions
might produce a false negative due to overflow is considered in the paper,
but the authors state that “the probability of such a chain of events is so
low that it is much more likely that the proxy server would be rebooted in
the meantime and the entire structure reconstructed.” An alternative way of
dealing with this problem is to never decrement a maxed-out register. This
never produces a false negative, but may cause the filter to slowly fill up
with maxed-out registers, producing a higher false-positive rate.
A fancier variant of this idea is the spectral Bloom filter of Cohen
and Matias [CM03], which uses larger counters to track multiplicities of
items. The essential idea here is that we can guess that the number of times
a particular value x was inserted is equal to minki=1 A[hi (x)]), with some
extra tinkering to detect errors based on deviations from the typical joint
distribution of the A[hi (x)] values. An even more sophisticated approach
gives the count-min sketches of the next section.

7.7 Data stream computation


In the data stream model, we are given a huge flood of data—far too big
to store—in a single pass, and want to incrementally build a small data
structure, called a sketch, that will allow us to answer statistical questions
about the data after we’ve processed it all. The motivation is the existence
CHAPTER 7. HASHING 134

of data sets that are too large to store at all (network traffic statistics), or
too large to store in fast memory (very large database tables). By building
an appropriate small data structure using a single pass through the data,
we can still answer queries about with some loss of accuracy. Examples we
will consider include estimating the size of a set presented over time with
possible duplicate elements (§7.7.1) or more general statistical queries based
on aggregate counts of some sort (§7.7.2).
In each of these cases, the answers we get will be approximate. We
will measure the quality of the approximation in terms of parameters (δ, ),
where we demand a relative error of at most  with probability at least 1 − δ.
We’d also like our data structure to have size at most polylogarithmic in the
number of samples n and polynomial in δ and .

7.7.1 Cardinality estimation


The cardinality estimation or count-duplicates problem involves seeing
a sequence of values x1 , x2 , . . . , xn and asking to compute the number of
unique values in this sequence.
Without the uniqueness constraint, this is trivial: just keep a counter.
With the uniqueness constraint, exact counting is much harder, since any
data structure that lets us detect if we see a new element also lets us test for
membership. But if we are willing to accept an approximation, we can get
around this by using hashing and then tracking statistical properties of the
hashed values. To keep the analysis simple, we will assume that our hashing
function is a random function, and not charge for storing its parameters.
Many algorithms of this type are based on a tool called a Flajolet-
Martin sketch [FNM85]. The simplest version of this is that we pick
a random hash function h, and use it to generate a geometric random
variable Ri for each xi by counting the number of trailing zeroes in the
binary representation of h(xi ). We then track R = maxi Ri and estimate the
number n of unique xi using 2R .
The intuition for why this counts unique xi is that sending in the same
xi twice produces the same Ri = h(xi ) both times, so the maximum R is
unaffected.
It’s easy to show that this gives a reasonably good approximation to
n. For the upper bound, given n samples, the expected number of samples
with Ri ≥ k is n · 2−k , so for k = lg n + `, the probability that 2R ≥ 2k
is at most 2−` by Markov’s inequality. For the lower bound, we have
−k
Pr [Ri < k] = 1 − 2−k so Pr [∀i : Ri < k] = (1 − 2−k )n ≤ e−n2 , which gives
`
Pr [max Ri < lg n − `] ≤ e−2 . The lower bound is a bit stronger than the
CHAPTER 7. HASHING 135

upper bound, but in both directions we get at least a constant probability of


being within a (large) constant factor of the correct answer. The expected
value can also be shown to converge to the correct value for large enough
n, after multiplying by a small correction factor φ that compensates for
systematic round-off error caused by quantizing to powers of 2.
To improve the constants, Flajolet and Martin proposed a technique they
called Probabilistic Counting with Stochastic Averaging or PCSA.
This splits the incoming stream into m buckets using a second hash function
(or, in practice, the leading bits of the first one). Each bucket gives its own
estimate n̂i = mφ2R , then these estimates are averaged to produce to final
estimate. Flajolet and Martin show that, with an appropriate multiplier, this

estimate has a typical error of O(1/ m). This analysis is pretty involved so
we will not repeat it here.
The original PCSA algorithm is not used much in practice. More popular
is a descendant called HyperLogLog [FFGM07] that replaces the arithmetic
1 P
mean m i n̂i with a harmonic mean

1
1 P 1 .
m i n̂i

As with PCSA, HyperLogLog requires using some carefully calculated


corrections to get an unbiased estimator. This can be avoided using an
auxiliary counter that is updated whenever the main data structure changes,
which also gives some improvement in the accuracy of the estimate. This
mechanism was originally proposed independently by Cohen [Coh15] and
Ting [Tin14], although the description we give here is largely drawn from a
more recent paper by Pettie et al. [PWY20], which refers to this approach
as the martingale transformation of the original data structure.
What this transformation does is observe that for each state s of the
HyperLogLog (or whatever) data structure, there is a probability ps that
the next unique element will send it to a new state s0 . If we can calculate
this probability, then we can update our estimated count λ̂ by increasing it
by 1/ps when the change occurs. Because each new unique element gives an
expected increase of exactly ps · p1s = 1, this makes the expected value in the
counter exactly equal to the actual count. (The connection between this and
martingales is that we just showed that λ̂t − λt is a martingale where λ̂t is
the value of the counter after t steps and λt is the actual number of unique
elements.)
The only tricky part here is computing ps . For HyperLogLog, there is a
1/m chance that our new unique element lands in each of the m buckets, and
CHAPTER 7. HASHING 136

it lands in a bucket that currently stores ri , the probability that we increase


ri is exactly 2−ri −1 . This immediately gives
1 X −ri −1
ps = 2
m i

and thus
1 1
= 1 −ri −1
,
ps
P
m i2

which looks suspiciously like the harmonic mean used on the final estimates
in HyperLogLog. As with the original HyperLogLog, it is possible to show

that the typical relative error for this sketch is O(1/ m). See [PWY20] for
more details and some further improvements.
If we don’t care about practical engineering issues, there is a known asymp-
totically optimal solution to the cardinality estimation problem [KNW10],
which doesn’t even require assuming a random oracle, but the constants give
worse performance than the systems that people actually use.

7.7.2 Count-min sketches


A count-min sketch is used for the case where we are presented with a
sequence of pairs (it , ct ) where 1 ≤ it ≤ n is an index and ct is a count, and
we want to construct a sketch that will allows us to approximately answer
P
statistical queries about the vector a given by ai = t,it =i ct . These were
developed by Cormode and Muthukrishnan [CM05], although some of the
presentation here is based on [MU17, §15.4]. Structurally, they are related
to counting Bloom filters (§7.6.5).
Note that we are no longer interested in detecting unique values, but
instead want to avoid the cost of storing the entire vector a when most of
its components are small or zero. The goal is that the size of the sketch
should be polylogarithmic in the size of a and the length of the stream, and
polynomial in the error bounds. We also want updating the sketch for each
new data point to be cheap.
The Cormode-Muthukrishnan count-min sketch is fairly versatile, giving
approximations of ai , ri=` ai , and a · b (for any fixed b), and it can also be
P

used for more complex tasks like finding heavy hitters—indices with high
weight. The easiest case is approximating ai when all the ct are non-negative,
so we’ll start with that.
CHAPTER 7. HASHING 137

7.7.2.1 Initialization and updates


To construct a count-min sketch, build a two-dimensional array c with depth
d = dln(1/δ)e and width w = de/e, where  is the error bound and δ is
the probability of exceeding the error bound. Choose d independent hash
functions from some 2-universal hash family; we’ll use one of these hash
function for each row of the array. Initialize c to all zeros.
The update rule: Given an update (it , ct ), increment c[j, hj (it )] by ct for
j = 1 . . . d. (This is the count part of count-min.)

7.7.2.2 Queries
Let’s start with point queries. Here we want to estimate ai for some
fixed i. There are two cases; the first handles non-negative increments only,
while the second handles arbitrary increments. In both cases we will get an
estimate whose error is linear in both the error parameter  and the `1 -norm
kak1 = i |ai | of a. It follows that the relative error will be low for heavy
P

points, but we may get a large relative error for light points (and especially
large for points that don’t appear in the data set at all).
For the non-negative case, to estimate ai , compute âi = minj c[j, hj (i)].
(This is the min part of coin-min.) Then:

Lemma 7.7.1. When all ct are non-negative, for âi as defined above:

âi ≥ ai , (7.7.1)

and

Pr [âi ≤ ai + kak1 ] ≥ 1 − δ. (7.7.2)

Proof. The lower bound is easy. Since for each pair (i, ct ) we increment each
c[j, hj (i)] by ct , we have an invariant that ai ≤ c[j, hj (i)] for all j throughout
the computation, which gives ai ≤ âi = minj c[j, hj (i)].
For the upper bound, let Iijk be the indicator for the event that (i 6=
k) ∧ (hj (i) = hj (k)), i.e., that we get a collision between i and k using hj .
The 2-universality property of the hj gives E [Iijk ] ≤ 1/w ≤ /e.
Now let Xij = nk=1 Iijk ak . Then c[j, hj (i)] = ai + Xij . (The fact that
P

Xij ≥ 0 gives an alternate proof of the lower bound.) Now use linearity of
CHAPTER 7. HASHING 138

expectation to get
" n #
X
E [Xij ] = E Iijk ak
k=1
n
X
= ak E [Iijk ]
k=1
Xn
≤ ak (/e)
k=1
= (/e)kak1 .

So Pr [c[j, hj (i)] > ai + kak1 ] = Pr [Xij > e E [Xij ]] < 1/e, by Markov’s
inequality. With d choices for j, and each hj chosen independently, the prob-
ability that every count is too big is at most (1/e)d = e−d ≤ exp(− ln(1/δ)) =
δ.

Now let’s consider the general case, where the increments ct might be
negative. We still initialize and update the data structure as described in
§7.7.2.1, but now when computing âi , we use the median count instead of
the minimum count: âi = median {c[j, hj (i)] | j = 1 . . . n}. Now we get:
Lemma 7.7.2. For âi as defined above,

Pr [ai − 3kak1 ≤ âi ≤ ai + 3kak1 ] > 1 − δ 1/4 . (7.7.3)

Proof. The basic idea is that for the median to be off by t, at least d/2
rows must give values that are off by t. We’ll show that for t = 3kak1 , the
expected number of rows that are off by t is at most d/8. Since the hash
functions for the rows are chosen independently, we can use Chernoff bounds
to show that with a mean of d/8, the chances of getting all the way to d/2
are small.
In detail, we again define the error term Xij as above, and observe that
" #
X
E [|Xij |] = E Iijk ak
k
n
X
≤ |ak E [Iijk ]|
k=1
Xn
≤ |ak |(/e)
k=1
= (/e)kak1 .
CHAPTER 7. HASHING 139

Using Markov’s inequality, we get Pr [|Xij |] > 3kak1 ] = Pr [|Xij | > 3e E [Xij ]] <
1/3e < 1/8. In order for the median to be off by more than 3kak1 , we need
d/2 of these low-probability events to occur. The expected number that
occur is µ = d/8, so applying the standard Chernoff bound (5.2.1) with δ = 3
we are looking at

Pr [S ≥ d/2] = Pr [S ≥ (1 + 3)µ]
≤ (e3 /44 )d/8
≤ (e3/8 /2)ln(1/δ)
= δ ln 2−3/8
< δ 1/4 .

(The actual exponent is about 0.31, but 1/4 is easier to deal with). This
immediately gives (7.7.3).

One way to think about this is that getting an estimate within kak1 of
the right value with probability at least 1 − δ requires 3 times the width and
4 times the depth—or 12 times the space and 4 times the time—when we
aren’t assuming increments are non-negative.
Next, we consider inner products. Here we want to estimate a · b, where
a and b are both stored as count-min sketches using the same hash functions.
The paper concentrates on the case where a and b are both non-negative,
which has applications in estimating the size of a join in a database. The
method is to estimate a · b as minj w k=1 ca [j, k] · cb [j, k].
P

For a single j, the sum consists of both good values and bad collisions; we
have w
Pn
k=1 ca [j, k] · cb [j, k] =
P P
k=1 ai bi + p6=q,hj (p)=hj (q) ap bq . The second
term has expectation
X X
Pr [hj (p) = hj (q)] ap bq ≤ (/e)ap bq
p6=q p6=q
X
≤ (/e)ap bq
p,q

≤ (/e)kak1 kbk1 .

As in the point-query case, we get probability at most 1/e that a single j


gives a value that is too high by more than kak1 kbk1 , so the probability
that the minimum value is too high is at most e−d ≤ δ.
CHAPTER 7. HASHING 140

7.7.2.3 Finding heavy hitters


Here we want to find the heaviest elements in the set: those indices i for
which ai exceeds φkak1 for some constant threshold 0 < φ ≤ 1. We assume
that all ct are non-negative. Because kak1 = i ai , we know that there will
P

be at most 1/φ heavy hitters. But the tricky part is figuring out which
elements they are.
The output at any stage will be approximate in the following sense:
it is guaranteed that any i such that ai ≥ φkak1 is included, and each i
with ai < (φ − ) that previously appeared in the stream is included with
probability at most 1 − δ. This is similar to what we would get if we just
ran a point query on all possible i, but (a) there are many possible i and (b)
we won’t ever output an i we’ve never seen.
The trick is to extend the data structure and update procedure to track
all the heavy elements found so far (stored in a heap, with the minimum
estimate at the top), as well as kak1 = ct . When a new increment (i, c)
P

comes in, we first update the count-min structure and then do a point query
on ai ; if âi ≥ φkak1 , we insert i into the heap. We also delete any elements
at the top of the heap that have a point-query estimate below threshold.
Because âi ≥ ai , every heavy hitter is correctly identified. However, it’s
possible that an index stops being a heavy hitter at some point (because the
threshold φkak1 rose since we included it). In this case it may get removed
from the heap, but if it becomes a heavy hitter again, we’ll put it back.

7.8 Locality-sensitive hashing


Locality-sensitive hashing was invented by Indyk and Motwani [IM98] to
solve the problem of designing a data structure that finds approximate nearest
neighbors to query points in high dimension. We’ll mostly be following this
paper in this section, concentrating on the hashing parts.

7.8.1 Approximate nearest neighbor search


In the nearest neighbor search problem (NNS for short), we are given a
set of n points P in a metric space with distance function d, and we want
to construct a data structure that allows us to quickly find the closet point
p in P to any given query point q. We could always compute the distance
between q and each possible p, but this takes time O(n), and we’d like to
get lookups to be sublinear in n.
CHAPTER 7. HASHING 141

Indyk and Motwani were particularly interested in what happens in Rd


for high dimension d under various natural metrics. Because the volume of
a ball in a high-dimensional space grows exponentially with the dimension,
this problem suffers from the curse of dimensionality [Bel57]: simple
techniques based on, for example, assigning points in P to nearby locations
in a grid may require searching exponentially many grid locations. Indyk
and Motwani deal with this through a combination of randomization and
solving the weaker problem of -nearest neighbor search (-NNS), where
it’s acceptable to return a different point p0 as long as d(q, p0 ) ≤ (1 +
) minp∈P d(q, p).
This problem can be solved by reduction to a simpler problem called
-point location in equal balls or -PLEB. In this problem, we are given
n radius-r balls centered on points c in a set C, and we want a data structure
that returns a point c0 ∈ C with d(q, c0 ) ≤ (1 + )r if there is at least one
point c with d(q, c) ≤ r. If there is no such point, the data structure may or
may not return a point (it might say no, or it might just return a point that
is too far away, which we can discard). The difference between an -PLEB
and NNS is that an -PLEB isn’t picky about returning the closest point to
q if there are multiple points that are all good enough. Still, we can reduce
NNS to -PLEB.
maxx,y∈P d(x,y)
The easy reduction is to use binary search. Let R = minx,y∈P,x6 =y d(x,y)
.
Given a point q, look for the minimum ` ∈ (1 + )0 , (1 + )1 , . . . , R for


which an -PLEB data structure with radius ` and centers P returns a point
p with d(q, p) ≤ (1 + )`; then return this point as the approximate nearest
neighbor.
This requires O(log1+ R) instances of the -PLEB data structure and
O(log log1+ R) queries. The blowup as a function of R can be avoided using
a more sophisticated data structure called a ring-cover tree, defined in the
paper. We won’t talk about ring-cover trees because they are (a) complicated
and (b) not randomized. Instead, we’ll move directly to the question of how
we solve -PLEB.

7.8.2 Locality-sensitive hash functions


Definition 7.8.1 ([IM98]). A family of hash functions H is (r1 , r2 , p1 , p2 )-
sensitive for d if, for any points p and q, if h is chosen uniformly from
H,
1. If d(p, q) ≤ r1 , then Pr [h(p) = h(q)] ≥ p1 , and
2. If d(p, q) > r2 , then Pr [h(p) = h(q)] ≤ p2 .
CHAPTER 7. HASHING 142

These are useful if p1 > p2 and r1 < r2 ; that is, we are more likely to hash
inputs together if they are closer. Ideally, we can choose r1 and r2 to build
-PLEB data structures for a range of radii sufficient to do binary search as
described above (or build a ring-cover tree if we are doing it right). For the
moment, we will aim for an (r1 , r2 )-PLEB data structure, which returns a
point within r1 with high probability if one exists, and never returns a point
farther away than r2 .
There is some similarity between locality-sensitive hashing and a more gen-
eral dimension-reduction technique known as the Johnson-Lindenstrauss
lemma [JL84]; this says that projecting n points in a high-dimensional space
to O(−2 log n) dimensions using an appropriate random matrix preserves
`2 distances between the points to within relative error  (in fact, even a
random matrix with entries in {−1, 0, +1} is enough [Ach03]). Unfortunately,
dimension reduction by itself is not enough to solve approximate nearest
neighbors in sublinear time, because we may still need to search a number of
boxes exponential in O(−2 log n), which will be polynomial in n. But we’ll
look at the Johnson-Lindenstrauss lemma and its many other applications
more closely in Chapter 8.

7.8.3 Constructing an (r1 , r2 )-PLEB


The first trick is to amplify the difference between p1 and p2 so that we can
find a point within r1 of our query point q if one exists. This is done in three
stages: First, we concatenate multiple hash functions to drive the probability
that distant points hash together down until we get few collisions: the idea
here is that we are taking the AND of the events that we get collisions in the
original hash function. Second, we hash our query point and target points
multiple times to bring the probability that nearby points hash together up:
this is an OR. Finally, we iterate the procedure to drive down any remaining
probability of failure below a target probability δ: another AND.
For the first stage, let k = log1/p2 n and define a composite hash function
log1/p n
g(p) = (h1 (p) . . . hk (p)). If d(p, q) > r2 , Pr [g(p) = g(q)] ≤ pk2 = p2 2
=
1/n. Adding this up over all n points in our data structure gives us one false
match for q on average.
However, we may not be able to find the correct match for q, since p1
may not be all that much larger than p2 . For this, we do a second round of
amplification, where now we are taking the OR of events we want instead of
the AND of events we don’t want.
Let ` = nρ , where ρ = log(1/p 1) log p1
log(1/p2 ) = log p2 < 1, and choose hash functions
g1 . . . g` independently as above. To store a point p, put it in a bucket for
CHAPTER 7. HASHING 143

gj (p) for each j; these buckets are themselves stored in a hash table (by
hashing the value of gj (p) down further) so that they fit in O(n) space.
Suppose now that d(p, q) ≤ r1 for some p. Then

Pr [gj (p) = gj (q)] ≥ pk1


log1/p2 n
= p1
log 1/p
− log 1/p1
=n 2

−ρ
=n
= 1/`.

So by searching through ` independent buckets we find p with probability


at least 1 − (1 − 1/`)` = 1 − 1/e + o(1). We’d like to guarantee that we
only have to look at O(nρ ) points (most of which we may reject) during this
process; but we can do this by stopping if we see more than 2` points. Since
we only expect to see ` bad points in all ` buckets, this event only happens
with probability 1/2. So even adding it to the probability of failure from the
hash functions not working right we still have only a constant probability of
failure 1/e + 1/2 + o(1).
Iterating the entire process O(log(1/δ)) times then gives the desired
bound δ on the probability that this process fails to find a good point if one
exists.
 costs gives a cost of a query of O(k` log(1/δ)) =
 Multiplying out all the
ρ
O n (log1/p2 n) log(1/δ) hash function evaluations and O(nρ log(1/δ)) dis-
tance
 computations. The  cost to insert a point is just O(k` log(1/δ)) =
ρ
O n (log1/p2 n) log(1/δ) hash function evaluations, the same number as for
a query.

7.8.4 Hash functions for Hamming distance


Suppose that our points are d-bit vectors and that we use Hamming dis-
tance for our metric. In this case, using the family of one-bit projections
{hi | hi (x) = xi } gives a locality-sensitive hash family [ABMRT96]. 
Specifically, we can show this family is r, r(1 + ), 1 − dr , 1 − r(1+)
d -
sensitive. The argument is trivial: if two points p and q are at distance r
or less, they differ in at most r places, so the probability that they hash
together is just the probability that we don’t pick one of these places, which
is at least 1 − dr . Essentially the same argument works when p and q are far
away.
CHAPTER 7. HASHING 144

These are not particularly clever hash functions, so the heavy lifting will
be done by the (r1 , r2 )-PLEB construction. Our goal is to build an -PLEB
for any fixed r, which will correspond to an (r, r(1 + ))-PLEB. The main
thing we need to do, following [IM98] as always, is compute a reasonable
p1 ln(1−r/d)
bound on ρ = log log p2 = ln(1−(1+)r/d) . This is essentially just a matter of
hitting it with enough inequalities, although there are a couple of tricks in
the middle.
Compute

ln(1 − r/d)
ρ=
ln(1 − (1 + )r/d)
(d/r) ln(1 − r/d)
=
(d/r) ln(1 − (1 + )r/d)
ln((1 − r/d)d/r )
=
ln((1 − (1 + )r/d)d/r )
ln(e−1 (1 − r/d))

ln e−(1+)
−1 + ln(1 − r/d)
=
−(1 + )
1 ln(1 − r/d)
= − . (7.8.1)
1+ 1+
Note that we used the fact that 1 + x ≤ ex for all x in the denominator
and (1 − x)1/x ≥ e−1 (1 − x) for x ∈ [0, 1] in the numerator. The first fact is
our usual favorite inequality.
The second can be proved in a number of ways. The most visually
intuitive is that (1 − x)1/x and e−1 (1 − x) are equal at x = 1 and equal
in the limit as x goes to 0, while (1 − x)1/x is concave in between 0 and
1 and e−1 (1 − x) is linear. Unfortunately it is rather painful to show that
(1 − x)1/x is in fact concave. An alternative is to rewrite the inequality
(1 − x)1−x ≥ e−1 (1 − x) as (1 − x)1/x−1 ≥ e−1 , apply a change of variables
y = 1/x to get (1 − 1/y)y−1 ≥ e−1 for y ∈ [1, ∞), and then argue that (a)
equality holds in the limit as y goes to infinity, and (b) the left-hand-side is
CHAPTER 7. HASHING 145

a nonincreasing function, since


d   d
ln (1 − 1/y)y−1 = [(y − 1) (ln(y − 1) − ln y)]
dy dy
1 1
 
= ln(1 − 1/y) + (y − 1) −
y−1 y
= ln(1 − 1/y) + 1 − (1 − 1/y)
= ln(1 − 1/y) + 1/y
≤ −1/y + 1/y
= 0.

We now return to (7.8.1). We’d really like the second term to be small
enough that we can just write nρ as n1/(1+) . (Note that even though it looks
negative, it isn’t, because ln(1 − r/d) is negative.) So we pull a rabbit out of
a hat by assuming that r/d < 1/ ln n.8 This assumption can be justified by
modifying the algorithm so that d is padded out with up to d ln n unused
junk bits if necessary. Using this assumption, we get

nρ < n1/(1+) n− ln(1−1/ ln n)/(1+)


= n1/(1+) (1 − 1/ ln n)− ln n
≤ en1/(1+) .

Plugging into the formula for (r1 , r2 )-PLEB gives O(n1/(1+) log n log(1/δ))
hash function evaluations per query, each of which costs O(1) time, plus
O(n1/(1+) log(1/δ)) distance computations, which will take O(d) time each.
If we add in the cost of the binary search, we have to multiply this by
O(log log1+ R log log log1+ R), where the log-log-log comes from having to
adjust δ so that the error doesn’t accumulate too much over all O(log log R)
steps. The end result is that we can do approximate nearest-neighbor queries
in
 
O n1/(1+) log(1/δ)(log n + d) log log1+ R log log log1+ R

time. For  reasonably large, this is much better than naively testing against
all points in our database, which takes O(nd) time (although it does produce
an exact result).
8
Indyk and Motwani pull this rabbit out of a hat a few steps earlier, but it’s pretty
much the same rabbit either way.
CHAPTER 7. HASHING 146

7.8.5 Hash functions for `1 distance


Essentially the same approach works for (bounded) `1 distance, using dis-
cretization, where we replace a continuous variable over some range with
a discrete variable. Suppose we are working in [0, 1]d with the `1 metric.
Represent each coordinate xi as a sequence of d/ values xij in unary, for
j = 1 . . . d, with xij = 1 if j/d < xi . Then the Hamming distance between
the bit-vectors representing x and y is proportional to the `1 distance between
the original vectors, plus an error term that is bounded by . We can then
use the hash functions for Hamming distance to get a locality-sensitive hash
family.
A nice bit about this construction is that we don’t actually have to build
the bit-vectors; instead, we can specify a coordinate xi and a threshold c
and get the same effect by recording whether xi > c or not.
Note that this does increase the cost slightly: we are converting d-
dimensional vectors into (d/)-long bit vectors, so the log(n+d) term becomes
log(n + d/). When n is small, this effectively multiples the cost of a query by
an extra log(1/). More significant is that we have to cut  in half to obtain
the same error bounds, because we now pay  error for the data structure
itself and an additional  error for the discretization. So our revised cost for
the `1 case is
 
O n1/(1+/2) log(1/δ)(log n + d/) log log1+/2 R log log log1+/2 R .
Chapter 8

Dimension reduction

In this chapter, we will discuss how randomization can be used to reduce


the dimension of a set of points in a way that approximately preserves the
distance between points. The main tool for doing this is a family of closely-
related results that collectively are known as the Johnson-Lindenstrauss
lemma.
We can’t really do full justice to the Johnson-Lindenstrauss lemma and
its applications here, although we will try to hit the high points. There is an
excellent survey by Freksen [Fre21] if you’d like to learn more.

8.1 The Johnson-Lindenstrauss lemma


The Johnson-Lindenstrauss lemma [JL84] says that it is possible to
project a set of n vectors in a space of arbitrarily high dimension onto an
O(log n)-dimensional subspace, such that the distances between the vectors
are approximately preserved. There are several versions of the lemma,
depending on how the projection is done, but in each case we want to
show that for sufficiently large k as a function of n there is some k × d
matrix A such that for any two d-dimensional vectors u and v in our set,
(1 − )ku − vk2 ≤ kAu − Avk2 ≤ (1 + )ku − vk2 . Typical choices for A are:

• A projection matrix for a uniform random k-dimensional subspace.


(Original Johnson-Lindenstrauss paper [JL84], also used in the Dasgupta-
Gupta [DG03] proof described below in §8.1.2.)

• A matrix whose elements are i.i.d. normal random variables. (Indyk-


Motwani [IM98].)

147
CHAPTER 8. DIMENSION REDUCTION 148

• A matrix whose elements are i.i.d. variables with a particular distribu-


tion on {−1, 0, +1}. (Achlioptas [Ach03].)

• A matrix whose elements are ±1 coin-flips [Ach03]. This is perhaps


the easiest to keep track of, but the constants in the error bounds are
a bit worse than the {−1, 0, +1} version.

In each case, we multiply the matrix by an appropriate fixed scale factor


c so that E kcAuk2 = kuk2 . The resulting linear projection is known as a


Johnson-Lindenstrauss transformation (JLT for short).

8.1.1 Reduction to single-vector case


Most proofs of the theorem reduce to the case of a single vector, by finding
a value of k such that Pr (1 − )kuk2 ≤ kAuk2 ≤ (1 + )kuk2 ≥ 1 − 2/n2 ,


and then using the union bound to show that this same property holds for
all n2 vectors u − v with nonzero probability. This shows the existence of a


good matrix, and we can generate matrices and test them until we find one
that actually works.

8.1.2 A relatively simple proof of the lemma


This proof is due to Dasgupta and Gupta [DG03], and is somewhat simpler
than many proofs of the theorem. The basic idea is that if A projects onto
a uniform random k-dimensional subspace, then its effect on the length of
an arbitrary nonzero vector u is the same as its effect on the length of a
unit vector u/kuk, which is the same as the effect of some specific fixed
k × d matrix B applied to a random unit vector Y drawn uniformly from
the surface of the d-dimensional sphere S d . An easy choice for B is just the
matrix that extracts the first k coordinates from Y .
So how do we get a random unit vector Y ? The usual trick is to use the
fact that the multivariate normal distribution is radially symmetric: if we
generate d independent normally-distributed random variables X1 , . . . , Xd ,
then the vector Y = X/kXk is uniformly distributed over S d .1
This gives Z = BY = (X1 , . . . , Xk )/kXk, and kZk2 = (X12 + X22 + · · · +
Xk )/(X12 + · · · + Xd2 ). If each value Xi2 were exactly equal to its expectation
2

2
1
Radial symmetry is immediate from the density √12π e−x /2 of the univariate normal
distribution. If we consider a vector hX1 , . . . , Xd i of independent N (0, 1) random
P variables,
2
1 −x2
= (2π)−d/2 e− xi /2 . But
Qd
then the joint density is given by the product i=1 2π
e i /2
2 2
P
xi = r where r is the distance from the origin, meaning this distribution has the same
density at all points at the same distance.
CHAPTER 8. DIMENSION REDUCTION 149

1 (the variance of a standard N (0, 1) normal random variable), then kZk2


would be exactly equal to k/d, and we could adjust
p the length of Z to equal
1 by multiplying Z by a correction factor of d/k. But we generally won’t
have each Xi2 equal to 1; instead, we will use a Chernoff bound argument
to show that kZk2 lies between (1 − )(k/d) and (1 + )(k/d), for suitable 
and k, with high probability.
Following the approach in [DG03], we’ll write β for (1 − ) or (1 + )
as appropriate. Starting with the lower bound, we want to show that for
β = 1 − , it is unlikely that

kZk2 ≤ β(k/d)

which expands to
Pk 2
i=1 Xi
Pd 2
≤ β(k/d).
i=1 Xi

Having a ratio between two sums is a nuisance, but we can multiply out
the denominators to turn it into something we can apply a Chernoff-style
argument to.

k d
"P # " #
k 2
i=1 Xi
X X
Pr Pd 2
≤ β(k/d) = Pr d Xi2 ≤ βk Xi2
i=1 Xi i=1 i=1
h i
= Pr βk(X12 + · · · + Xd2 ) − d(X12 + . . . Xk2 ) ≥ 0
h   i
= Pr exp t βk(X12 + · · · + Xd2 ) − d(X12 + . . . Xk2 ) ≥1
h i
≤ E exp(t(βk(X12 + · · · + Xd2 ) − d(X12 + . . . Xk2 )))
h id−k h ik
= E exp(tβkX 2 ) E exp(t(βk − d)X 2 ) ,

where t can be any value greater than 0, the shift from probability to
expectation uses Markov’s inequality, and in the last step we replace each
independent occurrence of Xi with a standard normal random variable X.
h Now2
i we just need to be able to compute the moment generating function
E e sX for X 2 . The quick way to do this is to notice that X 2 has a
chi-squared distribution with one degree of freedom (since the chi-squared
distribution with k degrees of freedom is just the distribution of the sum
of squares of k independent normal random variables), and look up its
m.g.f. (1 − 2s)−1/2 (for s < 1/2).
CHAPTER 8. DIMENSION REDUCTION 150

We can substitute this m.g.f. in the formula above to get


(d−k)/2  k/2
1 1
h i 
Pr (X12 + · · · + Xk2 )/(X12 + · · · + Xd2 ) ≤ β(k/d) ≤ .
1 − 2βkt 2t(βk − d)
The rest is just the usual trick of finding the minimum value of this
expression over all values t (that don’t produce negative denominators in the
m.g.f.!) by differentiating and setting to 0. This turns out to be easiest if we
maximize 1/g(t)2 where g(t) is the above expression; see the Dasgupta-Gupta
1−β
paper for details. For β = 1 −  < 1, they show that the optimal t is 2β(d−kβ) ,
which gives, after a few intermediate steps,
h i
Pr kZk2 ≤ β(k/d) ≤ e(k/2)(1−β+ln β) .

The same argument applied to g(−t) gives essentially the same bound
for β = 1 +  > 1:
h i
Pr kZk2 ≥ β(k/d) ≤ e(k/2)(1−β+ln β) .

The 1 − β + ln β factors are very close to 0. For β = 1 − , we have


1 − β + ln β =  + ln(1 − ) = Θ(−2 ), and for β = 1 + , we similarly
have 1 − β + ln β = − + ln(1 + ) = Θ(−2 ). So in either case we need
k = Θ(−2 ln n) to make the probability bound polynomially small in n.
A more precise calculation (see [DG03]) includes the next term in the
Taylor series expansion of 1 ±  to get an exact inequality. This gives a good
map at k ≥ 4(2 /2 − 3 /3)−1 ln n. Note that both our crummy asymptotic
bound and this more precise bound depend on n but not d.
Constructing a projection matrix for a random k-dimensional space is
mildly painful. The easiest way to do it may be to generate a random
k × d matrix of independent N (0, 1) variables and then apply Gram-Schmidt
orthogonalization, which takes O(k 2 d) time. A faster approach that gives
similar results is to use a matrix of independent random variables that are ±1
with probability 1/6 each and 0 the rest of the time. This doesn’t produce
exactly the same distribution on projections, but it can be shown (with much
more effort) to still work pretty well [Ach03].
If we leave out enough details, we can summarize all of these results as a
single lemma:
Lemma 8.1.1 ([JL84]). For every set of n points X in Rd , and any  with
0 <  < 1, there is a linear projection f : Rd → Rk with k = O(−2 log n)
such that for any u and v in X,
(1 − )ku − vk2 ≤ kf (u) − f (v)k ≤ (1 + )ku − vk.
CHAPTER 8. DIMENSION REDUCTION 151

8.1.3 Distributional version


It’s worth nothing that the argument for Lemma 8.1.1 doesn’t use any
property of X except its size. This means that the same argument works
(with nonzero probability) without needing to know X. This gives the
“distributional” version of the lemma, which says

Lemma 8.1.2 ([JL84]). For every d, 0 <  < 1, and 0 < δ < 1, there exists
a distribution over linear functions f : Rd → Rk with k = O(−2 log(1/δ))
such that for every x ∈ Rd ,
h i
Pr (1 − )kxk2 ≤ kf (x)k2 ≤ (1 + )kxk2 ≥ 1 − δ.

This can be handy for applications where we don’t know the vectors we
will be working with in advance.

8.2 Applications
The intuition is that the Johnson-Lindenstrauss lemma lets us reduce the
dimension of some problem involving distances between n points, where we
are willing to tolerate a small constant relative error, from some arbitrary d
to O(log n) (or O(log(1/δ)) if we just care about the error probability per
pair of points). So applications tend to fall into one of two categories:

1. We want to run some algorithm on a set of points in a high-dimensional


space, but the cost of the algorithm is high (perhaps exponential!) as
a function of the dimension.

2. We want to run some algorithm on a set of points in a high-dimensional


space, but we don’t want to pay the linear-in-the-dimension space costs.

If we are lucky, we’ll get both payoffs, winning on both time and space.
For example, suppose we have a set of n points x1 , x2 , . . . , xn representing
the centers of various clusters in a d-dimensional space, and we want to
rapidly classify incoming points y into one of these clusters by finding which
xi y is closest to. If we do this naively, this is a Θ(nd) operation, since it takes
Θ(d) time to compute each distance between y and some xi . If instead we
are willing to accept the inaccuracy associated with Johnson-Lindenstrauss,
we can fix a matrix A in advance, replace each xi with Axi , and find the
xi that is (approximately) closest to y using O(d log n) time to reduce y to
Ay and O(n log n) time to compute the distance between Ay and each Axi .
In this case we are reducing both the time complexity of our classification
CHAPTER 8. DIMENSION REDUCTION 152

algorithm (at least if we don’t count the pre-processing time to generate the
Axi ) and the amount of data we need to store.
An example of space savings is the use of the Johnson-Lindenstrauss
transform in streaming algorithms (see §7.7). Freksen [Fre21] gives a simple
example of estimating kxk2 where x is a vector of counts of items from a set
of size n presented one at a time. If we don’t charge for the space to store
the JLT function f , we can simply add the i-th column of the matrix to our
running total whenever we see item i, and we need only store O −2 log(1/δ)
distinct numerical values of an appropriate precision to estimate kxk2 to
within  relative error with probability at least 1 − δ. The problem is that
in reality we do need to represent f somehow, and even for a ±1 matrix
this will take Θ(n−2 log(1/δ)) space. Fortunately it can be shown that
generating f using a 4-independent hash function reduces the space for f to
O(log n), giving the Tug-of-War sketch of Alon et al. [AMS96], one of the
first compact streaming data structures. Though this is a nice application of
the JLT, it’s worth mentioning Cormode and Muthukrishnan [CM05] observe
that this is still significantly more costly for most queries than their own
count-min sketch.
Chapter 9

Martingales and stopping


times

In §5.3.2, we used martingales to show that the outcome of some process


was tightly concentrated. Here we will show how martingales interact with
stopping times, which are random variables that control when we stop
carrying out some task. This will require a few new definitions.

9.1 Definitions
The general form of a martingale {Xt , Ft } consists of:

• A sequence of random variables X0 , X1 , X2 , . . . ; and

• A filtration F0 ⊆ F1 ⊆ F2 . . . , where each σ-algebra Ft represents


our knowledge at time t;

subject to the requirements that:

1. The sequence of random variables is adapted to the filtration, which


just means that each Xt is Ft -measurable or equivalently that Ft (and
thus all subsequent Ft0 for t0 ≥ t) includes all knowledge of Xt ; and

2. The martingale property

E [Xt+1 | Ft ] = Xt (9.1.1)

holds for all t.

153
CHAPTER 9. MARTINGALES AND STOPPING TIMES 154

We will also also need the following definition of a stopping time.


Given a filtration {Ft }, a random variable τ is a stopping time for {Ft } if
τ ∈ N ∪ {∞} and the event [τ ≤ t] is Ft -measurable for all t ∈ N.1 In simple
terms, τ is a stopping time if you know at time t whether to stop there or
not.
What we like about martingales is that iterating the martingale property
shows that E [Xt ] = E [X0 ] for all fixed t. We will show that, under reasonable
conditions, the same holds for Xτ when τ is a stopping time. (The random
variable Xτ is defined in the obvious way, as a random variable that takes
on the value of Xt when τ = t.)

9.2 Submartingales and supermartingales


In some cases we have a process where instead of getting equality in (9.1.1),
we get an inequality instead. A submartingale replaces (9.1.1) with

Xt ≤ E [Xt+1 | Ft ] (9.2.1)

while a supermartingale satisfies

Xt ≥ E [Xt+1 | Ft ] . (9.2.2)

In each case, what is “sub” or “super” is the value at the current time
compared to the expected value at the next time. Intuitively, a submartingale
corresponds to a process where you win on average, while a supermartingale
is a process where you lose on average. Casino games (in profitable casinos)
are submartingales for the house and supermartingales for the player.
Sub- and supermartingales can be reduced to martingales by subtracting
off the expected change at each step. For example, if {Xt } is a submartingale
with respect to {Ft }, then the process {Yt } defined recursively by

Y0 = X0
Yt+1 = Yt + Xt+1 − E [Xt+1 | Ft ]
1
Different authors impose different conditions on the range of τ ; for example, Mitzen-
macher and Upfal [MU17] exclude the case τ = ∞. We allow τ = ∞ to represent the
outcome where we never stop. This can be handy for modeling processes where this
outcome is possible, although in practice we will typically insist that it occurs only with
probability zero.
CHAPTER 9. MARTINGALES AND STOPPING TIMES 155

is a martingale, since

E [Yt+1 | Ft ] = E [Yt + Xt+1 − E [Xt+1 | Ft ] | Ft ]


= Yt + E [Xt+1 | Ft ] − E [Xt+1 | Ft ]
= Yt .

One way to think of this is that Yt = Xt + ∆t , where ∆t is a predictable,


non-decreasing drift process that starts at 0. For supermartingales, the
same result holds, but now ∆t is non-increasing. This ability to decompose
an adapted stochastic process into the sum of a martingale and a predictable
drift process is known as the Doob decomposition theorem.

9.3 The optional stopping theorem


If (Xt , Ft ) is a martingale, then applying induction to the martingale property
shows that E [Xt ] = E [X0 ] for any fixed time t. The optional stopping
theorem shows that this also happens for Xτ when τ is a stopping time,
under various choices of additional conditions:

Theorem 9.3.1. Let (Xt , Ft ) be a martingale and τ a stopping time for


{Ft }. Then E [Xτ ] = E[X0 ] if at least one of the following conditions holds:

1. Bounded time. There is a fixed n such that τ ≤ n always.

2. Finite time and bounded range. Pr [τ < ∞] = 1, and there is a


fixed M such that for all t ≤ τ , |Xt | ≤ M .

3. Finite expected time and bounded increments. E [τ ] < ∞, and


there is a fixed c such that |Xt+1 − Xt | ≤ c for all t < τ .

4. General case. All three of the following conditions hold:

(a) Pr [τ < ∞] = 1,
(b) E [|Xτ |] < ∞, and
h i
(c) limt→∞ E Xt · 1[τ >t] = 0.

It would be nice if we could show E [Xτ ] = E[X0 ] without the side


conditions, but in general this isn’t true. For example, the double-after-losing
martingale strategy in the St. Petersburg paradox (see §3.4.1.1) eventually
yields +1 with probability 1, so if τ is the time we stop playing, we have
Pr [τ < ∞] = 1, E [|Xτ |] < ∞, but E [Xτ ] = 1 6= E [X0 ] = 0. To make this
CHAPTER 9. MARTINGALES AND STOPPING TIMES 156

happen, we have to violate all of bounded time (τ is not bounded), bounded


range (|Xt | roughly doubles every step until we stop), bounded increments
(|Xt+1 − Xt | doubles every step as well), and at least one of the three
conditions of the general case (the last one: limt→∞ E [Xt ] · 1[τ >t] = −1 6= 0).
To prove Theorem 9.3.1, we’ll use a truncation argument. The intuition
is that for any fixed n, we can truncate Xτ to Xmin(τ,n) and show that
h i
E Xmin(τ,n) = E [X0 ]. This will immediately give the bounded time case.
h i
For the other cases, the argument is that limn→∞ E Xmin(τ,n) converges to
h i
E [Xτ ] provided the missing part E Xτ − Xmin(τ,n) converges to zero. How
we do this depends on which assumptions we are making.
We’ll start by formalizing the core truncation argument:

Lemma 9.3.2. Let (Xht , Ft ) be aimartingale and τ a stopping time for {Ft }.
Then for any n ∈ N, E Xmin(τ,n) = E[X0 ].
Pt
Proof. Define Yt = X0 + i=1 (Xt − Xt−1 )1[τ >t−1]
h
. Then (Yt , Ft ) is a martin-
i
gale, because we can calculate E [Yt+1 | Ft ] = E Yt + (Xt+1 − Xt )1[τ >t] Ft =
Yt + 1[τ >t] · E [Xt+1 − Xt | Ft ] = Yt ; effectively, we are treating 1[τ ≤t−1] as a
sequence of bets, and we know that h adjusting i our bets doesn’t change the
martingale property. But then E Xmin(τ,n) = E [Yn ] = E [Y0 ] = E [X0 ].

As claimed, this gives us the bounded-timeh varianti for free. If τ ≤ n


always, then Xτ = Xmin(τ,n) , and E [Xτ ] = E Xmin(τ,n) = E [X0 ].
For each of the unbounded-time variants, we will apply some version of
the following strategy:
h i
1. Observe that since E Xmin(τ,n) = E [X0 ] is a constant for any fixed n,
h i
limn→∞ E Xmin(τ,n) converges to E [X0 ].
h i
2. Argue using whatever assumptions we are making that limn→∞ E Xmin(τ,n)
also converges to E [Xτ ].

3. Conclude that E [X0 ] = E [Xτ ], since they are both limits of the same
sequence.

For the middle step, start with

Xτ = Xmin(τ,n) + 1[τ >n] (Xτ − Xn ).


CHAPTER 9. MARTINGALES AND STOPPING TIMES 157

This holds because either τ ≤ n, and we just get Xτ , or τ > n, and we get
Xn + (Xτ − Xn ) = Xτ .
Taking the expectation of both sides gives
h i h i
E [Xτ ] = E Xmin(τ,n) + E 1[τ >n] (Xτ − Xn )
h i
= E [X0 ] + E 1[τ >n] (Xτ − Xn ) .

So if we can show that the right-hand term goes to zero in the limit, we are
done. h i
For the bounded-range case, we have |Xτ − Xn | ≤ 2M , so E 1[τ >n] (Xτ − Xn ) ≤
2M ·Pr [τ > n]. Since in this case we assume Pr [τ < ∞] = 1, limn→∞ Pr [τ > n] =
0, and the theorem holds.
For bounded increments, we have
 
h i X
E (Xτ − Xn )1[τ >n] = E (Xt+1 − Xt )1[τ >t] 
t≥n
 
X
≤ E |(Xt+1 − Xt )| · 1[τ >t] 
t≥n
 
X
≤ E c · 1[τ >t] 
t≥n
 
X
≤ cE 1[τ >t]  .
t≥n

But E [τ ] = ∞
P
t=0 1[τ >t] . Under the assumption that this sequence converges,
its tail goes to zero, and again the theorem holds.
For the general case, we can expand
h i h i h i
E [Xτ ] = E Xmin(τ,n) + E 1[τ >n] Xτ − E 1[τ >n] Xn

which implies
h i h i h i
lim E [Xτ ] = lim E Xmin(τ,n) + lim E 1[τ >n] Xτ − lim E 1[τ >n] Xn ,
n→∞ n→∞ n→∞ n→∞

assuming all these limits exist and are finite. We’ve already established that
the first limit is E [X0 ], which is exactly what we want. So we just need to
show that the other two limits both converge tohzero. Forithe last limit, we
just use condition (4c), which gives limn→∞ E 1[τ >n] Xn = 0; no further
CHAPTER 9. MARTINGALES AND STOPPING TIMES 158

argument is needed. But we still need to show that the middle limit also
vanishes. h i P h i
Here we use condition (4b). Observe that E 1[τ >n] Xτ = ∞
t=n+1 E 1 [τ =t] Xt .
h i
Compare this with E [Xτ ] = ∞
P
t=0 E 1[τ =t] Xt ; this is an absolutely conver-
gent series (this is why we need condition (4b)), so in the limit the sum of
the terms for i = 0 . . . n converges to E [Xτ ]. But this means that the sum
of the remaining terms for i = n + 1 . . . ∞ converges to zero. So the middle
term goes to zero as n goes to infinity. This completes the proof.

9.4 Applications
Here we give some example of the Optional Stopping Theorem in action. In
each case, the trick is to find an appropriate martingale and stopping time,
and let the theorem do all the work.

9.4.1 Random walks


Let Xt be an unbiased ±1 random walk that starts at 0, adds ±1 to its
current position with equal probability at each step, and stops if it reaches
−a or +b.2 We’d like to calculate the probability of reaching +b before −a.
Let τ be the time at which the process stops.
We can easily show that Pr [τ < ∞] = 1 and E [τ ] < ∞ by observing that
from any state of the random walk, there is a probability of at least 2−(a+b)
that it stops within a + b steps (by flipping heads a + b times in a row),
so that if we consider a sequence of intervals of length a + b, the expected
number of such intervals we can have before we stop is at most 2a+b , giving
E [τ ] ≤ (a + b)2a+b (we can do better than this).
We also have bounded increments by the definition of the process
(bounded range also works, at least up until time τ ). So E [Xτ ] = E[X0 ] = 0
and the probability p of landing on +b instead of −a must satisfy pb − (1 −
a
p)a = 0, giving p = a+b .
Now suppose we want to find E [τ ]. Let Yt = Xt2 − t. Then Yt+1 =
(Xt ± 1)2 − (t + 1) = Xt2 ± 2Xt + 1 − (t + 1) = (Xt2 − t) ± 2Xt = Yt ± 2Xt . Since
the plus and minus cases are equally likely, they cancel out in expectation
and E [Yt+1 | Ft ] = Yt : we just showed Yt is a martingale.3 We can also show
2
This is called a random walk with two absorbing barriers.
3
This construction generalizes in a nice way to arbitrary martingales. Suppose {Xt } is
a martingale with respect to {Ft }. Let ∆t = Xt − Xt−1 , and let Vt = Var [∆t | Ft−1 ] be
the conditional variance of the t-th increment (note that this is a random variable that
CHAPTER 9. MARTINGALES AND STOPPING TIMES 159

it has bounded increments (at least up until time τ ), because |Yt+1 − Yt | =


2|Xt | ≤ max(a, b).
9.3.1, E [Yτ ] = 0, which gives E [τ ] = E Xτ2 . But we
 
From Theorem
can calculate E Xτ : it is a2 Pr [Xτ = −a] + b2 Pr [Xt = b] = a2 (b/(a + b)) +
 2

b2 (a/(a + b)) = (a2 b + b2 a)/(a + b) = ab.


If we have a random walk that only stops at +b,4 then if τ is the
first time at which Xτ = b, τ is a stopping time. However, in this case
E [Xτ ] = b 6= E [X0 ] = 0. So the optional stopping theorem doesn’t apply in
this case. But we have bounded increments, so Theorem 9.3.1 would apply if
E [τ ] < ∞. It follows that the expected time until we reach b is unbounded,
either because sometimes we never reach b, or because we always reach b but
sometimes it takes a very long time. 5
Pt
may depend on previous outcomes). We can easily show that Yt = Xt2 − i=1
Vi is a
martingale. The proof is that
" t
#
X
E [Yt | Ft−1 ] = E Xt2 − Vt Ft−1
i=1
t
X
= E (Xt−1 + ∆t )2 Ft−1 −
 
Vi
i=1
t
2
X
+ 2Xt−1 ∆t + ∆2t Ft−1 −
 
= E Xt−1 Vi
i=1
t
2
X
+ 2Xt−1 E [∆t | Ft−1 ] + E ∆2t Ft−1 −
 
= Xt−1 Vi
i=1
t
2
X
= Xt−1 + 0 + Vt − Vi
i=1
t−1
2
X
= Xt−1 − Vi
i=1

= Yt−1 .
Pt  
For the ±1 random walk case, we have Vt = 1 always, giving V = t and E Xτ2 =
i=1 i
 2
E X0 + E [τ ] when τ is a stopping time satisfying the conditions of the Optional Stopping
    Pτ 
Theorem. For the general case, the same argument gives E Xτ2 = E X02 + E V
t=1 t
instead: the expected square position of Xt is incremented by the conditional variance at
each step.
4
This would be a random walk with one absorbing barrier.
5
In fact, we always reach b. An easy way to see this is to imagine a sequence of intervals
 Pi 2
of length n1 , n2 , . . . , where ni+1 = b + n
. At the end of the i-th interval, we
j=1 j
Pi √
are no lower than − n , so we only need to go up ni+1 positions to reach a by the
j=0 j
CHAPTER 9. MARTINGALES AND STOPPING TIMES 160

We can also consider a biased random walk where +1 occurs with


probability p and −1 with probability q = 1 − p. If Xt is the position of the
random walk at time t, and Ft is the associated σ-algebra, then Xt isn’t a
martingale with respect to Ft . But there are at least two ways to turn it
into one:
1. Define Yt = Xt − (p − q)t. Then

E [Yt+1 | Ft ] = E [Xt+1 − (p − q)(t + 1) | Ft ]


= p(Xt + 1)Ft + q(Xt − 1)Ft − (p − q)(t + 1)
= (p + q)Xt + (p − q) − (p − q)(t + 1)
= Xt − (p − q)t
= Yt ,

and Yt is a martingale with respect to Ft .

2. Define Zt = (q/p)Xt . Then

E [Zt+1 | Ft ] = p(q/p)Xt +1 + q(q/p)Xt −1


= (q/p)Xt (p(q/p) + q(p/q))
= (q/p)Xt (q + p)
= (q/p)Xt
= Zt .

Again we have a martingale with respect to Ft .


Now let’s see what we can do with the Optional Stopping Theorem.

1. Suppose we start with X0 = 0 and we want to know the probability pb


that we will reach +b before we reach −a.
Let τ be the first time at which Xτ ∈ {−a, b}. We can use the same
argument as in the unbiased case to show that Pr [τ < ∞] = 1 and
E [τ ] < ∞, because from any position Xt there is at least a pa+b > 0
chance that the next a + b steps will all be +1 and we will reach b.
Since we can flip this pa+b -probability coin independently every a + b
steps, eventually we reach b if we haven’t already reached −a. The
expected time is bounded by (a + b)/pa+b .
end of the (i + 1)-th interval. Since this is just one standard deviation, it occurs with
constant probability, so after a finite expected number of intervals, we will reach +a. Since
there are infinitely many intervals, we reach +a with probability 1.
CHAPTER 9. MARTINGALES AND STOPPING TIMES 161
 
We also have that 0 < Zt ≤ max (q/p)a , (q/p)b for all t ≤ τ . This
gives us bounded range, so the finite-time/bounded-range case of OST
applies.
We thus have E [Zτ ] = pb (q/p)b +(1−pb )(q/p)−a = E [Z0 ] = (q/p)0 = 1.
Solving for pb gives
1 − (q/p)−a
pb = .
(q/p)b − (q/p)−a
(Note that this only makes sense if q 6= p.)
As a test, if −a = −1 and +b = +1, then we get
1 − p/q
pb =
q/p − p/q
pq − p2
=
q−p
p(q − p)
=
q−p
= p,
which is what we would expect since we hit −a or +b on the first step.
A more interesting case is if we set p < 1/2. Then q/p > 1, and
1 − (q/p)−a
pb =
(q/p)b − (q/p)−a
1 − (q/p)−a
= (q/p)−b ·
1 − (q/p)−a−b
< (q/p)−b .

Any walk that is biased against us will be an exponentially improbable


hill to climb.
2. Now suppose we want to know E [τ ], the average time at which we
first hit a or b. We already argued E [τ ] is finite, and it’s easy to see
that {Yt } has bounded increments, so we can use the finite-expected-
time/bounded-increment case of OST to get E [Yτ ] = E [Y0 ] = 0, or
E [Xτ − (p − q)τ ] = 0. It follows that E [τ ] = E [Xτ ] /(p − q).
But we can compute E [Xτ ], since it is just pb b − (1 − pb )a. So E [τ ] =
pb b−(1−pb )a
p−q . If p > q, and a and b are both large enough to make pb
very close to 1, this will be approximately b/(p − q), the time to climb
to b using our average return of p − q per step.
CHAPTER 9. MARTINGALES AND STOPPING TIMES 162

9.4.2 Wald’s equation


Suppose we run a Las Vegas algorithm until it succeeds, and the i-th attempt
costs Xi , where all the Xi are independent, satisfy 0 ≤ Xi ≤ c for some c,
and have a common mean E [Xi ] = µ.
Let N be the number of times we run the algorithm. Since we can tell
when we are done, N is a stopping time with respect to some filtration {Fi }
6
i Xi are adapted. Suppose also that E [N ] exists. What is
toh which the
PN
E i=1 Xi ?
If N were not a stopping time, this might be a very messy problem
indeed. But when N is a stopping time, we can apply it to the martingale
Yt = ti=1 (Xi − µ). This has bounded increments (0 ≤ Xi ≤ c, so −c ≤
P

Xi − E [Xi ] ≤ c), and we’ve already said E [N ] is finite (which implies


Pr [N < ∞] = 1), so Theorem 9.3.1 applies. We thus have
0 = E [YN ]
"N #
X
=E (Xi − µ)
i=1
"N # "N #
X X
=E Xi − E µ
i=1 i=1
"N #
X
=E Xi − E [N ] µ.
i=1

Rearranging this gives Wald’s equation:


"N #
X
E Xi = E [N ] µ. (9.4.1)
i=1

This is the same formula as in §3.4.3.1, but we’ve eliminated the bound
on N and allowed for much more dependence between N and the Xi .7

9.4.3 Maximal inequalities


Suppose we have a martingale {Xi } with Xi ≥ 0 always, and we want to
bound maxi≤n Xi . We can do this using the Optional Stopping Theorem:
6
A stochastic process {Xt } is adapted to a filtration {Ft } if each Xt is Ft -measurable.
7
In fact, looking closely at the proof reveals that we don’t even need the Xi to be
independent of each other. We just need that E [Xi+1 | Fi ] = µ for all i to make (Yt , Ft ) a
martingale. But if we don’t carry any information from one iteration of our Las Vegas
algorithm to the next, we’ll get independence anyway. So the big payoff is not having to
worry about whether N has some devious dependence on the Xi .
CHAPTER 9. MARTINGALES AND STOPPING TIMES 163

Lemma 9.4.1. Let {Xi } be a martingale with Xi ≥ 0. Then for any fixed
n,
E [X0 ]
 
Pr max Xi ≥ α ≤ . (9.4.2)
i≤n α
Proof. The idea is to pick a stopping time τ such that maxi≤n Xi ≥ α if and
only if Xτ ≥ α.
Let τ be the first time such that Xτ ≥ α or τ ≥ n. Then τ is a stopping
time for {Xi }, since we can determine from X0 , . . . , Xt whether τ ≤ t or
not. We also have that τ ≤ n always, which is equivalent to τ = min(τ, n).
Finally, Xτ ≥ α means that maxi≤n Xi ≥ Xτ ≥ α, and conversely if there
is some t ≤ n with Xt = maxi≤n Xi ≥ α, then τ is the first such t, giving
Xτ ≥ α.
Lemma 9.3.2 says E [Xτ ] = E [X0 ]. So Markov’s inequality gives Pr [maxi≤n Xi ≥ α] =
Pr [Xτ ≥ α] ≤ E[Xα
τ]
= E[X 0]
α , as claimed.

Lemma 9.4.1 is a special case of Doob’s martingale inequality, which


says that for a non-negative submartingale {Xi },

E [Xn ]
 
Pr max Xi ≥ α ≤ . (9.4.3)
i≤n α
The proof is similar, but requires showing first that E [Xτ ] ≤ E [Xn ] when
τ ≤ n is a stopping time and {Xi } is a submartingale.
Doob’s martingale inequality is what you get if you generalize Markov’s
inequality to martingales. The analogous generalization of Chebyshev’s
inequality is Kolmogorov’s inequality, which says:
Pi
Lemma 9.4.2. For sums Si = j=1 Xj of independent random variables
X1 , X2 , . . . , Xn with E [Xi ] = 0,

Var [S]
 
Pr max|Si | ≥ α ≤ . (9.4.4)
i≤n α2

Proof. Let Yi = Si2 − Var [Si ]. Then {Yi } is a martingale. This implies that
2
  2
E [Yn ] = E Sn − Var [S] = Y0 = 0 and thus that E Sn = Var [S]. It’s easy
to see that Si2 is a submartingale since partial sums can only increase over
time. Now apply (9.4.3).

In general, because we can always stop updating a martingale once we hit


a particular threshold, other martingale concentration bounds like Azuma-
Hoeffding will also apply to the maximum or minimum of a martingale over
a given interval.
CHAPTER 9. MARTINGALES AND STOPPING TIMES 164

9.4.4 Waiting times for patterns


Let’s suppose we flip coins until we see some pattern appear: for example,
we might flip coins until we see HTHH. What is the expected number of
coin-flips until this happens?
A very clever trick due to Li [Li80] solves this problem exactly using the
Optional Stopping Theorem. Suppose our pattern is x1 x2 . . . xk . We imagine
an army of gamblers, one of which shows up before each coin-flip. Each
gambler starts by betting $1 that next coin-flip will be x1 . If they win, they
bets $2 that the next coin-flip will be x2 , continuing to play double-or-nothing
until either they lose (and is down $1) or wins her last bet on xk (and is up
2k − 1). Because each gambler’s winnings form a martingale, so does their
sum, and so the expected total return of all gamblers up to the stopping
time τ at which our pattern first occurs is 0.
We can now use this fact to compute E [τ ]. When we stop at time τ , we
have one gambler who has won 2k − 1. We may also have other gamblers
who are still in play. For each i with x1 . . . xi = xk−i+1 . . . xk , there will be a
gambler with net winnings ij=1 2j−1 = 2i − 1. The remaining gamblers will
P

all be at −1.
Let χi = 1 if x1 . . . xi = xk−i+1 . . . xk , and 0 otherwise. Then the number
of losers is given by τ − ki=1 χi and the total expected payoff is
P

k k
" #
X X  
E [Xτ ] = E −(τ − χi ) + χi 2i − 1
i=1 i=1
k
" #
X  
i
= E −τ + χi 2
i=1
= 0.

It follows that E [τ ] = ki=1 χi 2i .


P

As a quick test, the pattern H has E [τ ] = 21 = 2. This is consistent with


what we know about geometric distributions.
For a longer example, the pattern HTHH only overlaps with its prefix H,
so in this case we have E [τ ] = χi 2i = 16 + 2 = 18. But HHHH overlaps
P

with all of its prefixes, giving E [τ ] = 16 + 8 + 4 + 2 = 30. At the other


extreme, THHH has no overlap at all and gives E [τ ] = 16.
In general, for a pattern of length k, we expect a waiting time somewhere
between 2k and 2k+1 − 2—almost a factor of 2 difference depending on how
much overlap we get.
This analysis generalizes in the obvious way to biased coins and larger
alphabets. See the paper [Li80] for details.
Chapter 10

Markov chains

A (discrete time) Markov chain is a sequence of random variables X0 , X1 , X2 , . . . ,


which we think of as the position of some particle at increasing times in N,
where the distribution of Xt+1 depends only on the value of Xt . A typical
example of a Markov chain is a random walk on a graph: each Xt is a node
in the graph, and a step moves to one of the neighbors of the current node
chosen at random, each with equal probability.
Markov chains come up in randomized algorithms both because the
execution of any randomized algorithm is, in effect, a Markov chain (the
random variables are the states of the algorithm); and because we can
often sample from distributions that are difficult to sample from directly
by designing a Markov chain that converges to the distribution we want.
Algorithms that use this latter technique are known as Markov chain
Monte Carlo algorithms, and rely on the fundamental fact that a Markov
chain that satisfies a few straightforward conditions will always converge in
the limit to a stationary distribution, no matter what state it starts in.
An example of this technique that predates randomized algorithms is
card shuffling: each permutation of the deck is a state, and the shuffling
operation sends the deck to a new state each time it is applied. Assuming
the shuffling operation is not too deterministic, it is possible to show that
enough shuffling will eventually produce a state that is close to being a
uniform random permutation. The big algorithmic question for this and
similar Markov chains is how quickly this happens: what is the mixing time
of the Markov chain, a measure of how long we have to run it to get close to
its limit distribution (this notion is defined formally in §10.2.3). Many of
the techniques in this chapter will be aimed at finding bounds on the mixing
time for particular Markov processes.

165
CHAPTER 10. MARKOV CHAINS 166

If you want to learn more about Markov chains than presented here, they
are usually covered in general probability textbooks (for example, in [Fel68]
or [GS01]), mentioned in many linear algebra textbooks [Str03], covered
in some detail in stochastic processes textbooks [KT75], and covered in
exquisite detail in many books dedicated specifically to the subject [KS76,
KSK76]. Good sources for mixing times for Markov chains are the textbook
of Levin, Peres, and Wilmer [LPW09] and the survey paper by Montenegro
and Tetali [MT05]. An early reference on the mixing times for random
walks on graphs that helped inspire much subsequent work is the Aldous-
Fill manuscript [AF01], which can be found on-line at http://www.stat.
berkeley.edu/~aldous/RWG/book.html.

10.1 Basic definitions and properties


A Markov chain or Markov process is a stochastic process1 where the
distribution of Xt+1 depends only on the value of Xt and not any previous
history. Formally, this means that

Pr [Xt+1 = j | Xt = it , Xt−1 = it−1 , . . . , X0 = i0 ] = Pr [Xt+1 = j | Xt = it ] .


(10.1.1)
A stochastic process with this property is called memoryless: at any time,
you know where you are, and you can figure out where you are going, but
you don’t know where you were before.
The state space of the chain is just the set of all values that each Xt can
have. A Markov chain is finite or countable if it has a finite or countable
state space, respectively. We’ll mostly be interested in finite Markov chains
(since we have to be able to fit them inside our computer), but countable
Markov chains will come up in some contexts.2
We’ll also assume that our Markov chains are homogeneous, which
means that Pr [Xt+1 = j | Xt = i] doesn’t depend on t.
1
A stochastic process is just a sequence of random variables {St }, where we usually
think of t as representing time and the sequence as representing the evolution of some
system over time. Here we are considering discrete-time processes, where t will typically
be a non-negative integer.
2
If the state space is not countable, we run into the same measure-theoretic issues as
with continuous random variables, and have to replace (10.1.1) with the more general
condition that    
E 1[Xt ∈A] Xt , Xt−1 , . . . , X0 = E 1[Xt ∈A] Xt ,
provided A is measurable with respect to some appropriate σ-algebra. We don’t really
want to deal with this, and for the most part we don’t have to, so we won’t.
CHAPTER 10. MARKOV CHAINS 167

p12 p23 p34 


p11 p12 0 0

p 0 p23 0 
p11 p44  21
1 2 3 4

 0 p32 0 p34 
 
p21 p32 p43
0 0 p43 p44

Figure 10.1: Drawing a Markov chain as a directed graph. Nodes represent


states. Edges, labeled with probabilities, represent possible transitions. Zero-
probability transitions are usually omitted. The corresponding transition
matrix is shown on the right.

For a homogeneous countable Markov chain, we can describe its behavior


completely by giving the state space and the one-step transition proba-
bilities pij = Pr [Xt+1 = j | Xt = i]. Given pij , we can calculate two-step
transition probabilities
(2)
pij = Pr [Xt+2 = j | Xt = i]
X
= Pr [Xt+2 = j | Xt+1 = k] Pr [Xt+1 = k | Xt = i]
k
X
= pik pkj .
k

This is identical to the formula for matrix multiplication. For a Markov


chain with n states, we can specify the transition probabilities pij using an
n × n transition matrix P with Pij = pij , and the two-step transition
probabilities are given by p(2) )ij = Pij2 . More generally, the t-step transition
(t)
probabilities are given by pij = (P t )ij .
Conversely, given any matrix with non-negative entries where the rows
P
sum to 1 ( j Pij = 1, or P 1 = 1, where 1 in the second equation stands for
the all-ones vector), there is a corresponding Markov chain given by pij = Pij .
Such a matrix is called a stochastic matrix; and for every stochastic matrix
there is a corresponding finite Markov chain and vice versa.
(s+t)
The general formula for (s+t)-step transition probabilities is that pij =
P (s) (t)
k pik pkj . This is known as the Chapman-Kolmogorov equation and
is equivalent to the matrix identity P s+t = P s P t . It is also similar to
the formula for counting paths in a directed graph, and we can use the
correspondence between matrices and (labeled) directed graphs to give visual
depictions of particular Markov chains, as shown in Figure 10.1.
A distribution over states of a finite Markov chain at some time t can
be given by a row vector x, where xi = P r[Xt = i]. To compute the
CHAPTER 10. MARKOV CHAINS 168

distribution at time t + 1, we use the law of total probability: Pr [Xt+1 = j] =


i Pr [Xt = i] Pr [Xt+1 = j | Xt = i] =
P P
i xi pij . Again we have the formula
for matrix multiplication (where we treat x as a 1 × n matrix); so the
distribution vector at time t + 1 is just xP , and at time t + n is xP n .
We like Markov chains for two reasons:

1. They describe what happens in a randomized algorithm; the state


space is just the set of all states of the algorithm, and the Markov
property holds because the algorithm can’t remember anything that
isn’t part of its state. So if we want to analyze randomized algorithms,
we will need to get good at analyzing Markov chains.

2. They can be used to do sampling over interesting distributions. Under


appropriate conditions (see below), the state of a Markov chain con-
verges to a stationary distribution. If we build the right Markov
chain, we can control what this stationary distribution looks like, run
the chain for a while, and get a sample close to the stationary distribu-
tion.

In both cases we want to have a bound on how long it takes the Markov
chain to converge, either because it tells us when our algorithm terminates,
or because it tells us how long to mix it up before looking at the current
state.

10.1.1 Examples
• A fair ±1 random walk. The state space is Z, the transition probabilities
are pij = 1/2 if |i − j| = 1, 0 otherwise. This is an example of a Markov
chain that is also a martingale.

• A fair ±1 random walk on a cycle. As above, but now the state space
is Z/m, the integers mod m. This is a finite Markov chain. It is also in
some sense a martingale, although we usually don’t define martingales
over finite groups.

• Random walks with absorbing and/or reflecting barriers.

• Random walk on a graph G = (V, E). The state space is V , the


transition probabilities are puv = 1/d(u) if uv ∈ E.
One can also have more general transition probabilities, where the
probability of traversing a particular edge is a property of the edge and
not the degree of its source. In principle we can represent any Markov
CHAPTER 10. MARKOV CHAINS 169

chain as a random walk on graph in this way: the states become vertices,
and the transitions become edges, each labeled with its transition
probability. It’s conventional in this representation to exclude edges
with probability 0 and include self-loops for any transitions i → i.
If the resulting graph is small enough or has a nice structure, this can
be a convenient way to draw a Markov chain.

• The Markov chain given by Xt+1 = Xt + 1 with probability 1/2, and 0


with probability 1/2. The state space is N.

• A finite-state machine running on a random input. The sequence


of states acts as a Markov chain, assuming each input symbol is
independent of the rest.

• A classic randomized algorithm for 2-SAT, due to Papadimitriou [Pap91].


Each state is a truth-assignment. The transitional probabilities are
messy but arise from the following process: pick an unsatisfied clause,
pick one of its two variables uniformly at random, and invert it. Then
there is an absorbing state at any satisfying assignment. With a bit of
work, it can be shown that the Hamming distance between the current
assignment and some satisfying assignment follows a random walk that
is either unbiased or biased toward 0, giving a satisfying assignment
after O(n2 ) steps on average.3 This algorithm is not necessarily all
that good, because there is a clever deterministic algorithm that solves
2-SAT in time linear in the size of the formula [APT79], but it has the
nice property of not being so clever.

• A similar process works for 2-colorability, 3-SAT, 3-colorability, etc.,


although for NP-hard problems, it may take a while to reach an
absorbing state. The constructive Lovász Local Lemma proof from
§13.3.5 also follows this pattern.
3
The proof of this is not too hard: given an unsatisfying assignment x and a satisfying
assignment y, any clause that is not satisfied by x includes at least one variable that is
satisfied by y. If we get lucky and flip this variable, we reduce the distance by one. If not,
we increase it by at most 1 (depending on whether y satisfies only one or both variables).
So we get a process bounded by an unbiased random walk with a reflecting barrier at n
and an absorbing barrier at 0, assuming we don’t hit some other satisfying assignment y 0
first.
CHAPTER 10. MARKOV CHAINS 170

10.2 Convergence of Markov chains


We want to use Markov chains for sampling. Typically this means that we
have some subset S of the state space, and we want to know what proportion
of the states are in S. If we can’t sample states from from the state space a
priori, we may be able to get a good approximation by running the Markov
chain for a while and hoping that it converges to something predictable.
To show this, we will proceed though several steps:

1. We will define a class of distributions, known as stationary distribu-


tions, that we hope to converge to (§10.2.1

2. We will define a distance between distributions, the total variation


distance (§10.2.2).

3. We will define the mixing time of a Markov chain as the minimum


time for the distribution of the position of a particle to get within a
given total variation distance of the stationary distribution (§10.2.3).

4. We will describe a technique called coupling that can be used to


bound total variation distance (§10.2.4), in terms of the probability
that two dependent variables X and Y with the distributions we are
looking at are or are not equal to each other.

5. We will define reducible and periodic Markov chains, which have


structural properties that prevent convergence to a unique stationary
distribution (§10.2.5).

6. We will use a coupling between two copies of a Markov chain to


show that any Markov chain that does not have these properties does
converge in total variation distance to a unique stationary distribution
(§10.2.6).

7. Finally, we will show that if we can construct a coupling between two


copies of a particular chain that causes both copies to reach the same
state quickly on average, then we can use this expected coupling time
to bound the mixing time of the chain that we defined previously. This
will give us a practical tool for showing that many Markov chains not
only converge eventually but converge at a predictable rate, so we can
tell when it is safe to stop running the chain and take a sample (§10.4).

Much of this section follows the approach of Chapter 4 of Levin et


al. [LPW09].
CHAPTER 10. MARKOV CHAINS 171

10.2.1 Stationary distributions


For a finite Markov chain, there is a transition matrix P in which each row
sums to 1. We can write this fact compactly as P 1 = 1, where 1 is the
all-ones column vector. This means that 1 is a right eigenvector of P with
eigenvalue 1.4 Because the left eigenvalues of a matrix are equal to the right
eigenvalues, this means that there will be at least one left eigenvector π such
that πP = π, and in fact it is possible to show that there is at least one
such π that represents a probability distribution in that each πi ≥ 0 and
πi = 1. Such a distribution is called a stationary distribution5 of the
P

Markov chain, and if π is unique, the probability πi is called the stationary


probability for i.
Every finite Markov chain has at least one stationary distribution, but
it may not be unique. For example, if the transition matrix is the identity
matrix (meaning that the particle never moves), then all distributions are
stationary.
If a Markov chain does have a unique stationary distribution, we can
calculate it from the transition matrix, by observing that the equations
πP = P
and
π1 = 1
together give n + 1 equations in n unknowns, which we can solve for π. (We
need an extra equation because the stochastic property of P means that it
has rank at most n − 1.)
Often this will be impractical, especially if our state space is large enough
or messy enough that we can’t write down the entire matrix. In these cases
we may be able to take advantage of a special property of some Markov
chains, called reversibility. We’ll discuss this in §10.3. For the moment
we will content ourselves with showing that we do in fact converge to some
unique stationary distribution if our Markov chain has the right properties.

10.2.2 Total variation distance


Definition 10.2.1. Let X and Y be random variables defined on the same
probability space. Then the total variation distance between X and Y ,
4
Given a square matrix A, a vector x is a right eigenvector of A with eigenvalue λ is
Ax = λx. Similarly, a vector y is a left eigenvalue of A with eigenvalue λ if yA = λA.
5
Spelling is important here. A stationery distribution would involve handing out office
supplies.
CHAPTER 10. MARKOV CHAINS 172

written dT V (X, Y ) is given by


dT V (X, Y ) = sup (Pr [X ∈ A] − Pr [Y ∈ A]) , (10.2.1)
A

where the supremum is taken over all sets A for which Pr [X ∈ A] and
Pr [Y ∈ A] are both defined.6
An equivalent definition is
dT V (X, Y ) = sup|Pr [X ∈ A] − Pr [Y ∈ A]|.
A

The reason this is equivalent is that if Pr [X ∈ A] − Pr [Y ∈ A] is negative,


we can replace A by its complement.
Less formally, given any test set A, X and Y show up in A with proba-
bilities that differ by at most dT V (X, Y ). This is usually what we want for
sampling, since this says that if we are testing some property (represented by
A) of the states we are sampling, the answer (Pr [X ∈ A]) that we get for how
likely this property is to occur is close to the correct answer (Pr [Y ∈ A]).
Total variation distance is a property of distributions, and is not affected
by dependence between X and Y . For finite Markov chains, we can define
the total variation distance between two distributions x and y as
X X
dT V (x, y) = max (xi − yi ) = max (xi − yi ) .
A A
i∈A i∈A

A useful fact is that dT V (x, y) is directly connected to the `1 distance between


x and y. If we let B = {i | xi ≥ yi }, then
X
dT V (x, y) = max (xi − yi )
A
i∈A
X
≤ (xi − yi ),
i∈B

because if A leaves out an element of B, or includes and element of B, this


can only reduce the sum. But if we consider B instead, we get
X
dT V (y, x) = max (yi − xi )
A
i∈A
X
≤ (yi − xi ).
i∈B
6
P For discrete random variables, this just means all A, since we can write Pr [X ∈ A] as
x∈A
Pr [X = x]; we can also replace sup with max for this case. For continuous random
variables, we want that X −1 (A) and Y −1 (A) are both measurable. If our X and Y range
over the states of a countable Markov chain, we will be working with discrete random
variables, so we can just consider all A.
CHAPTER 10. MARKOV CHAINS 173

Now observe that


X
kx − yk1 = |xi − yi |
i
X X
= (xi − yi ) + (yi − xi )
i∈B i∈B
= dT V (x, y) + dT V (y, x)
= 2dT V (x, y).

So dT V (x, y) = 21 kx − yk1 .

10.2.2.1 Total variation distance and expectation


Sometimes it’s useful to translate a bound on total variation to a bound on
the error when getting the expectation of a random variable. The following
lemma may be handy:

Lemma 10.2.2. Let x and y be two distributions of some discrete random


variable Z. Let Ex (Z) and Ey (Z) be the expectations of Z with respect to
each of these distributions. Suppose that |Z| ≤ M always. Then

|Ex (Z) − Ey (Z)| ≤ 2M · dT V (x, y). (10.2.2)

Proof. Compute

X  
|Ex (Z) − Ey (Z)| = z Pr(Z = z) − Pr(Z = z)
x y
z
X
≤ |z| · Pr(Z = z) − Pr(Z = z)
x y
z
X
≤M Pr(Z = z) − Pr(Z = z)
x y
z
≤ M kx − yk1
= 2M · dT V (x, y).

10.2.3 Mixing time


We are going to show that well-behaved finite Markov chains eventually
converge to some stationary distribution π in total variation distance. This
CHAPTER 10. MARKOV CHAINS 174

means that for any  > 0, there is a mixing time tmix () such that for any
initial distribution x and any t ≥ tmix (),

dT V (xP t , π) ≤ .

It is common to standardize  as 1/4: if we write just tmix , this means


tmix (1/4). The choice of 1/4 is somewhat arbitrary, but has some nice
technical properties that we will see below.

10.2.4 Coupling of Markov chains


A coupling of two random variables X and Y is a joint distribution on
hX, Y i that gives the correct marginal distribution for each of X and Y
while creating a dependence between them with some desirable property (for
example, minimizing total variation distance or maximize Pr [X = Y ]).
We will use couplings between Markov chains to prove convergence.
Here we take two copies of the chain, once of which starts in an arbitrary
distribution, and one of which starts in the stationary distribution, and show
that we can force them to converge to each other by carefully correlating their
transitions. Since the second chain is always in the stationary distribution,
this will show that the first chain converges to the stationary distribution as
well.
The tool that makes this work is the Coupling Lemma:7

Lemma 10.2.3. For any discrete random variables X and Y ,

dT V (X, Y ) ≤ Pr [X 6= Y ] .
7
It turns out that the bound in the Coupling Lemma is tight in the following sense:
for any given distributions on X and Y , there exists a joint distribution giving these
distributions such that dT V (X, Y ) is exactly equal to Pr [X 6= Y ] when X and Y are
sampled from the joint distribution. For discrete distributions, the easiest way to con-
struct the joint distribution is first to let to let Y = X = i for each i with prob-
ability min(Pr [X = i] , Pr [Y = i]), and then distribute the remaining probability for
X over all the cases where Pr [X = i] > Pr [Y = i] and similarly for Y over all the
cases where Pr P[Y = i] > Pr [X = i]. Looking at the unmatched values for X gives
Pr [X 6= Y ] ≤ {x | Pr X=i>Pr Y =i} (Pr [X = i] − Pr [Y = i]) ≤ dT V (X, Y ). So in this case
Pr [X 6= Y ] = dT V (X, Y ).
Unfortunately, the fact that there always exists a perfect coupling in this sense does not
mean that we can express it in any convenient way, or that even if we could, it would arise
from the kind of causal, step-by-step construction that we will use for couplings between
Markov processes.
CHAPTER 10. MARKOV CHAINS 175

Proof. Let A be any set for which Pr [X ∈ A] and Pr [Y ∈ A] are defined.


Then

Pr [X ∈ A] = Pr [X ∈ A ∧ Y ∈ A] + Pr [X ∈ A ∧ Y 6∈ A] ,
Pr [Y ∈ A] = Pr [X ∈ A ∧ Y ∈ A] + Pr [X 6∈ A ∧ Y ∈ A] ,

and thus

Pr [X ∈ A] − Pr [Y ∈ A] = Pr [X ∈ A ∧ Y 6∈ A] − Pr [X 6∈ A ∧ Y ∈ A]
≤ Pr [X ∈ A ∧ Y 6∈ A]
≤ Pr [X 6= Y ] .

Since this holds for any particular set A, it also holds when we take the
maximum over all A to get dT V (X, Y ).

For Markov chains, our goal will be to find a useful coupling between a se-
quence of random variables X0 , X1 , X2 , . . . corresponding to the Markov chain
starting in an arbitrary distribution with a second sequence Y0 , Y1 , Y2 , . . .
corresponding to the same chain starting in a stationary distribution. What
will make a coupling useful is if Pr [Xt 6= Yt ] is small for reasonably large t:
since Yt has the stationary distribution, this will show that dT V (xP t , π) is
also small.
Our first use of this technique will be to show, using a rather generic
coupling, that Markov chains with certain nice properties converge to their
stationary distribution in the limit. Later we will construct specialized
couplings for particular Markov chains to show that they converge quickly.
But first we will consider what properties a Markov chain must have to
converge at all.

10.2.5 Irreducible and aperiodic chains


Not all chains are guaranteed to converge to their stationary distribution.
If some states are not reachable from other states, it may be that starting
in one part of the chain will keep us from ever reaching another part of the
chain. Even if the chain is not disconnected in this way, we might still not
converge if the distribution oscillates back and forth due to some periodicity
in the structure of the chain. But if these conditions do not occur, then we
will be able to show convergence.
Let ptij be Pr [Xt = j | X0 = i].
A Markov chain is irreducible if, for all states i and j, there exists some
t such that ptij 6= 0. This says that we can reach any state from any other
CHAPTER 10. MARKOV CHAINS 176

state if we wait long enough. If we think of a directed graph of the Markov


chain where the states are vertices and each edge represents a transition that
occurs with nonzero probability, the Markov chain is irreducible if its graph
is strongly connected.
The period of a state i of a Markov chain is gcd t > 0 ptii 6= 0 . If
 

the period of i is m, then starting from i we can only return to i at times that
are multiples of m. If m = 1, state i is said to be aperiodic. A Markov chain
as a whole is aperiodic if all of its states are aperiodic. In graph-theoretic
terms, this means that the graph of the chain is not k-partite for any k > 1.
Reversible chains are also an interesting special case: if a chain is reversible,
it can’t have a period greater than 2, since we can always step off a node
and step back.
If our Markov chain is not aperiodic, we can make it aperiodic by flipping
a coin at each step to decide whether to move or not. This gives a lazy
Markov chain whose transition probabilities are given by 12 pij when i 6= j
and 12 + 12 pij when i = j. This doesn’t affect the stationary distribution: if
we replace our transition
 
matrix P with a new transition matrix P +I 2 , and
πP = π, then π P +I2 = 21 πP + 12 πI = 12 π + 12 π = π.
Unfortunately there is no quick fix for reducible Markov chains. But
since we will often be designing the Markov chains we will be working with,
we can just take care to make sure they are not reducible.
We will later need the following lemma about aperiodic Markov chains,
which is related to the Frobenius problem of finding the minimum value
that cannot be constructed using coins of given denominations:

Lemma 10.2.4. Let i be an aperiodic state of some Markov chain. Then


there is a time t0 such that ptii 6= 0 for all t ≥ t0 .

Proof. Let S = {t | pii (t) 6= 0}. Since gcd(S) = 1, there is a finite subset S 0
of S such that gcd S 0 = 1. Write the elements of S 0 as m1 , m2 , . . . , mk and let
M = kj=1 mj . From the extended Euclidean algorithm, there exist integer
Q

coefficients aj with |aj | ≤ M/mj such that kj=1 aj mj = 1. We would like


P

to use each aj as the number of times to go around the length mj loop from
i to i. Unfortunately many of these aj will be negative.
To solve this problem, we replace aj with bj = aj + M/mj . This makes all
the coefficients non-negative, and gives kj=1 bj mj = kM + 1. This implies
P

that there is a sequence of loops that gets us from i back to i in kM + 1


steps, or in other words that pkM ii
+1
6= 0. By repeating this sequence ` times,
`kM +`
we can similarly show that pii 6= 0 for any `.
We can also pad any of these sequences out by as many copies of M as
CHAPTER 10. MARKOV CHAINS 177

we like. In particular, given `kM + `, where ` ∈ {0, . . . , M − 1}, we can add


(kM − `)M to it to get (kM )2 + `. This means that we can express any
t ∈ (kM )2 , . . . , (kM )2 + M − 1 as a sum of elements of S, or equivalently
that ptii =
6 0 for any such t. But for larger t, we can just add in more copies
of M . So in fact ptii 6= 0 for all t ≥ t0 = (kM )2 .

10.2.6 Convergence of finite irreducible aperiodic Markov


chains
We can now show:

Theorem 10.2.5. Any finite irreducible aperiodic Markov chain converges


to a unique stationary distribution in the limit.

Proof. Consider two copies of the chain {Xt } and {Yt }, where X0 starts in
some arbitrary distribution x and Y0 starts in a stationary distribution π.
Define a coupling between {Xt } and {Yt } by the rule: (a) if Xt = 6 Yt , then
Pr [Xt+1 = j ∧ Yt+1 = j 0 | Xt = i ∧ Yt = i0 ] = pij pi0 j 0 ; and (b) if Xt = Yt ,
then Pr [Xt+1 = Yt+1 = j | Xt = Yt = i] = pij . Intuitively, we let both chains
run independently until they collide, after which we run them together. Since
each chain individually moves from state i to state j with probability pij
in either case, we have that Xt evolves normally and Yt remains in the
stationary distribution.
Now let us show that dT V (xP t , π) ≤ Pr [Xt 6= Yt ] goes to zero in the
limit. Pick some state i. Let r be the maximum over all states j of the first
passage time fji where fji is the minimum time t such that ptji 6= 0. Let s
be a time such that ptii 6= 0 for all t ≥ s (the existence of such an s is given
by Lemma 10.2.4).
Suppose that at time `(r + s), where ` ∈ N, X`(r+s) = j = 6 j 0 = Y`(r+s) .
0 0
Then there are times `(r + s) + u and `(r + s) + u , where u, u ≤ r, such that
X reaches i at time `(r + s) + u and Y reaches i at time `(r + s) + u0 with
nonzero probability. Since (r + s − u) ≤ s, then having reached i at these
times, X and Y both return to i at time `(r + s) + (r + s) = (` + 1)(r + s) with
nonzero hprobability. Let  > 0 be ithe product of hthese nonzero probabilities;
i
then Pr X(`+1)(r+s) 6= Y(`+1)(r+s) ≤ (1 − ) Pr X`(r+s) = Y`(r+s) , and in
general we have Pr [Xt 6= Yt ] ≤ (1 − )bt/(r+s)c , which goes to zero in the
limit. This implies that dT V (xP t , π) also goes to zero in the limit (using the
Coupling Lemma), and since any initial distribution (including a stationary
distribution) converges to π, π is the unique stationary distribution as
claimed.
CHAPTER 10. MARKOV CHAINS 178

This argument requires that the chain be finite, because otherwise we


cannot take the maximum over all the first passage times. For infinite Markov
chains, it is not always enough to be irreducible and aperiodic to converge
to a stationary distribution (or even to have a stationary distribution at all).
However, with some additional conditions a similar result can be shown: see
for example [GS01, §6.4].

10.3 Reversible chains


A Markov chain with transition probabilities pij is reversible if there is a
distribution π such that, for all i and j,

πi pij = πj pji . (10.3.1)

These are called the detailed balance equations—they say that in the
stationary distribution, the probability of seeing a transition from i to j is
equal to the probability of seeing a transition from j to i). If this is the case,
P P
then i πi pij = i πj pji = πj , which means that π is stationary.
It’s worth noting that this works for countable chains even if they are
not finite, because the sums always converge since each term is non-negative
P P
and i πi pij is dominated by i πi = 1. However, it may not be the case
for any particular p that there exists a corresponding stationary distribution
π. If this happens, the chain is not reversible.

10.3.1 Stationary distributions


The detailed balance equations often give a very quick way to compute the
stationary distribution, since if we know πi , and pij = 6 0, then πj = πi pij /pji .
If the transition probabilities are reasonably well-behaved (for example, if
pij = pij for all i, j), we may even be able to characterize the stationary
distribution up to a constant multiple even if we have no way to efficiently
enumerate all the states of the process.
What (10.3.1) says is that if we start in the stationary distribution and ob-
serve either a forward transition hX0 , X1 i or a backward transition hX1 , X0 i,
we can’t tell which is which; Pr [X0 = i ∧ X1 = j] = Pr [X1 = i ∧ X0 = j].
This extends to longer sequences. The probability that X0 = i, X1 = j, and
X2 = k is given by πi pij pjk = pji πj pjk = pji pkj πk , which is the probability
that X0 = k, X1 = j, and X0 = i. (A similar argument works for finite
sequences of any length.) So a reversible Markov chain is one with no arrow
of time in the stationary distribution.
CHAPTER 10. MARKOV CHAINS 179

A typical reversible Markov chain is a random walk on a graph, where


a step starting from a vertex u goes to one of its neighbors v, which each
1
neighbor chosen with probability d(u) . This has a stationary distribution

d(u)
πu = P
u d(u)
d(u)
= ,
2|E|
which satisfies
d(u 1
πu puv = ·
2|E| d(u)
d(v 1
= ·
2|E| d(v)
= πv pvu .
If we don’t know π in advance, we can often guess it by observing that
πi pij = πj pji
implies
pij
πj = πi , (10.3.2)
pji
provided pij 6= 0. This gives us the ability to calculate πk starting from any
initial state i as long as there is some chain of transitions i = i0 → i1 → i2 →
. . . i` = k where each step im → im+1 has pim ,im+1 6= 0. For a random walk
on a graph, this implies that π is unique as long as the graph is connected.
This of course only works for reversible chains; if we try to do this with a
non-reversible chain, we are likely to get a contradiction.
For example, if we consider a biased random walk on the n-cycle, which
moves from i to (i + 1) mod n with probability p and in the other direction
with probability q = 1 − p, then applying (10.3.2) repeatedly would give
 i
πi = π0 pq . This is not a problem when p = q = 1/2, since we get πi = π0
for all i and can deduce that πi = 1/n is the unique stationary distribution.
But if we try it for p = 2/3, then we get πi = π0 2i , which is fine up until we
hit π0 = πn = π0 2n . So for p 6= q, this process is not reversible, which is not
surprising if we realize that the n = 60, p = 1 case describes precisely the
movement of the second hand on a clock.8
8
It happens to be the case that πi = 1/n is a stationary distribution for any value of p,
we just can’t prove this using (10.3.1).
CHAPTER 10. MARKOV CHAINS 180

Reversible chains have a special role in Markov chain theory, because


some techniques for proving convergence only apply to reversible chains (see
§10.6). They are also handy for sampling from large, irregular state spaces,
because by tuning the transition probabilities locally we can often adjust
the relative likelihoods of states in the stationary distribution to be closer to
what we want (see §10.3.4).

10.3.2 Examples
Random walk on a weighted graph Here each edge has a weight wuv
where 0 < wuv = wvu < ∞, with self-loops permitted. A step of
P
the random walk goes from u to v with probability wuv / v0 wuv0 . It
is easy to show that this random walk has stationary distribution
P P P
πu = u wuv / u v wuv , generalizing the previous case, and that the
resulting Markov chain satisfies the detailed balance equations.

Random walk with uniform stationary distribution Now let ∆ be the


maximum degree of the graph, and traverse each each with probability
1/∆, staying put on each vertex u with probability 1 − d(u)/∆. The
stationary distribution is uniform, since for each pair of vertices u and v
we have puv = pvu = 1/∆ if u and v are adjacent and 0 otherwise. This
is a special case of the Metropolis-Hastings algorithm (see §10.3.4).

10.3.3 Time-reversed chains


Another way to get a reversible chain is to take an arbitrary chain with a
stationary distribution and rearrange it so that it can run both forwards
and backwards in time. This is not necessarily useful, but it illustrates the
connection between reversible and irreversible chains.
Given a finite Markov chain with transition matrix P and stationary
distribution π, define the corresponding time-reversed chain with matrix
P ∗ where πi pij = πj p∗ji .
To make sure that this actually works, we need to verify that:

1. The matrix P ∗ is stochastic:

p∗ij =
X X
pji πj /πi
j j

= πi /πi
= 1.
CHAPTER 10. MARKOV CHAINS 181

2. The reversed chain has the same stationary distribution as the original
chain:

πj p∗ji =
X X
πi pij
j j

= πi .

3. And that in general P ∗ ’s paths starting from the stationary distribution


are a reverse of P ’s paths starting from the same distribution. For
length-1 paths, this is just πj p∗ji = πi pij . For longer paths, this follows
from an argument similar to that for reversible chains.

This gives an alternate definition of a reversible chain as a chain for which


P = P ∗.
We can also use time-reversal to generate reversible chains from arbitrary
chains. The chain with transition matrix (P + P ∗ )/2 (corresponding to
moving 1 step forward or back with equal probability at each step) is always
a reversible chain.
Examples:

• Given a biased random walk on a cycle that moves right with probability
p and left with probability q, its time-reversal is the walk that moves
left with probability p and right with probability q. (Here the fact
that the stationary distribution is uniform makes things simple.) The
average of this chain with its time-reversal is an unbiased random walk.

• Given the random walk defined by Xt+1 = Xt + 1 with probability 1/2


and 0 with probability 1/2, we have πi = 2−i−1 . This is not reversible
(there is a transition from 1 to 2 but none from 2 to 1), but we can
reverse it by setting p∗ij = 1 for i = j + 1 and p∗0i = 2−i−1 . (Check:
πi pii+1 = 2−i−1 (1/2) = πi+1 p∗i+1i = 2−i−2 (1); πi pi0 = 2−i−1 (1/2) =
π0 p∗0i = (1/2)2−i−1 .)

These examples work because the original chains are simple and have
clean stationary distributions. Reversed versions of chains with messier
stationary distributions are usually messier. In practice, building reversible
chains using time-reversal is often painful precisely because we don’t have
a good characterization of the stationary distribution of the original non-
reversible chain. So we will often design our chains to be reversible from the
start rather than relying on after-the-fact flipping.
CHAPTER 10. MARKOV CHAINS 182

10.3.4 Adjusting stationary distributions with the Metropolis-


Hastings algorithm
Sometimes we have a reversible Markov chain, but it doesn’t have the station-
ary distribution we want. The Metropolis-Hastings algorithm[MRR+ 53,
Has70] (sometimes just called Metropolis) lets us start with a reversible
Markov chain P with a known stationary distribution π and convert it to a
chain Q on the same states with a different stationary distribution µ, where
µi = f (i)/ j f (j) is proportional to some function f ≥ 0 on states that we
P

can compute easily.


A typical application is that we want to sample according to Pr [i | A],
but A is highly improbable (so we can’t just use rejection sampling, where
we sample random points from the original distribution until we find one for
which A holds), and Pr [i | A] is easy to compute for any fixed i but tricky
to compute for arbitrary events (so we can’t use divide-and-conquer). If we
let f (i) ∝ Pr [i | A], then Metropolis-Hastings will do exactly what we want,
assuming it converges in a reasonable amount of time.
Let q be the transition probability for Q. Define, for i 6= j,
!
πi f (j)
qij = pij min 1,
πj f (i)
!
π i µj
= pij min 1,
π j µi

and let qii be whatever probability is left over. Now consider two states i
and j, and suppose that πi f (j) ≥ πj f (i). Then

qij = pij

which gives

µi qij = µi pij ,

while

µj qji = µj pji (πj µi /πi µj )


= pji (πj µi /πi )
= µi (pji πj )/πi
= µi (pij πi )/πi
= µi pij
CHAPTER 10. MARKOV CHAINS 183

(note the use of reversibility of P in the second-to-last step). So we have


µj qji = µi pij = µi qij and Q is a reversible Markov chain with stationary
distribution µ.
We can simplify this when our underlying chain P has a uniform stationary
distribution (for example, when it’s the random walk on a graph with
maximum degree ∆, where we traverse each edge with probability 1/∆).
Then we have πi = πj for all i, j, so the new transition probabilities qij are
1
just ∆ min(1, f (j)/f (i)). Most of our examples of reversible chains will be
instances of this case.

10.4 The coupling method


In order to use the stationary distribution of a Markov chain to do sampling,
we need to have a bound on the rate of convergence to tell us when it is safe
to take a sample. There are two standard techniques for doing this: coupling,
where we show that a copy of the process starting in an arbitrary state can
be made to converge to a copy starting in the stationary distribution; and
spectral methods, where we bound the rate of convergence by looking at the
second-largest eigenvalue of the transition matrix. We’ll start with coupling
because it requires less development.
(See also [Gur00] for a survey of the relationship between the various
methods.)
Note: these notes will be somewhat sketchy. If you want to read more
about coupling, a good place to start might be Chapter 12 of [MU17]; Chapter
4-3 (http://www.stat.berkeley.edu/~aldous/RWG/Chap4-3.pdf) of the
unpublished but nonetheless famous Aldous-Fill manuscript (http://www.
stat.berkeley.edu/~aldous/RWG/book.html, [AF01]), which is a good
place to learn about Markov chains and Markov chain Monte Carlo methods
in general; or even an entire book [Lin92]. We’ll mostly be using examples
from the Aldous-Fill text.

10.4.1 Random walk on a cycle


Let’s suppose we do a random walk on Zm , where to avoid periodicity at
each step we stay put with probability 1/2, move counterclockwise with
probability 1/4, and move clockwise with probability 1/4: in other words, we
are doing a lazy unbiased random walk. What’s a good choice for a coupling
to show this process converges quickly?
Specifically, we need to create a joint process (Xt , Yt ), where each of the
marginal processes X and Y looks like a lazy random walk on Zm , X0 has
CHAPTER 10. MARKOV CHAINS 184

whatever distribution our real process starts with, and Y0 has the stationary
distribution. Our goal is to structure the combined process so that Xt = Yt
as soon as possible.
Let Zt = Xt − Yt (mod m). If Zt = 0, then Xt and Yt have collided and
we will move both together. If Zt 6= 0, then flip a coin to decide whether
to move Xt or Yt ; whichever one moves then moves up or down with equal
probability. It’s not hard to see that this gives a probability of exactly 1/2
that Xt+1 = Xt , 1/4 that Xt+1 = Xt + 1, and 1/4 that Xt+1 = Xt − 1, and
similarly for Yt . So the transition functions for X and Y individually are the
same as for the original process.
Whichever way the first flip goes, we get Zt+1 = Zt ± 1 with equal
probability. So Z acts as an unbiased random walk on Zm with an absorbing
barriers at 0; this is equivalent to a random walk on 0 . . . m with absorbing
barriers at both endpoints. The expected time for this random walk to reach
a barrier starting from an arbitrary initial state is at most m2 /4, so if τ is
the first time at which Xτ = Yτ , we have E [τ ] ≤ m2 /4.9
Using Markov’s inequality, after t = 2(m2 /4) = m2 /2 steps we have
Pr [Xt 6= Yt ] = Pr τ > m2 /2 ≤ mE[τ ]
 
2 /2 ≤ 1/2. We can also iterate the whole
argument, starting over in whatever state we are in at time t if we don’t
converge. This gives at most a 1/2 chance of not converging for each interval
of m2 /2 steps. So after αm2 /2 steps we will have Pr [Xt 6= Yt ] ≤ 2−α . This
gives tmix () ≤ 12 m2 dlg(1/)e, where as before tmix () is the time needed to
make dT V (Xt , π) ≤  (see §10.2.3).
The choice of 2 for the constant in Markov’s inequality could be improved.
The following lemma gives an optimized version of this argument:
Lemma 10.4.1. Let the expected coupling time, at which two coupled
processes {Xt } and {Yt } starting from an arbitrary state are first equal, be
T . Then dT V (XT , YT ) ≤  for T ≥ T edln(1/)e.
Proof. Essentially the same argument as above, but replacing 2 with a
constant c to be determined. Suppose we restart the process every cT
steps. Then at time t we have a total variation bounded by c−bt/cT c . The
expression c−t/cT is minimized by minimizing c−1/c or equivalently − ln c/c,
which occurs at c = e. This gives tmix () ≤ T edln(1/)e.

It’s worth noting that the random walk example was very carefully rigged
to make the coupling argument clean. A similar argument still works (perhaps
9
If we know that Y0 is uniform, then Z0 is also uniform, and we can use this fact to get
a slightly smaller bound on E [τ ], around m2 /6. But this will cause problems if we want to
re-run the coupling starting from a state where Xt and Yt have not yet converged.
CHAPTER 10. MARKOV CHAINS 185

with a change in the bound) for other irreducible aperiodic walks on the ring,
but the details are messier.

10.4.2 Random walk on a hypercube


Start with a bit-vector of length n. At each step, choose an index uniformly
at random, and set the value of the bit-vector at that index to 0 or 1 with
equal probability. How long until we get a nearly-uniform distribution over
all 2n possible bit-vectors?
Here we apply the same transformation to both the X and Y vec-
tors. It’s easy to see that the two vectors will be equal once every in-
dex has been selected once. The waiting time for this to occur is just
the waiting time nHn for the coupon collector problem. We can either
use this expected time directly to show that the process mixes in time
O(n log n log(1/)) as above, or apply a sharp concentration bound to the
coupon collector process. It is known that (see [MU17, §5.4.1] or [MR95,
§3.6.3]), limn→∞ Pr [T ≥ n(ln n + c)] = 1 − exp(− exp(−c))), so in the limit
n ln n + n ln ln(1/(1 − )) = n ln n + O(n log(1/)) would seem to be enough.
But this is a little tricky: we don’t know from this bound alone how fast the
probability converges as a function of n, so to do this right we need to look
into the bound in more detail.
Fortunately, we are looking at large enough values of c that we can get a
bound that is just as good using a simple argument. We have
t
1

Pr [there is an empty bin after t insertions] ≤ n 1 −
n
−t/n
≤ ne ,

and setting t = n ln n + cn gives a bound of e−c . We can then set c = ln(1/)


to get a Pr [Xt 6= Yt ] ≤  at time n ln n + n ln(1/).
We can improve the bound slightly by observing that, on average, half
the bits in X0 and Y0 are already equal; doing this right involves summing
over many cases, so we won’t do it.
This is an example of a Markov chain with the rapid mixing property:
the mixing time is polylogarithmic in the number of states (2n in this case)
and 1/. For comparison, the random walk on the ring is not rapid mixing,
because the coupling time is polynomial in n = m rather than log n.
CHAPTER 10. MARKOV CHAINS 186

10.4.3 Various shuffling algorithms


Here we have a deck of n cards, and we repeatedly apply some random
transformation to the deck to converge to a stationary distribution that
is uniform over all permutations of the cards (usually this is obvious by
symmetry, so we won’t bother proving it). Our goal is to show that the
expected coupling time at which our deck ends up in the same permutation
as an initially-stationary deck is small. Typically we do this by linking
identical cards on each side of the coupling, so that each linked pair moves
through the same positions with each step. When all n cards are linked, we
will have achieved X = Y .
Lest somebody try implementing one of these shuffling algorithms, it’s
probably worth mentioning that they are all terrible. If you actually want to
shuffle an array of values, the usual approach is to swap an element chosen
uniformly at random into the first position, then shuffle the remaining n − 1
positions recursively. This gives a uniform shuffle in O(n) steps.

10.4.3.1 Move-to-top
This is a variant of card shuffling that is interesting mostly because it gives
about the easiest possible coupling argument. At each step, we choose one
of the cards uniformly at random (including the top card) and move it to
the top of the deck. How long until the deck is fully shuffled, i.e., until the
total variation distance between the actual distribution and the stationary
distribution is bounded by ?
Here the trick is that when we choose a card to move to the top in the
X process, we choose the same card in the Y process. It’s not hard to see
that this links the two cards together so that they are always in the same
position in the deck in all future states. So to keep track of how well the
coupling is working, we just keep track of how many cards are linked in this
way, and observe that as soon as n − 1 are, the two decks are identical.
Note: Unlike some of the examples below, we don’t consider two cards to
be linked just because they are in the same position. We are only considering
cards that have gone through the top position in the deck (which corresponds
to some initial segment of the deck, viewed from above). The reason is that
these cards never become unlinked: if we pick two cards from the initial
segment, the cards above them move down together. But deeper cards that
happen to match might become separated if we pull a card from one deck
that is above the matched pair while its counterpart in the other deck is
below the matched pair.
CHAPTER 10. MARKOV CHAINS 187

Given k cards linked in this way, the probability that the next step links
another pair of cards is exactly (n − k)/n. So the expected time until we get
k + 1 cards is n/(n − k), and if we sum these waiting times for k = 0 . . . n − 1,
we get nHn , the waiting time for the coupon collector problem. So the bound
on the mixing time is the same as for the random walk on a hypercube.

10.4.3.2 Random exchange of arbitrary cards


Here we pick two cards uniformly and independently at random and swap
them. (Note there is a 1/n chance they are the same card; if we exclude this
case, the Markov chain has period 2.) To get a coupling, we reformulate this
process as picking a random card and a random location, and swapping the
chosen card with whatever is in the chosen location in both the X and Y
processes.
First let’s observe that the number of linked cards never decreases. Let
xi , yi be the position of card i in each process, and suppose xi = yi . If
neither card i nor position xi is picked, i doesn’t move, and so it stays linked.
If card i is picked, then both copies are moved to the same location; it stays
linked. If position xi is picked, then it may be that i becomes unlinked; but
this only happens if the card j that is picked has xj 6= yj . In this case j
becomes linked, and the number of linked cards doesn’t drop.
Now we need to know how likely it is that we go from k to k + 1 linked
cards. We’ve already seen a case where the number of linked cards increases;
we pick two cards that aren’t linked and a location that contains cards that
aren’t linked. The probability of doing this is ((n − k)/n)2 , so our total
expected waiting time is n2 (n − k)−2 = n2 k −2 ≤ n2 π 2 /6. The final
P P

bound is O(n2 log(1/)).


This bound is much worse that the bound for move-to-top, which is sur-
prising. In fact, the real bound is O(n log n) with high probability, although
the proof uses very different methods (see http://www.stat.berkeley.edu/
~aldous/RWG/Chap7.pdf). This shows that the coupling method doesn’t
always give tight bounds (or perhaps we need a better coupling?).

10.4.3.3 Random exchange of adjacent cards


Suppose now that we only swap adjacent cards. Specifically, we choose one
of the n positions i in the deck uniformly at random, and then swap the
cards a positions i and i + 1 (mod n) with probability 1/2. (The 1/2 is there
for the usual reason of avoiding periodicity.)
So now we want a coupling between the X and Y processes where each
CHAPTER 10. MARKOV CHAINS 188

1
possible swap occurs with probability 2n on both sides, but somehow we
correlate things so that like cards are pushed together but never pulled apart.
The trick is that we will use the same position i on both sides, but be sneaky
about when we swap. In particular, we will aim to arrange things so that
once some card is in the same position in both decks, both copies move
together, but otherwise one copy changes its position by ±1 relative to the
1
other with a fixed probability 2n .
The coupled process works like this. Let D be the set of indices i where
the same card appears in both decks at position i or at position i + 1. Then
we do:
1
1. For i ∈ D, swap (i, i + 1) in both decks with probability 2n .
1
2. For i 6∈ D, swap (i, i + 1) in the X deck only with probability 2n .
1
3. For i 6∈ D, swap (i, i + 1) in the Y deck only with probability 2n .
|D|
4. Do nothing with probability 2n .

It’s worth checking that the total probability of all these events is |D|/2n+
2(n − |D|)/2n + |D|/2n = 1. More important is that if we consider only one
1
of the decks, the probability of doing a swap at (i, i + 1) is exactly 2n (since
we catch either case 1 or 2 for the X deck or 1 or 3 for the Y deck).
Now suppose that some card c is at position x in X and y in Y . If
x = y, then both x and x − 1 are in D, so the only way the card can move
is if it moves in both decks: linked cards stay linked. If x = 6 y, then c
moves in deck X or deck Y , but not both. (The only way it can move
in both is in case 1, where i = x and i + 1 = y or vice versa; but in
this case i can’t be in D since the copy of c at position x doesn’t match
whatever is in deck Y , and the copy at position y doesn’t match what’s
in deck X.) In this case the distance x − y goes up or down by 1 with
1
equal probability 2n . Considering x − y (mod n), we have a “lazy” random
walk that moves with probability 1/n, with absorbing barriers at 0 and
n. The worst-case expected time to converge is n(n/2)2 = n3 /4, giving
Pr time for c to become linked ≥ αn 3 /2 ≤ 2−α using the usual argument.


Now apply the union bound to get Pr time for every c to become linked ≥ αn3 /2 ≤


n2−α to get an expected coupling time of O(n3 log n).


To simplify the argument, we assumed that the deck wraps around, so
that we can swap the first and last card in addition to swapping physically
adjacent cards. If we restrict to swapping i and i + 1 for i ∈ {0, . . . , n − 2},
we get a slightly different process that converges in essentially the same time
CHAPTER 10. MARKOV CHAINS 189

O(n3 log n). A result of David Bruce Wilson [Wil04]) shows both this upper
bound holds and that the bound is optimal up to a constant factor.

10.4.3.4 Real-world shuffling


In real life, the handful of people who still use physical playing cards tend
to use a dovetail shuffle, which is closely approximated by the reverse
of a process where each card in a deck is independently assigned to a left
or right pile and the left pile is place on top of the right pile. Coupling
doesn’t really help much here. Instead, the process can be analyzed using
more sophisticated techniques due to Bayer and Diaconis [BD92]. The short
version of the result is that Θ(log n) shuffles are needed to randomize a deck
of size n.

10.4.4 Path coupling


If the states of our Markov process are the vertices of a graph, we may be
able to construct a coupling by considering a path between two vertices and
showing how to shorten this path on average at each step. This technique
is known as path coupling [BD97]. Typically, the graph we use will be the
graph of possible transitions of the underlying Markov chain (possibly after
making all edges undirected).
There are two ideas at work here. The first is that the expected distance
E [d(Xt , Yt )] between Xt and Yt in the graph gives an upper bound on
Pr [Xt 6= Yt ] (by Markov’s inequality, since if Xt 6= Yt then d(Xt , Yt ) ≥ 1).
The second is that to show that E [d(Xt+1 , Yt+1 ) | Ft ] ≤ α · d(Xt , Yt ) for some
α < 1, it is enough to show how to contract a single edge, that is, to show
that E [d(Xt+1 , Yt+1 | d(Xt , Yt ) = 1] ≤ α. The reason is that if we have a
coupling that contracts one edge, we can apply this inductively along each
edge in the path to get a coupling between all the vertices in the path that
still leaves each pair with expected distance at most α. The result for the
whole path then follows from linearity of expectation.
Formally, instead of just looking at X t and Y t , consider a path of
intermediate states X t = Z0t Z1t Z2t . . . Zmt = Y t , where d(Z , Z
i,t i+1,t ) = 1 for
each i (the vertices are adjacent in the graph). We now construct a coupling
only for adjacent nodes that reduces their distance on average. The idea
t t t t

is that d(X , Y ) ≤ d Zi , Zi+1 , so if the distance between each adjacent
P

pair shrinks on average, so does the total length of the path.


The coupling on each edge gives a joint conditional probability
h i
Pr Zit+1 = zi0 , Zi+1
t+1 0
= zi+1 t+1
Zit = zi , Zi+1 = zi+1 .
CHAPTER 10. MARKOV CHAINS 190

t+1
We can extract from this a conditional distribution on Zi+1 given the other
three variables:
h i
0
t+1
Pr Zi+1 = zi+1 Zit+1 = zi0 , Zit = zi , Zi+1
t
= zi+1 .

Multiplying these conditional probabilities together lets us compute a joint


distribution on X t+1 , Y t+1 conditioned on X t , Y t , which is the ordinary
coupling we really want.
It’s worth noting that the path is entirely notional, and we don’t actually
keep it around around between steps of the coupled Markov chains. The only
purpose of Z0 , Z1 , . . . , Zm is to show that X and Y move closer together.
Even though we could imagine that we are coalescing these nodes together
to create a new path at each step (or throwing in a few extra nodes if some
Zi , Zi+1 move away from each other), we are really computing a fresh path
between X and Y at each step and throwing it away as soon as it has done
its job.

10.4.4.1 Random walk on a hypercube


As a warm-up, let’s redo the argument about lazy random walks on a
hypercube from §10.4.2 using path coupling. Each state X t or Y t is an n-bit
vector, and with probability 1/2 a step flips one of the bits chosen uniformly
at random. The obvious metric d is Hamming distance: d(x, y) is the number
of indices i for which xi 6= yi .
For path coupling, we only need to push adjacent Zit and Zi+1
t . Adjacency

means that there is exactly one index j at which these two bit-vectors differ.
We apply the following coupling (which looks suspiciously like the more
generic coupling in §10.4.2):
1. Pick a random index r.

2. If r 6= j, which occurs with probability 1 − 1/n, flip position r in both


Zit and Zi+1t with probability 1/2. Whether we flip or not, we get
t+1 t+1
d(Zi , Zi+1 ) = 1.

3. If r = j, pick a new bit value b uniformly at random and set position


j in both Zit+1 and Zi+1t+1
to b. From either side alone, this looks like
flipping bit j with probability 1/2, exactly as in the original process.
But now d(Zit+1 , Zi+1
t+1
) = 0.
h i
Averaging over both cases gives E d(Zit+1 , Zi+1
t+1
) Ft = 1 − 1/n. It fol-
lows that E d(X t , Y t ) F0 ≤ (1 − 1/n)t d(X0 , Y0 ) ≤ ne−n/t . Since d(X t , Y t )
 
CHAPTER 10. MARKOV CHAINS 191

is always at least 1 if it’s not zero, this gives Pr X t 6= Y t ≤ ne−n/t by


 

Markov’s inequality, so dT V (X t , Y t ) ≤  after O n log n steps.

10.4.4.2 Sampling graph colorings


For this example, we’ll look at sampling k-colorings of a graph with maximum
degree ∆. We will assume that k ≥ 2∆ + 1.10 For smaller values of k, it
might still be the case that the chain converges reasonably quickly for some
graphs, but our analysis will not show this.
Unlike the hypercube case, the path coupling we will construct might
allow the distance between X t and Y t to rise with some probability. But the
expected distance will always decrease, which is enough.
Consider the following chain on proper k-colorings of a graph with
maximum degree ∆. At each step, we choose one of the n nodes v, compute
the set S of colors not found on any of v’s neighbors, and recolor v with a
color chosen uniformly from S (which may be its original color).
Suppose pij 6= 0. Then i and j differ in at most one place v, and so the set
S of permitted colors for each process—those not found on v’s neighbors—are
1
the same. This gives pij = pji = n·|S| , and the detailed balance equations
(10.3.1) hold when πi is constant for all i. So we have a reversible Markov
chain with a uniform stationary distribution. Now we will apply a path
coupling to show that we converge to this stationary distribution reasonably
quickly when k is large enough.
We’ll think of colorings as vectors. Given two colorings x and y, let d(x, y)
be the Hamming distance between them, which is the number of nodes u
for which xu 6= yu . To show convergence, we will construct a coupling that
shows that d(X t , Y t ) converges to 0 over time starting from arbitrary initial
points X 0 and Y 0 .
A complication is that it’s not immediately evident that the length of
the shortest path from X t to Y t in the transition graph of our Markov chain
is d(X t , Y t ). The problem is that it may not be possible to transform X t
into Y t one node at a time without producing improper colorings. With
enough colors, we can explicitly construct a short path between X t and Y t
that uses only proper colorings; but for this particular process it is easier to
simply extend the Markov chain to allow improper colorings, and show that
10
Jerrum [Jer95] provided the random walk we’ll be using, and Bubley and Dyer [BD97]
used it as one of their examples. Mitzenmacher and Upfal [MU17, §12.5] give an analysis
based on a standard coupling for a different version of the Markov chain that works for the
same bound on k. More sophisticated results and a history of the problem can be found in
[DGM02].
CHAPTER 10. MARKOV CHAINS 192

our coupling works anyway. This also allows us to start with an improper
coloring for X 0 if we are particularly lazy. The stationary distribution is
not affected, because if i is a proper coloring and j is an improper coloring
that differs from i in exactly one place, we have pij = 0 and pji 6= 0, so the
detailed balance equations hold with πj = 0.
The natural coupling to consider given adjacent X t and Y t is to pick the
same node and the same new color for both, provided we can do so. If we
pick the one node v on which they differ, and choose a color that is not used
by any neighbor (which will be the same for both copies of the process, since
all the neighbors have the same colors), then we get X t+1 = Y t+1 ; this event
occurs with probability at least 1/n. If we pick a node that is neither v nor
adjacent to it, then the distance between X and Y doesn’t change; either
both get a new identical color or both don’t.
Things get a little messier when we pick some node u adjacent to v, an
event that occurs with probability at most ∆/n. Let c be the color of v in
X t , c0 the color of v in Y t , and T the set of colors that do not appear among
the other neighbors of u. Let ` = |T | ≥ k − (∆ − 1).
Conditioning on choosing u to recolor, X t+1 picks a color uniformly from
T \ {c} and Y t+1 picks a color uniformly from T \ {c0 }. We’d like these colors
to be the same if possible, but these are not the same sets, and they aren’t
even necessarily the same size.
There are three cases:
1. Neither c nor c0 are in T . Then X t+1 and Y t+1 are choosing a new
color from the same set, and we can make both choose the same color:
the distance between X and Y is unchanged.
2. Exactly one of c and c0 is in T . Suppose that it’s c. Then |T \ {c}| = `−1
and |T \ {c0 }| = |T | = `. Let X t+1 choose a new color c00 first. Then
let Yut+1 = c00 with probability `−1 ` (this gives a probability of ` of
1
t+1
picking each color in T \ {c}, which is what we want), and let Yu = c
with probability 1` . Now the distance between X and Y increases with
probability 1` .
3. Both c and c0 are in T . For each c00 in T \ {c, c0 }, let Xut+1 = Yut+1 = c00
1
with probability `−1 ; since there are ` − 2 such c00 , this accounts for `−2
`−1
of the probability. Assign the remaining `−1 1
to Xut+1 = c0 , Yut+1 = c.
In this case the distance between X and Y increases with probability
1
`−1 , making this the worst case.
Putting everything together, we have a 1/n chance of picking a node
that guarantees to reduce d(X, Y ) by 1, and at most a ∆/n chance of
CHAPTER 10. MARKOV CHAINS 193

1
picking a node that may increase d(X, Y ) by at most `−1 on average, where
∆ 1
` ≥ k − ∆ + 1, giving a maximum expected increase of n · k−∆ . So
h i −1 ∆ 1
E d(X t+1 , Y t+1 ) − d(X t , Y t ) d(X t , Y t ) = 1 ≤ + ·
n n k−∆
1 ∆
 
= −1 +
n k−∆
1 −(k − ∆) + ∆
 
=
n k−∆
1 k − 2∆
 
=− .
n k−∆
So we get
h i
dT V (X t , Y t ) ≤ Pr X t 6= Y t
1 k − 2∆ t
 h  i
≤ 1− · · E d(X 0 , Y 0 )
n k−∆
t k − 2∆
 
≤ exp − · · n.
n k−∆

For fixed k and ∆ with k > 2∆, this is e−Θ(t/n) n, which will be less than
 for t = Ω(n(log n + log(1/))).

10.4.4.3 Sampling independent sets


For a more complicated example of path coupling, let’s try sampling inde-
pendent sets of vertices on a graph G = (V, E) with n vertices and m edges.
If we can bias in favor of larger sets, we might even get a good independent
set approximation! The fact that this is hard will console us when we find
that it doesn’t work.
A natural way to set up the random walk is to represent each potentially
independent set as a bit vector, where 1 indicates membership in the set, and
at each step we pick one of the n bits uniformly at random and set it to 0 or
1 with probability 1/2 each, provided that the resulting set is independent.
If the resulting set is not independent, we stay put.11
It’s easy to see that d(x, y) = kx − yk1 is a bound on the length of the
minimum number of transitions to get from x to y, since we can always remove
all the extra ones from x and put back the extra ones in y while preserving
11
In statistical physics, this process of making a local change with probability proportional
to how much we like the result goes by the name of Glauber dynamics.
CHAPTER 10. MARKOV CHAINS 194

independence throughout the process. (This argument also shows that the
Markov chain is irreducible.) It may be hard to find the exact minimum
path length, so we’ll use this distance instead for our path coupling.
We can easily show that the stationary distribution of this process is
uniform. The essential idea is that if we can transform one independent set
S into another S 0 by flipping a bit, then we can go back by flipping the bit
the other ways. Since each transition happens with the same probability
1/n, we get πS · (1/n) = πS 0 · (1/n) and πS = πS 0 . Since we can apply this
equation along a path between any two states, all states must have the same
probability in the unique stationary distribution.
To prove convergence, it’s tempting to start with the obvious coupling,
even though it doesn’t actually work. Pick the same position and value
for both copies of the chain. If x and y are adjacent, then they coalesce
with probability 1/n (both probability 1/2n transitions are feasible for both
copies, since the neighboring nodes always have the same state). What is
the probability that they diverge? We can only be prevented from picking
a value if the value is 1 and some neighbor is 1. So the bad case is when
xi = 1, yi = 0, and we attempt to set some neighbor of i to 1; in the worst
case, this happens ∆/2n of the time, which is at least 1/n when ∆ ≥ 2. No
coalescence here!
This could be a sign that our random walk is no good, or it could be a
sign that our coupling is no good. But we can avoid figuring out which is
the case by using a sneakier random walk. We’ll adopt the approach used
in [MU17, §12.6].
Here the idea is that we pick a random edge uv, and then try to do one
of the following operations, all with equal probability:

1. Set u = v = 0.

2. Set u = 0 and v = 1.

3. Set u = 1 and v = 0.

In each case, if the result would be a non-independent set, we instead do


nothing.
Verifying that this has a uniform stationary distribution is mildly painful
if we are not careful, since there may be several different transitions that move
from some state x to the same state y. But for each transition (occurring
1
with probability 3m ), we can see that there is a reverse transition that occurs
with equal probability; so the detailed balance equations (10.3.1) hold with
uniform probabilities. Note that we can argue this even though we don’t
CHAPTER 10. MARKOV CHAINS 195

know what the actual stationary probabilities are, since we don’t know how
many independent sets our graph has.
So now what happens if we run two coupled copies of this process, where
the copies differ on exactly one vertex i?
First, every neighbor of i is 0 in both processes. A transition that doesn’t
involve any neighbors of i will have the same effect on both processes. So we
need to consider all choices of edges where one of the endpoints is either i or
a neighbor j of i. In the case where the other endpoint isn’t i, we’ll call it k;
there may be several such k.
If we choose ij and don’t try to set j to one, we always coalesce the states.
2
This occurs with probability 3m . If we try to set i to zero and j to one, we
may fail in both processes, because j may have a neighbor k that is already
one; this will preserve the distance between the two processes. Similarly,
if we try to set j to one as part of a change to some jk, we will also get a
divergence between the two processes: in this case, the distance will actually
increase. This can only happen if j has at most one neighbor k (other than
i) that is already in the independent set; if there are two such k, then we
can’t set j to one no matter what the state of i is.
This argument suggests that we need to consider three cases for each j,
depending on the number s of nodes k 6= i that are adjacent to j and have
xk = yk = 1. In each case, we assume xi = 0 and yi = 1, and that all other
nodes have the same value in both x and y. (Note that these assumptions
mean that any such k can’t be adjacent to i, because we have yk = yi = 1.)

• s = 0. Then if we choose ij, we can always set i and j however we


1
like, giving a net − m expected change to the distance. However, this
is compensated for by up to d − 1 attempts to set j = 1 and k = 0 for
some k, all of which fail in one copy of the process but succeed in the
other. Since k doesn’t change, each of these failures adds only 1 to
the distance, which becomes at most d−13m total. So our total expected
d−4
change in this case is at most 3m .

• s = 1. Here attempts to set i = 0 and j = 1 fail in both processes,


2
giving only a − 3m expected change after picking ij. Any change to jk
fails only if we set j = 1, which we can only do in the x process and
only if we also set k = 0 for the unique k that is currently one. This
1
produces an increase in the distance of 2 with probability 3m , exactly
canceling out the decrease from picking ij. Total expected change is 0.
2
• s = 2. Now we can never set j = 1. So we drop − 3m from changes to
ij and have no change in distance from updates to jk for any k 6= i.
CHAPTER 10. MARKOV CHAINS 196

Considering all three cases, in the worst case we have E [d(Xt+1 , Yt+1 | Xt , Yt ] =
d(Xt , Yt ) + ∆−4
3m . For ∆ ≤ 3 (a pretty restrictive case), this is −1/3m, giving
a decent enough expected coupling time of O(m log n).
Here, we’ve considered the case where all independent sets have the
same probability. One can also bias the random walk in favor of larger
independent sets by accepting increases with higher probability than decreases
(as in Metropolis-Hastings); this samples independent sets of size s with
probability proportional to λs . Some early examples of this approach are
given in [LV97, LV99, DG00]. The question of exactly which values of λ give
polynomial convergence times is still open; see [MWW07] for some more
recent bounds.

10.4.4.4 Simulated annealing


Recall that the Metropolis-Hastings algorithm constructs a reversible Markov
chain with a desired stationary distribution from any reversible Markov chain
on the same states (see §10.3.4 for details.)
A variant, which generally involves tinkering with the chain while it’s
running, is the global optimization heuristic known as simulated anneal-
ing [KGV83]. Here we have some function g that we are trying to minimize.
So we set f (i) = exp(−αg(i)) for some α > 0. Running Metropolis-Hastings
gives a stationary distribution that is exponentially weighted to small values
of g; if i is the global minimum and j is some state with high g(j), then
π(i) = π(j) exp(α(g(j) − g(i)), which for large enough α goes a long way
towards compensating for the fact that in most problems there are likely to
be exponentially more bad j’s than good i’s. The problem is that the same
analysis applies if i is a local minimum and j is on the boundary of some
depression around i; large α means that it is exponentially unlikely that we
escape this depression and find the global minimum.
The simulated annealing hack is to vary α over time; initially, we set α
small, so that we get mixing time close to that of the original Markov chain.
This gives us a sample that is roughly uniform, with a small bias towards
states with smaller g(i). After some time we increase α to force the process
into better states. The hope is that by increasing α slowly, by the time we
are stuck in some depression, it’s a deep one—optimal or close to it. If it
doesn’t work, we can randomly restart and/or decrease α repeatedly to jog
the chain out of whatever depression it is stuck in. How to do this effectively
is deep voodoo that depends on the structure of the underlying chain and the
shape of g(i), so most of the time people who use simulated annealing just
try it out with some generic annealing schedule and hope it gives some
CHAPTER 10. MARKOV CHAINS 197

useful result. (This is what makes it a heuristic rather than an algorithm.


Its continued survival is a sign that it does work at least sometimes.)
Here are toy examples of simulated annealing with provable convergence
times.

Single peak Let’s suppose x is a random walk on an n-dimensional hy-


percube (i.e., n-bit vectors where we set 1 bit at a time), g(x) = |x|, and we
want to maximize g. Now a transition that increase |x| is accepted always
and a transition that decreases |x| is accepted only with probability e−α .
For large enough α, this puts a constant fraction of π on the single peak at
x = 1; the observation is that that there are only nk ≤ nk points with k
zeros, so the total weight of all points is at most π(1) k≥0 nk exp(−αk) =
P

π(1) exp(ln n−α)k = π(1)/(1−exp(ln n−α)) = π(1)·O(1) when α > ln n,


P

giving π(1) = Ω(1) in this case.


So what happens with convergence? Let p = exp(−α). Let’s try doing
a path coupling between two adjacent copies x and y of the Metropolis-
Hastings process, where we first pick a bit to change, then pick a value to
assign to it, accepting the change in both processes if we can. The expected
change in |x − y| is then (1/2n)(−1 − p), since if we pick the bit where x
and y differ, we have probability 1/2n of setting both to 1 and probability
p/2n of setting both to 0, and if we pick any other bit, we get the same
distribution of outcomes in both processes. This gives a general bound of
E [|Xt+1 − Yt+1 | | |Xt − Yt |] ≤ (1 − (1 + p)/2n)|Xt − Yt |, from which we have
E [|Xt − Yt |] ≤ exp(−t(1 + p)/2n) E [|X0 − Y0 |] ≤ n exp(−t(1 + p)/2n). So
after t = 2n/(1 + p) ln(n/) steps, we expect to converge to within  of the
stationary distribution in total variation distance. This gives an O(n log n)
algorithm for finding the peak.
This is kind of a silly example, but if we suppose that g is better disguised
(for example, g(x) could be |x ⊕ r| where r is a random bit vector), then we
wouldn’t really expect to do much better than O(n). So O(n log n) is not
bad for an algorithm with no brains at all.

Somewhat smooth functions Now we’ll let g : 2n → N be some arbitrary


Lipschitz function, that is, a function with the property that |g(x) − g(y)| ≤
|x − y|, and ask for what values of p = e−α the Metropolis-Hastings walk
with f (i) = e−αg(i) can be shown to converge quickly. Given adjacent states
x and y, with xi = 6 yi but xj = yj for all j 6= i, we still have a probability
of at least (1 + p)/2n of coalescing the states by setting xi = yi . But now
there is a possibility that if we try to move to (x[j/b], y[j/b]) for some j and
CHAPTER 10. MARKOV CHAINS 198

b, that x rejects while y does not or vice versa (note if xj = yj = b, we don’t


move in either copy of the process). Conditioning on j and b, this occurs
with probability 1 − p precisely when x[j/b] < x and y[j/b] ≥ y or vice versa,
giving an expected increase in |x − y| of (1 − p)/2n. We still get an expected
net change of −2p/2n = −p/n provided there is only one choice of j and b
for which this occurs. So we converge in time τ () ≤ (n/p) log(n/) in this
case.12
One way to think of this is that the shape of the neighborhoods of nearby
points is similar. If I go up in a particular direction from point x, it’s very
likely that I go up in the same direction from some neighbor y of x.
If there are more bad choices for j and b, then we need a much larger
value of p: the expected net change is now (k(1 − p) − (1 + p))/2n =
(k − 1 − (k + 1)p)/2n, which is only negative if p > (k − 1)/(k + 1). This
gives much weaker pressure towards large values of g, which still tends to
put us in high neighborhoods but creates the temptation to fiddle with α to
try to push us even higher once we think we are close to a peak.

10.5 Spectral methods for reversible chains


(See also http://www.stat.berkeley.edu/~aldous/RWG/Chap3.pdf, from
which many of the details in the notes below are taken.)
The problem with coupling is that (a) it requires cleverness to come up
with a good coupling; and (b) in many cases, even that doesn’t work—there
are problems for which no coupling that only depends on current and past
transitions coalesces in a reasonable amount of time.13 When we run into
these problems, we may be able to show convergence instead using a linear-
algebraic approach, where we look at the eigenvalues of the transition matrix
of our Markov chain. This approach works best for reversible Markov chains,
where πi pij = πj pij for all states i and j and some distribution π.

10.5.1 Spectral properties of a reversible chain


Suppose that P is the transition matrix of an irreducible, reversible Markov
chain. Then it has a unique stationary distribution π that is a left eigen-
vector corresponding to the eigenvalue 1, which just means that πP = 1π.
12
P You might reasonably ask if such functions g exist. One example is g(x) = (x1 ⊕ x2 ) +
i>2
xi .
13
A coupling that doesn’t require looking into the future is called a causal coupling.
An example of a Markov chain for which causal couplings are known not to work is the one
used for sampling perfect matchings in bipartite graphs as described in §10.6.4.4 [KR99].
CHAPTER 10. MARKOV CHAINS 199

For irreducible aperiodic chains, P will have a total of n real eigenvalues


λ1 > λ2 ≥ λ3 ≥ . . . ≥ λn , where λ1 = 1 and λn > −1. (This follows for
symmetric chains from the Perron-Frobenius theorem, which we will not
attempt to prove here; for general irreducible aperiodic reversible chains, we
will show below that we can reduce to the symmetric case.) Each eigenvalue λi
has a corresponding eigenvector ui , a nonzero vector that satisfies ui P = λi ui .
In Markov chain terms, these eigenvectors correspond to deviations from the
stationary distribution that shrink over time.
For example, the transition matrix
" #
p q
S=
q p

corresponding to a Markov chain on two states that stays in the same


state with probability p and switches
h i to the other
h state
i with probability
q = 1 − p has eigenvectors u1 = 1 1 and u2 = 1 −1 with corresponding
eigenvalues λ1 = 1 and λ2 = p − q, as shown by computing
" #
h i p q h i h i
1 1 = p+q q+p =1· 1 1
q p

and
" #
h i p q h i h i
1 −1 = p − q q − p = (p − q) · 1 −1 .
q p

10.5.2 Analysis of symmetric chains


To make our life easier, we will assume that in addition to being reversible,
our Markov chain has a uniform stationary distribution. Then πi = πj for all
i and j, and so reversibility implies pij = pji for all i and j as well, meaning
that the transition matrix P is symmetric. Symmetric matrices have the
very nice property that their eigenvectors are all orthogonal (this is the
spectral theorem), which allows for a very straightforward decomposition
of the distribution on the initial state (represented as a vector x0 ) as a linear
combination of the eigenvectors. h For i example, if initially we put all our
weight on state 1, we get x0 = 1 0 = 21 u1 + 12 u2 .
If we now take a step in the chain, we multiply x by P . We can express
CHAPTER 10. MARKOV CHAINS 200

this as
1 1 1 2
 
xP = u + u P
2 2
1 1 1
= u P + u2 P
2 2
1 1
= λ1 u + λ2 u2 .
1
2 2
This uses the defining property of eigenvectors, that ui P = λi ui .
In general, if x = i ai ui , then xP = i ai λi ui and xP t = ai λti ui . For
P P P

any eigenvalue λi with |λi | < 1, λti goes to zero in the limit. So only those
eigenvectors with λi = 1 survive. For irreducible aperiodic chains, these
consist only of the stationary distribution π = ui / ui 1 . For reducible or
periodic chains, there may be some additional eigenvalues with |λi | = 1, but
the only possibility that arises for an irreducible reversible chain is λn = −1,
corresponding to a chain with period 2.14
Assuming that |λ2 | ≥ |λn |, as t grows large λt2 will dominate the other
smaller eigenvalues, and so the size of λ2 will control the rate of convergence
of the underlying Markov process.
This assumption is always true for lazy walks that stay put with prob-
ability 1/2, because all eigenvalues of a lazy walk are non-negative. The
reason is that any such walk has transition matrix 12 (P + I), where I is the
identity matrix and P is the transition matrix  of the unlazy
 version
 of the
walk. If xP = λx for some x and λ, then x 12 (P + I) = 12 (λ + 1) x. This
means that x is still an eigenvector of the lazy walk, and its corresponding
eigenvalue is 12 (λ + 1) ≥ 12 ((−1) + 1) ≥ 0.
But what does having small λ2 mean for total variation distance? If
xt is a vector representing the distribution of our position at time t, then
dT V = 12 ni=1 xti − πi = 12 xt − π 1 . But we know that x0 = π + ni=2 ci ui
P P

for some coefficients ci , and xt = π + ni=2 λti ci ui . So we are looking for a


P
Pn t i
bound on i=2 λi ci u 1 . Pn
It turns out that it is easier to get a bound on the `2 norm t i
i=2 λi ci u 2 .
Here we use the fact that the eigenvectors are orthogonal. This means that
Pn t i 2 Pn 2t 2 i 2
the Pythagorean theorem holds and i=2 λi ci u 2 = i=2 λi ci ku k2 =
14
Chains with periods greater than 2 (which are never reversible) have pairs of complex-
valued eigenvalues that are roots of unity, which happen to cancel out to only produce
real probabilities in vP t . Chains that aren’t irreducible will have one eigenvector with
eigenvalue 1 for each final component; the stationary distributions of these chains are linear
combinations of these eigenvectors (which are just the stationary distributions on each
component).
CHAPTER 10. MARKOV CHAINS 201

Pn 2t 2
i=2 λi ci if we normalize each ui so that kui k22 = 1. But then
n
2 X
xt − π = λ2t 2
i ci
2
i=2
n
X
≤ λ2t 2
2 ci
i=2
n
X
= λ2t
2 c2i
i=2
= λ2 kx0
2t
− πk22 .

Now take the square root of both sides to get

xt − π =< λt2 x0 − π .
2

To translate this back into dT V , we use the inequality



kxk2 ≤ kxk1 ≤ n · kxk2 ,

which holds for any n-dimensional vector x. Because kx0 k1 = kπk1 = 1,


kx0 − πk1 ≤ 2 by the triangle inequality, which also gives kx0 − πk2 ≤ 2. So
1
dT V (xt , π) = kxt − πk1
2

n t 0
≤ λ kx − πk2
2√ 2
≤ λt2 n.

to get a bound on kx − πk1 .


t √
 to get dT V (x , n) ≤ , we will need t ln(λ2 ) + ln n ≤ ln  or
If we want
1 1
t ≥ ln(1/λ2) 2
ln n + ln(1/) .
1
The factor ln(1/λ2)
= − ln1 λ2 can be approximated by τ2 = 1−λ
1
2
, which is
called the mixing rate or relaxation time of the Markov chain. Indeed,
our old friend 1 + x ≤ ex implies ln(1  + x) ≤ x, which
 gives ln λ2 ≤ λ2 − 1
1 1 1
and thus − ln λ2 ≤ 1−λ2 = τ2 . So τ2 2 ln n + ln(1/) gives a conservative
estimate on the time needed to achieve dT V (xt , π) ≤  starting from an
arbitrary distribution.
It’s worth comparing this to the mixing time tmix = arg mint dT V (xt , π) ≤
1/4 that we used with coupling. With τ2 , we have to throw in an extra factor
of 12 ln n to get the bound down to 1, but after that the bound continues to
CHAPTER 10. MARKOV CHAINS 202

drop by a factor of e every τ2 steps. With tmix , we have to go through another


tmix steps for each factor of 2 improvement in total variation distance, even
if the initial drop avoids the log factor. Since we can’t always tell whether
coupling arguments or spectral arguments will do better for a particular
chain, and since convergence bounds are often hard to obtain no matter how
many techniques we throw at them, we will generally be happy with any
reasonable bound on either τ2 or tmix .

10.5.3 Analysis of asymmetric chains


If the stationary distribution is not uniform, then in general the transition
matrix will not be symmetric. We can make it symmetric by scaling the
−1/2
probability of being in the i-th state by πi . The idea is to decompose
−1
our transition matrix Pq as ΠAΠ , where Π is a diagonal matrix with
−1/2
Πii = πi , and Aij = ππji Pij . Then A is symmetric, because
s
πi
Aij = Pij
πj
s
πi πj
= Pji
πj πi
πj
r
= Pji
πi
= Aji .

This means in particular that A has orthogonal eigenvectors u1 , u2 , . . . , un


with corresponding eigenvalues λ1 ≥ λ2 ≥ . . . ≥ λn . These eigenvalues carry
over to P = ΠAΠ−1 , since (ui Π−1 )P = ui Π−1 ΠAΠ−1 = ui AΠ−1 = λi ui Π−1 .
So now given an initial distribution x0 , we can first pass it through Π and
then apply the same reasoning as before to show that kπΠ − xΠk2 shrinks
by λ2 with every step of the Markov chain. The difference now is that the
initial distance is affected by the scaling we did in Π−1 ; so instead of getting

dT V (xt , π) ≤ λt2 n, we get dT V (xt , π) ≤ λt2 (πmin )−1/2 , where πmin is the
smallest probability of any single node in the stationary distribution π. The
bad initial x0 in this case is the one that puts all its weight on this node,
since this maximizes its distance from π after scaling.
(A uniform π is a special case of this, since when πi = 1/n for all i,
−1/2 √
πmin = (1/n)−1/2 = n.)
For more details on this, see [AF01, §3.4].
CHAPTER 10. MARKOV CHAINS 203

So now we just need a tool for bounding λ2 . For a small chain with a
known transition matrix, we can just feed it to our favorite linear algebra
library, but most of the time we will not be able to construct the matrix
explicitly. So we need a way to bound λ2 indirectly, in terms of other
structural properties of our Markov chain.

10.6 Conductance
The conductance or Cheeger constant Φ(S) of a set S of states in a
Markov chain is
P
πi pij
i∈S,j6∈S
Φ(S) = . (10.6.1)
π(S)
This is the probability of leaving S on the next step starting from the
stationary distribution conditioned on being in S. The conductance is a
measure of how easy it is to escape from a set. It can also be thought of as a
weighted version of edge expansion.
The conductance of a Markov chain as a whole is obtained by taking the
minimum of Φ(S) over all S that occur with probability at most 1/2:

Φ= min Φ(S). (10.6.2)


0<π(S)≤1/2

The usefulness of conductance is that it bounds λ2 :


Theorem 10.6.1. In a reversible Markov chain,

1 − 2Φ ≤ λ2 ≤ 1 − Φ2 /2. (10.6.3)

The bound (10.6.3) is known as the Cheeger inequality. We won’t


attempt to prove it here, but the intuition is that in order for a reversible
chain not to mix, it has to get stuck in some subset of the states. Having
high conductance prevents this disaster, while having low conductance causes
it.
For lazy walks we always have λ2 = λmax , and so we can convert (10.6.3)
to a bound on the relaxation time:
Corollary 10.6.2.
1 2
≤ τ2 ≤ 2 . (10.6.4)
2Φ Φ
In other words, high conductance implies low relaxation time and vice
versa, up to squaring.
CHAPTER 10. MARKOV CHAINS 204

10.6.1 Easy cases for conductance


For very simple Markov chains we can compute the conductance directly.
Consider a lazy random walk on a cycle. Any proper subset S has at least two
outgoing edges, each of which carries a flow of 1/4n, giving ΦS ≥ (1/2n)/π(S).
If we now take the minimum of ΦS over all S with π(S) ≤ 1/2, we get φ ≥ 1/n,
which gives τ2 ≤ 2n2 . This is essentially the same bound as we got from
coupling.
Here’s a slightly more complicated chain. Take two copies of Kn , where
n is odd, and join them by a path with n edges. Now consider ΦS for a
lazy random walk on this graph where S consists of half the graph, split
at the edge in the middle of the path. There is a single outgoing edge uv,
with π(u) = d(u)/2|E| = 2/(2n(n − 1)n/2 + n) = 2n−2 and puv = 1/4, for
π(u)puv = n−2 /2. By symmetry, we have π(S) → 1/2 as n → ∞, giving
ΦS → n−2 (1/2)/(1/2) = n−2 . So we have n2 /2 ≤ τ2 ≤ 2n4 .
How does this compare to the actual mixing time? In the stationary
distribution, we have a constant probability of being in each of the copies of
Kn . Suppose we start in the left copy. At each step there is a 1/n chance
that we are sitting on the path endpoint. We then step onto the path with
probability 1/n, and reach the other end before coming back with probability
1/n. So (assuming we can make this extremely sloppy handwaving argument
rigorous) it takes at least n3 steps on average before we reach the other copy
of Kn , which gives us a rough estimate of the mixing time of Θ(n3 ). In
this case the exponent is exactly in the middle of the bounds derived from
conductance.
Curiously, it’s not hard to argue that this is in fact the worst possible
conductance we can get from a lazy random walk. Consider an arbitrary
connected graph G and any subset S of V (G). Then S has at least one
outgoing edge uv, and we take it with probability πu puv = Pd(u) · 1 =
d(w) 2d(u)
w
P 1 = 1
≥ 1
Assuming π(S) ≤ 1/2, this gives Φ(S) ≥ 2
d(w) 2|E| n(n−1) . n(n−1) =
w
Θ(1/n2 ). This immediately gives that a lazy random walk on any graph
e 4 ).
converges in time O(n

10.6.2 Edge expansion using canonical paths


(Here and below we are mostly following the presentation of Guruswami [Gur00],
but with slightly different examples.)
For more complicated Markov chains, it is helpful to have a tool for
bounding conductance that doesn’t depend on intuiting what sets have the
smallest boundary. The canonical paths [JS89] technique does this by
CHAPTER 10. MARKOV CHAINS 205

assigning a unique path γxy from each state x to each state y in a way that
doesn’t send too many paths across any one edge. So if we have a partition
of the state space into sets S and T , then there are |S| · |T | paths from states
in S to states in T , and since (a) every one of these paths crosses an S–T
edge, and (b) each S–T edge carries at most ρ paths, there must be at least
|S| · |T |/ρ edges from S to T . Note that because there is no guarantee we
chose good canonical paths, this is only useful for getting lower bounds on
conductance—and thus upper bounds on mixing time—but this is usually
what we want.
Let’s start with a small example. Let G = Cn  Cm , the n × m torus.
A lazy random walk on this graph moves north, east, south, or west with
probability 1/8 each, wrapping around when the coordinates reach n or m.
Since this is a random walk on a regular graph, the stationary distribution
is uniform. What is the relaxation time?
Intuitively, we expect it to be O(max(n, m)2 ), because we can think of this
two-dimensional random walk as consisting of two one-dimensional random
walks, one in the horizontal direction and one in the vertical direction, and
we know that a random walk on a cycle mixes in O(n2 ) time. Unfortunately,
the two random walks are not independent: every time I take a horizontal
step is a time I am not taking a vertical step. We can show that the expected
coupling time is O(n2 + m2 ) by running two sequential instances of the
coupling argument for the cycle, where we first link the two copies in the
horizontal direction and then in the vertical direction. So this gives us one
bound on the mixing time. But what happens if we try to use conductance?
Here it is probably easiest to start with just a cycle. Given points x
and y on Cn , let the canonical path γxy be a shortest path between them,
breaking ties so that half the paths go one way and half the other. Then each
each is crossed by exactly k paths of length k for each k = 1 . . . (n/2 − 1),
and at most n/4 paths of length n/2 (0 if n is odd), giving a total of
(n/2−1)(n/2)
ρ ≤ n/2
 n
2 + 4 = 2 + n4 = n2 /8 paths across the edge.
If we now take an S–T partition where |S| = m, we get at least m(n −
m)/ρ = 8m(n − m)/n2 S–T edges. This peaks at m = n/2, where we
get 2 edges—exactly the right number—and in general when m ≤ n/2 we
get at least 8m(n/2)/n2 = 4m/n outgoing edges, giving a conductance
ΦS ≥ (1/4n)(4m/n)/(m/n) = 1/n.
This is essentially what we got before, except we have to divide by 2
because we are doing a lazy walk. Note that for small m, the bound is a
gross underestimate, since we know that every nonempty proper subset has
at least 2 outgoing edges.
CHAPTER 10. MARKOV CHAINS 206

Now let’s go back to the torus Cn  Cm . Given x and y, define γxy to be


the L-shaped path that first changes x1 to y1 by moving the shortest distance
vertically, then changes x2 to y2 by moving the shortest distance horizontally.
For a vertical edge, the number of such paths that cross it is bounded by
n2 m/8, since we get at most n2 /8 possible vertical path segments and for
each such vertical path segment there are m possible horizontal destinations
to attach to it. For a horizontal edge, n and m are reversed, giving m2 n/8
paths. To make our life easier, let’s assume n ≥ m, giving a maximum of
ρ = n2 m/8 paths over any edge.
For |S| ≤ nm/2, this gives at least

|S| · |G \ S| |S|(nm/2) 4|S|


≥ 2
=
ρ n m/8 n

outgoing edges. We thus have

1 4|S|/n 1
Φ(S) ≥ · = .
8nm |S|/nm 2n

This gives τ2 ≤ 2/Φ2 ≤ 8n2 . Given that we assumed n ≥ m, this is essentially


the same O(n2 + m2 ) bound that we would get from coupling.

10.6.3 Congestion
For less symmetric chains, we weight paths by the probabilities of their
endpoints when counting how many cross each edge, and treat the flow
across the edge as a capacity. This gives the congestion of a collection of
canonical paths Γ = {γxy }, which is computed as

1 X
ρ(Γ) = max πx πy ,
uv∈E πu puv γxy 3uv

and we define ρ = minΓ ρ(Γ).


The intuition here is that the congestion bounds the ratio between the
canonical path flow across an edge and the Markov chain flow across the
edge. If the congestion is not too big, this means that any cut that has a lot
of canonical path flow across it also has a lot of Markov chain flow. When
π(T ) ≥ 21 , the total canonical path flow π(S)π(T ) is at least 12 π(S). This
means that when π(S) is large but less than 12 , the Markov chain flow leaving
S is also large. This gives a lower bound on conductance.
Formally, we have the following lemma:
CHAPTER 10. MARKOV CHAINS 207

Lemma 10.6.3. For any reversible aperiodic irreducible Markov chain,


1
Φ≥ (10.6.5)

from which it follows that
τ2 ≤ 8ρ2 . (10.6.6)
Proof. Pick some set of canonical paths Γ with ρ(Γ) = ρ. Now pick some
S–T partition with π(S) ≤ 1/2 and φ(S) = φ. Consider a flow where we
route π(x)π(y) units of flow along each path γxy with x ∈ S and y ∈ T . This
gives a total flow of π(S)π(T ) ≥ π(S)/2. We are going to show that we need
a lot of capacity across the S–T cut to carry this flow, which will give the
lower bound on conductance.
For any S–T edge uv, we have
1 X
πx πy ≤ ρ
πu puv γxy 3uv
or
X
πx πy ≤ ρ · πu puv .
γxy 3uv

Since each S–T path crosses at least one S–T edge, we have
X
π(S)π(T ) = πx πy
x∈S,y∈T
X X
≤ πx πy
u∈S,v∈T,uv∈E γxy 3uv
X
≤ ρπu puv .
u∈S,v∈T,uv∈E
X
=ρ πu puv .
u∈S,v∈T,uv∈E

But then
P
u∈S,t∈T,uv∈E πu puv
Φ(S) =
πS
π(S)π(T )/ρ

π(S)
π(T )
=
ρ
1
≥ .

To get the bound on τ2 , use (10.6.4) to compute τ2 ≤ 2/Φ2 ≤ 8ρ2 .
CHAPTER 10. MARKOV CHAINS 208

10.6.4 Examples
Here are some more examples of applying canonical paths.

10.6.4.1 Lazy random walk on a line


Consider a lazy random walk on a line with reflecting barriers at 0 and n − 1.
Here πx = n1 for all x. There aren’t a whole lot of choices for canonical paths;
the obvious choice for γxy with x < y is x, x+1, x+2, . . . , y. This puts (n/2)2
paths across the middle edge (which is the most heavily loaded), each of
which has weight πx πy = n−2 . So the congestion is (1/n)(1/4)
1
(n/2)2 n−2 = n,
giving a mixing time of at most 8n2 . This is a pretty good estimate.

10.6.4.2 Random walk on a hypercube


Let’s try a more complicated example: the random walk on a hypercube from
§10.4.2. Here at each step we pick some coordinate uniformly at random
and set it to 0 or 1 with equal probability; this gives a transition probability
1
puv = 2n whenever i and j differ by exactly one bit. A plausible choice for
canonical paths is to let γxy use bit-fixing routing, where we change bits
in x to the corresponding bits in y from left to right. To compute congestion,
pick some edge uv, and let k be the bit position in which u and v differ.
A path γxy will cross uv if uk . . . un = xk . . . xn (because when we are at u
we haven’t fixed those bits yet) and v1 . . . vk = y1 . . . yk (because at v we
have fixed all of those bits). There are 2k−1 choices of x1 . . . xk−1 consistent
with the first condition and 2n−k choices of yk+1 . . . yn consistent with the
second, giving exactly 2n−1 total paths γxy crossing uv. Since each path
occurs with weight πx πy = 2−2n , and the flow across uv is πu puv = 2−n 2n 1
,
we can calculate the congestion
1 X
ρ(Γ) = max πx πy
uv∈E πu puv γxy 3uv
1
= · 2n−1 · 2−2n
2−n /2n
= n.

This gives a relaxation time τ2 ≤ 8ρ2 = 8n2 , which whenwe account for the
large state space gives tmix () ≤ 8n2 12 ln 2nn + ln(1/) = O(n3 ). In this
case the bound is substantially worse than what we previously proved using
coupling.
CHAPTER 10. MARKOV CHAINS 209

The fact that the number of canonical paths that cross a particular edge
is exactly one half the number of nodes in the hypercube is not an accident:
if we look at what information we need to fill in to compute x and y from u
and v, we need (a) the part of x we’ve already gotten rid of, plus (b) the
part of y we haven’t filled in yet. If we stitch these two pieces together, we
get all but one of the n bits we need to specify a node in the hypercube, the
missing bit being the bit we flip to get from u to v. This sort of thing shows
up often in conductance arguments where we build our canonical paths by
fixing a structure one piece at a time.

10.6.4.3 Matchings in a graph


A matching in a graph G = (V, E) is a subset of the edges with no two
elements adjacent to each other; equivalently, it’s a subgraph that includes
all the vertices in which each vertex has degree at most 1. We can use a
random walk to sample matchings from an arbitrary graph uniformly.
Here is the random walk. Let St be the matching at time t. At each
step, we choose an edge e ∈ E uniformly at random, and flip a coin to decide
whether to include it or not. If the coin comes up tails, set St+1 = St \{e} (this
may leave St+1 = St if St already omitted e); otherwise, set St+1 = St ∪ {e}
unless one of the endpoints of e is already incident to an edge in St , in which
case set St+1 = St .
Because this chain contains many self-loops, it’s aperiodic. It’s also
straightforward to show that any transition between two adjacent matchings
1
occurs with probability exactly 2m , where m = |E|, and thus that the chain
is reversible with a uniform stationary distribution. We’d like to bound the
congestion of the chain to show that it converges in a reasonable amount of
time.
Let N be the number of matchings in G, and let n = |V | and m = |E| as
usual. Then πS = 1/N for all S and πS pST = 2N1m . Our congestion for any
transition ST will then be 2N mN −2 = 2m/N times the number of paths
that cross ST ; ideally this number of paths will be at most N times some
small polynomial in n and/or m.
Suppose we are trying to get from some matching X to another matching
Y . The graph X ∪ Y has maximum degree 2, so each of its connected
components is either a path or a cycle; in addition, we know that the edges
in each of these paths or cycles alternate whether they come from X or Y ,
which among other things implies that the cycles are all even cycles.
We can transform X to Y by processing these components in a methodical
way: first order the components (say, by the increasing identity of the smallest
CHAPTER 10. MARKOV CHAINS 210

vertex in each), then for each component replace the X edges with Y edges.
If we do this cleverly enough, we can guarantee that for any transition ST ,
the set of edges (X ∪ Y ) \ (S ∪ T ) always consists of a matching plus at most
two extra edges, and that S, T , (X ∪ Y ) \ (S ∪ T ) are enough to reconstruct
X and Y . Since there are at most N m2 choices of (X ∪ Y ) \ (S ∪ T ), this
will give at most N m2 canonical paths across each transition.15
Here is how we do the replacement within a component. If C is a
cycle with k vertices, order the vertices v0 , . . . , vk−1 such that v0 has the
smallest index in C and v0 v1 ∈ X. Then X ∩ C consists of the even-
numbered edges e2i = v2i v2+1 and Y ∩ C consists of the odd-numbered edges
e2i+1 = v2i+1 v2i+2 , where we take vk = v0 . We now delete e0 , then alternate
between deleting e2i and adding e2i−1 for i = 1 . . . k/2 − 1. Finally we add
ek−1 .
The number of edges in C at each step in this process is always one of
k/2, k/2 − 1, or k/2 − 2, and since we always add or delete an edge, the
number of edges in C ∩ (S ∪ T ) for any transition will be at least k/2 − 1,
which means that the total degree of the vertices in C ∩ (S ∪ T ) is at least
k − 2. We also know that SU T is a matching, since it’s either equal to S or
equal to T . So there are at least k − 2 vertices in C ∩ (S ∪ T ) with degree
1, and none with degree 2, leaving at most 2 vertices with degree 0. In the
complement C \ (S ∪ T ), this becomes at most two vertices with degree 2,
with the rest having degree 1. If these vertices are adjacent, removing the
edge between them leaves a matching; if not, removing one edge adjacent to
each does so. So two extra edges are enough.
A similar argument works for paths, which we will leave as an exercise
for the reader.
Now suppose we know S, T , and (XU Y ) \ (S ∪ T ). We can compute
X ∪ Y = ((X ∪ Y ) \ (S ∪ T )) ∪ (S ∪ T ); this means we can reconstruct the
graph X ∪ Y , identify its components, and so on. We know which component
C we are working on because we know which edge changes between S and T .
For any earlier component C 0 , we have C 0 ∩ S = C 0 ∩ Y (since we finished
already), and similarly C 0 \ S = C 0 ∩ X. For components C 0 we haven’t
reached yet, the reverse holds, C 0 ∩ S = C ∩ X and C 0 \ S = C ∩ Y . This
determines the edges in both X and Y for all edges not in C.
For C, we know from C ∩ (X ∪ Y ) which vertex is v0 . In a cycle, whenever
we remove an edge, its lower-numbered endpoint is always an even distance
from v0 , and similarly when we add an edge, its lower-numbered endpoint
is always an odd distance from v0 . So we can orient the cycle and tell
15 m

We could improve the constant a bit by using N 2
, but we won’t.
CHAPTER 10. MARKOV CHAINS 211

which edges in S ∪ T have already been added (and are thus part of Y ) and
which have been removed (and are thus part of X). Since we know that
C ∩ Y = C \ X, this is enough to reconstruct both C ∩ X and C ∩ Y . (The
same general idea works for paths, although for paths we do not need to be as
careful to figure out orientation since we can only leave v0 in one direction.)
The result is that given S and T , there are at most N m2 choices of
X and Y such that the canonical path γXY crosses ST . This gives ρ ≤
N m2 N −2
N −1 1
= 2m3 , which gives τ2 ≤ 32m6 . This is not great but it is at least
2m
polynomial
  in the size
 of the graph.
 For actual
 sampling, this translates to
6 1 6 1
O m log N + log  = O m m + log  steps to get a total variation
distance of .
This may or may not give rapid mixing, depending on how big N is relative
to m. For many graphs, the number of matchings N will be exponential in
m, and so the m6 term will be polylogarithmic in N . But N can be much
smaller in dense graphs. For example, on a clique we have N = m + 1,
since no matching can contain more than one edge, making the bound
τ2 ≤ 32m6 = Θ(N 6 ) polynomial in N but not polynomial in log N , which is
what we want. The actual mixing time in this case is Θ(m) (for the upper
bound, wait to delete the only edge, and do so on both sides of the coupling;
for the lower bound, observe that until we delete the only edge we are still in
our initial state). This is much better than the upper bound from canonical
paths, but it still doesn’t give rapid maxing.

10.6.4.4 Perfect matchings in dense bipartite graphs


(Basically doing [MR95, §11.3].)
A similar chain can be used to sample perfect matchings in dense bipartite
graphs, as shown by Jerrum and Sinclair [JS89] based on an algorithm by
Broder [Bro86] that turned out to have a bug in the analysis [Bro88].
A perfect matching of a graph G is a subgraph that includes all the
vertices and gives each of them exactly one incident edge. We’ll be looking
for perfect matchings in a bipartite graph consisting of n left vertices
u1 , . . . , un and n right vertices v1 , . . . , vn , where every edge goes between a
left vertex and a right vertex. We’ll also assume that the graph is dense,
which in this case means that every vertex has at least n/2 neighbors.
This density assumption was used in the original Broder and Jerrum-Sinclair
papers but removed in a later paper by Jerrum, Sinclair, and Vigoda [JSV04].
The random walk is similar to the random walk from the previous section
restricted to the set of matchings with either n − 1 or n edges. At each
step, we first flip a coin (the usual lazy walk trick); if it comes up heads,
CHAPTER 10. MARKOV CHAINS 212

we choose an edge uv uniformly at random and apply one of the following


transformations depending on how uv fits in the current matching mt :
1. If uv ∈ mt , and |mt | = n, set mt+1 = mt \ {uv}.

2. If u and v are both unmatched in mt , and |mt | = n − 1, set mt+1 =


mt ∪ {uv}.

3. If exactly one of u and v is matched to some other node w, and


|mt | = n − 1, perform a rotation that deletes the w edge and adds uv.

4. If none of these conditions hold, set mt+1 = mt .


The walk can be started from any perfect matching, which can be found
in O(n5/2 ) time using a classic algorithm of Hopcroft and Karp [HK73] that
repeatedly searches for an augmenting path, which is a path in G between
two unmatched vertices that alternates between edges not in the matching
and edges in the matching. (We’ll see augmenting paths again below when we
show that any near-perfect matching can be turned into a perfect matching
using at most two transitions.)
We can show that this walk converges in polynomial time using a canonical
path argument. This is done in two stages: first, we define canonical paths
between all perfect matchings. Next, we define a short path from any
matching of size n − 1 to some nearby perfect matching, and build paths
between arbitrary matchings by pasting one of these short paths on one or
both ends of a long path between perfect matchings. This gives a collection
of canonical paths that we can show to have low congestion.
To go between perfect matchings X and Y , consider X ∪ Y as a collection
of paths and even cycles as in §10.6.4.3. In some standard order, fix each path
cycle by first deleting one edge to make room, then using rotate operations
to move the rest of the edges, then putting the last edge back in. For any
transition S–T along the way, we can use the same argument that we can
compute XU Y from SU T by supplying the missing edges, which will consist
of a matching of size n plus at most one extra edge. So if N is the number
of matchings of size n or n − 1, the same argument used previously shows
that at most N m of these long paths cross any S–T transition.
For the short paths, we must use the density property. The idea is that for
any matching that is not perfect, we can find an augmenting path of length at
most 3 between two unmatched nodes on either side. Pick some unmatched
nodes u and v. Each of these nodes is adjacent to at least n/2 neighbors; if
any of these neighbors are unmatched, we just found an augmenting path
of length 1. Otherwise the n/2 neighbors of u and the n/2 nodes matched
CHAPTER 10. MARKOV CHAINS 213

to neighbors of v overlap (because v is unmatched, leaving at most n − 1


matched nodes and thus at most n/2−1 nodes that are matched to something
that’s not a neighbor of v). So for each matching of size n − 1, in at most
two steps (rotate and then add an edge) we can reach some specific perfect
matching. There are at most m2 ways that we can undo this, so each perfect
matching is associated with at most m2 smaller matchings. This blows up
the number of canonical paths crossing any transition by roughly m4 ; by
counting carefully we can thus show congestion that is O(m6 ) (O(m4 ) from
the blow-up, m from the m in N m, and m from 1/pST ).
It follows that for this process, τ2 = O(m12 ). (I think a better analysis is
possible.)
As noted earlier, this is an example of a process for which causal coupling
doesn’t work in less than exponential time [KR99], a common problem with
Markov chains that don’t have much symmetry. So it’s not surprising that
stronger techniques were developed specifically to attack this problem.
An issue we need to deal with is that when we sample a matching with
this process, we don’t necessarily get a perfect matching, since many states
are going to have only n − 1 edges. But we have already shown that the
perfect matchings make up at least an m−2 fraction of the total. So if
we don’t get a perfect matching after a polynomial number of steps, we
can simply start over and try again. This adds a polynomial factor to the
already-bad running time, but it is still polynomial. Alternatively we can
argue that the Markov chain sampled only at times when it yields a perfect
matching is (a) still a Markov chain, and (b) has a total variation distance
from its stationary distribution that we can bound using the bounds on the
unsampled chain. The details of this are a little messy but it also works.
Chapter 11

Approximate counting

(See also [MR95, Chapter 11].)


The basic idea: we have some class of objects, and we want to know how
many of them there are. Ideally we can build an algorithm that just prints
out the exact number, but for many problems this is hard.
A fully polynomial-time randomized approximation scheme or
FPRAS for a numerical problem outputs a number that is between 1 − 
and 1 +  times the correct answer, with probability at least 3/4 (or some
constant bounded away from 1/2—we can amplify to improve it), in time
polynomial in the input size n and 1/. In this chapter, we’ll be hunting for
FPRASs. But first we will discuss briefly why we can’t just count directly in
many cases.

11.1 Exact counting


A typical application is to a problem in the complexity class #P, problems
that involve counting the number of accepting computation paths in a
nondeterministic polynomial-time Turing machine. Equivalently, these are
problems that involve counting for some input x the number of values r such
that M (x, r) = 1, where M is some machine in P. An example would be the
problem #SAT of counting the number of satisfying assignments of some
CNF formula.
The class #P (which is usually pronounced sharp P or number P)
was defined by Leslie Valiant in a classic paper [Val79]. The central result
in this paper is Valiant’s theorem. This shows that any problem in #P
can be reduced (by Cook reductions, meaning that we are allowed to
use the target problem as a subroutine instead of just calling it once) to the

214
CHAPTER 11. APPROXIMATE COUNTING 215

problem of computing the permanent of a square 0–1 matrix A, where the


P Q
permanent is given by the formula π i Ai,π(i) , where the sum ranges over
all n! permutations π of the indices of the matrix. An equivalent problem
is counting the number of perfect matchings (subgraphs including all
vertices in which every vertex has degree exactly 1) of a bipartite graph.
Other examples of #P-complete problems are #SAT (defined above) and
#DNF (like #SAT, but the input is in DNF form; #SAT reduces to #DNF
by negating the formula and then subtracting the result from 2n ).
Exact counting of #P-hard problems is likely to be very difficult: Toda’s
theorem [Tod91] says that being able to make even a single query to a #P-
oracle is enough to solve any problem in the polynomial-time hierarchy,
which contains most of the complexity classes you have probably heard
of. Nonetheless, it is often possible to obtain good approximations to such
problems.

11.2 Counting by sampling


If many of the things we are looking for are in the target set, we can count
by sampling; this is what poll-takers do for a living. Let U be the universe
we are sampling from and G be the “good” set of points we want to count.
Let ρ = |G|/|U |. If we can sample uniformly from U , then we can estimate ρ
by taking N independent samples and dividing the number of samples in G
by N . This will give us an answer ρ̂ whose expectation is ρ, but the accuracy
may be off. Since the variance of each sample is ρ(1 − ρ) ≈ ρ (when ρ is

small), we get Var [ Xi ] ≈ mρ, giving a standard deviation of mρ. For
P

this to be less than our allowable error mρ, we need 1 ≤  mρ or N ≥ 21ρ .
This gets bad if ρ is exponentially small. So if we are to use sampling, we
need to make sure that we only use it when ρ is large.
On the positive side, if ρ is large enough we can easily compute how
many samples we need using Chernoff bounds. The following lemma gives
a convenient estimate; it is based on [MR95, Theorem 11.1] with a slight
improvement on the constant:

Lemma 11.2.1. Sampling N times gives relative error  with probability at


least 1 − δ provided  ≤ 1.81 and
3 2
N≥ ln . (11.2.1)
2 ρ δ
Proof. Suppose we take N samples, and let X be the total count for these
CHAPTER 11. APPROXIMATE COUNTING 216

samples. Then E [X] = ρN , and (5.2.7) gives (for  ≤ 1.81):


2 /3
Pr [|X − ρN | ≥ ρN ] ≤ 2e−ρN  .
2 /3
Now set 2e−ρN  ≤ δ and solve for N to get (11.2.1).

11.2.1 Generating samples


An issue that sometimes comes up in this process is that it is not always
obvious how to generate a sample from a given universe U . What we want
is a function that takes random bits as input and produces an element of U
as output.
In the simplest case, we know |U | and have a function f from {0, . . . , |U | − 1}
to U (the application of such a function is called unranking). Here we can
do rejection sampling: we choose a bit-vector of length dlg|U |e, interpret
it as an integer x expressed in binary, and try again until x < |U |. This
requires less than 2dlg|U |e bits on average in the worst case, since we get a
good value at least half the time.
Unranking (and the converse operation of ranking) can be a non-trivial
problem in combinatorics, and there are entire textbooks on the subject [SW13].
But some general principles apply. If U is a union of disjoint sets U1 ∪ U2 ∪
. . . ∪ Un , we can compute the sizes of each |Ui |, and we can unrank each
Ui , then we can map some x ∈ {0 . . . |U | − 1} to a specific element of U
by choosing the maximum i such that i−1 j=1 |Uj | ≤ x and choosing the k-th
P
Pi−1
element of Ui , where k = x − j=1 |Uj |. This is essentially what happens if
we want to compute the x-th day of the year. The first step of this process
requires computing the sizes of O(n) sets if we generate the sum one element
at a time, although if we have a fast way of computing ji=1 |Ui | for each
P

i, we can cut this time to O(log n) such computations down using binary
search. The second step depends on whatever mechanism we use to unrank
within Ui .
An example would be generating one of the nk subsets of a set of size


k uniformly at random. If we let S be the set of all such k-subsets, we can


express S as ni=1 Si where i is the set of all k-subsets that have the i-th
S

element as their smallest element in some fixed ordering. But then we can
n−i 
easily compute |Si | = k−1 for each i, apply the above technique to pick a
particular Si to start with, and then recurse within Si to get the rest of the
elements.
A more general case is when we can’t easily sample from U but we can
sample from some T ⊇ U . Here rejection sampling comes to our rescue
CHAPTER 11. APPROXIMATE COUNTING 217

as long as |U |/|T | is large enough. For example, if we want to generate a



k-subset of n elements when k  n, we can choose a list of k elements
with replacement and discard any lists that contain duplicates. But this
doesn’t work so well for larger k. In this particular case, there is an easy
out: we can sample k elements without replacement and forget their order.
This corresponds to sampling from a universe T of ordered k-subsets, then
mapping down to U . As long as the inverse image of each element of U has
the same size in T , this will give us a uniform sample from U .
When rejection sampling fails, we may need to come up with a more clever
approach to concentrate on the particular values we want. One approach that
can work well if we can express U as a union of sets that are not necessarily
disjoint is the Karp-Luby technique [KL85], discussed in §11.4 below.
All of this assumes that we are interested in getting uniform samples.
For non-uniform samples, we replace the assumption that we are trying to
generate a uniform X ∈ {0 . . . m − 1} with some X ∈ {0 . . . m − 1} for which
we can efficiently calculate Pr [X ≤ i] for each i. In this case, we can (in
principle) generate a continuous random variable Y ∈ [0, 1] and choose the
maximum i such that Pr [X ≤ i] ≤ Y , which we can find using binary search.
The complication is that we can’t actually generate Y in finite time.
Instead, we generate Y one bit at a time; this gives a sequence of rational
values Yt where each Yt is of the form k/2t . We can stop when every
value in interval [Yt , Yt + 2−t ) lies with the range corresponding to some
particular value of X. This technique is closely related to arithmetic
coding [WNC87], and requires generating H(X) + O(1) bits on average,
where H(X) = − m−1
P
i=0 Pr [X = i] lg Pr [X = i] is the (base 2) entropy of
X.

11.3 Approximating #KNAPSACK


Here is an algorithm for approximating the number of solutions to a KNAP-
SACK problem, due to Dyer [Dye03]. We’ll concentrate on the simplest
version, 0–1 KNAPSACK, following the analysis in Section 2.1 of [Dye03].
For the 0–1 KNAPSACK problem, we are given a set of n objects of
weight 0 ≤ a1 ≤ a2 ≤ . . . an ≤ b, and we want to find a 0–1 assignment
x1 , x2 , . . . , xn such that ni=1 ai xi ≤ b (usually while optimizing some prop-
P

erty that prevents us from setting all the xi to zero). We’ll assume that the
ai and b are all integers.
For #KNAPSACK, we want to compute |S|, where S is the set of all
assignments to the xi that make ni=1 ai xi ≤ b.
P
CHAPTER 11. APPROXIMATE COUNTING 218

There is a well-known fully polynomial-time approximation scheme


for optimizing KNAPSACK, based on dynamic programming. The idea is
that a maximum-weight solution can be found exactly in time polynomial in
b, and if b is too large, we can reduce it by rescaling all the ai and b at the
cost of a small amount of error. A similar idea is used in Dyer’s algorithm:
the KNAPSACK problem is rescaled so that size of the solution set S 0 of the
rescaled version can be computed in polynomial time. Sampling is then used
to determine what proportion of the solutions in S 0 correspond to solutions
of the original problem.
Scaling step: Let a0i = bn2 ai /bc. Then 0 ≤ a0i ≤ n2 for all i. Taking the
floor creates some error: if we try to reconstruct ai from a0i , the best we can do
is argue that a0i ≤ n2 ai /b < a0i + 1 implies (b/n2 )a0i ≤ ai < (b/n2 )a0i + (b/n2 ).
The reason for using n2 as our rescaled bound is that the total error in the
upper bound on ai , summed over all i, is bounded by n(b/n2 ) = b/n, a fact
that will become important soon.
Pn
Let S 0 = ~x 0 2 be the set of solutions to the rescaled

i=1 ai xi ≤ n
knapsack problem, where we substitute a0i for ai and n2 = (n2 /b)b for b.
Claim: S ⊆ S 0 . Proof: ~x ∈ S if and only if ni=1 ai xi ≤ b. But then
P

n n
a0i xi =
X X
bn2 ai /bcxi
i=1 i=1
n
X
≤ (n2 /b)ai xi
i=1
n
X
= (n2 /b) ai xi
i=1
≤ n2 ,

which shows ~x ∈ S 0 .
The converse does not hold. However, we can argue that any ~x ∈ S 0 can
be shoehorned into S by setting at most one of the xi to 0. Consider the
set of all positions i such that xi = 1 and ai > b/n. If this set is empty,
then ni=1 ai xi ≤ ni=1 b/n = b, and ~x is already in S. Otherwise, pick
P P

any position i with xi = 1 and ai > b/n, and let yj = 0 when j = i and
yj = xj otherwise. If we recall that the total error from the floors in our
approximation was at most (b/n2 )n = b/n, the intuition is that deleting yj
CHAPTER 11. APPROXIMATE COUNTING 219

removes it. Formally, we can write


n
X n
X
a j yj = aj xj − ai
j=1 j=1
n
((b/n2 )a0j + b/n2 )xj − b/n
X
<
j=1
n
a0j xj + b/n − b/n
X
≤ (b/n2 )
j=1
2 2
≤ (b/n )n
= b.

Applying this mapping to all elements ~x of S maps at most n + 1 of them to


each ~y in S 0 ; it follows that |S 0 | ≤ (n + 1)|S|, which means that if we can
sample elements of S 0 uniformly, each sample will hit S with probability at
least 1/(n + 1). n o
Pk
To compute |S 0 |, let C(k, m) = ~x a0x ≤ m
i=1 i i be the number of
0 0
subsets of {a1 , . . . , ak } that sum to m or less. Then C(k, m) satisfies the
recurrence

C(k, m) = C(k − 1, m − a0k ) + C(k − 1, m)


C(0, m) = 1

where k ranges from 0 to n and m ranges from 0 to n2 , and we treat


C(k − 1, m − a0k ) = 0 if m − a0k < 0. The idea is that C(k − 1, m − a0k ) counts
all the ways to make m if we include a0k , and C(k − 1, m) counts all the ways
to make m if we exclude it. The base case corresponds to the empty set
(which sums to ≤ m no matter what m is).
We can compute a table of all values of C(k, m) by iterating through m
in increasing order; this takes O(n3 ) time. At the end of this process, we can
read off |S 0 | = C(n, n2 ).
But we can do more than this: we can also use the table of counts to
sample uniformly from S 0 . The probability that x0n = 1 for a uniform random
element of S 0 is exactly C(n−1, n2 −a0n )/C(n, n2 ); having chosen x0n = 1 (say),
the probability that x0n−1 = 1 is then C(n−2, n2 −a0n −a0n−1 )/C(n−2, n2 −a0n ),
and so on. So after making O(n) random choices (with O(1) arithmetic
operations for each choice to compute the probabilities) we get a uniform
element of S 0 , which we can test for membership in S in an additional O(n)
operations.
CHAPTER 11. APPROXIMATE COUNTING 220

We’ve already established that |S|/|S 0 | ≥ 1/(n + 1), so we can apply


Lemma 11.2.1 to get  relative error with probability at least 1 − δ using
3(n+1)
2
ln 2δ samples. This gives a cost of O(n2 log(1/δ)/2 ) for the sampling
step, or a total cost of O(n3 + n2 log(1/δ)/2 ) after including the cost of
building the table (in practice, the second term will dominate unless we are

willing to accept  = ω(1/ n)).
It is possible to improve this bound. Dyer [Dye03] shows that using
randomized rounding onpthe a0i instead of just truncating them gives a
FPRAS that runs in O(n5/2 log(1/) + n2 /2 ) time. We won’t do this here,
but we will see randomized rounding used in a different context in §13.2.2.

11.4 Approximating #DNF


A classical algorithm of Karp and Luby [KL85] gives a FPRAS for #DNF.
We’ll describe how this works, mostly following the presentation in [MR95,
§11.2]. The key idea of the Karp-Luby technique is to express the set S
whose size we want to know, as a union of a polynomial number of simpler
sets S1 , . . . , Sk . If for each i we can sample uniformly from Si , compute
|Si |, and determine membership in Si , then some clever sampling will let us
approximate |S| by sampling from a disjoint union of the Si to determine
how much |Si | overestimates |S|.
P

To make this concrete, let’s look at the specific problem studied by Karp
and Luby of approximating the number of satisfying assignment to a DNF
formula. A DNF formula is a formula that is in disjunctive normal form:
it is an OR of zero or more clauses, each of which is an AND of variables
or their negations. An example would be (x1 ∧ x2 ∧ x3 ) ∨ (¬x1 ∧ x4 ) ∨ x2 ).
The #DNF problem is to count the number of satisfying assignments of a
formula presented in disjunctive normal form.
Solving #DNF exactly is #P-complete, so we don’t expect to be able to
do it. Instead, we’ll get a FPRAS by cleverly sampling solutions. The need
for cleverness arises because just sampling solutions directly by generating
one of the 2n possible assignments to the n variables may find no satisfying
assignments at all, since the size of any individual clause might be big enough
that getting a satisfying assignment for that clause at random is exponentially
unlikely.
So instead we will sample pairs (x, i), where x is an assignment that
satisfies clause Ci ; these are easier to find, because if we know which clause
Ci we are trying to satisfy, we can read off the satisfying assignment from
its variables. Let S be the set of such pairs. For each pair (x, i), define
CHAPTER 11. APPROXIMATE COUNTING 221

P
f (x, i) = 1 if and only if Cj (x) = 0 for all j < i. Then (x,i)∈S f (x, i) counts
every satisfying assignment x, because (a) there exists some i such that
x satisfies Ci , and (b) only the smallest such i will have f (x, i) = 1. In
effect, f is picking out a single canonical satisfied clause from each satisfying
assignment. Note that we can compute f efficiently by testing x against all
clauses Cj with j < i.
Our goal is to estimate the proportion ρ of “good” pairs with f (x, i) =
1 out of all pairs in S 0 = Si , and then use this to estimate |S| =
U
P
f (x, i) = ρ|S 0 |. If we can sample from S uniformly, the proportion
(x,i)∈S 0

ρ of “good” pairs with f (x, i) = 1 is at least 1/m, because every satisfying


assignment x contributes at most m pairs total, and one of them is good.
The only tricky part is figuring out how to sample pairs (x, i) with
Ci (x) = 1 so that all pairs occur with the same probability. Let Si =
{(x, i) | Ci (x) = 1}. Then we can compute |Si | = 2n−ki where ki is the
number of literals in Ci . Using this information, we can sample i first with
probability |Si |/ j |Sj |, then sample x from Si just by picking independent
P

uniform random values for the n − k variables not fixed by Ci .


4
With N ≥ 2 (1/m) ln 2δ = 4m2
ln 2δ , we obtain an estimate ρ̂ for the propor-
tion of pairs (x, i) with f (x, i) = 1 that is within  relative error of ρ with
probability at least 1 − δ. Multiplying this by |Si | then gives the desired
P

count of satisfying assignments.


It’s worth noting that there’s nothing special about DNF formulas in
this method. Essentially the same trick will work for estimating the size of
the union of any collection of sets Si where we can (a) compute the size of
each Si ; (b) sample from each Si individually; and (c) test membership of
our sample x in Sj for j < i.

11.5 Approximating exponentially improbable events


For #KNAPSACK and #DNF, we saw how restricting our sampling to
a cleverly chosen sample space could boost the hit ratio ρ to something
that gave a reasonable number of samples using Lemma 11.2.1. For other
problems, it is often not clear how to do this directly, and the best sample
spaces we can come up with make our target points an exponentially small
fraction of the whole.
In these cases, it is sometimes possible to approximate this exponentially
small fraction as a product of many more reasonable ratios. The idea is to
express our target set as the last of a sequence of sets S0 , S1 , . . . , Sk , where
we can compute the size of S0 and can estimate |Si+1 |/|Si | accurately for
CHAPTER 11. APPROXIMATE COUNTING 222

each i. This gives |Sk | = |S0 | · ki=1 |S|Si+1 |


Q
i|
, with a relative error that grows
roughly linearly with k. Specifically, if we can approximate each |Si+1 |/|Si |
ratio to between 1 −  and 1 +  of the correct value, then the product of
these ratios will be between (1 − )k and (1 + )k of the correct value; these
bounds approach 1 − k and 1 + k in the limit as k goes to zero, using
the binomial theorem, although to get a real bound we will need to do more
careful error analysis.

11.5.1 Matchings
We saw in §10.6.4.3 that a random walk on matchings on a graph with m
edges has mixing time τ2 ≤ 32m6 , where the walk is defined by selecting
an edge uniformly at random and flipping whether it is in the matching or
not, while rejecting any steps that produce a non-matching. This allows
us to sample matchings of a graph   with δ totalvariation distance from
the uniform distribution in O m6 log N + log 1δ time, where N is the
number of matchings. Since every matching is a subset of the edges, we
m
can
 crudely
  N ≤ 2 which lets us rewrite the sampling time as
bound
6 1
O m m + log δ .
Suppose now that we want to count matchings instead of sampling them.
It’s easy to show that for any particular edge uv ∈ G, at least half of all
matchings in G don’t include uv: the reason is that if M is a matching in G,
then M 0 = M \ {uv} is also a matching, and at most two matchings M 0 and
M 0 ∪ {uv} are mapped to any one M 0 by this mapping.
Order the edges of G arbitrarily as e1 , e2 , . . . , em . Let Si be the set of
matchings in G \ {e1 . . . ei }. Then S0 is the set of all matchings, and we’ve
just argued that ρi+1 = |Si+1 |/|Si | ≥ 1/2. We also know that |Sm | counts
the number of matchings in a graph with no edges, so it’s exactly one. So
|Si |
we can use the product-of-ratios trick to compute S0 = m
Q
i=1 |Si+1 | .
  
A random walk of length O m6 m + log η1 can sample matchings from
0
Si with a probability ρ of getting a matching in Si+1 that is between (1 −
η)ρi+1 and (1+η)ρi+1 . From Lemma 11.2.1, we canestimate ρ0 within
 relative

error γ with probability at least 1 − ζ using O γ 21ρ0 log ζ1 = O γ1 log ζ1
0
  with the error on ρ ,this gives relative error at most
samples. Combined
γ + η + γη in O m6 m + log η1 log γ1 log ζ1 operations.1 If we then multiply
1
This is the point where sensible people start hauling out the O e notation, where a
function is O(f
e (n)) if it O(f (n)g) where g is polylogarithmic in n and any other parameters
that may be running around ( 1 , η1 , etc.).
CHAPTER 11. APPROXIMATE COUNTING 223

out all the estimates for |Si |/|Si+1 |, we get an estimate of S0 that is at most
(1+γ +η +γη)m times the correct value with probability   at least1−mζ (with 
a similar bound on the other side), in total time O m7 m log η1 log γ1 log ζ1 .
To turn this into a fully polynomial-time approximation scheme, given
, δ, and m, we need to select η, γ, and ζ to get relative error  with
probability at least 1 − δ. Letting ζ = δ/m gets the δ part. For , we need
(1 + γ + η + γη)m ≤ 1 + . Suppose that  < 1 and let γ = η = /6m. Then
m


m
(1 + γ + η + γη) ≤ 1+
2m
/2
≤e
≤ 1 + .

Plugging these values into our cost formula gives O(m8 ) times a bunch
of factors that are polynomial in log m and 1 , which we can abbreviate as
e 8 ).
O(m

11.5.2 Other problems


Similar methods work for other problems that self-reduce by restricting
particular features of the solution. Examples include colorings (fix the color
of some vertex), independent sets (remove a vertex), and approximating the
volume of a convex body (take the intersection with a sphere of appropriate
radius; see [MR95, §11.4] for a slightly less sketchy description). We will not
go into details on any of these applications here.
Chapter 12

Hitting times

In addition to using Markov chains for sampling, we can also use Markov
chains to model processes that we hope will eventually reach some terminating
state. These Markov chains will generally not be irreducible, since often
the terminating states are inescapable, and may or may not have the other
desirable properties we needed for convergence analysis. So instead of looking
at convergence to a stationary distribution that might not even exist, we
will be mostly interested in the hitting time for some subset of A of the
states, defined as the minimum τ such that X τ ∈ A starting from some given
initial distribution on X 0 . As suggested by the notation, hitting times will a
special case of stopping times (see Chapter 9), and if we are very lucky we
may be able to use the optional stopping theorem to bound them. If we are
less lucky, we will use whatever tools we can.

12.1 Waiting times


The simplest case is when our Markov chain consists of two states: one where
something hasn’t happened yet, and one where it has. If the something
happens with probability p at each step, then τ has a geometric distribution,
and we can calculate E [τ ] using the usual conditional-probability argument
for a geometric random variable (see §3.6.2):

E [τ ] = p · 1 + (1 − p) · (1 + E [τ ])
= 1 + (1 − p) E [τ ]

which has the solution

E [τ ] = 1/p.

224
CHAPTER 12. HITTING TIMES 225

A more interesting case is when we have a Markov chain that we can


describe using a sequence of waiting times. For example, an epidemic
process involves a population of n agents of which initially only one is
infected, and at each step we pair two agents chosen uniformly at random,
and if one is infected and the other not, the uninfected agent becomes
infected. Such processes are used as the basis for gossip or epidemic
algorithms [DGH+ 87], where the infection corresponds to knowing some
piece of information we want to broadcast through the population.
If we let X t be the number of agents infected after t steps, we have a
Markov process where

k(n − k)
pk,k+1 = n
2
pk,k = 1 − pk,k+1 ,

and all other transition probabilities are 0. So any trajectory of this process
involves starting in 1, moving to 2 after some time, then 3, and so on until
all n agents are infected.
The hitting time τn for n will be the sum of the waiting times for each of
these steps. This gives
n−1
X 1
E [τn ] =
k=1
pk,k+1
n−1
X n
2
=
k=1
k(n − k)
n n−1
!
X  1/n 1/n

= +
2 k=1
k n−k
n−1
(n − 1) X1 1

= +
2 k=1
k n−k
= (n − 1)Hn−1 .

12.2 Lyapunov functions


Waiting time analysis works well for processes like the epidemic in §12.1
that only go in one direction. But what if sometimes we go backwards, or if
it’s not even possible to arrange the states of our Markov chain neatly in
one dimension? Here it can be helpful to construct a Lyapunov function,
CHAPTER 12. HITTING TIMES 226

which is often called a potential function in the computer science literature.


This is a function that reduces each state of a complex process to a single
value in R, ideally producing some sort of semimartingale that drops in a
predictable way that so that we can use it to get a bound on the time to
reach our target set.
We’ve seen examples of this already in §9.4.1. For an unbiased random
walk, we saw that the function Xt2 drops by 1 on average, making Xt2 − t
a martingale. A similar result holds with Xt − (p − q)t for biased random
walks. In both cases, we replace the original process with a new process
whose behavior involves time as a term.
Let’s try this for a more complicated process. The rock-paper-scissors
process is similar to the epidemic process, except that instead of having two
states infected and uninfected, where infected overwrites uninfected when
agents in these states meet, we have three states with a cyclic dominance
relation. This means that when a rock agent meets a paper agent, we get
two paper agents (“paper covers rock”), and similarly paper gets replaced
by scissors (“scissors cuts paper”) and scissors gets replaced by rock (“rock
breaks scissors”). Meetings between agents in the same state have no effect.
As in the epidemic case, we choose pairs of agents to interact uniformly at
random from some population of size n.1
Processes like this have been extensively studied in a variety of models;
see [SMJ+ 14] for a survey. The rock-paper-scissors process in particular is
often modeled in its continuous limit as a system of differential equations:

r0 = rs − pr
p0 = pr − sp
s0 = sp − rs,

where r, p, and s represent the densities of rocks, papers, and scissors in the
population at some time and primes are used to indicate derivatives with
respect to time.
This system of differential equations does not have simple closed-form
solutions, but it does have some interesting invariants. If we compute
(r + p + s)0 = r0 + p0 + s0 = (rs − pr) + (pr − sp) + (sp − rs) = 0, we see that
the total concentration of agents r + p + s doesn’t change over time. This is
1
This is also a special case of the Lotka-Volterra model in population dynamics.
The translation is to convert rocks to wolves, paper to grass, and scissors to sheep, and
adopt the rules “sheep eats grass”, “wolf eats sheep”, and “grass starves wolf.”
CHAPTER 12. HITTING TIMES 227

not especially surprising. But we can also compute

(rps)0 = r0 ps + rp0 s + rps0


= (rs − pr)ps + rs(pr − sp) + rp(sp − rs)
= rps(s − p) + rps(r − s) + rps(p − r)
= 0.

This shows that the product of the three concentrations also doesn’t change
over time, giving a family of closed-loop trajectories in the continuous case.
This makes the rock-paper-scissors process a useful way to build an oscillator
in a dynamical system.
In the discrete case, we don’t have infinitesimal rocks breaking infinites-
imal scissors over infinitesimal time intervals. Instead, we have a Markov
chain. The states of this Markov chain are triples of counts hRt , Pt , St i, and
at each step we see transitions
!
n
hR, P, Si → hR − 1, P + 1, Si with probability RP/
2
!
n
hR, P, Si → hR, P − 1, S + 1i with probability P S/
2
!
n
hR, P, Si → hR + 1, P, S − 1i with probability SR/ ,
2

with the remaining probability causing no change.


This is a finite Markov chain that is aperiodic because of the no-op
transitions, but it’s not irreducible. The three states hn, 0, 0i, h0, n, 0i, and
h0, 0, ni all have no outgoing transitions: once we’re all rocks, we all stay
rocks forever. Furthermore there is a path from every configuration with n
agents that isn’t one of these three terminal configurations to one of these
configurations. This means that eventually this process will converge to a
uniform configuration, which is trouble if we want to use the process as an
oscillator but maybe not so bad if we are trying to achieve agreement. But
whatever our goal is, we can ask how long will this take?
The rps invariant for the continuous process suggests looking for a similar
invariant or near-invariant for the discrete process. Let’s define a Lyapunov
function

Φ = RP S,

and look at the expected change in this function at each step.


CHAPTER 12. HITTING TIMES 228

n
Suppose paper covers rock, which occurs with probability P R/ 2 . Then
the change in Φ is

∆Φ = (R − 1)(P + 1)S − RP S
= RP S + RS − P S − S − RP S
= (R − P − 1)S.

If we multiply this by the probability that the event occurs, we get a


contribution to E [∆Φ] of (R − P − 1) · RP S/ n2 . By symmetry,

the other
n
two transitions yield contributions of (P − S − 1) · RP S/ 2 and (S − R −
1) · RP S/ n2 . Adding these up gives


RP S
E [∆Φ] = n ((R − P − 1) + (P − S − 1) + (S − R − 1)),
2

which very nicely cancels down to

−3
= RP S · n .
2

What this tells us is that Φ drops, on average, by a Θ(n−2 ) fraction of


its current value. We can express this as
!
3
E [Φt+1 | Φt ] = Φt 1 − n ,
2

which, iterated, gives for any t


!t
3
E [Φt ] = Φ0 1 − n . (12.2.1)
2

Alternatively we can use this to argue that


!−t
3
Zt = Φt 1 − n (12.2.2)
2

is a martingale.
For proving an upper bound on the hitting time, the first formulation
(12.2.1) is more useful. Using our old friend 1 + x ≤ ex , we can rewrite the
CHAPTER 12. HITTING TIMES 229

bound as E [Φt ] ≤ e−6t/n(n−1) . Now observe that Φ is never greater than


(n/3)3 and never less than n − 2 as long as it is not zero. So we can compute

Pr [Φt 6= 0] = Pr [Φt ≥ n − 2]
E [Φt ]

n−2
(n/3)3 e−6t/n(n−1)


n −2 
= O n2 exp −Θ(t/n2 ) .

We can immediately see that there is some value t = O(n2 log n) that
knocks Pr [Φt 6= 0] down to at most 1/2. As when converging to a stationary
distribution, if we lose this coin-flip, we can restart the argument and try
again. This gives an expected waiting time of O(n2 log n) to reach Φ = 0.
We are not quite done. Having Φ = 0 only means that one of our three
species has disappeared; for full convergence, we need to lose two. But once
we are down to two remaining species (say rock and paper), we have an
epidemic process. From our previous analysis, we know that only O(n log n)
additional steps are needed to get down to one.
It’s worth noting that hitting times are generally stopping times. So
perhaps there is a cleaner argument using (12.2.2) and the optional stopping
theorem (see Theorem 9.3.1). Let τ be the first time at which Φτ = 0.
We’ve already shown that E [τ ] is finite, so the bounded-increments case of
Theorem 9.3.1 is tempting. But if we get unlucky and Φt+1 = Φt 6= 0 for
some large t, Zt can increase by an arbitrarily large amount. The bounded
time and bounded range cases also don’t apply. It’s tempting to see if the
general case works, but since we know that Zτ = 0 6= Z0 and OST implies
Zτ = Z0 , we can rule out applying the theorem to this particular martingale
and stopping time no matter how clever we get. What we can do is use a
fixed time t and compute E [Zt ] = Z0 = Φ0 , but at this point we are just
reinventing (12.2.1).
To salve our disappointment, let’s get a lower bound on the expected
fixation time. Use 1 − x ≥ e−x/2 , which holds for 0 ≤ x ≤ 2 at least, to get
 t
E [Φt ] = Φ0 1 − Θ(1/n2 )
 
≥ Φ0 exp −Θ(t/n2 ) .

If we start with Φ0 = Θ(n3 ), there is some t = Θ(n2 ) such that E [Φt ] is


CHAPTER 12. HITTING TIMES 230

still Θ(n3 ), just with a smaller constant. But then we can argue

Θ(n3 ) = E [Φt ]
= E [Φt ] Pr [Φt 6= 0]
≤ O(n3 ) Pr [Φt 6= 0] ,

from which Pr [Φt 6= 0] = Ω(1). So there is a constant probability of not


reaching Φt = 0 after Θ(n2 ) steps, which gives an expected hitting time for
Φ = 0 of Ω(n2 ). This bound is missing a log factor compared to our upper
bound. I suspect that this could be improved with a better analysis.

12.3 Drift analysis


The Lyapunov/potential function approach has two main steps: (a) finding
a good Lyapunov function and (b) using it to get a bound on hitting time.
The first step can be frustrating because we know that a good Lyapunov
function always exists, since we can in principle just take the expected hitting
time from the current state. But finding a good, clean Lyapunov function
that we can write down and reason about is tougher. The second step can
involve a lot of awkward futzing with supermartingales and such, and we
can avoid some of this awkwardness but letting other people package it
up in a few convenient theorems. For this purpose, it’s helpful to look at
the literature on drift analysis, which mostly shows up in the context of
evolutionary algorithms, a class of optimization algorithms modeled on
natural evolutionary processes. In this section, we’ll describe some of the
results in this area, largely following a survey by Lengler [Len20]. Many of
these results are at their core repackagings of the optional stopping theorem,
but it can be convenient not to have to go through the details of the OST
every time we want to use them.
As before, the idea is that we will come up with some one-dimensional
stochastic process {Xt } on the non-negative reals R+ 0 that is related to the
process we are trying to bound. Our hope is that Xt either drops by a
consistent additive amount or a consistent multiplicative factor on average
at each step.
For additive drift, we have the following:

Theorem 12.3.1 ([Len20, Theorem 2.3.1], adapted from [HY04]). Let


(Xt )t≥0 b e sequence of non-negative random variables with a finite state
space S ⊆ R+ 0 such that 0 ∈ S. Let T = inf {t ≥ 0 | Xt = 0}.
CHAPTER 12. HITTING TIMES 231

1. If there exists δ > 0 such that for all s ∈ S \ {0} and for all t ≥ 0,
∆t (s) = E [Xt − Xt+1 | Xt = s] ≥ δ,
then
E [X0 ]
E [T ] ≤ .
δ
2. If there exists δ > 0 such that for all s ∈ S \ {0} and for all t ≥ 0,
∆t (s) = E [Xt − Xt+1 | Xt = s] ≤ δ,
then
E [X0 ]
E [T ] ≥ .
δ
The proof of Theorem 12.3.1 is a fairly simple application of the optional
stopping theorem. Depending on which case we are in, Xt + tδ is either a
supermartingale or submartingale; we have a bounded range because of the
assumption that S is finite; and we have finite time in the supermartingale
case because the downward drift implies and finite state space implies that
there is always a path to 0 that we can take with nonzero probability, and in
the submartingale case because if we don’t E [T ] is infinite. So in either case
we are looking at arguments that we’ve done before.
Where the theorem is helpful is that it saves us from having to repeat
these arguments. For example, in the case of a biased random walk with
reflecting barriers at 0 and n and a probability of p > 1/2 of dropping at
each step, we can compute E [Xt − Xt+1 | Xt = s] ≥ p − q (note that the
convention here is that we are tracking expected drop, which is the negative
of the expected change) and instantly get E [T ] ≤ n/(p − q) for any starting
point X0 ≤ n. In the other direction we are not so lucky: the reflecting
barrier at n means that E [Xt − Xt+1 | Xt = n] = 1, so the best bound we
get starting at X0 = n is E [T ] ≥ n, which we don’t really need the theorem
to get.
In some cases, it can be hard to find a Lyapunov function with constant
additive drift. Rescaling may help in this case: the idea is to stretch the
Lyapunov function locally by multiplying by the inverse of the expected
drop, so that the new stretched function has expected drop at least 1. This
trick has been reinvented several times in different contexts, dating back at
least to the probabilistic recurrence relations of Karp et al. [KUW88].
We’ll quote the version from the evolutionary analysis literature given by
Lengler [Len20], which is known as the variable drift theorem:
CHAPTER 12. HITTING TIMES 232

Theorem 12.3.2 ([Len20, Theorem 2.3.3], adapted from [Joh10, RS14]).


Let (Xt )t≥0 b e sequence of non-negative random variables with a finite
state space S ⊆ R+ 0 such that 0 ∈ S. Let smin = min(S \ {0}), let T =
inf {t ≥ 0 | Xt = 0}, and for t ≥ 0 and s ∈ S let ∆t (s) = E [Xt − Xt+1 | Xt = s].
If there is an increasing function h : R+ → R+ such that for all s ∈ S \ {0}
and all t ≥ 0,
∆t (s) ≥ h(s),
then "Z #
smin X0 1
E [T ] ≤ +E dσ ,
h(smin ) smin h(σ)
where the expectation in the latter term is over the random choice of X0 .

For the full proof, see [Len20, Theorem 2.3.3] or the original [Joh10,
Theorem 4.6]. The intuition (using the notation of [Len20]) is that we can
apply Theorem 12.3.1 to a function

 smin + R s 1
when s ≥ smin ,
smin h(σ) dσ
g(s) = h(s min )
 s when s ≤ smin .
smin

The slope of this function for large s is 1/h(s), which means that the
expected change in g(s) when we drop by h(s) is roughly h(s) · (1/h(s)) = 1.
The exact bound requires observing that the integral is concave and applying
Jensen’s inequality. For s close to smin , the linear term covers any drop that
goes below smin (where we don’t care about h(s), because we are never going
to land there).
The actual proof does a case analysis on all possible changes in s depending
on whether they involve going up, going down but staying about smin , or
going down and crossing to 0. Adding up all of these cases shows an expected
drift for g(s) of at least 1, reducing to Theorem 12.3.1. The somewhat messy
bound is just what we get when we run the bound from the theorem back
through g.
A special case of variable drift is multiplicative drift, where ∆t (s) ≥ δs
for some constant δ > 0. In this case the bound in Theorem 12.3.2 simplifies
to
1 + E [ln(X0 /smin )]
E [T ] ≤ . (12.3.1)
δ
For example, we showed in §12.2 that the function RPS for the rock-
paper-scissors process has multiplicative drift with δ = 3/ n2 , smin = n − 2,
CHAPTER 12. HITTING TIMES 233

and X0 ≤ (n/3)3 . Plugging these values in (12.3.1) gives a bound on the


3 /(n−2))
time to reach RP S = 0 of 1+ln((n/3)
n ≈ 32 n2 ln n. So we can get the
(2)
same bound as before without having to go through all the intermediate
steps.
There are many drift theorem results in the literature, including lower
bounds and tail inequalities. The Lengler survey [Len20] summarizes many
of these results. Another good source for Lyapunov-style bounds in general
is the textbook of Menshikov et al. [MPW16].
Chapter 13

The probabilistic method

The probabilistic method is a tool for proving the existence of objects with
particular combinatorial properties, by showing that some process generates
these objects with nonzero probability.
The relevance of this to randomized algorithms is that in some cases
we can make the probability large enough that we can actual produce such
objects.
We’ll mostly be following Chapter 5 of [MR95] with some updates for
more recent results. If you’d like to read more about these technique, a
classic reference on the probabilistic method in combinatorics is the text of
Alon and Spencer [AS92].

13.1 Randomized constructions and existence proofs


Suppose we want to show that some object exists, but we don’t know how to
construct it explicitly. One way to do this is to devise some random process for
generating objects, and show that the probability that it generates the object
we want is greater than zero. This implies that something we want exists,
because otherwise it would be impossible to generate; and it works even if the
nonzero probability is very, very small. The systematic development of the
method is generally attributed to the notoriously productive mathematician
Paul Erdős and his frequent collaborator Alfréd Rényi.
From an algorithmic perspective, the probabilistic method is useful
mainly when we can make the nonzero probability substantially larger than
zero—and especially if we can recognize when we’ve won. But sometimes
just demonstrating the existence of an object is a start.
We give a couple of example of the probabilistic method in action below.

234
CHAPTER 13. THE PROBABILISTIC METHOD 235

In each case, the probability that we get a good outcome is actually pretty
high, so we could in principle generate a good outcome by retrying our
random process until it works. There are some more complicated examples
of the method for which this doesn’t work, either because the probability of
success is vanishingly small, or because we can’t efficiently test whether what
we did succeeded (the last example below may fall into this category). This
means that we often end up with objects whose existence we can demonstrate
even though we can’t actually point to any examples of them. For example,
it is known that there exist sorting networks (a special class of circuits
for sorting numbers in parallel) that sort in time O(log n), where n is the
number of values being sorted [AKS83], and these these can be generated
randomly with nonzero probability. But the best explicit constructions of
such networks take time Θ(log2 n), and the question of how to find an explicit
network that achieves O(log n) time has been open for decades despite many
efforts to solve it.

13.1.1 Set balancing


Here we have a collection of vectors v1 , v2 , . . . , vn in {0, 1}m . We’d like to
find ±1 coefficients 1 , 2 , . . . n that minimize the discrepancy maxj |Xj |
where Xj = ni=1 i vij .
P

This is called set balancing because we are trying to balance attributes


(represented by the vectors) between two sets (represented by the coefficients).
We can also take the transpose and imagine that we have a collection of
sets (represented by the columns of the matrix made up of the vi ) that we
are trying to divide in half as evenly as possible by assigning their common
elements to the −1 and +1 piles.
If we choose the i randomly, Hoeffding’s inequality (5.3.2) says for each
fixed j that Pr [|Xj | > t] < 2 exp(−t2 /2n) (since there √ are at most n non-zero
values vij ). Setting 2 exp(−t 2 /2n) ≤ 1/m gives t ≥ 2n ln 2m. So by the
h √ i
union bound, we have Pr maxj |Xj | > 2n ln 2m < 1: a solution exists.
Since we only proved that the probability of winning in nonzero, this
doesn’t necessary give us a good solution. In §14.4.3 we will revisit this
problem and show how to remove the randomness while still getting the
same bound.

13.1.2 Ramsey numbers


Consider a collection of n schoolchildren, and imagine that each pair of
schoolchildren either like each other or dislike each other. We assume that
CHAPTER 13. THE PROBABILISTIC METHOD 236

these preferences are symmetric: if x likes y, then y likes x, and similarly


if x dislikes y, y dislikes x. Let R(k, h) be the smallest value for n that
ensures that among any group of n schoolchildren, there is either a subset of
k children that all like each other or a subset of h children that all dislike
each other.1
It is not hard to show that R(k, h) is finite for all k and h.2 The exact
value of R(k, h) is known only for small values of k and h.3 But we can use
the probabilistic method to show that for k = h, it is reasonably large. The
following theorem is due to Erdős, and was the first known lower bound on
R(k, k).

Theorem 13.1.1 ([Erd47]). If k ≥ 3, then R(k, k) > 2k/2 .

Proof. Suppose each pair of schoolchildren flip a fair coin to decide whether
they like each other or not. Then the probability that any particular set of
k
k schoolchildren all like each other is 2−(2) and the probability that they
all dislike each other is the same. Summing over both possibilities and all
k
subsets gives a bound of nk 21−(2) on the probability that there is at least one


subset in which everybody likes everybody or everybody dislikes everybody.


1
In terms of graphs, any graph G with at least R(k, h) nodes contains either a clique
of size k or an independent set of size h.
2
A simple proof, due to Erdős and Szekeres [ES35], proceeds by showing that R(k, h) ≤
R(k − 1, h) + R(k, h − 1). Given a graph G of size at least R(k − 1, h) + R(k, h − 1), choose
a vertex v and partition the graph into two induced subgraphs G1 and G2 , where G1
contains all the vertices adjacent to v and G2 contains all the vertices not adjacent to v.
Either |G1 | ≥ R(k − 1, h) or |G2 | ≥ R(k, h − 1). If |G1 | ≥ R(k − 1, h), then G1 contains
either a clique of size k − 1 (which makes a clique of size k in G when we add v to it)
or an independent set of size h (which is also in G). Alternatively, if |G2 | ≥ R(k, h − 1),
then G2 either gives us a clique of size k by itself or an independent set of size h after
adding v. Together with  the fact that R(1, h) = R(k, 1) = 1, this recurrence gives
R(k, h) ≤ (k−1)+(h−1)
k−1
.
3
There is a fairly current table at http://en.wikipedia.org/wiki/Ramsey’s_Theorem.
Some noteworthy values are R(3, 3) = 6, R(4, 4) = 18, and 43 ≤ R(5, 5) ≤ 48. One
problem with computing exact values is that as R(k, h) grows, the number of graphs one
n
needs to consider gets very big. There are 2( 2 ) graphs with n vertices, and even detecting
the presence or absence of a moderately-large clique or independent set in such a graph
can be expensive. This pretty much rules out any sort of brute-force approach based on
simply enumerating candidate graphs. But even smarter approaches have yielded only slow
progress: for example, the upper bound on R(5, 5) was only improved from 49 [MR97] to
48 [AM18] after roughly two decades of work.
CHAPTER 13. THE PROBABILISTIC METHOD 237

For n = 2k/2 , we have


!
n 1−(k) nk 1−(k)
2 2 ≤ 2 2
k k!
2 /2+1−k(k−1)/2
2k
=
k!
2 /2+1−k 2 /2+k/2
2k
=
k!
21+k/2
=
k!
< 1.

Because the probability that there is an all-liking or all-hating subset


is less than 1, there must be some chance that we get a collection that
doesn’t have one. So such a collection exists. It follows that R(k, k) > 2k/2 ,
because we have shown that not all collections at n = 2k/2 have the Ramsey
property.

The last step in the proof uses the fact that 21+k/2 < k! for k ≥ 3, which
can be tested explicitly for k = 3 and proved by induction for larger k. The
resulting bound is a little bit weaker than just saying that n must be large
k
enough that nk 21−(2) ≥ 1, but it’s easier to use.


The proof can be generalized to the case where k 6= h by tweaking the


bounds and probabilities appropriately. Note that even though this process
generates a graph with no large cliques or independent sets with reasonably
high probability, we don’t have any good way of testing the result, since
testing for the existence of a clique is NP-hard.

13.2 Approximation algorithms


One use of a randomized construction is to approximate the solution to
an otherwise difficult problem. In this section, we start with a trivial
approximation algorithm for the largest cut in a graph, and then show a
more powerful randomized approximation algorithm, due to Goemans and
Williamson [GW94], that gets a better approximation ratio for a much larger
class of problems.
CHAPTER 13. THE PROBABILISTIC METHOD 238

13.2.1 MAX CUT


We’ve previously seen (§2.3.3.2) a randomized algorithm for finding small
cuts in a graph. What if we want to find a large cut?
Here is a particularly brainless algorithm that finds a large cut. For each
vertex, flip a coin: if the coin comes up heads, put the vertex in S, otherwise,
put in it T . For each edge, there is a probability of exactly 1/2 that it is
included in the S–T cut. It follows that the expected size of the cut is exactly
m/2.
One consequence of this is that every graph has a cut that includes
at least half the edges. Another is that this algorithm finds such a cut,
1
with probability at least m+1 . To prove this, let X be the random variable
representing the number of edges in the cut, and let p be the probability
that X ≥ m/2. Then

m/2 = E [X]
= (1 − p) E [X | X < m/2] + p E [X | X ≥ m/2]
m−1
≤ (1 − p) + pm.
2
Solving this for p gives the claimed bound.4
By running this enough times to get a good cut, we get a polynomial-time
randomized algorithm for approximating the maximum cut within a factor
of 2, which is pretty good considering that MAX CUT is NP-hard.
There exist better approximation algorithms. Goemans and Williamson [GW95]
give a 0.87856-approximation algorithm for MAX CUT based on randomized
rounding of a semidefinite program.5 The analysis of this algorithm is a
little involved, so we won’t attempt to show this here, but we will describe
(in §13.2.2) an earlier result, also due to Goemans and Williamson [GW94],
that gives a 43 -approximation to MAX SAT using a similar technique.
4
This is tight for m = 1, but I suspect it’s an underestimate for larger m. The
main source of slop in the analysis seems to be the step E [X | X ≥ m/2] ≤ m; using a
concentration bound, we should be able to show a much stronger inequality here and thus
a much larger lower bound on p.
5
Semidefinite programs are like linear programs except that the variables are vectors
instead of scalars, and the objective function and constraints apply to linear combinations
of dot-products of these variables. The Goemans-Williamson MAX CUTPalgorithm is
based on a relaxation of the integer optimization problem of maximizing xi xj where
each xi ∈ {−1, +1} encodes membership in S or T . They instead allow xi to be any unit
vector in an n-dimensional space, and then take the sign of the dot-product with a random
unit vector r to map each optimized xi to one side of the cut or the other.
CHAPTER 13. THE PROBABILISTIC METHOD 239

13.2.2 MAX SAT


Like MAX CUT, MAX SAT is an NP-hard optimization problem that sounds
like a very short story about Max. We are given a satisfiability problem
in conjunctive normal form: as a conjunction (AND) of m clauses, each
of which is the OR of a bunch of variables or their negations. We want
to choose values for the n variables that satisfy as many of the clauses
as possible, where a clause is satisfied if it contains a true variable or the
negation of a false variable.6
We can instantly satisfy at least m/2 clauses on average by assigning
values to the variables independently and uniformly at random; the analysis
is the same as in §13.2.1 for large cuts, since a random assignment makes
the first variable in a clause true with probability 1/2. Since this approach
doesn’t require thinking and doesn’t use the fact that many of our clauses
may have more than one variable, we can probably do better. Except we
can’t do better in the worst case, because it might be that our clauses consist
entirely of x and ¬x for some variable x; clearly, we can only satisfy half of
these. We could do better if we knew all of our clauses consisted of at least
k distinct literals (satisfied with probability 1 − 2−k ), but we can’t count on
this. We also can’t count on clauses not being duplicated, so it may turn
out that skipping a few hard-to-satisfy small clauses hurts us if they occur
many times.
Our goal will be to get a good approximation ratio, defined as the
ratio between the number of clauses we manage to satisfy and the actual
maximum that can be satisfied. The tricky part in designing approximation
algorithms is showing that the denominator here won’t be too big. We
can do this using a standard trick, of expressing our original problem as
an integer program and then relaxing it to a linear program7 whose
6
The presentation here follows [MR95, §5.2], which in turn is mostly based on a classic
paper of Goemans and Williamson [GW94].
7
A linear program is an optimization problem where we want to maximize (or
minimize) some linear objective function of the variables subject to linear-inequality
constraints. A simple example would be to maximize x + y subject to 2x + y ≤ 1 and
x + 3y ≤ 1; here the optimal solution is the assignment x = 25 , y = 15 , which sets the
objective function to its maximum possible value 35 . An integer program is a linear
program where some of the variables are restricted to be integers. Determining if an integer
program even has a solution is NP-complete; in contrast, linear programs can be solved in
polynomial time. We can relax an integer program to a linear program by dropping the
requirements that variables be integers; this will let us find a fractional solution that is
at least as good as the best integer solution, but might be undesirable because it tells us
to do something that is ludicrous in the context of our original problem, like only putting
half a passenger on the next plane.
CHAPTER 13. THE PROBABILISTIC METHOD 240

solution doesn’t have to consist of integers. We then convert the fractional


solution back to an integer solution by rounding some of the variables
to integer values randomly in a way that preserves their expectations, a
technique known as randomized rounding.8
Here is the integer program (taken from [MR95, §5.2]). We let zj ∈ {0, 1}
represent whether clause Cj is satisfied, and let yi ∈ {0, 1} be the value of
variable xi . We also let Cj+ and Cj− be the set of variables that appear in
Cj with and without negation. The problem is to maximize
m
X
zj
j=1

subject to X X
yi + (1 − yi ) ≥ zj
i∈Cj+ i∈Cj−

for all j.
The main trick here is to encode OR in the constraints; there is no
requirement that zj is the OR of the yi and (1 − yi ) values, but we maximize
the objective function by setting it that way.
Sadly, solving integer programs like the above is NP-hard (which is not
surprising, since if we could solve this particular one, we could solve SAT).
But if we drop the requirements that yi , zj ∈ {0, 1} and replace them with
0 ≤ yi ≤ 1 and 0 ≤ zj ≤ 1, we get a linear program—solvable in polynomial
time—with an optimal value at least as good as the value for the integer
program, for the simple reason that any solution to the integer program is
also a solution to the linear program.
The problem now is that the solution to the linear program is likely to
be fractional: instead of getting useful 0–1 values, we might find out we are
supposed to make xi only 2/3 true. So we need one more trick to turn the
Linear programming has an interesting history. The basic ideas were developed indepen-
dently by Leonid Kantorovich in the Soviet Union and George Dantzig in the United States
around the start of the Second World War. Kantorovich’s work had direct relevance to
Soviet planning problems, but wasn’t pursued seriously because it threatened the political
status of the planners, required computational resources that weren’t available at the
time, and looked suspiciously like trying to sneak a capitalist-style price system into the
planning process; for a fictionalized account of this tragedy, see [Spu12]. Dantzig’s work,
which included the development of the simplex method for solving linear programs, had
a higher impact, although its publication was delayed until 1947 by wartime secrecy.
8
Randomized rounding was invented by Raghavan and Thompson [RT87]; the particular
application here is due to Goemans and Williamson [GW94].
CHAPTER 13. THE PROBABILISTIC METHOD 241

fractional values back into integers. This is the randomized rounding step:
given a fractional assignment ŷi , we set xi to true with probability ŷi .
So what does randomized rounding do to clauses? In our fractional
solution, a clause might have value ẑj , obtained by summing up bits and
pieces of partially-true variables. We’d like to argue that the rounded version
gives a similar probability that Cj is satisfied.
Suppose Cj has k variables; to make things simpler, we’ll pretend that
Cj is exactly x1 ∨ x2 ∨ . . . xk . Then the probability that Cj is satisfied is
exactly 1 − ki=1 (1 − ŷi ). This quantity is minimized subject to ki=1 ŷi ≥ ẑj
Q P

by setting all ŷi equal to ẑj /k (easy application of Lagrange multipliers, or


can be shown using a convexity argument). Writing z for ẑj , this gives
k
Y
Pr [Cj is satisfied] = 1 − (1 − ŷi )
i=1
Yk
≥1− (1 − z/k)
i=1
= 1 − (1 − z/k)k
≥ z(1 − (1 − 1/k)k ).
≥ z(1 − 1/e).
The second-to-last step looks like a typo, but it actually works. The idea
is to observe that the function f (z) = 1 − (1 − z/k)k is concave (Proof:
d2
dz 2
f (z) = − k−1
k (1 − z/k)
k−2 < 0), while g(z) = z(1 − (1 − 1/k)k ) is linear,

so since f (0) = 0 = g(0) and f (1) = 1 − (1 − 1/k)k = g(1), any point in


between must have f (z) ≥ g(z). An example of this argument for k = 3 is
depicted in Figure 13.1.
Since each clause is satisfied with probability at least ẑj (1 − 1/e), the
expected number of satisfied clauses is at least (1 − 1/e) j ẑj , which is at
P

least (1 − 1/e) times the optimum. This gives an approximation ratio of


slightly more than 0.632, which is better than 1/2, but still kind of weak.
So now we apply the second trick from [GW94]: we’ll observe that, on a
per-clause basis, we have a randomized rounding algorithm that is good at
satisfying small clauses (the coefficient (1 − (1 − 1/k)k ) goes all the way up
to 1 when k = 1), and our earlier dumb algorithm that is good at satisfying
big clauses. We can’t combine these directly (the two algorithms demand
different assignments), but we can run both in parallel and take whichever
gives a better result.
To show that this works, let Xj be indicator for the event that clause j is
satisfied by the randomized-rounding algorithm and Yj the indicator for the
CHAPTER 13. THE PROBABILISTIC METHOD 242

Figure 13.1: Tricky step in MAX SAT argument, showing (1 − (1 − z/k)k ) ≥


z(1 − (1 − 1/k)k ) when 0 ≤ z ≤ 1. The case k = 3 is depicted.

even that it is satisfied by the simple algorithm. Then if Cj has k literals,

E [Xj ] + E [Yj ] ≥ (1 − 2−k ) + (1 − (1 − 1/k)k )ẑj


 
≥ (1 − 2−k ) + (1 − (1 − 1/k)k ) )ẑj
 
= 2 − 2−k − (1 − 1/k)k ẑj .

The coefficient here is exactly 3/2 when k = 1 or k = 2, and rises thereafter,


so for integer
hP ki we have
hP E i[Xj ] + E [Yj ] ≥ (3/2)ẑj . Summing over all j then

j Yj ≥ (3/2)
P
gives E j Xj +E j ẑj . But then one of the two expected
P
sums must beat (3/4) j ẑj , giving us a (3/4)-approximation algorithm.

13.3 The Lovász Local Lemma


Suppose we have a finite set of bad events A, and we want to show that with
nonzero
hT probability,
i none of these events occur. Formally, we want to show
Pr A∈A Ā > 0.
Our usual trick so far has been to use the union bound (4.2.1) to show that
P
A∈A Pr [A] < 1. But this only works if the events are actually improbable.
If the union bound doesn’t work, we might hT be lucky
i enough hto ihave the
Q
events be independent; in this case, Pr A∈A Ā = A∈A Pr Ā > 0, as
long as each event Ā occurs with positive probability. But most of the time,
CHAPTER 13. THE PROBABILISTIC METHOD 243

we’ll find that the events we care about aren’t independent, so this won’t
work either.
The Lovász Local Lemma [EL75] handles a situation intermediate
between these two extremes, where events are generally not independent of
each other, but each collection of events that are not independent of some
particular event A has low total probability. In the original version, it’s
non-constructive: the lemma shows a nonzero probability that none of the
events occur, but this probability may be very small if we try to sample
the events at random, and there is no guidance for how to find a particular
outcome that makes all the events false.
Subsequent work [Bec91, Alo91, MR98, CS00, Sri08, Mos09, MT10]
showed how, when the events A are determined by some underlying set
of independent variables and independence between two events is detected
by having non-overlapping sets of underlying variables, an actual solution
could be found in polynomial expected time. The final result in this series,
due to Moser and Tardos [MT10], gives the same bounds as in the original
non-constructive lemma, using the simplest algorithm imaginable: whenever
some bad event A occurs, try to get rid of it by resampling all of its variables,
and continue until no bad events are left.

13.3.1 General version


A formal statement of the general lemma is:9
Lemma 13.3.1. Let A = A1 , . . . , Am be a finite collection of events on some
probability space, and for each A ∈ A, let Γ(A) be a set of events such that
A is independent of all events not in Γ+ (A) = {A} ∪ Γ(A). If there exist
real numbers 0 < xA < 1 such that, for all events A ∈ A
Y
Pr [A] ≤ xA (1 − xB ), (13.3.1)
B∈Γ(A)

then
" #
\ Y
Pr Ā ≥ (1 − xA ). (13.3.2)
A∈A A∈A

In particular, this means that the probability that none of the Ai occur
is not zero, since we have assumed xAi < 1 holds for all i.
The role of xA in the original proof is to act as an upper bound on the
probability that A occurs given that some collection of other events doesn’t
9
This version is adapted from [MT10].
CHAPTER 13. THE PROBABILISTIC METHOD 244

occur. For the constructive proof, the xA are used to show a bound on the
number of resampling steps needed until none of the A occur.

13.3.2 Symmetric version


For many applications, having to come up with the xA values can be awkward.
The following symmetric version is often used instead:

Corollary 13.3.2. Let A and Γ be as in Lemma 13.3.1. Suppose that there


are constants p and d, such that for hall A ∈ iA, we have Pr [A] ≤ p and
|Γ(A)| ≤ d. Then if ep(d + 1) < 1, Pr A∈A Ā 6= 0.
T

Proof. Basically, we are going to pick a single value x such that xA = x for
all A in A, and (13.3.1) is satisfied. This works as long as p ≤ x(1 − x)d ,
d |Γ(A)| =
as in  for all A, Pr [A] ≤ p ≤ x(1 − x) ≤ x(1 − x)
Qthis case we have,
xA B∈Γ(A) (1 − xB ) .
d
For fixed d, x(1 − x)d is maximized using the usual trick: dx x(1 − x)d =
d d−1 1
(1 − x) − xd(1 − x) = 0 gives (1 − x) − xd = 0 or x = d+1 . So now
 d  d
1 1 1
we need p ≤ d+1 1− d+1 . It is possible to show that 1/e < 1 − d+1
 d
1 1 1
for all d ≥ 0.10 So ep(d + 1) ≤ 1 implies p ≤ e(d+1) ≤ 1− d+1 d+1 ≤
x(1 − x)|Γ(A)| as required by Lemma 13.3.1.

13.3.3 Applications
Here we give some simple applications of the lemma.

13.3.3.1 Graph coloring


Let’s start with a problem where we know what the right answer should be.
Suppose we want to color the vertices of a cycle with c colors, so that no
edge has two endpoints with the same color. How many colors do we need?
Using brains, we can quickly figure out that c = 3 is enough. Without
brains, we could try coloring the vertices randomly: but in a cycle with n
vertices and n edges, on average n/c of the edges will be monochromatic,
since each edge is monochromatic with probability 1/c. If these bad events
were independent, we could argue that there was a (1 − 1/c)n > 0 probability
d d
10
Observe that limd→∞ 1 − d+1 1
= e−1 and that f (d) = 1 − d+1 1
= e−1 is a
decreasing function. The last part can be shown by taking the derivative of log f (d). See
the solution to Problem G.1.4 for the gory details.
CHAPTER 13. THE PROBABILISTIC METHOD 245

that none of them occurred, but they aren’t, so we can’t. Instead, we’ll use
the local lemma.
The set of bad events A is just the set of events Ai = [edge i is monochromatic].
We’ve already computed p = 1/c. To get d, notice that each edge only shares
a vertex with two other edges, so |Γ(Ai )| ≤ 2. Corollary 13.3.2 then says that
there is a good coloring as long as ep(d + 1) = 3e/c ≤ 1, which holds as long
as c ≥ 9. We’ve just shown we can 9-color a cycle. If use the asymmetric
 2
version, we can set all xA to 1/3 and show that p ≤ 13 1 − 13 = 27 4
would
also work; with this we can 7-color a cycle. This is still not as good as
what we can do if we are paying attention, but not bad for a procedure that
doesn’t use the structure of the problem much.

13.3.3.2 Satisfiability of k-CNF formulas


A more sophisticated application is demonstrating satisfiability for k-CNF
formulas where each variable appears in a bounded number of clauses. Recall
that a k-CNF formula consists of m clauses, each of which consists of
exactly k variables or their negations (collectively called literals. It is
satisfied if at least one literal in every clause is assigned the value true.
Suppose that each variable appears in at most ` clauses. Let A consist of
all the events Ai = [clause i is not satisfied]. Then, for all i, Pr [Ai ] = 2−k
exactly, and |Γ(Ai )| ≤ d = k(` − 1) since each variable in Ai is shared with
at most ` − 1 other clauses, and Ai will be independent of all events Aj with
which it doesn’t share a variable. So if ep(d + 1) = e2−k (k(` − 1) + 1) < 1,
k
which holds if ` < 2 +e(k−1)
ek , Corollary 13.3.2 tells us that a satisfying
assignment exists.11
Corollary 13.3.2 doesn’t let us actually find a satisfying assignment, but
it turns out we can do that too. We’ll return to this when we talk about the
constructive version of the local lemma in §13.3.5.

13.3.3.3 Hypergraph 2-colorability


This was the original motivation for the lemma. A k-uniform hypergraph
G = hV, Ei has a set of hyperedges E, each of which contains exactly k
vertices from V . A 2-coloring of the hypergraph assigns one of two colors
to each vertex in v, so that no edge is monochromatic.
Suppose we color vertices at random. Pick some edge S, and let AS be
the probability that it is monochromatic. Then Pr [AS ] = 21−k exactly. We
11
To avoid over-selling this claim, it’s worth noting that the bound on ` only reaches 2
at k = 4, although it starts growing pretty fast after that.
CHAPTER 13. THE PROBABILISTIC METHOD 246

also have that AS depends only on events AT where S ∩ T 6= ∅. Suppose that


there is a bound d on the number of hyperedges T that overlap with any
hyperedge S. Then Corollary 13.3.2 says that there exists a good coloring as
long as e21−k (d + 1) < 1. This holds for any hypergraph with d < 2k−1 /e − 1.
Unlike graph coloring, it’s not so obvious how to solve hypergraph coloring
with a simple greedy algorithm. So here the local lemma seems to buy us
something we didn’t already have.

13.3.4 Non-constructive proof


This is essentially the same argument presented in [MR95, §5.5], but we
adapt the notation to match the statement in terms of neighborhoods Γ(A)
instead of edges in a dependency graph, where an event is independent of
all events that are not its successors in the graph. The two formulations are
identical, since we can always represent the neighborhoods Γ(A) by creating
an edge from A to B when B is in Γ(A); and conversely we can convert a
dependency graph into a neighborhood structure by making Γ(A) the set of
successors of A in the dependency graph.
The proof is a bit easier than the constructive version, while also being
a bit more general because the neighborhood structure doesn’t depend on
identifying a collection of underlying independent variables.
hThe idea
i is Q
to orderh theTevents iin A as A1 , A2 , . . . , An and expand
Tn n i−1
Pr i=1 Āi as i=1 Pr Āi j=1 Āj . If we can show that every factor in
the product is nonzero, we are done.
To do this, we will show by induction on |S| the more general statement
that for any A and any S ⊆ A with A 6∈ S,
" #
\
Pr A B̄ ≤ xA . (13.3.3)
B∈S

When |S| = 0, this just says Pr [A] ≤ xA , which follows immediately from
(13.3.1).
For larger S, split S into S1 = S ∩ Γ(A), the events in S that might
not be independent of A; and S2 = S \ Γ(A), the events in S that we know
to be independent of A. If S2 = S, then A is independent
h iof all events
B∈S B̄ = Pr [A] ≤
T
in S, and (13.3.3) follows immediately from Pr A
xA B∈Γ(A) (1 − xB ) ≤ xA . Otherwise |S2 | < |S|, which means that we can
Q

assume the induction hypothesis holds for S2 .


T T
Write C1 for the event B∈S1 B̄ and C2 for the event B∈S2 B̄. Then
CHAPTER 13. THE PROBABILISTIC METHOD 247

B̄ = C1 ∩ C2 and we can expand


T
B∈S
" #
\ Pr [A ∩ C1 ∩ C2 ]
Pr A B̄ =
B∈S
Pr [C1 ∩ C2 ]
Pr [A ∩ C1 | C2 ] Pr [C2 ]
=
Pr [C1 | C2 ] Pr [C2 ]
Pr [A ∩ C1 | C2 ]
= . (13.3.4)
Pr [C1 | C2 ]
We don’t need to do anything particularly clever with the numerator:

Pr [A ∩ C1 | C2 ] ≤ Pr [A | C2 ]
= Pr [A]
Y
≤ xA (1 − xB ), (13.3.5)
B∈Γ(A)

from (13.3.1) and the fact that A is independent of all B in S2 and thus also
independent of C2 .
T
For the denominator, we expand C1 backh out to B∈Si1 B̄ and break
T
out the induction hypothesis. To bound Pr B∈S1 B̄ C2 , we order S1
arbitrarily as {B1 , . . . , Br }, for some r, and show by induction on ` as ` goes
from 1 to r that
" ` # `
\ Y
Pr B̄i C2 ≥ (1 − xBi ). (13.3.6)
i=1 i=1

The proof is that, for ` = 1,


h i
Pr B̄1 C2 = 1 − Pr [B1 | C2 ]
≥ 1 − xB1

using the outer induction hypothesis (13.3.3), and for larger `, we can compute
" ` # " `−1
! # "`−1 #
\ \ \
Pr B̄i C2 = Pr B̄` B̄i ∩ C2 · Pr B̄i C2
i=1 i=1 i=1
`−1
Y
≥ (1 − xB` ) (1 − xBi )
i=1
`
Y
= (1 − xBi ),
i=1
CHAPTER 13. THE PROBABILISTIC METHOD 248

where the second-to-last step uses the outer induction hypothesis (13.3.3)
for the first term and the inner induction hypothesis (13.3.6) for the rest.
This completes the proof of the inner induction.
When ` = r, we get
" r #
\
Pr [C1 | C2 ] = Pr B̄i C2
i=1
Y
≥ (1 − xB ). (13.3.7)
B∈S1

Substituting (13.3.5) and (13.3.7) into (13.3.4) gives


Q 
xA B∈Γ(A) (1 − xB ) ,
" #
\
Pr A B̄ ≤
− xB )
Q
B∈S B∈S1 (1
 
Y
= xA  (1 − xB )
B∈Γ(A)\S1

≤ xA .
This completes the proof of the outer induction.
To get the bound (13.3.2), we reach back inside the proof and repeat
T
the argument for (13.3.7) with A∈A Ā in place of C1 and without the
conditioning on C2 . We order A arbitrarily as {A1 , A2 , . . . , Am } and show
by induction on k that
" k # k
\ Y
Pr Āi ≥ (1 − xAk ). (13.3.8)
i=1 i=1

For the base case we have k = 0 and Pr [Ω] ≥ 1, using the usual conventions
on empty products. For larger k, we have
" k # " k−1
# "k−1 #
\ \ \
Pr Āi = Pr Āk Āi Pr Āi
i=1 i=1 i=1
k−1
Y
≥ (1 − xAk ) (1 − xAi )
i=1
k
Y
≥ (1 − xAk ),
i=1

where in the second-to-last step we use (13.3.3) for the first term and the
induction hypothesis (13.3.8) for the big product.
Setting k = m finishes the proof.
CHAPTER 13. THE PROBABILISTIC METHOD 249

13.3.5 Constructive proof


We now describe he constructive proof of the Lovász local lemma due to
Moser and Tardos [MT10], which is based on a slightly more specialized
construction of Moser alone [Mos09]. This version applies when our set of
bad events A is defined in terms of a set of independent variables P, where
each A ∈ A is determined by some set of variables vbl(A) ⊆ P, and Γ(A) is
defined to be the set of all events B 6= A that share at least one variable
with A; i.e., Γ(A) = {B ∈ A \ {A} | vbl(B) ∩ vbl(A) 6= ∅}.
In this case, we can attempt to find an assignment to the variables that
makes none of the A occur using the obvious algorithm of sampling an initial
state randomly, then resampling all variables in vbl(A) whenever we see some
bad A occur. Astonishingly, this actually works in a reasonable amount of
time, without even any cleverness in deciding which A to resample at each
step, if the conditions for Lemma 13.3.1 hold for x that are not too large.
In particular, we will show that a good assignment is found after each A is
xA
resampled at most 1−x A
times on average.12
Lemma 13.3.3. Under the conditions of Lemma 13.3.1, the Moser-Tardos
procedure does at most
X xA

A
1 − xA

resampling steps on average.


We defer the proof of Lemma 13.3.3 for the moment. For most applica-
tions, the following symmetric version is easier to work with:
Corollary 13.3.4. Under the conditions of Corollary 13.3.2, the Moser-
Tardos procedure does at most m/d resampling steps on average.
1
Proof. Follows from Lemma 13.3.3 and the choice of xA = d+1 in the proof
of Corollary 13.3.2.

How this expected m/d bound translates into actual time depends on
the cost of each resampling step. The expensive part at each step is likely to
be the cost of finding an A that occurs and thus needs to be resampled.
12
Even though there is a lot of resampling going on here, Moser-Rardos is not actually a
sampling algorithm, in the sense that some solutions to the original problem may be much
more likely to be produced than others. There has been some more recent work on the
sampling Lovász local lemma, where the goal is to obtain a close-to-uniform sample
from the space of possible solutions. This is much harder than just finding one solution
and usually requires stronger constraints on the problem. An example of this work is the
paper of He et al. [HWY22].
CHAPTER 13. THE PROBABILISTIC METHOD 250

Intuitively, we might expect the resampling to work because if each A ∈ A


has a small enough neighborhood Γ(A) and a low enough probability, then
whenever we resample A’s variables, it’s likely that we fix A and unlikely
that we break too many B events in A’s neighborhood. It turns out to be
tricky to quantify how this process propagates outward, so the actual proof
uses a different idea that essentially looks at this process in reverse, looking
for each resampled event A at a set of previous events whose resampling we
can blame for making A occur, and then showing that this tree (which will
include every resampling operation as one of its vertices) can’t be too big.
The first step is to fix some strategy for choosing which event A to
resample at each step. We don’t care what this strategy is; we just want
it to be the case that the sequence of events depends only on the random
choices made by the algorithm in its initial sampling and each resampling.
We can then define an execution log C that lists the sequence of events
C1 , C2 , C3 , . . . that are resampled at each step of the execution.
From C we construct a witness tree Tt for each resampling step t whose
nodes are labeled with events, with the property that the children of a node
v labeled with event Av are labeled with events in Γ+ (Av ) = {Av } ∪ Γ(Av ).
The root of Tt is labeled with Ct . To construct the rest of the tree, we work
backwards through Ct−1 , Ct−2 , . . . , C1 , and for each event Ci we encounter,
we attach Ci as a child of the deepest v we can find with Ci ∈ Γ+ (Av ),
choosing arbitrarily if there is more than one such v, and discarding Ci if
there is no such v.
Now we can ask what the probability is that we see some particular
witness tree T in the execution log. Each vertex of T corresponds to some
event Av that we resample because it occurs; in order for it to occur, the
previous assignments of each variable in vbl(Av ) must have made Av true,
which occurs with probability Pr [Av ]. But since we resample all the variables
in Av , any subsequent assignments to these variables are independent of the
ones that contributed to v; with sufficient handwaving (or a rather detailed
coupling argument as found in [MT10]) this gives that each event Av occurs
Q
with independent probability Pr [Av ], giving Pr [T ] = v∈T Pr [Av ].
Why do we care about this? Because every event we resample is the root
of some witness tree, and we can argue that every event we resample is the
root of a distinct witness tree. The proof is that since we only discard events
B that have vbl(B) disjoint from all nodes already in the tree, once we put
A at the root, any other instance of A gets included. So the witness tree
rooted at the i-th occurrence of A in C will include exactly i copies of A,
unlike the witness tree rooted at the j-th copy for j 6= i.
Now comes the sneaky trick: we’ll bound how many distinct witness
CHAPTER 13. THE PROBABILISTIC METHOD 251

trees T we expect to see rooted at A, given that each occurs with probability
Q
v∈T Pr [Av ]. This is done by constructing a branching process using the
xB values from Lemma 13.3.1 as probabilities of a node with label A having
a child labeled B for each B in Γ+ (A), and doing algebraic manipulations
Q
on the resulting probabilities until v∈T Pr [Av ] shows up. We can then sum
over the expected number of copies of trees to get a bound on the expected
number of events in the execution log (since each such even is the root of
some tree), which is equal to the expected number of resamplings.
Consider the process where we start with a root labeled A, and for each
vertex v with label Av , give it a child labeled B for each B ∈ Γ+ (Av ) with
independent probability xB . We’ll now calculate the probability pT that this
process generates a particular tree T in the set TA of trees with root A.
Let x0B = xB C∈Γ(B) (1 − xC ). Note that (13.3.1) says precisely that
Q

Pr [B] ≤ x0B .
For each vertex v in T , let Wv ⊆ Γ+ (Av ) be the set of events B ∈ Γ+ (Av )
that don’t occur as labels of children of v. The probability of getting T is
equal to the product of the probabilities at each v of getting all of its children
and none of its non-children. The non-children of v collectively contribute
B∈Wv (1 − xB ) to the product, and v itself contributes xAv (via the product
Q

for its parent), unless v is the root node. So we can express the giant product
as
 
1 Y Y
pT = xAv (1 − xB ) .
xA v∈T B∈W v

We don’t like the Wv very much, so we get rid of them by taking the product
of B in Γ+ (A), then dividing out the ones that aren’t in Wv . This gives
 
1 Y Y
pT = xAv (1 − xB ) .
xA v∈T B∈W v
   
1 Y Y Y 1 
= xAv  (1 − xB ) .
xA v∈T 1 − xB
B∈Γ+ (Av ) B∈Γ+ (Av )\Wv

This seems like we exchanged one annoying index set for another, but each
element of Γ+ (AV ) \ Wv is Av0 for some child of v in T . So we can push these
factors down to the children, and since we are multiplying over all vertices
in T , they will each show up exactly once except at the root. To keep the
1
products clean, we’ll throw in 1−x A
for the root as well, but compensate for
CHAPTER 13. THE PROBABILISTIC METHOD 252

this by multiplying by 1 − xA on the outside of the product. This gives


 
1 − xA Y  xAv Y
pT = (1 − xB )
xA v∈T 1 − xAv
B∈Γ+ (Av )
 
1 − xA Y  Y
= xAv (1 − xB )
xA v∈T B∈Γ(A ) v

1 − xA Y 0
= xAv .
xA v∈T

Now we can bound the expected number of trees rooted at A that appear
in C, assuming (13.3.1) holds. Letting TA as before be the set of all such
trees and NA the number that appear in C, we have
X
E [NA ] = Pr [T appears in C]
T ∈TA
X Y
≤ Pr [A(v)]
T ∈TA v∈T

x0Av
X Y

T ∈TA v∈T
X xA
= pT
T ∈TA
1 − xA
xA X
= pT
1 − xA T ∈T
A
xA
≤ .
1 − xA
The last sum is bounded by one because occurrences of particular trees
T in TA are all disjoint events.
Now sum over all A, and we’re done.
Chapter 14

Derandomization

Derandomization is the process of taking a randomized algorithm and


turning it into a deterministic algorithm. This is useful both for practical
reasons (deterministic algorithms are more predictable, which makes them
easier to debug and gives hard guarantees on running time) and theoretical
reasons (if we can derandomize any randomized algorithm we could show
results like P = RP,1 which would reduce the number of complexity classes
that complexity theorists otherwise have to deal with). It may also be the
case that derandomizing a randomized algorithm can be used for probability
amplification, where we replace a low probability of success with a higher
probability, in this case 1.
There are basically two approaches to derandomization:

1. Reduce the number of random bits used down to O(log n), and then
search through all choices of random bits exhaustively. For example, if
we only need pairwise independence, we could use the XOR technique
from §5.1.2.1 to replace a large collection of variables with a small
collection of random bits.
Except for the exhaustive search part, this is how randomized algo-
rithms are implemented in practice: rather than burning random bits
continuously, a pseudorandom generator is initialized from a seed
consisting of a small number of random bits. For pretty much all of
the randomized algorithms we know about, we don’t even need to use
a particularly strong pseudorandom generator. This is largely because
1
The class RP consists of all languages L for which there is a polynomial-time random-
ized algorithm that correctly outputs “yes” given an input x in L with probability at least
1/2, and never answers “yes” given an input x not in L. See §1.5.2 for a more extended
description of RP and other randomized complexity classes.

253
CHAPTER 14. DERANDOMIZATION 254

current popular generators are the products of a process of evolution:


pseudorandom generators that cause wonky behavior or fail to pass
tests that approximate the assumptions made about them by typical
randomized algorithms are abandoned in favor of better generators.2
From a theoretical perspective, pseudorandom generators offer the
possibility of eliminating randomization from all randomized algorithms,
except there is a complication. While (under reasonable cryptographic
assumptions) there exist cryptographically secure pseudorandom
generators whose output is indistinguishable from a genuinely random
source by polynomial-time algorithms (including algorithms originally
intended for other purposes), such generators are inherently incapable
of reducing the number of random bits down to the O(log n) needed
for exhaustive search. The reason is that any pseudorandom generator
with only polynomially-many seeds can’t be cryptographically secure,
because we can distinguish it from a random source by just checking
its output against the output for all possible seeds. Whether there
is some other method for transforming an arbitrary algorithm in RP
or BPP into a deterministic algorithm remains an open problem in
complexity theory (and beyond the scope of this course).

2. Start with a specific randomized protocol and analyze its behavior


enough that we can replace the random bits it uses with specific,
deterministically-chosen bits we can compute. This is the main ap-
proach we will describe below. A non-constructive variant of this shows
that we can always replace the random bits used by all inputs of a given
size with a few carefully-selected fixed sequences (Adleman’s Theorem,
described in §14.2). More practical is the method of conditional
probabilities, which chooses random bits sequentially based on which
value is more likely to give a good outcome (see §14.4).

14.1 Deterministic vs. randomized algorithms


In thinking about derandomization, it can be helpful to have more than
one way to look at a randomized algorithm. So far, we’ve describe random-
ized algorithms as random choices of deterministic algorithms (Mr (x)) or,
2
Having cheaper computers helps here as well. Nobody would have been willing to
spend 2496 bytes on the state vector for Mersenne Twister [MN98] back in 1975, but by
1998 this amount of memory was trivial for pretty much any computing device except the
tiniest microcontrollers.
CHAPTER 14. DERANDOMIZATION 255

equivalently, as deterministic algorithms that happen to have random inputs


(M (r, x)). This gives a very static view of how randomness is used in the
algorithm. A more dynamic view is obtained by thinking of the computation
of a randomized algorithm as a computation tree, where each path through
the tree corresponds to a computation with a fixed set of random bits and a
branch in the tree corresponds to a random decision. In either case we want
an execution to give us the right answer with reasonably high probability,
whether that probability measures our chance of getting a good deterministic
machine for our particular input or landing on a good computation path.

14.2 Adleman’s theorem


The idea of picking a good deterministic machine is the basis for Adleman’s
theorem[Adl78], a classic result in complexity theory. Adleman’s theorem
says that we can always replace randomness by an oracle that presents
us with a fixed string of advice pn that depends only on the size of the
input n and has size polynomial in n. The formal statement of the theorem
relates the class RP, which is the class of problems for which there exists
a polynomial-time Turing machine M (x, r) that outputs 1 at least half the
time when x ∈ L and never when x 6∈ L; and the class P/poly, which is the
class of problems for which there is a polynomial-sized string pn for each
input size n and a polynomial-time Turing machine M 0 such that M 0 (x, p|x| )
outputs 1 if and only if x ∈ L.

Theorem 14.2.1. RP ⊆ P/poly.

Proof. The intuition is that if any one random string has a constant proba-
bility of making M happy, then by choosing enough random strings we can
make the probability that M fails using on every random string for any given
input so small that even after we sum over all inputs of a particular size,
the probability of failure is still small using the union bound (4.2.1). This is
an example of probability amplification, where we repeat a randomized
algorithm many times to reduce its failure probability.
Formally, consider any fixed input x of size n, and imagine running M
repeatedly on this input with n + 1 independent sequences of random bits
r1 , r2 , . . . , rn+1 . If x 6∈ L, then M (x, ri ) never outputs 1. If x ∈ L, then for
each ri , there is an independent probability of at least 1/2 that M (x, ri ) = 1.
So Pr [M (x, ri ) = 0] ≤ 1/2, and Pr [∀iM (x, ri ) = 0] ≤ 2−(n+1) . If we sum
this probability of failure for each individual x ∈ L of length n over the at
most 2n such elements, we get a probability that any of them fail of at most
CHAPTER 14. DERANDOMIZATION 256

2n 2−(n+1) = 1/2. Turning this upside down, any sequence of n + 1 random


inputs includes a witness that x ∈ L for all inputs x with probability at
least 1/2. It follows that a good sequence r1 , . . . , rn+1 , exists.
Our advice pn is now some good sequence pn = hr1 . . . rn+1 i, and the
deterministic advice-taking algorithm that uses it runs M (x, ri ) for each
ri and returns true if and only if at least one of these executions returns
true.

The classic version of this theorem shows that anything you can do with
a polynomial-size randomized circuit (a circuit made up of AND, OR, and
NOT gates where some of the inputs are random bits, corresponding to the r
input to M ) can be done with a polynomial-size deterministic circuit (where
now the pn input is baked into the circuit, since we need a different circuit
for each size n anyway).
A limitation of this result is that ordinary algorithms seem to be better
described by uniform families of circuits, where there exists a polynomial-
time algorithm that, given input n, outputs the circuit Cn for processing size-n
inputs. In contrast, the class of circuits generated by Adleman’s theorem
is most likely non-uniform: the process of finding the good witnesses ri is
not something we know how to do in polynomial time (with the usual caveat
that we can’t prove much about what we can’t do in polynomial time).

14.3 Limited independence


For some algorithms, it may be that full independence is not needed for all
of the random bits. If the amount of independence needed is small enough,
this may allow us to reduce the actual number of random bits consumed
down to O(log n), at which point we can try all possible sequences of random
bits in polynomial time.
Variants of this technique have been used heavily in the cryptography
and complexity; see [LW05] for a survey of some early work in this area.
We’ll do a quick example of the method before before moving onto more
direct approaches.

14.3.1 MAX CUT


Let’s look at the randomized MAX CUT algorithm from §13.2.1. In the
original algorithm, we use n independent random bits to assign the n vertices
to S or T . The idea is that by assigning each vertex independently at
CHAPTER 14. DERANDOMIZATION 257

random to one side of the cut or the other, each edge appears in the cut with
probability 1/2, giving a total of m/2 edges in the cut in expectation.
Suppose that we replace these n independent random bits with n pairwise-
independent bits generated by taking XORs of subsets of dlg(n + 1)e inde-
pendent random bits as described in §5.1.2.1. Because the bits are pairwise-
independent, the probability that the two endpoints of an edge are assigned
to different sides of the cut is still exactly 1/2. So on average we get m/2
edges in the cut as before, and there is at least one sequence of random bits
that guarantees a cut at least this big.
But with only dlg(n+1)e random bits, there are only 2dlog(n+1)e < 2(n+1)
possible sequences of random bits. If we try all of them, then we find a cut
of size m/2 always. The total cost is O(n(n + m)) if we include the O(n + m)
cost of testing each cut. Note that this algorithm does not generate all 2n
possible cuts, but among those it does generate, there must be a large one.
In this particular case, we’ll see below how to get the same result at a
much lower cost, using more knowledge of the problem. So we have the
typical trade-off between algorithm performance and algorithm designer
effort.

14.4 The method of conditional probabilities


The method of conditional probabilities [Rag88] follows an execution
of the randomized algorithm, but at each point where we would otherwise
make a random decision, it makes a decision that minimizes the conditional
probability of losing.
Structurally, this is similar to the method of bounded differences (see
§5.3.3). Suppose our randomized algorithm generates m random values
X1 , X2 , . . . , Xm . Let f (X1 , . . . , Xm ) be the indicator variable for our ran-
domized algorithm failing (more generally, we can make it an expected cost
or some other performance measure). Extend f to shorter sequences of val-
ues by defining f (x1 , . . . , xk ) = E [f (x1 , . . . , xk , Xk+1 , . . . , Xm )]. Then Yk =
f (X1 , . . . , Xk ) is a Doob martingale, just as in the method of bounded differ-
ences. This implies that, for any partial sequence of values x1 , . . . , xk , there
exists some next value xk+1 such that f (x1 , . . . , xk ) ≥ f (x1 , . . . , xk , xk+1 ). If
we can find this value, we can follow a path on which f always decreases, and
obtain an outcome of the algorithm f (x1 , . . . , xm ) less than or equal to the
initial value f (hi). If our outcomes are 0–1 (as in failure probabilities), and
our initial value for f is less than 1, this means that we reach an outcome
with f = 0.
CHAPTER 14. DERANDOMIZATION 258

The tricky part here is that it may be hard to compute f (x1 , . . . , xk ).


(It’s always possible to do so in principle by enumerating all assignments of
the remaining variables, but if we have time to do this, we can just search
for a winning assignment directly.) What makes the method of conditional
probabilities practical in many cases is that it’s not necessary for f to compute
the actual probability of failure, as long as (a) f gives an upper bound on the
real probability of failure, at least in terminal configurations, and (b) f has the
property used in the argument that for any partial sequence x1 , . . . , xk there
exists an extension x1 , . . . , xk , xk+1 with f (x1 , . . . , xk ) ≥ f (x1 , . . . , xk , xk+1 ).
Such an f is called a pessimistic estimator. If we can find a pessimistic
estimator that is easy to compute and starts out less than 1, then we can
just follow it down the tree to a leaf that doesn’t fail.

14.4.1 MAX CUT using conditional probabilities


Again we consider the algorithm from §13.2.1 that assigns vertices to S
and T at random. To derandomize this algorithm, at each step we pick a
vertex and assign it to the side of the cut that maximizes the conditional
expectation of the number of edges that cross the cut. We can compute this
conditional expectation as follows:
1. For any edge that already has both endpoints assigned to S or T , it’s
either in the cut or it isn’t: add 0 or 1 as appropriate to the conditional
expectation.

2. For any edge with only one endpoint assigned, there’s a 1/2 probability
that the other endpoint gets assigned to the other side (in the original
randomized algorithm). Add 1/2 to the conditional expectation for
these edges.

3. For any edge with neither endpoint assigned, we again have a 1/2
probability that it crosses the cut. Add 1/2 for these as well.
So now let us ask how assigning a particular previously unassigned vertex
v to S or T affects the conditional probability. For any neighbor w of v that
is not already assigned, adding v to S or T doesn’t affect the 1/2 contribution
of vw. So we can ignore these. The only effects we see are that if some
neighbor w is in S, assigning v to S decreases the conditional expectation by
1/2 and assigning v to T increases the expectation by 1/2. So to maximize
the conditional expectation, we should assign v to whichever side currently
holds fewer of v’s neighbors—the obvious greedy algorithm, which runs in
O(n + m) time if we are reasonably clever about keeping track of how many
CHAPTER 14. DERANDOMIZATION 259

neighbors each unassigned node has in S and T . The advantage of the


method of conditional probabilities here is that we immediately get that
the greedy algorithm achieves a cut of size m/2, which might require actual
intelligence to prove otherwise.

14.4.2 Deterministic construction of Ramsey graphs


Here is an example that is slightly less trivial. For this example we let f be
a count of bad events rather than a failure probability, but the same method
applies.
Recall from §13.1.2 that if k ≥ 3, for n ≤ 2k/2 there exists a graph with
n nodes and no cliques or independent sets of size k. The proof of this fact
is to observe that each subset of k vertices is bad with probability 2−k+1 ,
and when nk 2−k+1 < 1, the expected number of bad subsets in Gn,1/2 is less
than 1, showing that some good graph exists.
We can turn this into a deterministic nO(log n) algorithm for finding a
k-Ramsey graph in the worst case when n = 2k/2 . The trick is to set the
edges to be present or absent one at a time, and for each edge, take the
value that minimizes the expected number of bad subsets conditioned on
the choices so far. We can easily calculate the conditional probability that
a subset is bad in O(k) time: if it already has both a present and missing
edge, it’s not bad. Otherwise, if we’ve already set ` of its edges to the
same value, it’s bad with probability exactly 2−k+` . Summing this over all
O(nk ) subsets takes
 O(nk k) time 2
 per edge, and we have O(n ) edges, giving
O(nk+2 k) = O n(2 lg n+2+lg lg n) = nO(log n) time total.
It’s worth mentioning that there are faster deterministic constructions
in the literature. The best construction that I am aware of is given by
Cohen [Coh21], which constructs a graph with n vertices and no clique or
c
independent set of size 2(log log n) for a particular constant c, in polylogarith-
nmic time per edge.

14.4.3 Derandomized set balancing


Recall the set balancing problem from §13.1.1: given vectors v1 , v2 , . . . , vn in
{0, 1}m , we’d like to find ±1 coefficients 1 , 2 , . . . n that minimize maxj |Xj |
where Xj = ni=1 i vij . Hoeffding tells us that choosing the i independently
P

at random gives a nonzero probability that this quantity is at most 2n ln 2m.
Because there may be very complicated dependence between the Xj , it
is difficult to calculate the probability of the event j [|Xj | ≥ t], whether
S
CHAPTER 14. DERANDOMIZATION 260

conditioned on some of the i or not. However, we can calculate the proba-


bility of the individual events [|Xj | ≥ t] exactly. Conditioning on 1 , . . . , k ,
the expected value of Xj is just ` = ki=1 i vij , and the distribution of
P

Y = Xj − E [Xj ] is the sum r ≤ n − k independent ±1 random variables. So

Pr [|Xj | > t | 1 , . . . , k ] = 1 − Pr [−t − ` ≤ Y ≤ t − `]


b t−`+r
2
c !
X r −r
=1− 2 .
i
i=d −t−`+r
2
e

This last expression involves a linear number of terms, each of which we can
calculate using a linear number of operations on rational numbers that fit
in a linear number of bits, so we can calculate the probability exactly in
polynomial time by just adding them up.
For our pessimistic estimator, we take
n
X h √ i
U (1 , . . . , k ) = Pr |Xj | > 2n ln 2m 1 , . . . , k .
i=j

Since each term in the sum is a Doob martingale, the sum is a martingale
as well, so E [U (1 , . . . , k+1 ) | 1 , . . . , k ] = U (1 , . . . , k ). It follows that for
any choice of 1 , . . . , k there exists some k+1 such that U (1 , . . . , k ) ≥
U (1 , . . . , k+1 ), and we can determine this winning k+1 explicitly. Our
previous argument shows that U (hi) < 1, which implies that our final value
U (1 , . . . , n ) will also be less than 1. Since U is an integer, √ this means it
must be 0, and we find an assignment in which |Xj | < 2n ln 2m for all j.
Chapter 15

Probabilistically-checkable
proofs

In this chapter, we discuss the PCP theorem [ALM+ 98], which is an


alternative characterization of NP that can be used to show hardness for
various approximation problems assuming P = 6 NP. What connects the
PCP theorem to randomized algorithms is that class of probabilistically-
checkable proofs that it describes are what you get when you replace the
deterministic verifier that checks a witness in the usual definition of NP
with a randomized algorithm. This allows for very efficient verification, since
a randomized verified can demand a proof that includes sufficient internal
consistency checks that a spot-check of only a few randomly-chosen bits of
the proof is enough to detect an incorrect proof with constant probability.
Formally, this means that we consider a verifier that is a probabilistic Turing
machine that uses r random bits to select q bits of the proof to look at,
and accepts bad proofs with less than some constant probability ρ while
accepting all good proofs.
This turns out to have strong consequences for approximation algorithms.
We can think of a probabilistically-checkable proof as a kind of constraint
satisfaction problem, where the constraints apply to the tuples of q bits that
the verifier might look at, the number of constraints is bounded by the number
of possible random choices 2r , and each constraint enforces some condition
on those q bits corresponding to the verifier’s response to seeing them. If we
can satisfy at least ρ · 2r of the constraints, we’ve constructed a proof that
is not bad. This means that there is a winning certificate for our original
NP machine, and that whatever input x we started with is in the language
accepted by that machine. So a polynomial-time approximation algorithm

261
CHAPTER 15. PCP AND HARDNESS OF APPROXIMATION 262

for this constraint satisfaction problem will let us generate probabilistically-


checkable proofs for the corresponding problem in NP, which means there
can’t be such an approximation algorithm unless P = NP.
The above is only a sketchy, high-level overview of where we are going.
Below we fill in some of the details. We are mostly following the approach
of [AB07, §§18.1–18.4], with much of the actual text of this chapter adapted
from [Asp20].

15.1 Probabilistically-checkable proofs


A hr(n), q(n), ρi-PCP verifier for a language L consists of two polynomial-
time computable functions f and g, where:

• f (x, r) takes an input x of length n and a string r of length r(n) and


outputs a sequence i of q(n) indices i1 , i2 , . . . , iq(n) , each in the range
0 . . . q(n) · 2r(n) ; and

• g(x, πi ) takes as input the same x as f and a sequence πi = πi1 πi2 . . . πi(q)n
and outputs either 1 (for accept) or 0 for (reject); and

• if x ∈ L, then there exists a sequence π that causes g(x, πf (x,r) ) to


output 1 always (completeness); and

• if x 6∈ L, then for any sequence π, g(x, πf (x,r) ) outputs 1 with probability


at most ρ (soundness).

We call the string π a probabilistically-checkable proof or PCP for


short.
Typically, ρ is set to 1/2, and we just write hr(n), q(n)i-PCP verifier for
hr(n), q(n), 1/2i-PCP verifier.
The class PCP(r(n), q(n)) is the class of languages L for which there
exists an hO(r(n)), O(q(n)), 1/2i-PCP verifier.
The PCP theorem says that NP = PCP(log n, 1). That is, any
language in NP can be recognized by a PCP-verifier that is allowed to look
at only a constant number of bits selected using O(log n) random bits from
a proof of polynomial length, which is fooled at most half the time by bad
proofs. In fact, 3 bits is enough [Hås01]. We won’t actually prove this here,
but we will describe some consequences of the theorem, and give some hints
about how the proof works.
CHAPTER 15. PCP AND HARDNESS OF APPROXIMATION 263

15.2 A PCP for GRAPH NON-ISOMORPHISM


The GRAPH NON-ISOMORPHISM (GNI) problem takes as input
two graphs G and H, and asks if G is not isomorphic to H (written G 6' H).
Recall that G is isomorphic to H (written G ' H) if there is a permutation of
the vertices of G that makes it equal to H. An NP-machine can easily solve
the GRAPH ISOMORPHISM problem by guessing the right permutation,
which puts the complementary GRAPH NON-ISOMORPHISM problem in
coNP.1 But in general we don’t know if problems in coNPproblems have
NP-style proofs, and there is no obvious general method that a prover could
use to convince a deterministic verifier that two graphs are not isomorphic.
Instead, we’ll describe two protocols that allow an omniscient prover to
demonstrate that two graphs are not isomorphic to a suspicious, randomized
verifier. The first, described in §15.2.1, is an interactive proof , meaning
that the verifier can ask questions that the prover must answer. The second,
described in §15.2.2, is not interactive: instead, the prover supplies a (very
large) proof that the verifier can check by looking at a single randomly-chosen
bit.

15.2.1 GRAPH NON-ISOMORPHISM with private coins


Here is how the prover can convince a verifier that two graphs are not
isomorphic, assuming that the verifier can flip coins that are not visible to
the prover.
Given graphs G and H, the verifier picks one or the other with equal
probability, then randomly permutes its vertices to get a test graph T . It
then asks the prover which of G or H it picked.
If the graphs are not isomorphic, the prover can distinguish T ' G
from T ' H and answer the question correctly with probability 1. If they
are isomorphic, then the prover can only guess: T gives no information
at all about which of G or H was used to generate it. So in this case it
answers correctly only with probability 1/2. This gives us the completeness
and soundness properties needed for a probabilistically-checkable proof.
Unfortunately we need to do more work to make this non-interactive.
1
Neither problem is known not to be in P, even under the assumption that P 6= NP.
The exact complexity of GRAPH ISOMORPHISM is still open.
CHAPTER 15. PCP AND HARDNESS OF APPROXIMATION 264

15.2.2 A probabilistically-checkable proof for GRAPH NON-


ISOMORPHISM
To turn the interactive proof into a probabilistically-checkable proof, have
the prover build a bit-vector π indexed by every possible graph on n vertices,
writing a 1 for each graph that is isomorphic to H. Now the verifier can use
Θ(n2 ) random bits to construct T as above, look it up in this gigantic table,
and accept if and only if (a) it chose G and π[T ] = 0, or (b) it chose H and
π[T ] = 1. If G and H are non-isomorphic, the verifier accepts every time. But
if they are isomorphic, no matter what proof π 0 is supplied, there is at least
a 1/2 chance that π 0 [T ] is wrong. This puts GRAPH NON-ISOMORPHISM
in PCP(n2 , 1).
(A minor technical issue here is that the verifier only needs O(n log n)
bits to construct T . But we give it Θ(n2 ) bits, so that the table can be big
enough to hold all graphs and still fit in the maximum proof size q(n)2r(n) .)

15.3 NP ⊆ PCP(poly(n), 1)
Here we give a weak version of the PCP theorem, adapted from [AB07, §18.4],
showing that any problem in NP has a probabilistically-checkable proof
where the verifier uses polynomially-many random bits but only needs to
look at a constant number of bits of the proof, which we can state succinctly
as NP ⊆ PCP(poly(n), 1).2 The proof itself will be exponentially long.
The central step is to construct a hpoly(n), 1)i-PCP for a particular
NP-complete problem; we can then take any other problem in NP, reduce
it to this problem, and use the construction to get a PCP for that problem
as well.

15.3.1 QUADEQ
The particular problem we will look at is QUADEQ, the language of systems
of quadratic equations over Z2 that have solutions.
This is in NP because we can guess and verify a solution; it’s NP-hard
because we can use quadratic equations over Z2 to encode instances of SAT,
using the representation 0 for false, 1 for true, 1 − x for ¬x, xy for x ∧ y, and
1 − (1 − x)(1 − y) = x + y + xy for x ∨ y. We may also need to introduce
2
This is a rather weak result, since (a) the full PCP theorem gives NP using only
O(log n) random bits, and (b) PCP(poly(n), 1) is known to be equal to NEXP [BFL91].
But the construction is still useful for illustrating many of the ideas behind probabilistically-
checkable proofs.
CHAPTER 15. PCP AND HARDNESS OF APPROXIMATION 265

auxiliary variables to keep the degree from going up: for example, to encode
the clause x ∨ y ∨ z, we introduce an auxiliary variable q representing x ∨ y
and enforce the constraints q = x ∨ y and 1 = q ∨ z = x ∨ y ∨ z using two
equations

x + y + xy = q,
q + z + qz = 1.

It will be helpful later to rewrite these in a standard form with only zeros
on the right:

q + x + y + xy = 0
q + z + qz + 1 = 0.

This works because we can move summands freely from one side of an
equation to the other since all addition is mod 2.

15.3.2 The Walsh-Hadamard Code


An NP proof for QUADEQ just gives an assignment to the variables that
makes all the equations true. Unfortunately, this requires looking at the
entire proof to check it. To turn this into a PCP(poly(n), 1) proof, we
will make heavy use of a rather magical error-correcting code called the
Walsh-Hadamard code.
This code expands an n-bit string x into a 2n -bit codeword H(x), where
H(x)i = x · i when x and the index i are both interpreted as n-dimensional
vectors over Z2 and · is the usual vector dot-product ni=1 xj ij . This encoding
P

has several very nice properties, all of which we will need:

1. It is a linear code: H(x + y) = H(x) + H(y) when all strings are


interpreted as vectors of the appropriate dimension over Z2 .

2. It is an error-correcting code with distance 2n−1 . If x 6= y, then exactly


half of all i will give x · i =
6 y · i. This follows from the subset sum
principle, which says that a random subset of a non-empty set S
is equally likely to have an even or odd number of elements.3 So
for any particular nonzero x, exactly half of the x · i values will be
3
Proof: If S is non-empty, pick some element x, and then the function f that removes
x if present and adds x if missing is a bijection between even and odd subsets. If
you
P like generating
 P functions,  Pnan alternative proof uses nthe Binomial Theorem to show
n n n n n
even i i
− odd i i
= i=0
(−1) i
= (1 + (−1)) = 0 = 0 when n 6= 0.
CHAPTER 15. PCP AND HARDNESS OF APPROXIMATION 266

1, since i includes each one in x with independent probability 1/2.


This makes d(H(0), H(x)) = 2n−1 . But then the linearity of H gives
d(H(x), H(y)) = d(H(0), d(H(x + y))) = 2n−1 whenever x 6= y.

3. It is locally testable: Given an alleged codeword w, we can check if


w is close to being a legitimate codeword H(x) by sampling a constant
number of bits from w. (We do not need to know what x is to do this.)
Our test is: Pick two indices i and j uniformly at random, and check if
wi + wj = wi+j . A legitimate codeword will pass this test always. It is
also possible to show using Fourier analysis (see [AB07, Theorem 19.9])
that if w passes this test with probability ρ ≥ 1/2, then there is some
x such that Pri [H(x)i = wi ] ≥ ρ (equivalently, d(H(x), w) ≤ 2n (1 − ρ),
in which case we say w is ρ-close to H(x).

4. It is locally decodable: If w is ρ-close to H(x), then we can compute


H(x)i by choosing a random index r and computing wr + wi+r . This
will be equal to H(x)i if both bits are correct (by linearity of H). The
probability that both bits are correct is at least 1 − 2δ if ρ = 1 − δ.

5. It allows us to check an unlimited number of linear equations in x by


looking up a single bit of H(x). This again uses the subset sum principle.
Give a system of linear equations x · y1 = 0, x · y2 = 0, . . . x · ym = 0,
choose a random subset S of {1, . . . , m}, let i = j∈S yj , and query
P

H(x)i = x · j∈S yj . This will be 0 always if the equations hold and 1


P

with probability 1/2 if at least one is violated.


This gets a little more complicated if we have any ones on the right-
hand side. But we can handle an equation of the form x · y = 1 by
rewriting it as x · y + 1 = 0, and then extending x to include an extra
constant 1 bit (which we can test is really one by looking up H(x)i for
an appropriate index i).

15.3.3 A PCP for QUADEQ


So now to construct a PCP for QUADEQ, we build:

1. An n-bit solution u to the system of quadratic equations, which we


think of as a function f : {0, 1}n → {0, 1} and encode as f = H(u).

2. An n2 -bit vector w = u ⊗ u where (u ⊗ u)ij = ui uj , which we encode


as g = H(u ⊗ u).
CHAPTER 15. PCP AND HARDNESS OF APPROXIMATION 267

To simplify our life, we will assume that one of the equations is x1 = 1 so


that we can use this constant 1 later (note that we can trivially reduce the
unrestricted version of QUADEQ to the version that includes this assumption
by adding an extra variable and equation). A different approach that does
not require this assumption is given in [AB07, §18.4.2, Step 3].
To test this PCP, the verifier checks:

1. That f and g are (1 − δ)-close to real codewords for some suitably


small δ.

2. That for some random r, s, f (r)f (s) = g(r ⊗ s). This may let us
know if w is inconsistent with u. Define W as the n × n matrix with
Wij = wij and U as the n × n matrix U = u ⊗ u (so Uij = ui uj ). Then
g(r ⊗s) = w ·(r ⊗s) = ij wij ri sj = rW s and f (r)f (s) = (u·r)(u·s) =
P
P P  P
( i ui ri ) j uj sj = ij ri Uij rj = rU s, where we are treating r as a
row vector and s as a column vector. Now apply the random subset
principle to argue that if U =6 W , then rU 6= rW at least half the time,
and if rU 6= rW , then rU s 6= rW s at least half the time. This gives a
probability of at least 1/4 that we catch U 6= W , and we can repeat
the test a few times to amplify this to whatever constant we want.

3. That our extra constant-1 variable is in fact 1 (lookup on u).

4. That w encodes a satisfying assignment for the original problem. This


just involves checking a system of linear equations using w.

Since we can make each step fail with only a small constant probability,
we can make the entire process fail with the sum of these probabilities, also
a small constant.

15.4 PCP and approximability


Suppose we want to use the full PCP theorem NP = PCP(log n, 1) to
actually decide some language L in NP. What do we need to do?

15.4.1 Approximating the number of satisfied verifier queries


If we can somehow find a PCP for x ∈ L and verify it, then we know x ∈ L.
So the obvious thing is to try to build an algorithm for generating PCPs.
But actually generating a PCP may be hard. Fortunately, even getting an
good approximation will be enough. We illustrate the basic idea using MAX
CHAPTER 15. PCP AND HARDNESS OF APPROXIMATION 268

SAT, the problem of finding an assignment that maximizes the number of


satisfied clauses in a 3CNF formula.
Suppose that we have some language L with a PCP verifier V . If
x ∈ L, there exists a proof of polynomial length such that every choice
of q = O(log n) bits by V from the proof will be accepted by V . We can
encode this verification step as a Boolean formula: for each set of bits
S = {i1 , . . . , iq }, write a polynomial-size formula φS with variables in π that
checks if V will accept πi1 , . . . , πiq for our given input x. Then we can test if
x ∈ L by testing if φ = S φS is satisfiable or not.
V

But we can do better than this. Suppose that we can approximate the
number of φS that can be satisfied to within a factor of 2 − . Then if φ has
an assignment that makes all the φS true (which followed from completeness
if x ∈ L), our approximation algorithm will give us an assignment that makes
1
at least a 2− > 12 fraction of the φS true. But we can never make more than
1
2 of the φS true if x 6∈ L. So we can run our hypothetical approximation
algorithm, and if it gives us an assignment that satisfies more than half of
the φS , we know x ∈ L. If the approximation runs in P, we just solved SAT
in P and showed P = NP.

15.4.2 Gap-preserving reduction to MAX SAT


Maximizing the number of subformulas φS that are satisfied is a strange
problem, and we’d like to state this result in terms of a more traditional
problem like MAX SAT. We can do this by converting each φS into a 3CNF
formula (which makes φ also 3CNF), but the cost is that we reduce the gap
between negative instances x 6∈ L and negative instances x ∈ L.
The PCP theorem gives us a gap of (1/2, 1) between negative and
positive instances. If each φS is represented by k 3CNF clauses, then it
may be that violating a single φS only maps to violating one of those k
clauses. So where previously we either satisfied at most 1/2 of the φS or all
of them, now we might have a negative instance where we can still satisfy
1
a 1 − 2k fraction of the clauses. So we only get P = NP if we are given a
poly-time approximation algorithm for MAX SAT that is at least this good;
or, conversely, we only show that P 6= NP implies that there is no MAX
1
SAT approximation that gets more than 1 − 2k of optimal.
This suggests that we want to find a version of the PCP theorem that
makes k as small as possible. Fortunately, Håstad [Hås01] showed that it is
possible to construct a PCP-verifier for 3SAT with the miraculous property
that (a) q is only 3, and (b) the verification step involves testing only if
πi1 ⊕ πi2 ⊕ πi3 = b, where i1 , i2 , i3 , and b are generated from the random bits.
CHAPTER 15. PCP AND HARDNESS OF APPROXIMATION 269

There is a slight cost: the completeness parameter of this verifier is only


1 −  for any fixed  > 0, meaning that it doesn’t always recognize a valid
proof, and the soundness parameter is 1/2+. But checking πi1 ⊕πi2 ⊕πi3 = b
requires a 3CNF formula of only 4 clauses. So this means that there is no
approximation algorithm for MAX SAT that does better than 7/8 + δ of
optimal in all cases, unless P = NP. This matches the 7/8 upper bound
given by just picking an assignment at random.4
This is an example of a reduction argument, since we reduced 3SAT first
to a problem of finding a proof that would make a particular PCP-verifier
happy and then to MAX SAT. The second reduction is an example of a
gap-preserving reduction, in that it takes an instance of a problem with
a non-trivial gap (1/2 + , 1 − ) and turns it into an instance of a problem
with a non-trivial gap (7/8 + , 1 − ). Note that do be gap-preserving,
a reduction doesn’t have to preserve the value of the gap, it just has to
preserve the existence of a gap. So a gap-reducing reduction like this is
still gap-preserving. We can also consider gap-amplifying reductions: in
a sense, Håstad’s verifier gives a reduction from 3SAT to 3SAT that amplifies
the reduction from the trivial (1 − 1/m, 1) that follows from only being able
to satisfy m − 1 of the m clauses in a negative instance to the much more
useful (1/2 + , 1 − ).

15.4.3 Other inapproximable problems


Using inapproximability of MAX SAT, we can find similar inapproximability
results for other NP-complete optimization problems by looking for gap-
preserving reductions from MAX SAT. In many cases, we can just use
whatever reduction we already had for showing that the target problem
was NP-hard. This gives constant-factor inapproximability bounds for
problems like GRAPH 3-COLORABILITY (where the value of a solution is
the proportion of two-colored edges) and MAXIMUM INDEPENDENT SET
(where the value of a solution is the size of the independent set). In each
4
It’s common in the approximation-algorithm literature to quote approximation ratios
for maximization problems as the fraction of the best solution that we can achieve, as in a
7/8-approximation for MAX SAT satisfying 7/8 of the maximum possible clauses. This
leads to rather odd statements when we start talking about lower bounds (“you can’t do
better than 7/8 + δ”) and upper bounds (“you can get at least 7/8”), since the naming of
the bounds is reversed from what they actually say. For this reason complexity theorists
have generally standardized on always treating approximation ratios as greater than 1,
which for maximization problems means reporting the inverse ratio, 8/7 −  in this case. I
like 7/8 better than 8/7, and there is no real possibility of confusion, so I will stick with
7/8.
CHAPTER 15. PCP AND HARDNESS OF APPROXIMATION 270

case we observe that a partial solution to the target problem maps back to a
partial solution to the original SAT problem.
In some cases we can do better, by applying a gap-amplification step. For
example, suppose that no polynomial-time algorithm for INDEPENDENT
SET can guarantee an approximation ratio better than ρ, assuming P 6= NP.
Given a graph G, construct the graph Gk on nk vertices where each vertex
in Gk represents a set of k vertices in G, and ST is an edge in Gk if S ∪ T
is not an independent set in G. Let I be an independent set for G. Then
the set I k of all k-subsets of I is an independent set in Gk (S ∪ T ⊆ I is an
independent set for any S and T in I k ). Conversely, given any independent
set J ⊆ Gk , its union J is an independent set in G (because otherwise
S

there is an edge either within some element of J or between two elements of


J). So any maximum independent set in Gk will be I k for some maximum
independent set in G.
This amplifies approximation ratios: given an independent set I such
that |I|/|OP T | = ρ, then |I k |/|OP T k | = |I|
 |OP T |
k / k ≈ ρk . If k is constant,
we can compute G in polynomial time. If we can then compute a ρk -
k

approximation to the maximum independent set in Gk , we can take the


union of its elements to get a ρ-approximation to the maximum independent
set in G. By making k sufficiently large, this shows that approximating the
maximum independent set to within any constant  > 0 is NP-hard.
There is a stronger version of this argument that uses expander graphs
to get better amplification, which shows that n−δ approximations are also
NP-hard for any δ < 1. See [AB07, §18.3] for a description of this argument.

15.5 Dinur’s proof of the PCP theorem


Here we give a very brief outline of Dinur’s proof of the PCP theorem [Din07].
This is currently the simplest known proof of the theorem, although it is still
too involved to present in detail here. For a more complete description, see
§18.5 of [AB07], or Dinur’s paper, which is pretty accessible.
A constraint graph is a graph G = (V, E), where the vertices in V are
interpreted as variables, and each edge uv in E carries a constraint cuv ⊆ Σ2
that specifies what assignments of values in some alphabet Σ are permitted
for the endpoints of uv. A constraint satisfaction problem asks for an
assignment σ : V → Σ that minimizes UNSATσ (G) = Pruv∈E [hσ(u), σ(v)i 6∈
cuv ], the probability that a randomly-chosen constraint is unsatisfied. The
quantity UNSAT(G) is defined as minimum value of UNSATσ (G): this is
the smallest proportion of constraints that we must leave unsatisfied. In
CHAPTER 15. PCP AND HARDNESS OF APPROXIMATION 271

the other direction, the value val(G) of a constraint satisfaction problem


is 1 − UNSAT(G): this is the largest proportion of constraints that we can
satisfy. 5
An example of a constraint satisfaction problem is GRAPH 3-COLORABILITY:
here Σ = {r, g, b}, and the constraints just require that each edge on the
graph we are trying to color (which will be the same as the constraint graph
G!) has two different colors on its endpoints. If a graph G with m edges
has a 3-coloring, then UNSAT(G) = 0 and val(G) = 1; if G does not, then
UNSAT(G) ≥ 1/m and val(G) ≤ 1 − 1/m. This gives a (1 − 1/m, 1) gap
between the best value we can get for non-3-colorable vs. 3-colorable graphs.
Dinur’s proof works by amplifying this gap.
Here is the basic idea:

1. We first assume that our input graph is k-regular (all vertices have the
same degree k) and an expander (every subset S with |S| ≤ m/2 has
δ|S| external neighbors for some constant δ > 0). Dinur shows that
even when restricted to graphs satisfying these assumptions, GRAPH
3-COLORABILITY is still NP-hard.

2. We then observe that coloring our original graph G has a gap of


(1−1/m, 1), or that UNSAT(G) ≥ 1/m. This follows immediately from
the fact that a bad coloring must include at least one monochromatic
edge.

3. To amplify this gap, we apply a two-stage process.


First, we construct a new constraint graph G0 (that is no longer a
graph coloring problem) with n vertices, where the constraint graph
has an edge between any two vertices at distance 2d + 1 or less in G,
the label on each vertex v is a “neighborhood map” assigning a color
of every vertex within distance d of v, and the constraint on each edge
uv says that the maps for u and v (a) assign the same color to each
5
Though Dinur’s proof doesn’t need this, we can also consider a constraint hyper-
graph, where each q-hyperedge is a q-tuple e = v1 v2 . . . vq relating q vertices, and a
constraint ce is a subset of Σq describing what assignments are permitted to the ver-
tices in e. As in the graph case, our goal is to minimize UNSATσ (G), which is now the
proportion of hyperedges whose constraints are violated, or equivalently to maximize
valσ (G) = 1 − UNSATσ (G). This gives us the q-CSP problem: given a q-ary constraint
hypergraph, find an assignment σ that maximizes valσ (G). An example of a constraint
hypergraph arises from 3SAT. Given a formula φ, construct a graph Gφ in which each
vertices represents a variable, and the constraint on each 3-hyperedge enforces least one of
the literals in some clause is true. The MAX 3SAT problem asks to find an assignment σ
that maximizes the proportion of satisfied clauses, or valσ (Gφ ).
CHAPTER 15. PCP AND HARDNESS OF APPROXIMATION 272

vertex in the overlap between the two neighborhoods, and (b) assigns
colors to the endpoints of any edge in either neighborhood that are
permitted by the constraint on that edge. Intuitively, this means that
a bad edge in a coloring of G will turn into many bad edges in G0 , and
the expander assumption means that many bad edges in G will also
turn into many bad edges in G0 . In particular, Dinur shows that with
appropriate tuning this process amplifies the UNSAT value of G by a
constant. Unfortunately, we also blow up the size of the alphabet by
Θ(k d ).
So the second part of the amplification knocks the size of the alphabet
back down to 2. This requires replacing each node in G0 with a set of
nodes in a new constraint graph G00 , where the state of the nodes in
the set encodes the state of the original node, and some coding-theory
magic is used to preserve the increased gap from the first stage (we
lose a little bit, but not as much as we gained).
The net effect of both stages is to take a constraint graph G of size n
with UNSAT(G) ≥  and turn it into a constraint graph G00 of size cn,
for some constant c, with UNSAT(G00 ) ≥ 2.

4. Finally, we repeat this process Θ(log m) = Θ(log n) times to construct


a constraint graph with size cO(log n) n = poly(n) and gap (1/2, 1).
Any solution to this constraint graph gives a PCP for GRAPH 3-
COLORABILITY for the original graph G.

15.6 The Unique Games Conjecture


The PCP theorem, assuming P 6= NP, gives us fairly strong inapproxima-
bility results for many classic optimization problems, but in many cases these
are not tight: there is a still a gap between the lower bound and the best
known upper bound. The Unique Games Conjecture of Khot [Kho02],
if true, makes many of these bounds tight.
The Unique Games Conjecture was originally formulated in terms of an
interactive proof game with two provers. In this game, the verifier V picks a
query q1 to ask of prover P1 and a query q2 to ask of prover P2 , and then
checks consistency of the prover’s answers a1 and a2 . (The provers cannot
communicate, and so answer the queries independently, although they can
coordinate their strategy before seeing the queries.) This gives a unique
game if, for each answer a1 there is exactly one answer a2 that will cause V
to accept.
CHAPTER 15. PCP AND HARDNESS OF APPROXIMATION 273

Equivalently, we can model a unique game as a restricted 2-CSP: the


labels on the vertices are the answers, and the consistency condition on
each edge is a bijection between the possible labels on each endpoint. This
corresponds to the two-prover game the same way PCPs correspond to
single-prover games, in that a labeling just encodes the answers given by
each prover.
A nice feature of unique games is that they are easy to solve in polynomial
time: pick a label for some vertex, and then use the unique constraints to
deduce the labels for all the other vertices. So the problem becomes interesting
mostly when we have games for which there is no exact solution.
For the Unique Games Conjecture, we consider the set of all unique
games with gap (δ, 1 − ); that is, the set consisting of the union of all unique
games with approximations with ratio 1 −  or better and all unique games
with no approximations with ratio better than δ. The conjecture states that
for any δ and , there exists some alphabet size k such that it is NP-hard to
determine of these two piles a unique game G with this alphabet size lands
in.6
Unfortunately, even though the Unique Games Conjecture has many
consequences that are easy to state (for example, the usual 2-approximation
to MINIMUM VERTEX COVER is optimal, as is the 0.878-approximation
to MAX CUT of Goemans and Williamson [GW95]), actually proving these
consequences requires fairly sophisticated arguments. So we won’t attempt
to do any of them here, and instead will point the interested reader to Khot’s
2010 survey paper [Kho10], which gives a table of bounds known at that
time and citations to where they came from.
There is no particular consensus among complexity theorists as to whether
the Unique Games Conjecture is true or not, but it would be nifty if it were.

6
Note that this is not a decision problem, in that the machine M considering G does
not need to do anything sensible if G is in the gap; instead, it is an example of a promise
problem where we have two sets L0 and L1 , M (x) must output i when x ∈ Li , but L0 ∪ L1
does not necessarily cover all of {0, 1}∗ .
Chapter 16

Quantum computing

Quantum computing is a currently almost-entirely-theoretical branch of


randomized algorithms that attempts to exploit the fact that probabilities
at a microscopic scale arise in a mysterious way from more fundamental
probability amplitudes, which are complex-valued and can cancel each
other out where probabilities can only add. In a quantum computation, we
replace random bits with quantum bits—qubits for short—and replace
random updates to the bits with quantum operations.
To explain how this works, we’ll start by re-casting our usual model of a
randomized computation to make it look more like the standard quantum
circuits of Deutsch [Deu89]. We’ll then get quantum computation by
replacing all the real-valued probabilities with complex-valued amplitudes.

16.1 Random circuits


Let’s consider a very simple randomized computer whose memory consists of
two bits. We can describe our knowledge of the state of this machine using a
vector of length 4, with the coordinates in the vector giving the probability
of states 00, 01, 10, and 11. For example, if we know that the initial state is
always 00, we would have the (column) vector
 
1
0
 .
 
0
0

Any such state vector x for the system must consist of non-negative
real values that sum to 1; this is just the usual requirement for a discrete

274
CHAPTER 16. QUANTUM COMPUTING 275

probability space. Operations on the state consist of taking the old values of
one or both bits and replacing them with new values, possibly involving both
randomness and dependence on the old values. The law of total probability
applies here, so we can calculate the new state vector x0 from the old state
vector x by the rule

x0b0 b0 = xb1 b2 Pr Xt+1 = b01 b02 Xt = b1 b2 .


X  
1 2
b1 b2

These are linear functions of the previous state vector, so we can summarize
the effect of our operation using a transition matrix A, where x0 = Ax.1
We imagine that these operations are carried out by feeding the initial
state into some circuit that generates the new state. This justifies the calling
this model of computation a random circuit. But the actual implementa-
tion might be an ordinary computer than is just flipping coins. If we can
interpret each step of the computation as applying a transition matrix, the
actual implementation doesn’t matter.
For example, if we negate the second bit 2/3 of the time while leaving
the first bit alone, we get the matrix
 
1/3 2/3 0 0
2/3 1/3 0 0 
A= .
 
 0 0 1/3 2/3
0 0 2/3 1/3

One way to derive this matrix other than computing each entry directly is
that it is the tensor product of the matrices that represent the operations
on the individual bits. The idea here is that the tensor product of A and B,
written A ⊗ B, is the matrix C with Cij,k` = Aik Bj` . We’re cheating a little
bit by allowing the C matrix to have indices consisting of pairs of indices,
one for each of A and B; there are more formal definitions that justify this
at the cost of being harder to understand.
In this particular case, we have
 
1/3 2/3 0 0 " # " #
2/3 1/3 0 0  1 0 1/3 2/3
= ⊗ .
 
 0 0 1/3 2/3 0 1 2/3 1/3

0 0 2/3 1/3
1
Note that this is the reverse of the convention we adopted for Markov chains in
Chapter 10. There it was convenient to have Pij = pij = Pr [Xt+1 = j | Xt = i]. Here we
defer to the physicists and make the update operator come in front of its argument, like
any other function.
CHAPTER 16. QUANTUM COMPUTING 276

The first matrix in the tensor product gives the update rule for the first bit
(the identity matrix—do nothing), while the second gives the update rule for
the second.
Some operations are not decomposable in this way. If we swap the values
of the two bits, we get the matrix
 
1 0 0 0
0 0 1 0
S= ,
 
0 1 0 0
0 0 0 1

which maps 00 and 11 to themselves but maps 01 to 10 and vice versa.


The requirement for all of these matrices is that they be stochastic. This
means that each column has to sum to 1, or equivalently that 1A = 1, where
1 is the all-ones vector. This just says that our operations map probability
distributions to probability distributions; we don’t apply an operation and
find that the sum of our probabilities is now 3/2 or something. (Proof: If
1A = 1, then |Ax|1 = 1(Ax) = (1A)x = 1x = |x|1 .)
A randomized computation in this model now consists of a sequence
of these stochastic updates to our random bits, and at the end performing a
measurement by looking at what the values of the bits actually are. If we
want to be mystical about it, we could claim that this measurement collapses
a probability distribution over states into a single unique state, but really
we are just opening the box to see what we got.
For example, we could generate two bias-2/3 coin-flips by starting with
00 and using the algorithm flip second, swap bits, flip second, or in matrix
terms:

xout = ASAxin
    
1/3 2/3 0 0 1 0 0 0 1/3 2/3 0 0 1
2/3 1/3 0 0  0 0 1 0 2/3 1/3 0 0  0
=
    
 0 0 1/3 2/3 0 1 0 0  0 0 1/3 2/3 0
   

0 0 2/3 1/3 0 0 0 1 0 0 2/3 1/3 0


 
1/9
2/9
= .
 
2/9
4/9

When we look at the output, we find 11 with probability 4/9, as we’d


expect, and similarly for the other values.
CHAPTER 16. QUANTUM COMPUTING 277

16.2 Bra-ket notation


A notational oddity that scares many people away from quantum mechanics in
general and quantum computing in particular is the habit of practitioners in
these fields of writing basis vectors and their duals using bra-ket notation,
a kind of typographical pun invented by the physicist Paul Dirac [Dir39].
This is based on a traditional way of writing an inner product of two
vectors x and y in bracket form as hx|yi. The interpretation of this is
hx|yi = x∗ y, where x∗ is the conjugate transpose of x.
For real-valued x the conjugate transpose is the same as the transpose.
For complex-valued x, each coordinate xi = a + bi is replaced by its complex
conjugate x̄i = a − bi.) Using the conjugate transpose makes hx|xi equal
|x|22 when x is complex-valued.
For example, for our vector xin above that puts all of its probability on
00, we have
 
1
h i 0
hxin |xin i = 1 0 0 0   = 1. (16.2.1)
 
0
0
The typographic trick is to split in half both hx|yi and its expansion. For
example, we could split (16.2.1) as
 
1
h i 0
hxin | = 1 0 0 0 |xin i =   .
 
0
0
In general, wherever we used to have a bracket hx|yi, we now have a bra
hx| and a ket |yi. These are just row vectors and column vectors, and hx| is
always the conjugate transpose of |xi.
The second trick in bra-ket notation is to make the contents of the bra or
ket an arbitrary name. For kets, this will usually describe some state. As an
example, we might write xin as |00i to indicate that it’s the basis vector that
puts all of its weight on the state 00. For bras, this is the linear operator
that returns 1 when applied to the given state and 0 when applied to any
orthogonal state. So h00| |00i = h00|00i = 1 but h00| |01i = h00|01i = 0.

16.2.1 States as kets


This notation is useful for the same reason that variable names are useful.
It’s much easier to remember that |01i refers to the distribution assigning
CHAPTER 16. QUANTUM COMPUTING 278
h i>
probability 1 to state 01 than it is to remember that 0 1 0 0 does.
Other vectors can be expressed using a linear combination of kets. For
example, we can write
1 2 2 4
xout = |00i + |01i + |10i + |11i .
9 9 9 9
This is not as compact as just writing out the vector as a matrix, but it has
the advantage of clearly labeling what states the probabilities apply to.

16.2.2 Composition of kets


We’ll interpret multiplication of kets as tensor product. For example:
 
" # " #0
1 0 1
|0i |1i = ⊗ =   = |01i .
 
0 1 0
0
This gives a way of composing states together. The general rule is
|xi |yi = |xyi. Because this is tensor product underneath, this is a linear
operation, and it associates with other products in the obvious way.

16.2.3 Operators as sums of kets times bras


A similar trick can be used to express operators, like the swap operator S.
We can represent S as a combination of maps from specific states to specific
other states. For example, the operator
   
0 0 0 0 0
1 h i 0 0 1 0
|01i h10| =   0 0 1 0 = 
   
0 0 0 0 0

0 0 0 0 0

maps |10i to |01i (Proof: |01i h10| |10i = |01i h10|10i = |01i) and sends all
other states to 0. Add up four of these mappings to get
 
1 0 0 0
0 0 1 0
S = |00i h00| + |10i h01| + |01i h10| + |11i h11| =  .
 
0 1 0 0
0 0 0 1
Here the bra-ket notation both labels what we are doing and saves writing a
lot of zeros.
CHAPTER 16. QUANTUM COMPUTING 279

The intuition is that just like a ket represents a state, a bra represents a
test for being in that state. So something like |01i h10| tests if we are in the
10 state, and if so, sends us to the 01 state.

16.3 Quantum circuits


So how do we turn our random circuits into quantum circuits?
The first step is to replace our random bits with quantum bits (qubits).
For a single random bit, the state vector represents a probability distri-
bution
p0 |0i + p1 |1i ,
where p0 and p1 are non-negative real numbers with p0 + p1 = 1.
For a single qubit, the state vector represents amplitudes
a0 |0i + a1 |1i ,

where a0 and a1 are complex numbers with |a0 |2 + |a1 |2 = 1.2 The reason for
this restriction on amplitudes is that if we measure a qubit, we will see state 0
with probability |a0 |2 and state 1 with probability |a1 |2 . Unlike with random
bits, these probabilities are not mere expressions of our ignorance but arise
through a still somewhat mysterious process from the more fundamental
amplitudes.3
2
√ The absolute value, norm, or magnitude |a + bi| of a complex number is given by
a2 + b2 . When b = 0, this is the same as the absolute value for the corresponding
√ real
number. For any complex number x, the norm p can also be written
p as x̄x, where
√ x̄ is
the complex conjugate of x. This is because (a + bi)(a − bi) = a2 − (bi)2 = a2 + b2 .
The appearance of the complex conjugate here explains why we define hx|yi = x∗ y; the
conjugate transpose means that for hx|xi, when we multiply x∗i by xi we are computing a
squared norm.
3
In the old days of “shut up and calculate,” this process was thought to involve the
unexplained power of a conscious observer to collapse a superposition into a classical
state. Nowadays the most favored explanation involves decoherence, the difficulty of
maintaining superpositions in systems that are interacting with large, warm objects with lots
of thermodynamic degrees of freedom (measuring instruments, brains). The decoherence
explanation is particularly useful for explaining why real-world quantum computers have a
hard time keeping their qubits mixed even when nobody is looking at them. Decoherence
by itself does not explain which basis states a system collapses to. Since bases in linear
algebra are pretty much arbitrary, it would seem that we could end up running into a
physics version of Goodman’s grue-bleen paradox [Goo83], but there are apparently ways of
dealing with this too using a mechanism called einselection [Zur03] that favors classical
states over weird ones. Since all of this is (a) well beyond my own limited comprehension
of quantum mechanics and (b) irrelevant to the theoretical model we are using, these issues
will not be discussed further.
CHAPTER 16. QUANTUM COMPUTING 280

With multiple bits, we get amplitudes for all combinations of the bits,
e.g.
1
(|00i + |01i + |10i + |11i)
2
gives a state vector in which each possible measurement will be observed
 2
1
with equal probability 2 = 14 . We could also write this state vector as
 
1/2
1/2
.
 
1/2

1/2

16.3.1 Quantum operations


In the random circuit model, at each step we pick a small number of random
bits, apply a stochastic transformation to them, and replace their values
with the results. In the quantum circuit model, we do the same thing, but
now our transformations must have the property of being unitary. Just as
a stochastic matrix preserves the property that the probabilities in a state
vector sum to 1, a unitary matrix preserves the property that the squared
norms of the amplitudes in a state vector sum to 1.
Formally, a square, complex matrix A is unitary if it preserves inner
products: hAx|Ayi = hx|yi for all x and y. Alternatively, A is unitary if
A∗ A = AA∗ = I, where A∗ is the conjugate transpose of A, because if this
holds, hAx|Ayi = (Ax)∗ (Ay) = x∗ A∗ Ay = x∗ Iy = x∗ y = hx|yi. Yet another
way to state this is that the columns of A form an orthonormal basis: this
means that hAi |Aj i = 0 if i 6= j and 1 if i = j, which is just a more verbose
way to say A∗ A = I. The same thing also works if we consider rows instead
of columns.
The rule then is: at each step, we can operate on some constant number
of qubits by applying a unitary transformation to them. In principle, this
could be any unitary transformation, but some particular transformations
show up often in actual quantum algorithms.4
The simplest unitary transformations are permutations on states (like
the operator that swaps two qubits), and rotations of a single state. One
4
Deutsch’s original paper [Deu89] shows that repeated applications of single-qubit
rotations and the CNOT operation (described in §16.3.2) are enough to approximate any
unitary transformation.
CHAPTER 16. QUANTUM COMPUTING 281

particularly important rotation is the Hadamard operator


r " #
1 1 1
H= .
2 1 −1
q q
This maps |0i to the superposition 12 |0i + 12 |1i; since this superposition
collapses to either |0i or |1i with probability 1/2, the state resulting from
H |0i is q
the quantum-computing
q equivalent of a fair coin-flip. Note that
H |1i = 12 |0i − 12 |1i 6= H |0i. Even though both yield the same prob-
abilities, these two superpositions have different phases and may behave
differently when operated on further. That H |0i and H |1i are different is
necessary, and indeed a similar outcome occurs for any quantum operation:
all quantum operations are reversible, because any unitary matrix U has
an inverse U ∗ .
If we apply H in parallel to all the qubits in our system, we get the n-fold
tensor product H ⊗n , which (if we take our bit-vector indices
q as integers
−1
0 . . . N − 1 = 2n − 1 represented in binary) maps |0i to N1 N i=0 |ii. So
P

n applications of H effectively scramble a deterministic initial state into a


uniform distribution across all states. We’ll see this scrambling operation
again when we look at Grover’s algorithm in §16.5.

16.3.2 Quantum implementations of classical operations


One issue that comes up with trying to implement classical algorithms in the
quantum-circuit model is that classical operation are generally not reversible:
if I execution x ← x ∧ y, it may not be possible to reconstruct the old state
of x. So I can’t implement AND directly as a quantum operation.
The solution is to use more sophisticated reversible operations from which
standard classical operations can be extracted as a special case. A simple
example is the controlled NOT or CNOT operator, which computes the
mapping (x, y) 7→ (x, x ⊕ y). This corresponds to the matrix (over the basis
|00i , |01i , |10i , |11i)
 
1 0 0 0
0 1 0 0
,
 
0 0 0 1

0 0 1 0
which is clearly unitary (the rows are just the standard basis vectors). We
could also write this more compactly as |00i h00| + |01i h01| + |11i h10| +
|10i h11|.
CHAPTER 16. QUANTUM COMPUTING 282

The CNOT operator gives us XOR, but for more destructive operations
we need to use more qubits, possibly including junk qubits that we won’t look
at again but that are necessary to preserve reversibility. The Toffoli gate or
controlled controlled NOT gate (CCNOT) is a 3-qubit gate that was
originally designed to show that classical computation could be performed
reversibly [Tof80]. It implements the mapping (x, y, z) 7→ (x, y, (x ∧ y) ⊕ z),
which corresponds to the 8 × 8 matrix
 
1 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0
 
0 0 1 0 0 0 0 0
 
 
0 0 0 1 0 0 0 0
 .
0 0 0 0 1 0 0 0
 
0 0 0 0 0 1 0 0


0 0 0 0 0 0 0 1
 

0 0 0 0 0 0 1 0

By throwing in some extra qubits we don’t care about, Toffoli gates


can implement basic operations like NAND ((x, y, 1) 7→ (x, y, ¬(x ∧ y))),
NOT ((x, 1, 1) 7→ (x, 1, ¬x)), and fan-out ((x, 1, 0) 7→ (x, 1, x)).5 This gives
a sufficient basis for implementing all classical circuits.
What we get at the end of applying this process to a classical circuit
implementing some Boolean function f is a quantum circuit implementing
a unitary operator f that maps |x, yi to |x, y ⊕ f (x)i (possibly with some
additional auxiliary qubits getting passed through as well). If we want to
treat this operator as an input to a later computation (say, one that tests
some condition on all of f ), there is a further simplification step that we
can do that will make f much easier to work with. This replaces the XOR
y ⊕ f (x) with a change in phase on |xi itself.
5
In the case of fan-out, this only works with perfect accuracy for classical bits and not
superpositions, which run into something called the no-cloning theorem. For example,
applying CCNOT to √12 |010i + √12 |110i yields √12 |010i + √12 |111i. This works, sort of,
but the problem is that the first and last bits are still entangled, meaning we can’t operate
on them independently. This is actually not all that different from what happens in the
probabilistic case (if I make a copy of a random variable X, it’s correlated with the original
X), but it has good or bad consequences depending on whether you want to prevent people
from stealing your information undetected or run two copies of a quantum circuit on
independent replicas of a single superposition.
CHAPTER 16. QUANTUM COMPUTING 283

16.3.3 Phase representation of a function


The idea of a phase representation of a Boolean function f is to use an
operator Uf = x (−1)f (x) |xi hx|. This has the effect of mapping each |xi
P

to − |xi when f (x) = 1, and passing it through intact when f (x) = 0. The
result is a diagonal matrix whose diagonal looks like a truth table for f
expressed as ±1 values. This has the effect of mapping each |xi to − |xi
when f (x) = 1, and passing it through intact when f (x) = 0. The result is a
diagonal matrix Uf whose diagonal looks like a truth table for f expressed
as ±1 values, as in this matrix for the XOR function f (x0 , x1 ) = x0 ⊕ x1 :
 
1 0 0 0
0 −1 0 0
.
 
0 0 −1 0

0 0 0 1

Because the amplitude of each |xi is unchanged, this representation is


not so useful for observing the value of f (x) directly. So we will generally
be using this representation of f in the context of some larger quantum
algorithm that exploits the phase changes to get useful cancellations for some
values of x.
The question remains how we get the phase representation Uf , given
that our standard construction of a classical circuit gives the controlled-not
representation f : |xyi 7→ |x, y ⊕ f (x)i. The trick here is to feed this f the
particular value y = H |1i = √12 (|0i − |1i) This gives

1
f |xyi = √ (f |x0i − f |x1i)
2
1
= √ (|x, f (x)i − |x, ¬f (x)i)
2
(−1)f (x)
= √ (|x0i − |x1i).
2
1
= (−1)f (x) |xi √ (|0i − |1i).
2
= (Uf |xi) |yi .

By leaving the auxiliary y alone in subsequent computation, we effectively


end up with just Uf |xi.
CHAPTER 16. QUANTUM COMPUTING 284

16.3.4 Practical issues (which we will ignore)


The big practical question is whether any of these operations—or even non-
trivial numbers of independently manipulable qubits—can be implemented
in a real, physical system. As theorists, we can ignore these issues, but in
real life they are what would make quantum computing science instead of
science fiction.6

16.3.5 Quantum computations


Putting everything together, a quantum computation consists of three stages:

1. We start with a collection of qubits in some known state x0 (e.g.,


|000 . . . 0i).

2. We apply a sequence of unitary operators A1 , A2 , . . . Am to our qubits.


Some of these unitary operators may be oracles Uf or Uw representing
the input we actually care about.

3. We take a measurement of the final superposition Am Am−1 . . . A1 x0


that collapses it into a single state, with probability equal to the square
of the amplitude of that state.

Our goal is for this final state to tell us what we want to know, with
reasonably high probability.

16.4 Deutsch’s algorithm


We now have enough machinery to describe a real quantum algorithm.
Known as Deutsch’s algorithm, this computes f (0) ⊕ f (1) while evaluating
f once [Deu89]. The trick, of course, is that f is applied to a superposition.
Assumption: f is implemented as a unitary operator Uf that that maps
|xi to (−1)f (x) |xi. To compute f (0) ⊕ f (1), evaluate
r
1
HUf H |0i = HUf (|0i + |1i)
r2 
1 
= H (−1)f (0) |0i + (−1)f (1) |1i
2
1     
= (−1)f (0) + (−1)f (1) |0i + (−1)f (0) − (−1)f (1) |1i .
2
6
Current technology, sadly, still puts quantum computing mostly in the science fiction
category.
CHAPTER 16. QUANTUM COMPUTING 285

Suppose now that f (0) = f (1) = b. Then the |1i terms cancel out and
we are left with
1 
2 · (−1)b |0i = (−1)b |0i .
2
This puts all the weight on |0i, so when we take our measurement at the
end, we’ll see 0.
Alternatively, if f (0) = b 6= f (1), it’s the |0i terms that cancel out,
leaving (−1)b |1i. The phase depends on b, but we don’t care about the
phase. The important thing is that if we measure the qubit, we always see 1.
The result in either case is that with probability 1, we determine the
value of f (0) ⊕ f (1), after evaluating f once (albeit on a superposition of
quantum states).
This is kind of a silly example, because the huge costs involved in building
our quantum computer almost certainly swamp the factor-of-2 improvement
we got in the number of calls to f . But a generalization of this trick, known
as the Deutsch-Josza algorithm [DJ92], solves the much harder (although
still a bit contrived-looking) problem of distinguishing a constant Boolean
function on n bits from a function that outputs one for exactly half of its
inputs. No deterministic algorithm can solve this problem without computing
at least 2n /2 + 1 values of f , giving an exponential speed-up.
The speed-up compared to a randomized algorithm that works with
probability 1 −  is less impressive. With randomization, we only need to
look at O(log 1/) values of f to see both a 0 and a 1 in the non-constant
case. But even here, the Deutsch-Josza algorithm does have the advantage
of giving the correct answer with probability 1. If we make the same demand
of a randomized algorithm, it does no better than a deterministic algorithm,
at least in the constant-function case.

16.5 Grover’s algorithm


Grover’s algorithm [Gro96] is one of two main exemplars for the astonishing
power of quantum computers.7 The idea of Grover’s algorithm is that if we
have a function f on N = 2n possible inputs whose value is 1 for exactly

one possible input w, we can find this w with high probability using O( N )
quantum evaluations of f . As with Deutsch’s algorithm, we assume that f
7
The other is Shor’s algorithm [Sho97], which allows a quantum computer to factor
n-bit integers in time polynomial in n. Sadly, Shor’s algorithm is a bit too complicated to
talk about here.
CHAPTER 16. QUANTUM COMPUTING 286

is encoded as an operator (conventionally written Uw ) that maps each |xi to


(−1)f (x) |xi.
The basic outline of the algorithm:
q
1. Start in the superposition |si = 1
x |xi = H ⊗n |0i.
P
N

2. Alternate between applying the Grover diffusion operator D =



2 |si hs| − I and the f operator Uw = 2 |wi hw| − I. Do this O( n)
times (the exact number of iterations is important and will be calculated
below).

3. Take a measurement of the state. It will be w with high probability.

Making this work requires showing that (a) we can generate the original
superposition |si, (b) we can implement D efficiently using unitary operations
on a constant number of qubits each, and (c) we actually get w at the end
of this process.

16.5.1 Initial superposition


To get the initial superposition, start with |0qn i and apply the Hadamard
transform to each bit individually; this gives N1 x |xi as claimed.
P

16.5.2 The Grover diffusion operator


We have the definition D = 2 |si hs| − I.
Before we try to implement this, let’s start by checking that it is in fact
unitary. Compute

DD∗ = (2 |si hs| − I)2


= 4 |si hs| |si hs| − 4 |si hs| + I 2
= 4 |si hs| − 4 |si hs| + I
= I.

Here we use the fact that |si hs| |si hs| = |si hs|si hs| = |si (1) hs| = |si hs|.
Now let’s look at implementation. Recall that |si = H ⊗n |0n i, where
H ⊗n is the result of applying H to each of the n bits individually. We also
have that H ∗ = H and HH ∗ = I, from which H ⊗n H ⊗n = I as well.
CHAPTER 16. QUANTUM COMPUTING 287

So we can expand

D = 2 |si hs| − I
= 2H ⊗n |0n i (H ⊗n |0n i)∗ − I
= 2H ⊗n |0n i h0n | H ⊗n − I
= H ⊗n (2 |0n i h0n | − I) H ⊗n .

The two copies of H ⊗n involve applying H to each of the n bits, which


we can do. The operator in the middle, 2 |0n i h0n | − I, maps |0n i to |0n i and
maps all other basis vectors |xi to − |xi. This can be implemented as a NOR
of all the qubits, which we can do using our tools for classical computations.
So the entire operator D can be implemented using O(n) qubit operations,
most of which can be done in parallel.

16.5.3 Effect of the iteration


To see what happens when we apply Uw D, it helps to represent the state in
terms of a particular two-dimensional basis. The idea here is that the initial
state |si and the operation Uw D are symmetric with respect to any basis
vectors |xi that aren’t |wi, so instead of tracking all of these non-|wi vectors
separately, we will represent all of them by a single composite vector
s
1 X
|ui = |xi .
N − 1 x6=w
q
The coefficient N 1−1 is chosen to make hu|ui = 1. As always, we like
our vectors to have length 1.
Using |ui, we can represent
r s
1 N −1
|si = |wi + |ui . (16.5.1)
N N
q
1
A straightforward calculation shows that this indeed puts N amplitude
on each |xi. q
Now we’re going to bring in some trigonometry. Let θ = sin−1 N1 , so
q √ q
that sin θ = N1 and cos θ = 1 − sin2 θ = NN−1 . We can then rewrite
(16.5.1) as

|si = (sin θ) |wi + (cos θ) |ui . (16.5.2)


CHAPTER 16. QUANTUM COMPUTING 288

Let’s look at what happens if we expand D using (16.5.2):

D = 2 |si hs| − I
= 2 ((sin θ) |wi + (cos θ) |ui) ((sin θ) hw| + (cos θ) hu|) − I
= (2 sin2 θ − 1) |wi hw| + (2 sin θ cos θ) |wi hu| + (2 sin θ cos θ) |ui hw| + (2 cos2 θ − 1) |ui hu|
= (− cos 2θ) |wi hw| + (sin 2θ) |wi hu| + (sin 2θ) |ui hw| + (cos 2θ) |ui hu|
" #
− cos 2θ sin 2θ
= ,
sin 2θ cos 2θ

where the matrix is over the basis (|wi , |ui).


Multiplying by Uw negates all the |wi coefficients. So we get
" #" #
−1 0 − cos 2θ sin 2θ
Uw D =
0 1 sin 2θ cos 2θ
" #
cos 2θ − sin 2θ
= . (16.5.3)
sin 2θ cos 2θ

Aficionados of computer graphics, robotics, or just matrix algebra in


general may recognize (16.5.3) as the matrix that rotates two-dimensional
vectors by 2θ. Since we started with |si at an angle of θ, after t applications
of this matrix we will be at an angle of (2t + 1)θ, or in state

(sin(2t + 1)θ) |wi + (cos(2t + 1)θ) |ui .

Ideally, we pick t so that (2t + 1)θ = π/2, which would put all of the
amplitude on |wi. Because t is an integer, we can’t do this exactly, but
setting t = b π/2θ−1
2 c will get us somewhere between π/2 − 2θ and π/2. Since
q
1
θ≈ N,this gives us a probability of seeing |wi in our final measurement
p √
of 1 − O( 1/N ) after O( N ) iterations of Uw D.
Sadly, this is as good as it gets. A lower bound of Bennet et al. [BBBV97]
shows that any quantum √ algorithm using Uw as the representation for f must
apply Uw at least Ω( N ) times to find w. So we get a quadratic speedup
but not the exponential speedup we’d need to solve NP-complete problems
directly.
Chapter 17

Randomized distributed
algorithms

A distributed algorithm is one that runs on multiple machines that com-


municate with each other in some way, typically via message-passing (an
abstraction of packet-based computer networks) or shared memory (an
abstraction of multi-core CPUs and systems with a common memory bus).
What generally distinguishes distributed computing from the closely-
related idea of parallel computing is that we expect a lot of nondetermin-
ism in a distributed algorithm: events take place at unpredictable times,
processes may crash, and in particular bad cases we may have Byzantine
processes that work deliberately against the algorithm. We are not going to
assume Byzantine processes; instead, we’ll look at a particular model called
wait-free shared memory.
This hostile nondeterminism is modeled by an adversary that controls
scheduling and failures. For a shared-memory model, we have a collection
of processes that each have a pending operation that is either a read
or write to some register. The adversary chooses which of these pending
operations happens next (so that concurrency between processes is modeled
by interleaving their operations). Unlike the adversary that supplies the
worst-case input to a traditional algorithm before it executes, an adversary
scheduler might be able to observe the execution of a distributed algorithm
in progress and adjust its choices in response. If it has full knowledge of
the system (including internal states of processes), we call it an adaptive
adversary; at the other extreme is an oblivious adversary that chooses
the schedule of which processes execute operations at which times in advance.
This is similar to the distinction between an adaptive and oblivious adversary

289
CHAPTER 17. RANDOMIZED DISTRIBUTED ALGORITHMS 290

required for the analysis of randomized data structures (see §6.3.1.


For either adversary, what makes the system wait-free is that any process
that gets to take an unbounded number of steps has to finish whatever it is
doing no matter how the other processes are scheduled. In particular, this
means that no process can wait for any of the others, because they might
never be scheduled. The processes that do run all run the algorithm correctly,
so the model is equivalent to assuming up to n − 1 of the processes may
crash but that none are Byzantine.
As with traditional algorithms, distributed algorithms can use random-
ization to make themselves less predictable. The assumption is that even if
the adversary is adaptive, it can’t predict the outcome of future coin-flips.
For some problems, avoiding such predictability is provably necessary.
Distributed computing is a huge field, and we aren’t going to be able
to cover much of it in the limited space we have here. So we are going
to concentrate on a few simple problems that give some of the flavor or
randomized distributed algorithms (and that lead to problems that are still
open).

17.1 Consensus
The consensus problem is to get a collection of n processes to agree on a
value. The requirements are that all the processes that don’t crash finish the
protocol (with probability 1 for randomized protocols) (termination), that
they all output the same value (agreement), and that this value was an input
to one of the processes (validity)—this last condition excludes protocols that
always output the same value no matter what happens during the execution,
and makes consensus useful for choosing among values generated by some
other process.
There are many versions of consensus. The original problem as proposed
by Pease, Shostak, and Lamport [PSL80] assumes Byzantine processes in a
synchronous message-passing system. Here scheduling is entirely predictable,
and the obstacle to agreement is dissension sown by the lying Byzantine
processes. We will instead consider wait-free shared-memory consensus,
where scheduling is unpredictable but the the processes and memory are
trustworthy. Even in this case, the unpredictable scheduling makes solving
the problem deterministically impossible.
CHAPTER 17. RANDOMIZED DISTRIBUTED ALGORITHMS 291

17.1.1 Impossibility of deterministic algorithms


The FLP impossibility result [FLP85] and its extension to shared mem-
ory [LAA87] show that consensus is impossible to solve in an asynchronus
model with even one crash failure. The general proof is a bit involved, but
there is a simple intuitive proof of this result for the wait-free case where up
to n − 1 of the processes may crash.
Crash all but two processes, one with input 0 and one with input 1.
Define the preference of a process as the value it will decided in a solo
execution (this is well-defined because the processes are deterministic). In the
initial state, the process with input b has preference b because of validity—it
doesn’t know the other process exists. But before it returns b, it has to cause
the other process to change its preference (once it leaves the other process is
running alone). It can do so only by sending a message or writing a register.
When it is about to do this, stop it and run the other process until it does
the same thing.
Either the other process somehow neutralizes the effect of the delayed
message/write during this time, or it doesn’t. In the first case, restart
the first process (which still has its original preference and now must do
something else to make the other process change). In the second case, deliver
both operations and let the processes exchange preferences. The resulting
execution looks very much like when two people are trying to pass each other
in a hallway and oscillate back-and-forth—but since our process’s have their
timing controlled by an adversary, the natural damping or randomness that
occurs in humans doesn’t ever resolve the situation.
We can avoid this impossibility result with randomness. If either of the
process decides to choose a new preference at random, then there is a 50%
chance it will end up agreeing with the other one, no matter what the other
one is doing. Assuming we can detect this agreement, this solves consensus
in constant expected time for two processes.
Unfortunately, this flip-if-confused approach does not generalize very
well to n processes. The first known randomized shared-memory algorithm
for consensus, due to Abrahamson [Abr88], used essentially this idea. But
because it requiredwaiting  for a long run of identical coin-flips across all
n 2
processes, it took O 2 steps on average to finish. In a little while, we will
see how to do better.
CHAPTER 17. RANDOMIZED DISTRIBUTED ALGORITHMS 292

17.2 Leader election


In a shared-memory system, leader election is a very similar problem to
consensus. Here we want one of the processes to decide that it is the leader,
and the rest to decide that they are followers. The difference from consensus
is that we don’t require that the followers learn the identity of the leader.
This means that followers can drop out before the algorithm determines the
leader, which may allow a faster algorithm.
Note however that if there are only two processes, the losing process
knows who the winner is, because there is only one possibility. So the same
argument that shows that we can’t do consensus deterministically with two
processes also works for leader election.

17.3 How randomness helps


We’ve already hinted at the idea of using randomness to shake processes into
agreement. In the current literature on shared-memory consensus and leader
election, this idea shows up in two main forms:

1. We can replace the n separate coin-flips of stalled processes with a


single weak shared coin protocol. This is a protocol that has for each
outcome b ∈ {0, 1} a minimum probability δ > 0 that every processes
receives b as the value of the coin. Because δ will typically be less than
1/2, it is possible that we get disagreements, or that the adversary can
bias the coin toward a particular value that it likes. But with a bit of
extra machinery that we will not describe here, we can get agreement
after O(1/δ) calls to the shared coin on average [AH90]. An example
of a weak shared coin protocol that assumes an adaptive adversary is
given in §17.4.

2. We can eliminate processes quickly using a sifter, which solves a


weak version of leader election that allows for multiple winners but
guarantees that the expected number of winners is small relative to
n. Repeated application of a sifter can quickly knock the number of
repeated winners down to 1, which solves leader election (§17.5. With
some tinkering, it is possible to use some sifters to also solve consensus
(§17.6. In both cases we assume an oblivious adversary.
CHAPTER 17. RANDOMIZED DISTRIBUTED ALGORITHMS 293

17.4 Building a weak shared coin


The goal of a weak shared coin is to minimize the influence of the adversary
scheduler. An adaptive adversary can bias the outcome of the coin by
looking at the random values generated by the individual processes, and
withholding values that it doesn’t like by delaying any write operations by
those processes. So we would like to find a way to combine the local coins at
the processes together into a single global coin that minimizes the influence
of any particular local coin. The easiest way to do this is by applying the
majority function: each process repeatedly generates ±1 values with equal
probability and adds them to a common pile. When the pile includes enough
local coins, we take the sign of the total to get the return value of the global
coin.
The nice thing about this approach is that the behavior of the processes
is symmetric. It doesn’t matter what order the adversary runs them in,
the next coin-flip is always another fair coin-flip. So we can analyze the
sequence of partial sums of generated coin-flips using the tools we have
developed for sums of independent variables, random walks, and martingales.
Unfortunately things get a little more complicated when we look at the totals
actually observed by any particular processes.
The problem is that we need some mechanism for gathering up the local
coin values. Using read-write registers, we can have each process write
the count and sum of its own local coins to a register writable by it alone
(this avoids processes overwriting each other). Because the adversary can
only delay one coin from each process, both total count and the total sum
are always within n of the correct value. If we could read all the registers
instantaneously and stop as soon as the count reached some threshold T ,
then for T = Θ(n2 ) the distribution on the generated coins would be spread
out enough that the total would be at least n + 1 away from 0 with constant
probability. But the actual situation is more complicated.
The problem is that each process must do n sequential reads to read all
the registers. This is not only expensive—it’s not something we want to do
after every local flip—but it also means that more coins may be generated
while I’m reading the registers, meaning any sum I compute at the end of
the algorithm might have only a tenuous connection with the actual sum of
the generated coins at any point in the execution.
The solution to the first part of this problem was proposed by Bracha
and Rachman [BR91]: only check the total after writing out Θ(n/ log n)
local coin-flips, giving an amortized cost per coin-flip of Θ(log n) register
operations. Unfortunately this makes the second problem worse: a process p
CHAPTER 17. RANDOMIZED DISTRIBUTED ALGORITHMS 294

might continue generating many local coins after the total count crosses the
threshold, and each other process might see a different prefix of these coins
depending on when it reads p’s register.
Bracha and Rachman showed that this wasn’t too much of a problem
using Azuma’s inequality (this is where the O(log n) factor comes in). But
later work by Attiya and Censor [AC08] allowed for a simpler analysis of a
slightly different algorithm, which we will describe here.
Pseudocode for the Attiya-Censor coin is given in Algorithm 17.1. The
algorithm only checks the total count once for every n coin-flips, giving an
amortized cost of 1 read per coin-flip. But to keep the processes for running
too long, each process checks a multi-writer stop bit after ever coin-flip.

while nj=1 rj .count < T do


P
1
2 for i ← 1 . . . n do
3 hri .count, ri .sumi ← hri .count + 1, ri .sum ± 1i
4 if stop then
5 break;

6 stop ← true
P 
n
7 return sgn j=1 rj .sum

Algorithm 17.1: Attiya-Censor weak shared coin [AC08]

Each process may see a different total sum at the end, but our hope is
that if T is large enough, there is at least a δ probability that all these total
sums are positive (or, by symmetry, negative). We can represent the total
sum seen by any particular process i as a sum S + D + Xi − Hi , where:

1. S is the sum of the first T coin-flips. This has a binomial distribution,


equal to the sum of T independent ±1 random variables. Letting
T = cn2 for some reasonably large c gives us a constant probability
that |S| > 4n.

2. D is the sum of all coin-flips after the first T that are generated before
some process sets the stop bit. There are at most n2 + n such coin-flips,
and they form a martingale with bounded increments. So Azuma’s
inequality gives us that |D| ≤ 2n with at least constant probability,
independent of S.

3. Xi = nj=1 Yij , where Yij is the sum of all coin-flips generated by


P

process j between some process setting the stop bit and process i
CHAPTER 17. RANDOMIZED DISTRIBUTED ALGORITHMS 295

reading rj . Since each process can generate at most one extra coin-flip
before checking stop, |Xi | ≤ n always.

4. Hi = nj=1 Zij , where Zij is the sum of all votes that are generated by
P

process j before i reads rj , but that are not included in rj .sum because
they haven’t been written yet. Again, each process can contribute only
one coin-flip to Hi . So |Hi | ≤ n always.

Adding up the error terms D + Xi − Hi gives a total that is bounded by


4n with at least constant probability. This probability is independent of the
constant probability that |S| > 4n. So if both of these events occur, we win.
The total cost of the coin is O(T + n2 ) = O(n2 ). This also gives a cost
of O(n2 ) for consensus.
This turns out to be optimal, because of a lower bound on the expected
total steps for consensus appearing in the same paper [AC08]. This is a bit
disappointing, because an Ω(n2 ) lower bound translates to Ω(n) steps for
each process even if we can distribute the load evenly across all processes
(which Algorithm 17.1 might not). We’d like to get something that scales
better, but to get around the lower bound we will need to switch to an
oblivious adversary.

17.5 Leader election with sifters


The idea of a sifter [AA11] is to use randomization to quickly knock out pro-
cesses while leaving at least one process always. The current best known sifter
for standard read-write registers, due to Giakkoupis and Woelfel [GW12], is
shown in Algorithm 17.2.

1 Choose r ∈ Z+ such that Pr [r = i] = 2−i


2 A[r] ← 1
3 if A[r + 1] = 0 then
4 stay
5 else
6 leave
Algorithm 17.2: Giakkoupis-Woelfel sifter [GW12]

The cost of executing the sifter is two operations. Each process chooses
an index r according to a geometric distribution, writes A[r], and then checks
if any other process has written A[r + 1]. The process stays if it sees nothing.
CHAPTER 17. RANDOMIZED DISTRIBUTED ALGORITHMS 296

Because any process with the maximum value of r always stays, at least
one process stays. To show that not too many processes stay, let Xi be the
number of survivors with r = i. This will be bounded by the number of
processes that write to A[i] before any process writes to A[i + 1]. We can
immediately see that E [Xi ] ≤ n · 2−i , since each process has probability 2−i
of writing to A[i]. But we can also argue that E [Xi ] ≤ 2 for any value of n.
The reason is that because the adversary is oblivious, the choice of which
process writes next is independent of the location it writes to. If we condition
on some particular write being to either A[i] or A[i + 1], there is a 1/3
chance that it writes to A[i + 1]. So we can think of the subsequence of
writes that either land in A[i + 1] or A[i] as a sequence of biased coin-flips,
and we are counting how many probability-2/3 tails we get before the first
probability-1/3 heads. This will be 2 on average, or at most 2 if we take into
account that we will stop after n writes.
We thus have E [Xi ] ≤ min(2, n · 2−i ). So the expected total number of
winners is bounded by ∞ −i
i=1 E [Xi ] min(2, n · 2 ) = 2 lg n + O(1).
P

Now comes the fun part. Take all the survivors of our sifter, and run
them through more sifters. Let Si be the number of survivors of the first
i sifters. We’ve shown that E [Si+1 | Si ] = O(log Si ). Since log is concave,
Jensen’s inequality then gives E [Si+1 ] = O(log E [Si ]). Iterating this gives
E [Si ] = O(log(i) S0 ) = O(log(i) n). So there is some i = O(log∗ n at which
E [Si ] is a constant.
This doesn’t quite give us leader election because the constant might not
be 1. But there are known leader election algorithms that run in O(1) time
with O(1) expected participants [AAG+ 10], so we can use these to clean up
any excess processes that make it through all the sifters. The total cost is
O(log∗ n) operations for each process.

17.6 Consensus with sifters


We’ve seen that we can speed up leader election by discarding potential
leaders quickly. For consensus, we want to discard potential output values
quickly. This is a little more complicated, because processes carrying losing
output values can’t just drop out if we haven’t determined the winning values
yet.
We are going to look at an algorithm by Aspnes and Er [AE19] that
solves consensus in O(log∗ n) operations per process, using sifters built on
top of a stronger primitive than normal registers. (This algorithm uses
some ideas from an earlier paper of Aspnes [Asp15], which gives a more
CHAPTER 17. RANDOMIZED DISTRIBUTED ALGORITHMS 297

complicated O(log log n)-time algorithm for standard registers.) As with


the Giakkoupis-Woelfel leader election algorithm, we assume an oblivious
adversary scheduler.
The stronger primitive we need is a max register [AACH12]. Unlike
a standard register, reading a max register returns the largest value ever
written to it; equivalently, trying to write a smaller value to a max register
than it already contains has no effect.
What this means for randomized algorithms is that we can knock out
values using a max register by a simple priority scheme. If each process
writes a tuple hr, vi to the max register, where r is a unique random priority
and v is the process’s preferred value, then the i-th value written appears
in the max register only if it is a left-to-right maximum of the sequence of
priorities, which occurs with probability roughly 1/i. (This follows from
symmetry and the fact that the oblivious adversary can’t selectively schedule
smaller priorities first.) So on average only Hn = O(log n) of these values
will ever appear in the max register. By having each process read and return
a value from the max register, we’ve gone from n possible values to O(log n)
possible values on average.
We now run into two technical obstacles. The first is that we can’t
necessarily guarantee that all the random priorities are unique, which we
would need to get an exact 1/i probability for survival. This can be dealt with
by choosing priorities over a sufficiently large range that the rare collisions
don’t add much to the expected survivors (say, O(n3 )). The second issue is
that if we try to iterate the sifter, in the second round we have many copies
of each input value. So even if one copy drops out because of a low priority,
some other copy might get lucky and survive. To go from O(log n) expected
values to O(log log n) expected values, we need to make sure that no value
gets more than one chance at surviving.
Here is where we use an idea that first appeared in [Asp15]. Instead of
generating a new priority for each value in each round, we generate a sequence
r1 , r2 , . . . , r` of priorities for all ` rounds all at once at the beginning. We
can then store hri , . . . , r` , vi in the max register for round i, and the leading
priority ri controls survival. Now we don’t care about having multiple copies
of a value, because they will all have the same initial ri , and if the first copy
to be written doesn’t survive, none of them will.
The result of applying this idea is Algorithm 17.3. Because each iteration
of the loop reduces Si survivors to Si+1 = O(log Si ) survivors on average,
the same application of Jensen’s inequality that we used for Algorithm 17.2
works here as well, so for an appropriate ` = O(log∗ n) we can get down to a
single surviving value with reasonably high probability.
CHAPTER 17. RANDOMIZED DISTRIBUTED ALGORITHMS 298

1 procedure sifter(v)
2 Choose random ranks r1 . . . r`
3 for i ← 1 . . . ` do
4 writeMax(Mi , hri , . . . , r` , vi)
5 hri , . . . , r` , vi ← readMax(Mi )
6 return v
Algorithm 17.3: Sifter using max registers [AE19]

This is not quite enough to get consensus, which requires agreement


always, but with some additional machinery it is possible to detect when
the protocol fails and re-run it if necessary. The final result is consensus in
expected O(log∗ n) max register operations.
Appendix A

Sample assignments from


Spring 2024

A.1 Assignment 1, due Thursday 2024-02-01 at


23:59
A.1.1 Matchings
A matching in a graph is a subgraph where each vertex has degree at
most 1, which means that some vertices are paired with other vertices. In
this problem, we will consider several randomized algorithms for quickly
constructing a large matching in a d-regular graph. For each algorithm, your
task is to compute an exact closed-form expression for the expected number
of edges M (n, d) in the matching as a function of n and d.

1. In the simplest algorithm, each vertex u chooses one of its d neighbors


v independently and uniformly at random. Each edge uv is included
in the matching if u chooses v and v chooses u.

2. The second algorithm assumes a bipartite d-regular graph where n is


even and the nodes are partitioned into two subsets S and T of n/2
nodes each, with all edges going between S and T . In this graph, have
each node u in S choose one of its d neighbors v in T independently
and uniformly at random. Each edge uv is included in the matching if
u chooses v and no other u0 chooses v.

3. The third algorithm attempts to extend the bipartite-graph algorithm


to a general graph. Each node u flips an independent fair coin to

299
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 300

decide whether it will send or receive. Each sender then picks one of
its d neighbors independently and uniformly at random. An edge uv is
included in the matching if u is a sender, v is a receiver, and u is the
only sender that picks v; or if the same conditions hold with u and v
reversed.

Solution
In each case we’ll use linearity of expectation.

1. Let Xuv be the indicator for the event that uv is included in the
matching. Then

E [Xuv ] = Pr [u and v are matched]


= Pr [u chooses v and v chooses u]
= Pr [u chooses v] Pr [v chooses u]
= (1/d)(1/d)
= 1/d2 .

Now sum over all nd/2 edges to get


n
M (n, d) = (nd/2)(1/d2 ) = .
2d

2. Again let Xuv be the indicator variable for the event that uv is included.
Now

E [Xuv ] = Pr [only u picks v]


= (1/d)(1 − 1/d)d−1 .

Sum over all nd/2 edges to get


n
M (n, d) = (nd/2)(1/d)(1 − 1/d)d−1 = (1 − 1/d)d−1 .
2

Using 1 − 1/d ≤ e−1/d , we can show that this is at least 2e


n
for any d,
meaning we match a constant fraction of the nodes on average even
when d is large.

3. For this version it’s convenient to break symmetry and make Xuv be
the indicator variable for the event that u is a sender, v is a receiver,
and u is matched with v. This means that we will have two variables
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 301

Xuv and Xvu for each edge, but we can deal with this when we need
to.
Compute

E [Xuv ] = Pr [u is sender and v is receiver and only u picks v]


d−1
1

= (1/2)(1/2)(1/d) 1 −
2d
d−1
1 1

= 1− .
4d 2d
Sum over all nd directed edges uv to get
d−1
n 1

M (n, d) = 1− .
4 2d

It can be shown that this is also Θ(n) regardless of d, although the


constant is not quite as good as in the bipartite case.

A.1.2 Non-volatile memory


A manufacturer has designed a non-volatile memory chip based on blow-
ing tiny fuses. Each fuse starts out representing a 0, but by running an
excess voltage across the fuse it can be permanently changed to a new state
representing a 1.
To allow storing multiple values in the chip, it is arranged as a sequence
of cells of k bits each, with an extra bit used to mark the cell as discarded.
Initially a cell will hold all zeros (0000). When writing a new value v, if it is
possible to overwrite the previous value without having to turn any 1 into a
0, the same cell will be re-used. Otherwise it will be marked as discarded
and a new cell will be used for v. An example of this process is given in
Figure A.1.
In the worst case, every new value requires a new cell. But the manufac-
turer hopes that things will go better with enough randomness.

1. Suppose that a sequence v1 , v2 , . . . , vn of n ≥ 1 values is generated


uniformly at random, so that each vi is equally likely to be each of
the 2k possible bit-vectors of length k. Give an exact closed-form
expression for the expected number of cells used to write these values,
as a function of n and k.
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 302

Value New memory contents


(Initial) 0000
0101 0101
1011 0101 1011
0010 0101 1011 0010
0110 0101 1011 0110
1100 0101 1011 0110 1100
1101 0101 1011 0110 1101

Figure A.1: Non-volatile memory in action. The example shows n = 6 writes


with k = 4 bits per cell. A total of 4 cells are used.

2. Since we can’t rely on a random input, let’s move the randomness into
the algorithm. Suppose that a fixed permutation π on all k-bit vectors
is chosen uniformly at random and each vi (supplied by an adversary)
is mapped to π(vi ) before being written. Now what is the expected
number of cells used in the worst case? (As usual, assume that π is
chosen after v1 , . . . , vn are fixed.)

3. The manufacturer complains that storing π is too expensive, and


suggests storing a random k-bit vector x instead. Each vi is now
mapped to vi ⊕ x before being stored, where ⊕ represents bitwise XOR.
Now what is the expected number of cells used in the worst case?

Solution
1. Let v and v 0 be consecutive values. Then v 0 requires a new cell if there
is some position j such that vj0 = 0 but vj = 1. For each position j,
this occurs with probability 1/4 and does not occur with probability
3/4. Because the positions are independent, the probability that v 0
does not require a new cell is (3/4)k and the probability that it does
require a new cell is 1 − (3/4)k .
For each i > 1, let Xi be the indicator variable for the even that
vi requires a new cell. If we let X1 = 1, then S = ni=1 Xi gives
P

the total number of cells needed. We have calculated above that


E [Xi ] = 1 − (3/4)k for i > 1, so by linearity of expectation
n
X
E [S] = E [Xi ]
i=1
= 1 + (n − 1)(1 − (3/4)k ).
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 303

2. Same setup as before but we need to recompute E [Xi ]. The problem is


that π(v) and π(v 0 ) are no longer independent, because the adversary
can (and should) pick v = 6 v 0 , which removes one of the possible cases
for π(v ) since π can’t send unequal v and v 0 to the same value.
0

We can compensate for this using conditional expectations. Let X be


the indicator for the event that π(v 0 ) requires a new cell if π(v) and
π(v 0 ) are chosen independently at random. Then

1 − (3/4)k = E [X]
= E X π(v) = π(v 0 ) Pr π(v) = π(v 0 )
   

+ E X π(v) 6= π(v 0 ) Pr π(v) 6= π(v 0 )


   

= 0 · 2−k + E X π(v) 6= π(v 0 ) (1 − 2−k ),


 

which we can solve to get

1 − (3/4)k
E X π(v) 6= π(v 0 ) =
 
.
1 − 2−k
Since the only choice the adversary can make that affects E [Xi ] or not
is to set vi 6= vi−1 , this gives a worst case of
hX i 1 − (3/4)k
E Xi = 1 + (n − 1) ,
1 − 2−k
which is only a little bit worse than the average case.
An alternative approach to computing E [X] is to just count the number
of pairs π(v), π(v 0 ) that require a new cell and divide by the total
number of possibilities 2k (2k − 1). Here there are 3 choices for each bit
position that don’t require a new cell, giving 3k choices overall, but 2k
of these choices make π(v) = π(v 0 ). So the number of distinct pairs that
don’t require a new cell is 3k − 2k . We can get the number of pairs that
do require a new cell by subtracting from the total 2k (2k − 1) = 4k − 2k .
k k k
This gives E [X] = 44k −3−2k
= 1−(3/4)
1−2−k
as computed above.

3. Here things get substantially worse. Suppose that v = 0k and v 0 = 1k .


Now v 0 ⊕ x requires a new cell as long as x 6= 0k , which occurs with
probability 1 − 2−k . A similar argument holds when v = 1k and v 0 = 0k .
So by alternating between all-0 and all-1 values, the adversary can force
us to use an expected 1 + (n − 1)(1 − 2−k ) cells. It can’t do worse since
any choice other than switching all bits would give a lower expected
cost for each write.
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 304

All of these bounds are terrible, but they are not equally terrible and
are modestly distinct for small k. For k = 2, for example, the coefficient on
(n − 1) goes from 7/16 in the average case, to 7/12 in the permutation case,
and all the way up to 3/4 in the bitwise XOR case.

A.2 Assignment 2, due Thursday 2024-02-15 at


23:59
A.2.1 Mediocre cuts
Recall that a cut in a graph G is a partition of the vertices into sets S and T ,
where the size of the cut is the number of edges with one endpoint in each
set.
Given a graph G with n vertices and m edges, we’d like to construct a
cut with size as close as possible to m/2. We will consider two algorithms
that generate 0-1 random variables Xv for each vertex v, such that v is in S
if Xv = 0 and v is in T if Xv = 1.

1. In the first algorithm, Xv is simply an independent fair coin-flip for


each v.
Prove or disprove: This algorithm produces a cut of size m/2 ± o(m)
with constant nonzero probability.

2. In this second algorithm, the independent fair coins are replaced by


pairwise independent coins, constructed as in §5.1.2.1: Generate k =
dlg(n + 1)e independent fair random bits Y1 , . . . , Yk , assign each vertex
L
v a unique non-empty subset Av of these bits, and let Xv = i∈Av Yi .
Prove or disprove: This algorithm produces a cut of size m/2 ± o(m)
with constant nonzero probability.

Solution
1. We’ll give a proof using Chebyshev’s inequality.
For each edge uv let Zuv = Xu ⊕ Xv be the indicator variable for the
P
event that uv is in the cut. Let S = uv Zuv be the size of the cut.
P
We have E [S] = E [Zuv ] = m/2.
To compute Var [S], we will first show that the variables Zuv are
pairwise-independent. This holds trivially for any variables correspond-
ing to non-incident edges. For the case of two incident edges Zuv and
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 305

Zvw , observe that E [Zvw | Zuv ] = 1/2, because whatever value Xv has,
adding Xw yields 0 or 1 with equal probability. So Zuv and Zvw are
independent as well.
P
Since the Zuv are pairwise-independent, Var [S] = Var [Zuv ] = m/4.
Chebyshev then says
 √  m/4 1
Pr |S − m/2| ≥ m ≤ √ 2 = .
( m) 4

This gives a 3/4 chance of getting S in the range m/2 ± m = m/2 ±
o(m).

2. For this version we will give a disproof, by constructing a family of


arbitrarily large graphs where the algorithm generates a cut of size 0
or m with equal probability.
Fix some n, and create an edge uv for each pair of vertices u and v
with Au = Av 4 {1}. This will give a total of m = n/2 edges. Each
such edge will have Xu = Xv ⊕ Y1 ; this will put uv in the cut precisely
when Y1 = 1. So the cut will have 0 edges with probability 1/2 and m
edges with probability 1/2.
In neither case is the size of the cut within o(m) of m/2, so the algorithm
does not guarantee this with nonzero probability.

A.2.2 Training costs


A total of n workers arrive at a site in blocks of size n1 , n2 , . . . , nk , where
ni = n. After the i-th block arrives, one of the n1 + n2 + · · · + ni workers
P

now on site is chosen uniformly at random to be the leader. If the leader


arrived previously (was one of the n1 + n2 + · · · + ni−1 workers already at
the site), they are said to be experienced and require no training. If the
leader was one of the most recent ni arrivals, they are inexperienced and
must be trained. We’d like to get an estimate of the worst-case expected
cost of training inexperienced leaders, assuming that an adversary chooses
the sizes and number of the blocks n1 , . . . , nk , and show that the actual cost
is likely to be close to the expected cost.
We consider two different models for the cost of training an inexperienced
leader.

1. In the first model, an inexperienced leader remedies their ignorance by


calling the help line 1-800-LEADER, at a cost of one unit.
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 306

Show an upper bound f (n) on the worst-case expected total training


cost for this model, and show that there is are constants c > 0 and
c0 < 1 such that actual cost in this worst case is within f (n) ± c0 f (n)
for sufficiently large n with probability at least 1 − O(n−c ).

2. This part of the problem contained an error and has been


withdrawn. You do not need to supply a solution to this part
of the problem and will be given full credit for it.

Solution
1. First we need to figure out the worst-case choice of n1 , . . . , nk .
For any particular choice of the block sizes, the expected cost is
k
X ni
Pi .
i=1 j=1 nj

It is not hard to see that this quantity is maximized when k = n and


ni = 1 for all i. The proof is to observe that if any ni > 1, it can be
replaced by consecutive blocks of size ni − 1 and 1, giving an expected
cost for these two new blocks of
ni − 1 1 ni − 1 1 ni
+ Pi >P + Pi =P ,

P
n
j=i i 1 j=i ni n
j=i i j=i ni j=i ni

which increases the expected total cost.


Setting each ni = 1, the expected total cost is
n
X 1
= Hn = Θ(log n).
i=1
i

Let Xi be the training cost for the i-th block. Then each Xi is an
P
independent Bernoulli random variable and S = Xi has known
expectation µ = Hn = Θ(log n), so we can use the two-sided Chernoff
bound (5.2.7) to get

Pr [|S − µ| ≥ (1/2)µ] ≤ 2e−µ/12 = 2e−Θ(log n) = O(n−c ),

for some c > 0.


But then S lies between (1/2)µ and (3/2)µ with probability 1 − O(n−c ),
giving the desired bound.
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 307

A.3 Assignment 3, due Thursday 2024-02-29 at


23:59
A.3.1 A robot rendezvous problem
A warehouse for a large online retailer has n packages, each of which has
a distinct tracking number ti ∈ {0, . . . , t − 1}. Each package starts out in
the possession of one of n warehouse robots and needs to be transferred to
one of n corresponding delivery drones to be delivered. We will refer to the
robots and drones collectively as workers.
Unfortunately there are only m < n landing pads for the drones, and the
scheduling system that assigns packages to the workers does not know about
the landing pads.
Instead, each worker knows only n, m, t, the unique tracking number ti
for its assigned package, whether it is a robot or a drone, and up to b bits of
randomness that is shared between all the workers, where b is polylogarithmic
in t (meaning that b = O(logc t) for some c).1 Using this information, it
must choose one of the m landing pads 0, . . . , m − 1 in one of k rounds
0, . . . , k − 1. If exactly one robot and its corresponding drone arrive at a
particular pad during a round, the package is transferred successfully. If a
robot meets a drone expecting a different package, or if two robots or two
drones attempt to use the same pad during the same round, the transfer
fails and any workers choosing that pad in that round must try again at a
later round. Each worker has no ability to communicate with other workers
beyond being able to detect if a transfer it attempted failed or not.
It is clear that k must be at least dn/me for it to be possible to deliver
all the packages even with perfect coordination. Show that this is almost
sufficient, by giving an algorithm that, for any fixed c > 0, delivers all
packages in k = O((n/m) log n) rounds with probability at least 1 − O(n−c ).
1
There is a very technical issue here involving how the bound on the number of bits
interpreted. One extreme is that we have b bits of randomness and all 2b possible bit
strings are equally likely. The other extreme is that some algorithm generates a random
variable or collection of random variables that collectively take on at most 2b distinct
possible values (and thus can be encoded in b bits), but there is no requirement that each
possible value is equally likely. While the problem is solvable in either model, there are
fewer details to worry about in the second model, so you should feel free to assume, for
example, that you can encode a six-sided die in 3 bits by assigning 0 probability to 000
and 111 and 1/6 probability to each of the remaining six bit vectors.
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 308

Solution
We’ll have a sequence of p = O(log n) phases of d2n/me rounds each, and
in phase use a hash function to assign each package to one of the ` =
md2n/me ≥ 2n slots corresponding to a particular combination of pad and
round within the phase.
A complication is that we have limited randomness, which is going to
constrain what hash functions we can use. We’ll pick an independent linear
congruential hash function hj for each phase j, which requires O(log t) bits
of randomness per phase or O(log t log n) = O(log2 t) bits of randomness
across all phases. Tabulation hashing also works (we want the version that
adds table elements mod ` rather than using XOR on bit vectors), but the
bits per phase goes up to O(log2 t) giving O(log3 t) overall.
In each phase, a remaining robot or drone assigned tracking number
ti computes hj (ti ) ∈ [`] and goes to pad hj (ti ) mod modm during round
bhj (ti )/mc (numbering from 0 within the phase). Each pair of tracking
numbers ti , ti0 produces a collision with probability Pr [hj (ti ) = hj (ti0 )] ≤ 1` .
Let Xj be the number of robots left after j phases; then X0 = n and

Xj2
!
Xj 1 Xj
E [Xj+1 | Xj ] ≤ 2 ≤ ≤ .
2 ` 2n 2

Here the 2 at the start accounts for the fact that each collision may send up
to two robots to the next phase, the n2 counts all the pairs of robots, and
the last step uses the bound Xj ≤ n.
Iterating this inequality gives E [Xp ] ≤ n · 2−p . Set p = lg n to get
Pr [Xp > 0] ≤ E [Xp ] ≤ 2−c lg n = n−c . This gives the desired error probability
in c lg nd2n/me = O(n log n/m) rounds.

A.3.2 A linked list


It is well-known that linked lists are terrible data structures, with only O(n)
guarantees on search time in the worst case. However, for some distributions
on elements, the average case might be better.
Suppose we are given a set of n elements 1, . . . , n, where at each step
element i is supplied with probability pi . Suppose that we insert each new
element we see at the end of the list, until eventually all n elements appear
in the list. We then sample a new element according to the given probability
distribution and ask what its expected position in the list is.
For a uniform distribution, this won’t help much: all n! orderings will
be equally likely, and the expected position of a random element is just
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 309

n+1
2 = Θ(n), which is (up to constants) no better than the deterministic
worst case. But for more skewed distributions we may hope that more
probable elements get inserted early, meaning that the cost of searching for
them will be less than for more improbable elements.
For each of the following distributions, compute a tight (big-Θ) asymptotic
bound on the expected cost to search for a random element given by the
distribution, assuming that we constructed the list as described above by
repeatedly sampling from the same distribution.

1. A geometric distribution with pi ∝ 2−i .

2. A Zipf’s law distribution with pi ∝ 1i .

Solution
Let’s do what we can for a generic distribution before getting into the details
of each distribution individually.
Let Aij be the indicator variable for the event that i appears in the list
at or before the same position as j (the “at” part covers the case Aii , which
P
we take to be 1). Let Di be the position of i in the list. Then Di = j Aji .
6 j, we can compute E [Aij ] by looking at probability of seeing i
For i =
conditioned on seeing i or j at a particular step. This gives
pi
E [Aij ] = ,
pi + pj
X pj
E [Di ] = 1 + .
j6=i
pi + pj

and thus
X X X pj
E [search cost] = pi E [Di ] = 1 + pi .
i i j6=i
pi + pj

Let’s see what happens to this for each of our given distributions.

1. When pi ∝ 2−i , we have

2−i 
−i

pi = = Θ 2 .
1 − 2−n−1
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 310

Even better, the denominators cancel while computing


X pj X 2−j
=
j6=i
pi + pj j6=i
2−i + 2−j
X 2−j X 2−j
= +
j<i
2−i + 2−j j>i 2−i + 2−j
X 2−j X 2−j
= +
j<i
Θ(2−j ) j>i Θ(2−i )
X X
= Θ(1) + Θ(2i−j )
j<i j>i

= Θ(i) + Θ(1)
= Θ(i).

We can now compute the expected search cost as

Θ(2−i i) = Θ(1).
X X
pi E [Di ] = 1 +
i i

2. In this case, we have

1/i 1
pi = = .
Hn iHn
But the Hn ’s cancel out when we compute
X pj X 1/j
=
j6=i
pi + pj j6=i
1/i + 1/j
X i
= .
j6=i
i+j
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 311

We can use this to get an upper bound on the expected search cost
 
n n
X X 1 X i 
pi E [Di ] = 1 +
i=1 i=1
iHn j6=i
i+j
n n
1 XX 1
<1+
Hn i=1 j=1
i+j
2n
1 X k−1
<1+
Hn k=1 k
1
<1+ · 2n
Hn
= O(n/ log n).

To get a matching lower bound, notice that the best possible ordering
places each i at position i, giving an expected search cost of exactly
P 1
i iHn · i = n/Hn = Ω(n/ log n). Since whatever random ordering we
end up with is at least this bad, this gives us the Θ(n/ log n) bound
we are looking for.

A.4 Assignment 4, due Thursday 2024-03-28 at


23:59
A.4.1 Nearly orthogonal vectors
Given two vectors x and y, the angle θxy between them can be computed
x·y
by the formula θxy = cos−1 kxkkyk . If x and y are orthogonal, then x · y = 0,
and θxy = π/2. Let’s call two vectors x and y nearly orthogonal to within
 if π/2 −  ≤ θxy ≤ π/2 + .
Given an n × n matrix A, we can find all the pairs of rows Ai , Aj that
are nearly orthogonal by computing θAi ,Aj for each pair and comparing it to
π/2 ± . Unfortunately, this takes Θ(n3 ) time if done in the obvious way. So
let’s allow a randomized approximation algorithm.
Claim: For any fixed  with 0 <  ≤ π/4, there is a randomized algorithm
that takes as input an n × n matrix A and an error parameter δ > 0, and
returns in time O(n2 logc n logc (1/δ)), for some constant c, a list of nearly-
orthogonal pairs of rows of A, such that with probability at least 1 − δ, this
list includes every pair of rows that are nearly orthogonal to within , and
only pairs of rows that are nearly orthogonal to within 2.
Prove this claim.
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 312

Solution
This is a job for the Johnson-Lindenstrauss lemma (distributional version).
First normalize the rows of A so that each kAi k = 1. This takes O(n2 )
time and doesn’t affect the angles between rows.
Next, observe for any unit vectors x and y, kx − yk2 is an increasing
function of θ = θxy . We don’t actually need anything more than this, but
if we want to we can compute the distance exactly by constructing an
isosceles triangle with x and y as the unit edges and chopping it in half:
this gives kx − yk2 = 2 sin 2θ . So if θ ≤ , kx − yk2 ≤ 2 sin 2 , and if θ ≥ 2,
kx − yk2 > 2 sin .
Pick a threshold halfway between these two bounds. By choosing a
small enough error term in Lemma 8.1.2, we can guarantee that any pair
x, y that is nearly orthogonal to within  has kf (x − y)k2 close enough to
kx − yk2 to be within this threshold, and any pair x, y that is not nearly
orthogonal to within 2 doesn’t. (We could calculate what the exact error
bound we need for this, but since it’s a constant, we don’t care.) If we set
the probability of exceeding the error bound for a single vector to δ/ n2 ,


this gives a probability that kAi − A − jk2 exceeds the relative error bound
for any i, j of at most δ. To obtain δ/ n2 probability of error, we need
k = O(log(δ/n2 )) = O(log n + log δ).
So now we need to feed all of our rows to the Johnson-Lindenstrauss
transform whose existence is given in the lemma and then check the threshold
for each pair of rows in the output. Expanding out the steps, we must:

1. Compute an n × k transform matrix B implementing f . This takes


O(nk) time.

2. Compute AB. This takes O(n2 k) time using ordinary matrix multipli-
cation.

3. For each pair of rows i and j, compute k(AB)i − (AB)j k2 and check
it against the thresholds. This takes O(k) time per pair of rows for
O(n2 k) time total.

Adding up all of these costs gives O(n2 k) = O(n(log n + log(1/δ))) time,


which is within the claimed bounds.

A.4.2 Boosting a random walk


Suppose you are made to participate in a fair ±1 random walk, starting with
X0 = k ≤ n/2, that ends if Xt ≥ n or Xt ≤ 0. The twist is that once and
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 313

only once during the walk only you can push a “boost” button that replaces
the normal transition rule Xt+1 = Xt ± 1 with a rule that doubles the current
value: Xt+1 = 2Xt . Naturally, your choice to use your boost can’t depend on
knowledge of the future: when choosing to set Xt+1 = 2Xt , you only know
the outcomes X1 , . . . , Xt .

1. Suppose that your goal is to maximize the probability that you reach
a state Xt ≥ n before you reach Xt = 0. What strategy should you
use and what probability of reaching Xt ≥ n (as a function of k and n)
does it give you?

2. Suppose instead that you want to minimize the probability that you
reach a state Xt ≥ n before you reach reach 0, but you want to do this
in a way that always uses the boost before you reach Xt ≥ n or Xt = 0,
so that suspicious onlookers won’t think you aren’t trying. Now what
strategy should you use, and what probability of reaching Xt ≥ n does
it give you?

Solution
Let’s formalize things a bit. Let {∆t } be the set of independent fair ±1
increments, so that Xt+1 = Xt + ∆t+1 if we don’t use the boost. Then we
can let Ft = h∆1 , . . . , ∆t i and make the time σ at which we use the boost a
stopping time with respect to {Ft }. This makes [σ ≤ t] measurable Ft for
each t and gives E [∆t+1 | Ft ] = 0.
We can now write the transition rule compactly as

Xt+1 = 2[σ=t] Xt + [σ 6= t]∆t+1 ,

where [A] denotes the indicator variable for the event A.


Because of the boost, {Xt } is not a martingale with respect to {Ft }, but
we can define a process that is.
Let Yt = 2[σ≥t] Xt . The idea is that the first factor represents whether we
still have a boost available to use or not.
Note that [σ ≥ t] starts at 1 when t = 0 and drops to 0 when we use
the boost. We can write this as [σ ≥ t + 1] = [σ ≥ t] − [σ = t]. This lets us
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 314

compute
h i
E [Yt+1 | Ft ] = E 2[σ≥t+1] Xt+1 Ft
h   i
= E 2[σ≥t+1] 2[σ=t] Xt + [σ 6= t]∆t+1 Ft
= 2[σ≥t]−[σ=t] · 2[σ=t] Xt + 2[σ≥t+1] [σ 6= t] E [∆t+1 | Ft ]
= 2[σ≥t] Xt
= Yt .

So {Yt } is a martingale with respect to {Ft }.


(Here we could pull Xt , 2σ≥t , 2σ≥t+1 = 2[σ>t] , and [σ = 6 t] out of the
conditional expectation because all these variables are measurable Ft .)
Let τ = min {t | Xt ≥ n ∨ Xt = 0}. Then τ is finite with probability 1 (if
nothing else, we eventually take n positive steps in a row), and |Yi | ≤ 2n, so
we can apply the finite probability / bounded range version of the Optional
Stopping Theorem to get

E [Y0 ] = E [Yτ ]
= E [Yτ | Xτ ≥ n] Pr [Xτ ≥ n] + E [Yτ | Xτ = 0] Pr [Xτ = 0]
= E [Yτ | Xτ ≥ n] Pr [Xτ ≥ n] .

Since we know E [Y0 ] = 2k, this gives

2k
Pr [Xτ ≥ n] = . (A.4.1)
E [Yτ | Xτ ≥ n]

We should thus try to minimize or maximize E [Yτ | Xτ ≥ n] depending


on whether we want to maximize or minimize Pr [Xτ ≥ n].

1. To minimize E [Yt | Xτ ≥ n], observe that the smallest possible value


for Yτ given Xτ ≥ n is Yτ = Xτ = n. We can get this by using the
boost immediately (σ = 0). Then we have a standard fair random walk
starting from X1 = 2k, which hits n before 0 with probability exactly
2k
n.
This is not necessarily the only good strategy, since any strategy that
guarantees σ < τ and Xσ ≤ n/2 will also get 2k n , but (A.4.1) tells us
that no strategy can do better.

2. To maximize E [Yτ | Xτ ≥ n] subject to the requirement that we must


use the boost, we want E [Xτ | Xτ ≥ n] to be as large as possible. The
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 315

best we can hope to get is Xτ = 2(n − 1), by using the boost on when
Xt = n − 1. But what if we don’t reach n − 1 and are forced to use
the boost anyway?
Analyzing this using the Yt martingale turns out to be tricky, and an
easier approach is just to guess the worst strategy and show that it
is in fact the worst. Since we must use the boost before τ , we are
forced to use the boost if we reach 1 or n − 1, since otherwise there
is a possibility the process finishes before we use it. Let’s suppose we
adopt a maximum-delay strategy and set σ = min {t | t ∈ {1, n − 1}}.
Starting from any position x, this σ gives Pr [Xτ = n] of 2/n if we
reach 1 first and 1 if we reach n − 1 first. Which we reach first is just
the probability of hitting on or the other absorbing barrier in a simple
random walk in [1, n − 1]. So we can compute

x−1 x−1 2
 
Pr [Xτ = n | Xt = x, t < τ ] = ·1+ 1− ·
n−2 n−2 n
x−1 2 2
 
= 1− +
n−2 n n
x−1 n−2 2
 
= +
n−2 n n
x−1 2
= +
n n
x+1
= .
n

In particular, starting at X0 = k gives a probability of hitting the


upper barrier of k+1
n .
Now suppose we consider some other strategy σ 0 . Like σ, σ 0 must
use the boost at Xt = 1 or Xt = n − 1. If Xσ0 = x for some x with
x+1
1 < x < n − 1, then Pr [Xτ ≥ n | Xσ = x] = 2x n > n since x > 1. So
any other σ 0 gives a higher probability of Xτ ≥ n.

A.5 Assignment 5, due Thursday 2024-04-11 at


23:59
A.5.1 Relaxation time for Metropolis-Hastings
Suppose we have a lazy irreducible reversible Markov chain on n states with
transition probabilities pij , uniform stationary distribution πi = 1/n, and
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 316

relaxation time τ2 . Fix some function f with 1 ≤ f (x) ≤ r for all states x
and apply Metropolis-Hastings to get a new Markov chain with transition
probabilities p0ij = pij min(1, f (j)/f (i)) and stationary distribution πi0 ∝ f (i).
Let τ20 be the relaxation time of this new chain.
Prove or disprove: Under these assumptions, there is an upper bound on
τ20 that is polynomial in τ2 and r.

Solution
We will prove this by showing that πu0 and p0uv are both bounded relative to
their unmodified versions by polynomial functions of r, then show this leads
to at most polynomial blowup going from τ2 to the conductance Φ of the
original chain to the conductance Φ0 of the modified chain and finally to τ20 .
Let n be the number of states. For πu0 , we have

f (u) 1 1 1
πu0 = P ≥ > = πu
v f (v) (n − 1)r + 1 nr r

and in the other direction


f (u) r r
πu0 = P ≤ < = rπu .
v f (v) (n − 1) + r n

For p0uv = puv min(1, f (v)/f (u)), we get 1r puv ≤ p0uv ≤ puv .
So now let’s look at some set of states S and try to bound Φ0 (S).
P 0 0
0 u∈S,u6∈S πu puv
Φ (S) =
π 0 (S)
1
· 1r puv
P
u∈S,u6∈S r πu
>
rπ(S)
P
1 u∈S,u6∈S πu puv
=
r3 π(S)
1
= 3 Φ(S). (A.5.1)
r
We’d like to use this to bound Φ0 = min0<π0 (S)≤1/2 Φ0 (S) in terms of r
and Φ = min0<π(S)≤1/2 Φ(S), but there is a complication: there may be sets
S with π 0 (S) ≤ 1/2 but π(S) > 1/2 that are included in the computation of
Φ0 but not Φ. For these sets we will need to take advantage of reversibility.
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 317

Suppose π 0 (S) ≤ 1/2, π(S) > 1/2. Let T = S. Then π(T ) = 1 − π(S) <
1/2, so Φ(T ) ≤ Φ. Now we can argue
P 0 0
0 u∈S,v∈T πu puv
Φ (S) =
π 0 (S)
P 0 0
v∈T,u∈S πv pvu
=
π 0 (S)
π 0 (T )
= Φ0 (T ) .
π 0 (S)
1 π 0 (T )
> 3 Φ(T ) 0 .
r π (S)
1 1/2
> 3Φ ·
r 1/2
1
= 3 Φ. (A.5.2)
r
So given S with 0 < π 0 (S) ≤ 1/2, (A.5.1) shows Φ0 (S) ≥ r−3 Φ when
π(S) ≤ 1/2 and (A.5.2) shows Φ0 (S) ≥ r−3 Φ when π(S) > 1/2. In either
case we get Φ0 = min0<π0 (S)≤1/2 Φ0 (S) ≥ r−3 Φ.
1
From (10.6.4), we have 2Φ ≤ τ2 , which gives Φ ≥ 2τ12 . But then we can
apply the other direction of (10.6.4) to get

2 2 2r6
τ20 ≤ < ≤ = 8r6 τ22 ,
(Φ0 )2 (r−3 Φ)2 (1/(2τ2 ))2

which is polynomial in τ2 and r.


Though it doesn’t affect the solution to the problem, it is probably
worth noting here that there may be a much more efficient way to sample
proportional to f , which is to sample a point uniformly (in time O(τ2 log n/))
and then apply rejection sampling to adjust the probabilities. This yields
a sample in O(rτ2 log n/) in the worst case where f is 1 for most points
and so we only return a sample with probability 1/r. So we may wish to
reserve Metropolis-Hastings for situations where f has a much larger range
(making rejection sampling too costly) and we can show a tighter bound on
convergence time in some other way.

A.5.2 A constrained random walk


Let’s imagine we want to sample n-bit strings with no consecutive pairs of
ones. So 00101 is acceptable but 01101 is not.
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 318

It’s possible to do this efficiently and exactly by taking advantage of the


connection of these strings to Fibonacci numbers, but for some unexplained
reason we decide to do this using Markov Chain Monte Carlo instead. Given
an initial string x, pick an index i and bit b uniformly at random, and set xi
to b unless this would produce two consecutive ones. It is not hard to show
that this gives an irreducible aperiodic Markov chain that converges to a
uniform stationary distribution.
Show that this chain has the rapid mixing property, which means that
tmix () is polylogarithmic in the number of states N and the total variation
distance bound .

Solution
First observe that the number of possible states N is at least 2dn/2e , the
number of strings with zeros in all odd-numbered positions.2 So if we can
get a mixing time that is polynomial in n, it will be polylogarithmic in n.
Coupling and canonical paths are both options for showing such a bound.
One possible coupling is to apply the same update rule to both the X
and Y processes: Pick an index i and a bit i and update each of Xi and Yi
to equal b if permitted by the no consecutive ones rule. We can then track
how the Hamming distance between X and Y evolves over time.
Let S t = i Xit 6= Yit and let H t = S t . Then H t can change in two


ways:

• If we pick some i ∈ S, then H always drops by 1 if b = 0, and drops


by 1 if b = 1 and neither i − 1 nor i + 1 is in S.

• If we pick some i 6∈ S, then H stays the same if i − 1 and i + 1 are


both not in S and goes up by 1 if either is in S.

Partition S into segments S1 , . . . , Sk where each Sk is a maximal sequence


of consecutive indices.
2
This is a rather crude estimate that is convenient mostly because itit is instantly
recognizable as exponential in n. As hinted at in the problem statement, we can compute
N exactly using Fibonacci numbers. Each acceptable string of length 2 or more either
starts with 0 followed by an acceptable string of length n − 1, or starts with 10 followed by
an acceptable string of length n − 2. This gives a recurrence N (n) = N (n − 1) + N (n − 2
with N (0) = 1 and N (1) = 2. This is exactly the recurrence for Fibonacci numbers
shifted so that N (n) = Fn+2 , which is not surprising when we recall that Pingala invented
Fibonacci numbers precisely to count sequences of syllables of length 1 or 2 adding up to a
given total.
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 319

If |Si | = 1, then the index in Si contributes a n1 chance of lowering H,


and the adjacent indices together contribute at most a n1 chance of increasing
H.
1
If |Si | = k > 1, then each index in Si contributes a 2n chance of lowering
k 1
H, for a total of 2n ≥ n , and the adjacent indices together contribute at
most a n1 chance of increasing H.
X t , Y t ≤ H t , with at least a n2 chance
 t+1 
In either case we have E H
that H changes if H > 0. We also have that H never exceeds n and never
changes once it reaches 0. So we can bound H by a fair random walk H 0
with a reflecting barrier at n and absorbing barrier at 0, that only takes a
step when H does. This shows that H reaches 0 in O(n3 ) steps on average.
For canonical paths, given two states x and y that differ in k places, we’ll
define a path γxy of length k that fixes these bits one at a time. We can’t
do strict left-to-right bit fixing, because this may create an intermediate
state that has two consecutive ones. But we can adjust the order a little to
get around this: at each step, change the leftmost unfixed bit that can be
changed without violating the no-consecutive ones rule.
A step in such a path changes some string of the form y1 . . . y` rsx`+3 . . . xn
to y1 . . . y` r0 s0 x`+3 . . . xn , where the rs → r0 s0 change is one of 00 → 10,
01 → 00, or 10 → 00. To reconstruct x and y from this information, it is
enough to provide ` (n − 2 choices), a string x1 . . . x`+1 0y`+3 . . . yn with no
consecutive ones (< N choices), and the values of x`+2 and y`+2 (4 choices).
This gives < 4nN paths crossing each edge, so the congestion ρ is
1 1
· 4nN · N −2 = 8n2 .
X
max πx πy < −1
uv∈E πu puv γxy 3uv N (1/(2n))

We thus get τ2 ≤ 8ρ2 = O(n4 ), with a hideous constant swept behind the
big O for decency’s sake. This again is polylogarithmic in N , so we’re done.

A.6 Assignment 6, due Thursday 2024-04-25 at


23:59
A.6.1 The power of intransigence
Let G be a connected graph on n vertices labeled 1, . . . , n, and imagine each
vertex v in G represents a person who has an opinion Xvt ∈ {0, 1} at each
time t. This opinion is initially arbitrary, but may evolve over time to match
the opinions of v’s neighbors.
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 320

Specifically, at each step, we pick one of the m edges in the graph


uniformly at random, and then choose a random orientation of that edge to
get an ordered pair of adjacent vertices ij, with each possible pair chosen
1
with probability 2m . Most of the time, this results in j adopting i’s opinion,
so that Xj = Xi , while Xkt+1 = Xkt for all k 6= j.
t+1 t

But not if j = 1. Vertex 1 is stubborn: headstrong, willful, convinced in


its bones that it is right, holding a graduate degree in obstinacy in one hand
and a thesaurus opened to “stubborn” in the other to back up its unyielding
pig-headed beliefs. If we are unlucky enough to pick j = 1, then nothing
changes and X t+1 = X t .
Running this process long enough will eventually converge to a state
where every other vertex gives up and agrees with vertex 1. There is even a
constant c such that it reaches this state in Θ(nc ) steps on average with the
worst possible G and the worst possible X 0 .
Find c, and prove that this bound holds.

Solution
We’ll show c = 4.
For the lower bound, we just need one bad family of graphs and starting
configurations that gives Ω(n4 ) expected convergence time. One choice is a
lollipop graph, consisting of a handle that is a path of n/2 vertices and a
sucker that is Kn/2 , plus an edge linking one end of the path to one vertex
in the clique.
Let’s make the stubborn node be a member of the clique with initial
value 0, and start in a configuration where the most distant n/4 nodes in the
path have 1 and everybody else has 0. Let Xit be the value of node i at time
t and let Zt = i X t . Then as long as all the clique nodes have Xit = 0 and
P

some path node has Xit = 1, we have a configuration where the 1 nodes are
exactly the Zt most distant nodes on the path. So Z t+1 = Zt only if we pick
1
the unique 0 − 1 edge, and we have Z t+1 = Zt ± 1 with probability 2m each.
1
This give an unbiased random walk that takes steps with probability m .
2
It takes (n/4) steps for this to reach Zt = 0 or Zt = n/2, which is a lower
bound on the time for the process as a whole to converge. The waiting time
for each random walk step is m, so the total expected time to reach Zt = 0
or Zt = n/2 is m(n/4)2 by Wald’s Equation. The clique makes m = Ω(n2 ),
so the expected time to converge is Ω(n4 ).
In the other direction, we will show that any graph reaches an all-equal
state from any initial state in time O(n4 ). For this argument it’s convenient
to assume that the stubborn node starts with 1, so we finish when Zt = n.
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 321

Fix some configuration X t , and let ` be the number of 0 − 1 edges in


X t . Let Ft = X 0 , X 1 , . . . , X t be the σ-algebra generatedby the trajec-
`
tory of the system up until time t. Then Pr Z t+1 − Zt = 1 Ft = 2m

and
`
 t+1 
Pr Z − Zt = −1 Ft ≤ 2m (because edges incident to the stubborn vertex
don’t contribute). The rest of the time Z t+1 = Zt .
This gives a ±1 random walk that may be biased toward +1 and that
takes steps only some of the time. There are two mostly equivalent ways to
bound the convergence time of this process: one uses an explicit potential
function, and the other reduces to known results about random walks via a
coupling argument. We’ll describe both below, since either gives a reasonable
solution to this part of the problem.

Potential function argument Consider the function Zt2 . The expected


change in this quantity is
h i
2
E Zt+1 − Zt2 Ft ≥ (2Zt + 1) Pr [Zt+1 = Zt + 1 | F + t]
+ (−2Zt + 1) Pr [Zt+1 = Zt − 1 | F + t]
` `
≥ (2Zt + 1) + (−2Zt + 1)
2m 2m
`
=
m
1
≥ .
m
(The second step uses Zt ≥ 1, as enforced by the stubborn vertex.)
If we define Yt = Zt2 − mt , then Yt is a submartingale, since the expected
increase in Zt2 always pays for the increase in mt . So at any stopping time
τ that satisfies the conditions of the Optional Stopping Theorem for {Yt },
E [Yτ ] ≥ E [Y0 ].
τ τ
Let τ be the first time at which Zτ = n. Then Yτ = Zτ2 − m = n2 − m .
For t ≤ τ , Yt and τ satisfy the bounded increments and finite expected time
1
conditions, because Yt can’t change by more than 2n + m in this range and τ
has finite expectation by the usual argument for Markov chain hitting times.
1
So E [Yτ ] = n2 − m E [τ ] ≥ Y0 = Z02 ≥ 0. Solve to get E [τ ] ≤ mn2 = O(n4 ).

Coupling argument An alternative argument couples Z t with an unbi-


ased random walk Wt with a reflecting barrier at 0, that takes steps at
random times T1 , T2 , . . . , where E [Ti+1 − Ti ] ≤ m. The idea is to move
Wt only when we pick a 0 − 1 edge n the X t process, and argue that we
can always keep Zt ≥ Wt by incrementing Wt whenever we increment Zt
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 322

and decrementing Wt even if Zt doesn’t go down because of stubbornness.


So when Wt reaches n after at most n2 m expected steps (Wald’s Equation
again!), Zt must already have reached n.
(This is pretty much the same argument as the potential function ar-
gument, except that we hide the potential function inside the known n2
bound for the unbiased random walk. I think the coupling argument is more
satisfying in some ways because it more intuitively describes what the Zt
process is doing. However, at least as written above, it skips over justifying
some details that don’t come up with the potential function, so there is also
more room for error.)

A.6.2 Almost Markov


Let G be a d-regular graph on n vertices. Assign a label `(v) ∈ {0, 1} to
each vertex. Let V0 V1 V2 . . . be a random walk on G starting at some initial
fixed vertex v0 , where each Vt+1 is chosen uniformly among the d neighbors
of Vt , and let `(V0 ), `(V1 ), `(V2 ), . . . be the sequence of vertex labels observed
during this random walk.
Call the sequence `(V0 ), `(V1 ), `(V2 ), . . . almost Markov with error  if,
for all t ≥ 0, 1/2 −  ≤ E [`(Vt+1 ) | `(V0 ), . . . , `(Vt )] ≤ 1/2 + .
Show that, for any fixed  > 0, there is some minimum degree d , such that
for any d-regular graph G on n vertices, it is possible to compute a labeling
on G in expected time polynomial in n and d, such that a random walk on G
starting from any vertex produces a sequence of labels `(V0 ), `(V1 ), `(V2 ), . . .
that is almost Markov with error .

Solution
Suppose we label each vertex independently and uniformly at random. For
each vertex v, let N (v) be the set of all nodes adjacent to v, let Lv =
1P
d u∈N (v) `(v) be the probability that taking a step starting at V lands on
a node labeled with a 1, and let Av be the event |Lv − 1/2| > .
Observe that if none of the events Av occurs, we get the almost-Markov
property because the bound on E [`(Vt+1 ] will hold conditioned on any
previous vertex Vt . We will use the Lovász Local Lemma to show that a
labeling exists that makes none of these occur, and then use Moser-Tardos
to find one.
With a bit of tinkering, we can write Lv −1/2 as the sum of d independent
APPENDIX A. SAMPLE ASSIGNMENTS FROM SPRING 2024 323

1
± 2d random variables, so Hoeffding’s inequality says
2 /2d(1/d2 ) 2
p = Pr [Av ] ≤ Pr [|Lv − 1/2| ≥ ] ≤ 2e− = 2e−d .

Let Γ(Av ) = {Au | d(v, u) = 2}. Then Av is independent of any event


Au 6∈ Γ(Av ) ∪ {Av }, because only events in this set depend on vertices that
are adjacent to v, which are the only ones that affect Av . Because the graph
is d-regular, there are at most d(d − 1) vertices at distance 2 from v, and
thus at most d(d − 1) events in Γ(Av ).
The Lovász Local Lemma applies if ep(d(d − 1) + 1) < 1, which holds if
2
e · 2e−d · (d(d − 1) + 1) < 1. For any fixed , the expression on the left goes
to zero in the limit as d increases, so there exists some value d such that
the bound holds for all d ≥ d . If the bound does hold, LLL says that there
exists a labeling that makes none of the Av happen.
Using the Moser-Tardos algorithm [MT10], we can find such a labeling in
n
d(d−1) resamplings on average. The efficient way to manage the resamplings
is to compute all the Lv values initially in O(nd) time, and then pay O(d2 )
time for each resampling to update the values and check any that have
changed. If we do this, the initial setup cost dominates on average, giving
an overall expected O(nd) total time, which is about the best we could
reasonably hope for given the size of the input, and which is polynomial in n
and d as required.
Appendix B

Sample assignments from


Spring 2023

B.1 Assignment 1, due Thursday 2023-02-16 at


23:59
B.1.1 Hashing without counting
Suppose we are building a hash table with separate chaining, which is
implemented as an array of m positions A[0], A[1], . . . , A[m − 1], each of
which may hold zero or more elements in some secondary data structure like
a variable-length array or linked list (see §7.1). We’ll abstract away from the
actual hash function and assume that when a new element is inserted, its
position is chosen independently and uniformly at random.
Normally we would keep track of the number of elements n inserted into
the table and grow the table by some constant factor when the load factor
α = n/m gets too big. But this is work and we are lazy. So instead we adopt
the following heuristic: grow the table as soon as some element is inserted
into position 0. We’d like to make sure that this heuristic doesn’t produce
any bad outcomes, by looking at how crowded the positions in the table are
at the moment after this element is placed in A[0], just before expanding the
table.
Specifically, for this configuration:

1. Compute the exact expected load factor α = n/m as a function of m.

2. Compute the best high-probability asymptotic bound you can for the
maximum number of elements in any table position as a function of

324
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2023 325

m. This means that you should find a function f (m) such that for
all c > 0, the maximum number of elements is at most O(f (m)) with
probability at least 1 − m−c , where the constant in the asymptotic
bound may depend on c.

Solution
1. Let X be the number of elements inserted into the table. Then X
counts the number of insertions up to and including the first insertion
to position 0, making it a geometric random variable with parameter
p = 1/m (see §3.6.2). This gives E[X] = m.
The expected load factor is E[X/m] = 1.

2. Pick some i 6= 0, and let Xi be the number of elements in position i.


Let’s calculate Pr [Xi ≥ k]. If k = 0, this is 1. Otherwise, when we
insert the first element, there is a 1/n chance it lands in position i, and
a 1/n chance it lands in position 0. The rest of the time it lands in a
position j 6∈ {0, i} and has no effect on Xi .
So we have, for k > 0,
1
Pr [Xi ≥ k] = Pr [Xi ≥ k | insert 0]
n
1
+ Pr [Xi ≥ k | insert i]
n
n−2
+ Pr [Xi ≥ k | insert elsewhere]
n
which gives
1
Pr [Xi ≥ k] = Pr [Xi ≥ k | insert 0]
2
1
+ Pr [Xi ≥ k | insert i] .
2

The first branch is 0 when k > 0. For the second branch, observe
that Pr [Xi ≥ k | insert i] = Pr [Xi ≥ k − 1], since we already have one
element and now we are asking if we can get k − 1 more using the same
process that generates Xi . This gives a recurrence:
(
1 k=0
Pr [Xi ≥ k] = 1
2 Pr [Xi ≥ k − 1] k > 0
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2023 326

which has the solution Pr [Xi ≥ k] = 2−k .


Now use the union bound to compute, for k > 1:
Pr [max Xi ≥ k] = Pr [∃i : Xi ≥ k]
n−1
X
≤ Pr [Xi ≥ k]
i=1
= (n − 1)2−k .

Pick any c > 0. Let k = (c + 1) lg n. Then


Pr [max Xi ≥ k] ≤ (n − 1)2−(c+1) lg n
= (n − 1)n−c−1
< n−c ,
from which it follows that
Pr [max Xi < (c + 1) lg n)] ≥ 1 − n−c .

Since this makes max Xi = O(log n) with probability 1 − n−c for any
fixed c > 0, we get a high-probability bound of O(log n).
Though this is not required by the problem, we can see that it is not
possible to do better than O(log m) by looking at the distribution
for a single bin. We have Pr [max Xi ≥ k] ≥ Pr [X1 ≥ k] = 2−k , so
Pr [max Xi ≥ c lg m] ≥ m−c for any fixed c.

B.1.2 Permutation routing on an incomplete network


Suppose we have a network of n = 2d senders and n = 2d receivers, where
each sender has outgoing links to exactly d distinct receivers. Other than
this information, the structure of the graph is unconstrained.
Let the senders be labeled 1 . . . n and let the receivers be labeled π1 , . . . , πn
where π is a permutation of 1 . . . n chosen uniformly at random. Call a sender
i “successful” if there is a link from sender i to the receiver who is assigned
label i. Let S be the number of successful senders. Then S is a random
variable that depends on the choice of π. (Figure B.1 shows an example.)
1. What is the exact value of E [S] as a function of n?
2. What are the smallest and largest possible exact values of Var [S] as a
function of n?
3. Show that for any choice of graph that satisfies the constraints above,
at least Ω(n) senders are successful with probability at least 1 − o(1).
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2023 327

1 4 1 3

2 1 2 1

3 3 3 2

4 2 4 4

Figure B.1: An incomplete network with n = 2d = 4. In the left copy, the


random permutation makes all four senders successful, giving S = 4. In the
right copy, a different random permutation makes only sender 1 successful,
giving S = 1.

Solution
1. Let Xi be the indicator variable for the event that sender i is successful,
so that S = ni=1 Xi . Each sender i is successful if and only if i is the
P

label of one of i’s d neighbors. This occurs with probability d/n = 1/2,
so E [Xi ] = 1/2 and E [S] = n/2.
P P
2. We have Var [S] = i Var [Xi ] + i6=j Cov [Xi , Yi ]. Since the Xi are
fair coin-flips we get Var [Xi ] = 1/4.
To bound Cov [Xi , Yi ] let δi and δj be the neighborhoods of senders i
and j, and let cij = |δi ∩ δj | be the number of neighbors these senders
have in common. Then

E [Xi Xj ] = Pr [i ∈ δi ∧ j ∈ δj ]
= Pr [i ∈ δi ∩ δj ] Pr [j ∈ δj | i ∈ δi ∩ δj ]
+ Pr [i ∈ δi \ δj ] Pr [j ∈ δj | i \ δi ∩ δj ]
cij d − 1 d − cij d
= · + ·
n n−1 n n−1
cij (d − 1) + (d − cij )d
=
n(n − 1)
cij d − cij + d2 − cij d
=
n(n − 1)
2
d − cij
= .
n(n − 1)
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2023 328

This gives

Cov [Xi , Xj ] = E [Xi Xj ] − E [Xi ] E [Xj ]


d2 − cij d2
= − 2
n(n − 1) n
d2 n − cij n − d2 (n − 1)
=
n2 (n − 1)
d2 − cij n
= 2
n (n − 1)
1/4 − cij /n
= .
n−1

Summing over all n(n − 1) pairs i 6= j gives


X X
Var [S] = Var [Xi ] + Cov [Xi , Xj ]
i i6=j
n n X cij
= + − .
4 4 i6=j n(n − 1)
n 1 X
= − cij .
2 n(n − 1) i6=j

P
To minimize the variance, we want to make i6=j cij as large as possible.
We can do this by routing all n senders to the same d = n/2 receivers,
giving cij = n/2 for all i =6 j and thus Var [S] = 0. This is not
surprising because in this case, all and only those senders whose ids
appear among the d favored receivers will be successful, making S a
constant.
For the upper bound, we can get crude bound by observing that cij ≥ 0
gives Var [S] ≤ n/2. But in general there is no graph that actually
produces cij = 0 for all i 6= j. Instead, we need to look at minimizing
P
cij subject to the constraint that there are nd edges.
Let eijk be 1 if senders i and j share an edge to receiver k and 0
P
otherwise. Then cij = k eijk . But if we write dk for the degree of
receiver k, we also have that i6=j eijk = dk (dk − 1), since each ordered
P

pair of distinct neighbors of k contributes one nonzero eijk . It follows


APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2023 329

that
X XX
cij = eijk
i6=j i6=j k
X
= dk (dk − 1).
k

We have the constraint k dk = dn; subject to this constraint, k dk (dk −


P P

1) is minimized when all dk = d. This can be achieved by routing each


sender i to receivers i, i + 1, . . . i + (d − 1) (mod n) or by routing all of
the first d senders to all of the first d receivers and similarly for the
second d senders and receivers.
This gives
n 1 X
Var [S] = − cij
2 n(n − 1) i6=j
n 1
≤ − · nd(d − 1)
2 n(n − 1)
n d(d − 1)
= −
2 n−1
n n/2 − 1
 
= 1−
2 n−1
n 2n − 2 − n + 2
= ·
2 2(n − 1
n 2
= .
4(n − 1)

As a check, we can observe that when n = 2, this gives Var [S] = 1 =


n/2, matching the crude bound because in this case we can in fact
guarantee no overlap between senders’ neighborhoods.

3. Since Var [S] ≤ n/2, we can use Chebyshev’s inequality to compute

Pr [S ≤ n/4] = Pr [S − E [S] ≤ −n/4]


≤ Pr [|S − E [S]| ≥ n/4]
n/2

(n/4)2
= 1/n.
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2023 330

So our probability of getting at least n/4 = Ω(n) successful senders is


at least 1 − 1/n = 1 − o(1).
A tighter bound can be obtained using Azuma’s inequality with an
appropriate exposure martingale. Let Ft be the σ-algebra generated
by knowing which receiver is assigned label j for all j ≤ t. Let
Yt = E [S | Ft ].
To apply Azuma’s inequality, we need a bound on |Yt+1 − YT |. Let’s
look at E [Xi | Ft ] and how it changes with t.

(a) If i ≤ t, then the target for sender i is already known, and Xi is


either 0 or 1. The change E [Xi | Ft+1 ] − E [XI | Ft+1 ] is exactly
0 in this case.
(b) If i = t + 1, then Ft+1 determines xi . Whatever the previous
conditional expectation, the new conditional expectation is either
0 or 1. This gives a change of at most ±1.
(c) If i > t + 1, then there are n − t possible places to put label i,
of which di are connected to sender i for some 0 ≤ di ≤ d. The
value of di depends on how many receivers connected to sender i
have known labels in Ft .
In this case we have
di di
E [Xi | Ft+1 ] − E [Xi | Ft ] ≤ −
n−t−1 n−t
di
=
(n − t)(n − t − 1)
n−t
≤ .
(n − t)(n − t − 1)
1
≤ .
n−t−1
In the other direction,
di − 1 di
E [Xi | Ft+1 ] − E [Xi | Ft ] ≥ −
n−t−1 n−t
n − t − di
=−
(n − t)(n − t − 1)
n−t
≥− ,
(n − t)(n − t − 1)
1
≥− .
n−t−1
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2023 331

1
In either case the absolute value is bounded by n−t−1 . There are
n − t − 1 values of i in this range, so the sum for all of them is at
most ±1.

Combining the last two cases, we have |E [S | Ft+1 ] − E [S | Ft ]| ≤ 2.


2
So Azuma’s inequality gives Pr [S − n/2 ≥ t] ≤ e−t /8n and similarly
2
Pr [S − n/2 ≤ t] ≤ e−t /8n . The constant in the denominator could
possibly be improved
√ a bit with a better analysis, but even as is we get
that S = n/2 ± Θ( n log n) with high probability and S ≥ n/4 = Ω(n)
with all but exponentially small probability.

B.2 Assignment 2, due Thursday 2023-03-30 at


23:59
B.2.1 Some streaming data structures
Suppose you are given a stream x1 , x2 , . . . , x2n of 2n integer values in the
range 0 . . . n − 1. You are presented with these values one at a time. Upon
receiving each value, you may update a data structure of your choosing that
fits in O(log n) bits of space. You have no other storage.

1. Give an algorithm for sampling a single value with probability propor-


tional to the number of times it occurs in the stream, returning this
value once all elements of the stream have been processed.
For example, given a stream 1, 0, 1, 2, 1, 1, your algorithm should return
0 or 2 with probability 1/6 each, and return 1 with probability 2/3.

2. Give a Las Vegas algorithm that returns a value that occurs at least
twice in the stream with probability Ω(1). If it doesn’t succeed, it
should return ⊥ to indicate failure. (This algorithm should never return
a value that occurs less than twice.)

In both cases, prove the correctness of your algorithm. You may assume
that n is known to the algorithm.

Solution
1. There are two straightforward ways to do this using O(log n) bits:

(a) The easiest is to sample the position of the desired element ahead
of time, then count down until we hit it. This requires 2n possible
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2023 332

states for the countdown timer plus n possible states for the stored
value, for a total of dlg 3ne = O(log n) bits.
(b) Alternatively, we could store a value while counting how many
values we’ve seen, and replace the stored value with the k-th value
with probability 1/k.1 A straightforward induction argument
shows that this makes the stored value uniform across all the
values seen so far. We need dlg 2ne bits for the counter and dlgne
bits to store the current value, but the total is still O(log n).

2. For this part, the intuition is that we will sample a value as we go,
then record a winner if we happen to see it again. How easy this is to
analyze depends on which sampling algorithm we use:

(a) If we pick a particular position i and then sample xi , we win as


long as xi is not the last occurrence of this particular value in the
sequence. Since there are at most n positions that are the last
occurrence of a value, we have at least a 1 − n/2n = 1/2 chance of
winning. This requires an extra bit to record whether or not the
sampled value appeared again, for a total of dlg 3ne + 1 = O(log n)
bits.
(b) Alternatively, we could run the continuous sampling algorithm
above and fix the stored value if we see it again. The analysis in
this case is a bit more messy. Here is one possible approach.
Consider the original sampling process, and let Ai be the event
that (a) there exists some j > i such that xj = xi , and (b) value
xi is stored at time i and never overwritten. There are at least n
positions for which (a) holds, and for each of these positions we
have Pr [Ai ] = 1/2n.
Unfortunately these events are not independent. But for any i,
we have that Pr [¬Ai | ¬A1 ∧ . . . ∧ ¬Ai−1 ] is either 1 (if Ai never
occurs) or at most 1−1/2n (since previous events Aj not occurring
for j < i i can only increase the likelihood of Ai occurring). The
latter case holds for at least n choices of i, giving
h^ i Y
Pr ¬Ai = Pr [¬Ai | ¬A1 ∧ . . . ∧ ¬Ai−1 ]
≤ (1 − 1/2n)n
≤ e−1/2 ,
1
This is a special case of a more general algorithm known as reservoir sampling, due
to Vitter [Vit85].
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2023 333

b0 a0 b0

a b a b a b

base base base

Figure B.2: Example of graph density evolution. Initial graph hasdensity


D0 = 1/ 32 = 1/3. After duplicating b, new density is D1 = 2/ 42 = 1/3.
After duplicating a as well, final density is D2 = 4/ 52 = 2/5.


which is a constant strictly less than 1. So with constant probabil-


ity, at least one event Ai occurs. But then the algorithm observes
a duplicate and returns it.

B.2.2 A dense network


Suppose we construct a network as follows. Initially, we have n0 = n nodes,
of which one is an inert base station with no incident edges. We also have
m0 = m edges between the n − 1 remaining nodes. At each step, we pick
one of the nt − 1 non-base-station nodes uniformly at random and duplicate
it, creating a new node connected to the same nodes as the old one.
The density Dt of this network after t steps is the ratio between the
number of edges mt and the number of possible edges n2t , where nt = n0 + t
is the number of nodes (including the base station, even though it doesn’t
otherwise participate in the network). We’d like to examine how the density
evolves over time as new nodes are added to the network.
An example of this process is given in Figure B.2.

1. Compute the exact value of E [Dt ] as a function of n, m, and t.


p 
2. Show that, for any t, Dt is within O log n/n of E [Dt ] with high
probability.

Solution
n
1. We’ll show that {Dt } is a martingale, giving E [Dt ] = E [D0 ] = m/ 2
for all t.
Let Gt be the graph after t steps. Let Ft be the σ-algebra generated
by hG0 , . . . , Gt i.
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2023 334

Each edge in Gt goes between two non-base-station nodes, so there


is a chance of exactly nt2−1 that one of its endpoints is duplicated.
 
2
Summing over all edges gives E [mt+1 | mt ] = mt 1 − n−1 . But then

mt (2)n − 1
E [Dt+1 | Ft ] = nt +1
2
n+1
mt n−1
= nt +1 n+1
2 n−1
mt
= nt 
2
= Dt .

n
It follows that E [Dt ] = E [D0 ] = m/ 2 .

2. We will apply Azuma’s inequality. This requires getting bounds on the


martingale differences Xt = Dt − Dt−1 .
Since the maximum possible degree of any node in Gt−1 is nt−1 − 2,
we have mt ≤ mt−1 + nt−1 − 2 and thus
mt−1 + nt−1 − 2 mt−1
Dt − Dt−1 ≤ nt−1 +1 − nt−1 
2 2
mt−1 + nt−1 − 2 mt−1
≤ nt−1  − nt−1 
2 2
nt−1
≤ nt−1 
2
2
≤ .
nt−1 − 1
2
= .
n+t−2

In the other direction, duplicating a node destroys no edges, so mt ≥


APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2023 335

mt−1 , giving
mt−1 mt−1
Dt − Dt−1 ≥ nt−1 +1 − nt−1

2 2
nt−1 
− nt−12 +1

2
= mt−1 nt−1 +1 nt−1 
2 2
−nt−1
= mt−1 nt−1 +1 nt−1 
2 2
−nt−1
≤ nt−1 +1
2
2
=− .
n+t

2
So in either direction we have |Xt | = |Dt+1 − Dt | ≤ n+t−2 .
Now let us apply Azuma’s inequality:
t  2 !
2
X 2
Pr [|Dt − E [Dt ]| ≥ α] ≤ 2 exp −α /
i=1
n+i−2
  

1 
X
≤ 2 exp−α2 /4
k=n−1
k2
 Z ∞
1
 
≤ 2 exp −α2 / 4 dx
x=n−2 x2
 
2
≤ 2 exp −α (n − 2)/4 .

p p 
For any fixed c, setting α = 4(c + 1) ln n/(n − 2) = O log n/n
gives a probability of at most 2n−c−1 ≤ n−c (when n ≥ 2) that we are
more than α away from the expectation.

B.3 Assignment 3, due Thursday 2023-04-27 at


23:59
B.3.1 Shuffling a graph
Consider the following shuffling algorithm: Given a connected graph G with
n vertices labeled 1 through n, and m edges, at each step, with probability
1/2, choose one of the edges uniformly at random and swap the labels of
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2023 336

its endpoints. This process yields a Markov chain that is aperiodic and
irreducible,2 so it has a unique stationary distribution.

1. What is the stationary distribution of this process?

2. Show that this process converges to within a total variation distance


of  of its stationary distribution in time polynomial in n and log 1 .

Solution
1. Suppose there is a labeling i and a labeling j such that j is reachable
from i in one step. Then these labelings differ by swapping one edge
1 P
so pij = pji = 2m . For each labeling i, let πi = 1/n!. Then i πi = 1
m
and, for all i and j, πi pij = πj pji = n! . This shows that the chain is
reversible and that this particular π is its stationary distribution.

2. We’ll do a coupling argument, similar to the argument for adjacent-


swap shuffling on a line. For technical reasons that we will see later, it
will be helpful to assume that G has at least two edges. In the case
where G has only one edge, we can argue that the process reaches the
stationary distribution in one step, as there are only two labelings and
the first step has a 1/2 chance each of switching or staying put.
Let X and Y be two copies of the process, with X starting in an arbi-
trary distribution and Y starting in the uniform stationary distribution.
At each step, let D be the set of edges with the same label on at least
one endpoint. For each edge uv:
1
(a) If uv ∈ D, swap uv in both copies with probability 2m .
1
(b) If uv 6∈ D, swap uv in X only with probability 2m and in Y only
1
with probability 2m .
|D|
Do nothing with the remaining probability 2m .
For each label `, write X` for the vertex labeled with ` in X and Y` for
the vertex labeled with ` in Y . As in the adjacent-swap case, we can
argue that once X` = Y` , any edge uv with an endpoint labeled ` is in
D, so X` continues to equal Y` in all subsequent states of the coupled
2
Aperiodicity is immediate from having steps that do nothing. For irreducibility, if G is
a tree, there is at least one vertex v with degree 1; use a sequence of swaps to move the
desired label to v, then recurse on G \ {v} to fix the rest of the labeling. For general G,
apply this procedure to a spanning tree of G.
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2023 337

process. So the question then is how long it takes for two unlinked
copies of a label to become linked.
Define a new process Z t = U t , V t where U t represents the position
X`t and V t represents Y`t . Unlike the X`t , Y`t process, we will assume
that at each step of Z t we always move U t across an incident edge with
1
probability 2n and similarly for V t , without regard to whether U t = V t .
This gives a Markov chain with the same transition probabilities as
X`t , Y`t when X`t 6= Y`t , but unlike the X`t , Y`t process, the Z t process
is both irreducible and reversible, with a uniform stationary distribution.
We can also argue that it is aperiodic under our assumption that m > 1,
because there exists at least one state where U t is not adjacent to some
edge, so there is a nonzero probability that Zt+1 = Z t in this state.
We will now show that X`t and Y`t collide with high probability after
polynomially-many steps by showing convergence of Z t to its stationary
distribution using Cheeger’s inequality. Observe that the Z chain has
exactly m2 states, and that each non-empty subset S of the chain has at
1
least one transition that leaves it with probability πi pij = m12 · 2m 1
= 2m 3.
1/2m3 1
This gives ΦS ≥ |S|/m2
which is at least 4m 3 in the worst case. It follows
2
that τ2 ≤ Φ2
≤ 6
8m and that the total variation distance between Z t
1
and the uniform distribution is at most 2m after O(m6 log m) steps. In
1 1 1
particular this gives a probability of at least m − 2m = 2m of being
t t t t
in a state U , V with U = V at the end of each such interval.
Repeating for O(m log m) intervals with a suitable constant gives a
probability of at most m−c for any fixed c that U t = V t at the end
of one of these intervals, which shows that X` and Y` collide within
O(m7 log2 m) steps with probability at least 1 − m−c .
Since m ≥ n − 1, we can choose c so that the probability that X` and
Y` don’t collide in O(m7 log2 m) steps is O(n−2 ). By the union bound,
this gives that X` and Y` collide in time O(m7 log2 m) with probability
at least 1 − O(n− 1). Repeat as needed at most O(log(1/)) times to
drive the probability of failure down below , and use m = O(n2 ) to
express the convergence time in terms of n as O(n1 4 log2 n log(1/)).
This polynomial in n and log 1 even though the exponents are terrible.

B.3.2 Counting unbalanced sets


Suppose you are given the usual set balancing set-up, where you have a
collection of sets S1 , S2 , . . . , Sm with | Si | = n, and you are asked assign a
S
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2023 338

value ±1 to each element x of of S = Si with the goal of minimizing the


S
P
discrepancy d = maxi x∈Si x .

We know that applying a random assignment gives d = O( n log m) with
high probability. But this bound doesn’t take into account any structure
the Si might have. Give a FPRAS that takes the Si as input along with a
bound t and computes an approximation for Pr [d ≥ t], assuming each x is
assigned ±1 using an independent fair coin-flip.

Solution
Let A be the set of all assignments that produce d ≥ t. For each i, let
P
Xi = x∈Si x be the random variable representing the sum of the assignments
of elements of Si . We can write A as the union of Bij for all i ∈ [m] and all
j with |j| ≥ t, where Bij is the event that Xi = j. Since this union is not
disjoint, we will use Karp-Luby (see §11.4) to approximate its size.
As with #DNF, the idea is that we can both count and sample triples
hi, j, xi where x is an assignment that makes Xi = j. Let ni = |Si |. Let Yi be
the number of elements of Si that are assigned +1; then Xi = Yi − (ni − Yi ) =
2Yi − ni , and solving for Yi gives Yi = ni +X
2 ,
i
assuming the numerator is even.
n−n ni
Given i and j, there are exactly 2 i
(ni +j)/2 assignments with Xi = j when
ni + j is even, and none when ni + j is odd. For the even case, we can sample
these assignments in polynomial time by choosing (ni +j)/2 elements without
replacement from Si to have value +1, making the remaining elements of Si
have value −1, and assigning ±1 values independently to any x ∈ S \ Si . We
will abuse notation slightly by writing Bij for the set of all possible triples
hi, j, xi resulting from this process for a particular pair of values i and j. Let
B = ni=1 |j|≥d,j + ni even Bij ; note that this union is disjoint.
S S

Our full sampling procedure looks like this:

1. Sample i and j with probability |Bij |/|B|. Assuming that we precom-


pute the sizes of the Bij sets and that arithmetic operations take time
O(1) time each, this takes time at most O(nm) if we are naive about
the sampling and O(log nm) if we are a little bit more clever and use
binary search.

2. Sample an element hi, j, xi of Bi j uniformly as described above. This


takes time O(n).

3. Return 1 if x does not appear in Bi0 j 0 for any i0 j 0 that precedes ij in


lexicographic order. This requires checking O(nm) previous sets at a
APPENDIX B. SAMPLE ASSIGNMENTS FROM SPRING 2023 339

cost of O(n) each, for O(n2 m) time total, which dominates the cost of
generating the sample.

Let ρ be the probability that we obtain 1 in an iteration of this process.


Then |A| = ρ|B|, and we can approximate |A| to within relative error
 by approximating ρ within the same bound. Because each element of
A occurs
 in at least one and at most O(nm) of the Bij , we have that
1
ρ = Ω nm . Applying Lemma 11.2.1, approximating ρ using N samples
2 2
 
where N = 23ρ ln 2δ = O n m2log δ gives us our desired error bounds in time
polynomial in nm, 1/, and log δ, as required.
Appendix C

Sample assignments from


Fall 2019

C.1 Assignment 1: due Thursday, 2019-09-12, at


23:00
Bureaucratic part
Send me email! My address is james.aspnes@gmail.com.
In your message, include:

1. Your name.

2. Your status: whether you are an undergraduate, grad student, auditor,


etc.

3. Whether you are taking the course as CPSC 469 or CPSC 569.

4. Anything else you’d like to say.

(You will not be graded on the bureaucratic part, but you should do it
anyway.)

C.1.1 The golden ticket


An infamous candy company announces it is placing a golden ticket uniformly
at random in one of n candy bars that are distributed to n children, with
unimaginable wealth to be bestowed on the child who receives the ticket.

340
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 341

Unfortunately, the candy company owner is notorious for trickery and lies,
so the probability that the golden ticket actually exists is only 1/2.
Consider the last child to open their candy bar, and what estimate they
should make of their probability of winning after seeing the first k children
open their candy bars to find no ticket.
An optimist reasons: If the ticket does exist, then the last child’s chances
went from 1/n to 1/(n − k). So the fact that none of the first k children got
the ticket is good news!
A pessimist reasons: The more candy bars are opened without seeing
the ticket, the more likely it is that the ticket doesn’t exist. So the fact that
none of the first k children got the ticket is bad news!
Which of them is right? Compute the probability that the last child
receives the ticket given the first k candy bars come up empty, and com-
pare this to the probability that the last child receives the ticket given no
information other than the initial setup of the problem.

Solution
Let W be the event that the last child wins, C the event that the candy bar
exists, and Lk the event that the first k children lose.
Without conditioning, we have

Pr [W ] = Pr [W | C] Pr [C] + Pr [W | ¬C] Pr [¬C]


1 1 1
= · +0·
n 2 2
1
= .
2n
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 342

With conditioning, we have

Pr [W ∧ Lk ]
Pr [W | Lk ] =
Pr [Lk ]
1/2n
=
Pr [Lk ] C Pr [C] + Pr [Lk ] ¬C Pr [¬C]
1/2n
= n−k 1 1
n · 2 +1· 2
1/2n
= n−k 1
2n + 2
1/2n
= 2n−k
2n
1
= .
2n − k
1 1
Since 2n−k > 2n when k > 0, the optimist is correct.

C.1.2 Exploding computers


Suppose we have a system of 3n machines m0 , m2 , . . . m3n−1 in a ring, and
we turn each machine on at random with independent probability p. Over
time, the machines may overheat: if we turn on three consecutive ma-
chines mi , mi+1 , and mi+2 (wrapping indices around mod 3n), the middle
machine mi+1 overheats and explodes. Explosions happen simultaneously
and are purely a function of which machines are initially turned on; so if
m1 , m2 , m3 , m4 , and m5 are all turned on, m2 , m3 , and m4 will all explode.
Call a machine a active if it is turned on and does not explode.
We want to choose p to maximize the number of active machines X. But
there are two ways to interpret this goal.

1. Suppose our goal is to maximize the expected number of active machines


E [X]. What value of p should we choose, and what does this give us
for E [X]?

2. Suppose instead our goal is to maximize the probability of having the


maximum possible number of active machines. More formally, letting
M be the number of survivors we could get if we chose which machines
to turn on optimally, we want to choose p to maximize Pr [X = M ].
What value of p should we choose, and what does this give us for
Pr [X = M ]?
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 343

Solution
1. Let Ai be the event that machine mi is active, and let Xi be the
indicator variable for mi being active and not exploding. We are trying
P P
to maximize E [ Xi ] = E [Xi ] = n E [X1 ], where the last equation
holds by symmetry.
Observe that X1 = 1 if A1 occurs and it is not the case that both
A0 and A2 occur. This gives E [X1 ] = Pr [X1 = 1] = p(1 − p2 ) =
p − p3 . By settingpthe derivative 1 − 3p2 to zero we find a unique
extremum at p = 1/3; this is a maximum since p(1 − p p2 ) = 0 at
the two corner p solutions p = 0 and p = 1. Setting p = 1/3 gives
3n E [Xi ] = 3n · 1/3(2/3) = n √23 ≈ n · 1.154701 . . . .

2. First we need to figure out M .


We can assume that we set up the optimal configuration so that none
explode, because any initial configuration that produces explosions can
be replaced by one where the dead machines are just turned off to start
with, with no reduction in the number of active machines.
Conveniently, we can divide the machines into consecutive groups of
three, and observe that (a) turning on the first two machines in each
group causes no explosions, (b) turning on three machines in any group
causes the middle one to explode. This shows M = 2n is the maximum
achievable number of machines, because (a) lets us get M and (b) says
we can’t do more, since this would give us a group with three by the
Pigeonhole Principle.
We still need to characterize how many possible patterns of M = 2n
machines produce no explosions. Since we can only have two consecutive
active machines, each pair of active machines must be followed by an
inactive machine. Putting two inactive machines in a row would allow
a partitioning into groups of three with some group getting only one
active machine, leaving some other group with three, which is trouble.
So the pattern must consist of pairs of active machines alternating with
single inactive machines. There are exactly three ways to do this.
Each of these three patterns occurs with probability p2n q n where
q = 1 − p. Maximizing this probability can again be done by taking the
derivative 2np2n−1 q n − np2n q n−1 and setting to 0. Removing common
factors gives 2q − p = 0, or 2 − 3p = 0, giving p = 2/3, which is about
what one would expect.
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 344

We then get Pr [X = 2n] = 3 · (2/3)2n (1/3)n . This is pretty small for


reasonably large n, which suggests that a randomized algorithm may
not be the best in this case.

C.2 Assignment 2: due Thursday, 2019-09-26, at


23:00
C.2.1 A logging problem
An operating system generates a new log entry at time t with independent
probability pt , and the size of each log entry for any t is an integer X chosen
uniformly in the range a ≤ X ≤ b, where the choice of X at a particular time
t is independent of all the other log entry sizes and all choices of whether to
write a log entry.
Suppose that over some fixed period of time, the operating system
generates n log entries on average. Compute the expectation of the total size
S of all the log entries, and show that the probability that S is more than a
small constant fraction larger than E [S] is exponentially small as a function
of n.

Solution
For each t, let Xt be the size of the log entry at time t. Then E [S] =
a+b a+b a+b
t pt 2 = 2 E [ t pt ] = 2 · n.
P P P
t E [Xt ] =
We can’t use Chernoff bounds directly on the Xt because they aren’t
Bernoulli random variables, and we can’t use Hoeffding’s inequality because
we don’t actually know how many of them there are. So we will use Chernoff
first to bound the number of nonzero Xt and then use Hoeffding to bound
their sum.
The easiest way to think about this is to imagine we generate the nonzero
Xt by generating a sequence of independent values Y1 , Y2 , . . . , with each Yi
uniform in [a, b], then select the i-th nonzero Xt to have value Yi .
Let Zt be the indicator variable for the event that a log entry is written
P
at time t. Then N = t Zt gives the total number of log entries, and
P P
E [N ] = E [ t Zt ] = t pt = n.
Pick some δ < 1; then Chernoff’s inequality (5.2.2) gives
2 /3
Pr [N ≥ (1 + δ)n] ≤ e−nδ .
P(1+δ)n
Suppose N < (1 + δ)n. Then t Xt = N i=1 Yi ≤
P P
i=1 Yi . This last
quantity is a sum of independent bounded random variables, so we can apply
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 345

the asymmetric version of Hoeffding’s inequality (5.3.17) to get


   
(1+δ)n
X a+b s2
Pr  Yi − (1 + δ)n · ≥ s ≤ exp−
 
2 
2

b−a
i=1 2(1 + δ)n 2

a+b
For s = δn · 2 this gives a bound of
  2    2 
a+b a+b
−δ 2 n2 2 −δ 2 n 2
exp− 2  ≤ exp−
   
  2 
b−a b+a
2(1 + δ)n 2 2(1 + δ) 2
!
−δ 2 n
= exp − .
2(1 + δ)

If we put this all together, we get


a+b
 
2 2
Pr S ≥ (1 + 2δ)n ≤ e−nδ /3 + e−nδ /2(1+δ) = e−Θ(n) .
2

C.2.2 Return of the exploding computers


Suppose we are working in the model described in §C.1.2, where 3n machines
in a ring are each turned on independently at random with probability p,
and a machine remains active only if it is turned on and at least one of its
neighbors is not turned on. We have seen that an optimal choice of p gives
cn active machines on average for a constant c independent of n.
Unfortunately, averages are not good enough for our customers. Show
that there is a choice of p and constants c0 > 0 and δ > 0 such that at least
c0 n machines are active with probability at least 1 − e−δn for any n.

Solution
We can use the method of bounded differences. Let Xi be the indicator for
the event that machine i is turned on, and let S = f (X1 , . . . , X3n ) compute
the number of active machines. Then changing one input to f from 0 to 1
increases f by at most 1, and decreases it by at most 2 (since it may be that
the new machine explodes, producing no increase, and it also explodes both
neighbors, producing a net −2 change). So McDiarmid’s inequality (5.3.13)
applies with ci = 2 for all i, giving
!
2t2 2 /6n
Pr [f (X1 , . . . , X3n − E [f (X1 , . . . , X3n ] ≤ t] ≤ exp − = e−t .
3n · 22
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 346

Choose p to maximize the expected number of active machines, and


let E [S] = c0 n. Choose c so that 0 < c < c0 . Then Pr [S ≤ cn] =
0 2 2 0 2
Pr [S − E [S] ≤ c0 n − cn] ≤ e−(c −c) n /6n = e−(c −c) n/6 = e−δn , where δ =
(c0 − c)2 /6 > 0.
The above is my original sample solution, below are some quick sketches
of alternative approaches that showed up in submissions:

• Analyze the dependence and use Lemma 5.2.2. This gives similar
results to McDiarmid’s inequality with a bit more work. The basic idea
is that E [Xi |X1 . . . Xi−1 ] ≥ p(1 − p) for i < 3n whether or not Xi−1 is
1 (we don’t care about the other variables). So we can let pi = p(1 − p)
for these values of i. To avoid wrapping around at the end, either omit
X3n entirely or let p3n = 0.

• Even simpler: Let Yi be the indicator for the event that machine i is
active. Then Y3 , Y6 , . . . , Y3n are all independent, and we can bound
Pn P3n
i=1 Y3i ≤ i=1 Yi using either Chernoff or Hoeffding.

C.3 Assignment 3: due Thursday, 2019-10-10, at


23:00
C.3.1 Two data plans
A cellphone company offers a choice of two data plans, both based on
penalizing the user whenever they use more data than in the past. For both
plans the assumption is that your data usage in each month i ∈ {0 . . . ∞}
is an independent continuous random variable that is uniform in the range
[0, c] for some fixed constant c.
With Plan A, you pay a penalty of $1 for every month i in which your
usage Xi exceeds maxj<i Xj . This excludes month 0, for which the max of
the previous empty set is taken to be +∞.
With Plan B, you pay a penalty of $100 for every pair of months 2i and
2i + 1 in which X2i and X2i+1 both exceed maxj<2i Xj . As in Plan A, you
never pay a penalty for the pair of months 0 and 1.

1. Suppose that you expect to live forever (and keep whatever plan you
pick forever as well). If your goal is to minimize your expected total
cost over eternity, which plan should you pick?

2. Which plan is cheaper on average if you don’t plan to live for more
than, say, the next 1000 years?
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 347

(Assume no discounting of future payments in either case.)

Solution
Let Ai be the event that we pay a penalty in month i with plan A, and let
Bi for even i be the event that we pay a penalty for months i and i + 1 in
plan B. Then for i > 0,
 
Pr [Ai ] = Pr Xi > max Xj
j<i
 
= Pr Xi = max Xi
j≤i
i!
=
(i + 1)!
1
= .
i+1
Similarly,
 
Pr [Bi ] = Pr Xi > max Xj and Xi+1 > max
j<i j<i
i! · 2
=
(i + 2)!
2
= .
(i + 2)(i + 1)

(In the first case, we count all orderings of the i + 1 elements 0 . . . i that
make Xi largest, and divide by the total number of orderings. In the second
case, we count all the orderings that make Xi and Xi+1 both larger than all
the other values.)
Now let us do some sums.

1. For the infinite-lifetime version, Plan A costs



X 1
,
i=1
i+1
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 348

which diverges. Plan B costs


∞ ∞
X 2 X 2
100 · ≤ 100 ·
i=1
(2i)(2i − 1) i=1
(2i)2

X 1
= 100 ·
i=1
2i2
100 π2
· .
=
2 6
which is finite. So Plan B is better for immortals.
2. For the finite-lifetime version, over n months, Plan A costs
n
X 1
≈ ln n + γ
i=1
i

2
Plan B has a 3·4 = 16 chance of costing $100 after the first four months,
giving an expected cost of at least $16. As long as you keep both for
at least four months, for n  e16 ≈ 8886110 months ≈ 740509 years,
plan A is better.

C.3.2 A randomly-indexed list


Consider the following data structure: We store elements in a sorted linked
list, and as each element is inserted, with probability p we also insert it in a
balanced binary search tree data structure with guaranteed O(log n)-time
search and insert operations (say, a red-black tree). We can now search for
an element x by searching for the largest element y in the tree less than or
equal to x, then walking along the list starting from y until we find x (or
some value bigger than x).
Recall that for a skip list with (small) parameter p, the expected number
of pointers stored
 per  element is 1 + O(p) and the expected cost of a search
1
operation is O p log n .

1. What are the corresponding asymptotic values for the expected pointers
per element and the expected cost of a search for this new data structure,
as a function of p and n?
2. Suppose you are given n in advance. Is it possible to choose p based
on n for the tree-based data structure to get lower expected space
overhead (expressed in terms of pointers per element) than with a skip
list, while still getting O(log n) expected search time?
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 349

Solution
1. This is actually easier than the analysis for a skip list, because there is
no recursion to deal with.
A balanced binary search tree stores exactly 2 pointers per element,
so with an expected pn elements in the tree, we get n + 2pn pointers
total, or 1 + 2p) pointers per element.
For the search cost, work backwards from the target x to argue that
we traverse an expected O(1/p) edges in the linked list before hitting
a tree node. Searching the tree takes O(log n) time independent of p.
This gives a cost of O p1 + log n .

2. Yes. Let p = Θ(1/ log n). Then we use an expected  1 + O(1/ log  n)
1
pointers per element, while the search cost is O 1/ log n + log n =
O(log n). In contrast, getting 1 + O(1/ log n) expected space with a
skip list would also require setting p = Θ(1/ log n), but this would give
an expected search cost Θ(log2 n).

C.4 Assignment 4: due Thursday, 2019-10-31, at


23:00
C.4.1 A hash tree
Suppose we attempt to balance a binary search tree by inserting n elements
one at a time, using hashes of the values instead of the values themselves.
Specifically, let M = 1 . . . n2 be the set of possible hash values; let


U = {1 . . . u} be the universe of possible values, where u  n2 ; and let H


be a 2-universal family of hash functions h : U → M . After the adversary
chooses n distinct values x1 , . . . , xn , the algorithm chooses a hash function
h ∈ H uniformly at random, and inserts x1 through xn in order into a binary
search tree with no rebalancing, where the key for each value xi is h(xi ).
In the case of duplicate keys (which shouldn’t happen very often), assume
that if we try to insert x into a subtree with root r, and h(x) = h(r), we just
adopt some fixed rule like always inserting x into the right subtree of r.
For any n and u  n2 , prove or disprove: for any 2-universal family H
of hash functions, and any sequence of values x1 , . . . , xn , the expected depth
of a tree constructed by the above process is O(log n).
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 350

Solution
Disproof: We will construct a 2-universal family H such that h(x) = x for
all h ∈ H whenever 1 ≤ x ≤ m. We can then insert the values xi = i with
corresponding hash values h(xi ) = i and get a tree of depth exactly n.
Start with a strongly 2-universal family H 0 . For each h0 in H, construct
a corresponding h according to the rule h(x) = x when 1 ≤ x ≤ m and
h(x) = h0 (x) otherwise. We claim that this gives a 2-universal family of hash
functions.
Recall that H is 2-universal if, for any x =6 y, Pr [h(x) = h(y)] ≤ 1/m,
when h is chosen uniformly from H. We consider three cases:

1. If 1 ≤ x ≤ m and 1 ≤ y ≤ m, then h(x) = x 6= y = h(y) always.

2. If 1 ≤ x ≤ m and m < y, then Pr [h(y) = h(x)] = Pr [h0 (y) = x] = 1/m,


since h(y) is equally likely to be any value by strong 2-universality. A
symmetric argument works if m < x and 1 ≤ y ≤ m.

3. If m < x and m < y, then Pr [h(x) = h(y)] = Pr [h0 (x) = h0 (y)] = 1/m.

In each case we have Pr [h(x) = h(y)] ≤ 1/m. So H gives a 2-universal


hash family that produces trees of depth n, much greater than O(log n).

C.4.2 Randomized robot rendezvous on a ring


Suppose we have a ring of size nm, consisting of positions 0, . . . , nm − 1,
with n robots starting at positions Xi0 = im.
At each time step t, an adversary chooses one of the n robots i, and
this robot moves from position Xit to Xi,t+1 = (Xit ± 1) mod mn, moving
clockwise or counterclockwise with equal probability. If any other robot or
robots i0 are at the same position Xi0 t = Xit , they also move to the same
new position with robot i. Eventually, the robots slowly coalesces until all
robots are in the same position.
The process finishes at the first time τ where Xiτ = Xjτ for all i and j.
What is E [τ ]?
(Assume that the adversary’s choice of robot to move at each time t
doesn’t involve predicting the future. Formally, this choice must be mea-
surable Ft , where Ft is the σ-algebra generated by all the positions of the
robots at times up to and including t.)
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 351

Solution
We’ll use a variant of the Xt2 − t martingale for random walks.
For each i and t, let Yit = Xi+1,t − Xit be the size of the gap between
Xi+1 and Xi (wrapping around in the obvious way).
Suppose at some step t the adversary chooses robot j to move. Let i and
k be the smallest and largest robot ids such that Xit = Xjt = Xkt (again
wrapping around in the obvious way).
Then Yi−1,t and Ykt are the only gapsh that change,i and each rises or
2
drops by 1 with equal probability. So E Yi−1,t+1 2
Ft = Yi−1,t + 1 and
h i
2 2 + 1. If we define Z = n 2
Ft = Yj,t i=1 Yit − 2t, then {Zt } is a
P
E Yj,t+1 t
martingale with respect to {Ft }.
For any adversary strategy, starting from any configurations, there is a
sequence of at most n2 m coin-flips that causes all robots to coalesce. So
2
E [τ ] ≤ n2 m2n m < ∞. We also have that |Zt+1 − Zt | ≤ 2, so we can apply
the finite-expectation/bounded-increments case of Theorem 9.3.1 to show
that E [Zτ ] = E [Z0 ] = nm2 . But at time τ , all but one interval Yit is zero, and
2 − 2τ ,

the remaining interval has length mn. This gives E [Z τ ] = E (mn)
giving E [τ ] = 12 n2 m2 − nm2 = n2 m2 .
 

The nice thing about this argument is that it instantly shows that the
adversary’s strategy doesn’t affect the expected time until all robots coalesce,
but the price is that the argument is somewhat indirect. For specific adversary
strategies, it may be possible to show the coalescence time more directly.
For example, consider an adversary that always picks robot 0. Then we
can break down the resulting process into a sequence of phases delimited by
collisions between 0 and robots that have not yet been added to its entourage.
The first such collision happens after robot 0 hits one of the two absorbing
barriers at −m and +m, which occurs in m2 steps on average. At this point,
we now have a new random walk with absorbing barriers at −m and +2m
relative to the position of robot 0 (possibly after flipping the directions to
get the signs right). This second random walk takes 2m2 steps on average to
hit one of the barriers. Continuing in this way gives us a sequence of random
walks with absorbing barriers at −m and +km for each k ∈ {1 . . . n − 1}.
(n−1)n 2
The total time is thus n−1 2 m = n2 m2 , as shown above for
P 
k=1 km = 2
the general case.
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 352

C.5 Assignment 5: due Thursday, 2019-11-14, at


23:00
C.5.1 Non-exploding computers
Engineers from the Exploding Computer Company have devised a new
scheme for managing the ring of overheating computers from Problem C.1.2.
In the new scheme, we start with an n-bit vector S 0 = 0n . Each 1 represents
a computer that is turned on, and each 0 represents a computer that is
turned off.1
Call a state S permissible if there is no j such that Sjt+1 = Sj+1 t+1
=
t+1
Sj+2 = 1, where the indices wrap around mod n.
At each step, some position i is selected uniformly at random. If Sit = 1,
Si is set to 0 with probability p. If Sit = 0, Sit+1 is set to 1 if this creates
t+1

a permissible configuration, and stays 0 otherwise. For any positions j 6= i,


Sjt+1 = Sjt .

1. Show that when 0 < p < 1, the sequence of values S t forms a Markov
chain with states consisting of all permissible sets, and that this chain
has a unique stationary distribution π in which π(S) = π(T ) whenever
|S| = |T |, where |S| is the number of 1 bits in S.

2. Let τ be the first time at which |S τ | ≥ n/2. Show that for any n, there
is a choice of p ≥ 0 such that E [τ ] = O(n log n).

Solution
1. First observe that the updaterule for generating S t+1 depends only on
the state S t ; this gives that S t is a Markov chain. It is irreducible,
because there is always a nonzero-probability path from any permissible
configuration S to any permissible configuration T that goes through
the empty configuration 0n . It is aperiodic, because for any nonempty
configuration S t , there is a nonzero probability that S t+1 = S t , since
we can pick a position i with Sit = 1 and then choose not set Sit+1 to 0.
So a unique stationary distribution π exists.
We can show that π(S) depends only on |S| by using the fact that the
Markov chain is reversible. Let S and T be reachable states that differ
1
The marketing department convinced the company to change the number of computers
in the ring from 3n to n in response to consumer complaints. If the engineers’ scheme
works, the next meeting of the marketing department will consider choosing a new name
for the company.
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 353

only in position i. Suppose that Sit = 0 and Tit = 1. Then pST = 1/n
and pT S = p/n. Let c = S p−|S| , where S ranges over all admissible
P

states, and let πS = p−|S| /c. Pick some S and T as above and let
k = |S|.

p−k 1
πS pST =
c n
p−k−1 p
=
c n
= πT pT S .

So this particular choice of π satisfies the detailed balance equations


and is the stationary distribution of the chain. Since π(S) only depends
on |S|, it is immediate that π(S) = π(T ) when |S| = |T |.

2. For this part, we’ll set p = 0, then walk directly up to a configuration


with |S t | ≥ n/2 in O(n log n) steps on average. Setting p = 0 means
that the chain is no longer irreducible, but we won’t care about this
since we are not worried about convergence.
Let S be a permissible configuration with |S| = k. We want to get
a lower bound on the number of positions i such that S[i/1] is also
permissible. There n − k zeroes in S; each such 0 can be switched to 1
unless it would create three ones in a row. We will show that at most
k of the n − k zeroes cannot be switched, by constructing an injection
f from the set of unswitchable 0 positions to the set of 1 positions.
Let i be a position such that Si = 0 but S[i/1] is not permissible. Then
one of the following cases holds:

(a) Si−2 = Si−1 = 1. Let f (i) = i − 1.


(b) Si−1 = Si+1 = 1. Let f (i) = i − 1.
(c) Si+1 = Si+2 = 1. Let f (i) = i + 1.

To show that this is an injection, observe that the only way for f (i) =
f (j) to occur is if f (i) = i + 1 and f (j) = j − 1 (or vice versa). But if
f (i) = i + 1, then Sj = Si+2 = 1, and Sj is not an unswitchable zero.
So out of n − k zeroes we have at most k unswitchable zeroes, giving at
least n − 2k switchable zeroes. The expected waiting time to hit one of
n
these at least n − 2k switchable zeroes is at most n−2k , giving a total
Pb(n−1)/2c n
expected waiting time of at most k=0 n−2k ≤ nHn = O(n log n).
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 354

I suspect the correct expected waiting time is in fact O(n), because


(a) the pattern 11011 includes at least one 1 that is not the image of
any unswitchable zero; (b) after, say, n/8 steps it is likely (via linearity
of expectation and McDiarmid’s inequality) that there are a linear
number ` of such patterns in S; so that (c) even as k gets close to n/2,
the number of unswitchable zeros is at least n − 2k + ` = Θ(n), giving
constant expected waiting times to find the next switchable zero. But
actually proving (b) given all the dependencies flying around looks at
least mildly painful, so we may want to just hope that an O(n log n)
bound is good enough.

C.5.2 A wordy walk


Consider the following random walk on strings over some alphabet Σ. At
each step:
1. With probability p, if |X t | > 1, delete a character from a position
chosen uniformly at random from all positions in X t . If |X t | = 1, do
nothing.
2. With probability q, insert a new character chosen uniformly at ran-
dom from Σ into X t . The new character will be inserted after the
first i characters in X t , where i is chosen uniformly at random from
0 . . . |X t | .


3. With probability r, choose one of the character positions in X t uni-


formly at random and replace the character in that position with a
character chosen uniformly at random from Σ.
For example, here is the result of running the above process for a few
steps with Σ the lowercase Latin alphabet, p = 1/2, q = 1/4, and r =
1/4, starting from the string markov: markov markzv marzv carzv caurzv
cauzv cauav cauav cavav avav.
Assume p + q + r = 1, p > q > 0, and r > 0.
Suppose that we start with |X 0 | = n. Show that for any constants p, q, r
satisfying the above constraints, and any constant  > 0, there is a time
t = O(n) such that dT V (X t , π) ≤ , where π is a stationary distribution of
this process.

Solution
We can simplify things a bit by tracking X t = |S t |. Since changes in the
length of S don’t depend on the characters in S, X t is also a Markov
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 355

chain, with transition probabilities



p

 j =i−1
pij = q j =i+1


r otherwise
Applying the detailed balance equations gives a stationary distribution
ρ for X t of ρi = (q/p)i−1 /(1 − q/p). By symmetry, this gives a stationary
|S|−1 |Σ|−|S|
distribution πS = (q/p) 1−q/p . For X with distribution ρ, we have E [X]
1 P∞ n−1 converges.
finite when p > q since E [X] = 1−q/p n=1 n(q/p)
We’ll show convergence for S t starting  from a particular S 0 using a
coupling with a stationary copy of the chain T t .
Our strategy will go as follows:
1. Let m = |T 0 |. We will argue that a coupling that applies the same
operation (delete, insert, change) to both copies of the chain reaches a
configuration where |S t | = |T t | = 1 in O max(n, m) expected steps.
2. From a state with |S t | = |T t | = 1, with probability r > 0, both chains
switch to a new character. Make this the same character and get
St = Y t.
3. If this doesn’t work, repeat the argument from the start. Note that
we now have |S t | = |T t | ≤ 2, so after O(1) expected steps we are back
at |S t | = |T t | = 1 again. The waiting time for convergence is now
geometric and has expectation O(1).
Let us now justify the claim that the waiting time to reach |S t | = |T t | = 1
is O(max(n, m)). This is just the usual argument for a biased random
walk with one absorbing barrier. Let Z t = max |S t |, |Y t | . Let τ be the
stopping
 t+1  0
time at which Z t is first equal to 1. Observe that for t < τ ,
E Z Z . . . Z t = Z t + q − p, so Qt = Z t + (p − q)t is a martingale with
bounded increments, and we can apply OST to get that E [Qτ ] = 1−(p−q)τ =
n,m−1
E [Q0 ] = max n, m. This gives E [τ ] = maxp−q .
We now have a coupling that gives an expected d coalescence time
of O(max(n, m)). If max(n, m) = n, we are done. But it could be that
max(n, m) = m. Conditioning on |T t | > n, we are truncating the distribution
of |T t | to be a geometric distribution starting at n+1; this has expected value
n + O(1), giving E [max(n, m)] = n + O(1) in either case. So the coalescence
time is O(n + O(1)) = O(n).
Since  is a constant, after applying the Coupling Lemma and Markov’s
inequality, the O eats it.
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 356

C.6 Assignment 6: due Monday, 2019-12-09, at


23:00
C.6.1 Randomized colorings
Suppose we want to use a constant number k of colors to color the vertices
of a graph G with n vertices and m edges so that we minimize the number of
monochromatic edges. A particularly simple approach is to color the vertices
independently and uniformly at random. This gives m/k monochromatic
edges on average. But we’d like to derandomize the algorithm to get at most
m/k monochromatic edges always.
There are two standard ways to do this:
1. Replace the independent random colors with pairwise independent
random colors and then consider all possible random choices. Show
that we can do this by generating n pairwise-independent colors from
a small number of random values, and determine the asymptotic time
complexity, as a function of n, m, and k, of generating and testing all
the resulting colorings to find the best one.
2. Apply the method of conditional probabilities and assign colors one
at a time to minimize the expected number of monochromatic edges
conditioned on the assignment so far. Show that we can do so in this
case, and compute the asymptotic time complexity as a function of n,
m, and k of the resulting deterministic algorithm.

Solution
1. The obvious way to generate n pairwise-independent random values
in {0, . . . , k − 1} is to apply the subset-sum construction described in
§5.1.2.1. Let ` = dlg(n + 1)e, assign a distinct nonempty subset Sv of
{1, . . . , `} to each vertex v, and let Xv = i∈Cv ri where r1 , . . . , r` are
P

independent random values chosen uniformly from Zk . Then the Xv are


pairwise independent, giving a probability of 1/k that any particular
edge is monochromatic. Since this gives m/k monochromatic edges
on average, enumerating all choices of r1 , . . . , r` will find a particular
coloring that gets at least m/k monochromatic edges.
There
 are k ` = kdlg(n+1)e
 possible choices. This is bounded by k
2+lg n =

O k 2 nlg k = O nlg k , which is polynomial in n.


Generating each coloring takes O(n log n) time, and testing the edges
takes O(m) time, for a total time of O(nlg k (m + n log n)).
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 357

With some data structure tinkering we can reduce the complexity a


bit. We can enumerate all of r1 , . . . , rk with an amortized O(1) values
changing at each iteration. This still requires changing about half the
colors, but we can update each dynamically by subtracting off the old
value of ri and adding the new one, reducing the amortized cost per
iteration from O(n log n) to O(n).
 We still end up checking most of the
edges, so the cost will now be O nlg k (m + n) . I don’t see an obvious
way to do better.

2. For this part we let X1 , . . . , Xn be the colors assigned to the ver-


tices (ordering them arbitrarily) and let f (X1 , . . . , Xn ) be the number
of monochromatic edges given a particular coloring. Having fixed
X1 , . . . , Xi , we want to choose Xi+1 to minimized E [f | X1 , . . . , Xi+1 ].
Observe that f (X1 , . . . , Xn ) = uv∈E [Xu = Xv ], so E [f |X1 , . . . , Xi ] =
P

uv∈E Pr [Xu = Xv ]. For each uv, if i + 1 6∈ {u, v}, then [Xu =


P

Xv ] is independent of Xi+1 , giving Pr [Xu = Xv | X1 , . . . , Xi+1 ] =


Pr [Xu = Xv | X1 , . . . , Xi+1 ]. So we only care about the cases where
i + 1 ∈ {u, v}. Assume without loss of generality that i + 1 = u. If
v ≤ i, then [Xu = Xv ] is either 1 or 0 depending on whether we set
Xu = Xv or not. If v > i + 1, then Xv is independent of Xi+1 , so
Pr [Xu = Xv | X1 , . . . , Xi+1 ] = Pr [Xu = Xv | X1 , . . . , Xi ].
This means that to minimize E [f | X1 , . . . , Xi+1 ], the only edges we
need to consider are those edges Xuv where one endpoint is i + 1
and the other endpoint is in {1, . . . , i}. Minimizing the conditional
expectation of f turns into the obvious greedy algorithm of picking
a color for vi+1 that is least represented among its neighbors, which
we can do in O(1 + d(vi+1 )) time. Adding up over all vertices v gives
P
O( v∈V (1 + d(v))) = O(n + m) time.

C.6.2 No long paths


Given a directed graph G with n vertices and maximum out-degree d, we
would like to find an induced subgraph G0 with at least (1 − δ)n/(d + 1)
vertices, such that G0 has no simple paths2 of length d2d ln ne or greater.
Find an algorithm that does this in polynomial expected time for any
constant d > 0 and δ > 0, and sufficiently large n.
2
For the purposes of this problem, a simple path of length ` is a directed path with `
edges and no repeated vertices.
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 358

Solution
We’ll put each vertex v to G0 with independent probability p = 1/(d + 1).
We now have two possible bad outcomes:

1. We may get a path of length d2d ln ne in G0 .

2. We may get fewer than (1 − δ)n/(d + 1) vertices in G0 .

We’ll start by showing that the sum of the probabilities of these bad
outcomes is not too big.
Let ` = d2d ln ne. Observe that we can describe a path of length ` in G
by specifying its starting vertex (n choices) and a sequence of ` edges each
leaving the last vertex added so far (d choices each). This gives at most
nd` paths of length ` in G. Each such path appears in G0 with probability
exactly p`+1 . Using the usual probabilistic-method argument, we get

Pr G0 includes at least one path of length ` ≤ E number of paths of length ` in G0


   

≤ nd` p`+1
= np(pd)` .
`
1 d
 
= n
d+1 d+1
`
1 1

= n 1−
d+1 d+1
d2d ln ne
1 1

= n 1−
d+1 d+1
2d ln n
1 1

≤ n 1−
d+1 d+1
1
≤ ne− ln n
d+1
1
= .
d+1
1
≤ .
2
 2d
1
For the second-to-last inequality, we use the inequality 1 − d+1 ≤ e−1
for d ≥ 1. This is easiest to demonstrate numerically, but if we had to prove
it we could argue that it holds when d = 1 (since 1/4 < 1/e) and that
1 2d
(1 − d+1 ) is a decreasing function of d.
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 359

For getting too few vertices, use Chernoff bounds. Let X be the number
of vertices in G0 . Let µ = E [X] = d+1
n
.

Pr [X < (1 − δ)n/(d + 1)] = Pr [X < (1 − δ)µ]


2 /2
≤ e−µδ
2 /(2(d+1))
= e−nδ
< 1/4,

for any fixed d > 0, δ > 0, and sufficiently large n.


Adding these error probabilities together gives a total probability of
failure of at most 3/4. We can select the vertices in G0 in time linear in n.
So if we can detect failure in polynomial time and try again, we only have to
run at most 4 poly-time attempts on average until we win, giving polynomial
expected cost.
So now let us detect failure. There are at most nd` paths of length ` in
G. We can enumerate all of them and check if each is in G0 in time
   
O nd` ` = O n`d2d ln n+1
 
= O n`de2d ln n ln d
 
= O n2d ln d+1 d2 log n ,

which is polynomial in n for fixed d.

C.7 Final exam


Write your answers in the blue book(s). Justify your answers. Work alone.
Do not use any notes or books.
There are three problems on this exam, each worth 20 points, for a total
of 60 points. You have approximately three hours to complete this exam.

C.7.1 A uniform ring


Suppose we have a ring of n nodes, each of which holds a bit. At each step,
we pick a node uniformly at random and have it copy the bit of its left
neighbor. For example, if we start in state 010, then picking the rightmost
node will send us to state 011, and if we then pick the leftmost node we will
reach state 111. From this last state no further changes are possible.
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 360

Suppose n is even and we start with a state where positions 1 through


n/2 are all 0 and positions n/2 + 1 through n are all 1. What is the expected
number of steps, as an asymptotic function of n, to reach a state where all
bits are the same?

Solution
First observe that any transition preserves the property that our state looks
like a string of the form 0k 1n−k , where both the zeroes and ones all appear
in consecutive positions around the ring.
Let Xt be the number of zeroes after t steps. We have X0 = n/2, and
our possible transitions when 0 < Xt < n are:

1. With probability 1/n, we choose the leftmost zero and replace it with
a one. This makes Xt+1 = Xt − 1.

2. With probability 1/n, we choose the leftmost one and replace it with a
zero. This makes Xt+1 = Xt + 1.

3. With the remaining probability 1 − 2/n, we choose some node whose


bit is equal to that of its left neighbor. This makes Xt+1 = Xt .

When Xt = 0 or Xt = n, we have converged: no further changes are


possible.
Looking at the transitions for Xt , we see that it is doing a very lazy
random walk with absorbing barriers at 0 and n. If we remember that a
non-lazy random walk converges in O(n2 ) steps on average, we can argue that
the lazy version converges in O(n3 ) steps on average, because the expected
waiting time between changes in Xt is n/2 = O(n), and we can multiply this
by the number of changes because of Wald’s equation.
If we want to be very formal about this, we can also calculate the exact
expected convergence time using the Optional Stopping Theorem. Let’s
guess that Yt = Xt2 − ct is a martingale for some coefficient c. To find c,
observe that we need
h    i h i
2
E Xt+1 − c(t + 1) − Xt2 − ct 2
Xt = E Xt+1 − Xt2 − c Xt
1 1
= (2Xt + 1) + (−2Xt + 1) − c
n n
2
= −c
n
= 0.
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 361

So {Yt } is a martingale when c = n2 .


Let τ be the stopping time at which Xτ first equals 0 or n. The usual
argument shows that E [τ ] is finite, and {Yt } has bounded increments because
Yt never changes by more than 2Xt + 1 ≤ 2n + 1. So OST applies and
E [Yτ ] = E [Y0 ] = (n/2)2 . But we can expand
1 1 2
E [Yτ ] = · 0 + · n2 − · E [τ ]
2 2 n
n2 2
= − · E [τ ] .
2 n
n2 n2 2 n3
This gives 4 = 2 − n · τ . Solving for E [τ ] gives E [τ ] = 8 = O(n3 ).

C.7.2 Forbidden runs


Given a sequence of coin-flips, a run is a maximal subsequence of either all
heads or all tails. The length of a run is the number of flips in the run. For
example, the sequence HHHTTHTHH has five runs: a run HHH of length 3,
a run TT of length 2, a run H of length 1, a run T of length 1, and a run HH
of length 2.
Suppose we are attempting to transmit a sequence of n fair, independent
coin-flips across a communication channel that uses a run of length k to signal
an emergency condition. To avoid false alarms, we must encode our sequence
to include no runs of length k. We do this by encoding any run of length
` ≥ k as a run of length ` + 1 of the same value. For example, when k = 2, we
would encode the previous example HHHTTHTHH as HHHHTTTHTHHH.
As a function of n and k, what is the expected length of the encoded
sequence?

Solution
Let i range from 1 to n, and let Xi be the i-th coin-flip in the sequence. Let
Ri be the indicator variable for the event that there is a run of length k or
more starting at position i. Then the expected number of runs of length
k or more is given by ni=1 E [Ri ] by linearity of expectation, and because
P

each such run adds one to the length of the encoded sequence, the expected
encoded length is n + ni=1 E [Ri ].
P

To compute E [Ri ], we consider three cases:

1. If i = 1, then Ri = 1 when X1 , . . . , Xk are all H or all T. This gives


E [Ri ] = 2 · 2−k = 2−k+1 .
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 362

2. If 1 < i ≤ n − k + 1, then Ri = 1 when X1 , . . . , Xk are all H or all T,


and Xi−1 is not this common value (because otherwise the run starts
earlier). This gives E [Ri ] = 2 · 2−k−1 = 2−k .

3. If i > n − k + 1, then Ri = 0, because there aren’t enough coins left to


get a run of length k.

Summing over all the cases gives


n n−k+1
E [Ri ] = 2−k+1 + 2−k
X X

i=1 i=2
−k
=2·2 + (n − k) · 2−k
= (n − k + 2) · 2−k .

For the expected encoded sequence length, we must add back n to get
n + (n − k + 2) · 2−k .

C.7.3 A derandomized balancing scheme


Suppose we construct an n × n matrix A where the elements Aij are n2
independent fair ±1 random variables. Then Hoeffding’s inequality tells us
that  
n X n
!
X t2
Pr  Aij ≥ t ≤ 2 exp − 2 .
i=1 j=1
2n

Unfortunately this requires using n2 random bits.


As an alternative, suppose we use 2n random bits to generate two se-
quences X and Y of n independent fair ±1 random variables, and define
Bij = Xi Yj .
Show that
 
n X
n
t
X  
Pr  Bij ≥ t ≤ 4 exp −
 .
i=1 j=1
2n
APPENDIX C. SAMPLE ASSIGNMENTS FROM FALL 2019 363

Solution
Start by factoring
n X
X n n X
X n
Bij = Xi Yj
i=1 j=1 i=1 j=1
n
! n 
X X
= Xi  Yj .
i=1 j=1

Each factor is a sum of n independent ±1 random variables with expec-


tation 0, so Hoeffding’s inequality applies, giving
" n # !
X s2
Pr Xi ≥ s ≤ 2 exp −
i=1
2n

and
 
n
!
X s2
Pr  Yj ≥ s ≤ 2 exp − .
j=1
2n

Pn Pn Pn Pn
If neither of these events holds, then i=1 j=1 Bij =| i=1 Xi |· j=1 Yj <
s2 .
So by the union bound,
   
n X
n
" n # n
X X X
Pr  Bij ≥ s2  ≤ Pr Xi ≥ s + Pr  Yj ≥ s 
i=1 j=1 i=1 j=1
!
s2
≤ 4 exp − .
2n

Now substitute t for s2 .


Appendix D

Sample assignments from


Fall 2016

D.1 Assignment 1: due Sunday, 2016-09-18, at


17:00
Bureaucratic part
Send me email! My address is james.aspnes@gmail.com.
In your message, include:

1. Your name.

2. Your status: whether you are an undergraduate, grad student, auditor,


etc.

3. Whether you are taking the course as CPSC 469 or CPSC 569.

4. Anything else you’d like to say.

(You will not be graded on the bureaucratic part, but you should do it
anyway.)

D.1.1 Bubble sort


Algorithm D.1 implements one pass of the (generally pretty slow) bubble
sort algorithm. This involves comparing each array location with its succes-
sor, and swapping the elements if they are out of order. The full bubble sort

364
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 365

1 procedure BubbleSortOnePass(A, n)
2 for i ← 1 to n − 1 do
3 if A[i] > A[i + 1] then
4 Swap A[i] and A[i + 1]

Algorithm D.1: One pass of bubble sort

algorithm repeats this loop until the array is sorted, but here we just do it
once.
Suppose that A starts out as a uniform random permutation of distinct
elements. As a function of n, what is the exact expected number of swaps
performed by Algorithm D.1?

Solution
The answer is n − Hn , where Hn = ni=1 1i is the n-th harmonic number.
P

There are a couple of ways to prove this. Below, we let Ai represent the
original contents of A[i], before doing any swaps. In each case, we will use the
fact that after i iterations of the loop, A[i] contains the largest of Ai , . . . , Ai ;
this is easily proved by induction on i.

• Let Xij be the indicator variable for the event that Ai is eventually
swapped with Aj . For this to occur, Ai must be bigger than Aj , and
must be present in A[j −1] after j −1 passes through the loop. This hap-
pens if and only if Ai is the largest value in A1 , . . . , Aj . Because these
values are drawn from a uniform random permutation, by symmetry
Ai is largest with probability exactly 1/j. So E [Xij ] = 1/j.
Now sum Xij over all pairs i < j. It is easiest to do this by summing
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 366

over j first:
 
X X
E  Xij  = E [Xij ]
i<j i<j
n j−1
X X
= E [Xij ]
j=2 i=1
n j−1
X X1
=
j=2 i=1
j
n
X j−1
=
j=2
j
n
X j−1
=
j=1
j
n
X 1
= (1 − )
j=1
j
n
X 1
=n−
j=1
j
= n − Hn .

• Alternatively, let’s count how many values are not swapped from A[i]
to A[i − 1]. We can then subtract from n to get the number that are.
Let Yi be the indicator variable for the event that Ai is not swapped
into A[i − 1]. This occurs if, when testing A[i − 1] against A[i], A[i] is
larger. Since we know that at this point A[i − 1] is the largest value
among A1 , . . . , Ai−1 , Yi = 1 if and only if Ai is greater than all of
A1 , . . . , Ai−1 , or equivalently if Ai is the largest value in A1 , . . . , Ai .
Again by symmetry we have E [Yi ] = 1i , and summing over all i gives
an expected Hn values that are not swapped down. So there are n − Hn
values on average that are swapped down, which also gives the expected
number of swaps.

D.1.2 Finding seats


A huge lecture class with n students meets in a room with m ≥ n seats.
Because the room is dark, loud, and confusing, the students adopt the
following seat-finding algorithm:
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 367

1. The students enter the room one at a time. No student enters until
the previous student has found a seat.

2. To find a seat, a student chooses one of the m seats uniformly at random.


If the seat is occupied, the student again picks a seat uniformly at
random, continuing until they either find a seat or have made k attempts
that led them to an occupied seat.

3. If a student fails to find a seat in k attempts, they contact an Under-


graduate Seating Assistant, who leads them to an unoccupied seat.

Give an asymptotic (big-Θ) expression for the number of students who


are helped by Undergraduate Seating Assistants, as a function of n, m, and
k, where k is a constant independent of n and m. As with any asymptotic
formula, this expression should work for large n and m; and should be as
simple as possible.1

Solution
Define the i-th student to be the student who takes their seat after i students
have already sat (note that we are counting from zero here, which makes
things a little easier). Let Xi be the indicator variable for the even that the
i-th student does not find a seat on their own and needs help from a USA.
Each attempt to find a seat fails with probability i/m. Since each attempt is
independent of the others, the probability that all k attempts fail is (i/m)k .
The number of students who need help is n−1
P
i=0 Xi , so the expected
number is
"n−1 # n−1
X X
E Xi = E [Xi ]
i=0 i=0
n−1
X
= (i/m)k
i=0
n−1
= m−k
X
ik .
i=0

Unfortunately there is no simple expression for ik that works for all


P

k. However, we can approximate it from above and below using integrals.


1
An earlier version of this problem allowed k to grow with n, which causes some trouble
with the asymptotics.
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 368

From below, we have


n−1
X n−1
X
ik = ik
i=0 i=1
Z n−1
≥ xk dx
0
n−1
1
 
= xk+1
k+1 0
1
= (n − 1)k+1
k+1
1

=Ω nk+1 .
k+1
From above, an almost-identical computation gives
n−1
X Z n
ik ≤ xk dx
i=0 0
n
1

= xk+1
k+1 0
= f rac1k + 1n( k + 1)
1
= O( nk+1 ).
k+1
 
1 k+1
Combining the two cases gives that the sum is Θ k+1 n . We must
also include m−k , giving the full answer
1
 
k+1 −k
Θ n m .
k+1
k
 
I like writing this as Θ n (n/m)
k+1 to emphasize that the cost per student
(n/m)k
 
is Θ k+1, but there are many different ways to write the same thing.
Note that since we are dealing with asymptotics, we could also replace the
1 1
k+1 with just k and still be correct, but that seems less informative somehow.

D.2 Assignment 2: due Thursday, 2016-09-29, at


23:00
D.2.1 Technical analysis
It is well known that many human endeavors—including sports reporting,
financial analysis, and Dungeons and Dragons—involve building narra-
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 369

tives on top of the output of a weighted random number generator [Mun11],


a process that is now often automated [Hol16]. For this problem, we are
going to consider a simplified model of automated financial analysis.
Let us model the changes in a stock price over n days using a sequence of
independent ±1 random variables X1 , X2 , . . . , Xn , where each Xi is ±1 with
probability 1/2. Let Si = ij=1 Xj , where i ∈ {0, . . . , n}, be the price after i
P

days.
Our automated financial reporter will declare a dead cat bounce if a
stock falls in price for two days in a row, followed by rising in price, followed
by falling in price again. Formally, a dead cat bounce occurs on day i if
i ≥ 4 and Si−4 > Si−3 > Si−2 < Si−1 > Si . Let D be the number of dead
cat bounces over the n days.

1. What is E [D]?

2. What is Var [D]?

3. Our investors will shut down our system if it doesn’t declare at least
one dead cat bounce during the first n days. What upper bound you
can get on Pr [D = 0] using Chebyshev’s inequality?

4. Show Pr [D = 0] is in fact exponentially small in n.

Note added 2016-09-28: It’s OK if your solutions only work for sufficiently
large n. This should save some time dealing with weird corner cases when n
is small.

Solution
1. Let Di be the indicator variable for the event that a dead cat bounce
occurs on day i. Let p = E [Di ] = Pr [Di = 1]. Then
1
p = Pr [Xi−3 = −1 ∧ Xi−2 = −1 ∧ Xi−1=+1 ∧ Xi = −1] = ,
16
since the Xi are independent.
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 370

Then
" n #
X
E [D] = E Di
i=4
n
X
= E [Di ]
i=4
n
X 1
=
i=4
16
n−3
= ,
16
assuming n is at least 3.
15
2. For variance, we can’t just sum up Var [Di ] = p(1 − p) = 256 , because
the Di are not independent. Instead, we have to look at covariance.
Each Di depends on getting a particular sequence of four values for
Xi−3 , Xi−2 , Xi−1 , and Xi . If we consider how Di overlaps with Dj for
j > i, we get these cases:
Variable Pattern Correlation
Di --+-
Di+1 --+- inconsistent
Di+2 --+- inconsistent
Di+3 --+- overlap in one place
For larger j, there is no overlap, so Di and Dj are independent for
j ≥ 4, giving Cov [Di , Dj ] = 0 for these cases.
Because Di can’t occur with Di+1 or Di+2 , when j ∈ {i + 1, i + 2} we
have

Cov [Di , Dj ] = E [Di Dj ] − E [Di ] E [Dj ] = 0 − (1/16)2 = −1/256.

When j = i + 3, the probability that both Di and Dj occur is 1/27 ,


since we only need to get 7 coin-flips right. This gives

Cov [Di , Di+3 ] = E [Di Di+3 ]−E [Di ] E [Di+3 ] = 1/128−(1/16)2 = 1/256.
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 371

Now apply (5.1.5):


" n #
X
Var [D] = Var Di
i=4
n
X n−1
X n−2
X n−3
X
= Var [Di ] + 2 Cov [Di , Di+1 ] + 2 Cov [Di , Di+2 ] + 2 Cov [Di , Di+3 ]
i=4 i=4 i=4 i=4
15 −1 −1 1
= (n − 3) + 2(n − 4) + 2(n − 5) + 2(n − 6)
256 256 256 256
13n − 39
= ,
256
where to avoid special cases, we assume n is at least 6.

3. At this point we are just doing algebra. For n ≥ 6, we have:

Pr [D = 0] ≤ Pr [|D − E [D]| ≥ E [D]]


n−3
 
= Pr |D − E [D]| ≥
16
(13n − 39)/256

((n − 3)/16)2
13n − 39
=
(n − 3)2
13
= .
n−3
This is Θ(1/n), which is not an especially strong bound.

4. Here is an upper bound that avoids the mess of computing the exact
probability, but is still exponentially small in n. Consider the events
[D4i = 1] for i ∈ {1, 2, . . . , bn/4c}. These events are independent
since they are functions of non-overlapping sequences of increments.
So we can compute Pr [D = 0] ≤ Pr [D4i = 0, ∀i ∈ {1, . . . , bn/4c}] =
(15/16)bn/4c .
This expression is a little awkward, so if we want to get an asymptotic
estimate we can simplify it using 1 + x ≤ ex to get (15/16)bn/4c ≤
exp(−(1/16)bn/4c) = O(exp(−n/64)).
With a better analysis, we should be able to improve the constant in
the exponent; or, if we are real fanatics, calculate Pr [D = 0] exactly.
But this answer is good enough given what the problems asks for.
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 372

D.2.2 Faulty comparisons


Suppose we want to do a comparison-based sort of an array of n distinct
elements, but an adversary has sabotaged our comparison subroutine. Specif-
ically, the adversary chooses a sequence of times t1 , t2 , . . . , tn , such that for
each i, the ti -th comparison we do returns the wrong answer. Fortunately,
this can only happen n times. In addition, the adversary is oblivious: thought
it can examine the algorithm before choosing the input and the times ti , it
cannot predict any random choices made by the algorithm.
A simple deterministic work-around is to replace each comparison in our
favorite Θ(n log n)-comparison sorting algorithm with 2n + 1 comparisons,
and take the majority value. But this gives us a Θ(n2 log n)-comparison
algorithm. We would like to do better using randomization.
Show that in this model, for every fixed c > 0 and sufficiently large n, it
is possible to sort correctly with probability at least 1 − n−c using O(n log2 n)
comparisons.

Solution
We’ll adapt QuickSort. If the bad comparisons occurred randomly, we could
just take the majority of Θ(log n) faulty comparisons to simulate a non-faulty
comparison with high probability. But the adversary is watching, so if we do
these comparisons at predictable times, it could hit all Θ(log n) of them and
stay well within its fault budget. So we are going to have to be more sneaky.
Here’s this idea: Suppose we have a list of n comparisons hx1 , y1 i , hx2 , y2 i , . . . , hxn , yn i
that we want to perform. We will use a subroutine that carries out these
comparisons with high probability by doing kn ln n+n possibly-faulty compar-
isons, with k a constant to be chosen below, where each of the possibly-faulty
comparisons looks at xr and yr where each r is chosen independently and uni-
formly at random. The subroutine collects the results of these comparisons
for each pair hxi , yi i and takes the majority value.
At most n of these comparisons are faulty, and we get an error only if
some pair hxi , yi i gets more faulty comparisons than non-faulty ones. Let Bi
be the number of bad comparisons of the pair hxi , yi i and Gi the number of
good comparisons. We want to bound the probability of the event Bi > Gi .
The probability that any particular comparison lands on a particular
pair is exactly 1/n; so E [Bi ] = 1 and E [Gi ] = k ln n. Now apply Cher-
noff bounds. From (5.2.4), Pr [Bi ≥ (k/2) ln n] ≤ 2−(k/2) ln n = n−(k ln 2)/2)
provided (k/2) ln n is at least 6. In the other direction, (5.2.6) says that
2
Pr [Gi ≤ (k/2) log n] ≤ e−(1/2) k log n/2 = n−k/8 . So we have a probability of
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 373

at most n−k/2 + n−k/8 that we get the wrong result for this particular pair,
and from the union bound we have a probability of at most n1−k/2 + n1−k/8
that we get the wrong result for any pair. We can simplify this a bit by
observe that n must be at least 2 (or we have nothing to sort), so we can put
a bound of n2−k/8 on the probability of any error among our n simulated
comparisons, provided k ≥ 12/ ln 2. We’ll choose k = max(12/ ln 2, 8(c + 3))
to get a probability of error of at most n−c−1 .
So now let’s implement QuickSort. The first round of QuickSort picks a
pivot and performs n − 1 comparisons. We perform these comparisons using
our subroutine (note we can always throw in extra comparisons to bring
the total up to n). Now we have two piles on which to run the algorithm
recursively. Comparing all nodes in each pile to the pivot for that pile
requires n − 3 comparisons, which can again be done by our subroutine.
At the next state, we have four piles, and we can again perform the n − 7
comparisons we need using the subroutine. Continue until all piles have size
1 or less; this takes O(log n) rounds with high probability. Since each round
does O(n log n) comparisons and fails with probability at most n−c−1 , the
entire process takes O(n log2 n) comparisons and fails with probability at
most n−c−1 O(log n) ≤ n−c when n is sufficiently large.

D.3 Assignment 3: due Thursday, 2016-10-13, at


23:00
D.3.1 Painting with sprites
In the early days of computing, memory was scarce, and primitive home
game consoles like the Atari 2600 did not possess enough RAM to store
the image they displayed on your television. Instead, these machines used a
graphics chip that would composite together fixed sprites, bitmaps stored
in ROM that could be displayed at locations specified by the CPU.2 One
common programming trick of this era was to repurpose existing data in
memory as sprites, for example by searching through a program’s machine
code instructions to find sequences of bytes that looked like explosions when
displayed.
2
In the actual Atari 2600 graphics hardware, the CPU did this not by writing down the
locations in RAM somewhere (which would take memory), but by counting instruction
cycles since the graphics chip started refreshing the screen and screaming “display this
sprite NOW!” when the CRT beam got to the right position. For this problem we will
assume a more modern approach, where we can just give a list of locations to the graphics
hardware.
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 374

Figure D.1: Filling a screen with Space Invaders. The left-hand image places
four copies of a 46-pixel sprite in random positions on a 40 × 40 screen. The
right-hand image does the same thing with 24 copies.

For this problem, we will imagine that we have a graphics device that can
only display one sprite. This is a bitmap consisting of m ones, at distinct rel-
ative positions hy1 , x1 i , hy2 , x2 i , . . . hym , xm i. Displaying a sprite at position
y, x on our n×n screen sets the pixels at positions h(y + yi ) mod n, (x + xi ) mod ni
for all i ∈ {1, . . . , m}. The screen is initially blank (all pixels 0) and setting
a pixel at some position hy, xi changes it to 1. Setting the same pixel more
than once has no effect.
We would like to use these sprites to simulate white noise on the screen,
by placing them at independent uniform random locations with the goal
of setting roughly half of the pixels. Unfortunately, because the contents
of the screen are not actually stored anywhere, we can’t detect when this
event occurs. Instead, we want to fix the number of sprites ` so that we
get 1/2 ± o(1) of the total number of pixels set to 1 with high probability,
by which we mean that (1/2 ± o(1))n2 total pixels are set with probability
1 − n−c for any fixed c > 0 and sufficiently large n, assuming m is fixed.
An example of this process, using Taito Corporation’s classic Space
Invader bitmap, is shown in Figure D.1.
Compute the value of ` that we need as a function of n and m, and show
that this choice of ` does in fact get 1/2 ± o(1) of the pixels set with high
probability.
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 375

Solution
We will apply the following strategy. First, we’ll choose ` so that each
individual pixel is set with probability close to 1/2. Linearity of expectation
then gives roughly n2 /2 total pixels set on average. To show that we get
close to this number with high probability, we’ll use the method of bounded
differences.
The probability that pasting in a single copy of the sprite sets a particular
pixel is exactly m/n2 . So the probability that the pixel is not set after ` sprites
ln(1/2)
is (1 − m/n2 )` . This will be exactly 1/2 if ` = log1−m/n2 1/2 = ln(1−m/n 2) .

Since ` must be an integer, we can just round this quantity to the nearest
integer to pick our actual `. For example, when n = 40 and m = 46,
ln(1/2)
ln(1−46/402 )
= 23.7612... which rounds to 24.
For large n, we can use the approximation ln(1 + x) ≈ x (which holds for
small x) to approximate ` as (n2 ln 2/m) + o(1). We will use this to get a
concentration bound using (5.3.13).
We have ` independent random variables corresponding to the positions
of the ` sprites. Moving a single sprite changes the number of set pixels by
at most m. So the probability that the √ total number of set pixels deviates
from its expectation by more than an ln n is bounded by
  √ 2 
an ln n 
!
a2 n2 ln n
2 exp−  = 2 exp − 2

`m2 (n ln 2/m + o(1))m2
!
a2 ln n
= 2 exp −
m ln 2 + o(m2 /n2 )
 
≤ exp −(a2 /m) ln n
2 /m
= n−a ,

where the inequality holds when n is sufficiently large.
√ Now choose a = cm
to get a√probability of deviating by more than n cm ln n of at most n−c .
Since n cm ln n = o(1)n2 , this give us our desired bound.

D.3.2 Dynamic load balancing


A low-cost competitor to better-managed data center companies offers a
scalable dynamic load balancing system that supplies a new machine whenever
a new job shows up, so that there are always n machines available to run
n jobs. Unfortunately, one of the ways the company cuts cost is by not
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 376

hiring programmers to replace the code from their previous randomized


load balancing mechanism, so when the i-th job arrives, it is still assigned
uniformly at random to one of the i available machines. This means that job
1 is always assigned to machine 1; job 2 is assigned with equal probability to
machine 1 or 2; job 3 is assigned with equal probability to machine 1, 2, or
3; and so on. These choices are all independent. The company claims that
the maximum load is still not too bad, and that the reduced programming
costs that they pass on to their customers make up for their poor quality
control and terrible service.
Justify the first part of this claim by showing that, with high probability,
the most loaded machine after n jobs has arrived has load O(log n).

Solution
Let Xij be the indicator variable for the event that machine i gets job j.
Then E [Xij ] = 1/j for all i ≤ j, and E [Xij ] = 0 when i > j.
Let Yi = nj=1 Xij be the load on machine i.
P
hP i
n Pn Pn
Then E [Yi ] = E j=1 Xij j=i 1/j ≤
= j=1 1/j = Hn ≤ ln n + 1.
From (5.2.4), we have Pr [Yi ≥ R] ≤ 2−R as long as R > 2e E [Yi ]. So if
we let R = (c + 1) lg n, we get a probability of at most n−c−1 that we get
more than R jobs on machine i. Taking the union bound over all i gives
a probability of at most n−c that any machine gets a load greater than
(c + 1) lg n. This works as long as (c + 1) lg n ≥ 2e(ln n + 1), which holds for
sufficiently large c. For smaller c, we can just choose a larger value c0 that
0
does work, and get that Pr [max Yi ≥ (c0 + 1) lg n] ≤ n−c ≤ n−c .
So for any fixed c, we get that with probability at least 1 − n−c the
maximum load is O(log n).

D.4 Assignment 4: due Thursday, 2016-11-03, at


23:00
D.4.1 Re-rolling a random treap
Suppose that we add to a random treap (see §6.3) an operation that re-rolls
the priority of the node with a given key x. This replaces the old priority for
x with a new priority generated uniformly at random, independently of the
old priority any other priorities in the treap. We assume that the range of
priorities is large enough that the probability that this produces a duplicate
is negligible, and that the choice of which node to re-roll is done obliviously,
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 377

without regard to the current priorities of nodes or the resulting shape of


the tree.
Changing the priority of x may break the heap property. To fix the
heap property, we either rotate x up (if its priority now exceeds its parent’s
priority) or down (if its priority is now less than that of one or both of its
children). In the latter case, we rotate up the child with the higher priority.
We repeat this process until the heap property is restored.
Compute the best constant upper bound you can on the expected number
of rotations resulting from executing a re-roll operation.

Solution
The best possible bound is at most 2 rotations on average in the worst case.
There is an easy but incorrect argument that 2 is an upper bound, which
says that if we rotate up, we do at most the same number of rotations as
when we insert a new element, and if we rotate down, we do at most the same
number of rotations as when we delete an existing element. This gives the
right answer, but for the wrong reasons: the cost of deleting x, conditioned
on the event that re-rolling its priority gives a lower priority, is likely to be
greater than 2, since the conditioning means that x is likely to be higher up
in the tree than average; the same thing happens in the other direction when
x moves up. Fortunately, it turns out that the fact that we don’t necessarily
rotate x all the way to or from the bottom compensates for this issue.
This can be formalized using the following argument, for which I am
indebted to Adit Singha and Stanislaw Swidinski. Fix some element x, and
suppose its old and new priorities are p and p0 . If p < p0 , we rotate up,
and the sequence of rotations is exactly the same as we get if we remove
all elements of the original treap with priority p or less and then insert a
new element x with priority p0 . But now if we condition on the number k of
elements with priority greater than p, their priorities together with p0 are all
independent and identically distributed, since they are all obtained by taking
their original distribution and conditioning on being greater than p. So all
(k + 1)! orderings of these priorities are equally likely, and this means that
we have the same expected cost as an insertion into a treap with k elements,
which is at most 2. Averaging over all k shows that the expected cost of
rotating up is at most 2, and, since rotating down is just the reverse of this
process with a reversed distribution on priorities (since we get it by choosing
p0 as our old priority and p as the new one), the expected cost of rotating
down is also at most 2. Finally, averaging the up and down cases gives that
the expected number of rotations without conditioning on anything is at
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 378

most 2.
We now give an exact analysis of the expected number of rotations, which
will show that 2 is in fact the best bound we can hope for.
The idea is to notice that whenever we do a rotation involving x, we
change the number of ancestors of x by exactly one. This will always be
a decrease in the number of ancestors if the priority of x went up, or an
increase if the priority of x went down, so the total number of rotations will
be equal to the change in the number of ancestors.
Letting Ai be the indicator for the event that i is an ancestor of x before
the re-roll, and A0i the indicator for the even that i is an ancestor of x after the
re-roll, then the number of rotations is just | i Ai − i A0i |, which is equal
P P

to i |Ai − A0i | since we know that all changes Ai − A0i have the same sign. So
P

the expected number of rotations is just E [ i |Ai − A0i |] = i E [|Ai − A0i |],
P P

by linearity of expectation.
So we have to compute E [|Ai − A0i |]. Using the same argument as in
§6.3.2, we have that Ai = 1 if and only if i has the highest initial priority
of all elements in the range [min(i, x), max(i, x)], and the same holds for A0i
if we consider updated priorities. So we want to know the probability that
changing only the priority of x to a new random value changes whether i
has the highest priority.
Let k = max(i, x) − min(i, x) + 1 be the number of elements in the
range under consideration. To avoid writing a lot of mins and maxes, let’s
renumber these elements as 1 through k, with i = 1 and x = k (this may
involve flipping the sequence if i > x). Let X1 , . . . , Xk be the priorities of
these elements, and let Xk0 be the new priority of x. These k + 1 random
variables are independent and identically distributed, so conditioning on the
even that no two are equal, all (k + 1)! orderings of their values are equally
likely.
So now let us consider how many of these orderings result in |Ai − A0i | = 1.
For Ai to be 1, X1 must exceed all of X2 , . . . , Xk . For A0i to be 0, X1 must
not exceed all of X2 , . . . , Xk1 , Xk0 . The intersection of these events is when
Xk0 > X1 > max(X2 , . . . , Xk ). Since X2 , . . . , Xk can be ordered in any of
(k − 1)! ways, this gives

(k − 1)! 1
Pr Ai = 1 ∧ A0i = 0 =
 
= .
(k + 1)! k(k + 1)

In the other direction, for Ai to be 0 and A0i to be 1, we must have Xk >


APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 379

X1 > max(X1 , . . . , Xk−1 , Xk ). This again gives


1
Pr Ai = 0 ∧ A0i = 1 =
 
k(k + 1)
and combining the two cases gives
2
E Ai − A0i
 
= .
k(k + 1)
Now sum over all i:
n
E A0i − Ai
X  
E [number of rotations] =
i=1
x−1 n
X 2 X 2
= +
i=1
(x − i + 1)(x − i + 2) i=x+1 (i − x + 1)(i − x + 2
x n−x+1
X 2 X 2
= +
k=2
k(k + 1) k=2
k(k + 1)
x  n−x+1
2 2 2 2
X  X  
= − + −
k=2
k k+1 k=2
k k+1
2 2
=2− − .
x+1 n−x+2
In the limit as n goes to infinity, choosing x = bn/2c makes both fractional
terms converge to 0, making the expected number of rotations arbitrarily
close to 2. So 2 is the best possible bound that doesn’t depend on n.

D.4.2 A defective hash table


A foolish programmer implements a hash table with m buckets, numbered
0, . . . , m − 1, with the property that any value hashed to an even-numbered
bucket is stored in a linked list as usual, but any value hashed to an odd-
numbered bucket is lost.

1. Suppose we insert a set S of n = |S| items into this hash table, using
a hash function h chosen at random from a strongly 2-universal hash
family H. Show that there is a constant c such that, for any t > 0, the
probability that at least n/2 + t items are lost is at most cn/t2 .

2. Suppose instead that we insert a set S of n = |S| items into this hash
table, using a hash function h chosen at random from a hash family
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 380

H 0 that is 2-universal, but that may not be strongly 2-universal. Show


that if n ≤ m/2, it is possible for an adversary that knows S to design
a 2-universal hash family H 0 that ensures that all n items are lost with
probability 1.

Solution
1. Let Xi be the indicator variable for the event that the i-th element
of S is hashed to an odd-numbered bucket. Since H is strongly 2-
universal, E [Xi ] ≤ 1/2 (with equality when m is even), from which
it follows that Var [Xi ] = E [Xi ] (1 − E [Xi ]) ≤ 1/4; and the Xi are
pairwise independent. Letting Y = ni=1 Xi be the total number of
P

items lost, we get E [Y ] ≤ n/2 and Var [Y ] ≤ n/4. But then we can
apply Chebyshev’s inequality to show

Pr [Y ≥ n/2 + t] ≤ Pr [Y ≥ E [Y ] + t]
Var [Y ]

t2
n/4
≤ 2
t
1
= (n/t2 ).
4
So the desired bound holds with c = 1/4.

2. Write [m] for the set {0, . . . , m − 1}. Let S = {x1 , x2 , . . . , xn }.


Consider the set H 0 of all functions h : U → [m] with the property
that h(xi ) = 2i + 1 for each i. Choosing a function h uniformly from
this set corresponds to choosing a random function conditioned on this
constraint; and this conditioning guarantees that every element of S is
sent to an odd-numbered bucket and lost. This gives us half of what
we want.
The other half is that H 0 is 2-universal. Observe that for any y 6∈ S,
the conditioning does not constrain h(y), and so h(y) is equally likely
to be any element of [m]; in addition, h(y) is independent of h(z) for
any z 6= y. So for any y 6= z, Pr [h(y) = h(z)] = 1/m if one or both of
y and z is not an element of S, and Pr [h(y) = h(z)] = 0 if both are
elements of S. In either case, Pr [h(y) = h(z)] ≤ 1/m, and so H 0 is
2-universal.
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 381

D.5 Assignment 5: due Thursday, 2016-11-17, at


23:00
D.5.1 A spectre is haunting Halloween
An adult in possession of n pieces of leftover Halloween candy is distributing
them to n children. The candies have known integer values v1 , v2 , . . . , vn ,
and the adult starts by permuting the candies according to a uniform random
permutation π. They then give candy π(1) to child 1, candy π(2) to child 2,
and so on. Or at least, that is their plan.
Unbeknownst to the adult, child n is a very young Karl Marx, and after
observing the first k candies π(1), . . . , π(k), he can launch a communist
revolution and seize the means of distribution. This causes the remaining
n − k candies π(k + 1) through π(n) to be distributed evenly among the
remaining n − k children, so that each receives a value equal to exactly
1 Pn
n−k i=k+1 vπ(i) . Karl may declare the revolution at any time before the last
candy is distributed (even at time 0, when no candies have been distributed).
However, his youthful understanding of the mechanisms of historical de-
terminism are not detailed enough to predict the future, so his decision to
declare a revolution after seeing k candies can only depend on those k candies
and his knowledge of the values of all of the candies but not their order.
Help young Karl optimize his expected take by devising an algorithm that
takes as input v1 , . . . , vn and vπ(1) , . . . , vπ(k) , where 0 ≤ k < n, and outputs
whether or not to launch a revolution at this time. Compute Karl’s exact
expected return as a function of v1 , . . . , vn when running your algorithm,
and show that no algorithm can do better.

Solution
To paraphrase an often-misquoted line of Trotsky’s, young Karl Marx may
not recognize the Optional Stopping Theorem, but the Optional Stopping
Theorem does not permit him to escape its net. No strategy, no matter how
clever, can produce an expected return better or worse than simply waiting
for the last candy.
Let Xt be the expected return if Karl declares the revolution after seeing
t cards. We will show that {Xt , Ft } is a martingale where each Ft is the
σ-algebra generated by the random variables vπ(1) through vπ(t) .
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 382

The value of Xt is exactly


 
n n t
1 X 1  X X
Xt = vπ(i) = vn − vπ(i) .
n − t i=t+1 n − t i=t+1 i=1
h i
This is also E vπ(t+1) Ft , since the next undistributed candy is equally
likely to be any of the remaining candies, and Ft determines the value of
vπ(i) for all i ≤ t. So

n t+1
" ! #
1 X X
E [Xt+1 | Ft ] = E vi − vπ(i) Ft
n − t − 1 i=1 i=1
n t
!
1 X X 1 h i
= vi − vπ(i) − E vπ(t+1) FT
n − t − 1 i=1 i=1
n−t−1
n−t 1
= Xt − Xt
n−t−1 n−t−1
= Xt .

Now fix some strategy for Karl, and let τ be the time at which he
launches the revolution. Then τ < n is a stopping time with respect to the
Ft , and the Optional Stopping Theorem (bounded time version) says that
E [Xτ ] = E [X0 ]. So any strategy is equivalent (in expectation) to launching
the revolution immediately.

D.5.2 Colliding robots on a line


Suppose we have k robots on a line of n positions, numbered 1 through
n. No two robots can pass each other or occupy the same position, so we
can specify the positions of all the robots at time t as an increasing vector
X t = X1t , X2t , . . . , Xkt . At each time t, we pick one of the robots i uniformly
at random, and also choose a direction d = ±1 uniformly and independently
at random. Robot i moves to position X t+1 = Xit + d if (a) 0 ≤ Xit + d < n,
and (b) no other robot already occupies position Xit + d. If these conditions
do not hold, robot i stays put, and Xit+1 = Xit .

1. Given k and n, determine the stationary distribution of this process.

2. Show that the mixing time tmix for this process is polynomial in n.
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 383

Solution
1. The stationary distribution is a uniform distribution on all nk place-


ments of the robots. To prove this, observe that two vectors increasing
vectors x and y are adjacent if and only if there is some i such that
xj = yj for all j 6= i and xj + d = yj for some d ∈ {−1, +1}. In this
1
case, the transition probability pxy is 2n , since there is a n1 chance that
1
we choose i and a 2 chance that we choose d. But this is the same
as the probability that starting from y we choose i and −d. So we
have pxy = pyx for all adjacent x and y, which means that a uniform
distribution π satisfies πx pxy = πy pyx for all x and y.
To show that this stationary distribution is unique, we must show that
there is at least one path between any two states x and y. One way to
do this is to show that there is a path from any state x to the state
h1, . . . , ki, where at each step we move the lowest-index robot i that
is not already at position i. Since we can reverse this process to get
to y, this gives a path between any x and y that occurs with nonzero
probability.

2. This one could be done in a lot of ways. Below I’ll give sketches of
three possible approaches, ordered by increasing difficulty. The first
reduces to card shuffling by adjacent swaps, the second uses an explicit
coupling, and the third uses conductance.
Of these approaches, I am personally only confident of the coupling
argument, since it’s the one I did before handing out the problem, and
indeed this is the only one I have written up in enough detail below to
be even remotely convincing. But the reduction to card shuffling is also
pretty straightforward and was used in several student solutions, so I
am convinced that it can be made to work as well. The conductance
idea I am not sure works at all, but it seems like it could be made to
work with enough effort.

(a) Let’s start with the easy method. Suppose that instead of colliding
robots, we have a deck of n cards, of which k are specially marked.
Now run a shuffling algorithm that swaps adjacent cards at each
step. If we place a robot at the position of each marked card,
the trajectories of the robots follow the pretty much the same
distribution as in the colliding-robots process. This is trivially
the case when we swap a marked card with an unmarked card (a
robot moves), but it also works when we swap two marked cards
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 384

(no robot moves, since the positions of the set of marked cards
stays the same; this corresponds to a robot being stuck).
Unfortunately we can’t use exactly the same process we used in
§10.4.3.3, because this (a) allows swapping the cards in the first
and last positions of the deck, and (b) doesn’t include any moves
corresponding to a robot at position 1 or n trying to move off the
end of the line.
The first objection is easily dealt with, and indeed the cited result
of Wilson [Wil04] doesn’t allow such swaps either. The second can
be dealt with by adding extra no-op moves to the card-shuffling
process that occur with probability 1/n, scaling the probabilities
of the other operations to keep the sum to 1. This doesn’t affect
the card-shuffling convergence argument much, but it is probably
a good idea to check that everything still works.
Finally, even after fixing the card-shuffling argument, we still have
to argue that convergence in the card-shuffling process implies
convergence in the corresponding colliding-robot process. Here is
where the definition of total variation distance helps. Let C t be the
permutation of the cards after t steps, and let f : C t 7→ X t map
permutations of cards to positions of robots. Let π and π 0 be the
stationary distributions of the card and robot processes, respec-
t 0
 t 
tively. Then dT V (f (C ), f (π)) = maxA Pr X ∈ A − π (A) =
maxA Pr C t ∈ f −1 (A) − π(f −1 (A)) ≤ maxB Pr C t ∈ B − π(B) =
 

dT V (C t , π). So convergence of C implies convergence of X = f (C).


(b) Alternatively, we can construct a coupling between two copies of
the process X t and Y t , where as usual we start X 0 in our initial
distribution (whatever it is) and Y 0 in the uniform stationary
distribution. At each step we generate a pair (i, d) uniformly at
random from {1, . . . , n} × {−1, +1}. Having generated this pair,
we move robot i in direction d in both processes if possible. It is
0 0
clear that once X t = Y t , it will continue to hold that X t = Y t
for all t0 > t.
To show convergence we must show that X t will in fact eventually
hit Y t . To do so, let us track Z t = kX t − Y t k1 = ki=1 |Xit − Yit |.
P

We will show that Zt is a supermartingale with respect to the


history Ft generated by the random variables hX s , Y s i for s ≤ t.
Specifically, we will show that at each step, the expected change
in Z t conditioning on X t and Y t is non-positive.
Fix X t , and consider all 2n possible moves (i, d). Since (i, d)
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 385

doesn’t change Xjt for any j 6= i, the only change in Z t will occur
if one of Xit and Yit can move and the other can’t. There are
several cases:
i. If i = k, d = +1, and exactly one of Xkt and Ykt is n, then the
copy of the robot not at n moves toward the copy at n, giving
Zt+1 − Zt = −1. The same thing occurs if i = 1, d = −1, and
exactly one of X1t and Y1t is 1.
It is interesting to note that these two cases will account for
the entire nonzero part of E [Zt+1 − Zt | Ft ], although we will
not use this fact.
ii. If i < k, d = +1, Xit + 1 = Xi+1 t , but Y t + 1 < X t , and
i i+1
Xi ≤ Yi , then robot i can move right in Y t but not in X t .
t t

This gives Z t+1 − Z t = +1.


However, in this case the move (i+1, −1) moves robot i+1 left
t > Y t + 1 > X t = X t+1 ,
in Y t but not in X t , and since Yi+1 i i i
this move gives Z t+1 t
− Z = −1. Since the moves (i, +1) and
(i + 1, −1) occur with equal probability, these two changes in
Z t cancel out on average.
Essentially the same analysis applies if we swap X t with Y t ,
and if we consider moves (i, −1) where i > 1 in which one copy
of the robot is blocked but not the other. If we enumerate all
such cases, we find that for every move of these types that
increase Z t , there is a corresponding move that decreases Z t ;
it follows that these changes sum to zero in expectation.
Because each Xit and Yit lies in the range [1, n], it holds trivially
that Z t ≤ k(n − 1) < n2 . Consider an unbiased ±1 random walk
W t with a reflecting barrier at n2 (meaning that at n2 we always
move down) and an absorbing barrier at 0. Then the expected
time to reach 0 starting from W t = x is given by x(2n2 − x) ≤ n4 .
Since this unbiased random walk dominates the biased random
walk corresponding to changes in Z t , this implies that Z t can
change at most n4 times on average before reaching 0.
Now let us ask how often Z t changes. The analysis above implies
that there is a chance of at least 1/n that Z t changes in any state
where some robot i with Xit 6= Yit is blocked from moving freely
in one or both directions in either the X t or Y t process. To show
that this occurs after polynomially many steps, choose some i with
Xit 6= Yit that is not blocked in either process. Then i is selected
every n1 steps on average, and it moves according to a ±1 random
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 386

walk as long as it is not blocked. But a ±1 random walk will


reach position 1 or n in O(n2 ) steps on average (corresponding to
O(n3 ) steps of the original Markov chain), and i will be blocked
by the left or right end of the line if it is not blocked before
then. It follows that we reach a state in which Z t changes with
probability at least n1 after O(n3 ) expected steps, which means
that Z t changes every O(n4 ) on average. Since we have previously
established that Z t reaches 0 after O(n4 ) expected changes, this
gives a total of O(n8 ) expected steps of the Markov process before
Z t = 0, giving X t = Y t . Markov’s inequality and the Coupling
Lemma then give tmix = O(n8 ).
(c) I believe it may also be possible to show convergence using a con-
ductance argument. The idea is to represent a state hx1 , . . . , xk i
as the differences h∆1 , . . . , ∆n i, where ∆1 = x1 and ∆i+1 =
ni+1 − xi . This
x representation
o makes the set of possible states
∆ ∆i ≥ 1, ki=1 ∆i ≤ n look like a simplex, a pretty well-
P

behaved geometric object. In principle it should be possible


to show an isoperimetric inequality that the lowest-conductance
sets in this simplex are the N points closest to any given corner, or
possibly get a lower bound on conductance using canonical paths.
But the ∆ version of the process is messy (two coordinates change
every time a robot moves), and there are some complication with
this not being a lazy random walk, so I didn’t pursue this myself.

D.6 Assignment 6: due Thursday, 2016-12-08, at


23:00
D.6.1 Another colliding robot
A warehouse consists of an n × n grid of locations, which we can think of
as indexed by pairs (i, j) where 1 ≤ i, j ≤ n. At night, the warehouse is
patrolled by a robot executing a lazy random walk. Unfortunately for the
robot, there are also m crates scattered about the warehouse, and if the
robot attempts to walk onto one of the m grid locations occupied by a crate,
it fails to do so, and instead emits a loud screeching noise. We would like to
use the noises coming from inside the warehouse to estimate m.
Formally, if the robot is at position (i, j) at time t, then at time t + 1
it either (a) stays at (i, j) with probability 1/2; or (b) chooses a position
(i0 , j 0 ) uniformly at random from {(i − 1, j), (i + 1, j), (i, j − 1), (i, j + 1)}
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 387

and attempts to move to it. If the robot chooses not to move, it makes no
noise. If it chooses to move, but (i0 , j 0 ) is off the grid or occupied by one of
the m crates, then the robot stays at (i, j) and emits the noise. Otherwise it
moves to (i0 , j 0 ) and remains silent. The robot’s position at time 0 can be
any unoccupied location on the grid.
To keep the robot from getting walled in somewhere, whatever adversary
placed the crates was kind enough to ensure that if a crate was placed at
position (i, j), then all of the eight positions

(i − 1, j + 1) (i, j + 1) (i + 1, j + 1)
(i − 1, j) (i + 1, j)
(i − 1, j − 1) (i, j − 1) (i + 1, j − 1)

reachable by a king’s move from (i, j) are unoccupied grid locations. This
means that they are not off the grid and not occupied by another crate, so
that the robot can move to any of these eight positions.
Your job is to devise an algorithm for estimating m to within  relative
error with probability at least 1 − δ, based on the noises emitted by the robot.
The input to your algorithm is the sequence of bits x1 , x2 , x3 , . . . , where xi
is 1 if the robot makes a noise on its i-th step. Your algorithm should run in
time polynomial in n, 1/, and log(1/δ).

Solution
It turns out that  is a bit of a red herring: we can in fact compute the exact
number of crates with probability 1 − δ in time polynomial in n and log(1/δ).
The placement restrictions and laziness make this an irreducible aperiodic
chain, so it has a unique stationary distribution π. It is easy to argue from
reversibility that this is uniform, so each of the N = n2 − m unoccupied
positions occurs with probability exactly 1/N .
It will be useful to observe that we can assign three unique unoccupied
positions to each crate, to the east, south, and southeast, and this implies
m ≤ n2 /4.
The idea now is to run the robot until the distribution on its position is
close to π, and then see if it hits an obstacle on the next step. We can easily
count the number of possible transitions that hit an obstacle, since there are
4m incoming edges to the crates, plus 4n incoming edges to the walls. Since
1
each edge uv has probability πu puv = 8N of being selected in the stationary
distribution, the probability q that we hit an obstacle starting from π is
exactly n+m n+m
2N = n2 −m . This function is not trivial to invert, but we don’t
have to invert it: if we can compute its value (to some reasonable precision),
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 388

we can compare it to all n2 /4 + 1 possible values of m, and see which it


matches. But we still have to deal with both getting enough samples and
the difference between our actual distribution when we take a sample and
the stationary distribution. For the latter, we need a convergence bound.
We will get this bound by constructing a family of canonical paths
between each distinct pair of locations (i, j) and (i0 , j 0 ). In the absence of
crates, we can use the L-shaped paths that first change i to i0 and then j to
j 0 , one step at a time in the obvious way. If we encounter a crate at some
position along this path, we replace that position with a detour that uses
up to three locations off the original path to get around the obstacle. The
general rule is that if we are traveling left-to-right or right-to-left, we shift
up one row to move past the obstacle, and if we are moving up-to-down or
down-to-up, we shift left one column. An annoying special case is when the
obstacle appears exactly at the corner (i0 , j); here we replace the blocked
position with the unique diagonally adjacent position that still gives us a
path.
Consider the number of paths (i, j) → (i0 , j 0 ) crossing a particular left-
right edge (x, y) → (x + 1, y). There are two cases we have to consider:

1. We cross the edge as part of the left-to-right portion of the path (this
includes left-to-right moves that are part of detours). In this case we
have |j − y| ≤ 1. This gives at most 3 choices for j, giving at most 3n3
possible paths. (The constants can be improved here.)

2. We cross the edge as part of a detour on the vertical portion of the


path. Then |i − x| ≤ 1, and so we again have at most 3n3 possible
paths.

This gives at most 6n3 possible paths across each edge, giving a congestion
1
ρ≤ (6n3 )πij πi0 j 0
πij pij,i0 j 0
1
= −1 (6n3 )N −2
N (1/8)
= 24n3 N − 1
3
≤ 24n3 ( n2 )−1
4
= 32n,

since we can argue from the placement restrictions that N ≤ n2 /4. This
immediately gives a bound τ2 ≤ 8ρ2 ≤ O(n2 ). using Lemma 10.6.3.
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 389

So let’s run for d51τ2 ln ne = O(n2 ln n) steps for each sample. Starting
from any initial location, we will reach some distribution σ with dT V σ, π =
O(n−50 ). Let X be the number of obstacles (walls or crates) adjacent to the
current position, then we can apply Lemma 10.2.2 to get |Eσ (X) − Eπ (X)| ≤
4dT V (σ, π) = O(n−50 ). The same bound (up to constants) also applies to
the probability ρ = X/8 of hitting an obstacle, giving ρ = 4(n + m)/(n2 −
m) ± O(n−50 ). Note that ρ is Θ(n−1 for all values of m.
Now take n10 ln(1/δ) samples, with a gap of d51τ2 ln ne steps between
p

The expected number of positive samples is µ = (ρ+O(n−50 ))n10 ln(1/δ) =


p
each. p
Θ(n9 ln(1/δ)), and from Lemma 5.2.2, the probability that the num-
ber of positive samples exceeds µ(1 + n−4 ) is at most exp(−µn−8 /3) =
exp(−Θ(n ln(1/δ))) = δ Θ(n) ≤ δ/2 for sufficiently large n. A similar bound
holds on the other side. So with probability at least 1 − δ we get an estimate
ρ̂ of ρ = 4(n + m)/(n2 − m) that is accurate to within a relative error of n−4 .

Since dm = O(n−2 ) throughout the interval spanned by m, and since ρ itself
is Θ(n ), any relative error that is o(n−1 ) will give us an exact answer for
−1

m when we round to the nearest value of ρ. We are done.


We’ve chosen big exponents because all we care about is getting a polyno-
mial bound. With a cleaner analysis we could probably get better exponents.
It is also worth observing that the problem is ambiguous about whether we
have random access to the data stream, and whether the time taken by the
robot to move around counts toward our time bound. If we are processing
the stream after the fact, and are allow to skip to specific places in the stream
at no cost, we can get each sample in O(1) time. This may further reduce
the exponents, and demonstrates why it is important to nail down all the
details of a model.

D.6.2 Return of the sprites


The Space Invaders from Problem D.3.1 are back, and now they are trying
to hide in random n × n bitmaps, where each pixel is set with probability
1/2. Figure D.2 shows an example of this, with two sprites embedded in an
otherwise random bitmap.
The sprites would like your assistance in making their hiding places
convincing. Obviously it’s impossible to disguise their presence completely,
but to make the camouflage as realistic as possible, they would like you to
provide them with an algorithm that will generate an n × n bitmap uniformly
at random, conditioned on containing at least two copies of a given m-pixel
sprite at distinct offsets. (It is OK for the sprites to overlap, but they cannot
be directly on top of each other, and sprites that extend beyond the edges of
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 390

Figure D.2: Two hidden Space Invaders. On the left, the Space Invaders hide
behind random pixels. On the right, their positions are revealed by turning
the other pixels gray.

the bitmap wrap around as in Problem D.3.1.) Your algorithm should make
the probability of each bitmap that contains at least two copies of the sprite
be exactly equally likely, and should run in expected time polynomial in n
and m.

Solution
Rejection sampling doesn’t work here because if the sprites are large, the
chances of getting two sprites out of a random bitmap are exponentially
small. Generating a random bitmap and slapping two sprites on top of it
also doesn’t work, because it gives a non-uniform distribution: if the sprites
2
overlap in k places, there are 2n −2m+k choices for the remaining bits, which
means that we would have to adjust the probabilities to account for the effect
of the overlap. But even if we do this, we still have issues with bitmaps that
contain more than two sprites: a bitmap with three sprites can be generated
in three different ways, and it gets worse quickly as we generate more. It
may be possible to work around these issues, but a simpler approach is to
use the sampling mechanism from Karp-Luby [KL85] (see also §11.4).
2
Order all n2 positions u < v for the two planted sprites in lexicographic
order. For each pair of positions u < v, let Auv be the set of all bitmaps with
2
sprites at u and v. Then we can easily calculate |Auv | = 2n −2m+k where k
is the number of positions where sprites at u and v overlap, and so we can
sample a particular Auv with probability |Auv |/ st |Ast | and then choose an
P
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 391

element of Auv uniformly at random by filling in the positions not covered


by the sprites with independent random bits. To avoid overcounting, we
discard any bitmap that contains a sprite at a position less than max(u, v);
this corresponds to discarding any bitmap that appears in Ast for some
st < uv. By the same argument as in Karp-Luby, we expect to discard at
2
most n2 bitmaps before we get one that works. This gives an expected O(n4 )
attempts, each of which requires about O(n2 ) work to generate a bitmap
and O(n2 m) work to test it, for a nicely polynomial O(n6 m) expected time
in total.
If we don’t want to go through all this analysis, we could also reduce
to Karp-Luby directly by encoding the existence of two sprites as a large
2
DNF formula consisting of n2 clauses, and then argue that the sampling
mechanism inside Karp-Luby gives us what we want.

D.7 Final exam


Write your answers in the blue book(s). Justify your answers. Work alone.
Do not use any notes or books.
There are three problems on this exam, each worth 20 points, for a total
of 60 points. You have approximately three hours to complete this exam.

D.7.1 Virus eradication (20 points)


Viruses have infected a network structured as an undirected graph G = (V, E).
At time 0, all n = |V | nodes are infected. At each step, as long as there is at
least one infected node, we can choose a node v to disinfect; this causes v
1
to become uninfected, but there is an independent 2d(v) chance that each of
v’s neighbors becomes infected if it is not infected already, where d(v) is the
degree of v.
Let Xt be the number of nodes that are infected after t disinfection steps.
Let τ be the first time at which Xτ = 0. Show that, no matter what strategy
is used to select nodes to disinfect, E [τ ] = O(n).
Clarification added during exam: When choosing a node to disinfect, the
algorithm can only choose nodes that are currently infected.

Solution
Let Yt = Xt + 12 min(t, τ ). We will show that {Yt } is a supermartingale.
Suppose we disinfect node v at time t, and that v has d0 ≤ d(v) uninfected
d0
neighbors. Then E [Xt+1 | Xt ] = Xt − 1 + 2d(v) ≤ Xt − 12 , because we always
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 392

disinfect v and each of its d0 uninfected neighbors contributes 2d(v) 1


new
infections on average. We also have min(τ, t) increasing by 1 in this case.
This gives E [Yt+1 | Yt ] ≤ Yt − 21 + 12 = Yt when t < τ ; it also holds trivially
that Yt+1 = Yt when t ≥ τ . So E [Yt+1 | Yt ] ≤ Yt always.
Now apply the optional stopping theorem. We have bounded incre-
ments, because |Yt+1 − Yt | ≤ n + 12 . We also have bounded expected
time, because there is a nonzero probability  that each sequence of n
consecutive disinfections produces no new infections, giving E [τ ] ≤ 1/. So
n = Y0 ≥ E [Yτ ] = 12 E [τ ], giving E [τ ] ≤ 2n = O(n).

D.7.2 Parallel bubblesort (20 points)


Suppose we are given a random bit-vector and choose to sort it using parallel
bubblesort. The initial state consists of n independent fair random bits
X10 , X20 , . . . , Xn0 , and in each round, for each pair of positions Xit and Xi+1
t ,
t+1 t+1 t t
we set new values Xi = 0 and Xi+1 = 1 if Xi = 1 and Xi+1 = 0. When it
holds that Xit = 1 and Xi+1 t = 0, we say that there is a swap at position i in
round t. Denote the total number of swaps in round t by Yt .

1. What is E [Y0 ]?

2. What is E [Y1 ]?

3. Show that both Y0 and Y1 are concentrated around their expectations,


by showing, for any fixed c > 0 and each t ∈ {0, 1}, that there exists a
function ft (n) = o(n) such that Pr [|Yt − E [Yt ]| ≥ ft (n)] ≤ n−c .

Solution
1. This is a straightforward application of linearity of expectation. Let Zit ,
for each i ∈ {1, . . . , n − 1}, be the indicator for the event that Xit = 1
t
and Xi+1 = 0. For t = 0, Xi0 and Xi+1 0 are independent, so this event
h Pn−1 i Pn−1
occurs with probability 14 . So E [Y0 ] = E Zi0 = E Zi0 =
 
i=1 i=1
n−1
(n − 1) 14 = 4 .

2. Again we want to use linearity of expectation, but we have to do a little


more work to calculate E Zi1 . For Zi1 to be 1, we need Xi1 = 1 and
 
1
Xi+1 = 0. If we ask where that 1 in Xi1 came from, either it started in
0
Xi−1 and moved to position i because Xi0 = 0, or it started in Xi0 and
didn’t move because Xi+10 1
= 1. Similarly the 0 in Xi+1 either started
0 0
in Xi+2 and moved down to replace Xi+1 = 1 or started in Xi+1 1 and
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 393

didn’t move because Xi0 = 0. Enumerating all the possible assignments


0 , X 0 , X 0 , and X 0
of Xi−1 i i+1 i+2 consistent with these conditions gives
0110, 1000, 1001, 1010, and 1110, making E Zi1 = 5 · 2−4 = 16 5
 
when
2 ≤ i ≤ n − 2. This accounts for n − 3 of the positions.
There are two annoying special cases: when i = 1 and when i = n − 1.
We can handle these with the above analysis by pretending that X0t = 0
t
and Xn+1 = 1 for all t; this leaves the initial patterns 0110 for i = 1
and 1001 for i = n − 1, each of which occurs with probability 81 .
1 Pn−3 5 1 5n−11
Summing over all cases then gives E [Y1 ] = 8 + i=2 16 + 8 = 16 .

3. The last part is just McDiarmid’s inequality. Because the algorithm


is deterministic aside from the initial random choice of input, We
can express each Yt as a function gt (X10 , . . . , Xn0 ) of the independent
random variables X10 , . . . , Xn0 . Changing some Xi0 can change Zi−1 0 and
0
Zi , for a total change to Y0 of at most 2; similarly, changing Xi can 0

change at most Zi−21 , Z 1 , Z 1 , and Z i , for a total change of at most


i−1 i i+1
4. So McDiarmid says that, for each t ∈ {0, 1}, Pr [|Yt − E√ [Yt ]| ≤ s] ≤
2
2 exp −Θ(s /n). For any fixed c > 0, there is some s = Θ( n log n) =
o(n) such that the right-hand side is less than n−c .

D.7.3 Rolling a die (20 points)


Suppose we have a device that generates a sequence of independent fair
random coin-flips, but what we want is a six-sided die that generates the
values 1, 2, 3, 4, 5, 6 with equal probability.

1. Give an algorithm that does so using O(1) coin-flips on average.

2. Show that any correct algorithm for this problem will use more than n
coin-flips with probability at least 2−O(n) .

Solution
1. Use rejection sampling: generate 3 bits, and if the resulting binary
number is not in the range 1, . . . , 6, try again. Each attempt consumes
3 bits and succeeds with probability 3/4, so we need to generate 43 ·3 = 4
bits on average.
It is possible to improve on this by reusing the last bit of a discarded
triple of bits as the first bit in the next triple. This requires a more
complicated argument to show uniformity, but requires only two bits
APPENDIX D. SAMPLE ASSIGNMENTS FROM FALL 2016 394

for each attempt after the first, for a total of 3 + 14 · 43 · 2 = 11


3 bits on
1
average. This is still O(1), so unless that 3 bit improvement is really
important, it’s probably easiest just to do rejection sampling.

2. Suppose that we have generated n bits so far. Since 6 does not evenly
divide 2n for any n, we cannot assign an output from 1, . . . , 6 to all
possible 2n sequences of bits without giving two outputs different
probabilities. So we must keep going in at least one case, giving a
probability of at least 2−n = 2−O(n) that we continue.
Appendix E

Sample assignments from


Spring 2014

E.1 Assignment 1: due Wednesday, 2014-09-10, at


17:00
E.1.1 Bureaucratic part
Send me email! My address is james.aspnes@gmail.com.
In your message, include:

1. Your name.

2. Your status: whether you are an undergraduate, grad student, auditor,


etc.

3. Anything else you’d like to say.

(You will not be graded on the bureaucratic part, but you should do it
anyway.)

E.1.2 Two terrible data structures


Consider the following data structures for maintaining a sorted list:

1. A sorted array A. Inserting a new element at position A[i] into an


array that current contains k elements requires moving the previous
values in A[i] . . . A[k] to A[i + 1] . . . A[k + 1]. Each element moved costs
one unit.

395
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 396

For example, if the array currently contains 1 3 5, and we insert 2,


the resulting array will contain 1 2 3 5 and the cost of the operation
will be 2, because we had to move 3 and 5.

2. A sorted doubly-linked list. Inserting a new element at a new position


requires moving the head pointer from its previous position to a neigh-
bor of the new node, and then to the new node; each pointer move
costs one unit.
For example, if the linked list currently contains 1 3 5, with the head
pointer pointing to 5, and we insert 2, the resulting linked list will
contain 1 2 3 5, with the head pointing to 2, and the cost will be 2,
because we had to move the pointer from 5 to 3 and then from 3 to 2.
Note that we do not charge for updating the pointers between the new
element and its neighbors. We will also assume that inserting the first
element is free.

Suppose that we insert the elements 1 through n, in random order, into


both data structures. Give an exact closed-form expression for the expected
cost of each sequence of insertions.

Solution
1. Suppose we have already inserted k elements. Then the next element
is equally likely to land in any of the positions A[1] through A[k + 1].
The number of displaced elements is then uniformly distributed in 0
through k, giving an expected cost for this insertion of k2 .
Summing over all insertions gives
n−1
X k 1 n−1
X
=
k=0
2 2 k=0
n(n − 1)
= .
4

An alternative proof, which also uses linearity of expectation, is to


define Xij as the indicator variable for the event that element j moves
when element i < j is inserted. This is 1 if an only if j is inserted
before i, which by symmetry occurs with probability exactly 1/2. So
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 397

the expected total number of moves is


X X 1
E [Xij ] =
1≤i<j≤n 1≤i<j≤n
2
!
1 n
=
2 2
n(n − 1)
= .
4

It’s always nice when we get the same answer in situations like this.

2. Now we need to count how far the pointer moves between any two
consecutive elements. Suppose that we have already inserted k − 1 > 0
elements, and let Xk be the cost of inserting the k-th element. Let
i and j be the indices in the sorted list of the new and old pointer
positions after the k-th insertion. By symmetry, all pairs of distinct
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 398

positions i 6= j are equally likely. So we have


1 X
E [Xk ] = |i − j|
k(k − 1) i6=j
 
1 X X
=  (j − i) + (i − j)
k(k − 1) 1≤i<j≤k 1≤j<i≤k
2 X
= (j − i)
k(k − 1) 1≤i<j≤k
k X j−1
2 X
= `
k(k − 1) j=1 `=1
k
2 X j(j − 1)
=
k(k − 1) j=1 2
k
1 X
= j(j − 1).
k(k − 1) j=1
k 
1 X 
= j2 − j
k(k − 1) j=1
1 (2k + 1)k(k + 1) k
 
= · − 2
k(k − 1) 6 k+1
1 (2k − 2)k(k + 1)
= ·
k(k − 1) 6
k+1
= .
3

This is such a simple result that we might reasonably expect that there
is a faster way to get it, and we’d be right. A standard trick is to
observe that we can simulate choosing k points uniformly at random
from a line of n points by instead choosing k + 1 points uniformly at
random from a cycle of n + 1 points, and deleting the first point chosen
to turn the cycle back into a line. In the cycle, symmetry implies that
the expected distance between each point and its successor is the same
as for any other point; there are k + 1 such distances, and they add up
to n + 1, so each expected distance is exactly n+1
k+1 .
In our particular case, n (in the formula) is k and k (in the formula) is
2, so we get k+1
3 . Note we are sweeping the whole absolute value thing
under the carpet here, so maybe the more explicit derivation is safer.
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 399

However we arrive at E [Xk ] = k+13 (for k > 1), we can sum these
expectations to get our total expected cost:
" n # n
X X
E Xk = E [Xk ]
k=2 k=2
n
X k+1
=
k=2
3
1 n+1
X
= `
3 `=3
1 (n + 1)(n + 2)
 
= −3
3 2
(n + 1)(n + 2)
= − 1.
6
It’s probably worth checking a few small cases to see that this answer
actually makes sense.

For large n, this shows that the doubly-linked list wins, but not by much:
we get roughly n2 /6 instead of n2 /4. This is a small enough difference that
in practice it is probably dominated by other constant-factor differences that
we have neglected.

E.1.3 Parallel graph coloring


Consider the following algorithm for assigning one of k colors to each node
in a graph with m edges:
1. Assign each vertex u a color cu , chosen uniformly at random from all
k possible colors.
2. For each vertex u, if u has a neighbor v with cu = cv , assign u a new
color c0u , again chosen uniformly at random from all k possible colors.
Otherwise, let c0u = cu .
Note that any new colors c0 do not affect the test cu = cv . A node changes
its color only if it has the same original color as the original color of one or
more of its neighbors.
Suppose that we run this algorithm on an r-regular1 triangle-free2 graph.
As a function of k, r, and m, give an exact closed-form expression for the
1
Every vertex has exactly r neighbors.
2
There are no vertices u, v, and w such that all are neighbors of each other.
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 400

expected number of monochromatic3 edges after running both steps of the


algorithm.

Solution
We can use linearity of expectation to compute the probability that any
particular edge is monochromatic, and then multiply by m to get the total.
Fix some edge uv. If either of u or v is recolored in step 2, then the
probability that c0u = c0v is exactly 1/k. If neither is recolored, the probability
that c0u = c0v is zero (otherwise cu = cv , forcing both to be recolored). So we
can calculate the probability that c0u = c0v by conditioning on the event A
that neither vertex is recolored.
This event occurs if both u and v have no neighbors with the same color.
The probability that cu = cv is 1/k. The probability that any particular
neighbor w of u has cw = cu is also 1/k; similarly for any neighbor w of
v. These events are all independent on the assumption that the graph is
triangle-free (which implies that no neighbor of u is also a neighbor of v).
So the probability that none of these 2r − 1 events occur is (1 − 1/k)2r−1 .
We then have
h i h i
Pr [cu = cv ] = Pr cu = cv A Pr A + Pr [cu = cv | A] Pr [A]
2r−1 !
1 1

= · 1− 1− .
k k

Multiply by m to get
2r−1 !
m 1

· 1− 1− .
k k
 
For large k, this is approximately m −(2r−1)/k , which is a little
k · 1−e
bit better than than the m k expected monochrome edges from just running
step 1.
Repeated application of step 2 may give better results, particular if k is
large relative to r. We will see this technique applied to a more general class
of problems in §13.3.5.
3
Both endpoints have the same color.
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 401

E.2 Assignment 2: due Wednesday, 2014-09-24, at


17:00
E.2.1 Load balancing
Suppose we distribute n processes independently and uniformly at random
among m machines, and pay a communication cost of 1 for each pair of
processes assigned to different machines. Let C be the total communication
cost, that is, the number of pairs of processes assigned to different machines.
What is the expectation and variance of C?

Solution
Let Cij be the communication cost between machines i and j. This is just
an indicator variable for the event that i and j are assigned to different
1
machines, which occurs with probability 1 − m
P
. We have C = 1≤i<j≤n Cij .
1. Expectation is a straightforward application of linearity of expectation.
There are n2 pairs of processes, and E [Cij ] = 1 − m 1
for each pair, so
!
n 1

E [C] = 1− .
2 m

2. Variance is a little trickier because the Cij are not independent. But
they are pairwise independent: even if we fix the location of i and j,
1
the expectation of Cjk is still 1 − m , so Cov [Cij , Cjk ] = E [Cij Cjk ] −
E [Cij ] · E [Cjk ] = 0. So we can compute
!
n 1 1
X  
Var [C] = Var [Cij ] = 1− .
1≤i<j<n
2 m m

E.2.2 A missing hash function


A clever programmer inserts Xi elements in each of m buckets in a hash table,
where each bucket i is implemented as a balanced binary search tree with
search cost at most dlg(Xi + 1)e. We are interested in finding a particular
target element x, which is equally likely to be in any of the buckets, but we
don’t know what the hash function is.
Suppose that we know that the Xi are independent and identically
distributed with E [Xi ] = k, and that the location of x is independent of the
values of the Xi . What is best upper bound we can put on the expected cost
of finding x?
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 402

Solution
The expected cost of searching bucket i is E [dlg(Xi + 1)e]. This is the
expectation of a function of Xi , so we would like to bound it using Jensen’s
inequality (§4.3).
Unfortunately the function f (n) = dlg(n + 1)e is not concave (because
of the ceiling), but 1 + lg(n + 1) > dlg(n + 1)e is. So the expected cost of
searching bucket i is bounded by 1 + lg(E [Xi ] + 1) = 1 + lg(k + 1).
Assuming we search the buckets in some fixed order until we find x, we
will search Y buckets where E [Y ] = n+12 . Because Y is determined by the
position of x, which is independent of the Xi , Y is also independent of the
Xi . So Wald’s equation (3.4.3) applies, and the total cost is bounded by
n+1
(1 + lg(k + 1)) .
2

E.3 Assignment 3: due Wednesday, 2014-10-08, at


17:00
E.3.1 Tree contraction
Suppose that you have a tree data structure with n nodes, in which each
node i has a pointer parent(u) to its parent (or itself, in the case of the root
node root).
Consider the following randomized algorithm for shrinking paths in the
tree: in the first phase, each node u first determines its parent v = parent(u)
and its grandparent w = parent(parent(u)). In the second phase, it sets
parent0 (u) to v or w according to a fair coin-flip independent of the coin-flips
of all the other nodes.
Let T be the tree generated by the parent pointers and T 0 the tree
generated by the parent0 pointers. An example of two such trees is given in
Figure E.1.
Recall that the depth of a node is defined by depth(root) = 0 and
depth(u) = 1 + depth(parent(u)) when u 6= root. The depth of a tree is equal
to the maximum depth of any node.
Let D be depth of T , and D0 be the depth of T 0 . Show that there is a
constant a, such that for any fixed c > 0, with probability at least 1 − n−c it
holds that
p  p 
a·D−O D log n ≤ D0 ≤ a · D + O D log n . (E.3.1)
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 403

0 0

1 1 5

2 5 2 3 6

3 4 6 4

Figure E.1: Example of tree contraction for Problem E.3.1. Tree T is on


the left, T 0 on the right. Nodes 1, 3, 5, and 6 (in boldface) switch to their
grandparents. The other nodes retain their original parents.

Solution
For any node u, let depth(u) be the depth of u in T and depth0 (u) be the
depth of u in T 0 . Note that depth0 (u) is a random variable. We will start by
0 
computing E depth (u) as a function of depth(u), by solving an appropriate
recurrence.
Let S(k) = depth0 (u) when depth(u) = k. The base cases are S(0) = 0
(the depth of the root never changes) and S(1) = 1 (same for the root’s
children). For larger k, we have
1   1 
E depth0 (u) = E 1 + depth0 (parent(u)) + E 1 + depth0 (parent(parent(u)))
  
2 2
or
1 1
S(k) = 1 + S(k − 1) + S(k − 2).
2 2
There are various ways to solve this recurrence. The most direct may be
to define a generating function F (z) = ∞ k
P
k=0 S(k)z . Then the recurrence
becomes
z 1 1
F = + zF + z 2 F.
1−z 2 2
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 404

Solving for F gives


z
1−z
F = 1 1 2
1− 2z − 2z
2z
= .
(1 − z)(2 − z − z 2 )
2z
=
(1 − z)2 (2 + z)
!
1/3 1/9 1/18
= 2z + +
(1 − z) 2 (1 − z) 1 + 12 z
2 z 2 z 1 z
= · + · + · ,
3 (1 − z) 2 9 1 − z 9 1 + 12 z

from which we can read off the exact solution


k−1
2 2 1 1

S(k) = ·k+ + −
3 9 9 2
when k ≥ 1.4
We can easily show that depth0 (u) is tightly concentrated around depth(u)
using McDiarmid’s inequality (5.3.13). Let Xi , for i = 2 . . . depth u, be the
choice made by u’s depth-i ancestor. Then changing one Xi changes depth0 (u)
by at most 1. So we get
2 /(depth(u)−1)
Pr depth0 (u) ≥ E depth0 (u) + t ≤ e−2t
  
(E.3.2)

and similarly
2 /(depth(u)−1)
Pr depth0 (u) ≤ E depth0 (u) − t ≤ e−2t
  
. (E.3.3)
q
1
Let t = 2 D ln (1/). Then the right-hand side of (E.3.2) and (E.3.3)
becomes e −D ln(1/)/(depth(u)−1) < e− ln(1/) = . For  = 12 n−c−1 , we get t =
q q √
1
= c+1

c+1 )
2 D ln (2n 2 D(ln n + ln 2) = O D log n when c is constant.
0
For the lower bound on D , when can apply (E.3.3) to some u
√ single node
with depth (u) = D; this node by itself will give D0 ≥ 23 D−O D log n with


4
A less direct but still effective approach is to guess that S(k) grows linearly, and find a
and b such that S(k) ≤ ak + b. For this we need ak + b ≤ 1 + 12 (a(k − 1)+ b)+ 12 (a(k − 2) + b).
The b’s cancel, leaving ak ≤ 1 + ak − 32 a. Now the ak’s cancel, leaving us with 0 ≤ 1 − 32 a
or a ≥ 2/3. We then go back and make b = 1/3 to get the right bound on S(1), giving the
bound S(k) ≤ 32 · k + 13 . We can then repeat the argument for S(k) ≥ a0 k + b0 to get a full
bound 23 k ≤ S(k) ≤ 32 k + 13 .
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 405

probability at least 1− 12 n−c−1 . For the upper bound, we need to take the max-
imum over all nodes. In general, an upper bound on the maximum of a bunch
of random variables is likely to be larger than an upper bound on any one of
the random variables individually, because there is a lot of room for one of
the variables to get unlucky, but we can happly the union bound to get around
√ i
this. For each individual u, we have Pr depth0 (u) ≥ 23 D + O D log n ≤
1 −c−1
h
0 ≥ 2D + O
√ i
≤ u 12 n−c−1 = 12 n−c . This com-
P
2 n , so Pr D 3 D log n
pletes the proof.

E.3.2 Part testing


You are running a factory that produces parts for some important piece of
machinery. Many of these parts are defective, and must be discarded. There
are two levels of tests that can be performed:

• A normal test, which passes a part with probability 2/3.

• A rigorous test, which passes a part with probability 1/3.

At each step, the part inspectors apply the following rules to decide which
test to apply:

• For the first part, a fair coin-flip decides between the tests.

• For subsequent parts, if the previous part passed, the inspectors become
suspicious and apply the rigorous test; if it failed, they relax and apply
the normal test.

For example, writing N+ for a part that passes the normal test, N- for
one that fails the normal test, R+ for a part that passes the rigorous test,
and R- for one that fails the rigorous test, a typical execution of the testing
procedure might look like N- N+ R- N- N+ R- N+ R- N- N- N- N+ R+ R-
N+ R+. This execution tests 16 parts and passes 7 of them.
Suppose that we test n parts. Let S be the number that pass.

1. Compute E [S].

2. Show that there a constant c > 0 such that, for any t > 0,
2 /n
Pr [|S − E [S]| ≥ t] ≤ 2e−ct . (E.3.4)
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 406

Solution
Using McDiarmid’s inequality and some cleverness Let Xi be the
indicator variable for the event that part i passes, so that S = ni=1 Xi .
P

1. We can show by induction that E [Xi ] = 1/2 for all i. The base case
is X1 , where Pr [part 1 passes] = 12Pr [part

1 passes rigorous test] +
1
2 Pr [part 1 passes normal test] = 12 13 + 23 = 12 . For i > 1, E [Xi−1 ] =
1/2 implies that part i is tested with the normal and rigorous tests
with equal probability, so the analysis for X1 carries through and gives
E [Xi ] = 1/2 as well. Summing over all Xi gives E [S] = n/2.
2. We can’t use Chernoff, Hoeffding, or Azuma here, because the Xi are
not independent, and do not form a martingale difference sequence even
after centralizing them by subtracting off their expectations. So we
are left with McDiarmid’s inequality unless we want to do something
clever and new (we don’t). Applying McDiarmid to the Xi directly
doesn’t work so well, but we can make it work with a different set of
variables that generate the same outcomes.
Let Yi ∈ {A, B, C} be the grade of part i, where A means that it passes
both the rigorous and the normal test, B means that fails the rigorous
test but passes the normal test, and C means that it fails both tests.
In terms of the Xi , Yi = A means Xi = 1, Yi = C means Xi = 0,
and Yi = B means Xi = 1 − Xi−1 (when i > 1). We get the right
probabilities for passing each test by assigning equal probabilities.
We can either handle the coin-flip at the beginning by including an
extra variable Y0 , or we can combine the coin-flip with Y1 by assuming
that Y1 is either A or C with equal probability. The latter approach
improves our bound a little bit since then we only have n variables and
not n + 1.
6 i and ask what happens if Yi
Now suppose that we fix all Yj for j =
changes.

(a) If j < i, then Xj is not affected by Yi .


(b) Let k > i be such that Yk ∈ {A, C}. Then Xk is not affected by
Yi , and neither is Xj for any j > k.

It follows that changing Yi can only change Xi , . . . , Xi+` , where ` is


the number of B grades that follow position i.
There are two cases for the sequence X0 . . . X` :
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 407

(a) If Xi = 0, then X1 = 1, X2 = 0, etc.


(b) If Xi = 1, then X1 = 0, X2 = 1, etc.

If ` is odd, changing Yi thus has no effect on i+`


P
j=i Xi , while if ` is
even, changing Yi changes the sum by 1. In either case, the effect of
changing Yi is bounded by 1, and McDiarmid’s inequality applies with
ci = 1, giving
2
Pr [|S − E [S]| ≥ t] ≤ 2e−2t /n .

E.4 Assignment 4: due Wednesday, 2014-10-29, at


17:00
E.4.1 A doubling strategy
A common strategy for keeping the load factor of a hash table down is to
double its size whenever it gets too full. Suppose that we start with a hash
table of size 1 and double its size whenever two items in a row hash to the
same location.
Effectively, this means that we attempt to insert all elements into a table
of size 1; if two consecutive items hash to the same location, we start over
and try to do the same to a table of size 2, and in general move to a table of
size 2k+1 whenever any two consecutive elements hash to the same location
in our table of size 2k .
Assume that we are using an independently chosen 2-universal hash
function for each table size. Show that the expected final table size is O(n).

Solution
Let X be the random variable representing the final table size. Our goal is
to bound E [X].
First let’s look at the probability that we get at least one collision between
consecutive elements when inserting n elements into a table with m locations.
Because the pairs of consecutive elements overlap, computing the exact
probability that we get a collision is complicated, but we only need an upper
bound.
We have n − 1 consecutive pairs, and each produces a collision with
probability at most 1/m. This gives a total probability of a collision of at
most (n − 1)/m.
Let k = dlg ne, so that 2k ≥ n. Then the probability of a consecutive
collision in a table with 2k+` locations is at most (n−1)/2k+` < 2k /2k+` = 2−` .
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 408

Since the events that collisions occur at each table size are independent, we
can compute, for ` > 0,
h i h i
Pr X = 2k+` ≤ Pr X ≥ 2k+`
`−1
2−i
Y

i=0
−`(`−1)/2
=2 .

From this it follows that



X h i
E [X] = 2i Pr X = 2i
i=0
h i ∞
X h i
< 2k Pr X ≤ 2k + 2k+` Pr X = 2k+`
l=1

2k+` · 2−`(`−1)/2
X
≤ 2k +
l=1

X
≤ 2k + 2k 2`−`(`−1)/2
l=1

2 −2`)/2
2−(`
X
≤ 2k + 2k
l=1
k
= O(2 ),

since the series converges to some constant that does not depend on k. But
we chose k so that 2k = O(n), so this gives us our desired bound.

E.4.2 Hash treaps


Consider the following variant on a treap: instead of choosing the heap
keys randomly, we choose a hash function h : U → [1 . . . m] from some
strongly 2-universal hash family, and use h(x) as the heap key for tree key x.
Otherwise the treap operates as usual.5
Suppose that |U | ≥ n. Show that there is a sequence of n distinct tree
keys such that the total expected time to insert them into an initially empty
hash treap is Ω(n2 /m).
5
If m is small, we may get collisions in the heap key values. Assume in this case that a
node will stop rising if its parent has the same key.
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 409

Solution
Insert the sequence 1 . . . n.
Let us first argue by induction on i that all elements x with h(x) = m
appear as the uppermost elements of the right spine of the treap. Suppose
that this holds for i − 1. If h(i) = m, then after insertion i is rotated up
until it has a parent that also has heap key m; this extends the sequence
of elements with heap key m in the spine by 1. Alternatively, if h(i) < m,
then i is never rotated above an element x with h(x) = m, so the sequence
of elements with heap key m is unaffected.
Because each new element has a larger tree key than all previous elements,
inserting a new element i requires moving past any elements in the right
spine, and in particular requires moving past any elements j < i with
h(j) = m. So the expected cost of inserting i is at least the expected
number of such elements j. Because h is chosen from a strongly 2-universal
hash family, Pr [h(j) = m] = 1/m for any j, and by linearity of expectation,
E [|{j < i | h(j) = m}|] = (i − 1)/m. Summing this quantity over all i gives
a total expected insertion cost of at least n(n − 1)/2m = Ω(n2 /m).

E.5 Assignment 5: due Wednesday, 2014-11-12, at


17:00
E.5.1 Agreement on a ring
Consider a ring of n processes at positions 0, 1, . . . n − 1. Each process i
has a value A[i] that is initially 0 or 1. At each step, we choose a process
r uniformly at random, and copy A[r] to A[(r + 1) mod n]. Eventually, all
A[i] will have the same value.
Let At [i] be the value of A[i] at time t, and let Xt = n−1
P
i=0 At [i] be the
total number of ones in At . Let τ be the first time at which Xτ ∈ {0, n}.

1. Suppose that we start with X0 = k. What is the probability that we


eventually reach a state that is all ones?

2. What is E [τ ], assuming we start from the worst possible initial state?

Solution
This is a job for the optional stopping theorem. Essentially we are going to
follow the same analysis from §9.4.1 for a random walk with two absorbing
barriers, applied to Xt .
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 410

Let Ft be the σ-algebra generated by A0 , . . . , At . Then {Ft } forms a


filtration, and each Xt is Ft -measurable.

1. For the first part, we show that (Xt , Ft ) is a martingale. The intuition
is that however the bits in At are arranged, there are always exactly
the same number of positions where a zero can be replaced by a one as
there are where a one can be replaced by a zero.
Let Rt be the random location chosen in state At . Observe that

E [Xt+1 | Ft , Rt = i] = Xt + A[i] − A[(i + 1) mod n].

But then
n−1
X 1
E [Xt+1 | Ft ] = (Xt + A[i] − A[(i + 1) mod n])
i=0
n
1 1
= Xt + Xt − Xt
n n
= Xt ,

which establishes the martingale property.


We also have that (a) τ is a stopping time with respect to the Fi ; (b)
Pr [τ < ∞] = 1, because from any state there is a nonzero chance that
all A[i] are equal n steps later; and (c) Xt is bounded between 0 and
n. So the optional stopping theorem applies, giving

E [Xτ ] = E [X0 ] = k.

But then

E [Xτ ] = n · Pr [Xτ = n] + 0 · Pr [Xτ = 0] = k,

so E [Xτ ] = k/n.

2. For the second part, we use a variant on the Xt2 − t martingale.


Let Yt count the number of positions i for which At [i] = 1 and At [(i +
1) mod n] = 0. Then, conditioning on Ft , we have

2
Xt + 2Xt + 1

 with probability Yt /n,
2
Xt+1 = Xt2 − 2Xt + 1 with probability Yt /n, and

X 2

otherwise.
t
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 411

The conditional expectation sums to


h i
2
E Xt+1 Ft = Xt2 + 2Yt /n.

Let Zt = Xt2 − 2t/n when t ≤ τ and Xτ2 − 2τ /n otherwise. Then, for


t < τ , we have
h i
2
E [Zt+1 | Ft ] = E Xt+1 − 2(t + 1)/n Ft
= Xt2 + 2Yt /n − 2(t + 1)/n
= Xt2 − 2t/n + 2Yt /n − 2/n
= Zt + 2Yt /n − 2/n
≤ Zt .

For t ≥ τ , Zt+1 = Zt , so in either case Zt has the submartingale


property.
We have previously established that τ has bounded expectation, and
it’s easy to see that Zt has bounded step size. So the optional stopping
theorem applies to Zt , and E [Z0 ] ≤ E [Zτ ].
Let X0 = k. Then E [Z0 ] = k 2 , and E [Zτ ] = (k/n) · n2 − 2 E [τ ] /n =
kn − 2 E [τ ] /n. But then

k 2 ≤ kn − 2 E [τ ] /n

which gives

kn − k 2
E [τ ] ≤
2/n
kn2 − k 2 n
= .
2
This is maximized at k = bn/2c, giving

bn/2c · n2 − (bn/2c)2 · n
E [τ ] ≤ .
2

For even n, this is just n3 /4 − n3 /8 = n3 /8.


For odd n, this is (n − 1)n2 /4 − (n − 1)2 n/8 = n3 /8 − n/8. So there
is a slight win for odd n from not being able to start with an exact
half-and-half split.
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 412

To show that this bound applies in the worst case, observe that if we
start with have contiguous regions of k ones and n − k zeros in At , then
(a) Yt = 1, and (b) the two-region property is preserved in At+1 . In
this case, for t < τ it holds that E [Zt+1 | Ft ] = Zt + 2Yt /n − 2/n = Zt ,
2 2n
so Zt is a martingale, and thus E [τ ] = kn −k 2 . This shows that the
initial state with bn/2c consecutive zeros and dn/2e consecutive ones
(or vice versa) gives the claimed worst-case time.

E.5.2 Shuffling a two-dimensional array


A programmer working under tight time constraints needs to write a proce-
dure to shuffle the elements of an n × n array, so that all (n2 )! permutations
are equally likely. Some quick Googling suggests that this can be reduced
to shuffling a one-dimensional array, for which the programmer’s favorite
language provides a convenient library routine that runs in time linear in
the size of the array. Unfortunately, the programmer doesn’t read the next
paragraph about converting the 2-d array to a 1-d array first, and instead
decides to pick one of the 2n rows or columns uniformly at random at each
step, and call the 1-d shuffler on this row or column.
Let At be the state of the array after t steps, where each step shuffles
one row or column, and let B be a random variable over permutations of
the original array state that has a uniform distribution. Show that the 2-d
shuffling procedure above is asympotically worse than the direct approach,
by showing that there is some f (n) = ω(n) such that after f (n) steps,
dT V (At , B) = 1 − o(1).6

Solution
Consider the n diagonal elements in positions Aii . For each such element,
there is a 1/n chance at each step that its row or column is chosen. The
6
I’ve been getting some questions about what this means, so here is an attempt to
translate it into English.
Recall that f (n) is ω(g(n)) if limn→∞ fg(n)
(n)
goes to infinity, and f (n) is o(g(n)) if
limn→∞ fg(n)
(n)
goes to zero.
The problem is asking you to show that there is some f (n) that is more than a constant
times n, such that the total variation distance between Af (n) becomes arbitrarily close to
1 for sufficiently large n.
So for example, if you showed that at t = n4 , dT V (At , B) ≥ 1 − log12 n , that would
2 n
demonstrate the claim, because limn→∞ nn goes to infinity and limn→∞ 1/ log
1
= 0. These
functions are, of course, for illustration only. The actual process might or might not
converge by time n4 .)
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 413

time until every diagonal node is picked at least once maps to the coupon
collector problem, which means that it is Ω(n log n) with high probability
using standard concentration bounds.
Let C be the event that there is at least one diagonal element that is
in its original position. If there is some diagonal node that has not moved,
C holds; so with probability 1 − o(1), At holds at some time t that is
Θ(n log n) = ω(n). But by the union bound, C holds in B with probability
at most n · n−2 = 1/n. So the difference between the probability of C in At
and B is at least 1 − o(1) − 1/n = 1 − o(1).

E.6 Assignment 6: due Wednesday, 2014-12-03, at


17:00
E.6.1 Sampling colorings on a cycle
Devise a Las Vegas algorithm for sampling 3-colorings of a cycle.
Given n > 1 nodes in a cycle, your algorithm should return a random
coloring X of the nodes with the usual constraint that no edge should have
the same color on both endpoints. Your algorithm should generate all possible
colorings with equal probability, and run in time polynomial in n on average.

Solution
Since it’s a Las Vegas algorithm, Markov chain Monte Carlo is not going to
help us here. So we can set aside couplings and conductance and just go
straight for generating a solution.
First, let’s show that we can generate uniform colorings of an n-node
line in linear time. Let Xi be the color of node i, where 0 ≤ i < n. Choose
X0 uniformly at random from all three colors; then for each i > 0, choose
Xi uniformly at random from the two colors not chosen for Xi−1 . Given a
coloring, we can show by induction on i that it can be generated by this
process, and because each choice is uniform and each coloring is generated
only once, we get all 3 · 2n−1 colorings of the line with equal probability.
Now we try hooking Xn−1 to X0 . If Xn−1 = X0 , then we don’t have a
cycle coloring, and have to start over. The probability that this event occurs
is at most 1/2, because for every path coloring with Xn−1 = X0 , there is
another coloring where we replace Xn−1 with a color not equal to Xn−2 or
X0 . So after at most 2 attempts on average we get a good cycle coloring.
This gives a total expected cost of O(n).
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 414

E.6.2 A hedging problem


Suppose that you own a portfolio of n investment vehicles numbered 1, . . . n,
where investment i pays out aij ∈ {−1, +1} at time j, where 0 < j ≤ m. You
have carefully chosen these investments so that your total payout ni=1 aij
P

for any time j is zero, eliminating all risk.


Unfortunately, in doing so you have run afoul of securities regulators, who
demand that you sell off half of your holdings—a demand that, fortunately,
you anticipated by making n even.
This will leave you with a subset S consisting of n/2 of the ai , and your
net worth at time t will now be wt = tj=1 i∈S aij . If your net worth drops
P P

too low at any time t, all your worldly goods will be repossessed, and you
will have nothing but your talent for randomized algorithms to fall back on.
Note: an earlier version of this problem demanded a tighter
bound.
Show that when n and m are sufficiently large,
√ it is always possible to
choose a subset S of size n/2 so that wt ≥ −m n ln nm for all 0 < t ≤ m,
and give an algorithm that finds such a subset in time polynomial in n and
m on average.

Solution
Suppose that we flip a coin independently to choose whether to include each
investment i. There are two bad things that can happen:

1. We lose too much at some time t from the investments the coin chooses.

2. We don’t get exactly n/2 heads.

If we can show that the sum of the probabilities of these bad events is
less than 1, we get the existence proof we need. If we can show that it is
enough less than 1, we also get an algorithm, because we can test in time
O(nm) if a particular choice works.
Let Xi be the indicator variable for the event that we include investment
i. Then
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 415

 
n
X t
X
wt = Xi aij 
i=1 j=1
 
n t
1 1
X    X
=  + Xi − aij 
i=1
2 2 j=1
t X n n  t
!
1X 1
X  X
= aij + Xi − aij
2 j=1 i=1 i=1
2 i=1
n  t
!
1
X  X
= Xi − aij .
i=1
2 i=1
h i
Because Xi − 12 is always 12 , E Xi − 12 = 0, and each aij is ±1, each
term
 in the
 P outermost sum is a zero-mean random variable that satisfies
t t m
Xi − 12 j=1 aij ≤ 2 ≤ 2 . So Hoeffding’s inequality says
h √ i 2 2
Pr wt − E [wt ] < m n ln nm ≤ e−m n ln nm/(2n(m/2) )
= e−2 ln nm
= (nm)−2 .

Summing over all t, the probability that this bound is violated for any t
is at most n21m .
For the second source of error, we have Pr [ ni=1 Xi 6= n/2] = 1− n2 /2n =
P 

1 − Θ(1/ n). So the total probability that the random assignment fails is

bounded by 1 − Θ(1/ n) + 1/n, giving a probability that it succeeds of at
√ √
least Θ(1/ n) − 1/(n2 m) = Θ( n). It follows that generating and testing
random assignments gives an assignment with the desired characteristics

after Θ( n) trials on average, giving a total expected cost of Θ(n3/2 m).

E.7 Final exam


Write your answers in the blue book(s). Justify your answers. Work alone.
Do not use any notes or books.
There are four problems on this exam, each worth 20 points, for a total
of 80 points. You have approximately three hours to complete this exam.
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 416

E.7.1 Double records (20 points)


Let A[1 . . . n] be an array holding the integers 1 . . . n in random order, with
all n! permutations of the elements of A equally likely.
Call A[k], where 1 ≤ k ≤ n, a record if A[j] < A[k] for all j < k.
Call A[k], where 2 ≤ k ≤ n, a double record if both A[k − 1] and A[k]
are records.
Give an asymptotic (big-Θ) bound on the expected number of double
records in A.

Solution
Suppose we condition on a particular set of k elements appearing in positions
A[1] through A[k]. By symmetry, all k! permutations of these elements are
equally likely. Putting the largest two elements in A[k − 1] and A[k] leaves
(k − 2)! choices for the remaining elements, giving a probability of a double
record at k of exactly (k−2)!
k!
1
= k(k−1) .
Applying linearity of expectation gives a total expected number of double
records of
n n
X 1 X 1

k=2
k(k − 1) k=2 (k − 1)2

X 1

k=2
(k − 1)2

X 1
=
k=1
k2
π2
=
6
= O(1).

Since the expected number of double records is at least 1/2 = Ω(1) for
n ≥ 2, this gives a tight asymptotic bound of Θ(1).
I liked seeing our old friend π 2 /6 so much that I didn’t notice an easier
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 417

exact bound, which several people supplied in their solutions:


n n
1 1 1
X X  
= −
k=2
k(k − 1) k=2 k − 1 k
n−1 n
X 1 X 1
= −
k=1
k k=2 k
1
=1− .
n
= O(1).

E.7.2 Hipster graphs (20 points)


Consider the problem of labeling the vertices of a 3-regular graph with labels
from {0, 1} to maximize the number of happy nodes, where a node is happy
if its label is the opposite of the majority of its three neighbors.
Give a deterministic algorithm that takes a 3-regular graph as input and
computes, in time polynomial in the size of the graph, a labeling that makes
at least half of the nodes happy.

Solution
This problem produced the widest range of solutions, including several very
clever deterministic algorithms. Here are some examples.

Using the method of conditional probabilities If we are allowed a


randomized algorithm, it’s easy to get at exactly half of the nodes on average:
simply label each node independently and uniformly at random, observe that
each node individually has a 1/2 chance at happiness, and apply linearity of
expectation.
To turn this into a deterministic algorithm, we’ll apply the method of
conditional expectations. Start with an unlabeled graph. At each step, pick
a node and assign it a label that maximizes the expected number of happy
nodes conditioned on the labeling so far, and assuming that all later nodes
are labeled independently and uniformly at random. We can compute this
conditional expectation in linear time by computing the value for each node
(there are only 34 possible partial labelings of the node and its immediate
neighbors, so computing the expected happiness of any particular node can
be done in constant time by table lookup; to compute the table, we just
enumerate all possible assignments to the unlabeled nodes). So in O(n2 )
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 418

time we get a labeling in which the number of happy nodes is at least the
n/2 expected happy nodes we started with.
With a bit of care, the cost can be reduced to linear: because each new
labeled node only affects its own probability of happiness and those of its
three neighbors, we can update the conditional expectation by just updating
the values for those four nodes. This gives O(1) cost per step or O(n) total.

Using hill climbing The following algorithm is an adaptation of the


solution of Rose Sloan, and demonstrates that it is in fact possible to make
all of the nodes happy in linear time.
Start with an arbitrary labeling (say, all 0). At each step, choose an
unhappy node and flip its label. This reduces the number of monochromatic
edges by at least 1. Because we have only 3n/2 edges, we can repeat this
process at most 3n/2 times before it terminates. But it only terminates
when there are no unhappy nodes.
To implement this in linear time, maintain a queue of all unhappy nodes.

E.7.3 Storage allocation (20 points)


Suppose that you are implementing a stack in an array, and you need to
figure out how much space to allocate for the array. At time 0, the size of
the stack is X0 = 0. At each subsequent time t, the user flips an independent
fair coin to choose whether to push onto the stack (Xt = Xt−1 + 1) or pop
from the stack (Xt = Xt−1 − 1). The exception is that when the stack is
empty, the user always pushes.
You are told in advance that the stack will only be used for n time units.
Your job is to choose a size s for the array so that it will overflow at most
half the time: Pr [maxt Xt > s] < 1/2. As an asymptotic (big-O) function of
n, what is the smallest value of s you can choose?

Solution
We need two ideas here. First, we’ll show that Xt − t2 is a martingale, despite
the fact that Xt by itself isn’t. Second, we’ll use the idea of stopping Xt
when it hits s + 1, creating a new martingale Yt = Xt∧τ 2 − (t ∧ τ ) where τ is

the first time where Xt = s (and t ∧ τ is shorthand for min(t, τ )). We can
then apply Markov’s inequality to Xn∧τ .
To save time, we’ll skip directly to showing that Yt is a martingale. There
are two cases:
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 419

1. If t < τ , then
1  1 
E [Yt+1 | Yt , t < τ ] = (Xt + 1)2 − (t + 1) + (Xt − 1)2 − (t + 1)
2 2
2
= Xt + 1 − (t + 1)
= Xt2 − t
= Yt .

2. If t ≥ τ , then Yt+1 = Yt , so E [Yt+1 | Yt , t ≥ τ ] = Yt .

In either case the martingale property holds.  2   2 


It follows that E [Yn ] = E [Y0 ] = 0, or E Xn∧τ − n = 0, giving E Xn∧τ =
n. Now apply Markov’s inequality:

Pr [max Xt > s] = Pr [Xn∧τ ≥ s + 1]


h i
2
= Pr Xn∧τ ≥ (s + 1)2
2
 
E Xn∧τ

(s + 1)2
n
= √
( 2n + 1)2
n
<
2n
= 1/2.
√ √
So s = 2n = O( n) is enough.

E.7.4 Fault detectors in a grid (20 points)


A processing plant for rendering discarded final exam problems harmless
consists of n2 processing units arranged in an n×n grid with coordinates (i, j)
each in the range 1 through n. We would like to monitor these processing
units to make sure that they are working correctly, and have access to a
supply of monitors that will detect failures. Each monitor is placed at some
position (i, j), and will detect failures at any of the five positions (i, j),
(i − 1, j), (i + 1, j), (i, j − 1), and (i, j + 1) that are within the bounds of
the grid. This plus-shaped detection range is awkward enough that the
engineer designing the system has given up on figuring out how to pack the
detectors properly, and instead places a detector at each grid location with
independent probability 1/4.
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 420

The engineer reasons that since a typical monitor covers 5 grid locations,
using n2 /4 monitors on average should cover (5/4)n2 locations, with the
extra monitors adding a little bit of safety to deal with bad random choices.
So few if any processing units should escape.

1. Compute the exact expected number of processing units that are not
within range of any monitor, as a function of n. You may assume
n > 0.

√ c, the actual number of unmonitored processing


2. Show that for any fixed
units is within O(n log n) of the expected number with probability at
least 1 − n−c .

Solution
1. This part is actually more annoying, because we have to deal with
nodes on the edges. There are three classes of nodes:

(a) The four corner nodes. To be unmonitored, each corner node


needs to have no monitor in its own location or either of the two
adjacent locations. This event occurs with probability (3/4)3 .
(b) The 4(n − 2) edge nodes. These have three neighbors in the grid,
so they are unmonitored with probability (3/4)4 .
(c) The (n − 2)2 interior nodes. These are each unmonitored with
probability (3/4)5 .

Adding these cases up gives a total expected number of unmonitored


nodes of
243 2 81 189
4 · (3/4)3 + 4(n − 2) · (3/4)4 + (n − 2)2 · (3/4)5 = n − n+ .
1024 32 64
(E.7.1)
For n = 1, this analysis breaks down; instead, we can calculate directly
that the sole node in the grid has a 3/4 chance of being unmonitored.
Leaving the left-hand side of (E.7.1) in the original form is probably a
good idea for understanding how the result works, but the right-hand
side demonstrates that this strategy leaves slightly less than a quarter
of the processing units uncovered on average.

2. Let Xij be the indicator for a monitor at position (i, j). Recall that
we have assumed that these variables are independent.
APPENDIX E. SAMPLE ASSIGNMENTS FROM SPRING 2014 421

Let S = f (X11 , . . . , Xnn ) compute the number of uncovered processing


units. Then changing a single Xij changes the value of f by at most 5.
So we can apply McDiarmid’s inequality to get
2
P
c2ij )
Pr [|S − E [S]| ≥ t] ≤ 2e−2t /(
2 2 2 2 2
= 2e−2t /(n ·5 ) = 2e−2t /(25n ) .

(The actual bound is slightly better, because we are overestimating cij


for boundary nodes.)

Set this to n−c and solve for t = n (25/2)(c ln n + ln 2) = O(n log n).
p

Then the bound becomes 2e−c ln n−ln 2 = n−c , as desired.


Appendix F

Sample assignments from


Spring 2013

F.1 Assignment 1: due Wednesday, 2013-01-30, at


17:00
F.1.1 Bureaucratic part
Send me email! My address is james.aspnes@gmail.com.
In your message, include:

1. Your name.

2. Your status: whether you are an undergraduate, grad student, auditor,


etc.

3. Anything else you’d like to say.

(You will not be graded on the bureaucratic part, but you should do it
anyway.)

F.1.2 Balls in bins


Throw m balls independently and uniformly at random into n bins labeled
1 . . . n. What is the expected number of positions i < n such that bin i and
bin i + 1 are both empty?

422
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 423

Solution
If we can figure out the probability that bins i and i + 1 are both empty for
some particular i, then by symmetry and linearity of expectation we can just
multiply by n − 1 to get the full answer.
For bins i and i + 1 to be empty, every ball must choose another bin.
This occurs with probability (1 − 2/n)m . The full answer is thus n(1 − 2/n)m ,
or approximately ne−2m/n when n is large.

F.1.3 A labeled graph


Suppose you are given a graph G = (V, E), where |V | = n, and you want to
assign labels to each vertex such that the sum of the labels of each vertex
and its neighbors modulo n + 1 is nonzero. Consider the naïve randomized
algorithm that assigns a label in the range 0 . . . n to each vertex independently
and uniformly at random and then tries again if the resulting labeling doesn’t
work. Show that this algorithm will find a correct labeling in time polynomial
in n on average.

Solution
We’ll use the law of total probability. First observe that the probability that
a random labeling yields a zero sum for any single vertex and its neighbors is
exactly 1/(n + 1); the easiest way to see this is that after conditioning on the
values of the neighbors, there is only one value in n + 1 that can be assigned
to the vertex itself to cause a failure. Now sum this probability over all n
vertices to get a probability of failure of at most n/(n + 1). It follows that
after n + 1 attempts on average (each of which takes O(n2 ) time to check all
the neighborhood sums), the algorithm will find a good labeling, giving a
total expected time of O(n3 ).

F.1.4 Negative progress


An algorithm has the property that if it has already run for n steps, it runs
for an additional n + 1 steps on average. Formally, let T ≥ 0 be the random
variable representing the running time of the algorithm, then

E [T | T ≥ n] = 2n + 1. (F.1.1)

For each n ≥ 0, what is the conditional probability Pr [T = n | T ≥ n]


that the algorithm stops just after its n-th step?
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 424

Solution
Expand (F.1.1) using the definition of conditional expectation to get


X
2n + 1 = x Pr [T = x | T ≥ n]
x=0

X Pr [T = x ∧ T ≥ n]
= x
x=0
Pr [T ≥ n]

1 X
= x Pr [T = x] ,
Pr [T ≥ n] x=n
which we can rearrange to get

X
x Pr [T = x] = (2n + 1) Pr [T ≥ n] , (F.1.2)
x=n

provided Pr [T ≥ n] is nonzero. We can justify this assumption by observing


that (a) it holds for n = 0, because T ≥ 0 always; and (b) if there is some
n > 0 such that Pr [T ≥ n] = 0, then E [T | T ≥ n − 1] = n − 1, contradicting
(F.1.1).
Substituting n + 1 into (F.1.2) and subtracting from the original gives
the equation

X ∞
X
n Pr [T = n] = x Pr [T = x] − x Pr [T = x]
x=n x=n+1
= (2n + 1) Pr [T ≥ n] − (2n + 3) Pr [T ≥ n + 1]
= (2n + 1) Pr [T = n] + (2n + 1) Pr [T ≥ n + 1] − (2n + 3) Pr [T ≥ n + 1]
= (2n + 1) Pr [T = n] − 2 Pr [T ≥ n + 1] .
Since we are looking for Pr [T = n | T ≥ n] = Pr [T = n] / Pr [T ≥ n],
having an equation involving Pr [T ≥ n + 1] is a bit annoying. But we can
borrow a bit of Pr [T = n] from the other terms to make it work:

n Pr [T = n] = (2n + 1) Pr [T = n] − 2 Pr [T ≥ n + 1]
= (2n + 3) Pr [T = n] − 2 Pr [T = n] − 2 Pr [T ≥ n + 1]
= (2n + 3) Pr [T = n] − 2 Pr [T ≥ n] .
A little bit of algebra turns this into
Pr [T = n] 2
Pr [T = n | T ≥ n] = = .
Pr [T ≥ n] n+3
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 425

F.2 Assignment 2: due Thursday, 2013-02-14, at


17:00
F.2.1 A local load-balancing algorithm
Suppose that we are trying to balance n jobs evenly between two machines.
Job 1 chooses the left or right machine with equal probability. For i > 1, job
i chooses the same machine as job i − 1 with probability p, and chooses the
other machine with probability 1 − p. This process continues until every job
chooses a machine. We want to estimate how imbalanced the jobs are at the
end of this process.
Let Xi be +1 if the i-th job chooses the left machine and −1 if it chooses
the right. Then S = ni=1 Xi is the difference between the number of jobs
P

that choose the left machine and the number that choose the right. By
symmetry, the expectation of S is zero. What is the variance of S as a
function of p and n?

Solution
P
To compute the variance, we’ll use (5.1.5), which says that Var [ i Xi ] =
P P
i Var [Xi ] + 2 i<j Cov [Xi , Xj ].
Recall that Cov [Xi , Xj ] = E [Xi Xj ] − E [Xi ] E [Xj ]. Since the last term
is 0 (symmetry again), we just need to figure out E [Xi Xj ] for all i ≤ j (the
i = j case gets us Var [Xi ]).
First, let’s compute E [Xj = 1 | Xi = 1]. It’s easiest to do this starting
with the j = i case: E [Xi | Xi = 1] = 1. For larger j, compute

E [Xj | Xj−1 ] = pXj−1 + (1 − p)(−Xj−1 )


= (2p − 1)Xj−1 .

It follows that

E [Xj | Xi = 1] = E [(2p − 1)Xj−1 | Xi = 1]


= (2p − 1) E [Xj−1 | Xi = 1] . (F.2.1)

The solution to this recurrence is E [Xj | Xi = 1] = (2p − 1)j−i .


APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 426

We next have

Cov [Xi , Xj ] = E [Xi Xj ]


= E [Xi Xj | Xi = 1] Pr [Xi = 1] + E [Xi Xj | Xi = −1] Pr [Xi = −1]
1 1
= E [Xj | Xi = 1] + E [−Xj | Xi = −1]
2 2
1 1
= E [Xj | Xi = 1] + E [Xj | Xi = 1]
2 2
= E [Xj | Xi = 1]
= (2p − 1)j−i

as calculated in (F.2.1).
So now we just need to evaluate the horrible sum.
X X X n
n X
Var [Xi ] + 2 Cov [Xi , Xj ] = n + 2 (2p − 1)j−i
i i<j i=1 j=i+1
n n−i
X X
=n+2 (2p − 1)k
i=1 k=1
n
X (2p − 1) − (2p − 1)n−i+1
=n+2
i=1
1 − (2p − 1)
n
n(2p − 1) 1 X
=n+ − (2p − 1)m
1−p 1 − p m=1
n(2p − 1) (2p − 1) − (2p − 1)n+1
=n+ − .
1−p 2(1 − p)2
(F.2.2)

This covers all but the p = 1 case, for which the geometric series formula
fails. Here we can compute directly that Var [S] = n2 , since S will be ±n
with equal probability.
For smaller values of p, plotting (F.2.2) shows the variance increasing
smoothly starting at 0 (for even n) or 1 (for odd n) at p = 0 to n2 in the limit
as p goes to 1, with an interesting intermediate case of n at p = 1/2, where
all terms but the first vanish. This makes a certain intuitive sense: when
p = 0, the processes alternative which machine they take, which gives an
even split for even n and a discrepancy of ±1 for odd n; when p = 1/2, the
processes choose machines independently, giving variance n; and for p = 1,
the processes all choose the same machine, giving n2 .
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 427

F.2.2 An assignment problem


Here is an algorithm for placing n balls into n bins. For each ball, we first
select two bins uniformly at random without replacement. If at least one
of the chosen bins is unoccupied, the ball is placed in the empty bin at no
cost. If both chosen bins are occupied, we execute an expensive parallel scan
operation to find an empty bin and place the ball there.

1. Compute the exact value of the expected number of scan operations.

2. Let c > 0. Show that the absolute value of the difference between the
actual
√ number of scan operations and the expected number is at most
O( cn log n) with probability at least 1 − n−c .

Solution
1. Number the balls 1 to n. Forball
 i,there are i−1 bins already occupied,
giving a probability of i−1
n
i−2
n−1 that we choose an occupied bin on
both attempts and incur a scan. Summing over all i gives us that the
expected number of scans is:
n  n−1
i−1 i−2 1
X   X
= (i2 − i)
i=1
n n−1 n(n − 1) i=1
1 (n − 1)n(2n − 1) (n − 1)n)
 
= −
n(n − 1) 6 2
(2n − 1) 1
= −
6 2
n−2
,
3
provided n ≥ 2. For n < 2, we incur no scans.

2. It’s tempting to go after this using Chernoff’s inequality, but in this case
Hoeffding gives a better bound. Let S be the number of scans. Then S is
the sum of n independent Bernoulli random variables,
√ so 5.3.4√says that
2
Pr [|S − E [S]| ≥ t] ≤ 2e−2t /n . Now let t = cn ln n = O( cn log n)
to make the right-hand side 2n−2c ≤ n−c for sufficiently large n.

F.2.3 Detecting excessive collusion


Suppose you have n students in some unnamed Ivy League university’s
Introduction to Congress class, and each generates a random ±1 vote, with
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 428

both outcomes having equal probability. It is expected that members of the


same final club will vote together, so it may be that many groups of up to k
students each will all vote the same way (flipping a single coin to determine
the vote of all students in the group, with the coins for different groups
being independent). However, there may also be a much larger conspiracy
of exactly m students who all vote the same way (again based on a single
independent fair coin), in violation of academic honesty regulations.
Let c > 0. How large must m be in asymptotic terms as a function of
n, k, and c so that the existence of a conspiracy can be detected solely by
looking at the total vote, where the probability of error (either incorrectly
claiming a conspiracy when none exists or incorrectly claiming no conspiracy
when one exists) is at most n−c ?

Solution
Let S be the total vote. The intuition here is that if there is no conspiracy,
S is concentrated around 0, and if there is a conspiracy, S is concentrated
around ±m. So if m is sufficiently large and |S| ≥ m/2, we can reasonably
guess that there is a conspiracy.
We need to prove two bounds: first, that the probability that we see
|S| ≥ m/2 when there is no conspiracy is small, and second, that the
probability that we see |S| < m/2 when there is a conspiracy is large.
For the first case, let Xi be the total vote cast by the i-th group. This
will be ±ni with equal probability, where ni ≤ k is the size of the group.
P
This gives E [Xi ] = 0. We also have that ni = n.
Because the Xi are all bounded, we can use Hoeffding’s inequality (5.3.2),
so long as we can compute an upper bound on n2i . Here we use the fact
P
P 2
that ni is maximized subject to 0 ≤ ni ≤ k and ni = 0 by setting as
P

many ni as possible to k; this follows from convexity of x 7→ x2 .1 We thus


have
X
n2i ≤ bn/kck 2 + (n mod k)2
≤ bn/kck 2 + (n mod k)k
≤ (n/k)k 2
= nk.
1
The easy way to see this is that if f is strictly convex, then P f 0 is increasing. So if
0 < ni ≤ njP < k, increasing nj by  while decreasing ni by  leaves ni unchanged while
increasing f (ni ) by (f 0 (nj ) − f 0 (ni )) + O(2 ), which will be positive when  is small
enough. So at any global maximum, we must have that at least one of ni or nj equals 0 or
k for any i 6= j.
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 429

Now apply Hoeffding’s inequality to get


2 /2nk
Pr [|S| ≥ m/2] ≤ 2e−(m/2) .

We want to set m so that the right-hand side is less than n−c . Taking
logs as usual gives

ln 2 − m2 /8nk ≤ −c ln n,

so the desired bound holds when


q
m≥ 8nk(c ln n + ln 2)
p
= Ω( ckn log n).

For the second case, repeat the above analysis on the n −√m votes except
the ±m from the conspiracy. Again we get that if m = Ω( ckn log n), the
−c
√ that these votes exceed m/2 is bounded by n . So in both cases
probability
m = Ω( ckn log n) is enough.

F.3 Assignment 3: due Wednesday, 2013-02-27, at


17:00
F.3.1 Going bowling
For your crimes, you are sentenced to play n frames of bowling, a game
that involves knocking down pins with a heavy ball, which we are mostly
interested in because of its complicated scoring system.
In each frame, your result may be any of the integers 0 through 9, a
spare (marked as /), or a strike (marked as X). We can think of your result
on the i-th frame as a random variable Xi . Each result gives a base score
Bi , which is equal to Xi when Xi ∈ {0 . . . 9} and 10 when Xi ∈ {/, X}. The
actual score Yi for frame i is the sum of the base scores for the frame and up
to two subsequent frames, according to the rule:

Bi

 when Xi ∈ {0 . . . 9},
Yi = B +B
i i+1 when Xi = /, and


B + B +B when Xi = X.
i i+1 i+2

To ensure that Bi+2 makes sense even for i = n, assume that there exist
random variables Xn+1 and Xn+2 and the corresponding Bn+1 and Bn+2 .
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 430

Suppose that the Xi are independent (but not necessarily identically


distributed). Show that your final score S = ni=1 Yi is exponentially con-
P

centrated2 around its expected value.

Solution
This is a job for McDiarmid’s inequality (5.3.13). Observe that S is a function
of X1 . . . Xn+2 . We need to show that changing any one of the Xi won’t
change this function by too much.
From the description of the Yi , we have that Xi can affect any of Yi−2 (if
Xi−2 = X), Yi−1 (if Xi−1 = /) and Yi . We can get a crude bound by observing
P
that each Yi ranges from 0 to 30, so changing Xi can change Yi by at most
±90, giving ci ≤ 90. A better bound can be obtained by observing that Xi
contributes only Bi to each of Yi−2 and Yi−1 , so changing Xi can only change
these values by up to 10; this gives ci ≤ 50. An even more pedantic bound
can be obtained by observing that X1 , X2 , Xn+1 , and Xn+2 are all special
cases, with c1 = 30, c2 = 40, cn+1 = 20, and cn+2 = 10, respectively; these
values can be obtained by detailed meditation on the rules above.
We thus have n+2 2 2 2 2 2 2
i=1 ci = (n − 2)50 + 30 + 40 + 20 + 10 = 2500(n −
P

2) + 3000 = 2500n − 2000, assuming n ≥ 2. This gives Pr [|S − E [S]| ≥ t] ≤


exp(−2t2 /(2500n − 2000)), with the symmetric bound holding on the other
side as well.
For√the standard game of bowling, with n = 10, this bound starts to bite
at t = 11500 ≈ 107, which is more than a third of the range between the
minimum and maximum possible scores. There’s a lot of variance in bowling,
but this looks like a pretty loose bound for players who √ don’t throw a lot
of strikes. For large n, we get the usual bound of O n log n with high
probability: the averaging power of endless repetition eventually overcomes
any slop in the constants.

F.3.2 Unbalanced treaps


Recall that a treap (§6.3) is only likely to be balanced if the sequence of
insert and delete operations applied to it is independent of the priorities
chosen by the algorithm.
Suppose that we insert the keys 1 through n into a treap with random
priorities as usual, but then allow the adversary to selectively delete whichever
keys it wants to after observing the priorities assigned to each key.
2
This means “more concentrated than you can show using Chebyshev’s inequality.”
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 431

Show that there is an adversary strategy that produces a path in the



treap after deletions that has expected length Ω ( n).

Solution
An easy way to do this is to produce a tree that consists of a single path,
which we can do by arranging that the remaining keys have priorities that
are ordered the same as their key values.

Here’s a simple strategy that works. Divide the keys into n ranges of
√ √ √ √
n keys each (1 . . . n, n + 1 . . . 2 n, etc.).3 Rank the priorities from 1
√ √
to n. From each range (i − 1) n . . . i n, choose a key to keep whose priority
√ √
is also ranked in the range (i − 1) n . . . i n (if there is one), or choose no
key (if there isn’t). Delete all the other keys.

For a particular range, we are drawing n samples without replacement

from the n priorities, and there are n possible choices that cause us
to√keep a key in that range. The probability that every draw misses is
Q n √ √ √n
i=1 (1 − n/(n − i + 1)) ≤ (1 − 1/ n) ≤ e−1 . So each range contributes
−1 √
at least 1 − e keys on average. Summing over all n ranges gives a
sequence of keys with increasing priorities with expected length at least
√ √
(1 − e−1 ) n = Ω ( n).
An alternative solution is to apply the Erdős-Szekeres theorem [ES35],
which says that every sequence of length k 2 + 1 has either an increasing
subsequence of length k + 1 or a decreasing sequence of k + 1. Consider √ the
sequence of priorities corresponding to the √ keys 1 . . . n; letting k = b n − 1c
gives a subsequence of length at least n − 1 that is either increasing or
decreasing. If we delete all other elements of the treap, the elements cor-
responding to this subsequence will form a path, giving the desired bound.
Note that this does not require any probabilistic reasoning at all.

Though not required for the problem, it’s possible to show that Θ( n)
is the best possible bound here. The idea is that the number of possible
sequences of keys that correspond to a path of length k in a binary search
tree is exactly nk 2k−1 ; the nk corresponds to choosing the keys in the path,


and the 2k−1 is because for each node except the last, it must contain either
the smallest or the largest of the remaining keys because of the binary search
tree property.
Since each such sequence will be a treap path only if the priorities are
decreasing (with probability 1/k!), the union bound says that the probability
3
To make our life easer, we’ll assume that n is a square. This doesn’t affect the
asymptotic result.
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 432

*
/ \
a / \ b
/ \
* *
/ \ \
a / \ b \ b
/ \ \
aa ab bb

Figure F.1: A radix tree, storing the strings aa, ab, and ba.

n k−1
of having any length-k paths is at most k 2 /k!. But
!
n k−1 (2n)k
2 /k! ≤
k 2(k!)2
(2n)k

2(k/e)2k
1
= (2e2 n/k 2 )k .
2

This is exponentially small for k  2e2 n, showing that with high probability

all possible paths have length O( n).

F.3.3 Random radix trees


A radix tree over an alphabet of size m is a tree data structure where each
node has up to m children, each corresponding to one of the letters in the
alphabet. A string is represented by a node at the end of a path whose edges
are labeled with the letters in the string in order. For example, in Figure F.1,
the string ab is stored at the node reached by following the a edge out of
the root, then the b edge out of this child.
The only nodes created in the radix tree are those corresponding to stored
keys or ancestors of stored keys.
Suppose you have a radix tree into which you have already inserted
n strings of length k from an alphabet of size m, generated uniformly at
random with replacement. What is the expected number of new nodes you
need to create to insert a new string of length k?
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 433

Solution
We need to create a new node for each prefix of the new string that is not
already represented in the tree.
For a prefix of length `, the chance that none of the n strings have this
n
prefix is exactly 1 − m−` . Summing over all ` gives that the expected
 n
number of new nodes is k`=0 1 − m−` .
P

There is no particularly clean expression for this, but we can observe


that (1 − m−` )n ≤ exp(−nm−` ) is close to zero for ` < logm n and close to 1
for ` > logm n. This suggests that the expected value is k − logm n + O(1).

F.4 Assignment 4: due Wednesday, 2013-03-27, at


17:00
F.4.1 Flajolet-Martin sketches with deletion
A Flajolet-Martin sketch [FNM85] is a streaming data structure for
approximately counting the number of distinct items n in a large data
stream using only m = O(log n) bits of storage.4 The idea is to use a hash
function h that generates each value i ∈ {1, . . . m} with probability 2−i ,
and for each element x that arrives in the data stream, we write a 1 to
A[h(x)]. (With probability 2−m we get a value outside this range and write
nowhere.) After inserting n distinct elements, we estimate n as n̂ = 2k̂ ,
where k̂ = max {k | A[k] = 1}, and argue that this is likely to be reasonably
close to n.
Suppose that we modify the Flajolet-Martin sketch to allow an element
x to be deleted by writing 0 to A[h(x)]. After n insertions and d deletions
(of distinct elements in both cases), we estimate the number of remaining
elements n − d as before by n \ − d = 2k̂ , where k̂ = max {k | A[k] = 1}.
Assume that we never delete an element that has not previously been
inserted, and that the values of h are for different inputs are independent of
each other and of the sequence of insertions and deletions.
Show that there exist constants c > 1 and  > 0, such that for n
i elements then deleting d ≤ n of
h large, after inserting n distinct
sufficiently
them, Pr (n − d)/c ≤ n − d ≤ (n − d)c ≥ 2/3.
\

4
If you actually need to do this, there exist better data structures for this problem. See
[KNW10].
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 434

Solution
We’ll apply the usual error budget approach and show that the probability
\
that n − d is too big and the probability that n \ − d is too small are both
small. For the moment, we will leave c and  as variables, and find values
that work at the end.
Let’s start with the too-big side. To get A[k] = 1, we need h(xi ) = k for
some xi that is inserted but not subsequently deleted. There are n−d such xi ,
and each gives h(xi ) = k with probability 2−k . So Pr [A[k] = 1] ≤ (n − d)2−k .
This gives
h i h i
\
Pr n − d ≥ (n − d)c = Pr kb ≥ dlg ((n − d)c)e

(n − d)2−k
X

k=dlg((n−d)c)e

= 2(n − d)2−dlg((n−d)c)e
2
≤ .
c

h = 1 gives k̂ ≥i
On the too-small side, fix k =hdlg ((n − d)/c))e. iSince A[k]
k ≥ dlg ((n − d)/c))e, we have Pr n − d < (n − d)/c = Pr k̂ < lg(n − d)/c ≤
\
Pr [A[k] = 0]. (We might be able to get a better bound by looking at larger
indices, but to solve the problem this one k will turn out to be enough.)
Let x1 . . . xn−d be the values that are inserted and not later deleted, and
xn−d+1 . . . xn the values that are inserted and then deleted. For A[k] to be
zero, either (a) no xi for i in 1 . . . xn−d has h(xi ) = k; or (b) some xi for
i in n − d + 1 . . . xn has h(xi ) = k. The probability of the first event is
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 435

 n−d  d
1 − 2−k ; the probability of the second is 1 − 1 − 2−k . So we have
 n−d   d 
Pr [A[k] = 0] ≤ 1 − 2−k + 1 − 1 − 2−k
      
≤ exp −2−k (n − d) + 1 − exp − 2−k + 2−2k d
    
≤ exp −2−dlg((n−d)/c))e (n − d) + 1 − exp −2 · 2−dlg((n−d)/c))e d
    
≤ exp −2− lg((n−d)/c)) (n − d) + 1 − exp −2 · 2− lg((n−d)/c))+1 d
4cd
  
−c
= e + 1 − exp −
n−d
4cd
≤ e−c +
n−d
−c 4c
≤e + .
1−
So our total probability of error is bounded by 2c + e−c + 1−
4c
. Let c = 8
1 −8 128 1
and  = 1/128 to make this less than 4 + e + 127 · 16 ≈ 0.313328 < 1/3,
giving the desired bound.

F.4.2 An adaptive hash table


Suppose we want to build a hash table, but we don’t know how many
elements we are going to put in it, and because we allow undisciplined C
programmers to obtain pointers directly to our hash table entries, we can’t
move an element once we assign it a position to it. Here we will consider a
data structure that attempts to solve this problem.
i
Construct a sequence of tables T0 , T1 , . . . , where each Ti has mi = 22
slots. For each table Ti , choose k independent strongly 2-universal hash
functions hi1 , hi2 , . . . hik .
The insertion procedure is given in Algorithm F.1. The essentially idea
is that we make k attempts (each with a different hash function) to fit x into
T0 , then k attempts to fit it in T1 , and so on.
If the tables Ti are allocated only when needed, the space complexity of
this data structure is given by the sum of mi for all tables that have at least
one element.
Show that for any fixed  > 0, there is a constant k such that after
inserting n elements:
1. The expected cost of an additional insertion is O(log log n), and
2. The expected space complexity is O n2+ .

APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 436

1 procedure insert(x)
2 for i ← 0 to ∞ do
3 for j ← 1 to k do
4 if Ti [hij (x)] = ⊥ then
5 Ti [hij (x)] ← x
6 return

Algorithm F.1: Adaptive hash table insertion

Solution
The idea is that we use Ti+1 only if we get a collision in Ti . Let Xi be the
indicator for the event that there is a collision in Ti . Then

X
E [steps] ≤ 1 + E [Xi ] (F.4.1)
i=0

and

X
E [space] ≤ m0 + E [Xi ] mi+1 . (F.4.2)
i=0

To bound E [Xi ], let’s calculate an upper bound on the probability that


a newly-inserted element xn+1 collides with any of the previous n elements
x1 . . . xn in table Ti . This occurs if, for every location hij (xn+1 ), there is some
xr and some j 0 such that hij 0 (xr ) = hij (xn+1 ). The chance that this occurs
for any particular j, j 0 , and r is at most 1/mi (if j = j 0 , use 2-universality
of hij , and if j 6= j 0 , use independence and uniformity), giving a chance
that it occurs for fixed j that is at most n/mi . The chance that it occurs
for all j is at most (n/mi )k , and the expected number of such collisions
summed over any n + 1 elements that we insert is bounded by n(n/mi )k
(the first element
 can’t collide
 with any previous elements). So we have
E [Xi ] ≤ min 1, n(n/mi ) . k

Let ` be the largest value such that m` ≤ n2+ . We will show that, for
an appropriate choice of k, we are sufficiently unlikely to get a collision in
round ` that the right-hand sides of (F.4.2) and (F.4.2) end up being not
much more than the corresponding sums up to ` − 1.
From our choice of `, it follows that (a) ` ≤ lg lg n2+ = lg lg n+lg(2+) =

O(log log n); and (b) m`+1 > n2+ , giving m` = m`+1 > n1+/2 . From this
we get E [X` ] ≤ n(n/m` )k < n1−k/2 .
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 437

By choosing k large enough, we can make this an arbitrarily small


polynomial in n. Our goal is to wipe out the E [X` ] m`+1 and subsequent
terms in (F.4.2).
Observe that E [Xi ] mi+1 ≤ n(n/mi )k m2i = nk+1 m2−k
i . Let’s choose k so
that this is at most 1/mi , when i ≥ `, so we get a nice convergent series.5
This requires nk+1 m3−k
i ≤ 1 or k + 1 + (3 − k) logn mi ≤ 0. If i ≥ `, we
have logn mi > 1 + /2, so we win if k + 1 + (3 − k)(1 + /2) ≤ 0. Setting
k ≥ 8/ + 3 works. (Since we want k to be an integer, we probably want
k = d8/e + 3 instead.)
So now we have

X
E [space] ≤ m0 + E [Xi ] mi+1
i=0
` ∞
X X 1
≤ mi +
i=0 i=`
mi
2
≤ 2m` +
m`
2+
= O(n ).

For E [steps], compute the same sum without all the mi+1 factors. This
makes the tail terms even smaller, so they is still bounded by a constant,
and the head becomes just `i=0 1 = O(log log n).
P

F.4.3 An odd locality-sensitive hash function


A deranged computer scientist decides that if taking one bit from a random
index in a bit vector is a good way to do locality-sensitive hashing (see
§7.8.4), then taking the exclusive OR of k independently chosen indices must
be even better.
Formally, given a bit-vector x1 x2 . . . xn , and a sequence of indices i1 i2 . . . ik ,
Lk
define hi (x) = j=1 xij . For example, if x = 00101 and i = 3, 5, 2,
hi (x) = 1 ⊕ 1 ⊕ 0 = 0.
Suppose x and y are bit-vectors of length n that differ in m places.

1. Give a closed-form expression for the probability that hi (x) 6= hi (y),


assuming i consists of k indices chosen uniformly and independently at
random from 1 . . . n
5
We could pick a smaller k, but why not make things easier for ourselves?
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 438

2. Use this to compute the exact probability that hi (x) 6= hi (y) when
m = 0, m = n/2, and m = n.
Hint: You may find it helpful to use the identity (a mod 2) = 12 (1−(−1)a ).

Solution
1. Observe that hi (x) 6= hi (y) if an only if i chooses an odd number of
indices where x and y differ. Let p = m/n be the probability that each
index in i hits a position where x and y differ, and let q = 1 − p. Then
the event that we get an odd number of differences is
k k
! !
X k j k−j X 1  k
(j mod 2) p q = 1 − (−1)j pj q k−j
j=0
j j=0
2 j
k k
! !
1X k j k−j 1 X k
= p q − (−p)j q k−j
2 j=0 j 2 j=0 j
1 1
= (p + q)k − (−p + q)k
2 2
1 − (1 − 2(m/n))k
= .
2
1−1k
2. • For m = 0, this is 2 = 0.
k
• For m = n/2, it’s 1−02 = 1
2 (assuming k > 0).
k
• For m = n, it’s 1−(−1)
2 = (k mod 2).
In fact, the chances of not colliding as a function of m are symmetric
around m = n/2 if k is even and increasing if k is odd. So we can only hope
to use this as locality-sensitive hash function in the odd case.

F.5 Assignment 5: due Friday, 2013-04-12, at 17:00


F.5.1 Choosing a random direction
Consider the following algorithm for choosing a random direction in three
dimensions. Start at the point (X0 , Y0 , Z0 ) = (0, 0, 0). At each step, pick one
of the three coordinates uniformly at random and add ±1 to it with equal
probability. Continue until the resulting vector has length at least k, i.e.,
until Xt2 + Yt2 + Zt2 ≥ k 2 . Return this vector.
What is the expected running time of this algorithm, as an asymptotic
function of k?
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 439

Solution
The trick is to observe that Xt2 + Yt2 + Zt2 − t is a martingale, essentially
following the same analysis as for Xt2 − t for a one-dimensional random walk.
Suppose we pick Xt to change. Then
h
2
i 1 
E Xt+1 Xt = (Xt + 1)2 + (Xt − 1)2
2
= Xt2 + 1.
So
h i
2 2 2
E Xt+1 + Yt+1 + Zt+1 − (t + 1) Xt , Yt , Zt , X changes = Xt2 + Yt2 + Zt2 − t.

But by symmetry, the same equation holds if we condition on Y or Z changing.


It follows that E Xt+1 + Yt+1 + Zt+1 − (t + 1) Xt , Yt , Zt = Xt2 +Yt2 +Zt2 −
 2 2 2


t, and that we have a martingale as claimed.


Let τ be the first time at which Xt2 +Yt2 +Zt2 ≤ k 2 . From the optional stop-
ping theorem (specifically, the bounded-increments case of Theorem 9.3.1),
E Xτ2 + Yτ2 + Zτ2 − τ = 0, or equivalently E [τ ] = E Xτ2 + Yτ2 + Zτ2 . This
  

immediately gives E [τ ] ≥ k 2 .
To get an upper bound, observe that Xτ2−1 + Yτ2−1 + Zτ2−1 < k 2 , and
that exactly one of these three term increases between τ − 1 and τ . Suppose
it’s X (the other cases are symmetric). Increasing X by 1 sets Xτ2 =
Xτ2−1 + 2Xτ −1 + 1. So we get
 
Xτ2 + Yτ2 + Zτ2 = Xτ2−1 + Yτ2−1 + Zτ2−1 + 2Xτ −1 + 1
< k 2 + 2k + 1.
So we have k 2 ≤ E [τ ] < k 2 + 2k + 1, giving E [τ ] = Θ(k 2 ). (Or k 2 + O(k)
if we are feeling really precise.)

F.5.2 Random walk on a tree


Consider the following random walk on a (possibly unbalanced) binary search
tree: At each step, with probability 1/3 each, move to the current node’s
parent, left child, or right child. If the target node does not exist, stay put.
Suppose we adapt this random walk using Metropolis-Hastings (see
§10.3.4) so that the probability of each node at depth d in the stationary
distribution is proportional to α−d .
Use a coupling argument to show that, for any constant α > 2, this
adapted random walk converges in O(D) steps, where D is the depth of the
tree.
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 440

Solution
As usual, let Xt be a copy of the chain starting in an arbitrary initial state
and Yt be a copy starting in the stationary distribution.
From the Metropolis-Hastings algorithm, the probability that the walk
moves to a particular child is 1/3α, so the probability that the depth increases
after one step is at most 2/3α. The probability that the walk moves to the
parent (if we are not already at the root) is 1/3.
We’ll use the same choice (left, right, or parent) in both the X and
Y processes, but it may be that only one of the particles moves (because
the target node doesn’t exist). To show convergence, we’ll track Zt =
max(depth(Xt ), depth(Yt )). When Zt = 0, both Xt and Yt are the root
node.
There are two ways that Zt can change:
1. Both processes choose “parent”; if Zt is not already 0, Zt+1 = Zt − 1.
This case occurs with probability 1/3.

2. Both processes choose one of the child directions. If the appropriate


child exists for the deeper process (or for either process if they are
at the same depth), we get Zt+1 = Zt + 1. This even occurs with
probability at most 2/3α < 1/3.
So the expected change in Zt+1 conditioned on Zt > 0 is at most −1/3 +
2/3α = −(1/3)(2/α − 1). Let τ be the first time at which Zt = 0. Then
the process Zt0 = Zt − (1/3)(2/α − 1)t for t ≤ τ and 0 for t > τ is a
supermartingale, so E [Zτ ] = E [Z0 ] = E [max(depth(X0 ), depth(Y0 ))] ≤ D.
3D
This gives E [τ ] ≤ 2/α−1 .

F.5.3 Sampling from a tree


Suppose we want to sample from the stationary distribution of the Metropolis-
Hastings walk in the previous problem, but we don’t want to do an actual
random walk. Assuming α > 2 is a constant, give an algorithm for sampling
exactly from the stationary distribution that runs in constant expected time.
Your algorithm should not require knowledge of the structure of the tree.
Its only input should be a pointer to the root node.
Clarification added 2013-04-09: Your algorithm can determine the chil-
dren of any node that it has already found. The idea of not knowing the
structure of the tree is that it can’t, for example, assume a known bound on
the depth, counts of the number of nodes in subtrees, etc., without searching
through the tree to find this information directly.
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 441

Solution
We’ll use rejection sampling. The idea is to choose a node in the infinite
binary tree with probability proportional to α−d , and then repeat the process
if we picked a node that doesn’t actually exist. Conditioned on finding a
depth(x)
node i that exists, its probability will be P αα− depth(j) .
j
If we think of a node in the infinite tree as indexed by a binary strength
of length equal to its depth, we can generate it by first choosing the length
X and then choosing the bits in the string. We want Pr [X = n] to be
proportional to 2n α−n = (2/α)n . Summing the geometric series gives

(2/α)n
Pr [X = n] = .
1 − (2/α)

This is a geometric distribution, so we can sample it by repeatedly flipping


a biased coin. The number of such coin-flips is O(X), as is the number of
random bits we need to generate and the number of tree nodes we need to
check. The expected value of X is given by the infinite series

X (2/α)n n
;
n=0
1 − (2/α)

this series converges to some constant by the ratio test.


So each probe costs O(1) time on average, and has at least a constant
1
probability of success, since we choose the root with Pr [X = 0] = 1−2/α .
Using Wald’s equation (9.4.1), the total expected time to run the algorithm
is O(1).
This is a little surprising, since the output of the algorithm may have
more than constant length. But we are interested in expectation, and when
α > 2 most of the weight lands near the top of the tree.

F.6 Assignment 6: due Friday, 2013-04-26, at 17:00


F.6.1 Increasing subsequences
Let S1 , . . . , Sm be sets of indices in the range 1 . . . n. Say that a permutation
π of 1 . . . n is increasing on Sj if π(i1 ) < π(i2 ) < · · · < π(ikj ) where i1 <
i2 < · · · < ikj are the elements of Sj .
Given a fully polynomial-time randomized approximation scheme that
takes as input n and a sequence of sets S1 , . . . , Sm , and approximates the
number of permutations π that are increasing on at least one of the Sj .
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 442

Solution
This can be solved using a fairly straightforward application of Karp-
Luby [KL85] (see §11.4). Recall that for Karp-Luby we need to be able to
express our target set U as the union of a polynomial number of covering sets
Uj , where we can both compute the size of each Uj and Psample  uniformly from
it. We can then estimate |U | = j,x∈Uj f (j, x) = j |Uj | Pr [f (j, x) = 1]
P

where f (j, x) is the indicator for the event that x 6∈ Uj 0 for any j 0 < j
and in the probability, the pair (j, x) is chosen uniformly at random from
{(j, x) | x ∈ Uj }.
In this case, let Uj be the set of all permutations that are increasing on
Sj . We can specify each such permutation by specifying the choice of which
kj = |Sj | elements are in positions i1 . . . ikj (the order of these elements
is determined by the requirement that the permutation be increasing on
Sj ) and specifying the order of the remaining n − kj elements. This gives
n
kj (n − kj )! = (n)n−kj such permutations. Begin by computing these counts
for all Sj , as well as their sum.
We now wish to sample uniformly from pairs (j, π) where each π is an
element of Sj . First sample each j with probability |Sj |/ ` |S` |, using the
P

counts we’ve already computed. Sampling a permutation uniformly from Sj


mirrors the counting argument: choose a kj -subset for the positions in Sj ,
then order the remaining elements randomly. The entire sampling step can
easily be done in O(n + m) time.
Computing f (j, π) requires testing π to see if it is increasing for any Sj 0 for
j < j 0 ; without doing anything particularly intelligent, this takes O(nm) time.
So we can construct and test one sample in O(nm) time. Since each sample
has at least a ρ = 1/m  chance
  f (j, π) = 1, from Lemma 11.2.1
of having
1 1 −2 1
we need O 2 ρ log δ = O m log δ samples to get relative error with
 
probability at least 1 − δ, for a total cost of O m2 n−2 log 1δ .

F.6.2 Futile word searches


A word search puzzle consists of an n×n grid of letters from some alphabet
Σ, where the goal is to find contiguous sequences of letters in one of the eight
orthogonal or diagonal directions that form words from some lexicon. For
example, in Figure F.2, the left grid contains an instance of aabc (running
up and left from the rightmost a character on the last line), while the right
grid contains no instances of this word.
For this problem, you are asked to build an algorithm for constructing
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 443

bacab bacab
ccaac ccaac
bbbac babac
bbaaa bbaaa
acbab acbab

Figure F.2: Non-futile (left) and futile (right) word search grids for the
lexicon {aabc, ccca}

word search puzzles with no solution for a given lexicon. That is, given a set
of words S over some alphabet and a grid size n, the output should be an
n × n grid of letters such that no word in S appears as a contiguous sequence
of letters in one of the eight directions anywhere in the grid. We will refer to
such puzzles as futile word search puzzles.

1. Suppose the maximum length of any word in S is k. Let pS be the


probability that some word in S is a prefix of an infinite string generated
by picking letters uniformly and independently from Σ. Show that
there is a constant c > 0 such that for any k, Σ, and S, pS < ck −2
implies that there exists, for all n, an n × n futile word search puzzle
for S using only letters from Σ.

2. Give an algorithm that constructs a futile word search puzzle given S


and n in expected time polynomial in |S|, k, and n, provided pS < ck −2
as above.

Solution
1. We’ll apply the symmetric version of the Lovász local lemma. Suppose
the grid is filled in independently and uniformly at random with
characters from Σ. Given a position ij in the grid, let Aij be the
event that there exists a word in S whose first character is at position
ij; observe that Pr [Aij ] ≤ 8pS by the union bound (this may be an
overestimate, both because we might run off the grid in some directions
and because the choice of initial character is not independent). Observe
also that Aij is independent of any event Ai0 j 0 where |i − i0 | ≥ 2k − 1 or
|j − j 0 | ≥ 2k − 1, because no two words starting at these positions can
overlap. So we can build a dependency graph with p ≤ 8pS and d ≤
(4k−3)2 . The Lovász local lemma shows that there exists an assignment
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 444

where no Aij occurs provided ep(d + 1) < 1 or 8epS ((4k − 3)2 + 1) < 1.
1
This easily holds if pS < 8e(4k 1 −2 .
2 ) = 128e k

2. For this part, we can just use Moser-Tardos [MT10], particularly the
symmetric version described in Corollary 13.3.4. We have a collection
of m = O(n2 ) bad events, with d = Θ(k 2 ), so the expected number of
resamplings is bounded by m/d = O(n2 /k 2 ). Each resampling requires
checking every position in the new grid for an occurrence of some string
in S; this takes O(n2 k · |S|) time per resampling even if we are not
very clever about it. So the total expected cost is O(n4 · |S|/k).
With some more intelligence, this can be improved. We don’t need to
recheck any position at distance greater than k from any of the at most
k letters we resample, and if we are sensible, we can store S using a
radix tree or some similar data structure that allows us to look up all
words that appear as a prefix of a given length-k string in time O(k).
This reduces the cost of each resampling to O(k 3 ), with an additive
cost of O(k · |S|) to initialize the data structure. So the total expected
cost is now O(n2 k + |S|).

F.6.3 Balance of power


Suppose you are given a set of n MMORPG players, and a sequence of
subsets S1 , S2 , . . . , Sm of this set, where each subset Si gives the players who
will participate in some raid. Before any of the raids take place, you are to
assign each player permanently to one of three factions. If for any i, |Si |/2
or more of the players are in the same faction, then instead of carrying out
the raid they will overwhelm and rob the other participants.
Give a randomized algorithm for computing a faction assignment that
prevents this tragedy from occurring (for all i) and thus allows all m raids to
be completed without incident, assuming that m > 1 and mini |Si | ≥ c ln m
for some constant c > 0 that does not depend on n or m. Your algorithm
should run in expected time polynomial in n and m.

Solution
Assign each player randomly to a faction. Let Xij be the number of players
in Si that are assigned to faction j. Then E [Xij ] = |Si |/3. Applying the
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 445

Chernoff bound (5.2.2), we have

1
   
Pr [Xij ≥ |Si |/2] = Pr Xij ≥ 1 + E [Xij ]
2
 2 !
1
≤ exp − (|Si |/3) /3
2
= e−|Si |/36 .

Let c = 3 · 36 = 108. Then if mini |Si | ≥ c ln m, for each i, j, it holds


that Pr [Xij ≥ |Si |/2] ≤ e−3 ln m = m−3 . So the probability that this bound
is exceeded for any i and j is at most (3m)m−2 = 3/m2 . So a random
assignment works with at least 1/4 probability for m > 1.
We can generate and test each assignment in O(nm) time. So our
expected time is O(nm).

F.7 Final exam


Write your answers in the blue book(s). Justify your answers. Work alone.
Do not use any notes or books.
There are four problems on this exam, each worth 20 points, for a total
of 80 points. You have approximately three hours to complete this exam.

F.7.1 Dominating sets


A dominating set in a graph is a subset S of the vertices for which every
vertex v is either in S or adjacent to a vertex in S.
Show that for any graph G, there is an aperiodic, irreducible Markov
chain on the dominating sets of G, such that (a) the transition rule for
the chain can be implemented in polynomial time; and (b) the stationary
distribution of the chain is uniform. (You do not need to say anything about
the convergence rate of this chain.)

Solution
Suppose we are have state St at time t. We will use a random walk where
we choose a vertex uniformly at random to add to or remove from St , and
carry out the action only if the resulting set is still a dominating set.
In more detail: For each vertex v, with probability 1/n, St+1 = St ∪ {v}
if v 6∈ St , St+1 = St \ {v} if v ∈ St and St \ {v} is a dominating set, and
St+1 = St otherwise. To implement this transition rule, we need to be able
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 446

to choose a vertex v uniformly at random (easy) and test in the case where
v ∈ St if St \ {v} is a dominating set (also polynomial: for each vertex, check
if it or one of its neighbors is in St \ {v}, which takes time O(|V | + |E|)).
Note that we do not need to check if St ∪ {v} is a dominating set.
For any pair of adjacent states S and S 0 = S \ {v} the probability of
moving from S to S 0 and the probability of moving from S 0 to S are both 1/n.
So the Markov chain is reversible with a uniform stationary distribution.
This is an aperiodic chain, because there exist minimal dominating sets
for which there is a nonzero chance that St+1 = St .
It is irreducible, because for any dominating set S, there is a path to
the complete set of vertices V by adding each vertex in V \ S one at a time.
Conversely, removing these vertices from V gives a path to S. This gives a
path S V T between any two dominating sets S and T .

F.7.2 Tricolor triangles


Suppose you wish to count the number of assignments of colors {r, g, b} to
nodes of a graph G that have the property that some triangle in G contains
all three colors. For example, the four-vertex graph shown below is labeled
with one of 18 such colorings (6 permutations of the colors of the triangle
nodes times 3 unconstrained choices for the degree-1 node).

r
g r
b

Give a fully polynomial-time randomized approximation scheme for this


problem.

Solution
Though it is possible, and tempting, to go after this using Karp-Luby (see
§11.4), naive sampling is enough.
If a graph has at least one triangle (which can be checked in O(n3 )
time just by enumerating all possible triangles), then the probability that
that particular triangle is tricolored when colors are chosen uniformly and
independently at random is 6/27 = 2/9. This gives a constant hit rate ρ, so
by Lemma 11.2.1, we can get  relative error with 1 − δ probability using
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 447
 
1
O 2
log 1δ samples. Each sample costs O(n3 ) time to evaluate (again, brute-
 
force checking of all possible triangles), for a total cost of O n3 −2 log 1δ .

F.7.3 The n rooks problem


The n rooks problem requires marking as large a subset as possible of the
squares in an n × n grid, so that no two squares in the same row or column
are marked.6
Consider the following randomized algorithm that attempts to solve this
problem:

1. Give each of the n2 squares a distinct label using a uniformly chosen


random permutation of the integers 1 . . . n2 .

2. Mark any square whose label is larger than any other label in its row
and column.

What is the expected number of marked squares?

Solution
Each square is marked if it is the largest of the 2n − 1 total squares in its
row and column. By symmetry, each of these 2n − 1 squares is equally likely
to be the largest, so the probability that a particular square is marked is
1
exactly 2n−1 . By linearity of expectation, the total expected number of
n2
marked squares is then 2n−1 .

F.7.4 Pursuing an invisible target on a ring


Suppose that you start at position 0 on a ring of size 2n, while a target
particle starts at position n. At each step, starting at position i, you can
choose whether to move to any of positions i − 1, i, or i + 1. At the same
time, the target moves from its position j to either j − 1 or j + 1 with equal
probability, independent of its previous moves or your moves. Aside from
knowing that the target starts at n at time 0, you cannot tell where the
target moves.
Your goal is to end up on the same node as the target after some step.7
Give an algorithm for choosing your moves such that, for any c > 0, you
6
This is not actually a hard problem.
7
Note that you must be in the same place at the end of the step: if you move from 1 to
2 while the target moves from 2 to 1, that doesn’t count.
APPENDIX F. SAMPLE ASSIGNMENTS FROM SPRING 2013 448

encounter the target in at most 2n steps with probability at least 1 − n−c


for sufficiently large n.

Solution
The basic idea is to just go through positions 0, 1, 2, . . . until we encounter
the target, but we have to be a little careful about parity to make sure we
don’t pass it by accident.8
Let Xi = ±1 be the increment of the target’s i-th move, and let Si =
Pi
j=1 Xj , so that its position after i steps is n + Si mod 2n.
Let Yi be the position of the pursuer after i steps.
First move: stay at 0 if n is odd, move to 1 if n is even. The purpose of
this is to establish the invariant that n + Si − Yi is even starting at i = 1.
For subsequent moves, let Yi+1 = Yi + 1. Observe that this maintains the
invariant.
We assume that n is at least 2. This is necessary to ensure that at time
1, Y1 ≤ n + S1 .
Claim: if at time 2n, Y2n ≥ n+S2n , then at some time i ≤ 2n, Yi = n+Si .
Proof: Let i be the first time at which Yi ≥ n + Si ; under the assumption
that n ≥ 2, i ≥ 1. So from the invariant, we can’t have Yi = n + Si + 1,
and if Yi ≥ n + Si + 2, we have Yi−1 ≥ Yi − 1 ≥ n + Si + 1 ≥ n + Si−1 ,
contradicting our assumption that i is minimal. The remaining alternative
is that Yi = n + Si , giving a collision at time i.
We now argue that Y2n ≥ n − 1 is very likely to be at least n + S2n . Since
S2n is a sum of 2n independent ±1 variables, from Hoeffding’s inequality we
2
have Pr [Yn < n + S2n ] ≤ Pr [S2n ≥ n] ≤ e−n /4n = e−n/4 . For sufficiently
large n, this is much smaller than n−c for any fixed c.

8
This is not the only possible algorithm, but there are a lot of plausible-looking
algorithms that turn out not to work. One particularly tempting approach is to run to
position n using the first n steps and then spend the next n steps trying to hit the target in
the immediate neighborhood of n, either by staying put (a sensible strategy when lost in the
woods in real life, assuming somebody is looking for you), or moving in a random walk of
some sort starting at n. This doesn’t work if we want a high-probability bound. To see this,
observe that the target has a small but nonzero constant probability in the limit of begin

at some position greater than or equal to n + 4 n after exactly n/2 steps. Conditioned on
√ √ √ √
starting at n + 4 n or above, its chances of moving below n + 4 n − 2 n = n + 2 n
−4n/2(3n/2) −4/3
at any time in the next 3n/2 steps is bounded by e =e (Azuma), and a

similar bound holds independently for our chances of getting up to n + 2 n or above.
Multiplying out all these constants gives a constant probability of failure. A similar but
bigger disaster occurs if we don’t rush to n first.
Appendix G

Sample assignments from


Spring 2011

G.1 Assignment 1: due Wednesday, 2011-01-26, at


17:00
G.1.1 Bureaucratic part
Send me email! My address is james.aspnes@gmail.com.
In your message, include:

1. Your name.

2. Your status: whether you are an undergraduate, grad student, auditor,


etc.

3. Anything else you’d like to say.

(You will not be graded on the bureaucratic part, but you should do it
anyway.)

G.1.2 Rolling a die


The usual model of a randomized algorithm assumes a source of fair, inde-
pendent random bits. This makes it easy to generate uniform numbers in the
range 0 . . . 2n − 1, but not so easy for other ranges. Here are two algorithms
for generating a uniform random integer 0 ≤ s < n:

449
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 450

• Rejection sampling generates a uniform random integer 0 ≤ s <


2dlg ne . If s < n, return s; otherwise keep trying until you get some
s < n.

• Arithmetic coding or range coding generates a sequence of bits


r1 , r2 , . . . rk until the half-open interval [ ki=1 2−i ri , ki=1 2−i ri +2−k−1 )
P P

is a subset of [s/n, (s + 1)/n) for some s; it then returns s.

1. Show that both rejection sampling and range coding produce a uniform
value 0 ≤ s < n using an expected O(log n) random bits.

2. Which algorithm has a better constant?

3. Does there exist a function f and an algorithm that produces a uniform


value 0 ≤ s < n for any n using f (n) random bits with probability 1?

Solution
1. For rejection sampling, each sample requires dlg ne bits and is accepted
with probability n/2dlg ne ≥ 1/2. So rejection sampling returns a value
after at most 2 samples on average, using no more than an expected
2dlg ne < 2(lg n + 1) expected bits for the worst n.
For range coding, we keep going as long as one of the n − 1 nonzero end-
points s/n lies inside the current interval. After k bits, the probability
that one of the 2k intervals contains an endpoint is at most (n − 1)2−k ;
in particular, it drops below 1 as soon as k = 2dlg ne and continues to
drop by 1/2 for each additional bit, requiring 2 more bits on average.
So the expected cost of range coding is at most dlg ne + 2 < lg n + 3
bits.

2. We’ve just shown that range coding beats rejection sampling by a


factor of 2 in the limit, for worst-case n. It’s worth noting that other
factors might be more important if random bits are cheap: rejection
sampling is much easier to code and avoids the need for division.

3. There is no algorithm that produces a uniform value 0 ≤ s < n for all


n using any fixed number of bits. Suppose such an algorithm existed.
Fix some n. For all n values s to be equally likely, the sets of random
bits M −1 (s) = {r | M (r) = s} must have the same size. But this can
only happen if n divides 2f (n) , which works only for n a power of 2.
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 451

G.1.3 Rolling many dice


Suppose you repeatedly roll an n-sided die. Give an asymptotic (big-Θ)
bound on the expected number of rolls until you roll some number you have
already rolled before.

Solution
In principle, it is possible to compute this value exactly, but we are lazy.
m
For a lower bound, observe that after m rolls, each of the 2 pairs of rolls
has probability 1/n of being equal, for an expected total of m 2 /n duplicates.

For m = n/2, this is less than 1/8, which shows that the expected number

of rolls is Ω( n).

For the upper bound, suppose we have already rolled the die n times.
If we haven’t gotten a duplicate already, each new roll has probability at
√ √ √
least n/n = 1/ n of matching a previous roll. So after an additional n
rolls on average, we get a repeat. This shows that the expected number of

rolls is O( n).

Combining these bounds shows that we need Θ( n) rolls on average.

G.1.4 All must have candy


A set of n0 children each reach for one of n0 candies, with each child choosing
a candy independently and uniformly at random. If a candy is chosen by
exactly one child, the candy and child drop out. The remaining n1 children
and candies then repeat the process for another round, leaving n2 remaining
children and candies, etc. The process continues until ever child has a candy.
Give the best bound you can on the expected number of rounds until
every child has a candy.

Solution
Let T (n) be the expected number of rounds remaining given we are starting
with n candies. We can set up a probabilistic recurrence relation T (n) =
1 + T (n − Xn ) where Xn is the number of candies chosen by eactly one
child. It is easy to compute E [Xn ], since the probability that any candy gets
chosen exactly once is n(1/n)(1 − 1/n)n−1 = (1 − 1/n)n−1 . Summing over
all candies gives E [Xn ] = n(1 − 1/n)n−1 .
The term (1 − 1/n)n−1 approaches e−1 in the limit, so for any fixed  > 0,
we have n(1 − 1/n)n−1 ≥ n(e−1 − ) for sufficiently large n. We can get a
quick bound by choosing  so that e−1 −  ≥ 1/4 (for example) and then
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 452

applying the Karp-Upfal-Wigderson inequality (I.3.1) with µ(n) = n/4 to


get
Z n
1
E [T (n)] ≤ dt
1 t/4
= 4 ln n.

There is a sneaky trick here, which is that we stop if we get down to 1 candy
instead of 0. This avoids the usual problem with KUW and ln 0, by observing
that we can’t ever get down to exactly one candy: if there were exactly one
candy that gets grabbed twice or not at all, then there must be some other
candy that also gets grabbed twice or not at all.
This analysis is sufficient for an asymptotic estimate: the last candy gets
grabbed in O(log n) rounds on average. For most computer-sciency purposes,
we’d be done here.
We can improve the constant slightly by observing that (1 − 1/n)n−1 is
in fact always greater than or equal to e−1 . The easiest way to see this is
to plot the function, but if we want to prove it formally we can show that
(1 − 1/n)n−1 is a decreasing function by taking the derivative of its logarithm:
d d
ln(1 − 1/n)n−1 = (n − 1) ln(1 − 1/n)
dn dn
n − 1 −1
= ln(1 − 1/n) + · .
1 − 1/n n2

and observing that it is negative for n > 1 (we could also take the derivative
of the original function, but it’s less obvious that it’s negative). So if it
approaches e−1 in the limit, it must do so from above, implying (1−1/n)n−1 ≥
e−1 .
This lets us apply (I.3.1) with µ(n) = n/e, giving E [T (n)] ≤ e ln n.
If we skip the KUW bound and use the analysis in §I.4.2 instead, we get
that Pr [T (n) ≥ ln n + ln(1/)] ≤ . This suggests that the actual expected
value should be (1 + o(1)) ln n.

G.2 Assignment 2: due Wednesday, 2011-02-09, at


17:00
G.2.1 Randomized dominating set
A dominating set in a graph G = (V, E) is a set of vertices D such that
each of the n vertices in V is either in D or adjacent to a vertex in D.
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 453

Suppose we have a d-regular graph, in which every vertex has exactly d


neighbors. Let D1 be a random subset of V in which each vertex appears
with independent probability p. Let D be the union of D1 and the set of
all vertices that are not adjacent to any vertex in D1 . (This construction is
related to a classic maximal independent set algorithm of Luby [Lub85], and
has the desirable property in a distributed system of finding a dominating
set in only one round of communication.)

1. What would be a good value of p if our goal is to minimize E [|D|], and


what bound on E [|D|] does this value give?

2. For your choice of p above, what bound can you get on Pr [|D| − E [|D|] ≥ t]?

Solution
1. First let’s compute E [|D|]. Let Xv be the indicator for the event
that v ∈ D. Then Xv = 1 if either (a) v is in D1 , which occurs
with probability p; or (b) v and all d of its neighbors are not in D1 ,
which occurs with probability (1 − p)d+1 . Adding these two cases gives
E [Xv ] = p + (1 − p)d+1 and thus
X  
E [|D|] = E [Xv ] = n p + (1 − p)d+1 . (G.2.1)
v

We optimize E [|D|] in the usual way, by seeking a minimum for E [Xv ].


Differentiating with respect to p and setting to 0 gives 1 − (d + 1)(1 −
p)d = 0, which we can solve to get p = 1 − (d + 1)−1/d . (We can easily
observe that this must be a minimum because setting p to either 0 or 1
gives E [Xv ] = 1.)
The value of E [|D|] for this value of p is the rather nasty expression
n(1 − (d + 1)−1/d + (d + 1)−1−1/d ).
Plotting the d factor up suggests that it goes to ln d/d in the limit, and
both Maxima and www.wolframalpha.com agree with this. Knowing
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 454

the answer, we can prove it by showing

1 − (d + 1)−1/d + (d + 1)−1−1/d 1 − d−1/d + d−1−1/d


lim = lim
d→∞ ln d/d d→∞ ln d/d
1 − d−1/d
= lim
d→∞ ln d/d

1 − e− ln d/d
= lim
d→∞ ln d/d
 
1 − 1 − ln d/d + O(ln2 d/d2 )
= lim
d→∞ ln d/d
ln d/d + O(ln2 d/d2 )
= lim
d→∞ ln d/d
= lim (1 + O(ln d/d))
d→∞
= 1.

This lets us write E [|D|] = (1 + o(1))n ln d/d, where we bear in mind


that the o(1) term depends on d but not n.

2. Suppose we fix Xv for all but one vertex u. Changing Xu from 0 to


1 can increase |D| by at most one (if u wasn’t already in D) and can
decrease it by at most d − 1 (if u wasn’t already in D and adding u to
D1 lets all d of u’s neighbors drop out). So we can apply the method
of bounded differences with ci = d − 1 to get
!
t2
Pr [|D| − E [|D|] ≥ t] ≤ exp − .
2n(d − 1)2

A curious feature of this bound is that it doesn’t depend on p at all. It


may be possible to get a tighter bound using a better analysis, which

might pay off for very large d (say, d  n).

G.2.2 Chernoff bounds with variable probabilities


Let X1 . . . Xn be a sequence of Bernoulli random variables, where for all i,
E [Xi | X1 . . . Xi−1 ] ≤ pi . Let S = ni=1 Xi and µ = ni=1 pi . Show that, for
P P

all δ ≥ 0,


Pr [S ≥ (1 + δ)µ] ≤ .
(1 + δ)1+δ
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 455

Solution
Let St = ti=1 Xi , sohthati S = Sn , and let µt = ti=1 pi . We’ll show by
P P

induction on t that E eαSt ≤ exp(eα−1 µt ), when α > 0.


Compute
h i h i
E eαS = E eαSn−1 eαXn
h h ii
= E eαSn−1 E eαXn X1 , . . . , Xn−1
h i
= E eαSn−1 (Pr [Xn = 0 | X1 , . . . , Xn−1 ] + eα Pr [Xn = 1 | X1 , . . . , Xn−1 ])
h i
= E eαSn−1 (1 + (eα − 1) Pr [Xn = 1 | X1 , . . . , Xn−1 ])
h i
≤ E eαSn−1 (1 + (eα − 1)pn )
h i
≤ E eαSn−1 exp ((eα − 1)pn )
h i
≤ E eαSn−1 exp ((eα − 1)pn )
≤ exp (eα − 1)µn−1 ) exp ((eα − 1)pn )
= exp (eα − 1)µn ) .

Now apply the rest of the proof of (5.2.1) to get the full result.

G.2.3 Long runs


Let W be a binary string of length n, generated uniformly at random. Define
a run of ones as a maximal sequence of contiguous ones; for example, the
string 11100110011111101011 contains 5 runs of ones, of length 3, 2, 6, 1,
and 2.
Let Xk be the number of runs in W of length k or more.

1. Compute the exact value of E [Xk ] as a function of n and k.


2. Give the best concentration bound you can for |Xk − E [Xk ]|.

Solution
1. We’ll compute the probability that any particular position i = 1 . . . n is
the start of a run of length k or more, then sum over all i. For a run of
length k to start at position i, either (a) i = 1 and Wi . . . Wi+k−1 are
all 1, or (b) i > 1, Wi−1 = 0, and Wi . . . Wi+k−1 are all 1. Assuming
n ≥ k, case (a) adds 2−k to E [Xk ] and case (b) adds (n − k)2−k−1 , for
a total of 2−k + (n − k)2−k−1 = (n − k + 2)2−k−1 .
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 456

2. We can get an easy bound without too much cleverness using McDi-
armid’s inequality (5.3.13). Observe that Xk is a function of the inde-
pendent random variables W1 . . . Wn and that changing one of these bits
changes Xk by at most 1 (this can happen in several ways: a previous
run of length k−1 can become a run of length k or vice versa, or two runs
of length k or more separated by a single zero may become a single
 2 run,
t
or vice versa). So (5.3.13) gives Pr [|X − E [X]| ≥ t] ≤ 2 exp − 2n .
We can improve on this a bit by grouping the Wi together into blocks
of length `. If we are given control over a block of ` consecutive bits
and want to minimize the number of runs, we can either (a) make
all the bits zero, causing no runs to appear within the block and
preventing adjacent runs from extending to length k using bits from
the block, or (b) make all the bits one, possibly creating a new run but
possibly also causing two existing runs on either side of the block to
merge into one. In the first case, changing all the bits to one except
for a zero after every k consecutive ones creates at most b `+2k−1k+1 c
new runs. Treating each of the dne`  blocks as a2 single variable
 then
t
gives Pr [|X − E [X]| ≥ t] ≤ 2 exp − 2dn/`e(b(`+2k−1)/(k+1)c)2 . Staring
at plots of the denominator for a while suggests that it is minimized at
` = k + 3, the largest value with
 b(` + 2k − 1)/(k + 1)c ≤ 2. This gives
t2
Pr [|X − E [X]| ≥ t] ≤ 2 exp − 8dn/(k+3)e , improving the bound on t
p p
from Θ( n log(1/)) to Θ( (n/k) log(1/)).
For large k, the expectation of any individual Xk becomes small, so we’d
expect that Chernoff bounds would work better on the upper bound
side than the method of bounded differences. Unfortunately, we don’t
have independence. But from Problem G.2.2, we know that the usual
Chernoff bound works as long as we can show E [Xi | X1 , . . . , Xi−1 ] ≤ pi
for some sequence of fixed bounds pi .
For X1 , there are no previous Xi , and we have E [X1 ] = 2−k exactly.
For Xi with i > 1, fix X1 , . . . , Xi−1 ; that is, condition on the event
Xj = xj for all j < i with some fixed sequence x1 , . . . , xi−1 . Let’s call
this event A. Depending on the particular values of the xj , it’s not
clear how conditioning on A will affect Xi ; but we can split on the
value of Wi−1 to show that either it has no effect or Xi = 0:
E [Xi | A] = E [Xi | A, Wi−1 = 0] Pr [Wi−1 = 0 | A] + E [Xi | A, Wi−1 = 1] Pr [Wi−1 = 1 | A]
≤ 2−k Pr [Wi−1 = 0 | A] + 0
≤ 2−k .
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 457

So we have pi ≤ 2−k for all 1 ≤ i ≤ n  − k δ+ 1.µThis gives µ =


−k
2 (n − k + 1), and Pr [X ≥ (1 + δ)µ] ≤ (1+δ)e (1+δ) .

h can set δ = 1 (since X can’t


If we want a two-sided bound, we i
drop below 0 anyway, and get Pr |X − E [X]| > 2−k (n − k + 1) ≤
 −k
e 2 (n−k+1)
4 . This is exponentially small for k = o(lg n). If k is much
bigger than lg n, then we have E [X]  1, so Markov’s inequality alone
gives us a strong concentration bound.
However, in both cases, the bounds are competitive withp
the previous
bounds from McDiarmid’s inequality only if E [X] = O( n log(1/)).
So McDiarmid’s inequality wins for k = o(log n), Markov’s inequality
wins for k = ω(log n), and Chernoff bounds may be useful for a small
interval in the middle.

G.3 Assignment 3: due Wednesday, 2011-02-23, at


17:00
G.3.1 Longest common subsequence
A common subsequence of two sequences v and w is a sequence u of length
k such that there exist indices i1 < i2 < · · · < ik and j1 < j2 < · · · < jk with
u` = vi` = wj` for all `. For example, ardab is a common subsequence of
abracadabra and cardtable.
Let v and w be words of length n over an alphabet of size n drawn
independently and uniformly at random. Give the best upper bound you
can on the expected length of the longest common subsequence of v and w.

Solution
Let’s count the expectation of the number Xk of common subsequences of
length k. We have nk choices of positions in v, and nk choices of positions
 

in w; for each such choices, there is a probability of exactly n−k that the
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 458

corresponding positions match. This gives


!2
n
E [Xk ] = n−k
k
nk
< .
(k!)2
nk
<
(k/e)2k
!k
ne2
= .
k2

We’d like this bound to be substantially less than 1. We can’t reasonably


expect this to happen unless the base of the exponent is less than 1, so we

need k > e n.
√ √
If k = (1 + )e n for any  > 0, then E [Xk ] < (1 + )−2e n < n1 for
sufficiently large n. It follows that the expected length of the longest common

subsequent is at most (1 + )e n for sufficiently large n (because if there
are no length-k subsequences, the longest subsequence has length at most
k − 1, and if there is at least one, the longest has length at most n; this gives
a bound of at most (1 − 1/n)(k − 1) + (1/n)n < k). So in general we have

the length of the longest common subsequence is at most (1 + o(1))e n.
Though it is not required by the problem, here is a quick argument that

the expected length of the longest common subsequence is Ω( n), based
on the Erdős-Szekeres theorem [ES35].1 The Erdős-Szekeres theorem
says that any permutation of n2 + 1 elements contains either an increasing
sequence of n + 1 elements or a decreasing sequence of n + 1 elements. Given
two random sequences of length n, let S be the set of all elements that appear
in both, and consider two permutations ρ and σ of S corresponding to the
order in which the elements appear in v and w, respectively (if an element
appears multiple times, pick one of the occurrences at random). Then the
Erdős-Szekeres
p theorem says that ρ contains a sequence of length at least
b |ρ| − 1c that is either increasing or decreasing with respect to the order
given by σ; by symmetry, the probability that it is increasing is at least 1/2.
This gives
hp an expected
i value for the longest common subsequence that is at
least E |ρ| − 1 /2.
Let X = |ρ|. We can compute a lower bound E [X] easily; each possible el-
ement fails to occur in v with probability (1−1/n)n ≤ e−1 , and similarly for w.
1
As suggested by Benjamin Kunsberg.
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 459

So the chance that an element appears in both sequences is at −1 2


h√ i least (1−e ) ,
and thus E [X] ≥ n(1 − e−1 )2 . What we want is E X − 1 /2; but here the
√ h√ i p
fact that x is concave means that E X − 1 ≥ E [X − 1] by Jensen’s
hp i p
inequality (4.3.1). So we have E |ρ| − 1 /2 ≥ n(1 − e−1 )2 − 1/2 ≈
1−e −1 √
2 n.
This is not a very good bound (empirically, the real bound seems to be

in the neighborhood of 1.9 n when n = 10000), but it shows that the upper

bound of (1 + o(1))e n is tight up to constant factors.

G.3.2 A strange error-correcting code


Let Σ be an alphabet of size m + 1 that includes m non-blank symbols
and a special blank symbol. Let S be a set of nk strings of length n with


non-blank symbols in exactly k positions each, such that no two strings in S


have non-blank symbols in the same k positions.
For what value of m can you show S exists such that no two strings in S
have the same non-blank symbols in k − 1 positions?

Solution
This is a job for the Lovász Local Lemma. And it’s even symmetric, so we
can use the symmetric version (Corollary 13.3.2).
Suppose we assign the non-blank symbols to each string uniformly and
independently at random. For each A ⊆ S with |A| = k, let XA be the string
that has non-blank symbols in all positions in A. For each pair of subsets
A, B with |A| = |B| = k and |A ∩ B| = k − 1, let CA,B be the event that XA
and XB are identical on all positions in A ∩ B. Then Pr [CA,B ] = m−k+1 .
We now now need to figure out how many events are in each neighborhood
Γ(CA,B ). Since CA,B depends only on the choices of values for A and B, it
is independent of any events CA0 ,B 0 where neither of A0 or B 0 is equal to A
or B. So we can make Γ(CA,B ) consist of all events CA,B 0 and CA0 ,B where
B 0 6= B and A0 6= A.
For each fixed A, there are exactly (n − k)k events B that overlap it in
k − 1 places, because we can specify B by choosing the elements in B \ A
and A \ B. This gives (n − k)k − 1 events CA,B 0 where B 0 6= B. Applying
the same argument for A0 gives a total of d = 2(n − k)k − 2 events in
Γ(CA,B ). Corollary 13.3.2 applies if ep(d + 1) ≤ 1, which in this case means
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 460

em−(k−1) (2(n − k)k − 1) ≤ 1. Solving for m gives

m ≥ (2e(n − k)k − 1)1/(k−1) . (G.3.1)

For k  n, the (n − k)1/(k−1) ≈ n1/(k−1) term dominates the shape of


the right-hand side asymptotically as k gets large, since everything else goes
to 1. This suggests we need k = Ω(log n) to get m down to a constant.
Note that (G.3.1) doesn’t work very well when k = 1.2 For the k = 1
case, there is no overlap in non-blank positions between different strings, so
m = 1 is enough.

G.3.3 A multiway cut


Given a graph G = (V, E), a 3-way cut is a set of edges whose endpoints
lie in different parts of a partition of the vertices V into three disjoint parts
S∪T ∪U =V.

1. Show that any graph with m edges has a 3-way cut with at least 2m/3
edges.

2. Give an efficient deterministic algorithm for finding such a cut.

Solution
1. Assign each vertex independently to S, T , or U with probability 1/3
each. Then the probability that any edge uv is contained in the cut is
exactly 2/3. Summing over all edges gives an expected 2m/3 edges.

2. We’ll derandomize the random vertex assignment using the method of


conditional probabilities. Given a partial assignment of the vertices,
we can compute the conditional expectation of the size of the cut
assuming all other vertices are assigned randomly: each edge with
matching assigned endpoints contributes 0 to the total, each edge
with non-matching assigned endpoints contributes 1, and each edge
with zero or one assigned endpoints contributes 2/3. We’ll pick values
for the vertices in some arbitrary order to maximize this conditional
expectation (since our goal is to get a large cut). At each step, we need
only consider the effect on edges incident to the vertex we are assigning
whose other endpoints are already assigned, because the contribution
of any other edge is not changed by the assignment. Then maximizing
2
Thanks to Brad Hayes for pointing this out.
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 461

the conditional probability is done by choosing an assignment that


matches the assignment of the fewest previously-assigned neighbors:
in other words, the natural greedy algorithm works. The cost of this
algorithm is O(n + m), since we loop over all vertices and have to check
each edge at most once for each of its endpoints.

G.4 Assignment 4: due Wednesday, 2011-03-23, at


17:00
G.4.1 Sometimes successful betting strategies are possible
You enter a casino with X0 = a dollars, and leave if you reach 0 dollars or
b or more dollars, where a, b ∈ N. The casino is unusual in that it offers
arbitrary fair games subject to the requirements that:

• Any payoff resulting from a bet must be a nonzero integer in the range
−Xt to Xt , inclusive, where Xt is your current wealth.

• The expected payoff must be exactly 0. (In other words, your assets
Xt should form a martingale sequence.)

For example, if you have 2 dollars, you may make a bet that pays off −2
with probability 2/5, +1 with probability 2/5 and +2 with probability 1/5;
but you may not make a bet that pays off −3, +3/2, or +4 under any
circumstances, or a bet that pays off −1 with probability 2/3 and +1 with
probability 1/3.

1. What strategy should you use to maximize your chances of leaving


with at least b dollars?

2. What strategy should you use to maximize your changes of leaving


with nothing?

3. What strategy should you use to maximize the number of bets you
make before leaving?

Solution
1. Let Xt be your wealth at time t, and let τ be the stopping time when
you leave. Because {Xt } is a martingale, E [X0 ] = a = E [Xτ ] =
Pr [Xτ ≥ b] E [Xτ | Xτ ≥ b]. So Pr [Xτ ≥ b] is maximized by making
E [Xτ | Xτ ≥ b] as small as possible. It can’t be any smaller than b,
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 462

which can be obtained exactly by making only ±1 bets. The gives a


probability of leaving with b of exactly a/b.

2. Here our goal is to minimize Pr [Xτ ≥ b], so we want to make E [Xτ | Xτ ≥ b]


as large as possible. The largest value of Xτ we can possibly reach
is 2(b − 1); we can obtain this value by betting ±1 until we reach
b − 1, then making any fair bet with positive payoff b − 1 (for example,
±(b − 1) with equal probability works, as does a bet that pays off b − 1
with probability 1/b and −1 with probability (b − 1)/b). In this case
a
we get a probability of leaving with 0 of 1 − 2(b−1) .

3. For each t, let δt = Xt −Xt−1 and Vt = Var[δt | Ft ]. We have previously


shown (see the footnote to §9.4.1) that E Xτ = E X0 + E [ τt=1 Vt ]
2
 2 P

where τ is an appropriate stopping time. When we stop, we know


that Xτ2 ≤ (2(b − 1))2 , which puts an upper bound on E [ τi=1 Vi ]. We
P

can spend this bound most parsimoniously by minimizing Vi as much


as possible. If we make each δt = ±1, we get the smallest possible
value for Vt (since any change contributes at least 1 to the variance).
However, in this case we don’t get all the way out to 2(b − 1) at the
high end; instead, we stop at b, giving an expected number of steps
equal to a(b − a).
We can do a bit better than this by changing our strategy at b − 1.
Instead of betting ±1, let’s pick some x and place a bet that pays off
b−1 with probability 1b and −1 with probability b−1 1
b = 1− b . (The idea
here is to minimize the conditional variance while still allowing ourselves
to reach 2(b − 1).) Each ordinary random walk step has Vt = 1; a “big”
2 2
bet starting at b − 1 has Vt = 1 − 1b + (b−1)
b = b−1+b b−2b+1 = b − 1b .
To analyze this process, observe that starting from a, we first spending
a
a(b−a−1) steps on average to reach either 0 (with probability 1− b−1 or
a
b−1 (with probability b−1 . In the first case, we are done. Otherwise, we
take one more step, then with probability 1b we lose and with probability
b−1
b we continue starting from b − 2. We can write a recurrence for our
expected number of steps T (a) starting from a, as:

a b−1
 
T (a) = a(b − a − 1) + 1+ T (b − 2) . (G.4.1)
b−1 b
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 463

When a = b − 2, we get
b−2 b−1
 
T (b − 2) = (b − 2) + 1+ T (b − 2)
b−1 b
1 b−2
 
= (b − 2) 1 + + T (b − 2),
b−1 b
which gives

(b − 2) 2b−1
b−1
T (b − 2) =
2/b
b(b − 2)(2b − 1)
= .
2(b − 1)
Plugging this back into (G.4.1) gives
a b − 1 b(b − 2)(2b − 1)
 
T (a) = a(b − a − 1) + 1+
b−1 b 2(b − 1)
a a(b − 2)(2b − 1)
= ab − a2 + a + +
b−1 2(b − 1)
3
= ab + O(b). (G.4.2)
2
This is much better than the a(b − a) value for the straight ±1 strategy,
especially when a is also large.
I don’t know if this particular strategy is in fact optimal, but that’s
what I’d be tempted to bet.

G.4.2 Random walk with reset


Consider a random walk on N that goes up with probability 1/2, down with
probability 3/8, and resets to 0 with probability 1/8. When Xt > 0, this
gives:

Xt + 1with probability 1/2,


Xt+1 = Xt − 1 with probability 3/8, and


0 with probability 1/8.

When Xt = 0, we let Xt+1 = 1 with probability 1/2 and 0 with probability


1/2.

1. What is the stationary distribution of this process?


APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 464

2. What is the mean recurrence time µn for some state n?

3. Use µn to get a tight asymptotic (i.e., big-Θ) bound on µ0,n , the


expected time to reach n starting from 0.

Solution
1
1. For n > 0, we have πn = 2 πn−1 + 38 πn+1 , with a base case π0 =
1 3 3
8 + 8 π0 + 8 π1 .
The πn expression is a linear homogeneous recurrence, so its solution
consists of linear combinations of terms bn , where b satisfies 1 =
1 −1
2b + 38 b. The solutions to this equation are b = 2/3 and b = 2; we
can exclude the b = 2 case because it would make our probabilities
blow up for large n. So we can reasonably guess πn = a(2/3)n when
n > 0.
1
For n = 0, substitute π0 = 8 + 38 π0 + 38 a(2/3) to get π0 = 1
5 + 25 a. Now
substitute

π1 = (2/3)a
1 3
= π0 + a(2/3)2
2 8 
1 1 2 3
= + a + a(2/3)2
2 5 5 8
1 11
= + a,
10 30
which we can solve to get a = 1/3.
So our candidate π is π0 = 1/3, πn = (1/3)(2/3)n , and in fact we can
drop the special case for π0 .
Pn Pn n 1/3
As a check, i=0 πn = (1/3) i=0 (2/3) = 1−2/3 = 1.

2. Since µn = 1/πn , we have µn = 3(3/2)n .

3. In general, let µk,n be the expected time to reach n starting at k. Then


µn = µn,n = 1 + 18 µ0,n + 12 µn+1,n + 38 µn−1,n ≥ 1 + µ0,n /8. It follows
that µ0,n ≤ 8µn + 1 = 24(3/2)n + 1 = O((3/2)n ).
For the lower bound, observe that µn ≤ µn,0 + µ0,n . Since there is a
1/8 chance of reaching 0 from any state, we have µn,0 ≤ 8. It follows
that µn ≤ 8 + µ0,n or µ0,n ≥ µn − 8 = Ω((3/2)n ).
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 465

G.4.3 Yet another shuffling algorithm


Suppose we attempt to shuffle a deck of n cards by picking a card uniformly
at random, and swapping it with the top card. Give the best bound you can
on the mixing time for this process to reach a total variation distance of 
from the uniform distribution.

Solution
It’s tempting to use the same coupling as for move-to-top (see §10.4.3). This
would be that at each step we choose the same card to swap to the top
position, which increases by at least one the number of cards that are in the
same position in both decks. The problem is that at the next step, these two
cards are most likely separated again, by being swapped with other cards in
two different positions.
Instead, we will do something slightly more clever. Let Zt be the number
of cards in the same position at time t. If the top cards of both decks are
equal, we swap both to the same position chosen uniformly at random. This
has no effect on Zt . If the top cards of both decks are not equal, we pick
a card uniformly at random and swap it to the top in both decks. This
increases Zt by at least 1, unless we happen to pick cards that are already in
the same position; so Zt increases by at least 1 with probability 1 − Zt /n.
Let’s summarize a state by an ordered pair (k, b) where k = Zt and b is
0 if the top cards are equal and 1 if they are not equal. Then we have a
Markov chain where (k, 0) goes to (k, 1) with probability n−k n (and otherwise
stays put); and (k, 1) goes to (k + 1, 0) (or higher) with probability n−k
n and
k
to (k, 0) with probability n .
n
Starting from (k, 0), we expect to wait n−k steps on average to reach
(k, 1), at which point we move to (k + 1, 0) or back to (k, 0) in one more step;
n
we iterate through this process n−k times on average before we are successful.
This gives an expected number  of steps to get from (k, 0) to (k + 1, 0) (or
n n
possibly a higher value) of n−k n−k + 1) . Summing over k up to n − 2
(since once k > n − 2, we will in fact have k = n, since k can’t be n − 1), we
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 466

get
n−2
n n
X  
E [τ ] ≤ +1
k=0
n−k n−k
n
!
X n2 n
= +
m=2
m2 m
!
2 π2
≤n − 1 + n ln n.
6
= O(n2 ).
So we expect the deck to mix in O(n2 log(1/)) steps. (I don’t know if
this is the real bound; my guess is that it should be closer to O(n log n) as
in all the other shuffling procedures.)

G.5 Assignment 5: due Thursday, 2011-04-07, at


23:59
G.5.1 A reversible chain
Consider a random walk on Zm , where pi,i+1 = 2/3 for all i and pi,i−1 = 1/3
for all i except i = 0. Is it possible to assign values to p0,m−1 and p0,0 to
make this chain reversible, and if so, what stationary distribution do you
get?

Solution
Suppose we can make this chain reversible, and let π be the resulting
stationary distribution. From the detailed balance equations, we have
(2/3)πi = (1/3)πi+1 or πi+1 = 2πi for i = 0 . . . m − 2. The solution to
i
this recurrence is πi = 2i π0 , which gives πi = 2m2−1 when we set π0 to get
P
i πi = 1.
Now solve π0 p0,m−1 = πm−1 pm−1,0 to get
πm−1 pm−1,0
p0,m−1 =
π0
= 2m−1 (2/3)
= 2m /3.
This is greater than 1 for m > 1, so except for the degenerate cases of
m = 1 and m = 2, it’s not possible to make the chain reversible.
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 467

G.5.2 Toggling bits


Consider the following Markov chain on an array of n bits a[1], a[2], . . . a[n].
At each step, we choose a position i uniformly at random. We then change
A[i] to ¬A[i] with probability 1/2, provided i = 1 or A[i − 1] = 1 (if neither
condition holds hold, do nothing).3

1. What is the stationary distribution?

2. How quickly does it converge?

Solution
1. First let’s show irreducibility. Starting from an arbitrary configuration,
repeatedly switch the leftmost 0 to a 1 (this is always permitted
by the transition rules); after at most n steps, we reach the all-1
configuration. Since we can repeat this process in reverse to get to any
other configuration, we get that every configuration is reachable from
every other configuration in at most 2n steps (2n − 1 if we are careful
about handling the all-0 configuration separately).
We also have that for any two adjacent configurations x and y, pxy =
1
pyx = 2n . So we have a reversible, irreducible, aperiodic (because
there exists at least one self-loop) chain with a uniform stationary
distribution πx = 2−n .

2. Here is a bound using the obvious coupling, where we choose the same
position in X and Y and attempt to set it to the same value. To
show this coalesces, given Xt and Yt define Zt to be the position of the
rightmost 1 in the common prefix of Xt and Yt , or 0 if there is no 1 in
the common prefix of Xt and Yt . Then Zt increases by at least 1 if we
1
attempt to set position Zt + 1 to 1, which occurs with probability 2n ,
and decreases by at most 1 if we attempt to set Zt to 0, again with
1
probability 2n .
It follows that Zt reaches n no later than a ±1 random walk on 0 . . . n
with reflecting barriers that takes a step every 1/n time units on
average. The expected number of steps to reach n from the worst-case
3
Motivation: Imagine each bit represents whether a node in some distributed system
is inactive (0) or active (1), and you can only change your state if you have an active
left neighbor to notify. Also imagine that there is an always-active base station at −1
(alternatively, imagine that this assumption makes the problem easier than the other
natural arrangement where we put all the nodes in a ring).
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 468

starting position of 0 is exactly n2 . (Proof: model the random walk


with a reflecting barrier at 0 by folding a random walk with absorbing
barriers at ±n in half, then use the bound from §9.4.1.) We must then
multiply this by n to get an expected n3 steps in the original process.
So the two copies coalesce in at most n3 expected steps. My suspicion
is one could improve this bound with a better analysis by using the
bias toward increasing Zt to get the expected time to coalesce down to
O(n2 ), but I don’t know any clean way to do this.
The path coupling version of this is that we look at two adjacent
configurations Xt and Yt , use the obvious coupling again, and see
what happens to E [d(Xt+1 , Yt+1 ) | Xt , Yt ], where the distance is the
number of transitions needed to convert Xt to Yt or vice versa. If
we pick the position i where Xt and Yt differ, then we coalesce; this
occurs with probability 1/n. If we change the 1 to the left of i to
a 0, then d(Xt+1 , Yt+1 ) rises to 3 (because to get from Xt+1 to Yt+1 ,
we have to change position i − 1 to 1, change position i, and then
change position i − 1 back to 0); this occurs with probability 1/2n if
i > 1. But we can also get into trouble if we try to change position
i + 1; we can only make the change in one of Xt and Yt , so we get
d(Xt+1 , Yt+1 ) = 2 in this case, which occurs with probability 1/2n
when i < n. Adding up all three cases gives a worst-case expected
change of −1/n + 2/2n + 1/2n = 1/2n > 0. So unless we can do
something more clever, path coupling won’t help us here.
However, it is possible to get a bound using canonical paths, but the
best bound I could get was not as good as the coupling bound. The
basic idea is that we will change x into y one bit at a time (from left
to right, say), so that we will go through a sequence of intermediate
states of the form y[1]y[2] . . . y[i]x[i + 1]x[i + 2] . . . x[n]. But to change
x[i + 1] to y[i + 1], we may also need to reach out with a tentacle of 1
bits from from the last 1 in the current prefix of y (and then retract it
afterwards). Given a particular transition where we change a 0 to a 1,
we can reconstruct the original x and y by specifying (a) which bit i at
or after our current position we are trying to change; (b) which 1 bit
before our current position is the last “real” 1 bit in y as opposed to
something we are creating to reach out to position i; and (c) the values
of x[1] . . . x[i − 1] and y[i + 1] . . . y[i]. A similar argument applies to
1 → 0 transitions. So we are routing at most n2 2n−1 paths across each
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 469

transition, giving a bound on the congestion


1
 
ρ≤ n2 2n−1 2−2n
2−n /2n
= n3 .

The bound on τ2 that follows from this is 8n6 , which is pretty bad
(although the constant could be improved by counting the (a) and
(b) bits more carefully). As with the coupling argument, it may be
that there is a less congested set of canonical paths that gives a better
bound.

G.5.3 Spanning trees


Suppose you have a connected graph G = (V, E) with n nodes and m edges.
Consider the following Markov process. Each state Ht is a subgraph of G
that is either a spanning tree or a spanning tree plus an additional edge. At
each step, flip a fair coin. If it comes up heads, choose an edge e uniformly
at random from E and let Ht+1 = Ht ∪ {e} if Ht is a spanning tree and let
Ht+1 = Ht \ {e} if Ht is not a spanning tree and Ht \ {e} is connected. If it
comes up tails and Ht is a spanning tree, let Ht+1 be some other spanning
tree, sampled uniformly at random. In all other cases, let Ht+1 = Ht .
Let N be the number of states in this Markov chain.

1. What is the stationary distribution?

2. How quickly does it converge?

Solution
1. Since every transition has a matching reverse transition with the same
transition probability, the chain is reversible with a uniform stationary
distribution πH = 1/N .

2. Here’s a coupling that coalesces in at most 4m/3 + 2 expected steps:

(a) If Xt and Yt are both trees, then send them to the same tree
with probability 1/2; else let them both add edges independently
(or we could have them add the same edge—it doesn’t make any
difference to the final result).
(b) If only one of Xt and Yt is a tree, with probability 1/2 scramble
the tree while attempting to remove an edge from the non-tree,
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 470

and the rest of the time scramble the non-tree (which has no
effect) while attempting to add an edge to the tree. Since the
non-tree has at least three edges that can be removed, this puts
(Xt+1 , Yt+1 in the two-tree case with probability at least 3/2m.
(c) If neither Xt nor Yt is a tree, attempt to remove an edge from
both. Let S and T be the sets of edges that we can remove from
Xt and Yt , respectively, and let k = min(|S|, |T |) ≥ 3. Choose k
edges from each of S and T and match them, so that if we remove
one edge from each pair, we also remove the other edge. As in
the previous case, this puts (Xt+1 , Yt+1 in the two-tree case with
probability at least 3/2m.

To show this coalesces, starting from an arbitrary state, we reach


a two-tree state in at most 2m/3 expected steps. After one more
step, we either coalesce (with probability 1/2) or restart from a new
arbitrary state. This gives an expected coupling time of at most
2(2m/3 + 1) = 4m/3 + 2 as claimed.

G.6 Assignment 6: due Monday, 2011-04-25, at


17:00
G.6.1 Sparse satisfying assignments to DNFs
Given a formula in disjunctive normal form, we’d like to estimate the number
of satisfying assignments in which exactly w of the variables are true. Give
a fully polynomial-time randomized approximation scheme for this problem.

Solution
Essentially, we’re going to do the Karp-Luby covering trick [KL85] described
in §11.4, but will tweak the probability distribution when we generate our
samples so that we only get samples with weight w.
n
Let U be the set of assignment with weight w (there are exactly w such
assignments, where n is the number of variables). For each clause Ci , let
Ui = {x ∈ U | Ci (x) = 1}. Now observe that:
1. We can compute |Ui |. Let ki = |Ci | be the number of variables in Ci
and ki+ = Ci+ the number of variables that appear in positive form
n−k 
in Ci . Then |Ui | = w−k+i is the number of ways to make a total of w
i
variables true using the remaining n − ki variables.
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 471

2. We can sample uniformly from Ui , by sampling a set of w − ki+ true


variables not in Ci uniformly from all variables not in Ci .
3. We can use the values computed for |Ui | to sample i proportionally to
the size of |Ui |.
So now we sample pairs (i, x) with x ∈ Ui uniformly at random by
sampling i first, then sampling x ∈ Ui . As in the original algorithm, we then
count (i, x) if and only if Ci is the leftmost clause for which Ci (x) = 1. The
same argument that at least 1/m of the (i, x) pairs count applies, and so we
get the same bounds as in the original algorithm.

G.6.2 Detecting duplicates


Algorithm G.1 attempts to detect duplicate values in an input array S of
length n.

1 Initialize A[1 . . . n] to ⊥
2 Choose a hash function h
3 for i ← 1 . . . n do
4 x ← S[i]
5 if A[h(x)] = x then
6 return true
7 else
8 A[h(x)] ← x

9 return false
Algorithm G.1: Dubious duplicate detector

It’s easy to see that Algorithm G.1 never returns true unless some value
appears twice in S. But maybe it misses some duplicates it should find.
1. Suppose h is a random function. What is the worst-case probability
that Algorithm G.1 returns false if S contains two copies of some
value?
2. Is this worst-case probability affected if h is drawn instead from a
2-universal family of hash functions?

Solution
1. Suppose that S[i] = S[j] = x for i < j. Then the algorithm will see x in
A[h(x)] on iteration j and return true, unless it is overwritten by some
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 472

value S[k] with i < k < j. This occurs if h(S[k]) = h(x), which occurs
with probability exactly 1 − (1 − 1/n)j−i−1 if we consider all possible k.
This quantity is maximized at 1 − (1 − 1/n)n−2 ≈ 1 − (1 − 1/n)2 /e ≈
1 − (1 − 1/2n)/e when i = 1 and j = n.

2. As it happens, the algorithm can fail pretty badly if all we know is that
h is 2-universal. What we can show is that the probability that some
S[k] with i < j < k gets hashed to the same place as x = S[i] = S[j]
in the analysis above is at most (j − i − 1)/n ≤ (n − 2)/n = (1 − 2/n),
since each S[k] has at most a 1/n chance of colliding with x and the
union bound applies. But it is possible to construct a 2-universal family
for which we get exactly this probability in the worst case.
Let U = {0 . . . n}, and define for each a in {0 . . . n − 1} ha : U → n by
ha (n) = 0 and ha (x) = (x + a) mod n for x 6= n. Then H = {ha } is 2-
universal, since if x 6= y and neither x nor y is n, Pr [ha (x) = ha (y)] = 0,
and if one of x or y is n, Pr [ha (x) = ha (y)] = 1/n. But if we use this
family in Algorithm G.1 with S[1] = S[n] = n and S[k] = k for
1 < k < n, then there are n − 2 choices of a that put one of the middle
values in A[0].

G.6.3 Balanced Bloom filters


A clever algorithmist decides to solve the problem of Bloom filters filling up
with ones by capping the number of ones at m/2. As in a standard Bloom fil-
ter, an element is inserted by writing ones to A[h1 (x)], A[h2 (x)], . . . , A[hk (x)];
but after writing each one, if the number of one bits in the bit-vector is more
than m/2, one of the ones in the vector (chosen uniformly at random) is
changed back to a zero.
Because some of the ones associated with a particular element might
be deleted, the membership test answers yes if at least 3/4 of the bits
A[h1 (x)] . . . A[hk (x)] are ones.
To simplify the analysis, you may assume that the hi are independent
random functions. You may also assume that (3/4)k is an integer.

1. Give an upper bound on the probability of a false positive when testing


for a value x that has never been inserted.

2. Suppose that we insert x at some point, and then follow this insertion
with a sequence of insertions of new, distinct values y1 , y2 , . . . . Assum-
ing a worst-case state before inserting x, give asymptotic upper and
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 473

lower bounds on the expected number of insertions until a test for x


fails.

Solution
1. The probability of a false positive is maximized when exactly half the
bits in A are one. If x has never been inserted, each A[hi (x)] is equally
likely to be zero or one. So Pr [false positive for x] = Pr [Sk ≥ (3/4)k]
when Sk is a binomial random variable with parameters 1/2 and k.
Chernoff bounds give

Pr [Sk ≥ (3/4)k] = Pr [Sk ≥ (3/2) E [Sk ]]


!k/2
e1/2

(3/2)3/2
≤ (0.94734)k .

We can make this less than any fixed  by setting k ≥ 20 ln(1/) or


thereabouts.

2. For false negatives, we need to look at how quickly the bits for x are
eroded away. A minor complication is that the erosion may start even
as we are setting A[h1 (x)] . . . A[hk (x)].
Let’s consider a single bit A[i] and look at how it changes after (a)
setting A[i] = 1, and (b) setting some random A[r] = 1.
In the first case, A[i] will be 1 after the assignment unless it is set back
1
to zero, which occurs with probability m/2+1 . This distribution does
not depend on the prior value of A[i].
In the second case, if A[i] was previously 0, it becomes 1 with probability

1 1 1 m/2
 
1− = ·
m m/2 + 1 m m/2 + 1
1
= .
m+2

If it was previously 1, it becomes 0 with probability


1 1 1
· = .
2 m/2 + 1 m+2
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 474

So after the initial assignment, A[i] just flips its value with probability
1
m+2 .
It is convenient to represent A[i] as ±1; let Xit = −1 if A[i] = 0 at time
t, and 1 otherwise. Then Xit satisfies the recurrence
h i m+1 t 1
E Xit+1 Xit = X − Xt
m+2 i m+2 i
m
= Xit .
m + 2
2
 
= 1− Xit .
m+2
 t
We can extend this to E Xit Xi0 = 1 − 2
Xi0 ≈ e−2t/(m+2) Xi0 .
 
m+2
1
Similarly, after setting A[i] = 1, we get E [Xi ] = 1 − 2 m/2+1 = 1−
4
2m+1 = 1 − o(1).

Let S t = ki=1 Xht i (x) . Let 0 be the time at which we finish inserting x.
P

Then each for each i we have


h i
1 − o(1)e−2k/(m+2) ≤ E Xh0i (x) ≤ 1 − o(1),

from which it follows that


h i
k(1 − o(1))e−2k/(m+2) ≤ E S 0 ≤ 1 − o(1)

and in general that


h i
k(1 − o(1))e−2(k+t)/(m+2) ≤ E S t ≤ 1 − o(1)e−2t/(m+2) .

So for any fixed 0 <  < 1 and sufficiently large m, we will have
E S = k for some t0 where t ≤ t0 ≤ k + t and t = Θ(m ln(1/)).
t


We are now looking for the time at which S t drops below k/2 (the k/2
is because we are working with ±1 variables). We will bound when
this time occurs using Markov’s inequality.
t with E S t ≥ (3/4)k. Then E k − S t ≤
   
Let’s look for the largest time
k/4 and Pr k − S t ≥ k/2 ≤ 1/2. It follows that after Θ(m) − k op-


erations, x is still visible with probability 1/2, which implies that the
expected time at which it stops being visible is at least (Ω(m) − k)/2.
To get the expected number of insert operations, we divide by k, to
get Ω(m/k).
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 475

For the upper bound, apply the same reasoning to the first time at which
 t
E S ≤ k/4. This occurs at time O(m) at the latest (with a different
constant), so after O(m) steps there is at most a 1/2 probability that
S t ≥ k/2. If S t is still greater than k/2 at this point, try again using
the same analysis; this gives us the usual geometric series argument
that E [t] = O(m). Again, we have to divide by k to get the number of
insert operations, so we get O(m/k) in this case.
Combining these bounds, we have that x disappears after Θ(m/k)
insertions on average. This seems like about what we would expect.

G.7 Final exam


Write your answers in the blue book(s). Justify your answers. Work alone.
Do not use any notes or books.
There are four problems on this exam, each worth 20 points, for a total
of 80 points. You have approximately three hours to complete this exam.

G.7.1 Leader election


Suppose we have n processes and we want to elect a leader. At each round,
each process flips a coin, and drops out if the coin comes up tails. We win if
in some round there is exactly one process left.
Let T (n) be the probability that this event eventually occurs starting
with n processes. For small n, we have T (0) = 0 and T (1) = 1. Show that
there is a constant c > 0 such that T (n) ≥ c for all n > 1.

Solution
Let’s suppose that there is some such c. We will necessarily have c ≤ 1 = T (1),
so the induction hypothesis will hold in the base case n = 1.
For n ≥ 2, compute
n
!
n
2−n
X
T (n) = T (k)
k=0
k
n−1
!
n
= 2−n T (n) + 2−n nT (1) + 2−n
X
T (k)
k=2
k
≥ 2−n T (n) + 2−n n + 2−n (2n − n − 2)c.
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 476

Solve for T (n) to get

n + (2n − n − 2)c
T (n) ≥
2n − 1
 n
2 − n − 2 + n/c

=c .
2n − 1

This will be greater than or equal to c if 2n − n − 2 + n/c ≥ 2n − 1 or


n
n/c ≥ n + 1, which holds if c ≤ n+1 . The worst case is n = 2 giving c = 2/3.
Valiant and Vazirani [VV86] used this approach to reduce solving general
instances of SAT to solving instances of SAT with unique solutions; they
prove essentially the result given above (which shows that fixing variables in
a SAT formula is likely to produce a SAT formula with a unique solution at
some point) with a slightly worse constant.

G.7.2 Two-coloring an even cycle


Here is a not-very-efficient algorithm for 2-coloring an even cycle. Every node
starts out red or blue. At each step, we pick one of the n nodes uniformly
at random, and change its color if it has the same color as at least one of
its neighbors. We continue until no node has the same color as either of its
neighbors.
Suppose that in the initial state there are exactly two monochromatic
edges. What is the worst-case expected number of steps until there are no
monochromatic edges?

Solution
Suppose we have a monochromatic edge surrounded by non-monochrome
edges, e.g. RBRRBR. If we pick one of the endpoints of the edge (say
the left endpoint in this case), then the monochromatic edge shifts in the
direction of that endpoint: RBBRBRB. Picking any node not incident to a
monochromatic edge has no effect, so in this case there is no way to increase
the number of monochromatic edges.
It may also be that we have two adjacent monochromatic edges: BRRRB.
Now if we happen to pick the node in the middle, we end up with no
monochromatic edges (BRBRB) and the process terminates. If on the other
hand we pick one of the nodes on the outside, then the monochromatic edges
move away from each other.
We can thus model the process with 2 monochromatic edges as a random
walk, where the difference between the leftmost nodes of the edges (mod
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 477

n) increases or decreases with equal probability 2/n except if the distance


is 1 or −1; in this last case, the distance increases (going to 2 or −2) with
probability 2/n, but decreases only with probability 1/n. We want to know
when this process hits 0 (or n).
Imagine a random walk starting from position k with absorbing barriers
at 1 and n − 1. This reaches 1 or n − 1 after (k − 1)(n − 1 − k) steps on
average, which translates into (n/4)(k − 1)(n − 1 − k) steps of our original
process if we take into account that we only move with probability 4/n
per time unit. This time is maximized by setting k = n/2, which gives
(n/4)(n/2 − 1)2 = n3 /16 − n2 /4 + n/4 expected time units to reach 1 or n − 1
for the first time.
At 1 or n − 1, we wait an addition n/3 steps on average; then with
probability 1/3 the process finishes and with probability 2/3 we start over
from position 2 or n − 2; in the latter case, we run (n/4)(n − 3) + n/3 time
units on average before we may finish again. On average, it takes 3 attempts
to finish. Each attempt incurs the expected n/3 cost before taking a step,
and all but the last attempt incur the expected (n/4)(n − 3) additional steps
for the random walk. So the last phase of the process the process adds
(1/2)n(n − 3) + n = (1/2)n2 − (5/4)n steps.
Adding up all of the costs gives n3 /16 − n2 /4 + n/4 + n/3 + (1/2)n2 −
1 3
(5/4)n = 16 n + 41 n2 − 23 n steps.

G.7.3 Finding the maximum

1 Randomly permute A
2 m ← −∞
3 for i ← 1 . . . n do
4 if A[i] > m then
5 m ← A[i]

6 return m
Algorithm G.2: Randomized max-finding algorithm

Suppose that we run Algorithm G.2 on an array with n elements, all of


which are distinct. What is the expected number of times Line 5 is executed
as a function of n?
APPENDIX G. SAMPLE ASSIGNMENTS FROM SPRING 2011 478

Solution
Let Xi be the indicator variable for the event that Line 5 is executed on
the i-th pass through the loop. This will occur if A[i] is greater than A[j]
for all j < i, which occurs with probability exactly 1/i (given that A has
been permuted randomly). So the expected number of calls to Line 5 is
i = 1n E [Xi ] = ni=1 1i = Hn .
P P

G.7.4 Random graph coloring


Let G be a random d-regular graph on n vertices, that is, a graph drawn
uniformly from the family of all n-vertex graph in which each vertex has
exactly d neighbors. Color the vertices of G red or blue independently at
random.

1. What is the expected number of monochromatic edges in G?

2. Show that the actual number of monochromatic edges is tightly con-


centrated around its expectation.

Solution
The fact that G is itself a random graph is a red herring; all we really need
to know is that it’s d-regular.

1. Because G has exactly dn/2 edges, and each edge has probability 1/2
of being monochromatic, the expected number of monochromatic edges
is dn/4.

2. This is a job for Azuma’s inequality. Consider the vertex exposure


martingale. Changing the color of any one vertex changes the number of
monochromatic edges by at most d. So we have Pr [|X − E [X]| ≥ t] ≤
2 2
2 exp −t2 /2 ni=1 d2 = 2e−t /2nd , which tells us that the deviation is
P

likely to be not much more than O(d n).
Appendix H

Sample assignments from


Spring 2009

H.1 Final exam, Spring 2009


Write your answers in the blue book(s). Justify your answers. Work alone.
Do not use any notes or books.
There are four problems on this exam, each worth 20 points, for a total
of 80 points. You have approximately three hours to complete this exam.

H.1.1 Randomized mergesort (20 points)


Consider the following randomized version of the mergesort algorithm. We
take an unsorted list of n elements and split it into two lists by flipping an
independent fair coin for each element to decide which list to put it in. We
then recursively sort the two lists, and merge the resulting sorted lists. The
merge procedure involves repeatedly comparing the smallest element in each
of the two lists and removing the smaller element found, until one of the lists
is empty.
Compute the expected number of comparisons needed to perform this
final merge. (You do not need to consider the cost of performing the recursive
sorts.)

Solution
Color the elements in the final merged list red or blue based on which sublist
they came from. The only elements that do not require a comparison to
insert into the main list are those that are followed only by elements of the

479
APPENDIX H. SAMPLE ASSIGNMENTS FROM SPRING 2009 480

same color; the expected number of such elements is equal to the expected
length of the longest monochromatic suffix. By symmetry, this is the same as
the expected longest monochromatic prefix, which is equal to the expected
length of the longest sequence of identical coin-flips.
The probability of getting k identical coin-flips in a row followed by a
different coin-flip is exactly 2−k ; the first coin-flip sets the color, the next k −1
must follow it (giving a factor of 2−k+1 , and the last must be the opposite
color (giving an additional factor of 2−1 ). For n identical coin-flips, there is
a probability of 2−n+1 , since we don’t need an extra coin-flip of the opposite
color. So the expected length is n−1 −k + n2−n+1 = Pn −k + n2−n .
P
k=1 k2 k=0 k2Pn
We can simplify the sum using generating functions. The sum k=0 2−k z k
n+1
is given by 1−(z/2)
Pn −k k−1 =
1−z/2 . Taking the derivative with respect to z gives k=0 2 kz
n+1 2 n
(1/2) 1−(z/2)
1−z/2 + (1/2) (n+1)(z/2)
1−z/2 . At z = 1 this is 2(1 − 2−n−1 ) − 2(n +
1)2−n = 2 − (n + 2)2−n . Adding the second term gives E [X] = 2 − 2 · 2−n =
2 − 2−n+1 .
Note that this counts the expected number of elements for which we do
not have to do a comparison; with n elements total, this leaves n − 2 + 2−n+1
comparisons on average.

H.1.2 A search problem (20 points)


Suppose you are searching a space by generating new instances of some
problem from old ones. Each instance is either good or bad; if you generate a
new instance from a good instance, the new instance is also good, and if you
generate a new instance from a bad instance, the new instance is also bad.
Suppose that your start with X0 good instances and Y0 bad instances,
and that at each step you choose one of the instances you already have
uniformly at random to generate a new instance. What is the expected
number of good instances you have after n steps?
Hint: Consider the sequence of values {Xt /(Xt + Yt )}.
APPENDIX H. SAMPLE ASSIGNMENTS FROM SPRING 2009 481

Solution
We can show that the suggested sequence is a martingale, by computing
Xt+1 Xt Xt + 1 Yt Xt
 
E Xt , Yt = · + ·
Xt+1 + Yt+1 Xt + Yt Xt + Yt + 1 Xt + Yt Xt + Yt + 1
Xt (Xt + 1) + Yt Xt
=
(Xt + Yt )(Xt + Yt + 1)
Xt (Xt + Yt + 1)
=
(Xt + Yt )(Xt + Yt + 1)
Xt
= .
Xt + Yt
h i
From the martingale property we have E XnX+Y n
n
= X0X+Y
0
0
. But Xn +
Yn = X0 + Y0 + n,
 a constant,
 so we can multiply both sides by this value to
X0 +Y0 +n
get E [Xn ] = X0 X0 +Y0 .

H.1.3 Support your local police (20 points)


At one point I lived in a city whose local police department supported
themselves in part by collecting fines for speeding tickets. A speeding ticket
would cost 1 unit (approximately $100), and it was unpredictable how often
one would get a speeding ticket. For a price of 2 units, it was possible
to purchase a metal placard to go on your vehicle identifying yourself as
a supporter of the police union, which (at least according to local legend)
would eliminate any fines for subsequent speeding tickets, but which would
not eliminate the cost of any previous speeding tickets.
Let us consider the question of when to purchase a placard as a problem
in on-line algorithms. It is possible to achieve a strict1 competitive ratio of 2
by purchasing a placard after the second ticket. If one receives fewer than 2
tickets, both the on-line and off-line algorithms pay the same amount, and at
2 or more tickets the on-line algorithm pays 4 while the off-line pays 2 (the
off-line algorithm purchased the placard before receiving any tickets at all).

1. Show that no deterministic algorithm can achieve a lower (strict)


competitive ratio.

2. Show that a randomized algorithm can do so, against an oblivious


adversary.
1
I.e., with no additive constant.
APPENDIX H. SAMPLE ASSIGNMENTS FROM SPRING 2009 482

Solution
1. Any deterministic algorithm essentially just chooses some fixed number
m of tickets to collect before buying the placard. Let n be the actual
number of tickets issued. For m = 0, the competitive ratio is infinite
when n = 0. For m = 1, the competitive ratio is 3 when n = 1. For
m > 2, the competitive ratio is (m + 2)/2 > 2 when n = m. So m = 2
is the optimal choice.

2. Consider the following algorithm: with probability p, we purchase a


placard after 1 ticket, and with probability q = 1 − p, we purchase a
placard after 2 tickets. This gives a competitive ratio of 1 for n = 0,
1 + 2p for n = 1, and (3p + 4q)/2 = (4 − p)/2 = 2 − p/2 for n ≥ 2.
There is a clearly a trade-off between the two ratios 1 + 2p and 2 − p/2.
The break-even point is when they are equal, at p = 2/5. This gives a
competitive ratio of 1 + 2p = 9/5, which is less than 2.

H.1.4 Overloaded machines (20 points)


Suppose n2 jobs are assigned to n machines with each job choosing a machine
independently and uniformly at random. Let the load on a machine be the
number of jobs assigned to it. Show that for any fixed δ > 0 and sufficiently
large n, there is a constant c < 1 such that the maximum load exceeds
(1 + δ)n with probability at most ncn .

Solution
This is a job for Chernoff bounds. For any particular machine, the load S is
a sum of independent indicator variables and the mean load is µ = n. So we
have
!n

Pr [S ≥ (1 + δ)µ] ≤ .
(1 + δ)1+δ

Observe that eδ /(1 + δ)1+δ < 1 for δ > 0. One proof of this fact is to take
the log to get δ −(1+δ) log(1+δ), which equals 0 at δ = 0, and then show that
d
the logarithm is decreasing by showing that dδ · · · = 1 − 1+δ
1+δ − log(1 + δ) =
− log(1 + δ) < 0 for all δ > 0.
So we can let c = eδ /(1 + δ)1+δ to get a bound of cn on the probability
that any particular machine is overloaded and a bound of ncn (from the
union bound) on the probability that any of the machines is overloaded.
Appendix I

Probabilistic recurrences

Randomized algorithms often produce recurrences with ugly sums embedded


inside them (see, for example, (1.3.1)). We’d like to have tools for pulling
at least asymptotic bounds out of these recurrences without having to deal
with the sums. This isn’t always possible, but for certain simple recurrences
we can make it work.

I.1 Recurrences with constant cost functions


Let us consider probabilistic recurrences of the form T (n) = 1 + T (n − Xn ),
where Xn is a random variable with 0 < Xn ≤ n and T (0) = 0. We assume
we can compute a lower bound on E [Xn ] for each n, and we want to translate
this lower bound into an upper bound on E [T (n)].

I.2 Examples
• How long does it take to get our first heads if we repeatedly flip
a coin that comes up heads with probability p? Even though we
probably already know the answer to this, We can solve it by solving
the recurrence T (1) = 1 + T (1 − X1 ), T (0) = 0, where E [X1 ] = p.

• Hoare’s FIND [Hoa61b], often called QuickSelect, is an algorithm


for finding the k-th smallest element of an unsorted array. It works like
QuickSort, only after partitioning the array around a random pivot
we throw away the part that doesn’t contain our target and recurse
only on the surviving piece. How many rounds of this must we do?
Here E [Xn ] is more complicated, since after splitting our array of size

483
APPENDIX I. PROBABILISTIC RECURRENCES 484

n into piles of size n0 and n − n0 − 1, we have to pick one or the other


(or possibly just the pivot alone) based on the value of k.

• Suppose we start with n biased coins that each come up heads with
probability p. In each round, we flip all the coins and throw away the
ones that come up tails. How many rounds does it take to get rid of
all of the coins? (This essentially tells us how tall a skip list [Pug90]
can get.) Here we have E [Xn ] = (1 − p)n.

• In the coupon collector problem, we sample from 1 . . . n with re-


placement until we see ever value at least once. We can model this
by a recurrence in which T (k) is the time to get all the coupons given
there are k left that we haven’t seen. Here Xn is 1 with probability
k/n and 0 with probability (n − k)/n, giving E [Xn ] = k/n.

• Let’s play Chutes and Ladders without the chutes and ladders. We
start at location n, and whenever it’s our turn, we roll a fair six-sided
die X and move to n − X unless this value is negative, in which case
we stay put until the next turn. How many turns does it take to get to
0?

I.3 The Karp-Upfal-Wigderson bound


This is a bound on the expected number of rounds to finish a process where
we start with a problem instance of size n, and after one round of work we get
a new problem instance of size n − Xn , where Xn is a random variable whose
distribution depends on n. It was original described in a paper by Karp,
Upfal, and Wigderson on analyzing parallel search algorithms [KUW88]. The
bound applies when E [Xn ] is bounded below by a non-decreasing function
µ(n).
Lemma I.3.1. Let a be a constant, let T (n) = 1 + T (n − Xn ), where for
each n, Xn is an integer-valued random variable satisfying 0 ≤ Xn ≤ n − a
and let T (a) = 0. Let E [Xn ] ≥ µ(n) for all n > a, where µ is a positive
non-decreasing function of n. Then
Z n
1
E [T (n)] ≤ dt. (I.3.1)
a µ(t)
To get an intuition for why this works, imagine that Xn is the speed
at which we drop from n, expressed in units per round. Traveling at this
speed, it takes 1/Xn rounds to cross from k + 1 to k for any such interval
APPENDIX I. PROBABILISTIC RECURRENCES 485

we pass. From the point of view of the interval [k, k + 1], we don’t know
which n we are going to start from before we cross it, but we do know that
for any n ≥ k + 1 we start from, our speed will Rbe at least µ(n) ≥ µ(k + 1)
on average. So the time it takes will be at most kk+1 µ(t)
1
dt on average, and
the total time is obtained by summing all of these intervals.
Of course, this intuition is not even close to a real proof (among other
things, there may be a very dangerous confusion in there between 1/ E [Xn ]
and E [1/Xn ]), so we will give a real proof as well.

Proof of Lemma I.3.1. This is essentially the same proof as in Motwani and
Raghavan [MR95], but we add some extra detail to allow for the possibility
that Xn = 0.
Let p = Pr [Xn = 0], q = 1 − p = Pr [Xn 6= 0]. Note we have q > 0
because otherwise E [Xn ] = 0 < µ(n). Then we have

E [T (n)] = 1 + E [T (n − Xn )]
= 1 + p E [T (n − Xn ) | Xn = 0] + q E [T (n − Xn ) | Xn 6= 0]
= 1 + p E [T (n)] + q E [T (n − Xn ) | Xn 6= 0] .

Now we have E [T (n)] on both sides, which we don’t like very much. So
we collect it on the left-hand side:

(1 − p) E [T (n)] = 1 + q E [T (n − Xn ) | Xn 6= 0] ,

divide both sides by q = 1 − p, and apply the induction hypothesis:

E [T (n)] = 1/q + E [T (n − Xn ) | Xn 6= 0]
= 1/q + E [E [T (n − Xn ) | Xn ] | Xn 6= 0]
"Z #
n−Xn1
≤ 1/q + E dt Xn 6= 0
a µ(t)
Z n Z n
1 1

= 1/q + E dt − dt Xn 6= 0
a µ(t) n−Xn µ(t)
Z n
1 Xn
 
≤ 1/q + dt − E Xn 6= 0
a µ(t) µ(n)
Z n
1 E [Xn | Xn 6= 0]
≤ 1/q + dt − .
a µ(t) µ(n)

The second-to-last step uses the fact that µ(t) ≤ µ(n) for t ≤ n.
It may seem like we don’t know what E [Xn | Xn 6= 0] is. But we know
that Xn ≥ 0, so we have E [Xn ] = p E [Xn | Xn = 0] + q E [Xn | Xn 6= 0] =
APPENDIX I. PROBABILISTIC RECURRENCES 486

q E [Xn | Xn 6= 0]. So we can solve for E [Xn | Xn 6= 0] = E[Xn ]/q. So let’s


plug this in:
Z n
1 E [Xn ] /q
E [T (n)] ≤ 1/q + dt −
a µ(t) µ(n)
Z n
1
≤ 1/q + dt − 1/q
a µ(t)
Z n
1
= dt.
a µ(t)
This concludes the proof.

Now we just need to find some applications.

I.3.1 Waiting for heads


For the recurrence T (1) = 1+T (1−X1 ) with E [X1 ] = p, we set µ(n) = p and
get E [T (1)] ≤ 01 p1 dt = p1 , which happens to be exactly the right answer.
R

I.3.2 Quickselect
In Quickselect, we pick a random pivot and split the original array of size
n into three piles of size m (less than the pivot), 1 (the pivot itself), and
n−m−1 (greater than the pivot). We then figure out which of the three piles
contains the k-th smallest element (depend on how k compares to m − 1) and
recurse, stopping when we hit a pile with 1 element. It’s easiest to analyze
this by assuming that we recurse in the largest of the three piles, i.e., that
our recurrence is T (n) = 1 + max(T (m), T (n − m − 1)), where m is uniform
in 0 . . . n − 1. The exact value of E [max(m, n − m − 1)] is a little messy to
compute (among other things, it depends on whether n is odd or even), but
it’s not hard to see that it’s always less than (3/4)n. So letting µ(n) = n/4,
we get
Z n
1
E [T (n)] ≤ dt = 4 ln n.
1 t/4

I.3.3 Tossing coins


Here we have E [Xn ] = (1 − p)n. If we let µ(n) = (1 − p)n and plug into the
formula without thinking about it too much, we get
Z n
1 1
E [T (n)] ≤ dt = (ln n − ln 0).
0 (1 − p)t 1−p
APPENDIX I. PROBABILISTIC RECURRENCES 487

That ln 0 is trouble. We can fix it by making µ(n) = (1 − p)dne, to get


Z n
1
E [T (n)] ≤ dt
0+ (1 − p)dte
n
1 X 1
=
1 − p k=1 k
Hn
= .
1−p

I.3.4 Coupon collector


Now that we know how to avoid dividing by zero, this is easy and fun. Let
µ(x) = dxe/n, then we have
Z n
n
E [T (n)] ≤ dt
0+ dte
n
X n
=
k=1
k
= nHn .

As it happens, this is the exact answer for this case. This will happen
whenever X is always a 0–1 variable1 and we define µ(x) = E [X | n = dxe],
which can be seen by spending far too much time thinking about the precise
sources of error in the inequalities in the proof.

I.3.5 Chutes and ladders


Let µ(n) be the expected drop from position n. We have to be a little
bit careful about small n, but we can compute that in general µ(n) =
1 Pmin(n,6)
6 i=0 i. For fractional values x we will set µ(x) = µ(dxe) as before.
1
A 0–1 random variable is also called a Bernoulli random variable, but 0–1 is
shorter to type and more informative. Even more confusing, the underlying experiment
that gives rise to a Bernoulli random variable goes by the entirely different name of a
Poisson trial. Jacob Bernoulli and Siméon-Denis Poisson were great mathematicians,
but there are probably better ways to remember them.
APPENDIX I. PROBABILISTIC RECURRENCES 488

Then we have
Z n
1
E [T (n)] ≤ dt
0+ µ(t)
n
X 1
=
k=1
µ(k)

We can summarize the values in the following table:

P
n µ(n) 1/µ(n) 1/µ(k)
1 1/6 6 6
2 1/2 2 8
3 1 1 9
4 5/3 3/5 48/5
5 5/2 2/5 10
≥6 7/2 2/7 10 + (2/7)(n − 5) = (2/7)n + 65/7

This is a slight overestimate; for example, we can calculate by hand


that the expected waiting time for n = 2 is 6 and for n = 3 that it is
20/3 = 6 + 2/3.
We can also consider the generalized version of the game where we start
at n and drop by 1 . . . n each turn as long as the drop wouldn’t take us below
0. Now the expected drop from position k is k(k + 1)/2n, and so we can
apply the formula to get
n
X 2n
E [T (n)] ≤ .
k=1
k(k + 1)
1
The sum of k(k+1) when k goes from 1 to n happens to have a very nice
n 1 2
value; it’s exactly n+1 = 1 + n+1 . So in this case we can rewrite the bound
n 2n2
as 2n · n+1 = n+1 .

I.4 High-probability bounds


So far we have only considered bounds on the expected value of T (n).
Suppose we want to show that T (n) is in fact small with high probability,
2
Pn 1
Pn−1 1
Proof: Trivially true for n = 0; for larger n, compute k=1 k(k+1) k=1 k(k+1)
+
1 n−1 1 (n−1)(n+1)−1 n2
n(n+1)
= n
+ n(n+1)
= n(n+1)
= n(n+1)
= n/(n + 1).
APPENDIX I. PROBABILISTIC RECURRENCES 489

i.e., a statement of the form Pr [T (n) ≥ t] ≤ . There are two natural ways
to do this: we can repeatedly apply Markov’s inequality to the expectation
bound, or we can attempt to analyze the recurrence in more detail. The first
method tends to give weaker bounds but it’s easier.

I.4.1 High-probability bounds from expectation bounds


Given E [T (n)] ≤ m, we have Pr [T (n) ≥ αm] ≤ α−1 . This does not give
a very good bound on probability; if we want to show Pr [T (n) ≥ t] ≤ n−c
for some constant c (a typical high-probability bound), we need t ≥ nc m.
But we can get a better bound if m bounds the expected time starting from
any reachable state, as is the case for the class of problems we have been
considering.
The idea is that if T (n) exceeds αm, we restart the analysis and ar-
gue that Pr [T (n) ≥ 2αm | T (n) ≥ αm] ≤ α−1 , from which it follows that
Pr [T (n) ≥ 2αm] ≤ α−2 . In general, for any non-negative integer k, we have
Pr [T (n) ≥ kαm] ≤ α−k . Now we just need to figure out how to pick α to
minimize this quantity for fixed t.
Let t = kαm. Then k = t/αm and we seek to minimize α−t/αm . Taking
the logarithm gives −(t/m)(ln α)/α. The t/m factor is irrelevant to the
minimization problem, so we are left with minimizing −(ln α)/α. Taking
the derivative gives −α−2 + α−2 ln α; this is zero when ln α = 1 or α = e.
(This popular constant shows up often in problems like this.) So we get
Pr [T (n) ≥ kem] ≤ e−k , or, letting k = ln(1/), Pr [T (n) ≥ em ln(1/)] ≤ .
So, for example, we can get an n−c bound on the probability of running
too long by setting our time bound to em ln(nc ) = cem ln n = O(m log n).
We can’t hope to do better than O(m), so this bound is tight up to a log
factor.

I.4.2 Detailed analysis of the recurrence


As Lance Fortnow has explained,3 getting rid of log factors is what theoretical
computer science is all about. So we’d like to do better than an O(m log n)
bound if we can. In some cases this is not too hard.
Suppose for each n, T (n) = 1 + T (X), where E [X] ≤ αn for a fixed
constant α. Let X0 = n, and let X1 , X2 , etc., be the sequence of sizes of
the remaining problem at time 1, 2, etc. Then we have E [X1 ] ≤ αn from
our assumption. But we also have E [X2 ] = E [E [X2 | X1 ]] ≤ E [αX1 ] =
α E [X1 ] ≤ α2 n, and by induction we can show that E [Xk ] ≤ αk n for all k.
3
http://weblog.fortnow.com/2009/01/soda-and-me.html
APPENDIX I. PROBABILISTIC RECURRENCES 490

Since Xk is integer-valued, E [Xk ] is an upper bound on Pr [Xk > 0]; we thus


get Pr [T (n) ≥ k] = Pr [Xk > 0] ≤ αk n. We can solve for the value of k that
makes this less than : k = − log(n/)/ log α = log1/α n + log1/α (1/).
For comparison, the bound on the expectation of T (n) from Lemma I.3.1
is H(n)/(1 − α). This is actually pretty close to log1/α n when α is close
to 1, and is not too bad even for smaller α. But the difference is that the
dependence on log(1/) is additive with the tighter analysis, so for fixed c
we get Pr [T (n) ≥ t] ≤ n−c at t = O(log n + log nc ) = O(log n) instead of
O(log n log nc ) = O(log2 n).

I.5 More general recurrences


We didn’t do these, but if we had to, it might be worth looking at Roura’s
Improved Master Theorems [Rou01].
Bibliography

[AA11] Dan Alistarh and James Aspnes. Sub-logarithmic test-and-set


against a weak adversary. In Distributed Computing: 25th
International Symposium, DISC 2011, volume 6950 of Lecture
Notes in Computer Science, pages 97–109. Springer-Verlag,
September 2011.

[AACH12] James Aspnes, Hagit Attiya, and Keren Censor-Hillel. Poly-


logarithmic concurrent data structures from monotone circuits.
Journal of the ACM, 59(1):2:1–2:24, February 2012.

[AAG+ 10] Dan Alistarh, Hagit Attiya, Seth Gilbert, Andrei Giurgiu, and
Rachid Guerraoui. Fast randomized test-and-set and renaming.
In Nancy A. Lynch and Alexander A. Shvartsman, editors,
Distributed Computing, 24th International Symposium, DISC
2010, Cambridge, MA, USA, September 13-15, 2010. Proceed-
ings, volume 6343 of Lecture Notes in Computer Science, pages
94–108. Springer, 2010.

[AB07] Sanjeev Arora and Boaz Barak. Computational complexity:


A modern approach. Unpublished draft available at https:
//theory.cs.princeton.edu/complexity/book.pdf, 2007.

[ABMRT96] Arne Andersson, Peter Bro Miltersen, Søren Riis, and Mikkel
Thorup.
p Static dictionaries on AC0 RAMs: Query time
θ( log n/ log log n) is necessary and sufficient. In FOCS, pages
441–450, 1996.

[Abr88] Karl Abrahamson. On achieving consensus using a shared mem-


ory. In Proceedings of the 2nd annual ACM-SIAM symposium
on Discrete algorithms, pages 291–302, 1988.

491
BIBLIOGRAPHY 492

[AC08] Hagit Attiya and Keren Censor. Tight bounds for asynchronous
randomized consensus. Journal of the ACM, 55(5):20, October
2008.

[ACBF02] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time


analysis of the multiarmed bandit problem. Machine Learning,
47:235–256, 2002.

[Ach03] Dimitris Achlioptas. Database-friendly random projections:


Johnson-Lindenstrauss with binary coins. J. Comput. Syst.
Sci., 66(4):671–687, June 2003.

[Adl78] Leonard Adleman. Two theorems on random polynomial time.


In Proceedings of the 19th Annual Symposium on Foundations
of Computer Science, pages 75–83, Washington, DC, USA, 1978.
IEEE Computer Society.

[AE19] James Aspnes and He Yang Er. Consensus with max registers.
In Jukka Suomela, editor, 33rd International Symposium on
Distributed Computing (DISC 2019), volume 146 of Leibniz In-
ternational Proceedings in Informatics (LIPIcs), pages 1:1–1:9,
Dagstuhl, Germany, 2019. Schloss Dagstuhl–Leibniz-Zentrum
fuer Informatik.

[AF01] David Aldous and James Allen Fill. Reversible Markov


chains and random walks on graphs. Unpublished manuscript,
available at https://www.stat.berkeley.edu/~aldous/RWG/
book.html, 2001.

[AH90] James Aspnes and Maurice Herlihy. Fast randomized consensus


using shared memory. Journal of Algorithms, 11(3):441–461,
September 1990.

[AKS83] M. Ajtai, J. Komlós, and E. Szemerédi. An O(n log n) sorting


network. In STOC ’83: Proceedings of the fifteenth annual ACM
symposium on Theory of computing, pages 1–9, New York, NY,
USA, 1983. ACM.

[AKS04] Manindra Agrawal, Neeraj Kayal, and Nitin Saxena. PRIMES


is in P. Annals of Mathematics, 160:781–793, 2004.

[ALM+ 98] Sanjeev Arora, Carsten Lund, Rajeev Motwani, Madhu Sudan,
and Mario Szegedy. Proof verification and the hardness of
BIBLIOGRAPHY 493

approximation problems. Journal of the ACM, 45(3):501–555,


may 1998.
[Alo91] Noga Alon. A parallel algorithmic version of the local lemma.
Random Structures & Algorithms, 2(4):367–378, 1991.
[AM18] Vigleik Angeltveit and Brendan D. McKay. r(5, 5) ≤ 48. Jour-
nal of Graph Theory, 89(1):5–13, 2018.
[AMS96] Noga Alon, Yossi Matias, and Mario Szegedy. The space com-
plexity of approximating the frequency moments. In Proceedings
of the twenty-eighth annual ACM symposium on Theory of com-
puting, pages 20–29, 1996.
[APT79] Bengt Aspvall, Michael F. Plass, and Robert Endre Tarjan. A
linear-time algorithm for testing the truth of certain quantified
boolean formulas. Information Processing Letters, 8(3):121–123,
1979.
[AS92] Noga Alon and Joel H. Spencer. The Probabilistic Method. John
Wiley & Sons, 1992.
[AS07] James Aspnes and Gauri Shah. Skip graphs. ACM Transactions
on Algorithms, 3(4):37, November 2007.
[Asp15] James Aspnes. Faster randomized consensus with an oblivious
adversary. Distributed Computing, 28(1):21–29, February 2015.
[Asp20] James Aspnes. Notes on computational complexity theory.
Unpublished manuscript, 2020.
[AVL62] G. M. Adelson-Velskii and E. M. Landis. An information
organization algorithm. Doklady Akademia Nauk SSSR, 146:263–
266, 1962.
[AW96] James Aspnes and Orli Waarts. Randomized consensus in
O(n log2 n) operations per processor. SIAM Journal on Com-
puting, 25(5):1024–1044, October 1996.
[Azu67] Kazuoki Azuma. Weighted sums of certain dependent random
variables. Tôhoku Mathematical Journal, 19(3):357–367, 1967.
[Bab79] László Babai. Monte-carlo algorithms in graph isomorphism
testing. Technical Report D.M.S. 79-10, Université de Montréal,
1979.
BIBLIOGRAPHY 494

[BBBV97] Charles H. Bennett, Ethan Bernstein, Gilles Brassard, and


Umesh Vazirani. Strengths and weaknesses of quantum com-
puting. SIAM J. Comput., 26(5):1510–1523, October 1997.

[BCB12] Sébastian Bubeck and Nicolò Cesa-Bianchi. Regret analysis


of stochastic and nonstochastic multi-armed bandit problems.
Foundations and Trends in Machine Learning, 5(1):1–122, 2012.

[BD92] Dave Bayer and Persi Diaconis. Trailing the dovetail shuffle to
its lair. Annals of Applied Probability, 2(2):294–313, 1992.

[BD97] Russ Bubley and Martin Dyer. Path coupling: A technique for
proving rapid mixing in markov chains. In Proceedings 38th
Annual Symposium on Foundations of Computer Science, pages
223–231. IEEE, 1997.

[Bec91] József Beck. An algorithmic approach to the lovász local lemma.


i. Random Structures & Algorithms, 2(4):343–365, 1991.

[Bel57] Richard Earnest Bellman. Dynamic Programming. Princeton


University Press, 1957.

[BFL91] László Babai, Lance Fortnow, and Carsten Lund. Non-


deterministic exponential time has two-prover interactive pro-
tocols. computational complexity, 1(1):3–40, 1991.

[Blo70] Burton H. Bloom. Space/time trade-offs in hash coding with


allowable errors. Commun. ACM, 13(7):422–426, 1970.

[Bol01] Béla Bollobás. Random Graphs. Cambridge University Press,


second edition, 2001.

[BR91] Gabriel Bracha and Ophir Rachman. Randomized consensus in


expected O(n2 log n) operations. In Sam Toueg, Paul G. Spi-
rakis, and Lefteris M. Kirousis, editors, Distributed Algorithms,
5th International Workshop, volume 579 of Lecture Notes in
Computer Science, pages 143–150, Delphi, Greece, 7–9 October
1991. Springer, 1992.

[Bro86] Andrei Z. Broder. How hard is to marry at random? (on


the approximation of the permanent). In Proceedings of the
Eighteenth Annual ACM Symposium on Theory of Computing,
28-30 May 1986, Berkeley, California, USA, pages 50–58, 1986.
BIBLIOGRAPHY 495

[Bro88] Andrei Z. Broder. Errata to “how hard is to marry at random?


(on the approximation of the permanent)”. In Proceedings of the
Twentieth Annual ACM Symposium on Theory of Computing,
2-4 May 1988, Chicago, Illinois, USA, page 551, 1988.

[CM03] Saar Cohen and Yossi Matias. Spectral Bloom filters. In Alon Y.
Halevy, Zachary G. Ives, and AnHai Doan, editors, Proceed-
ings of the 2003 ACM SIGMOD International Conference on
Management of Data, San Diego, California, USA, June 9-12,
2003, pages 241–252, 2003.

[CM05] Graham Cormode and S. Muthukrishnan. An improved data


stream summary: the count-min sketch and its applications. J.
Algorithms, 55(1):58–75, 2005.

[CMV13] Kai-Min Chung, Michael Mitzenmacher, and Salil Vadhan. Why


simple hash functions work: Exploiting the entropy in a data
stream. Theory of Computing, 9(1):897–945, 2013.

[Coh15] Edith Cohen. All-distances sketches, revisited: HIP estimators


for massive graphs analysis. IEEE Trans. Knowl. Data Eng.,
27(9):2320–2334, 2015.

[Coh21] Gil Cohen. Two-source dispersers for polylogarithmic entropy


and improved Ramsey graphs. SIAM Journal on Computing,
50(3):STOC16–30–STOC16–67, 2021.

[CS00] Artur Czumaj and Christian Scheideler. Coloring non-uniform


hypergraphs: a new algorithmic approach to the general Lovász
local lemma. In Proceedings of the eleventh annual ACM-
SIAM symposium on Discrete algorithms, SODA ’00, pages
30–39, Philadelphia, PA, USA, 2000. Society for Industrial and
Applied Mathematics.

[CW77] J. Lawrence Carter and Mark N. Wegman. Universal classes of


hash functions (extended abstract). In Proceedings of the ninth
annual ACM symposium on Theory of computing, STOC ’77,
pages 106–112, New York, NY, USA, 1977. ACM.

[Deu89] David Deutsch. Quantum computational networks. Proceedings


of the Royal Society of London. A. Mathematical and Physical
Sciences, 425(1868):73–90, 1989.
BIBLIOGRAPHY 496

[Dev88] Luc Devroye. Applications of the theory of records in the study


of random trees. Acta Informatica, 26(1-2):123–130, October
1988.

[DG00] Martin Dyer and Catherine Greenhill. On Markov chains for


independent sets. J. Algorithms, 35:17–49, April 2000.

[DG03] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of


a theorem of johnson and lindenstrauss. Random Structures &
Algorithms, 22(1):60–65, 2003.

[DGH+ 87] Alan Demers, Dan Greene, Carl Hauser, Wes Irish, John Larson,
Scott Shenker, Howard Sturgis, Dan Swinehart, and Doug Terry.
Epidemic algorithms for replicated database maintenance. In
Proceedings of the Sixth Annual ACM Symposium on Principles
of Distributed Computing, PODC ’87, page 1–12, New York,
NY, USA, 1987. Association for Computing Machinery.

[DGM02] Martin Dyer, Catherine Greenhill, and Mike Molloy. Very


rapid mixing of the Glauber dynamics for proper colorings on
bounded-degree graphs. Random Struct. Algorithms, 20:98–114,
January 2002.

[DHKP97] Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen, and


Martti Penttonen. A reliable randomized algorithm for the
closest-pair problem. J. Algorithms, 25(1):19–51, 1997.

[Din07] Irit Dinur. The pcp theorem by gap amplification. Journal of


the ACM (JACM), 54(3):12, 2007.

[Dir39] P. A. M. Dirac. A new notation for quantum mechanics. Math-


ematical Proceedings of the Cambridge Philosophical Society,
35:416–418, 6 1939.

[DJ92] David Deutsch and Richard Jozsa. Rapid solution of prob-


lems by quantum computation. Proceedings of the Royal Soci-
ety of London. Series A: Mathematical and Physical Sciences,
439(1907):553–558, 1992.

[DKM+ 94] Martin Dietzfelbinger, Anna Karlin, Kurt Mehlhorn, Friedhelm


Meyer Auf Der Heide, Hans Rohnert, and Robert E Tarjan.
Dynamic perfect hashing: Upper and lower bounds. SIAM
Journal on Computing, 23(4):738–761, 1994.
BIBLIOGRAPHY 497

[DP09] Devdatt P. Dubhashi and Alessandro Panconesi. Concentra-


tion of Measure for the Analysis of Randomized Algorithms.
Cambridge University Press, 2009.
[Dum56] A. I. Dumey. Indexing for rapid random-access memory. Com-
puters and Automation, 5(12):6–9, 1956.
[Dye03] Martin Dyer. Approximate counting by dynamic programming.
In Proceedings of the thirty-fifth annual ACM symposium on
Theory of computing, STOC ’03, pages 693–699, New York, NY,
USA, 2003. ACM.
[EL75] Paul Erdős and Laszlo Lovász. Problems and results on 3-
chromatic hypergraphs and some related questions. In A. Haj-
nal, R. Rado, and V. T. Sós, editors, Infinite and Finite Sets (to
Paul Erdős on his 60th birthday), pages 609–627. North-Holland,
1975.
[Epp16] David Eppstein. Cuckoo filter: Simplification and analysis.
In Rasmus Pagh, editor, 15th Scandinavian Symposium and
Workshops on Algorithm Theory, SWAT 2016, June 22-24,
2016, Reykjavik, Iceland, volume 53 of LIPIcs, pages 8:1–8:12.
Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2016.
[Erd45] P. Erdős. On a lemma of Littlewood and Offord. Bulletin of
the American Mathematical Society, 51(12):898–902, 1945.
[Erd47] P. Erdős. Some remarks on the theory of graphs. Bulletin of
the American Mathematical Society, 53:292–294, 1947.
[ES35] P. Erdős and G. Szekeres. A combinatorial problem in geometry.
Compositio Mathematica, 2:463–470, 1935.
[Eul68] M. L. Euler. Remarques sur un beau rapport entre les séries
des pussances tant directes que réciproques. Mémoires de
l’Académie des Sciences de Berlin, 17:83–106, 1768.
[FAKM14] Bin Fan, David G. Andersen, Michael Kaminsky, and Michael
Mitzenmacher. Cuckoo filter: Practically better than Bloom.
In Aruna Seneviratne, Christophe Diot, Jim Kurose, Augustin
Chaintreau, and Luigi Rizzo, editors, Proceedings of the 10th
ACM International on Conference on emerging Networking Ex-
periments and Technologies, CoNEXT 2014, Sydney, Australia,
December 2-5, 2014, pages 75–88. ACM, 2014.
BIBLIOGRAPHY 498

[FCAB00] Li Fan, Pei Cao, Jussara M. Almeida, and Andrei Z. Broder.


Summary cache: a scalable wide-area web cache sharing proto-
col. IEEE/ACM Trans. Netw., 8(3):281–293, 2000.
[Fel68] William Feller. An Introduction to Probability Theory and Its
Applications, volume 1. Wiley, third edition, 1968.
[Fel71] William Feller. An Introduction to Probability Theory and Its
Applications, volume 2. Wiley, second edition, 1971.
[FFGM07] Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Me-
unier. HyperLogLog: the analysis of a near-optimal cardinality
estimation algorithm. In Discrete Mathematics and Theoretical
Computer Science, pages 137–156. Discrete Mathematics and
Theoretical Computer Science, 2007.
[FGQ12] Xiequan Fan, Ion Grama, and Liu Quansheng. Hoeffding’s
inequality for supermartingales. Available as https://arxiv.
org/abs/1109.4359, July 2012.
[FKS84] M.L. Fredman, J. Komlós, and E. Szemerédi. Storing a Sparse
Table with O(1) Worst Case Access Time. Journal of the ACM
(JACM), 31(3):538–544, 1984.
[FLP85] Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson.
Impossibility of distributed consensus with one faulty process.
Journal of the ACM, 32(2):374–382, April 1985.
[FNM85] Philippe Flajolet and G Nigel Martin. Probabilistic counting
algorithms for data base applications. Journal of computer and
system sciences, 31(2):182–209, 1985.
[FPSS03] Dimitris Fotakis, Rasmus Pagh, Peter Sanders, and Paul Spi-
rakis. Space efficient hash tables with worst case constant
access time. In Helmut Alt and Michel Habib, editors, STACS
2003, volume 2607 of Lecture Notes in Computer Science, pages
271–282. Springer Berlin Heidelberg, 2003.
[FR75] Robert W. Floyd and Ronald L. Rivest. Expected time bounds
for selection. Commun. ACM, 18(3):165–172, 1975.
[Fre21] Casper Benjamin Freksen. An introduction to Johnson-
Lindenstrauss transforms. arXiv preprint arXiv:2103.00564,
2021.
BIBLIOGRAPHY 499

[GKP88] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik.


Concrete Mathematics: A Foundation for Computer Science.
Addison-Wesley, 1988.

[Goo83] Nelson Goodman. Fact, Fiction, and Forecast. Harvard Uni-


versity Press, 1983.

[GR93] Igal Galperin and Ronald L. Rivest. Scapegoat trees. In


Proceedings of the Fourth Annual ACM-SIAM Symposium on
Discrete Algorithms, pages 165–174. Society for Industrial and
Applied Mathematics, 1993.

[Gro96] Lov K Grover. A fast quantum mechanical algorithm for


database search. In Proceedings of the twenty-eighth annual
ACM symposium on Theory of computing, pages 212–219.
ACM, 1996. Available as https://arxiv.org/abs/quant-ph/
9605043.

[GS78] Leo J. Guibas and Robert Sedgewick. A dichromatic framework


for balanced trees. In Foundations of Computer Science, 1978.,
19th Annual Symposium on, pages 8–21. IEEE, 1978.

[GS92] G. R. Grimmet and D. R. Stirzaker. Probability and Random


Processes. Oxford Science Publications, second edition, 1992.

[GS01] Geoffrey R. Grimmett and David R. Stirzaker. Probability


and Random Processes. Oxford University Press, third edition,
2001.

[Gur00] Venkatesen Guruswami. Rapidly mixing Markov chains: A


comparison of techniques. Available at ftp://theory.lcs.
mit.edu/pub/people/venkat/markov-survey.ps, 2000.

[GW94] Michel X. Goemans and David P. Williamson. New 3/4-


approximation algorithms for the maximum satisfiability prob-
lem. SIAM J. Discret. Math., 7:656–666, November 1994.

[GW95] Michel X. Goemans and David P. Williamson. Improved ap-


proximation algorithms for maximum cut and satisfiability
problems using semidefinite programming. Journal of the ACM,
42(6):1115–1145, 1995.
BIBLIOGRAPHY 500

[GW12] George Giakkoupis and Philipp Woelfel. On the time and space
complexity of randomized test-and-set. In Darek Kowalski and
Alessandro Panconesi, editors, ACM Symposium on Princi-
ples of Distributed Computing, PODC ’12, Funchal, Madeira,
Portugal, July 16-18, 2012, pages 19–28. ACM, 2012.

[Has70] W. K. Hastings. Monte carlo sampling methods using Markov


chains and their applications. Biometrika, 57(1):97–109, 1970.

[Hås01] Johan Håstad. Some optimal inapproximability results. Journal


of the ACM (JACM), 48(4):798–859, 2001.

[Hec21] Annika Heckel. Non-concentration of the chromatic number


of a random graph. Journal of the American Mathematical
Society, 34(1):245–260, January 2021.

[HH80] P. Hall and C.C. Heyde. Martingale Limit Theory and Its
Application. Academic Press, 1980.

[HK73] John E. Hopcroft and Richard M. Karp. An n5/2 algorithm


for maximum matchings in bipartite graphs. SIAM J. Comput.,
2(4):225–231, 1973.

[Hoa61a] C. A. R. Hoare. Algorithm 64: Quicksort. Commun. ACM,


4:321, July 1961.

[Hoa61b] C. A. R. Hoare. Algorithm 65: find. Commun. ACM, 4:321–322,


July 1961.

[Hoe63] Wassily Hoeffding. Probability inequalities for sums of bounded


random variables. Journal of the American Statistical Associa-
tion, 58(301):13–30, March 1963.

[Hol16] Jonathan Holmes. AI is already making in-


roads into journalism but could it win a Pulitzer?
https://www.theguardian.com/media/2016/apr/03/
artificla-intelligence-robot-reporter-pulitzer-prize,
April 2016. Accessed September 18th, 2016.

[HST08] Maurice Herlihy, Nir Shavit, and Moran Tzafrir. Hopscotch


hashing. In Proceedings of the 22nd international symposium
on Distributed Computing, DISC ’08, pages 350–364, Berlin,
Heidelberg, 2008. Springer-Verlag.
BIBLIOGRAPHY 501

[HWY22] Kun He, Chunyang Wang, and Yitong Yin. Sampling lovász
local lemma for general constraint satisfaction solutions in
near-linear time. In 2022 IEEE 63rd Annual Symposium on
Foundations of Computer Science (FOCS), pages 147–158, 2022.

[HY04] Jun He and Xin Yao. A study of drift analysis for estimating
computation time of evolutionary algorithms. Natural Comput-
ing, 3:21–35, 2004.

[IM98] Piotr Indyk and Rajeev Motwani. Approximate nearest neigh-


bors: Towards removing the curse of dimensionality. In STOC,
pages 604–613, 1998.

[Jer95] Mark Jerrum. A very simple algorithm for estimating the


number of k-colorings of a low-degree graph. Random Structures
& Algorithms, 7(2):157–165, 1995.

[JL84] William B. Johnson and Joram Lindenstrauss. Extensions of


Lipschitz mappings into a Hilbert space. In Conference in
Modern Analysis and Probability (New Haven, Connecticut,
1982), number 26 in Contemporary Mathematics, pages 189–
206. American Mathematical Society, 1984.

[JLR00] Svante Janson, Tomasz Łuczak, and Andrezej Ruciński. Ran-


dom Graphs. John Wiley & Sons, 2000.

[Joh10] Daniel Johannsen. Random Combinatorial Structures and Ran-


domized Search Heuristics. PhD thesis, Universität des Saar-
lands, 2010.

[JS89] Mark Jerrum and Alistair Sinclair. Approximating the perma-


nent. SIAM J. Comput., 18(6):1149–1178, 1989.

[JSV04] Mark Jerrum, Alistair Sinclair, and Eric Vigoda. A polynomial-


time approximation algorithm for the permanent of a matrix
with nonnegative entries. J. ACM, 51(4):671–697, 2004.

[Kar93] David R. Karger. Global min-cuts in RNC, and other ramifi-


cations of a simple min-out algorithm. In Proceedings of the
fourth annual ACM-SIAM Symposium on Discrete algorithms,
SODA ’93, pages 21–30, Philadelphia, PA, USA, 1993. Society
for Industrial and Applied Mathematics.
BIBLIOGRAPHY 502

[KGV83] S. Kirkpatrick, C. D. Gelatt, Jr., and M. P. Vecchi. Opti-


mization by simulated annealing. Science, 220(4598):671–680,
1983.

[Kho02] Subhash Khot. On the power of unique 2-prover 1-round games.


In Proceedings of the thiry-fourth annual ACM symposium on
Theory of computing, pages 767–775. ACM, 2002.

[Kho10] Subhash Khot. On the unique games conjecture. In 2010 25th


Annual IEEE Conference on Computational Complexity, pages
99–121, 2010.

[KL85] Richard M. Karp and Michael Luby. Monte-carlo algorithms


for the planar multiterminal network reliability problem. J.
Complexity, 1(1):45–64, 1985.

[KM08] Adam Kirsch and Michael Mitzenmacher. Less hashing, same


performance: Building a better Bloom filter. Random Struct.
Algorithms, 33(2):187–218, 2008.

[KNW10] Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An opti-


mal algorithm for the distinct elements problem. In Proceedings
of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Sym-
posium on Principles of Database Systems, PODS ’10, pages
41–52, New York, NY, USA, 2010. ACM.

[Kol33] A.N. Kolmogorov. Grundbegriffe der Wahrscheinlichkeitsrech-


nung. Springer, 1933.

[KR99] V. S. Anil Kumar and H. Ramesh. Markovian coupling vs.


conductance for the Jerrum-Sinclair chain. In FOCS, pages
241–252, 1999.

[KS76] J.G. Kemeny and J.L. Snell. Finite Markov Chains: With
a New Appendix “Generalization of a Fundamental Matrix”.
Undergraduate Texts in Mathematics. Springer, 1976.

[KSK76] John G. Kemeny, J. Laurie. Snell, and Anthony W. Knapp.


Denumerable Markov Chains, volume 40 of Graduate Texts in
Mathematics. Springer, 1976.

[KT75] Samuel Karlin and Howard M. Taylor. A First Course in


Stochastic Processes. Academic Press, second edition, 1975.
BIBLIOGRAPHY 503

[KUW88] Richard M. Karp, Eli Upfal, and Avi Wigderson. The complexity
of parallel search. Journal of Computer and System Sciences,
36(2):225–253, 1988.

[LAA87] Michael C. Loui and Hosame H. Abu-Amara. Memory require-


ments for agreement among unreliable asynchronous processes.
Advances in Computing Research, pages 163–183, 1987.

[Len20] Johannes Lengler. Drift analysis. In Benjamin Doerr and


Frank Neumann, editors, Theory of Evolutionary Computation:
Recent Developments in Discrete Optimization, pages 89–131.
Springer International Publishing, Cham, 2020.

[Li80] Shuo-Yen Robert Li. A martingale approach to the study


of occurrence of sequence patterns in repeated experiments.
Annals of Probability, 8(6):1171–1176, 1980.

[Lin92] Torgny Lindvall. Lectures on the Coupling Method. Wiley, 1992.

[LPW09] David A. Levin, Yuval Peres, and Elizabeth L. Wimmer. Markov


Chains and Mixing Times. American Mathematical Society,
2009.

[LR85] T. L. Lai and Herbert Robbins. Asymptotically efficient adap-


tive allocation rules. Advances in Applied Mathematics, 6(1):4–
22, March 1985.

[LS20] Tor Lattimore and Csaba Szepsesvári. Bandit Algorithms. Cam-


bridge University Press, 1st edition, September 2020.

[Lub85] Michael Luby. A simple parallel algorithm for the maximal


independent set problem. In Proceedings of the seventeenth
annual ACM symposium on Theory of computing, pages 1–10,
New York, NY, USA, 1985. ACM.

[LV97] Michael Luby and Eric Vigoda. Approximately counting up to


four (extended abstract). In Proceedings of the twenty-ninth
annual ACM symposium on Theory of computing, STOC ’97,
pages 682–687, New York, NY, USA, 1997. ACM.

[LV99] Michael Luby and Eric Vigoda. Fast convergence of the Glauber
dynamics for sampling independent sets. Random Structures &
Algorithms, 15(3–4):229–241, 1999.
BIBLIOGRAPHY 504

[LW05] Michael Luby and Avi Wigderson. Pairwise independence and


derandomization. Foundations and Trends in Theoretical Com-
puter Science, 1(4):237–301, 2005.
[McC85] Edward M. McCreight. Priority search trees. SIAM J. Comput.,
14(2):257–276, 1985.
[McD89] Colin McDiarmid. On the method of bounded differences. In
Surveys in Combinatorics, 1989: Invited Papers at the Twelfth
British Combinatorial Conference, pages 148–188, 1989.
[MH92] Colin McDiarmid and Ryan Hayward. Strong concentration
for Quicksort. In Proceedings of the Third Annual ACM-SIAM
Symposium on Discrete algorithms, SODA ’92, pages 414–421,
Philadelphia, PA, USA, 1992. Society for Industrial and Applied
Mathematics.
[Mil76] Gary L. Miller. Riemann’s hypothesis and tests for primality.
Journal of Computer and System Sciences, 13(3):300–317, 1976.
[ML86] Lothar F. Mackert and Guy M. Lohman. R* optimizer vali-
dation and performance evaluation for distributed queries. In
Wesley W. Chu, Georges Gardarin, Setsuo Ohsuga, and Yahiko
Kambayashi, editors, VLDB’86 Twelfth International Confer-
ence on Very Large Data Bases, August 25-28, 1986, Kyoto,
Japan, Proceedings, pages 149–159. Morgan Kaufmann, 1986.
[MN98] Makoto Matsumoto and Takuji Nishimura. Mersenne twister:
a 623-dimensionally equidistributed uniform pseudo-random
number generator. ACM Trans. Model. Comput. Simul., 8:3–30,
January 1998.
[Mos09] Robin A. Moser. A constructive proof of the Lovász local lemma.
In Proceedings of the 41st annual ACM Symposium on Theory
of Computing, STOC ’09, pages 343–350, New York, NY, USA,
2009. ACM.
[MPW16] Mikhail Menshikov, Serguei Popov, and Andrew Wade. Non-
homogeneous random walks: Lyapunov function methods for
near-critical stochastic systems, volume 209. Cambridge Uni-
versity Press, 2016.
[MR95] Rajeev Motwani and Prabhakar Raghavan. Randomized Algo-
rithms. Cambridge University Press, 1995.
BIBLIOGRAPHY 505

[MR97] Brendan D McKay and Stanisław P Radziszowski. Subgraph


counting identities and Ramsey numbers. Journal of Combina-
torial Theory, Series B, 69(2):193–209, 1997.

[MR98] Michael Molloy and Bruce Reed. Further algorithmic aspects


of the local lemma. In Proceedings of the thirtieth annual ACM
symposium on Theory of computing, STOC ’98, pages 524–529,
New York, NY, USA, 1998. ACM.

[MRR+ 53] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N.


Rosenbluth, Augusta H. Teller, and Edward Teller. Equa-
tion of state calculations by fast computing machines. J. Chem.
Phys., 21(6):1087–1092, 1953.

[MT05] Ravi Montenegro and Prasad Tetali. Mathematical aspects of


mixing times in markov chains. Foundations and Trends in
Theoretical Computer Science, 1(3), 2005.

[MT10] Robin A. Moser and Gábor Tardos. A constructive proof of the


general Lovász local lemma. J. ACM, 57:11:1–11:15, February
2010.

[MU17] Michael Mitzenmacher and Eli Upfal. Probability and Comput-


ing: Randomization and Probabilistic Techniques in Algorithms
and Data Analysis. Cambridge University Press, second edition,
2017.

[Mun11] Randall Munroe. Sports. https://xkcd.com/904/, 2011. Ac-


cessed September 18th, 2016.

[MWW07] Elchanan Mossel, Dror Weitz, and Nicholas Wormald. On the


hardness of sampling independent sets beyond the tree threshold.
Available as https://arxiv.org/abs/math/0701471, 2007.

[Pag01] Rasmus Pagh. On the cell probe complexity of membership


and perfect hashing. In STOC, pages 425–432, 2001.

[Pag06] Rasmus Pagh. Cuckoo hashing for undergraduates.


Available at https://www.it-c.dk/people/pagh/papers/
cuckoo-undergrad.pdf, 2006.

[Pap91] Christos H. Papadimitriou. On selecting a satisfying truth


assignment (extended abstract). In 32nd Annual Symposium on
BIBLIOGRAPHY 506

Foundations of Computer Science, San Juan, Puerto Rico, 1-4


October 1991, pages 163–169. IEEE Computer Society, 1991.
[PPR05] Anna Pagh, Rasmus Pagh, and S. Srinivasa Rao. An optimal
Bloom filter replacement. In Proceedings of the Sixteenth Annual
ACM-SIAM Symposium on Discrete Algorithms, SODA 2005,
Vancouver, British Columbia, Canada, January 23-25, 2005,
pages 823–829. SIAM, 2005.
[PR04] Rasmus Pagh and Flemming Friche Rodler. Cuckoo hashing.
J. Algorithms, 51(2):122–144, 2004.
[PSL80] M. Pease, R. Shostak, and L. Lamport. Reaching agreement
in the presence of faults. Journal of the ACM, 27(2):228–234,
April 1980.
[PT12] Mihai Patrascu and Mikkel Thorup. The power of simple
tabulation hashing. J. ACM, 59(3):14, 2012.
[Pug90] William Pugh. Skip Lists: A Probabilistic Alternative to Bal-
anced Trees. Communications of the ACM, 33(6):668–676, June
1990.
[PWY20] Seth Pettie, Dingyu Wang, and Longhui Yin. Non-mergeable
sketching for cardinality estimation. Available as https://
arxiv.org/abs/2008.08739, 2020.
[Rab80] Michael O Rabin. Probabilistic algorithm for testing primality.
Journal of Number Theory, 12(1):128–138, 1980.
[Rag88] Prabhakar Raghavan. Probabilistic construction of determin-
istic algorithms: approximating packing integer programs. J.
Comput. Syst. Sci., 37:130–143, October 1988.
[Rou01] Salvador Roura. Improved master theorems for divide-and-
conquer recurrences. J. ACM, 48:170–205, March 2001.
[RS14] Jonathan E. Rowe and Dirk Sudholt. The choice of the offspring
population size in the (1, λ) evolutionary algorithm. Theoretical
Computer Science, 545:20–38, 2014. Genetic and Evolutionary
Computation.
[RT87] Prabhakar Raghavan and Clark D. Tompson. Randomized
rounding: a technique for provably good algorithms and algo-
rithmic proofs. Combinatorica, 7:365–374, December 1987.
BIBLIOGRAPHY 507

[SA96] Raimund Seidel and Cecilia R. Aragon. Randomized search


trees. Algorithmica, 16(4/5):464–497, 1996. Available
at https://people.ischool.berkeley.edu/~aragon/pubs/
rst96.pdf.

[Sch35] Erwin Schrödinger. Die gegenwärtige Situation in der Quan-


tenmechanik. Naturwissenschaften, 23(48):807–812, November
1935.

[She11] Irina Shevtsova. On the asymptotically exact constants in


the Berry-Esseen-Katz inequality. Theory of Probability & Its
Applications, 55(2):225–252, 2011. A preprint is available at
https://arxiv.org/pdf/1111.6554.pdf.

[Sho97] Peter W Shor. Polynomial-time algorithms for prime factoriza-


tion and discrete logarithms on a quantum computer. SIAM
journal on computing, 26(5):1484–1509, 1997.

[SMJ+ 14] Attila Szolnoki, Mauro Mobilia, Luo-Luo Jiang, Bartosz


Szczesny, Alastair M Rucklidge, and Matjaž Perc. Cyclic domi-
nance in evolutionary games: a review. Journal of the Royal
Society Interface, 11(100):20140735, 2014.

[Spu12] Francis Spufford. Red Plenty. Graywolf Press, 2012.

[Sri08] Aravind Srinivasan. Improved algorithmic versions of the Lovász


local lemma. In Proceedings of the nineteenth annual ACM-
SIAM symposium on Discrete algorithms, SODA ’08, pages
611–620, Philadelphia, PA, USA, 2008. Society for Industrial
and Applied Mathematics.

[SS71] A. Schönhage and V. Strassen. Schnelle multiplikation großer


zahlen. Computing, 7(3–4):281–292, 1971.

[SS87] E. Shamir and J. Spencer. Sharp concentration of the chromatic


number on random graphs Gn,p . Combinatorica, 7:121–129,
January 1987.

[Str03] Gilbert Strang. Introduction to linear algebra. Wellesley-


Cambridge Press, 2003.

[SW13] Dennis Stanton and Dennis White. Constructive Combinatorics.


Undergraduate Texts in Mathematics. Springer, 2013.
BIBLIOGRAPHY 508

[Tin14] Daniel Ting. Streamed approximate counting of distinct ele-


ments: beating optimal batch methods. In Sofus A. Macskassy,
Claudia Perlich, Jure Leskovec, Wei Wang, and Rayid Ghani,
editors, The 20th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD ’14, New York,
NY, USA - August 24 - 27, 2014, pages 442–451. ACM, 2014.
[Tod91] Seinosuke Toda. Pp is as hard as the polynomial-time hierarchy.
SIAM J. Comput., 20(5):865–877, 1991.
[Tof80] Tommaso Toffoli. Reversible computing. Technical Report
MIT/LCS/TM-151, MIT Laboratory for Computer Science,
1980.
[Val79] Leslie G. Valiant. The complexity of computing the permanent.
Theor. Comput. Sci., 8:189–201, 1979.
[Val82] Leslie G. Valiant. A scheme for fast parallel communication.
SIAM J. Comput., 11(2):350–361, 1982.
[VB81] Leslie G. Valiant and Gordon J. Brebner. Universal schemes
for parallel communication. In STOC, pages 263–277. ACM,
1981.
[Vit85] Jeffrey Scott Vitter. Random sampling with a reservoir. ACM
Trans. Math. Softw., 11(1):37–57, 1985.
[VN63] John Von Neumann. Various techniques used in connection
with random digits. John von Neumann, Collected Works,
5:768–770, 1963.
[Vui80] Jean Vuillemin. A unifying look at data structures. Communi-
cations of the ACM, 23(4):229–239, 1980.
[VV86] Leslie G. Valiant and Vijay V. Vazirani. Np is as easy as
detecting unique solutions. Theor. Comput. Sci., 47(3):85–93,
1986.
[Wil04] David Bruce Wilson. Mixing times of lozenge tiling and
card shuffling Markov chains. Annals of Applied Probability,
14(1):274–325, 2004.
[Wil06] Herbert S. Wilf. generatingfunctionology. A. K. Peters, third
edition, 2006. Second edition is available on-line at https:
//www.math.penn.edu/~wilf/DownldGF.html.
BIBLIOGRAPHY 509

[WNC87] Ian H. Witten, Radford M. Neal, and John G. Cleary. Arith-


metic coding for data compression. Communications of the
ACM, 30(6):520–540, 1987.

[Yao77] Andrew Chi-Chin Yao. Probabilistic computations: Toward a


unified measure of complexity. In 18th Annual Symposium on
Foundations of Computer Science, pages 222–227. IEEE, 1977.

[Zob70] Albert L. Zobrist. A new hashing method with application for


game playing. Technical Report 88, University of Washington,
Computer Sciences Department, April 1970.

[Zur03] Wojciech Hubert Zurek. Decoherence, einselection, and the


quantum origins of the classical. Rev. Mod. Phys., 75:715–775,
May 2003.
Index

(r1 , r2 , p1 , p2 )-sensitive, 141 adapted sequence of random variables,


2-coloring, 245 153
3-way cut, 460 adaptive adversary, 107, 289
G(n, p), 91 additive drift, 230
hr(n), q(n), ρi-PCP verifier, 262 Adleman’s theorem, 255
d-ary cuckoo hashing, 127 adversary, 289
k-CNF formula, 245 adaptive, 107, 289
k-uniform hypergraph, 245 oblivious, 107, 289
k-universal, 116 advice, 255
k-wise independence, 21 agreement, 290
tmix , 174 Aldous-Fill manuscript, 183
Φ, 203 algorith
-NNS, 141 gossip, 225
-PLEB, 141 algorithm
-nearest neighbor search, 141 Deutsch’s, 284
-point location in equal balls, 141 distributed, 289
ρ-close, 266 epidemic, 225
σ-algebra, 16, 43 evolutionary, 230
generated by a random variable, UCB1, 95
43 algorithmic information theory, 114
τ2 , 201 almost Markov, 322
F-measurable, 43 alphabet, 270
#DNF, 215, 220 amplitude
#P, 214 probability, 274
#SAT, 214 ancestor, 107
PCP theorem, 262 common, 108
0–1 random variable, 487 annealing schedule, 196
2-universal, 115 anti-concentration bound, 100
aperiodic, 176
absolute value, 279 approximation algorithm, 239
adapted, 84, 162 approximation ratio, 239
approximation scheme

510
INDEX 511

fully polynomial-time, 218 branching process, 251


fully polynomial-time randomized, bubble sort, 364
214 Byzantine processes, 289
arithmetic coding, 217, 450
arm, 94 canonical paths, 204
Atari 2600, 373 cardinality estimation, 134
augmenting path, 212 cartesian tree, 105
average-case analysis, 2 Catalan number, 103
AVL tree, 103 causal coupling, 198
avoiding worst-case inputs, 13 CCNOT, 282
axioms of probability, 16 central limit theorem, 99
Azuma’s inequality, 80, 85 chaining, 115
Azuma-Hoeffding inequality, 80 Chapman-Kolmogorov equation, 167
Chebyshev’s inequality, 60
balanced, 102 Cheeger constant, 203
balancing Cheeger inequality, 203
set, 235 chromatic number, 91
bandit Chutes and Ladders, 484
multi-armed, 94 circuit
Bernoulli random variable, 33, 487 quantum, 274, 279
Berry-Esseen theorem, 100 random, 275
biased random walk, 160 clause, 220, 239, 245
binary search tree, 102 clique, 236
balanced, 102 closed loop, 125
binary tree, 102 CLT, 99
binomial random variable, 33 CNOT, 281
bipartite graph, 211 code
bit-fixing routing, 78, 208 error-correcting, 265
Bloom filter, 128, 472 Walsh-Hadamard, 265
counting, 133 codeword, 265
spectral, 133 coin
Bloomjoin, 132 weak shared, 292
Boole’s inequality, 53 collision, 13, 114, 115
bound coloring
union, 53 hypergrah, 245
bounded difference property, 89 common ancestor, 108
bowling, 429 common subsequence, 457
bra, 277 complement, 16
bra-ket notation, 277 completeness, 262
bracket, 277 complex conjugate, 277
INDEX 512

computation concentration bounds, 185


randomized, 276 covariance, 35, 62
computation path, 255 crash, 290
computation tree, 255 cryptographic hash function, 14
concave, 55 cryptographically secure pseudoran-
concentration bound, 59 dom generator, 254
coupon collector problem, 185 cryptographically secure pseudoran-
conditional expectation, 7, 36 dom number generator, 9
conditional probability, 21 Cryptographically-secure pseudoran-
conductance, 203 domness., 9
congestion, 206 cuckoo, 122
conjecture cuckoo filter, 131
Unique Games, 272 cuckoo hash table, 122
conjugate cuckoo hashing, 122
complex, 277 d-ary, 127
conjugate transpose, 277 cumulative distribution function, 30
conjunctive normal form, 239 curse of dimensionality, 141
consensus, 290 cut
constraint, 270 minimum, 25
constraint graph, 270 cylinder set, 17
constraint hypergraph, 271
constraint satisfaction problem, 270 data stream, 133
continuous random variable, 30 dead cat bounce, 369
contraction, 25 decodable
controlled controlled NOT, 282 locally, 266
controlled NOT, 281 decoding
convex, 54 local, 266
convex function, 54 decoherence, 279
Cook reduction, 214 decomposition
count-duplicates, 134 Doob, 86
count-min sketch, 136 dense, 211
countable additivity, 16 density, 333
countable Markov chain, 166 dependency graph, 246
counting Bloom filter, 133 depth
coupling, 174 tree, 402
causal, 198 derandomization, 253
path, 189 detailed balance equations, 178
Coupling Lemma, 174 dictionary, 114
coupling time, 184, 186 discrepancy, 235
coupon collector problem, 48, 484 discrete probability space, 16
INDEX 513

discrete random variable, 28, 31 error budget, 53


discretization, 146 error-correcting code, 265
disjoint, 16 Euler-Mascheroni constant, 8
disjunctive normal form, 220 event, 15, 16
distance events
total variation, 171 independent, 19
distributed algorithm, 289 evolutionary algorithm, 230
distributed algorithms, 14 execution log, 250
distributed computing, 289 expectation, 31
distribution, 30 conditional, 36
geometric, 47 conditioned on a σ-algebra, 43
stationary, 171 conditioned on a random variable,
DNF formula, 220 38
dominating set, 445, 452 conditioned on an event, 37
Doob decomposition, 86 iterated
Doob decomposition theorem, 155 law of, 41
Doob martingale, 88, 257 linearity of, 32
Doob’s martingale inequality, 163 expected time, 2
double record, 416 expected value, 2, 31
dovetail shuffle, 189 expected worst-case, 1
drift exponential generating function, 68
additive, 230
multiplicative, 232 false positive, 128
variable, 231 filter
drift analysis, 230 cuckoo, 131
drift process, 155 filtration, 44, 153
Dungeons and Dragons, 368 financial analysis, 368
dynamic perfect hashing, 122 fingerprint, 14, 131
dynamics fingerprinting, 14
Glauber, 193 finite Markov chain, 166
first passage time, 177
eigenvalue, 198 Flajolet-Martin sketch, 134, 433
eigenvector, 198 FPRAS, 214
einselection, 279 fractional solution, 239, 240
entropy, 217 Frobenius problem, 176
epidemic, 225 fully polynomial-time approximation
epidemic algorithm, 225 scheme, 218
equation fully polynomial-time randomized ap-
Wald’s, 36 proximation scheme, 214
Erdős-Szekeres theorem, 431, 458 function
INDEX 514

Lyapunov, 225 hopscotch, 128


measurable, 28 locality-sensitive, 140
potential, 226 Zobrist, 119
futile word search puzzle, 443 heap, 105
heavy hitter, 136
game high probability, 51
unique, 272 high-probability bound, 1
gap, 268 hitting time, 224
gap-amplifying reduction, 269 Hoare’s FIND, 49, 483
gap-preserving reduction, 269 Hoeffding’s inequality, 80
gap-reducing reduction, 269 Hoeffding’s lemma, 80
gate homogeneous, 166
Toffoli, 282 hopscotch hashing, 128
generated by, 43 hypercube network, 77, 90
generating function hyperedge, 245, 271
exponential, 68 hypergraph, 245
probability, 33, 56 k-uniform, 245
geometric distribution, 47 HyperLogLog, 135
geometric random variable, 47
Glauber dynamics, 193 inclusion-exclusion formula, 19
GNI, 263 independence, 19
gossip algorithm, 225 of events, 19
graph of sets of events, 21
bipartite, 211 pairwise, 21
random, 29, 91 independent, 19, 21
GRAPH NON-ISOMORPHISM, 263 independent events, 19
Grover diffusion operator, 286 independent random variables, 30
independent set, 236
Hadamard operator, 281 index, 114
handwaving, 9, 250 indicator random variable, 29
harmonic number, 8 indicator variable, 7
hash function, 13, 115 inequality
perfect, 121 Azuma’s, 80, 85
tabulation, 119 Azuma-Hoeffding, 80
universal, 115 Boole’s, 53
hash table, 13, 114 Chebyshev’s, 60
cuckoo, 122 Cheeger, 203
hashing, 13 Doob’s martingale, 163
cuckoo, 122 Hoeffding’s, 80
dynamic perfect, 122 Jensen’s, 54
INDEX 515

Kolmogorov’s, 163 linear program, 239


Markov’s, 51 linearity of expectation, 7, 32
McDiarmid’s, 88, 90 Lipschitz function, 197
inner product, 277 literal, 245
integer program, 239 Littlewood-Offord problem, 101
interactive proof, 263 load balancing, 13
irreducible, 175 load factor, 115
iterated expectation locality-sensitive hashing, 140
law of, 41 locally decodable, 266
Iverson bracket, 29 locally testable, 266
Iverson notation, 29 loop
closed, 125
Jensen’s inequality, 54 Lotka-Volterra model, 226
JLT, 148 Lovász Local Lemma, 243
Johnson-Lindenstrauss lemma, 142, Lovász local lemma
147 sampling, 249
Johnson-Lindenstrauss transformation, Lyapunov function, 225
148
joint distribution, 30 magnitude, 279
joint probability mass function, 30 Markov chain, 165, 166
lazy, 176
Karger’s min-cut algorithm, 25 Markov chain Monte Carlo, 165
Karp-Upfal-Wigderson bound, 484 Markov process, 166
ket, 277 Markov’s inequality, 51, 52
key, 102, 114 martingale, 84, 153
KNAPSACK, 217 Doob, 257
Kolmogorov extension theorem, 17 vertex exposure, 91
Kolmogorov’s inequality, 163 martingale difference sequence, 84
Las Vegas algorithm, 11 martingale property, 153
law of iterated expectation, 41 martingale transformation, 135
law of total probability, 6, 22 matching, 209, 299
lazy Markov chain, 176 matrix
leader election, 14, 292 stochastic, 276
left spine, 109 transition, 167
lemma unitary, 280
coupling, 174 max register, 297
Johnson-Lindenstrauss, 142, 147 maximum cut, 238
Lovász local, 243 McDiarmid’s inequality, 88, 90
Yao’s, 45 measurable, 31, 43
length, 361 measurable function, 28
INDEX 516

measurable sets, 16 bra-ket, 277


measure number
probability, 15 Catalan, 103
measurement, 276 number P, 214
memoryless, 166
message-passing, 289 objective function, 239
method of bounded differences, 88 oblivious adversary, 107, 289
method of conditional probabilities, open addressing, 115
257 operator
Metropolis, 182 Grover diffusion, 286
Metropolis-Hastings Hadamard, 281
convergence, 196 optimal solution, 239
Metropolis-Hastings algorithm, 182 optional stopping theorem, 155
Miller-Rabin primality test, 5 orthogonal
min-cut, 25 nearly, 311
minimum cut, 25 outcomes, 16
mixing
pairwise independence, 62
rapid, 185
pairwise independent, 21
mixing rate, 201
paradox
mixing time, 165, 174
St. Petersburg, 34
moment, 68
parallel computing, 289
moment generating function, 68
partition, 16
Monte Carlo algorithm, 5, 11
path
Monte Carlo simulation, 11
simple, 357
multi-armed bandit, 94
PCP, 262
multigraph, 25
PCP theorem, 261, 262
multiplicative drift, 232
PCSA, 135
nearest neighbor search, 140 pending operation, 289
nearly orthogonal, 311 perfect hash function, 121
network perfect matching, 211
sorting, 235 perfect matchings, 215
NNS, 140 period, 176
no-cloning theorem, 282 periodic, 170
non-uniform, 256 permanent, 215
norm, 279 permutation routing, 77
normal form Perron-Frobenius theorem, 199
conjunctive, 239 pessimistic estimator, 258
disjunctive, 220 phase, 281
notation physical randomness, 9
INDEX 517

Physical randomness., 9 problem


pivot, 6 promise, 273
PLEB, 141 product
point location in equal balls, 141 inner, 277
point query, 137 tensor, 275
points, 16 promise problem, 273
Poisson trial, 487 proof
polynomial-time hierarchy, 215 interactive, 263
positive probabilistically-checkable, 261
false, 128 pseudorandom generator, 253
potential function, 226 cryptographically secure, 254
prefect hashing pseudorandom number generator, 10
dynamic, 122 puzzle
preference, 291 word search, 442
primality test futile, 443
Miller-Rabin, 5
priority search tree, 105 quantum bit, 274
Probabilistic Counting with Stochas- quantum circuit, 274, 279
tic Averaging, 135 quantum computing, 274
probabilistic method, 14, 234 qubit, 274
probabilistic recurrence, 92 query
probabilistic recurrence relations, 231 point, 137
probabilistically-checkable proof, 261, QuickSelect, 49, 483
262 QuickSort, 6, 92
probability, 16
radix tree, 432
conditional, 21
RAM, 1
high, 51
random circuit, 275
stationary, 171
random graph, 29, 91
probability amplification, 253, 255
random structure, 14
probability amplitude, 274
random variable, 2, 15, 28
probability generating function, 33,
Bernoulli, 33
56
binomial, 33
probability mass function, 29
discrete, 28, 31
probability measure, 15
geometric, 47
probability of A conditioned on B, 21
indicator, 29
probability of A given B, 21
random walk, 158
probability space, 15
biased, 160
discrete, 16
unbiased, 158
probability theory, 15
with one absorbing barrier, 159
probing, 115
INDEX 518

with two absorbing barriers, 158 sampling Lovász local lemma, 249
random-access machine, 1 SAT, 476
randomized computation, 276 satisfiability, 245
randomized rounding, 220, 240 satisfiability problem, 239
randomness satisfy, 239
physical, 9 scapegoat tree, 103
range coding, 450 search tree
rapid mixing, 185 balanced, 102
record, 416 binary, 102
double, 416 priority, 105
red-black tree, 103 second-moment method, 62
reducible, 170 seed, 9, 253
reduction, 214 separate chaining, 115
gap-amplifying, 269 set balancing, 235
gap-preserving, 269 SHA, 115
gap-reducing, 269 shared coin
register, 289 weak, 292
max, 297 shared memory, 289
regret, 95 sharp P, 214
rejection sampling, 182, 216, 450 sifter, 57, 292, 295
relax, 239 simple path, 357
relaxation, 239 simplex method, 240
relaxation time, 201 simulated annealing, 196
reservoir sampling, 332 simulation
reversibility, 171 Monte Carlo, 11
reversible, 178, 281 sketch, 133
right spine, 109 count-min, 136
ring-cover tree, 141 Flajolet-Martin, 134, 433
rock-paper-scissors, 226 Tug-of-War, 152
root, 4, 102 sort
rotation, 103 bubble, 364
tree, 103 sorting network, 235
routing soundness, 262
bit-fixing, 78, 208 Space Invaders, 374
permutation, 77 spanning tree, 469
run, 361, 455 spare, 429
spectral Bloom filter, 133
sampling, 11, 13 spectral theorem, 199
rejection, 182, 216, 450 spine
reservoir, 332 left, 109
INDEX 519

right, 109 Perron-Frobenius, 199


sports reporting, 368 spectral, 199
sprite, 373 Toda’s, 215
standard deviation, 99 Valiant’s, 214
state space, 166 third moment, 100
state vector, 274 time
stationary distribution, 165, 168, 171 coupling, 184, 186
stationary probability, 171 expected, 2
Statistical pseudorandomness., 10 hitting, 224
stochastic matrix, 167, 276 mixing, 174
stochastic process, 166 relaxation, 201
stopping time, 153, 154 stopping, 153, 154
stream time-reversed chain, 180
data, 133 Toda’s theorem, 215
strike, 429 Toffoli gate, 282
strong law of large numbers, 99 total path length, 104
strongly k-universal, 116 total variation distance, 171
strongly 2-universal, 115 transformation
submartingale, 84, 86, 154 Johnson-Lindenstrauss, 148
subset sum principle, 265 unitary, 280
subtree, 102 transition matrix, 167
supermartingale, 84, 86, 92, 154 transition probabilities, 167
symmetry breaking, 14 transpose
conjugate, 277
tabulation hashing, 119 treap, 105, 430
tensor product, 275 tree
termination, 290 AVL, 103
test binary, 102
local, 266 binary search, 102
testable cartesian, 105
locally, 266 priority search, 105
theorem radix, 432
Adleman’s, 255 red-black, 103
Berry-Esseen, 100 ring-cover, 141
Doob decomposition, 155 rotation, 103
Erdős-Szekeres, 431 scapegoat, 103
Kolmogorov extension, 17 witness, 250
no-cloning, 282 tree rotation, 103
optional stopping, 155 truncation, 156
PCP, 261, 262 Tug-of-War sketch, 152
INDEX 520

UCB1 algorithm, 95
unary, 146
unbiased random walk, 158
uniform, 256
union bound, 53
unique game, 272
Unique Games Conjecture, 272
unitary matrix, 280
unitary transformation, 280
universal hash family, 115
unranking, 216
upper confidence bound, 95

Valiant’s theorem, 214


validity, 290
value, 271
variable drift, 231
variance, 60
vector
state, 274
vector space, 42
vertex exposure martingale, 91

wait-free, 290
wait-free shared memory, 289
Wald’s equation, 36, 162
Walsh-Hadamard code, 265
weak shared coin, 292
with high probability, 51
witness, 5, 256
witness tree, 250
word search puzzle, 442
futile, 443
worst-case analysis, 2

xxHash, 115

Yao’s lemma, 3, 45

Zipf’s law, 309


Zobrist hashing, 119

You might also like