Probabilistic Methods in Combinatorics
Probabilistic Methods in Combinatorics
Yufei Zhao
Massachusetts Institute of Technology
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Contents
Contents
1 Introduction 1
1.1 Lower bounds to Ramsey numbers . . . . . . . . . . . . . . . . . . . 1
1.2 Set systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 2-colorable hypergraphs . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 List chromatic number of 𝐾𝑛,𝑛 . . . . . . . . . . . . . . . . . . . . . 12
2 Linearity of Expectations 17
2.1 Hamiltonian paths in tournaments . . . . . . . . . . . . . . . . . . . 17
2.2 Sum-free subset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Turán’s theorem and independent sets . . . . . . . . . . . . . . . . . 19
2.4 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Unbalancing lights . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Crossing number inequality . . . . . . . . . . . . . . . . . . . . . . . 25
3 Alterations 29
3.1 Dominating set in graphs . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Heilbronn triangle problem . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Markov’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 High girth and high chromatic number . . . . . . . . . . . . . . . . . 32
3.5 Random greedy coloring . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Second Moment 37
4.1 Does a typical random graph contain a triangle? . . . . . . . . . . . . 37
4.2 Thresholds for fixed subgraphs . . . . . . . . . . . . . . . . . . . . . 42
4.3 Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Clique number of a random graph . . . . . . . . . . . . . . . . . . . 55
4.5 Hardy–Ramanujan theorem on the number of prime divisors . . . . . 57
4.6 Distinct sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.7 Weierstrass approximation theorem . . . . . . . . . . . . . . . . . . . 63
5 Chernoff Bound 69
5.1 Discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Contents
10 Entropy 173
10.1 Basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
10.2 Permanent, perfect matchings, and Steiner triple systems . . . . . . . 178
10.3 Sidorenko’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . 185
10.4 Shearer’s lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
11 Containers 201
11.1 Containers for triangle-free graphs . . . . . . . . . . . . . . . . . . . 203
11.2 Graph containers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
11.3 Hypergraph container theorem . . . . . . . . . . . . . . . . . . . . . 208
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
1 Introduction
Proof. Let the graph by 𝐺 = (𝑉, 𝐸). Assign every vertex a color, randomly either
black or white, uniformly and independently at random.
Let 𝐸 ′ be the set of edges with one black endpoint and one white endpoint. Then
(𝑉, 𝐸 ′) is a bipartite subgraph of 𝐺.
Every edge belongs to 𝐸 ′ with probability 1/2. So by the linearity of expectation, the
expected size of 𝐸 ′ is
1
E[|𝐸 ′ |] = |𝐸 | .
2
1
Thus there is some coloring with |𝐸 ′ | ≥ 2 |𝐸 |. Then (𝑉, 𝐸 ′) is the desired subgraph.
□
1
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
1 Introduction
© Cambridge Wittgenstein archive. All rights reserved. This content is excluded from our
Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use.
Frank Ramsey (1903–1930) wrote seminal papers in philosophy, economics, and
mathematical logic, before his untimely death at the age of 26 from liver problems.
See a recent profile of him in the New Yorker.
2
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
In the proof below, we will apply the union bound: for events 𝐸 1 , . . . , 𝐸 𝑚 ,
Proof. Color edges of 𝐾𝑛 with red or blue independently and uniformly at random.
For every fixed subset 𝑆 of 𝑘 vertices, let 𝐴𝑆 denote the event that 𝑆 induces a
monochromatic 𝐾 𝑘 , so that P( 𝐴𝑆 ) = 21− ( 2) . Then, by the union bound,
𝑘
© Ø ª ∑︁ 𝑛 1− ( 𝑘)
P(there is a monochromatic 𝐾 𝑘 ) = P
𝐴𝑆 ® ≤
® P( 𝐴𝑆 ) = 2 2 < 1.
[𝑛] [𝑛]
𝑘
«𝑆∈ ( 𝑘 ) ¬ 𝑆∈ ( 𝑘 )
Thus, with positive probability, the random coloring gives no monochromatic 𝐾 𝑘 . So
there exists some coloring with no monochromatic 𝐾 𝑘 . □
Erdős’ 1947 paper actually was phrased in terms of counting: of all 2 ( 2) possible
𝑛
In this course, we mostly consider finite probability spaces. While in principle the
finite probability arguments can be rephrased as counting, some of the later more
involved arguments are impractical without a probabilistic perspective.
Remark 1.1.4 (Constructive lower bounds). The above proof only gives the existence
of a red-blue edge-coloring of 𝐾𝑛 without monochromatic cliques. Is there a way to
find algorithmically find one? With an appropriate 𝑛, even though a random coloring
achieves the goal with very high probability, there is no efficient method (in polynomial
running time) to certify that any specific edge-coloring avoids monochromatic cliques.
3
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
1 Introduction
So even though there are lots of Ramsey colorings, it is hard to find and certify an
actual one. This difficulty has been described as finding hay in a haystack.
Finding constructive lower bounds is a major open problem. There was major progress
on this problem stemming from connections to randomness extractors in computer
science (e.g., Barak et al. 2012, Chattopadhyay & Zuckerman 2016, Cohen 2017)
Remark 1.1.5 (Ramsey number upper bounds). Although Ramsey proved that Ram-
sey numbers are finite, his upper bounds are quite large. Erdős–Szekeres (1935) used
a simple and nice inductive argument to show
𝑘 +ℓ
𝑅(𝑘 + 1, ℓ + 1) ≤ .
𝑘
For diagonal Ramsey numbers 𝑅(𝑘, 𝑘), this bound has the form 𝑅(𝑘, 𝑘) ≤ (4 − 𝑜(1)) 𝑘 .
Recently, in a major and surprising breakthrough, Campos, Griffiths, Morris, and
Sahasrabudhe (2023+) show that there is some constant 𝑐 > 0 so that for all sufficiently
large 𝑘,
𝑅(𝑘, 𝑘) ≤ (4 − 𝑐) 𝑘 .
This is the first exponential improvement over the Erdős–Szekeres bound.
Alteration method
Let us give another argument that slightly improves the earlier lower bound on Ramsey
numbers.
Instead of just taking a random coloring and analyzing it, we first randomly color, and
then fix some undesirable features. This is called the alteration method (sometimes
also the deletion method).
(1) Randomly color each edge of 𝐾𝑛 with red or blue (independently and uniformly
at random);
(2) Delete a vertex from every monochromatic 𝐾 𝑘 .
The process yields a 2-edge-colored clique with no monochromatic 𝐾 𝑘 (since the
second step destroyed all monochromatic cliques).
4
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
𝑛 1− ( 𝑘)
E𝑋 = 2 2 .
𝑘
In the second step, we delete at most |𝑋 | vertices (since we delete one vertex from
every clique). Thus final graph has size ≥ 𝑛 − |𝑋 |, which has expectation 𝑛 − 𝑛𝑘 21− ( 2) .
𝑘
Thus with positive probability, the remaining graph has ≥ 𝑛 − 𝑛𝑘 21− ( 2) vertices (and
𝑘
no monochromatic 𝐾 𝑘 by construction). □
bad events.
• (Independence) If all bad events are independent, then the probability that none
Î𝑛
of 𝐸𝑖 occurs is 𝑖=1 (1 − P(𝐸𝑖 )) > 0 (provided that all P(𝐸𝑖 ) < 1).
What if we are in some intermediate situation, where the union bound is not good
enough, and the bad events are not independent, but there are only few dependencies?
The Lovász local lemma provides us a solution when each event is only independent
with all but a small number of other events.
Here is a version of the Lovász local lemma, which we will prove later in Chapter 6.
5
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
1 Introduction
Here 𝑒 = 2.71 · · · is the base of the natural logarithm. This constant turns out to be
optimal in the above theorem.
Using the Lovász local lemma, let us give one more improvement to the Ramsey
number lower bounds.
Theorem 1.1.9 (Ramsey lower bound via local lemma; Spencer 1977)
+ 1 21− ( 2) < 1/𝑒, then 𝑅(𝑘, 𝑘) > 𝑛.
𝑘 𝑛 𝑘
If 2 𝑘−2
Proof. Color the edges of 𝐾𝑛 with red/blue uniformly and independently at random.
For each 𝑘-vertex subset 𝑆, let 𝐸 𝑆 be the event that 𝑆 induces a monochromatic 𝐾 𝑘 . So
P[𝐸 𝑆 ] = 21− ( 2) .
𝑘
In the setup of the local lemma, we have one independent random variable correspond-
ing to each edge. Each event 𝐸 𝑆 depends only on the variables corresponding to the
edges in 𝑆.
If 𝑆 and 𝑆′ are both 𝑘-vertex subsets, their cliques share an edge if and only if
|𝑆 ∩ 𝑆′ | ≥ 2. So for each 𝑆, there are at most 2𝑘 𝑘−2 choices 𝑘-vertex sets 𝑆′ with
𝑛
1 1
21− ( 2) <
𝑘
𝑘 𝑛
.
𝑒
2 𝑘−2 +1
So with positive probability none of the events 𝐸 𝑆 occur, which means an edge-coloring
with no monochromatic 𝐾 𝑘 ’s. □
6
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Sperner’s theorem
We say that a set family F is an antichain if no element of F is a subset of another
element of F (i.e., the elements of F are pairwise incomparable by containment).
Question 1.2.1
What is the maximum number of sets in an antichain of subsets of [𝑛]?
Theorem 1.2.3 (LYM inequality; Bollobás 1965, Lubell 1966, Meshalkin 1963, and
Yamamoto 1954)
If F is an antichain of subsets of [𝑛], then
∑︁ 𝑛 −1
≤ 1.
| 𝐴|
𝐴∈F
𝑛 𝑛
Sperner’s theorem follows since | 𝐴| ≤ ⌊𝑛/2⌋ for all 𝐴.
7
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
1 Introduction
For each 𝐴 ⊆ {1, 2, . . . , 𝑛}, let 𝐸 𝐴 denote the event that 𝐴 appears in the above chain.
Then 𝐸 𝐴 occurs if and only if all the elements of 𝐴 appears first in the permutation
𝜎, followed by all the elements of [𝑛] \ 𝐴. The number of such permutations is
| 𝐴|!(𝑛 − | 𝐴|)!. Hence
−1
| 𝐴|!(𝑛 − | 𝐴|)! 𝑛
P(𝐸 𝐴 ) = = .
𝑛! | 𝐴|
∑︁ 𝑛 −1 ∑︁
≤ P(𝐸 𝐴 ) ≤ 1. □
| 𝐴|
𝐴∈F 𝐴∈F
This bound is sharp: let 𝐴𝑖 range over all 𝑟-element subsets of [𝑟 + 𝑠] and set 𝐵𝑖 =
[𝑟 + 𝑠] \ 𝐴𝑖 .
Let us give an application/motivation for Bollobás’ two families theorem in terms of
transversals. Given a set family F , say that 𝑇 is a transversal for F if 𝑇 ∩ 𝑆 ≠ ∅ for
all 𝑆 ∈ F (i.e., 𝑇 hits every element of F ). Let 𝝉(F), the transversal number of F ,
be the size of the smallest transversal of F . Say that F is 𝝉-critical if 𝜏(F ′) < 𝜏(F )
whenever F ′ is a proper subset of F .
Question 1.2.5
What is the maximum size of a 𝜏-critical 𝑟-uniform F with 𝜏(F ) = 𝑠 + 1?
We claim that the answer is 𝑟+𝑠𝑟 . Indeed, let F = {𝐴1 , . . . , 𝐴𝑚 }, and 𝐵𝑖 an 𝑠-element
transversal of F \ { 𝐴𝑖 } for each 𝑖. Then the hypothesis of Bollobás’ two families
theorem is satisfied. Thus 𝑚 ≤ 𝑟+𝑠 𝑟 .
Conversely, F = [𝑟+𝑠]
𝑟 is 𝜏-critcal 𝑟-uniform with 𝜏(F ) = 𝑠 + 1 (why?).
8
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Note that Sperner’s theorem and LYM inequality are also special cases, since if
{𝐴1 , . . . , 𝐴𝑚 } is an antichain, then setting 𝐵𝑖 = [𝑛] \ 𝐴𝑖 for all 𝑖 satisfies the hypothesis.
Proof. The proof is a modification of the proof of the LYM inequality earlier.
Note that the events 𝐸𝑖 are disjoint, since 𝐸𝑖 and 𝐸 𝑗 both occurring would contradict
Í
the hypothesis for 𝐴𝑖 , 𝐵𝑖 , 𝐴 𝑗 , 𝐵 𝑗 (why?). Thus 𝑖 P(𝐸𝑖 ) ≤ 1. This yields the claimed
inequality. □
Bollobas’ two families theorem has many interesting generalizations that we will not
discuss here (e.g., see Gil Kalai’s blog post). There are also beautiful linear algebraic
proofs of this theorem and its extensions.
One way to generate a large intersecting family is to include all sets that contain a fixed
element (say, the element 1). This family has size 2𝑛−1 and is clearly intersecting.
(This isn’t the only example with size 2𝑛−1 ; can you think of other intersecting families
with the same size?)
9
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
1 Introduction
It turns out that one cannot do better than 2𝑛−1 . Since we can pair up each subset of [𝑛]
with its complement. At most one of 𝐴 and [𝑛] \ 𝐴 can be in an intersecting family.
And so at most half of all sets can be in an intersecting family.
The question becomes much more interesting if we restrict to 𝑘-uniform families.
Remark 1.2.10. The assumption 𝑛 ≥ 2𝑘 is necessary since if 𝑛 < 2𝑘, then the family
of all 𝑘-element subsets of [𝑛] is automatically intersecting by pigeonhole.
10
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
We say that 𝐻 is 𝒓-colorable if the vertices can be colored using 𝑟 colors so that no
edge is monochromatic.
Let 𝒎(𝒌) denote the minimum number of edges in a 𝑘-uniform hypergraph that is
not 2-colorable (elsewhere in the literature, “2-colorable” = “property B”, named after
Bernstein who introduced the concept in 1908). Some small values:
• 𝑚(2) = 3
• 𝑚(3) = 7. Example: Fano plane (below) is not 2-colorable (the fact there are no
6-edge non-2-colorable 3-graphs can be proved by exhaustive search).
Proof. Let there be 𝑚 < 2 𝑘−1 edges. In a random 2-coloring, the probability that there
is a monochromatic edge is ≤ 2−𝑘+1 𝑚 < 1. □
Remark 1.3.2. Later in Section 3.5 we will prove an better lower bound 𝑚(𝑘) ≳
√︁
2 𝑘 𝑘/log 𝑘, which is the best known to date.
Perhaps somewhat surprisingly, the state of the art upper bound is also proved using
probabilistic method (random construction).
11
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
1 Introduction
Given a coloring 𝜒 : 𝑉 → [2], if 𝜒 colors 𝑎 vertices with one color and 𝑏 vertices
with the other color, then the probability that the (random) edge 𝑆1 is monochromatic
under the (non-random) coloring 𝜒 is
𝑎 𝑏 𝑛/2
+
𝑘
2
𝑘 𝑘 𝑘 2(𝑛/2)(𝑛/2 − 1) · · · (𝑛/2 − 𝑘 + 1) 𝑛/2 − 𝑘 + 1
𝑛 ≥ 𝑛 = ≥2
𝑘 𝑘
𝑛(𝑛 − 1) · · · (𝑛 − 𝑘 + 1) 𝑛−𝑘 +1
𝑘 𝑘
−𝑘+1 𝑘 −1 −𝑘+1 𝑘 −1
=2 1− =2 1− 2 ≥ 𝑐2−𝑘
𝑛−𝑘 +1 𝑘 −𝑘 +1
12
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
{1, 2} {1, 2}
{1, 3} {1, 3}
{2, 3} {2, 3}
Question 1.4.1
What is the asymptotic behavior of ch(𝐾𝑛,𝑛 )?
Theorem 1.4.2
If 𝑛 < 2 𝑘−1 , then 𝐾𝑛,𝑛 is 𝑘-choosable.
In other words, ch(𝐾𝑛,𝑛 ) ≤ log2 (2𝑛) + 1.
Proof. For each color, mark it either L or R independently and uniformly at random.
For any vertex of 𝐾𝑛,𝑛 on the left part, remove all its colors marked R.
For any vertex of 𝐾𝑛,𝑛 on the right part, remove all its colors marked L.
The probability that some vertex has no colors remaining is at most 2𝑛2−𝑘 < 1 by the
union bound. So with positive probability, every vertex has some color remaining.
Assign the colors arbitrarily for a valid coloring. □
The lower bound on ch(𝐾𝑛,𝑛 ) turns out to follow from the existence of non-2-colorable
𝑘-uniform hypergraph with many edges.
Theorem 1.4.3
If there exists a non-2-colorable 𝑘-uniform hypergraph with 𝑛 edges, then 𝐾𝑛,𝑛 is not
𝑘-choosable.
Recall from Theorem 1.3.3 that there exists a non-2-colorable 𝑘-uniform hypergraph
with 𝑂 (𝑘 2 2 𝑘 ) edges. Thus ch(𝐾𝑛,𝑛 ) > (1 − 𝑜(1)) log2 𝑛.
Putting these bounds together:
13
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
1 Introduction
It turns out that, unlike the chromatic number, the list chromatic number always
grows with the average degree. The following result was proved using the method of
hypergraph containers, a very important modern development in combinatorics that
we will see a glimpse of in Chapter 11. It provides the optimal asymptotic dependence
(the example of 𝐾𝑛,𝑛 shows optimality).
They also proved similar results for the list chromatic number of hypergraphs. For
graphs, a slightly weaker result, off by a factor of 2, was proved earlier by Alon (2000).
Exercises
1. Verify the following asymptotic calculations used in Ramsey number lower
bounds:
a) For each 𝑘, the largest 𝑛 satisfying 𝑛𝑘 21− ( 2) < 1 has 𝑛 = √1 + 𝑜(1) 𝑘2 𝑘/2 .
𝑘
𝑒 2
then the Ramsey number 𝑅(𝑘, 𝑡) satisfies 𝑅(𝑘, 𝑡) > 𝑛. Using this show that
3/2
𝑡
𝑅(4, 𝑡) ≥ 𝑐
log 𝑡
14
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
3. Let 𝐺 be a graph with 𝑛 vertices and 𝑚 edges. Prove that 𝐾𝑛 can be written as a
union of 𝑂 (𝑛2 (log 𝑛)/𝑚) isomorphic copies of 𝐺 (not necessarily edge-disjoint).
4. Prove that there is an absolute constant 𝐶 > 0 so that for every 𝑛 × 𝑛 matrix
with distinct real entries, one can permute its rows so that no column in the
√
permuted matrix contains an increasing subsequence of length at least 𝐶 𝑛. (A
subsequence does not need to be selected from consecutive terms. For example,
(1, 2, 3) is an increasing subsequence of (1, 5, 2, 4, 3).)
5. Generalization of Sperner’s theorem. Let F be a collection of subset of [𝑛] that
does not contain 𝑘 + 1 elements forming a chain: 𝐴1 ⊊ · · · ⊊ 𝐴 𝑘+1 . Prove that
F is no larger than taking the union of the 𝑘 levels of the Boolean lattice closest
to the middle layer.
6. Let 𝐺 be a graph on 𝑛 ≥ 10 vertices. Suppose that adding any new edge to 𝐺
would create a new clique on 10 vertices. Prove that 𝐺 has at least 8𝑛 − 36 edges.
Hint: Apply Bollobás’ two families theorem
15
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
2 Linearity of Expectations
Linearity of expectations refers to the following basic fact about the expectation: given
random variables 𝑋1 , . . . , 𝑋𝑛 and constants 𝑐 1 , . . . , 𝑐 𝑛 ,
This identity does not require any assumption of independence. On the other hand,
generally E[𝑋𝑌 ] ≠ E[𝑋]E[𝑌 ] unless 𝑋 and 𝑌 are uncorrelated (independent random
variables are always uncorrelated).
Here is a simple application (there are also much more involved solutions via enumer-
ation methods).
Solution. Let 𝑋𝑖 be the event that element 𝑖 ∈ [𝑛] is fixed. Then E[𝑋𝑖 ] = 1/𝑛. The
expected number of fixed points is
17
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
2 Linearity of Expectations
The minimization problem is easier. The transitive tournament (i.e., respecting a fixed
linear ordering of vertices) has exactly one Hamilton path. On the other hand, every
tournament has at least one Hamilton path (Exercise: prove this! Hint: consider a
longest directed path).
The maximization problem is more difficult and interesting. Here we have some
asymptotic results.
Proof. Consider a random tournament where every edge is given a random orientation
chosen uniformly and independently. Each of the 𝑛! permutations of vertices forms a
directed path with probability 2−𝑛+1 . So that expected number of Hamilton paths is
𝑛!2−𝑛+1 . Thus, there exists a tournament with at least this many Hamilton paths. □
This was considered the first use of the probabilistic method. Szele conjectured that the
maximum number of Hamilton paths in a tournament on 𝑛 players is 𝑛!/(2 − 𝑜(1)) 𝑛 .
This was proved by Alon (1990) using the Minc–Brégman theorem on permanents (we
will see this later in Chapter 10 on the entropy method).
where {·} denotes fractional part. Then 𝐴𝜃 is sum-free since (1/3, 2/3) is sum-free in
R/Z.
For 𝜃 uniformly chosen at random, {𝑎𝜃} is also uniformly random in [0, 1], so P(𝑎 ∈
𝐴𝜃 ) = 1/3. By linearity of expectations, E| 𝐴𝜃 | = 𝑛/3. □
18
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Remark 2.2.2 (Additional results). Alon and Kleitman (1990) noted that one can
improve the bound to ≥ (𝑛 + 1)/3 by noting that | 𝐴𝜃 | = 0 for 𝜃 close to zero (say,
|𝜃| < (3 max 𝑎 ∈ 𝐴 |𝑎|) −1 ), so that | 𝐴𝜃 | < 𝑛/3 with positive probability, and hence
| 𝐴𝜃 | > 𝑛/3 with positive probability. Note that since | 𝐴𝜃 | is an integer, being > 𝑛/3 is
the same as being ≥ (𝑛 + 1)/3.
Bourgain (1997) improved it to ≥ (𝑛 + 2)/3 via a difficult Fourier analytic argument.
This is currently the best lower bound known.
It remains an open problem to prove ≥ (𝑛 + 𝑓 (𝑛))/3 for some function 𝑓 (𝑛) → ∞.
In the other direction, Eberhard, Green, and Manners (2014) showed that there exist
𝑛-element sets of integers whose largest sum-free subset has size ≤ (1/3 + 𝑜(1))𝑛.
Taking the complement of a graph changes its independent sets to cliques and vice
versa. So the problem is equivalent to one about graphs without large independent
sets.
The following result, due to Caro (1979) and Wei (1981), shows that a graph with
small degrees much contain large independent sets. The probabilistic method proof
shown here is due to Alon and Spencer.
Proof. Consider a random ordering (permutation) of the vertices. Let 𝐼 be the set of
vertices that appear before all of its neighbors. Then 𝐼 is an independent set.
1
For each 𝑣 ∈ 𝑉, P(𝑣 ∈ 𝐼) = 1+𝑑 𝑣
(this is the probability that 𝑣 appears first among
{𝑣} ∪ 𝑁 (𝑣)). Thus E|𝐼 | = 𝑣∈𝑉 (𝐺) 𝑑 𝑣1+1 . Thus with positive probability, |𝐼 | is at least
Í
this expectation. □
19
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
2 Linearity of Expectations
By taking the complement of the graph, independent sets become cliques, and so we
obtain the following corollary.
Corollary 2.3.5
Every 𝑛-vertex graph 𝐺 contains a clique of size at least
∑︁ 1
.
𝑛 − 𝑑𝑣
𝑣∈𝑉 (𝐺)
𝑇10,3 = 𝐾3,3,4 =
20
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
2.4 Sampling
1 𝑛2
1− .
𝑟 2
Proof. Let 𝑚 be the number of edges. Since 𝐺 is 𝐾𝑟+1 -free, by Corollary 2.3.5, the
size 𝜔(𝐺) of the largest clique of 𝐺 satisfies
∑︁ 1 𝑛 𝑛
𝑟 ≥ 𝜔(𝐺) ≥ ≥ = .
𝑛 − 𝑑𝑣 𝑛 − 𝑛 𝑣 𝑑𝑣 𝑛 − 2𝑚
1 Í
𝑣∈𝑉 𝑛
1 𝑛2
Rearranging gives 𝑚 ≤ 1 − 𝑟 2. □
Remark 2.3.7. By a careful refinement of the above argument, we can deduce Turán’s
theorem that 𝑇𝑛,𝑟 maximizes the number of edges in a 𝑛-vertex 𝐾𝑟+1 -free graph, by
Í 1 Í
noting that 𝑣∈𝑉 𝑛−𝑑 𝑣
is minimized over fixed 𝑣 𝑑 𝑣 when the degrees are nearly equal.
Also, Theorem 2.3.6 is asymptotically tight in the sense that the Turán graph 𝑇𝑛,𝑟 , for
fixed 𝑟 and 𝑛 → ∞, as (1 − 1/𝑟 − 𝑜(1))𝑛2 /2 edges.
For more on this topic, see Chapter 1 of my textbook Graph Theory and Additive
Combinatorics and the class with the same title.
2.4 Sampling
By Turán’s theorem (actually Mantel’s theorem, in this case for triangles, the maximum
number of edges in an 𝑛-vertex triangle-free graph is 𝑛2 /4 .
How about the problem for hypergraphs? A tetrahedron, denoted 𝐾4(3) , is a complete 3-
uniform hypergraph (3-graph) on 4 vertices (think of the faces of a usual 3-dimensional
tetrahedron).
This turns out to be a notorious open problem. Turán conjectured that the answer is
5 𝑛
+ 𝑜(1) ,
9 3
21
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
2 Linearity of Expectations
𝑉1
𝑉2 𝑉3
Above, the vertices are partitioned into three nearly equal sets 𝑉1 , 𝑉2 , 𝑉3 , and all the
edges come in two types: (i) with one vertex in each of the three parts, and (ii) two
vertices in 𝑉𝑖 and one vertex in 𝑉𝑖+1 , with the indices considered mod 3.
Let us give some easy upper bounds, in order to illustrate a simple yet important
technique of bounding by sampling.
Proof. Let 𝑆 be a 4-vertex subset chosen uniformly at random. If the graph has
𝑝 𝑛3 edges, then the expected number of edges induced by 𝑆 is 4𝑝 by linearity of
expectations (why?).
Since the 3-graph is tetrahedron-free, 𝑆 induces at most 3 edges. Therefore, 4𝑝 ≤ 3.
Thus the total number of edges is 𝑝 𝑛3 ≤ 34 𝑛3 .
□
Why stop at sampling four vertices? Can we do better by sampling five vertices? To
run the above argument, we will know how many edges can there be in a 5-vertex
tetrahedron-free graph.
Lemma 2.4.3
A 5-vertex tetrahedron-free 3-graph has at most 7 edges.
We can improve Proposition 2.4.2 by sampling 5 vertices instead of 4 in its proof. This
yields (check):
22
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Proposition 2.4.4
7 𝑛
Every tetrahedron-free 3-graph on 𝑛 ≥ 4 vertices has at most 10 3 edges.
By sampling 𝑠 vertices and using brute-force search to solve the 𝑠-vertex problem,
we can improve the upper bound by taking larger values of 𝑠. In fact in principle, if
we had unlimited computational power, we can arbitrarily close to optimum by taking
sufficiently large 𝑠 (why?). However, this is not a practical method due to the cost of
the brute-force search. There are more clever ways to get better bounds (also with the
help of a computer). The best known upper bound notably via a method known as flag
algebras (using sums of squares) due to Razborov, which can give ≤ (0.561 · · · ) 𝑛3 ).
For more on the Hypergraph Turán problem, see the survey by Keevash (2011).
Theorem 2.5.1
Let 𝑎𝑖 𝑗 ∈ {−1, 1} for all 𝑖, 𝑗 ∈ [𝑛]. There exists 𝑥𝑖 , 𝑦 𝑗 ∈ {−1, 1} for all 𝑖, 𝑗 ∈ [𝑛] such
that √︂ !
𝑛
∑︁ 2
𝑎 𝑖 𝑗 𝑥𝑖 𝑦 𝑗 ≥ + 𝑜(1) 𝑛3/2 .
𝑖, 𝑗=1
𝜋
23
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
2 Linearity of Expectations
and set 𝑥𝑖 ∈ {−1, 1} to be the sign of 𝑅𝑖 (arbitrarily choose 𝑥𝑖 if 𝑅𝑖 = 0. Then the LHS
sum is
𝑛
∑︁ 𝑛
∑︁
𝑅𝑖 𝑥𝑖 = |𝑅𝑖 | .
𝑖=1 𝑖=1
For each 𝑖, 𝑅𝑖 has the same distribution as a sum of 𝑛 i.i.d. uniform {−1, 1}: 𝑆𝑛 =
𝜀1 + · · · + 𝜀 𝑛 (note that 𝑅𝑖 ’s are not independent for different 𝑖’s). Thus, for each 𝑖
√︂ !
2 √
E[|𝑅𝑖 |] = E[|𝑆𝑛 |] = + 𝑜(1) 𝑛,
𝜋
(one can also use binomial sum identities to compute exactly: E[|𝑆𝑛 |] = 𝑛21−𝑛 𝑛−1
⌊(𝑛−1)/2⌋ ,
though it is rather unnecessary to do so.) Thus
𝑛
√︂ !
∑︁ 2
E |𝑅𝑖 | = + 𝑜(1) 𝑛3/2 .
𝑖=1
𝜋
√︃
2
Thus with positive probability, the sum is ≥ 𝜋 + 𝑜(1) 𝑛3/2 . □
The next example is tricky. The proof will set up a probabilistic process where the
parameters are not given explicitly. A compactness argument will show that a good
choice of parameters exists.
Theorem 2.5.2
Let 𝑘 ≥ 2. Let 𝑉 = 𝑉1 ∪ · · · ∪ 𝑉𝑘 , where 𝑉1 , . . . , 𝑉𝑘 are disjoint sets of size 𝑛. The
edges of the complete 𝑘-uniform hypergraph on 𝑉 are colored with red/blue. Suppose
that every edge formed by taking one vertex from each 𝑉1 , . . . , 𝑉𝑘 is colored blue.
Then there exists 𝑆 ⊆ 𝑉 such that the number of red edges and blue edges in 𝑆 differ
by more than 𝑐 𝑘 𝑛 𝑘 , where 𝑐 𝑘 > 0 is a constant.
Proof. We will write this proof for 𝑘 = 3 for notational simplicity. The same proof
works for any 𝑘.
Let 𝑝 1 , 𝑝 2 , 𝑝 3 be real numbers to be decided. We are going to pick 𝑆 randomly by
24
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Then
E[#{blue edges in 𝑆} − #{red edges in 𝑆}]
equals to some polynomial
∑︁
𝑓 ( 𝑝1, 𝑝2, 𝑝3) = 𝑎𝑖, 𝑗,𝑘 𝑝𝑖 𝑝 𝑗 𝑝 𝑘 = 𝑛3 𝑝 1 𝑝 2 𝑝 3 + 𝑎 1,1,1 𝑝 31 + 𝑎 1,1,2 𝑝 21 𝑝 2 + · · · .
𝑖≤ 𝑗 ≤𝑘
Lemma 2.5.3
Let 𝑃 𝑘 denote the set of polynomials 𝑔( 𝑝 1 , . . . , 𝑝 𝑘 ) of degree 𝑘, whose coefficients
have absolute value ≤ 1, and the coefficient of 𝑝 1 𝑝 2 · · · 𝑝 𝑘 is 1. Then there is a
constant 𝑐 𝑘 > 0 such that for all 𝑔 ∈ 𝑃 𝑘 , there is some 𝑝 1 , . . . , 𝑝 𝑘 ∈ [0, 1] with
|𝑔( 𝑝 1 , . . . , 𝑝 𝑘 )| ≥ 𝑐.
Proof of Lemma. Set 𝑀 (𝑔) = sup 𝑝1 ,...,𝑝 𝑘 ∈[0,1] |𝑔( 𝑝 1 , . . . , 𝑝 𝑘 )| (note that sup is achieved
as max due to compactness). For 𝑔 ∈ 𝑃 𝑘 , since 𝑔 is nonzero (its coefficient of
𝑝 1 𝑝 2 · · · 𝑝 𝑘 is 1), we have 𝑀 (𝑔) > 0. As 𝑃 𝑘 is compact and 𝑀 : 𝑃 𝑘 → R is continuous,
𝑀 attains a minimum value 𝑐 = 𝑀 (𝑔) > 0 for some 𝑔 ∈ 𝑃 𝑘 . ■ □
25
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
2 Linearity of Expectations
(It is not too hard to show that Wagner’s theorem and Kuratowski’s theorem are
equivalent)
If a graph has a lot of edges, is it guaranteed to have a lot of crossings no matter how
it is drawn in the plane?
Question 2.6.1
What is the minimum possible number of crossings that a drawing of:
• 𝐾𝑛 ? (Hill’s conjecture)
• 𝐾𝑛,𝑛 ? (Zarankiewicz conjecture; Turán’s brick factory problem)
• a graph on 𝑛 vertices and 𝑛2 /100 edges?
|𝐸 | 3
cr(𝐺) ≳ .
|𝑉 | 2
Corollary 2.6.4
In a graph 𝐺 = (𝑉, 𝐸), if |𝐸 | ≳ |𝑉 | 2 , then cr(𝐺) ≳ |𝑉 | 4 .
Proof. The proof has three steps, starting with some basic facts on planar graphs.
26
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Thus |𝐸 | ≤ 3|𝑉 | for all planar graphs. Hence cr(𝐺) > 0 whenever |𝐸 | > 3|𝑉 |.
Step 2: From one to many.
The above argument gives us one crossing. Next, we will use it to obtain many
crossings.
By deleting one edge for each crossing, we get a planar graph, so |𝐸 | − cr(𝐺) ≤ 3|𝑉 |,
that is
cr(𝐺) ≥ |𝐸 | − 3|𝑉 |.
This is a “cheap bound.” For graphs with |𝐸 | = Θ(𝑛2 ), this gives cr(𝐺) ≳ 𝑛2 . This is
not a great bound. We next will use the probabilistic method to boost this bound.
Step 3: Bootstrapping.
Let 𝑝 ∈ [0, 1] to be decided. Let 𝐺 ′ = (𝑉 ′, 𝐸 ′) be obtained from 𝐺 by randomly
keeping each vertex with probability 𝑝. Then
cr(𝐺 ′) ≥ |𝐸 ′ | − 3|𝑉 ′ |.
So
E cr(𝐺 ′) ≥ E|𝐸 ′ | − 3E|𝑉 ′ |
We have E cr(𝐺 ′) ≤ 𝑝 4 cr(𝐺), E|𝐸 ′ | = 𝑝 2 |𝐸 | and E|𝑉 ′ | = 𝑝E|𝑉 |. So
𝑝 4 cr(𝐺) ≥ 𝑝 2 |𝐸 | − 3𝑝|𝑉 |.
Thus
cr(𝐺) ≥ 𝑝 −2 |𝐸 | − 3𝑝 −3 |𝑉 |.
Setting 𝑝 = 4 |𝑉 | /|𝐸 | ∈ [0, 1] (here is where we use the hypothesis that |𝐸 | ≥ 4 |𝑉 |)
so that 4𝑝 −3 |𝑉 | = 𝑝 −2 |𝐸 |, we obtain cr(𝐺) ≳ |𝐸 | 3 /|𝑉 | 2 . □
Remark 2.6.5. The above idea of boosting a cheap bound to a better bound is an
important one. We saw a version of this idea in Section 2.4 where we sampled a
constant number of vertices to deduce upper bounds on the hypergraph Turán num-
ber. In the above crossing number inequality application, we are also applying some
preliminary cheap bound to some sampled induced subgraph, though this time the
sampled subgraph has super-constant size.
It is tempting to modify the proof by sampling edges instead of vertices, but this does
not work.
27
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
2 Linearity of Expectations
Exercises
1. Let 𝐴 be a measurable subset of the unit sphere in R3 (centered at the origin)
containing no pair of orthogonal points.
a) Prove that 𝐴 occupies at most 1/3 of the sphere in terms of surface area.
b) ★ Prove an upper bound smaller than 1/3 (give your best bound).
2. ★ Prove that every set of 10 points in the plane can be covered by a union of
disjoint unit disks.
3. Let r = (𝑟 1 , . . . , 𝑟 𝑘 ) be a vector of nonzero integers whose sum is nonzero.
Prove that there exists a real 𝑐 > 0 (depending on r only) such that the following
holds: for every finite set 𝐴 of nonzero reals, there exists a subset 𝐵 ⊆ 𝐴 with
|𝐵| ≥ 𝑐| 𝐴| such that there do not exist 𝑏 1 , . . . , 𝑏 𝑘 ∈ 𝐵 with 𝑟 1 𝑏 1 + · · · + 𝑟 𝑘 𝑏 𝑘 = 0.
4. Prove that every set 𝐴 of 𝑛 nonzero integers contains two disjoint subsets 𝐵1 and
𝐵2 , such that both 𝐵1 and 𝐵2 are sum-free, and |𝐵1 | + |𝐵2 | > 2𝑛/3.
5. Let 𝐺 be an 𝑛-vertex graph with 𝑝𝑛2 edges, with 𝑛 ≥ 10 and 𝑝 ≥ 10/𝑛.
Prove that 𝐺 contains a pair of vertex-disjoint and isomorphic subgraphs (not
necessarily induced) each with at least 𝑐 𝑝 2 𝑛2 edges, where 𝑐 > 0 is a constant.
6. ★ Prove that for every positive integer 𝑟, there exists an integer 𝐾 such that the
following holds. Let 𝑆 be a set of 𝑟 𝑘 points evenly spaced on a circle. If we
partition 𝑆 = 𝑆1 ∪ · · · ∪ 𝑆𝑟 so that |𝑆𝑖 | = 𝑘 for each 𝑖, then, provided 𝑘 ≥ 𝐾,
there exist 𝑟 congruent triangles where the vertices of the 𝑖-th triangle lie in 𝑆𝑖 ,
for each 1 ≤ 𝑖 ≤ 𝑟.
7. ★ Prove that [𝑛] 𝑑 cannot be partitioned into fewer than 2𝑑 sets each of the form
𝐴1 × · · · × 𝐴𝑑 where 𝐴𝑖 ⊊ [𝑛].
28
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
3 Alterations
We saw the alterations method in Section 1.1 to give lower bounds to Ramsey numbers.
The basic idea is to first make a random construction, and then fix the blemishes.
Theorem 3.1.1
Every graph on 𝑛 vertices with minimum degree 𝛿 > 1 has a dominating set of size
log(𝛿 + 1) + 1
≤ 𝑛.
𝛿+1
Naive attempt: take out vertices greedily. The first vertex eliminates 1 + 𝛿 vertices, but
subsequent vertices eliminate possibly fewer vertices.
29
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
3 Alterations
Question 3.2.1
How can one place 𝑛 points in the unit square so that no three points forms a triangle
with small area?
Let
Δ(𝑛) = sup min area( 𝑝𝑞𝑟)/
𝑆⊆[0,1] 2 𝑝,𝑞,𝑟∈𝑆
distinct
|𝑆|=𝑛
Naive constructions fair poorly. E.g., 𝑛 points around a circle has a triangle of area
Θ(1/𝑛3 ) (the triangle formed by three consectutive points has side lengths ≍ 1/𝑛 and
angle 𝜃 = (1 − 1/𝑛)2𝜋). Even worse is arranging points on a grid, as you would get
triangles of zero area.
Heilbronn conjectured that Δ(𝑛) = 𝑂 (𝑛−2 ).
Komlós, Pintz, and Szemerédi (1982) disproved the conjecture, showing Δ(𝑛) ≳
𝑛−2 log 𝑛. They used an elaborate probabilistic construction. Here we show a much
simpler version probabilistic construction that gives a weaker bound Δ(𝑛) ≳ 𝑛−2 .
Remark 3.2.2 (Upper bounds). For a long time, the best upper bound known was
Δ(𝑛) ≤ 𝑛−8/7+𝑜(1) due to Komlós, Pintz, and Szemerédi (1981). This was recently
improved to Δ(𝑛) ≤ 𝑛−8/7−𝑐 by Cohen, Pohoata, and Zakharov (2023+).
Proof. Choose 2𝑛 points at random. For every three random points 𝑝, 𝑞, 𝑟, let us
estimate
P 𝑝,𝑞,𝑟 (area( 𝑝, 𝑞, 𝑟) ≤ 𝜀).
By considering the area of a circular annulus around 𝑝, with inner and outer radii 𝑥
and 𝑥 + Δ𝑥, we find
30
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
For fixed 𝑝, 𝑞
2𝜀 𝜀
P𝑟 (area( 𝑝𝑞𝑟) ≤ 𝜀) = P𝑟 dist( 𝑝𝑞, 𝑟) ≤ ≲ .
| 𝑝𝑞| | 𝑝𝑞|
Given these 2𝑛 random points, let 𝑋 be the number of triangles with area ≤ 𝜀. Then
E𝑋 = 𝑂 (𝜀𝑛3 ).
Choose 𝜀 = 𝑐/𝑛2 with 𝑐 > 0 small enough so that E𝑋 ≤ 𝑛.
Delete a point from each triangle with area ≤ 𝜀.
The expected number of remaining points is E[2𝑛 − 𝑋] ≥ 𝑛, and no triangles with area
≤ 𝜀 = 𝑐/𝑛2 .
Thus with positive probability, we end up with ≥ 𝑛 points and no triangle with area
≤ 𝑐/𝑛2 . □
E[𝑋]
P(𝑋 ≥ 𝑎) ≤ .
𝑎
31
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
3 Alterations
Proof. Let 𝐺 ∼ 𝐺 (𝑛, 𝑝) with 𝑝 = (log 𝑛) 2 /𝑛 (the proof works whenever log 𝑛/𝑛 ≪
𝑝 ≪ 𝑛−1+1/ℓ ). Here 𝐺 (𝑛, 𝑝) is Erdős–Rényi random graph (𝑛 vertices, every edge
appearing with probability 𝑝 independently).
Let 𝑋 be the number of cycles of length at most ℓ in 𝐺. By linearity of expectations,
as there are exactly 𝑛𝑖 (𝑖 − 1)!/2 cycles of length 𝑖 in 𝐾𝑛 for each 3 ≤ 𝑖 ≤ 𝑛, we have
(recall that ℓ is a constant)
ℓ ℓ
∑︁ 𝑛 (𝑖 − 1)! ∑︁
E𝑋 = 𝑖
𝑝 ≤ 𝑛𝑖 𝑝𝑖 = ℓ(log 𝑛) 2𝑖 = 𝑜(𝑛).
𝑖=3
𝑖 2 𝑖=3
By Markov’s inequality
E𝑋
P(𝑋 ≥ 𝑛/2) ≤ = 𝑜(1).
𝑛/2
(This allows us to get rid of all short cycles.)
How can we lower bound the chromatic number 𝜒(·)? Note that 𝜒(𝐺) ≥ |𝑉 (𝐺)|/𝛼(𝐺),
where 𝛼(𝐺) is the independence number (the size of the largest independent set).
With 𝑥 = (3/𝑝) log 𝑛 = 3𝑛/log 𝑛,
𝑛
(1 − 𝑝) ( 2 ) < 𝑛𝑥 𝑒 −𝑝𝑥(𝑥−1)/2 = (𝑛𝑒 −𝑝(𝑥−1)/2 ) 𝑥 = 𝑛−Θ(𝑛) = 𝑜(1).
𝑥
P(𝛼(𝐺) ≥ 𝑥) ≤
𝑥
32
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Let 𝑛 be large enough so that P(𝑋 ≥ 𝑛/2) < 1/2 and P(𝛼(𝐺) ≥ 𝑥) < 1/2. Then there
is some 𝐺 with fewer than 𝑛/2 cycles of length ≤ ℓ and with 𝛼(𝐺) ≤ 3𝑛/log 𝑛.
Remove a vertex from each cycle to get 𝐺 ′. Then |𝑉 (𝐺 ′)| ≥ 𝑛/2, girth > ℓ, and
𝛼(𝐺 ′) ≤ 𝛼(𝐺) ≤ 3𝑛/log 𝑛, so
if 𝑛 is sufficiently large. □
Remark 3.4.2. Erdős (1962) also showed that in fact one needs to see at least a linear
number of vertices to deduce high chromatic number: for all 𝑘, there exists 𝜀 = 𝜀 𝑘
such that for all sufficiently large 𝑛 there exists an 𝑛-vertex graph with chromatic
number > 𝑘 but every subgraph on ⌊𝜀𝑛⌋ vertices is 3-colorable. (In fact, one can take
𝐺 ∼ 𝐺 (𝑛, 𝐶/𝑛); see "Probabilistic Lens: Local coloring" in Alon–Spencer)
Recall from Section 1.3 that there exists a non-2-colorable 𝑘-uniform hypergraph on
𝑘 2 vertices and 𝑂 (𝑘 2 2 𝑘 ) edges, via a random construction.
Here we present a simpler proof, based on a random greedy coloring, due to Cherkashin
and Kozik (2015), following an approach of Pluhaár (2009).
33
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
3 Alterations
The resulting coloring has no blue edges. The greedy coloring succeeds if it does not
create a red edge.
Analyzing a greedy coloring is tricky, since the color of a single vertex may depend on
the entire history. Instead, we identify a specific feature that necessarily results from
a unsuccessful coloring.
If there is a red edge, then there must be two edges 𝑒, 𝑓 so that the last vertex of 𝑒
is the first vertex of 𝑓 . Call such pair (𝑒, 𝑓 ) conflicting (note that whether (𝑒, 𝑓 ) is
conflicting depends on the random ordering of the vertices, but not on how we assigned
colors).
What is the probability of seeing a conflicting pair? Here is the randomness comes
from the random ordering of vertices.
Each pair of edges with exactly one vertex in common conflicts with probability
(𝑘−1)!2 1 2𝑘−2 −1
(2𝑘−1)! = 2𝑘−1 𝑘−1 ≍ 𝑘 −1/2 2−2𝑘 . Summing over all ≤ 𝑚 2 pairs of edges that
share a unique vertex, we find that the expected number of conflicting pairs is at most
≲ 𝑚 2 𝑘 −1/2 2−2𝑘 , which is < 1 for some 𝑚 ≍ 𝑘 1/4 2 𝑘 . In this case, there is some
ordering of vertices creating no conflicting pairs, in which case the greedy coloring
always succeeds.
The above argument, due to Pluhaár (2009), 1/4 𝑘
√︃ yields 𝑚 ≲ 𝑘 2 . Next we will refine
𝑘
the argument to obtain a better bound of log 𝑘 2
𝑘 as claimed.
Instead of just considering a random permutation, let us map each vertex to [0, 1]
independently and uniformly at random. This map induces an ordering of the vertices,
but it comes with further information that we will use.
Write [0, 1] = 𝐿 ∪ 𝑀 ∪ 𝑅 where (𝑝 to be decided)
1− 𝑝 1− 𝑝 1+ 𝑝 1+ 𝑝
𝐿 := 0, , 𝑀 := , , 𝑅 := ,1 .
2 2 2 2
34
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Thus
𝑚2 𝑚2
2 𝑘−2 𝑘
= 𝑘−1 + 𝑘−1 log
4 𝑘 4 𝑘 𝑚
√︁
which is < 1 for 𝑚 = 𝑐2 𝑘 𝑘/log 𝑘 with 𝑐 > 0 being a sufficiently small constant (we
should assume that is 𝑘 large enough to ensure 𝑝 ∈ [0, 1]; smaller values of 𝑘 can be
handled in the theorem exceptionally by later reducing the constant 𝑐). □
Food for thought: what is the source of the gain in the the 𝐿 ∪ 𝑀 ∪ 𝑅 argument? The
expected number of conflicting pairs is unchanged. It must be that we are somehow
clustering the bad events by considering the event when some edge lies in 𝐿 or 𝑅.
It remains an intriguing open problem to improve this bound further.
Exercises
1. Using the alteration method, prove the Ramsey number bound
𝑅(4, 𝑘) ≥ 𝑐(𝑘/log 𝑘) 2
35
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
3 Alterations
√
𝑐𝑛3/2 / 𝑚, where 𝑐 > 0 is a constant.
3. Prove that every 𝑘-uniform hypergraph with 𝑛 vertices and 𝑚 edges has a transver-
sal (i.e., a set of vertices intersecting every edge) of size at most 𝑛(log 𝑘)/𝑘 +𝑚/𝑘.
4. Zarankiewicz problem. Prove that for every positive integers 𝑛 ≥ 𝑘 ≥ 2, there
exists an 𝑛 × 𝑛 matrix with {0, 1} entries, with at least 12 𝑛2−2/(𝑘+1) 1’s, such that
there is no 𝑘 × 𝑘 submatrix consisting of all 1’s.
5. Fix 𝑘. Prove that there exists a constant 𝑐 𝑘 > 1 so that for every sufficiently
large 𝑛 > 𝑛0 (𝑘), there exists a collection F of at least 𝑐 𝑛𝑘 subsets of [𝑛] such that
Ñ𝑘
for every 𝑘 distinct 𝐹1 , . . . , 𝐹𝑘 ∈ F , all 2 𝑘 intersections 𝑖=1 𝐺 𝑖 are nonempty,
where each 𝐺 𝑖 is either 𝐹𝑖 or [𝑛] \ 𝐹𝑖 .
6. Acute sets in R𝑛 . Prove√that, for some constant 𝑐 > 0, for every 𝑛, there exists a
family of at least 𝑐(2/ 3) 𝑛 subsets of [𝑛] containing no three distinct members
𝐴, 𝐵, 𝐶 satisfying 𝐴 ∩ 𝐵 ⊆ 𝐶 ⊆ 𝐴 ∪ 𝐵. √
Deduce that there exists a set of at least 𝑐(2/ 3) 𝑛 points in R𝑛 so that all angles
determined by three points from the set are acute.
Remark. The current best lower and upper bounds for the maximum size of
an “acute set” in R𝑛 (i.e., spanning only acute angles) are 2𝑛−1 + 1 and 2𝑛 − 1
respectively.
7. ★ Covering complements of sparse graphs by cliques
a) Prove that every graph with 𝑛 vertices and minimum degree 𝑛 − 𝑑 can be
written as a union of 𝑂 (𝑑 2 log 𝑛) cliques.
b) Prove that every bipartite graph with 𝑛 vertices on each side of the ver-
tex bipartition and minimum degree 𝑛 − 𝑑 can be written as a union of
𝑂 (𝑑 log 𝑛) complete bipartite graphs (assume 𝑑 ≥ 1).
8. ★ Let 𝐺 = (𝑉, 𝐸) be a graph with 𝑛 vertices and minimum degree 𝛿 ≥ 2. Prove
that there exists 𝐴 ⊆ 𝑉 with | 𝐴| = 𝑂 (𝑛(log 𝛿)/𝛿) so that every vertex in 𝑉 \ 𝐴
contains at least one neighbor in 𝐴 and at least one neighbor not in 𝐴.
9. ★ Prove that every graph 𝐺 without isolated vertices has an induced subgraph
𝐻 on at least 𝛼(𝐺)/2 vertices such that all vertices of 𝐻 have odd degree. Here
𝛼(𝐺) is the size of the largest independent set in 𝐺.
36
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
4 Second Moment
Question 4.1.1
For which 𝑝 = 𝑝 𝑛 does 𝐺 (𝑛, 𝑝) contain a triangle with probability 1 − 𝑜(1)?
Proposition 4.1.2
If 𝑛𝑝 → 0, then 𝐺 (𝑛, 𝑝) is triangle-free with probability 1 − 𝑜(1).
Proof. Let 𝑋 be the number of triangles in 𝐺 (𝑛, 𝑝). We know from linearity of
expectations that
𝑛 3
E𝑋 = 𝑝 ≍ 𝑛3 𝑝 3 = 𝑜(1).
3
Thus, by Markov’s inequality,
P(𝑋 ≥ 1) ≤ E𝑋 = 𝑜(1).
In other words, when 𝑝 ≪ 1/𝑛, 𝐺 (𝑛, 𝑝) is triangle-free with high probaiblity (recall
that 𝑝 ≪ 1/𝑛 means 𝑝 = 𝑜(1/𝑛); see asymptotic notation guide at the beginning of
these notes).
What about when 𝑝 ≫ 1/𝑛? Can we conclude that 𝐺 (𝑛, 𝑝) contains a triangle with
high probability? In this case E𝑋 → ∞, but this is not enough to conclude that
37
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
4 Second Moment
P(𝑋 ≥ 1) = 1 − 𝑜(1), since we have not ruled out the probability that 𝑋 is almost
always zero but extremely large with some tiny probability.
An important technique in probabilistic combinatorics is to show that some random
variable is concentrated around its mean. This would then imply that outliers are
unlikely.
We will see many methods in this course on proving concentrations of random vari-
ables. In this chapter, we begin with the simplest method. It is usually easiest to
execute and it requires not much hypotheses. The downside is that it only produces
relatively weak (though still useful enough) concentration bounds.
Second moment method: show that a random variable is concentrated near its mean
by bounding its variance.
(Exercise: check the second equality in the definitions of variance and covariance
above).
Remark 4.1.4 (Notation convention). It is standard to use the Greek letter 𝜇 for the
mean, and 𝜎 2 for the variance. Here 𝜎 ≥ 0 is the standard deviation.
The following basic result provides a concentration bound based on the variance.
E[(𝑋 − 𝜇) 2 ] 1
𝐿𝐻𝑆 = P(|𝑋 − 𝜇| 2 ≥ 𝜆2 𝜎 2 ) ≤ 2 2
= 2. □
𝜆 𝜎 𝜆
Remark 4.1.6. Concentration bounds that show small probability of deviating from
the mean are called tail bounds (more precisely: upper tail for 𝑋 ≥ 𝜇 + 𝑎 and lower tail
38
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
for P(𝑋 ≤ 𝜇 − 𝑎)). Chebyshev’s inequality gives tail bounds that decays quadratically.
Later on we will see tools that give much better decay (usually exponential) provided
additional assumptions on the random variable (e.g., independence).
We are often interested in upper bounding the probability of non-existence, i.e., P(𝑋 =
0). Chebyshev’s inequality yields the following bound.
Var 𝑋
P(𝑋 = 0) ≤ .
(E𝑋) 2
Var 𝑋
P(𝑋 = 0) ≤ P(|𝑋 − 𝜇| ≥ |𝜇|) ≤ . □
𝜇2
Corollary 4.1.8
If E𝑋 > 0 and Var 𝑋 = 𝑜(E𝑋) 2 , then 𝑋 > 0 and 𝑋 ∼ E𝑋 with probability 1 − 𝑜(1).
In many situations, it is not too hard to compute the second moment. We have
Var[𝑋] = Cov[𝑋, 𝑋]. Also, covariance is bilinear, i.e., for random variables 𝑋1 , . . .
and 𝑌1 , . . . (no assumptions needed on their independence, etc.) and constants 𝑎 1 , . . .
and 𝑏 1 , . . . , one has
" #
∑︁ ∑︁ ∑︁
Cov 𝑎𝑖 𝑋𝑖 , 𝑏 𝑗𝑌𝑗 = 𝑎𝑖 𝑏 𝑗 Cov[𝑋𝑖 , 𝑌 𝑗 ].
𝑖 𝑗 𝑖, 𝑗
We are often dealing with 𝑋 being the cardinality of some random set. We can usually
write this as a sum of indicator functions, such as 𝑋 = 𝑋1 + · · · + 𝑋𝑛 , so that
𝑛
∑︁ 𝑛
∑︁ ∑︁
Var 𝑋 = Cov[𝑋, 𝑋] = Cov[𝑋𝑖 , 𝑋 𝑗 ] = Var 𝑋𝑖 + 2 Cov[𝑋𝑖 , 𝑋 𝑗 ].
𝑖, 𝑗=1 𝑖=1 𝑖< 𝑗
39
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
4 Second Moment
We have Cov[𝑋, 𝑌 ] = 0 if 𝑋 and 𝑌 are independent. Thus in the sum we only need to
consider dependent pairs (𝑖, 𝑗).
Let us now return to the problem of determining when 𝐺 (𝑛, 𝑝) contains a triangle
whp.
Theorem 4.1.11
If 𝑛𝑝 → ∞, then 𝐺 (𝑛, 𝑝) contains a triangle with probability 1 − 𝑜(1).
Proof. Label the vertices by [𝑛]. Let 𝑋𝑖 𝑗 be the indicator random variable of the edge
𝑖 𝑗, so that 𝑋𝑖 𝑗 = 1 if the edge is present, and 𝑋𝑖 𝑗 = 0 if the edge is not present in the
random graph. Let us write
𝑋𝑖 𝑗 𝑘 := 𝑋𝑖 𝑗 𝑋𝑖𝑘 𝑋 𝑗 𝑘 .
Then the number of triangles in 𝐺 (𝑛, 𝑝) is given by
∑︁
𝑋= 𝑋𝑖 𝑗 𝑋𝑖𝑘 𝑋 𝑗 𝑘 .
𝑖< 𝑗 <𝑘
Now we compute Var 𝑋. Note that the summands of 𝑋 are not all independent.
If 𝑇1 and 𝑇2 are each 3-vertex subsets, then
Cov[𝑋𝑇1 , 𝑋𝑇2 ] = E[𝑋𝑇1 𝑋𝑇2 ] − E[𝑋𝑇1 ]E[𝑋𝑇2 ] = 𝑝 𝑒(𝑇1 ∪𝑇2 ) − 𝑝 𝑒(𝑇1 )+𝑒(𝑇2 )
0 if |𝑇1 ∩ 𝑇2 | ≤ 1,
= 𝑝 5 − 𝑝 6 if |𝑇1 ∩ 𝑇2 | = 2,
𝑝 3 − 𝑝 6 if 𝑇1 = 𝑇2 .
40
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
The number of pairs (𝑇1 , 𝑇2 ) of triangles sharing exactly on edge is 𝑂 (𝑛4 ). Thus
∑︁
Var 𝑋 = Cov[𝑋𝑇1 , 𝑋𝑇2 ] = 𝑂 (𝑛3 )( 𝑝 3 − 𝑝 6 ) + 𝑂 (𝑛4 )( 𝑝 5 − 𝑝 6 )
𝑇1 ,𝑇2
≲ 𝑛3 𝑝 3 + 𝑛4 𝑝 5 = 𝑜(𝑛6 𝑝 6 ) as 𝑛𝑝 → ∞.
We say that 1/𝑛 is a threshold for containing a triangle, in the sense that if 𝑝 grows
asymptotically faster than this threshold, i.e., 𝑝 ≫ 1/𝑛, then the event occurs with
probability 1 − 𝑜(1), while if 𝑝 ≪ 1/𝑛, then the event occurs with probability 𝑜(1).
Note that the definition of a threshold ignores leading constant factors (so that it is also
correct to say that 2/𝑛 is also a threshold for containing a triangle). Determining the
thresholds of various properties in random graphs (as well as other random settings)
is a central topic in probabilistic combinatorics. We will discuss thresholds in more
depth later in this chapter.
What else might you want to know about the probability that 𝐺 (𝑛, 𝑝) contains a
triangle?
Remark 4.1.12 (Poisson limit). What if 𝑛𝑝 → 𝑐 > 0 for some constant 𝑐 > 0? It
turns out in this case that the number of triangles of 𝐺 (𝑛, 𝑝) approaches a Poisson
distribution with constant mean. You will show this in the homework. It will be
done via the method of moments: if 𝑍 is some random variable with sufficiently
nice properties (known as “determined by moments”, which holds for many common
distributions such as the Poisson distribution and the normal distribution), and 𝑋𝑛 is
a sequence of random variables such that E𝑋𝑛𝑘 → E𝑍 𝑘 for all nonnegative integers 𝑘,
then 𝑋𝑛 converges in distribution to 𝑍.
41
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
4 Second Moment
In the homework, you will prove the asymptotic normality of 𝑋 using a later-found
method of projections. The idea is to show that that 𝑋 close to another random variable
that is already known to be asymptotically normal by checking that their difference
has negligible variance. For triangle counts, when 𝑝 ≫ 𝑛−1/2 , we can compare the
number of triangles to the number of edges after a normalization. The method can be
further modified for greater generality. See the §6.4 in the book Random Graphs by
Janson, Łuczak, and Rucinski (2000).
Remark 4.1.14 (Better tail bounds). Later on we will use more powerful tools (in-
cluding martingale methods and Azuma-Hoeffding inequalities, and also Janson in-
equalities) to prove better tail bounds on triangle (and other subgraph) counts.
Question 4.2.1
What is the threshold for containing a fixed 𝐻 as a subgraph?
42
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Remark 4.2.3. (a) For many applications with an underlying symmetry between
the events, the sum in the definition of Δ∗ does not actually depend on 𝑖.
(b) In the definition of the dependency graph (𝑖 ∼ 𝑗) above, we are only considering
pairwise dependence. Later on when we study the Lovász Local Lemma, we
will need a strong notion of a dependency graph.
(c) This method is appropriate for a collection of events with few dependencies. It
is not appropriate for where there are many weak dependencies (e.g., Section 4.5
on the Hardy–Ramanujan theorem on the number of distinct prime divisors).
(Here 𝐴𝑖 𝐴 𝑗 is the shorthand for 𝐴𝑖 ∧ 𝐴 𝑗 , meaning that both events occur.) Also
Cov[𝑋𝑖 , 𝑋 𝑗 ] = 0 if 𝑖 ≠ 𝑗 and 𝑖 ≁ 𝑗 .
Thus
𝑚
∑︁ 𝑚
∑︁ 𝑚
∑︁ ∑︁
Var 𝑋 = Cov[𝑋𝑖 , 𝑋 𝑗 ] ≤ P( 𝐴𝑖 ) + P( 𝐴𝑖 ) P( 𝐴 𝑗 | 𝐴𝑖 )
𝑖, 𝑗=1 𝑖=1 𝑖=1 𝑗: 𝑗∼𝑖
≤ E𝑋 + (E𝑋)Δ∗ .
Recall from Corollary 4.1.8 that E𝑋 > 0 and Var 𝑋 = 𝑜(E𝑋) 2 imply 𝑋 > 0 and
𝑋 ∼ E𝑋 whp. So we have the following.
Lemma 4.2.4
In the above setup, if E𝑋 → ∞ and Δ∗ = 𝑜(E𝑋), then 𝑋 > 0 and 𝑋 ∼ E𝑋 whp.
Theorem 4.2.5
The threshold for containing 𝐾4 is 𝑛−2/3 .
43
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
4 Second Moment
Now suppose 𝑝 ≫ 𝑛−2/3 , so E𝑋 → ∞. For each 4-vertex subset 𝑆, let 𝐴𝑆 be the event
that 𝑆 is a clique in 𝐺 (𝑛, 𝑝).
For each fixed 𝑆, one has 𝐴𝑆 ∼ 𝐴𝑆′ if and only if |𝑆 ∩ 𝑆′ | ≥ 2.
• The number of 𝑆′ that share exactly 2 vertices with 𝑆 is 6 𝑛−2 2
2 = 𝑂 (𝑛 ), and for
each such 𝑆′ one has P( 𝐴𝑆′ | 𝐴𝑆 ) = 𝑝 5 (as there are 5 additional edges not in the
𝑆-clique that need to appear clique to form the 𝑆′-clique).
• The number of 𝑆′ that share exactly 3 vertices with 𝑆 is 4(𝑛 − 4) = 𝑂 (𝑛), and
for each such 𝑆′ one has P( 𝐴𝑆′ | 𝐴𝑆 ) = 𝑝 3 .
Summing over all above 𝑆′, we find
∑︁
Δ∗ = P( 𝐴𝑆′ | 𝐴𝑆 ) ≲ 𝑛2 𝑝 5 + 𝑛𝑝 3 ≪ 𝑛4 𝑝 6 ≍ E𝑋.
𝑆 ′ :|𝑆 ′ ∩𝑆|∈{2,3}
For both 𝐾3 and 𝐾4 , we saw that any choice of 𝑝 = 𝑝 𝑛 with E𝑋 → ∞ one has 𝑋 > 0
whp. Is this generally true?
Thus the threshold for 𝐻 = is actually 𝑛−2/3 , and not 𝑛−5/7 as one might have
naively predicted from the first moment alone.
Why didn’t E𝑋𝐻 → ∞ give 𝑋𝐻 > 0 whp in our proof strategy? In the calculation
of Δ∗ , one of the terms is ≍ 𝑛𝑝 (from two copies of 𝐻 with a 𝐾4 -overlap), and
𝑛𝑝 3 𝑛5 𝑝 7 ≍ E𝑋𝐻 if 𝑝 ≪ 𝑛−2/3 .
The above example shows that the threshold is not always necessarily determined by
the expectation. For the property of containing 𝐻, the example suggests that we should
look at the “densest” subgraph of 𝐻 rather than containing 𝐻 itself.
44
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Definition 4.2.7
Define the edge-vertex ratio of a graph 𝐻 by
𝑒𝐻
𝝆(𝑯) := .
𝑣𝐻
𝒎(𝑯) := max
′
𝜌(𝐻 ′).
𝐻 ⊆𝐻
Example 4.2.8. Let 𝐻 = . We have 𝜌(𝐻) = 7/5 whereas 𝜌(𝐾4 ) = 3/2 > 7/5.
It is not hard to check that 𝑚(𝐻) = 𝜌(𝐾4 ) = 3/2 as 𝐾4 is the subgraph of 𝐻 with the
maximum edge-vertex ratio.
Remark 4.2.9 (Algorithm). Goldberg (1984) found a polynomial time algorithm for
computing 𝑚(𝐻) via network flow algorithms.
The next theorem completes determines the threshold for containing some fixed graph
𝐻. Basically, it is determined by the expected number of copies of 𝐻 ′, where 𝐻 ′ is the
“denest” subgraph of 𝐻 (i.e., with the maximum edge-vertex ratio).
45
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
4 Second Moment
since
′ ′ ′
𝑝 ≫ 𝑛−1/𝑚(𝐻) ≥ 𝑛−1/𝜌(𝐽∩𝐽 ) = 𝑛−|𝑉 (𝐽)∩𝑉 (𝐽 )|/|𝐸 (𝐽)∩𝐸 (𝐽 )| .
It then follows, after considering all the possible ways that 𝐽 ′ can overlap with 𝐽, that
Δ∗ ≪ 𝑛 |𝑉 (𝐽)| 𝑝 |𝐸 (𝐽)| ≍ E𝑋𝐻 . So Lemma 4.2.4 yields the result. □
Remark 4.2.11. The proof also gives that if 𝑝 ≫ 𝑛−1/𝑚(𝐻) , then the number 𝑋𝐻 of
copies of 𝐻 is concentrated near its mean, i.e., with probability 1 − 𝑜(1),
𝑛 𝑣 𝐻 ! 𝑒 𝐻 𝑛𝑣 𝐻 𝑝 𝑒 𝐻
𝑋𝐻 ∼ E𝑋𝐻 = 𝑝 ∼ .
𝑣 𝐻 aut(𝐻) aut(𝐻)
4.3 Thresholds
Previously, we computed the threshold for containing a fixed 𝐻 as a subgraph. In
this section, we take a detour from the discussion of the second moment method and
discuss thresholds in more detail.
We begin by discussing the concept more abstractly by first defining the threshold of
any monotone property on subsets. Then we show that thresholds always exist.
Thresholds form a central topic in probabilistic combinatorics. For any given property,
it is natural to ask the following questions:
1. Where is the threshold?
2. Is the transition sharp? (And more precisely, what is width of the transition
window?)
We understand thresholds well for many basic graph properties, but for many others,
it can be a difficult problem. Also, one might think that one must first understand
the location of the threshold before determining the nature of the phase transition, but
surprisingly this is actually not always the case. There are powerful results that can
sometimes show a sharp threshold without identifying the location of the threshold.
46
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
4.3 Thresholds
Here are some examples of increasing properties for subgraphs of a given set of
vertices:
• Contains some given subgraph
• Connected
• Has perfect matching
• Hamiltonian
• non-3-colorable
A family F ⊆ P (Ω) of subsets of Ω is called an up-set if whenever 𝐴 ∈ F and 𝐴 ⊆ 𝐵,
then 𝐵 ∈ F . Increasing property is the same as being an element of an up-set. We
will use these two terms interchangeably.
Remark 4.3.2. The above definition is only for increasing properties. We can similarly
define the threshold for decreasing properties by an obvious modification. An example
of a non-monotone property is sontaining some 𝐻 as an induced subgraph. Some (but
not all) non-monotone properties also have thresholds, though we will not discuss it
here.
Remark 4.3.3. From the definition, we see that if 𝑟 𝑛 and 𝑟 𝑛′ are both thresholds of
the same property, then they must be within a constant factor of each other (exercise:
check this). Thus it makes sense to say “the threshold” rather than “a threshold.”
Existence of threshold
47
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
4 Second Moment
How would a monotone property not have a threshold? Perhaps one could have
P(Ω1/𝑛 ∈ F ) and P(Ω (log 𝑛)/𝑛 ∈ F ) ∈ [1/10, 9/10] for all sufficiently large 𝑛?
Before answer this question, let us consider an even more elementary claim.
Let us give two related proofs of this basic fact. Both are quite instructive. Both are
based on coupling of random processes.
Proof 1. Let 0 ≤ 𝑝 < 𝑞 ≤ 1. Consider the following process to generate two random
subsets of Ω. For each 𝑥, generate uniform 𝑡𝑥 ∈ [0, 1] independently at random. Let
𝐴 = {𝑥 ∈ Ω : 𝑡𝑥 ≤ 𝑝} and 𝐵 = {𝑥 ∈ Ω : 𝑡𝑥 ≤ 𝑞} .
Then 𝐴 has the same distribution as Ω 𝑝 and 𝐵 has the same distribution as Ω𝑞 .
Furthermore, since 𝑝 < 𝑞, we always have 𝐴 ⊆ 𝐵. Since F is monotone, 𝐴 ∈ F
implies 𝐵 ∈ F . Thus
To see that the inequality strict, we simply have to observe that with positive probability,
one has 𝐴 ∉ F and 𝐵 ∈ F (e.g., if all 𝑡 𝑥 ∈ ( 𝑝, 𝑞], then 𝐴 = ∅ and 𝐵 = Ω). □
In the second proof, the idea is to reveal a random subset of Ω in independent random
stages.
Proof 2. (By two-round exposure) Let 0 ≤ 𝑝 < 𝑞 ≤ 1. Note that 𝐵 = Ω𝑞 has the same
distribution as the union of two independent 𝐴 = Ω 𝑝 and 𝐴′ = Ω 𝑝 ′ , where 𝑝′ is chosen
to satisfy 1 − 𝑞 = (1 − 𝑝) (1 − 𝑝′) (check that the probability that each element occurs
is the same in the two processes). Thus
P( 𝐴 ∈ F ) ≤ P( 𝐴 ∪ 𝐴′ ∈ F ) = P(𝐵 ∈ F ).
Like earlier, to observe that the inequality is strict, one observes that with positive
probability, one has 𝐴 ∉ F and 𝐴 ∪ 𝐴′ ∈ F . □
The above technique (generalized from two round exposure to multiple round ex-
posures) gives a nice proof of the following theorem (originally proved using the
Kruskal–Katona theorem).1
1 (Thresholds for random subspaces of F𝑞𝑛 ) The proof of the Bollobás–Thomason paper using the
48
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
4.3 Thresholds
Proof. Consider 𝑚 independent copies of Ω 𝑝/𝑚 , and let 𝑌 be their union. Since F is
monotone increasing, if 𝑌 ∉ F , then none of the 𝑚 copies lie in F . Hence
Kruskal–Katona theorem is still relevant. For example, there is an interesting analog of this concept
for properties of subspaces of F𝑞𝑛 , i.e., random linear codes instead of random graphs. The analogue
of the Bollobás–Thomason theorem was proved by Rossman (2020) via the the Kruskal–Katona
approach. The multiple round exposure proof does not seem to work in the random subspace setting,
as one cannot write a subspace as a union of independent copies of smaller subspaces.
As an aside, I disagree with the use of the term “sharp threshold” in Rossman’s paper for describing
all thresholds for subspaces—one really should be looking at the cardinality of the subspaces rather
than their dimensions. In a related work by Guruswami, Mosheiff, Resch, Silas, and Wootters
(2022), they determine thresholds for random linear codes for properties that seem to be analogous
to the property that a random graph contains a given fixed subgraph. Here again I disagree with
them calling it a “sharp threshold.” It is much more like a coarse threshold once you parameterize
by the cardinality of the subspace, which gives you a much better analogy to the random graph
setting.
Thresholds for random linear codes seems to an interesting topic that has only recently been
studied. I think there is more to be done here.
49
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
4 Second Moment
(here we write Ω𝑡 = Ω if 𝑡 > 1.) Indeed, applying Lemma 4.3.7 with 𝑝 = 𝑝 𝑐 , we have
Examples
We will primarily be studying monotone graph properties. In the previous notation,
Ω = [𝑛]
2 , and we are only considering properties that depend on the isomorphism
class of the graph.
Example 4.3.8 (Containing a triangle). We saw earlier in the chapter that the threshold
for containing a triangle is 1/𝑛:
0 if 𝑛𝑝 → 0,
3
P(𝐺 (𝑛, 𝑝) contains a triangle) → 1 − 𝑒 −𝑐 /6 if 𝑛𝑝 → 𝑐 ∈ (0, ∞)
1
if 𝑛𝑝 → ∞.
In this case, the threshold is determined by the expected number of triangles Θ(𝑛3 𝑝 3 ),
and whether this quantity goes to zero or infinity (in the latter case, we used a second
moment method to show that the number of triangles is positive with high probability).
What if 𝑝 = Θ(1/𝑛)? If 𝑛𝑝 → 𝑐 for some constant 𝑐 > 0, then (you will show in the
homework via the method of moments) that the number of triangles is asymptotically
Poisson distributed, and in particular the limit probability of containing a triangle
increases from 0 to 1 as a continuous function of 𝑐 ∈ (0, ∞). So, in particular, as
𝑝 increases, it goes through a “window of transition” of width Θ(1/𝑛) in order for
P(𝐺 (𝑛, 𝑝) contains a triangle) to increase from 0.01 to 0.99. The width of this window
is on the same order as the threshold. In this case, we call it a coarse transition.
Example 4.3.9 (Containing a subgraph). Theorem 4.2.10 tells us that the threshold
for containing a fixed subgraph 𝐻 is 𝑛−1/𝑚(𝐻) . Here the threshold is not always
determined by the expected number of copies of 𝐻. Instead, we need to look at
the “densest subgraph” 𝐻 ′ ⊆ 𝐻 with the largest edge-vertex ratio (i.e., equivalent to
largest average degree). The threshold is determined by whether the expected number
of copies of 𝐻 ′ goes to zero or infinity.
Similar to the triangle case, we have a coarse threshold.
50
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
4.3 Thresholds
The analysis can also be generalized to containing one of several fixed subgraphs
𝐻1 , . . . , 𝐻 𝑘 .
The original Erdős–Rényi (1959) paper on random graphs already studied several
thresholds, such as the next two examples.
log 𝑛 + 𝑐 𝑛
Example 4.3.11 (No isolated vertices). With 𝑝 = ,
𝑛
0 if 𝑐 𝑛 → −∞
−𝑐
P (𝐺 (𝑛, 𝑝) has no isolated vertices) → 1 − 𝑒 𝑒 if 𝑐 𝑛 → 𝑐
1
if 𝑐 𝑛 → ∞
It is a good exercise (and included in the problem set) to check the above claims. The
cases 𝑐 𝑛 → −∞ and 𝑐 𝑛 → ∞ can be shown using the second moment method. More
precisely, when 𝑐 𝑛 → 𝑐, by comparing moments one can show that the number of
isolated vertices is asymptotically Poisson.
In this example, the threshold is at (log 𝑛)/𝑛. As we see above, the transition window
is Θ(1/𝑛), much narrower the magnitude of the threshold. In particular, the event
probability goes from 𝑜(1) to 1 − 𝑜(1) when 𝑝 increases from (1 − 𝛿)(log 𝑛)/𝑛 to
(1 + 𝛿)(log 𝑛)/𝑛 for any fixed 𝛿 > 0. In this case, we say that the property has a sharp
threshold at (log 𝑛)/𝑛 (here the leading constant factor is relevant, unlike the earlier
example of a coarse threshold).
log 𝑛 + 𝑐 𝑛
Example 4.3.12 (Connectivity). With 𝑝 =
𝑛
0 if 𝑐 𝑛 → −∞
−𝑐
P (𝐺 (𝑛, 𝑝) is connected) → 1 − 𝑒 𝑒 if 𝑐 𝑛 → 𝑐
1
if 𝑐 𝑛 → ∞
In fact, a much stronger statement is true, connecting the above two examples: consider
a process where one adds a random edges one at a time, then with probability 1 − 𝑜(1),
51
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
4 Second Moment
the graph becomes connected as soon as there are no more isolated vertices. Such
stronger characterization is called a hitting time result.
A similar statement is true if we replace “is connected” by “has a perfect matching”
(assuming 𝑛 even).
Sharp transition
In some of the examples, the probability that 𝐺 (𝑛, 𝑝) satisfies the property changes
quickly and dramatically as 𝑝 crosses the threshold (physical analogy: similar to how
the structure of water changes dramatically as the temperature drops below freezing).
For example, while for connectivity, while 𝑝 = log 𝑛/𝑛 is a threshold, we see that
𝐺 (𝑛, 0.99 log 𝑛/𝑛) is whp not connected and 𝐺 (𝑛, 1.01 log 𝑛/𝑛) is whp connected,
unlike the situation for containing a triangle earlier. We call this the sharp threshold
phenomenon.
52
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
4.3 Thresholds
o
Pt P
o S Igoe P
Dpet
Figure 4.1: Examples of coarse and sharp thresholds. The vertical axis is the
probability that 𝐺 (𝑛, 𝑝) satisfies the property.
53
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
4 Second Moment
We need “close to” since the property could be “contains a triangle and has at least
log 𝑛 edges”, which is not exactly local but it is basically the same as “contains a
triangle.”
There is some subtlety here since we can allow very different properties depending on
the value of 𝑛. E.g., P could be the set of all 𝑛-vertex graphs that contain a 𝐾3 if 𝑛 is
odd and 𝐾4 if 𝑛 is even. Friedgut’s theorem tells us that if there is a threshold, then
there is a partition N = N1 ∪ · · · ∪ N 𝑘 such that on each N𝑖 , P is approximately the
form described in the previous paragraph.
In the last section, we derived that the property of containing some fixed 𝐻 has
threshold 𝑛−1/𝑚(𝐻) for some rational number 𝑚(𝐻). It follows as a corollary of
Friedgut’s theorem that every coarse threshold must have this form.
In particular, if (log 𝑛)/𝑛 is a threshold of some monotone graph property (e.g., this is
the case for connectivity), then we automatically know that it must be a sharp threshold,
even without knowing anything else about the property. Likewise if the threshold has
the form 𝑛−𝛼 for some irrational 𝛼.
The exact statement of Friedgut’s theorem is more cumbersome. We refer those who
are interested to Friedgut’s original 1999 paper and his later survey for details and
applications. This topic is connected more generally to an area known as the analysis
of boolean functions.
Also, it is known that the transition window of every monotone graph property is
(log 𝑛) −2+𝑜(1) (Friedgut––Kalai (1996), Bourgain–Kalai (1997)).
Curiously, tools such as Friedgut’s theorem sometimes allow us to prove the existence
of a sharp threshold without being able to identify its exact location. For example, it is
an important open problem to understand where exactly is the transition for a random
graph to be 𝑘-colorable.
54
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
On the other hand, it is not known whether lim𝑛→∞ 𝑑 𝑘 (𝑛) exists, which would imply
Conjecture 4.3.16. Further bounds on 𝑑 𝑘 (𝑛) are known, e.g. the landmark paper of
Achlioptas and Naor (2006) showing that for each fixed 𝑑 > 0, whp 𝜒(𝐺 (𝑛, 𝑑/𝑛) ∈
{𝑘 𝑑 , 𝑘 𝑑 + 1} where 𝑘 𝑑 = min{𝑘 ∈ N : 2𝑘 log 𝑘 > 𝑑}. Also see the later work of
Coja-Oghlan and Vilenchik (2013).
Question 4.4.1
What is the clique number of 𝐺 (𝑛, 1/2)?
Let us first do a rough estimate to see what is the critical 𝑘 to get 𝑓 (𝑛, 𝑘) large or small.
𝑛 𝑘
𝑘
Recall that 𝑒𝑘 ≤ 𝑛𝑘 ≤ 𝑒𝑛𝑘 . We have
𝑘
log2 𝑓 (𝑛, 𝑘) = 𝑘 log2 𝑛 − log2 𝑘 − + 𝑂 (1) .
2
And so the transition point is at some 𝑘 ∼ 2 log2 𝑛 in the sense that if 𝑘 ≥ (2+ 𝛿) log2 𝑛,
then 𝑓 (𝑛, 𝑘) → 0 while if 𝑘 ≤ (2 − 𝛿) log2 𝑛, then 𝑓 (𝑛, 𝑘) → ∞.
The next result gives us a lower bound on the typical clique number.
55
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
4 Second Moment
Proof sketch. The first claim follows from Markov’s inequality as P(𝑋 ≥ 1) ≤ E𝑋.
For the second claim, we bound the variance using Setup 4.2.2. For each 𝑘-element
subset 𝑆 of vertices, let 𝐴𝑆 be the event that 𝑆 is a clique. Let 𝑋𝑆 be the indicator
random variable for 𝐴𝑆 . Recall
∑︁
Δ∗ := max
P 𝐴 𝑗 𝐴𝑖 .
𝑖
𝑗: 𝑗∼𝑖
It then follows from Lemma 4.2.4 that 𝑋 > 0 (i.e., 𝜔(𝐺) ≥ 𝑘) whp. □
We can a two-point concentration result for the clique number of 𝐺 (𝑛, 1/2). This
result is due to Bollobás–Erdős 1976 and Matula 1976.
𝑓 (𝑛, 𝑘 + 1) 𝑛 − 𝑘 −𝑘
= 2 = 𝑛−1+𝑜(1) .
𝑓 (𝑛, 𝑘) 𝑘 +1
Then 𝑓 (𝑛, 𝑘 0 − 1) → ∞ and 𝑓 (𝑛, 𝑘 0 + 1) = 𝑜(1). By Theorem 4.4.2, the clique number
of 𝐺 (𝑛, 1/2) is whp in {𝑘 0 − 1, 𝑘 0 }. □
Remark 4.4.4. By a more careful analysis, one can show that outside a very sparse
56
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
subset of integers, one has 𝑓 (𝑛, 𝑘 0 ) → ∞, in which case one has one-point concentra-
tion 𝜔(𝐺 (𝑛, 1/2)) = 𝑘 0 whp.
By taking the complement of the graph, we also get a two-point concentration result
about the independence number of 𝐺 (𝑛, 1/2). Bohman and Hofstad (2024) extended
the two-point concentration result for the independence number of 𝐺 (𝑛, 𝑝) to all
𝑝 ≥ 𝑛−2/3+𝜀 .
Remark 4.4.5 (Chromatic number). Since the chromatic number satisfies 𝜒(𝐺) ≥
𝑛/𝛼(𝐺), we have
𝑛
𝜒(𝐺 (𝑛, 1/2)) ≥ (1 + 𝑜(1)) whp.
2 log2 𝑛
In Theorem 8.3.2, using more advanced methods, we will prove 𝜒(𝐺 (𝑛, 1/2)) ∼
𝑛/(2 log2 𝑛) whp, a theorem due to Bollobás (1987).
In Section 9.3, using martingale concentration, we will show that 𝜒(𝐺 (𝑛, 𝑝)) is tightly
concentrated around its mean without a priori needing to know where the mean is
located.
The original proof of Hardy and Ramanujan was quite involved. Here we show a
“probabilistic” proof due to Turán (1934), which played a key role in the development
of probabilistic methods in number theory.
57
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
4 Second Moment
We have
𝜈(𝑥) − 10 ≤ 𝑋 (𝑥) ≤ 𝜈(𝑥)
since 𝑥 cannot have more than 10 prime factors > 𝑛1/10 . So it suffices to analyze 𝑋.
Since exactly ⌊𝑛/𝑝⌋ positive integers ≤ 𝑛 are divisible by 𝑝, we have
⌊𝑛/𝑝⌋ 1 1
E𝑋 𝑝 = = +𝑂 .
𝑛 𝑝 𝑛
We quote Merten’s theorem from analytic number theory:
∑︁
1/𝑝 = log log 𝑛 + 𝑂 (1).
𝑝≤𝑛
(A more precise result says that 𝑂 (1) error term converges to the Meissel–Mertens
constant.) So
∑︁ 1
1
E𝑋 = +𝑂 = log log 𝑛 + 𝑂 (1).
𝑝≤𝑀
𝑝 𝑛
Next we compute the variance. The intuition is that divisibility by distinct primes
should behave somewhat independently. Indeed, if 𝑝𝑞 divides 𝑛, then 𝑋 𝑝 and 𝑋𝑞 are
independent (e.g., by the Chinese Remainder Theorem). If 𝑝𝑞 does not divide 𝑛, but
𝑛 is large enough, then there is some small covariance contribution. (In contrast to the
earlier variance calculations in random graphs, here we have many weak dependices.)
If 𝑝 ≠ 𝑞, then 𝑋 𝑝 𝑋𝑞 = 1 if and only if 𝑝𝑞|𝑥. Thus
Thus
∑︁ 𝑀2
Cov[𝑋 𝑝 , 𝑋𝑞 ] ≲ ≲ 𝑛−4/5 .
𝑝≠𝑞
𝑛
58
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Finally, recall that |𝑋 − 𝜈| ≤ 10. So the same asymptotic bound holds with 𝑋 replaced
by 𝜈. □
Í
The main idea from the above proof is that the number of prime divisors 𝑋 = 𝑝 𝑋 𝑝
(from the previous proof) behaves like a sum of independent random variables.
A sum of independent random variables often satisfy a central limit theorem (i.e.,
asymptotic normality, convergence to Gaussian), assuming some mild regularity hy-
potheses. In particular, we have the following corollary of the Lindenberg–Feller
central limit theorem (see Durrett, Theorem 3.4.10):
The original proof of Erdős and Kac verifies the above intuition using some more
involved results in analytic number theory. Simpler proofs have been subsequently
given, and we outline one such proof below, which is based on computing the moments
59
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
4 Second Moment
of the distribution. The idea of computing moments for this problem was first used by
Delange (1953), who was apparently not aware of the Erdős–Kacs paper. Also see a
more modern account by Granville and Soundararajan (2007).
The following tool from probability theory allows us to verify asymptotic normality
from convergence of moments.
Remark 4.5.5. The same conclusion holds for any probability distribution that is
“determined by its moments,” i.e., there are no other distributions sharing the same
moments. Many common distributions that arise in practice, e.g., the Poisson distri-
bution, satisfy this property. There are various sufficient conditions for guaranteeing
this moments property, e.g., Carleman’s condition tells us that any probability distri-
bution whose moments do not increase too quickly is determined by its moments. (See
Durrett §3.3.5).
𝜇 := E𝑌 ∼ E𝑋 ∼ log log 𝑛
and
𝜎 2 := Var 𝑌 ∼ Var 𝑋 ∼ log log 𝑛.
60
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Comparing expansions of 𝑋
e𝑘 in terms of the 𝑋 𝑝 ’s (𝑛𝑜(1) terms), we get
It follows that 𝑋
e is asymptotically normal. □
Question 4.6.1
For each 𝑘, what is the smallest 𝑛 so that there exists 𝑆 ⊆ [𝑛] with |𝑆| = 𝑘 and all 2 𝑘
subset sums of 𝑆 are distinct?
Let us use the second moment to give a modest improvement on the earlier pigeonhole
argument. The main idea here is that, by second moment, most of the subset sums lie
within an 𝑂 (𝜎)-interval, so that we can improve on the pigeonhole estimate ignoring
outlier subset sums.
Theorem 4.6.3
√
If there is a 𝑘-element subset of [𝑛] with distinct subset sums. Then 𝑛 ≳ 2 𝑘 / 𝑘.
61
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
4 Second Moment
1
P(|𝑋 − 𝜇| ≥ 2𝜎) ≤ ,
4
and thus
√ 3
P(|𝑋 − 𝜇| < 𝑛 𝑘) = P(|𝑋 − 𝜇| < 2𝜎) ≥ .
4
−𝑘
√ every (𝜀 1 , . . . , 𝜀 𝑘 ) ∈ {0, 1} , we√have P(𝑋
Since 𝑋 takes distinct values for √ = 𝑥) ≤ 2
𝑘
for all 𝑥. Since there are ≤ 2𝑛 𝑘 elements in the interval (𝜇 − 𝑛 𝑘, 𝜇 + 𝑛 𝑘), we have
√ √
P(|𝑋 − 𝜇| < 𝑛 𝑘) ≤ 2𝑛 𝑘2−𝑘 .
√
Putting the upper and lowers bounds on P(|𝑋 − 𝜇| < 𝑛 𝑘) together, we get
√ 3
2𝑛 𝑘2−𝑘 ≤ .
4
√
So 𝑛 ≳ 2 𝑘 / 𝑘. □
Dubroff, Fox, and Xu (2021) gave another short proof of this result by applying Harper’s
vertex-isoperimetric inequality on the cube (this is an example of “concentration of
measure”, which we will explore more later this course).
Consider the graph representing the 𝑛-dimensional boolean cube, with vertex set {0, 1}𝑛
with an edge between every pair of 𝑛-tuples that differ in exactly one coordinate. Given
𝐴 ⊆ {0, 1}𝑛 , write 𝜕 𝐴 for the set of all vertices outside 𝐴 that is adjacent to some
vertex of 𝐴.
62
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Remark 4.6.5. A stronger form of Harper’s theorem gives the precise value of
min |𝜕 𝐴|
𝐴⊆{0,1} 𝑛 :| 𝐴|=𝑚
for every (𝑛, 𝑚). Basically, the minimum is achieved when 𝐴 is a a Hamming ball, or, if
𝑚 is not exactly the size of some Hamming ball, then 𝐴 consists of the lexicographically
first 𝑚 elements of {0, 1}𝑛 .
Remark 4.6.7. The above bound has the currently best known leading constant factor,
matching an earlier result by Aliev (2009).
63
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
4 Second Moment
is chosen as some polynomials peaks at 𝑥 = 𝑖/𝑛 and then decays as 𝑥 moves away from
𝑖/𝑛.
For each 𝑥 ∈ [0, 1], the binomial distribution Binomial(𝑛, 𝑥) has mean 𝑛𝑥 and variance
𝑛𝑥(1 − 𝑥) ≤ 𝑛. By Chebyshev’s inequality,
∑︁
𝐸𝑖 (𝑥) = P(|Binomial(𝑛, 𝑥) − 𝑛𝑥| > 𝑛2/3 ) ≤ 𝑛−1/3 .
𝑖:|𝑖−𝑛𝑥|>𝑛2/3
(In the next chapter, we will see a much better tail bound.)
Since [0, 1] is compact, 𝑓 is uniformly continuous and bounded. By multiplying by
a constant, we assume that | 𝑓 (𝑥)| ≤ 1 for all 𝑥 ∈ [0, 1]. Also there exists 𝛿 > 0 such
that | 𝑓 (𝑥) − 𝑓 (𝑦)| ≤ 𝜀/2 for all 𝑥, 𝑦 ∈ [0, 1] with |𝑥 − 𝑦| ≤ 𝛿.
Take 𝑛 > max{64𝜀 −3 , 𝛿−3 }. Then for every 𝑥 ∈ [0, 1] (note that 𝑛𝑗=0 𝐸 𝑗 (𝑥) = 1),
Í
𝑛
∑︁
|𝑃𝑛 (𝑥) − 𝑓 (𝑥)| ≤ 𝐸𝑖 (𝑥)| 𝑓 (𝑖/𝑛) − 𝑓 (𝑥)|
𝑖=0
∑︁ ∑︁
≤ 𝐸𝑖 (𝑥)| 𝑓 (𝑖/𝑛) − 𝑓 (𝑥)| + 2𝐸𝑖 (𝑥)
𝑖:|𝑖/𝑛−𝑥|<𝑛 −1/3 <𝛿 𝑖:|𝑖−𝑛𝑥|>𝑛2/3
𝜀
≤ + 2𝑛−1/3 ≤ 𝜀. □
2
Exercises
1. Let 𝑋 be a nonnegative real-valued random variable. Suppose P(𝑋 = 0) < 1.
Prove that
Var 𝑋
P(𝑋 = 0) ≤ .
E[𝑋 2 ]
64
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
2. Let 𝑋 be a random variable with mean 𝜇 and variance 𝜎 2 . Prove that for all
𝜆 > 0,
𝜎2
P(𝑋 ≥ 𝜇 + 𝜆) ≤ 2 .
𝜎 + 𝜆2
3. Threshold for 𝑘-APs. Let [𝑛] 𝑝 denote the random subset of {1, . . . , 𝑛} where
every element is included with probability 𝑝 independently. For each fixed
integer 𝑘 ≥ 3, determine the threshold for [𝑛] 𝑝 to contain a 𝑘-term arithmetic
progression.
4. What is the threshold function for 𝐺 (𝑛, 𝑝) to contain a cycle?
5. Show that, for each fixed positive integer 𝑘, there is a sequence 𝑝 𝑛 such that
Hint: Make the random graph contain some specific subgraph but not some others.
6. Poisson limit. Let 𝑋 be the number of triangles in 𝐺 (𝑛, 𝑐/𝑛) for some fixed
𝑐 > 0.
a) For every nonnegative integer 𝑘, determine the limit of E 𝑋𝑘 as 𝑛 → ∞.
b) Let 𝑌 ∼ Binomial(𝑛, 𝜆/𝑛) for some fixed 𝜆 > 0. For every nonnegative
integer 𝑘, determine the limit of E 𝑌𝑘 as 𝑛 → ∞, and show that it agrees
with the limit in (a) for some 𝜆 = 𝜆(𝑐).
We know that 𝑌 converges to the Poisson distribution with mean 𝜆. Also,
the Poisson distribution is determined by its moments.
c) Compute, for fixed every nonnegative integer 𝑡, the limit of P(𝑋 = 𝑡) as
𝑛 → ∞.
(In particular, this gives the limit probability that 𝐺 (𝑛, 𝑐/𝑛) contains a
triangle, i.e., lim𝑛→∞ P(𝑋 > 0). This limit increases from 0 to 1 continu-
ously when 𝑐 ranges from 0 to +∞, thereby showing that the property of
containing a triangle has a coarse threshold.)
7. Central limit theorem for triangle counts. Find a real (non-random) sequence 𝑎 𝑛
so that, letting 𝑋 be the number of triangles and 𝑌 be the number of edges in the
random graph 𝐺 (𝑛, 1/2), one has
65
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
4 Second Moment
or first write the edge indicator variables as 𝑋𝑖 𝑗 = (1 + 𝑌𝑖 𝑗 )/2 and then expand. The latter
approach likely yields a cleaner computation.)
11. Nearly perfect triangle factor. Prove that, with probability approaching 1 as
𝑛 → ∞,
a) 𝐺 (𝑛, 𝑛−2/3 ) has at least 𝑛/100 vertex-disjoint triangles.
b) Simple nibble. 𝐺 (𝑛, 𝐶𝑛−2/3 ) has at least 0.33𝑛 vertex-disjoint triangles,
for some constant 𝐶.
iterate (a)
Hint: view a random graph as the union of several independent random graphs &
√
13. Let 𝑣 1 = (𝑥 1 , 𝑦 1 ), . . . , 𝑣 𝑛 = (𝑥 𝑛 , 𝑦 𝑛 ) ∈ Z2 with |𝑥𝑖 | , |𝑦𝑖 | ≤ 2𝑛/2 /(100 𝑛) for all
𝑖 ∈ [𝑛]. Show that there are two disjoint sets 𝐼, 𝐽 ⊆ [𝑛], not both empty, such
Í Í
that 𝑖∈𝐼 𝑣 𝑖 = 𝑗 ∈𝐽 𝑣 𝑗 .
66
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
14. ★ Prove that there is an absolute constant 𝐶 > 0 so that the following holds. For
every prime 𝑝 and every 𝐴 ⊆ Z/𝑝Z with | 𝐴| = 𝑘, there exists √ an integer 𝑥 so
that {𝑥𝑎 : 𝑎 ∈ 𝐴} intersects every interval of length at least 𝐶 𝑝/ 𝑘 in Z/𝑝Z.
15. ★ Prove that there is a constant 𝑐 > 0 so that every hyperplane containing the
origin in R𝑛 intersects at least 𝑐-fraction of the 2𝑛 closed unit balls centered at
{−1, 1}𝑛 .
67
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
5 Chernoff Bound
The Chernoff bound is an extremely useful bound on the tails of a sum of independent
random variables. It is proved by bounding the moment generating function. This
proof technique is interesting and important in its own right. We will see this proof
method come up again later on when we prove martingale concentration inequalities.
The method allows us to adapt the proof of the Chernoff bound to other distributions.
Let us give the proof in the most basic case for simplicity and clarity.
√
In contrast, Chebyshev’s inequality gives a weaker bound P(𝑆𝑛 ≥ 𝜆 𝑛) ≤ 1/𝜆2 . On the
other hand, Chebyshev’s inequality is application in wider settings as it only requires
pairwise independence (for the second moment) as opposed to full independence.
𝑒 −𝑡 + 𝑒 𝑡 ∑︁ 𝑥 2𝑘 ∑︁ 𝑥 2𝑘 2
= ≤ = 𝑒 𝑡 /2 .
2 𝑘 ≥0
(2𝑘)! 𝑘 ≥0 𝑘!2 𝑘
By Markov’s inequality,
√ E 𝑒 𝑡𝑆 √
𝑛+𝑡 2 𝑛/2
P(𝑆𝑛 ≥ 𝜆 𝑛) ≤ √ ≤ 𝑒 −𝑡𝜆 .
𝑒 𝑡𝜆 𝑛
√
Setting 𝑡 = 𝜆/ 𝑛 gives the bound. □
Remark 5.0.2. The technique of considering the moment generating function can
be thought morally as taking an appropriately high moment. Indeed, E[𝑒 𝑡𝑆 ] =
69
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
5 Chernoff Bound
Í 𝑛 ]𝑡 𝑛 /𝑛!
𝑛≥0 E[𝑆 contains all the moments data of the random variable.
The second moment method (Chebyshev + Markov) can be thought of as the first
iteration of this idea. By taking fourth moments (now requiring 4-wise independence
of the summands), we can obtain tail bounds of the form ≲ 𝜆−4 . And similarly with
higher moments.
In some applications, where one cannot assume independence, but can estimate some
high moments, the above philosophy can allow us to prove good tail bounds as well.
√ 2
Also by symmetry, P(𝑆𝑛 ≤ −𝜆 𝑛) ≤ 𝑒 −𝜆 /2 . Thus we have the following two-sided
tail bound.
Corollary 5.0.3
With 𝑆𝑛 as before, for any 𝜆 ≥ 0,
√ 2
P(|𝑆𝑛 | ≥ 𝜆 𝑛) ≤ 2𝑒 −𝜆 /2 .
Remark 5.0.4. It is easy to adapt the above proof so that each 𝑋𝑖 is a mean-zero
random variable taking [−1, 1]-values, and independent (but not necessarily identical)
across all 𝑖. Indeed, by convexity, we have 𝑒 𝑡𝑥 ≤ 1−𝑥 −𝑡 1+𝑥 𝑡
2 𝑒 + 2 𝑒 for all 𝑥 ∈ [−1, 1] by
−𝑡
convexity, so that E[𝑒 𝑡 𝑋 ] ≤ 𝑒 +𝑒
𝑡
2 . In particular, we obtain the following tail bounds
on the binomial distribution.
Corollary 5.0.6
Let 𝑋 be a sum of 𝑛 independent Bernoulli’s (with not necessarily identitical proba-
bility). Let 𝜇 = E𝑋 and 𝜆 > 0. Then
√ 2 √ 2
P(𝑋 ≥ 𝜇 + 𝜆 𝑛) ≤ 𝑒 −𝜆 /2 and P(𝑋 ≤ 𝜇 − 𝜆 𝑛) ≤ 𝑒 −𝜆 /2
The Chernoff bound compares well to that of the normal distribution. For the standard
2
normal 𝑍 ∼ 𝑁 (0, 1), one has E[𝑒 𝑡𝑍 ] = 𝑒 𝑡 /2 and so
2 /2
P(𝑍 ≥ 𝜆) = P(𝑒 𝑡𝑍 ≥ 𝑒 𝑡𝜆 ) ≤ 𝑒 −𝑡𝜆 E[𝑒 𝑡 𝑋 ] = 𝑒 −𝑡𝜆+𝑡 .
70
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
5.1 Discrepancy
The same proof method allows you to prove bounds for other sums of random variables,
which you can adjust based on the distributions. See the Alon–Spencer textbook,
Appendix A, for examples of bounds and proofs.
For example, for a sum of independent Bernoulli random variables with small means,
we can improve on the above estimates as follows.
Theorem 5.0.7
Let 𝑋 be the sum of independent Bernoulli random variables (not necessarily the same
probability). Let 𝜇 = E𝑋. For all 𝜀 > 0,
𝜀2
P(𝑋 ≥ (1 + 𝜀)𝜇) ≤ 𝑒 −((1+𝜀) log(1+𝜀)−𝜀)𝜇 ≤ 𝑒 − 1+𝜀 𝜇
and
2 𝜇/2
P(𝑋 ≤ (1 − 𝜀)𝜇) ≤ 𝑒 −𝜀 .
Remark 5.0.8. The bounds for upper and lower tails are necessarily asymmetric when
the probabilities are small. Why? Think about what happens when 𝑋 ∼ Bin(𝑛, 𝑐/𝑛),
which, for a constant 𝑐 > 0, converges as 𝑛 → ∞ to a Poisson distribution with mean 𝑐,
2
whose value at 𝑘 is 𝑒 −𝑐 𝑐 𝑘 /𝑘! = 𝑒 −Θ(𝑘 log 𝑘) and not the sub-Gaussian decay 𝑒 −Ω(𝑘 ) as
one might naively predict by an incorrect application of the Chernoff bound formula.
Nonetheless, both formulas tell us that both tails exponentially decay like 𝜀 2 for small
values of 𝜀 ∈ [0, 1].
5.1 Discrepancy
Given a hypergraph (i.e., set family), can we color the vertices red/blue so that every
edge has roughly the same number of red versus blue vertices? (Contrast this problem
to 2-coloring hypergraphs from Section 1.3.)
71
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
5 Chernoff Bound
Theorem 5.1.1
Let F be a collection of 𝑚 subsets of [𝑛]. Then √︁there exists some assignment
[𝑛] → {−1, 1} so that the sum on every set in F is 𝑂 ( 𝑛 log 𝑚) in absolute value.
Proof. Put ±1 iid uniformly at random on each vertex. On each edge, the probability
√︁
that the sum exceeds 2 𝑛 log 𝑚 in absolute value is, by Chernoff bound, less than
2𝑒 −2 log 𝑚 = 2/𝑚 2 . By union bound over all
√︁ 𝑚 edges, with probability greater than
1 − 2/𝑚 ≥ 0, no edge has sum exceeding 2 𝑛 log 𝑚. □
Remark 5.1.2. In a beautiful landmark paper titled Six standard deviations suffice,
Spencer (1985) showed that one can remove the logarithmic term by a more sophisti-
cated semirandom assignment algorithm.
Remark 5.1.4. More generally, Spencer proves that the same holds if vertices have
[0, 1]-valued weights.
The idea, very roughly speaking, is to first generalize from {−1, 1}-valued assign-
ments to [−1, 1]-valued assignments. Then the all-zero vector is a trivially satisfying
assignment. We then randomly, in iterations, alter the values from 0 to other values
√
in [−1, 1], while avoiding potential violations (e.g., edges with sum close to 6 𝑛 in
absolute value), and finalizing a color of a color when its value moves to either −1 and
1.
Spencer’s original proof was not algorithmic, and he suspected that it could not be
made efficiently algorithmic. In a breakthrough result, Bansal (2010) gave an efficient
algorithm for producing a coloring with small discrepancy. Lovett and Meka (2015)
provided another element algorithm with a beautiful proof.
Here is a famous conjecture on discrepancy.
72
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
𝜀1 𝑣 1 + · · · + 𝜀 𝑚 𝑣 𝑚 ∈ [−𝐾, 𝐾] 𝑛 .
√︁
Banaszczyk (1998) proved the bound 𝐾 = 𝑂 ( log 𝑛) in a beautiful paper using deep
ideas from convex geometry.
Spencer’s theorem’s implies the special case of Komlós conjecture where all vec-
tors 𝑣 𝑖 have the form 𝑛−1/2 (±1, . . . , ±1) (or more generally when all coordinates are
𝑂 (𝑛−1/2 )). The deduction is easy when 𝑚 ≤ 𝑛. When 𝑚 > 𝑛, we use the following
observation.
Lemma 5.1.6
Let 𝑣 1 , . . . , 𝑣 𝑚 ∈ R𝑛 . Then there exists 𝑎 1 , . . . , 𝑎 𝑚 ∈ [−1, 1] 𝑚 with |{𝑖 : 𝑎𝑖 ∉ {−1, 1}}| ≤
𝑛 such that
𝑎1 𝑣1 + · · · + 𝑎𝑚 𝑣 𝑚 = 0
Let us explain how to deduce the special caes of Kómlos conjecture as stated earlier.
Let 𝑎 1 , . . . , 𝑎 𝑚 and 𝐼 = {𝑖 : 𝑎𝑖 ∉ {−1, 1}} as in the Lemma. Take 𝜀𝑖 = 𝑎𝑖 for all 𝑖 ∉ 𝐼,
and apply a corollary of Spencer’s theorem to find 𝜀𝑖 ∈ {−1, 1}𝑛 , 𝑖 ∈ 𝐼 with
∑︁
(𝜀𝑖 − 𝑎𝑖 )𝑣 𝑖 ∈ [−𝐾, 𝐾] 𝑛 ,
𝑖∈𝐼
which would yield the desired result. The above step can be deduced from Spencer’s
theorem by first assuming that each 𝑎𝑖 ∈ [−1, 1] has finite binary length (a compactness
argument), and then rounding it off one digit at a time during Spencer’s theorem,
starting from the least significant bit (see Corollary 8 in Spencer’s paper for details).
73
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
5 Chernoff Bound
« | ··· | ¬ « | · · · | ¬ «𝑣 𝑚 · 𝑣 1 · · · 𝑣 𝑚 · 𝑣 𝑚 ¬
(here 𝐼𝑚 and 𝐽𝑚 are the 𝑚 × 𝑚 identity and all-ones matrix respectively). Since the
eigenvalues of 𝐽𝑚 are 𝑚 (once) and 0 (repeated 𝑚 − 1 times), the eigenvalues of
𝐼𝑚 + (𝛼 − 1)𝐽𝑚 are (𝑚 − 1)𝛼 + 1 (once) and 1 − 𝛼 (𝑚 − 1 times). Since the Gram matrix
is positive semidefinite, all its eigenvalues are nonnegative, and so 𝛼 ≥ −1/(𝑚 − 1).
• If 𝛼 ≠ −1/(𝑚 − 1), then this 𝑚 × 𝑚 matrix is non-singular, and since its rank is
at most 𝑛 (as 𝑣 𝑖 ∈ R𝑛 ), we have 𝑚 ≤ 𝑛.
• If 𝛼 = −1/(𝑚 − 1), then this matrix has rank 𝑚 − 1, and we conclude that
𝑚 ≤ 𝑛 + 1.
It is left as an exercise to check all these bounds are tight.
Exercise: given 𝑚 unit vectors in R𝑛 whose pairwise inner products are all ≤ −𝛽, one
has 𝑚 ≤ 1 + ⌊1/𝛽⌋. (A bit more difficult: show that for 𝛽 = 0, one has 𝑚 ≤ 2𝑛).
What if instead of asking for exactly equal angles, we ask for approximately the same
angle. It turns out that we can get many more vectors.
74
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Applying Chernoff bound in the form of Theorem 5.0.5 (after linear transformation on
each variable to make each term taking value in [−1, 1] and mean centered at zero),
we get
2
P |𝑣 𝑖 · 𝑣 𝑗 − 𝑛𝛼| ≥ 𝑛𝜀 ≤ 2𝑒 −Ω(𝑛𝜀 ) .
2
By the union bound, the probability that |𝑣 𝑖 ·𝑣 𝑗 −𝑛𝛼| > 𝑛𝜀 for some 𝑖 ≠ 𝑗 is < 𝑚 2 𝑒 Ω(𝑛𝜀 ) ,
which is < 1 for some 𝑚 at least 2𝑐𝑛 . So with positive probability, so such pair occurs,
√ √
and then 𝑣 1 / 𝑛, . . . , 𝑣 𝑚 / 𝑛 is a collection of unit vectors in R𝑛 whose pairwise inner
products all lie in [𝛼 − 𝜀, 𝛼 + 𝜀]. □
Remark 5.2.3 (Equiangular lines with a fixed angle). Given a fixed angle 𝜃, for large
𝑛, how many lines in R𝑛 through the origin can one place whose pairwise angles are all
exactly 𝜃? This problem was solved by Jiang, Tidor, Yao, Zhang, Zhao (2021). This
is the same as asking for a set of unit vectors in R𝑛 whose pairwise inner products are
±𝛼. It turns out that for fixed 𝛼, the maximum number of lines grows linearly with the
dimension 𝑛, and the rate of growth depends on properties of 𝛼 in relation to spectral
graph theory. We refer to the cited paper for details.
75
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
5 Chernoff Bound
The condition on 𝐾5 is clearly necessary, but what about 𝐾3,3 ? What is the “real”
reason for 4-colorability?
Hadwidger’s conjecture, below, remains a major conjectures in graph theory.
• 𝑡 = 1 trivial
• 𝑡 = 2 nearly trivial (if 𝐺 is 𝐾3 -minor-free, then it’s a tree)
• 𝑡 = 3 elementary graph theoretic arguments
• 𝑡 = 4 is equivalent to the 4-color theorem (Wagner 1937)
• 𝑡 = 5 is equivalent to the 4-color theorem (Robertson–Seymour–Thomas 1994;
this work won a Fulkerson Prize)
• 𝑡 ≥ 6 remains open
Let us explore a variation of Hadwiger’s conjecture:
Hajós conjecture. (1961) Every graph without a 𝐾𝑡+1 -subdivision is 𝑡-colorable.
Hajós conjecture is true for 𝑡 ≤ 3. However, it turns out to be false in general. Catlin
(1979) constructed counterexamples for all 𝑡 ≥ 6 (𝑡 = 4, 5 are still open).
It turns out that Hajós conjecture is not just false, but very false.
Erdős–Fajtlowicz (1981) showed that almost every graph is a counterexample (it’s a
good idea to check for potential counterexamples among random graphs!)
Theorem 5.3.2
√
With probability 1 − 𝑜(1), 𝐺 (𝑛, 1/2) has no 𝐾𝑡 -subdivision with 𝑡 = 10 𝑛 .
From Theorem 4.4.3 we know that, with high probability, 𝐺 (𝑛, 1/2) has independence
number ∼ 2 log2 𝑛 and hence chromatic number ≥ (1 + 𝑜(1) 2 log𝑛 𝑛 . Thus the above
2
result shows that 𝐺 (𝑛, 1/2) is whp a counterexample to Hajós conjecture.
76
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Taking a union bound over all 𝑡-vertex subsets 𝑆, and noting that
√
𝑛 −𝑡 2 /10 2
𝑒 < 𝑛𝑡 𝑒 −𝑡 /10 ≤ 𝑒 −10𝑛+𝑂 ( 𝑛 log 𝑛) = 𝑜(1)
𝑡
we see that whp no such 𝑆 exists, so that this 𝐺 (𝑛, 1/2) whp has no 𝐾𝑡 -subdivision □
Remark 5.3.3 (Quantitative question). One can ask the following quantitative ques-
tion regarding Hadwidger’s conjecture:
Can every graph without a 𝐾𝑡+1 -minor can be properly colored with a small number of
colors?
Wagner (1964) showed that every graph without 𝐾𝑡+1 -minor is 2𝑡−1 colorable.
Here is the proof: assume that the graph is connected. Take a vertex 𝑣 and let 𝐿 𝑖 be
the set of vertices with distance exactly 𝑖 from 𝑣. The subgraph induced on 𝐿 𝑖 has no
𝐾𝑡 -minor, since otherwise such a 𝐾𝑡 -minor would extend to a 𝐾𝑡+1 -minor with 𝑣. Then
by induction 𝐿 𝑖 is 2𝑡−2 -colorable (check base cases), and using alternating colors for
even and odd layers 𝐿 𝑖 yields a proper coloring of 𝐺.
This bound has been improved over time. Delcourt and Postle (2021+) showed that
every graph with no 𝐾𝑡 -minor is 𝑂 (𝑡 log log 𝑡)-colorable.
Exercises
1. Prove that with probability 1 − 𝑜(1) as 𝑛 → ∞, every bipartite subgraph of
𝐺 (𝑛, 1/2) has at most 𝑛2 /8 + 10𝑛3/2 edges.
2. Unbalancing lights. Prove that there is a constant 𝐶 so that for every positive
integer 𝑛, one can find an 𝑛 × 𝑛 matrix 𝐴 with {−1, 1} entries, so that for all
vectors 𝑥, 𝑦 ∈ {−1, 1}𝑛 , |𝑦 ⊺ 𝐴𝑥| ≤ 𝐶𝑛3/2 .
3. Prove that there exists a constant 𝑐 > 1 such that for every 𝑛, there are at least
𝑐 𝑛 points in R𝑛 so that every triple of points form a triangle whose angles are all
less than 61◦ .
77
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
5 Chernoff Bound
b) ★ Show that there is some constant 𝐶 such that (a) is false if 1.99 is replaced
by 𝐶. (What is the best 𝐶 you can get?)
78
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
The Lovász local lemma (LLL) was introduced in the paper of Erdős and Lovász
(1975). It is a powerful tool in the probabilistic method.
In many problems, we wish to avoid a certain set of “bad events.” Here are two easy
to handle scenarios:
• (Complete independence) All the bad events are independent and have probabil-
ity less than 1.
• (Union bound) The sum of the bad event probabilities is less than 1.
The local lemma deals with an intermediate situation where there is a small amount of
local dependencies.
We saw an application of the Lovász local lemma back in Section 1.1, where we used
it to lower bound Ramsey numbers. This chapter explores the local lemma and its
applications in depth.
P( 𝐴0 𝐵1 · · · 𝐵𝑚 ) = P( 𝐴0 )P(𝐵1 · · · 𝐵𝑚 ),
P( 𝐴0 |𝐵1 · · · 𝐵𝑚 ) = P( 𝐴0 ).
79
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
In practice, it is not too hard to construct a valid dependency graph, since most
applications of the Lovász local lemma use the following setup (which we saw in
Section 1.1).
It is easy to check that the canonical dependency graph above is indeed a valid depen-
dency graph.
Example 6.1.6 (Boolean satisfiability problem (SAT)). Given a CNF formula (con-
junctive normal form, i.e., and-of-or’s), e.g., (∧ = and; ∨ = or)
(𝑥 1 ∨ 𝑥 2 ∨ 𝑥3 ) ∧ (𝑥 1 ∨ 𝑥 2 ∨ 𝑥 4 ) ∧ (𝑥 2 ∨ 𝑥 4 ∨ 𝑥 5 ) ∧ · · ·
80
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
The following formulation of the local lemma is easiest to apply and is the most
commonly used. It applies to settings where the dependency graph has small maximum
degree.
𝑒 𝑝(𝑑 + 1) ≤ 1,
Remark 6.1.8. The constant 𝑒 is best possible (Shearer 1985). In most applications,
the precise value of the constant is unimportant.
then
𝑛
Ö
P(none of the events 𝐴𝑖 occur) ≥ (1 − 𝑥𝑖 ).
𝑖=1
Proof that the general form implies the symmetric form. Set 𝑥𝑖 = 1/(𝑑 + 1) < 1 for
all 𝑖. Then
𝑑
Ö 1 1 1
𝑥𝑖 (1 − 𝑥 𝑗 ) ≥ 1− > ≥𝑝
𝑑+1 𝑑+1 (𝑑 + 1)𝑒
𝑗 ∈𝑁 (𝑖)
Here is another corollary of the general form local lemma, which applies if the total
probability of any neighborhood in a dependency graph is small.
Corollary 6.1.10
Í
In the setup of Theorem 6.1.9, if P( 𝐴𝑖 ) < 1/2 and 𝑗 ∈𝑁 (𝑖) P( 𝐴 𝑗 ) ≤ 1/4 for all 𝑖, then
with positive probability none of the events 𝐴𝑖 occur.
81
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Ö ∑︁ ª ∑︁
(1 − 𝑥 𝑗 ) ≥ 𝑥𝑖 1 − 𝑥 𝑗 ® = 2P( 𝐴𝑖 ) 1 − 2P( 𝐴𝑖 ) ® ≥ P( 𝐴𝑖 ).
© © ª
𝑥𝑖
𝑗 ∈𝑁 (𝑖) « 𝑗 ∈𝑁 (𝑖) ¬ « 𝑗 ∈𝑁 (𝑖) ¬
(The first inequality is by “union bound.”) □
In some applications, one may need to apply the general form local lemma with
carefully chosen values for 𝑥𝑖 .
Û ª
𝐴 𝑗 ® ≤ 𝑥𝑖 whenever 𝑖 ∉ 𝑆 ⊆ [𝑛]. (6.1)
©
P 𝐴𝑖
« 𝑗 ∈𝑆 ¬
Once (6.1) has been established, we then deduce that
P( 𝐴1 · · · 𝐴𝑛 ) = P( 𝐴1 )P 𝐴2 𝐴1 P 𝐴3 𝐴1 𝐴2 · · · P 𝐴𝑛 𝐴1 · · · 𝐴𝑛−1
≥ (1 − 𝑥 1 ) (1 − 𝑥 2 ) · · · (1 − 𝑥 𝑛 ),
Û ª Ö
numerator ≤ P 𝐴𝑖 𝐴 𝑗 ® = P( 𝐴𝑖 ) ≤ 𝑥𝑖 (1 − 𝑥𝑖 ), (6.3)
©
« 𝑗 ∈𝑆2 ¬ 𝑗 ∈𝑁 (𝑖)
82
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Û ª © Û ª Û ª
denominator = P 𝐴 𝑗1 𝐴 𝑗 ® · · · P 𝐴 𝑗𝑟 𝐴 𝑗1 · · · 𝐴 𝑗𝑟 −1
© ©
𝐴 𝑗 ® P 𝐴 𝑗2 𝐴 𝑗1 𝐴𝑗®
« 𝑗 ∈𝑆2 ¬ « 𝑗 ∈𝑆2 ¬ « 𝑗 ∈𝑆2 ¬
≥ (1 − 𝑥 𝑗1 ) · · · (1 − 𝑥 𝑗𝑟 ) [by induction hypothesis]
Ö
≥ (1 − 𝑥𝑖 )
𝑗 ∈𝑁 (𝑖)
Remark 6.1.11. We used the independence assumption only at step (6.3) of the proof.
Upon a closerexamination, we see that we only need to know correlation inequalities
Ó
of the form P 𝐴𝑖 𝑗 ∈𝑆2 𝐴 𝑗 ≤ P( 𝐴𝑖 ) for 𝑆 2 ⊆ 𝑁 (𝑖), rather than independence. This
observation allows a strengthening of the local lemma, known as a lopsided local
lemma, that we will explore later in the chapter.
Theorem 6.2.1
A 𝑘-uniform hypergraph is 2-colorable if every edge intersects at most 𝑒 −1 2 𝑘−1 − 1
other edges
Proof. For each edge 𝑓 , let 𝐴 𝑓 be the event that 𝑓 is monochromatic. Then P( 𝐴 𝑓 ) =
𝑝 := 2−𝑘+1 . Each 𝐴 𝑓 is independent from all 𝐴 𝑓 ′ where 𝑓 ′ is disjoint from 𝑓 . Since
at most 𝑑 := 𝑒 −1 2 𝑘−1 − 1 edges intersect every edge, and 𝑒(𝑑 + 1) 𝑝 ≤ 1, so the local
lemma implies that with positive probability, none of the events 𝐴 𝑓 occur. □
Corollary 6.2.2
For 𝑘 ≥ 9, every 𝑘-uniform 𝑘-regular hypergraph is 2-colorable.
(Here 𝑘-regular means that every vertex lies in exactly 𝑘 edges.)
Proof. Every edge intersects ≤ 𝑑 = 𝑘 (𝑘 −1) other edges. And 𝑒(𝑘 (𝑘 −1) +1)2−𝑘+1 < 1
for 𝑘 ≥ 9. □
83
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Remark 6.2.3. The statement is false for 𝑘 = 2 (triangle) and 𝑘 = 3 (Fano plane) but
actually true for all 𝑘 ≥ 4 (Thomassen 1992).
Here is an example where the symmetric form of the local lemma is insufficient (why?).
Theorem 6.2.4
Let 𝐻 be a (non-uniform) hypergraph where every edge has size ≥ 3. Suppose
∑︁ 1
2−| 𝑓 | ≤ , for each edge 𝑒,
8
𝑓 ∈𝐸 (𝐻)\{𝑒}:𝑒∩ 𝑓 ≠∅
then 𝐻 is 2-colorable.
Proof. Consider a uniform random 2-coloring of the vertices. Let 𝐴𝑒 be the event that
edge 𝑒 is monochromatic. Then P( 𝐴𝑒 ) = 2−|𝑒|+1 ≤ 1/4 since |𝑒| ≥ 3. Also,
∑︁ ∑︁
P( 𝐴 𝑓 ) = 2−| 𝑓 |+1 ≤ 1/4.
𝑓 ∈𝐸 (𝐻)\{𝑒}:𝑒∩ 𝑓 ≠∅ 𝑓 ∈𝐸 (𝐻)\{𝑒}:𝑒∩ 𝑓 ≠∅
Thus by Corollary 6.1.10 one can avoid all events 𝐴𝑒 , and hence 𝐻 is 2-colorable. □
Remark 6.2.5. A sign to look beyond the symmetric local lemma is when there are
bad events of very different nature (e.g., having very different probabilities).
Compactness argument
Now we highlight an important compactness argument that allows us to deduce the
existence of an infinite object, even though the local lemma itself is only applicable to
finite systems.
Theorem 6.2.6
Let 𝐻 be a (non-uniform) hypergraph on a possibly infinite vertex set, such that
each edge is finite, has at least 𝑘 vertices, and intersect at most 𝑑 other edges. If
𝑒2−𝑘+1 (𝑑 + 1) ≤ 1, then 𝐻 has a proper 2-coloring.
Proof. From a vanilla application of the symmetric local lemma, we deduce that for
any finite subset 𝑋 of vertices, there exists an 2-coloring 𝑋 so that no edge contained
in 𝑋 is monochromatic (color each vertex iid uniformly, and consider the bad event 𝐴𝑒
that the edge 𝑒 ⊆ 𝑋 is monochromatic).
Next we extend the coloring to the entire vertex set 𝑉 by a compactness argument. The
set of all colorings is [2] 𝑉 . By Tikhonov’s theorem (which says a product of a possibly
84
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
𝐶 𝑋1 ∩ · · · ∩ 𝐶 𝑋ℓ ⊇ 𝐶 𝑋1 ∪···∪𝑋ℓ ,
Remark 6.2.8. Note the conclusion may be false if we do not assume the random
variable model (why?).
The next application appears in the paper of Erdős and Lovász (1975) where the local
lemma originally appears.
Consider 𝑘-coloring the real numbers, i.e., a function 𝑐 : R → [𝑘]. We say that 𝑇 ⊆ R
is multicolored with respect to 𝑐 if all 𝑘 colors appear in 𝑇.
Question 6.2.9
For each 𝑘 is there an 𝑚 so that for every 𝑆 ⊆ R with |𝑆| = 𝑚, one can 𝑘-color R so
that every translate of 𝑆 is multicolored?
The following theorem shows that this can be done whenever 𝑚 > (3 + 𝜀)𝑘 log 𝑘 and
𝑘 > 𝑘 0 (𝜀) sufficiently large.
85
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Theorem 6.2.10
The answer to the above equation is yes if
𝑚
1
𝑒(𝑚(𝑚 − 1) + 1)𝑘 1 − ≤ 1.
𝑘
Proof. We pick a uniform random color for each element of Z. For each 𝑘-term
arithmetic progression in Z with 𝑘 ≥ 𝑘 0 and common difference less than 2 (1−𝜀)𝑘 ,
consider the event that this 𝑘-AP is monochromatic. By the compactness argument, it
suffices to check that we can avoid any finite subset of events.
The event that a particular 𝑘-AP is monochromatic has probability exactly 2−𝑘+1 .
(Since this number depends on 𝑘, we should use the asymmetric local lemma.)
Recall that in the asymmetric local lemma (Theorem 6.1.9), we need to select 𝑥𝑖 ∈ [0, 1)
for each bad event 𝐴𝑖 so that
Ö
P( 𝐴𝑖 ) ≤ 𝑥𝑖 (1 − 𝑥 𝑗 ) for all 𝑖 ∈ [𝑛].
𝑗 ∈𝑁 (𝑖)
86
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Decomposing coverings
We say that a collection of disks in R𝑑 is a covering if their union is R𝑑 . We say that
it is a 𝒌-fold covering if every point of R𝑑 is contained in at least 𝑘 disks (so 1-fold
covering is the same as a covering).
We say that a 𝑘-fold covering is decomposable if it can be partitioned into two cover-
ings.
In R𝑑 , is a every 𝑘-fold covering by unit balls decomposable if 𝑘 is sufficiently large?
A fun exercise: in R1 , every 𝑘-fold covering by intervals can be partitioned into 𝑘
coverings.
Mani-Levitska and Pach (1986) showed that every 33-fold covering of R2 is decom-
posable.
What about higher dimensions?
Surprising, they also showed that for every 𝑘, there exists a 𝑘-fold indecomposable
covering of R3 (and similarly for R𝑑 for 𝑑 ≥ 3).
However, it turns out that indecomposable coverings must cover the space quite un-
evenly:
87
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Lemma 6.2.14
A set of 𝑛 ≥ 2 spheres in R3 cut R3 into at most 𝑛3 connected components.
Proof. Let us first consider the problem in one dimension lower. Let 𝑓 (𝑚) be the
maximum number of connected regions that 𝑚 circles on a sphere in R3 can cut the
sphere into.
We have 𝑓 (𝑚 + 1) ≤ 𝑓 (𝑚) + 2𝑚 for all 𝑚 ≥ 1 since adding a new circle to a set of 𝑚
circles creates at most 2𝑚 intersection points, so that the new circle is divided into at
most 2𝑚 arcs, and hence its addition creates at most 2𝑚 new regions.
Combined with 𝑓 (1) = 2, we deduce 𝑓 (𝑚) ≤ 𝑚(𝑚 − 1) + 2 for all 𝑚 ≥ 1.
Now let 𝑔(𝑚) be the maximum number of connected regions that 𝑚 spheres in R3 can
cut R3 into. We have 𝑔(1) = 2, and 𝑔(𝑚 + 1) ≤ 𝑔(𝑚) + 𝑓 (𝑚) ≤ 𝑔(𝑚) by a similar
argument as earlier. So 𝑔(𝑚) ≤ 𝑓 (𝑚 − 1) + 𝑓 (𝑚 − 2) + · · · + 𝑓 (1) + 𝑔(0) ≤ 𝑚 3 . □
Proof. Suppose for contradiction that every point in R3 is covered by at most 𝑡 ≤ 𝑐2 𝑘/3
unit balls from 𝐹 (for some sufficiently small 𝑐 that we will pick later).
Construct an infinite hypergraph 𝐻 with vertex set being the set of balls and edges
having the form 𝐸 𝑥 = {balls containing 𝑥} for some 𝑥 ∈ R3 . Note that |𝐸 𝑥 | ≥ 𝑘 since
we have a 𝑘-fold covering.
Also, note that if 𝑥, 𝑦 ∈ R3 lie in the same connected component in the complement of
the union of all the unit spheres, then 𝐸 𝑥 = 𝐸 𝑦 (i.e., the same edge).
Claim: every edge of intersects at most 𝑑 = 𝑂 (𝑡 3 ) other edges
Proof of claim: Let 𝑥 ∈ R3 . If 𝐸 𝑥 ∩ 𝐸 𝑦 ≠ ∅, then |𝑥 − 𝑦| ≤ 2, so all the balls in
𝐸 𝑦 lie in the radius-4 ball centered at 𝑥. The volume of the radius-4 ball is 43 times
the unit ball. Since every point lies in at most 𝑡 balls, there are at most 43 𝑡 balls
appearing among those 𝐸 𝑦 intersecting 𝑥, and these balls cut the radius-2 centered at
𝑥 into 𝑂 (𝑡 3 ) connected regions by the earlier lemma, and two different 𝑦’s in the same
region produce the same 𝐸 𝑦 . So 𝐸 𝑥 intersects 𝑂 (𝑡 3 ) other edges. ■
88
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
With 𝑡 ≤ 𝑐2 𝑘/3 and 𝑐 sufficiently small, and knowing 𝑑 = 𝑂 (𝑡 3 ) from the claim, we
have 𝑒2−𝑘+1 (𝑑 + 1) ≤ 1. It then follows by Theorem 6.2.6 (local lemma + compactness
argument) that this hypergraph is 2-colorable, which corresponds to a decomposition
of the covering, a contradiction. □
Theorem 6.3.1
Let 𝐺 = (𝑉, 𝐸) be a graph with maximum degree Δ and let 𝑉 = 𝑉1 ∪ · · · ∪ 𝑉𝑟 be a
partition, where each |𝑉𝑖 | ≥ 2𝑒Δ. Then there is an independent set in 𝐺 containing
one vertex from each 𝑉𝑖 .
Proof. The first step in the proof is simple yet subtle: we may assume that |𝑉𝑖 | = 𝑘 :=
⌈2𝑒Δ⌉ for each 𝑖, or else we can remove some vertices from 𝑉𝑖 (if we do not trim the
vertex sets now, we will run into difficulties later).
Pick 𝑣 𝑖 ∈ 𝑉𝑖 uniformly at random, independently for each 𝑖.
This is an instance of the random variable model (Setup 6.1.5), where 𝑣 1 , . . . , 𝑣 𝑟 are
the random variables.
We would like to design a collection of “bad events” so that if we avoid all of them,
then {𝑣 1 , . . . , 𝑣 𝑟 } is guaranteed to be independent set.
What do we choose as bad events? It turns out that some choices work better than
others.
Attempt 1:
For each 1 ≤ 𝑖 < 𝑗 ≤ 𝑟 where there exists an edge between 𝑉𝑖 and 𝑉 𝑗 , let 𝐴𝑖, 𝑗 be the
event that 𝑣 𝑖 is adjacent to 𝑣 𝑗 .
We find that P( 𝐴𝑖, 𝑗 ) ≤ Δ/𝑘.
The canonical dependency graph has 𝐴𝑖, 𝑗 ∼ 𝐴𝑖 ′ , 𝑗 ′ if and only if the two sets {𝑖, 𝑗 } and
{𝑖′, 𝑗 ′ } intersect. This dependency graph has max degree ≤ 2Δ𝑘 (starting from (𝑖, 𝑗),
89
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
look at the neighbors of all vertices in 𝑉𝑖 ∪ 𝑉 𝑗 ). The max degree is too large compared
to the bad event probabilities.
Attempt 2:
For each edge 𝑒 ∈ 𝐸, let 𝐴𝑒 be the event that both endpoints of 𝑒 are picked.
We have P( 𝐴𝑒 ) = 1/𝑘 2 .
The canonical dependency graph has 𝐴𝑒 ∼ 𝐴 𝑓 if some 𝑉𝑖 intersects both 𝑒 and 𝑓 .
This dependency graph has max degree ≤ 2𝑘Δ (if 𝑒 is between 𝑉𝑖 and 𝑉 𝑗 , then 𝑓 must
be incident to 𝑉𝑖 ∪ 𝑉 𝑗 ).
We have 𝑒(1/𝑘 2 )(2𝑘Δ + 1) ≤ 1, so the local lemma implies the with probability no
bad event occurs, in which case {𝑣 1 , . . . , 𝑣 𝑟 } is an independent set. □
Remark 6.3.2. Alon (1988) introduced the above result as lemma in his near resolution
of the still-open linear arboricity conjecture (see the Alon–Spencer textbook §5.5).
Alon’s approach makes heavy use of the local lemma.
Haxell (1995, 2001) relaxed the hypothesis to |𝑉𝑖 | ≥ 2Δ for each 𝑖. The statement
becomes false if 2Δ is replaced by 2Δ − 1 (Szabó and Tardos 2006).
Corollary 6.4.2
For every 𝑘 there exists 𝑑 so that every 2𝑑-regular graph has a cycle of length divisible
by 𝑘.
Proof. Every 2𝑑-regular graph can be made into a 𝑑-regular digraph by orientating its
edges according to an Eulerian tour. And then we can apply the previous theorem. □
90
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
𝛿
𝑘≤ .
1 + log(1 + 𝛿Δ)
Proof. By deleting edges, can assume that every vertex has out-degree exactly 𝛿.
P( 𝐴𝑣 ) = (1 − 1/𝑘) 𝛿 ≤ 𝑒 −𝛿/𝑘 .
Since 𝐴𝑣 depends only on {𝑥 𝑤 : 𝑤 ∈ {𝑣} ∪ 𝑁 + (𝑣)}, where 𝑁 + (𝑣) denotes the out-
neighbors of 𝑣 and 𝑁 − (𝑣) the in-neighbors of 𝑣, the canonical dependency graph
has
𝐴𝑣 ∼ 𝐴𝑤 if {𝑣} ∪ 𝑁 + (𝑣) intersects {𝑤} ∪ 𝑁 + (𝑤).
The maximum degree in the dependency graph is at most Δ + 𝛿Δ (starting from 𝑣, there
are
(1) at most Δ choices stepping backward
(2) 𝛿 choices stepping forward, and
(3) at most 𝛿(Δ − 1) choices stepping forward and then backward to land somewhere
other than 𝑣).
So an application of the local lemma shows that we are done as long as 𝑒 1−𝛿/𝑘 (1+Δ+𝛿Δ),
i.e.,
𝑘 ≤ 𝛿/(1 + log(1 + Δ + 𝛿Δ)).
This is almost, but not quite the result (though, for most applications, we would be
perfectly happy with such a bound).
The final trick is to notice that we actually have an even smaller valid dependency
digraph:
𝐴𝑣 is independent of all 𝐴𝑤 where 𝑁 + (𝑣) is disjoint from 𝑁 + (𝑤) ∪ {𝑤}.
91
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Indeed, even if we fix the colors of all vertices outside 𝑁 + (𝑣), the conditional proba-
bility that 𝐴𝑣 is still (1 − 1/𝑘) 𝛿 .
The number of 𝑤 such that 𝑁 + (𝑣) intersects 𝑁 + (𝑤) ∪ {𝑤} is at most 𝛿Δ (no longer
need to consider (1) in the previous count). And we have
Û ª Ö
numerator ≤ P 𝐴𝑖 𝐴 𝑗 ® = P( 𝐴𝑖 ) ≤ 𝑥𝑖 (1 − 𝑥𝑖 ).
©
« 𝑗 ∈𝑆 2 ¬ 𝑗 ∈𝑁 (𝑖)
If we had changed the middle = to ≤, the whole proof would remain valid. This
observation allows us to weaken the independence assumption. Therefore we have the
following theorem, which was used by Erdős and Spencer (1991) to give an application
to Latin transversals that we will see shortly.
92
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Û ª
𝐴 𝑗 ® ≤ P( 𝐴𝑖 ) for all 𝑖 ∈ [𝑛] and 𝑆 ⊆ [𝑛] \ (𝑁 (𝑖) ∪ {𝑖}) (6.1)
©
P 𝐴𝑖
« 𝑗 ∈𝑆 ¬
Suppose there exist 𝑥1 , . . . , 𝑥 𝑛 ∈ [0, 1) such that
Ö
P( 𝐴𝑖 ) ≤ 𝑥𝑖 (1 − 𝑥 𝑗 ) for all 𝑖 ∈ [𝑛].
𝑗 ∈𝑁 (𝑖)
Then
𝑛
Ö
P(none of the events 𝐴𝑖 occur) ≥ (1 − 𝑥𝑖 ).
𝑖=1
Like earlier, by setting 𝑥𝑖 = 1/(𝑑 + 1), we deduce a symmetric version that is easier to
apply.
The (di)graph where 𝑁 (𝑖) is the set of (out-)neighbors of 𝑖 is called a negative depen-
dency (di)graph.
Remark 6.5.3 (Important!). Just as with the usual local lemma, the negative depen-
dency graph is not constructed by simply checking pairs of events.
The hypothesis of Theorem 6.5.1 seems annoying to check. Fortunately, many appli-
cations of lopsided local lemma fall within a model that we will soon describe, where
there is a canonical negative dependency graph that is straightforward to construct.
This is analogous to the random variable model for the usual local lemma, where the
canonical dependence graph has two events adjacency if they share variables.
93
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
The following result shows that the above canonical negative dependency graph is a
valid for the lopsided local lemma (Theorem 6.5.1).
where the sum is taken over all |𝑌 | (|𝑌 | − 1) · · · (|𝑌 | − |𝑋 | + 1) complete matchings 𝑇
from 𝑋0 to 𝑌 (which we denote by 𝑇 : 𝑋0 ↩→ 𝑌 ), and
1
𝑅𝐻𝑆 = P( 𝐴𝐹0 ) = .
|{𝑇 : 𝑋0 ↩→ 𝑌 }|
94
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Latin transversal
A Latin square of order 𝑛 is an 𝑛 × 𝑛 array filled with 𝑛 symbols so that every symbol
appears exactly once in every row and column. Example:
1 2 3
2 3 1
3 1 2
These objects are called Latin squares because they were studied by Euler (1707–1783)
who used Latin symbols to fill the arrays.
95
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Given an 𝑛 × 𝑛 array, a transversal is a set of 𝑛 entries with one in every row and
column. A Latin transversal is a transversal with distinct entries. Example:
1 2 3
2 3 1
3 1 2
Here are is a famous open conjecture about Latin transversals.1 (Do you see why the
hypothesis on parity is necessary?)
Remark 6.5.10. Keevash, Pokrovskiy, Sudakov and Yepremyan (2022) proved that
every order 𝑛 Latin square contains a transversal containing all but 𝑂 (log 𝑛/log log 𝑛)
symbols, improving an earlier bound of 𝑂 (log2 𝑛) by Hatami and Shor (2008).
Recently, Montgomery announced a proof of the conjecture for all sufficiently large
even 𝑛. The proof uses sophisticated techniques combining the semi-random method
and the absorption method.
The next result is the original application of the lopsided local lemma.
Proof. Pick a transversal uniformly at random. This is the same as picking a permuta-
tion 𝑓 : [𝑛] → [𝑛] uniformly at random. In Setup 6.5.4, the random injection model,
transversals correspond to perfect matchings.
For each pair of equal entries in the array not both lying in the same row or column,
consider the bad event that the transversal contains both entries.
The canonical negative dependency graph is obtained by joining an edge between two
bad events if the four entries involved share some row or column.
1 Not to be confused with another conjecture also known as Ryser’s conjecture concerning an inequality
between the covering number and the matching number of multipartite hypergraphs, as a generaliza-
tion of König’s theorem. See Best and Wanless (2018) for a historical commentary and a translation
of Ryser’s 1967 paper.
96
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Let us count neighbors in this negative dependency graph. Fix a pair of equal entries
in the array. Their rows and columns span fewer than 4𝑛 entries, and for each such
entry 𝑧, there are at most 𝑛/(4𝑒) − 1 choices for another entry equal to 𝑧. Thus the
maximum degree in the canonical negative dependence graph is
𝑛 𝑛(𝑛 − 1)
≤ (4𝑛 − 4) −1 ≤ − 1.
4𝑒 𝑒
We can now apply the symmetric lopsided local lemma to conclude that with positive
probability, none of the bad events occur. □
Remark 6.6.1 (Too hard in general). The Moser–Tardos algorithm works in the ran-
dom variable model (there are subsequent work that concern other models such as
the random injection model). Some assumption on the model is necessary since the
problem can be computationally hard in general.
For example, let 𝑞 = 2 𝑘 , and 𝑓 : [𝑞] → [𝑞] be some fixed bijection (with an explicit
description and easy to compute). Consider the computational task of inverting 𝑓 :
given 𝑦 ∈ [𝑞], find 𝑥 such that 𝑓 (𝑥) = 𝑦 (we would like an algorithm with running
time polynomial in 𝑘).
If 𝑥 ∈ [𝑞] is chosen uniformly, then 𝑓 (𝑥) ∈ [𝑞] is also uniform. For each 𝑖 ∈ [𝑘], let
𝐴𝑖 be the event that 𝑓 (𝑥) and 𝑦 disagree on 𝑖-th bit. Then 𝐴1 , . . . , 𝐴 𝑘 are independent
events. Also, 𝑓 (𝑥) = 𝑦 if and only if no event 𝐴𝑖 occurs. So a trivial version of the
local lemma (with empty dependency graph) implies the existence of some 𝑥 such that
𝑓 (𝑥) = 𝑦.
97
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
On the other hand, it is believed that there exist functions 𝑓 that is easy to compute but
hard to invert. Such functions are called one-way functions, and they are a fundamental
building block in cryptography. For example, let 𝑔 be a multiplicative generator of F𝑞 ,
and let 𝑓 : F𝑞 → F𝑞 be given by 𝑓 (0) = 0 and 𝑓 (𝑥) = 𝑔 𝑥 and for 𝑥 ≠ 0. Then inverting
𝑓 is the discrete logarithm problem, which is believed to be computationally difficult.
The computational difficulty of this problem is the basis for the security of important
public key cryptography schemes, such as the Diffie–Hellman key exchange.
Moser–Tardos algorithm
The algorithm is surprisingly simple.
(We can make the algorithm more precise by specifying a way to pick an “arbitrary” choice, e.g., the
lexicographically first choice.)
We won’t prove the general theorem here. The proof in Moser and Tardos (2010) is
beautifully written and not too long. I highly recommend it reading it. In the next
subsection, we will prove the correctness of the algorithm in a special case using a
neat idea known as entropy compression.
Remark 6.6.4 (Las Vegas versus Monte Carlo). Here are some important classes of
randomized algorithms:
98
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
We are usually interested in randomized algorithms whose running time is small (e.g.,
at most a polynomial of the input size).
We can convert an efficient LV algorithm into an efficient MC algorithm as follows:
suppose the LV algorithm has expected running time 𝑇, and now we run the algorithm
but if it takes more than 𝐶𝑇 time, then halt and declare a failure. Markov’s inequality
then shows that the algorithm fails with probability ≤ 1/𝐶.
However, it is not always possible to convert an efficient MC algorithm into an efficient
LV algorithm. Starting with an MC algorithm, one might hope to repeatedly run it
until a correct answer has been found. However, there might not be an efficient way to
check the answer.
For example, consider the problem of finding a Ramsey coloring, specifically, 2-
edge-coloring of 𝐾𝑛 without a monochromatic clique of size ≥ 100 log2 𝑛. A uniform
random coloring works with overwhelming probability, as can be checked by a simple
union bound (see Theorem 1.1.2). However, we do not have an efficient way to check
whether the random edge-coloring indeed has the desired property. It is a major open
problem to find an LV algorithm for finding such an edge-coloring.
99
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
(𝑥 1 ∨ 𝑥 2 ∨ 𝑥 3 ) ∧ (𝑥 1 ∨ 𝑥2 ∨ 𝑥 4 ) ∧ (𝑥 2 ∨ 𝑥4 ∨ 𝑥 5 ) ∧ (𝑥 3 ∨ 𝑥5 ∨ 𝑥 6 ).
The problem is to find a satisfying assignment with boolean variables so that the
expression output to TRUE.
(We can make the algorithm more well defined by specifying a way to pick an “arbitrary” choice, e.g.,
the lexicographically first choice. Also, in Line 6, we allow taking 𝐷 = 𝐶.)
Note that the Lovász local lemma guarantees the existence of a solution if each clause
shares variables with at most 2 𝑘 /𝑒 − 1 clauses (each clause is violated with probability
exactly 2−𝑘 in a uniform random assignment of variables). So the theorem above is
tight up to an unimportant constant factor.
Proof. Given an assignment of variables, by calling fix(𝐶) for any clause 𝐶, any
clause that was previously satisfied remains satisfied after the completion of the execu-
tion of fix(𝐶). Furthermore, 𝐶 becomes satisfied after the function call. Thus, once
fix(𝐶) is called, 𝐶 can never show up again as a violated clause in Line 2. □
100
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
It follows that the expected number of recursive calls to fix is 𝑛 + 𝑂 (1). Thus, in
the Moser algorithm (Algorithm 6.6.5), the expected total number of calls to fix is
𝑚𝑛 + 𝑂 (𝑚), where 𝑛 is the number of variables and 𝑚 is the number of clauses. This
proves the correctness of the algorithm (Theorem 6.6.6).
Proof. Let us formalize the randomness in the algorithm by first initializing a random
string of bits. Specifically, let 𝑥 ∈ {0, 1} 𝑘ℓ be generated uniformly at random. When-
ever the a clause in resampled in Line 5, one replaces the variables in the clause by
the next 𝑘 bits from 𝑥. Furthermore, if the line Line 7 is called for the ℓ-th time, we
halt the algorithm and declare a failure (as we would have run out of random bits to
resample had we continued).
At the same time, we keep an execution trace which keeps track of which clauses got
called fix, and also when the inner while loop Line 6 ends. Note that the very first
call to fix(𝐶0 ) is not included in the execution trace since it is already given as fixed
and so we don’t need to include this information. Here is an example of an execution
trace, writing C7 for the 7th clause in the 𝑘-CNF:
fix(C7) called
fix(C4) called
fix(C7) called
while loop ended
fix(C2) called
while loop ended
while loop ended
...
For illustration, here is the example of how clause variables could intersect:
C2: ****
C4: ****
C7: ****
101
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Encoding the execution trace as a bit string. We fix at the beginning some canonical
order of all clauses (e.g., lexicographic). It would be too expensive to refer to each
clause in its absolute position in this order (this is an important point!). Instead, we
note that every clause shares variables with at most 2 𝑘−3 other clauses, and only these
≤ 2 𝑘−3 could be called in the inner while loop in Line 6. So we can record which one
got called using a 𝑘 − 3 bit string.
• fix(𝐷) called: suppose this was called inside an execution of fix(𝐶), and
𝐷 is the 𝑗-th clause among all clauses sharing a variable with 𝐶, then record in
the execution trace bit string 0 followed by exactly ℓ − 3 bits giving the binary
representation of 𝑗 (prepended by zeros to get exactly ℓ − 3 bits).
• while loop ended: record 1 in the execution trace bit string.
Note that one can recover the execution trace from the above bit string encoding.
Now, suppose the algorithm terminates as a failure due to fix being called the ℓ-th
time. Here is the key claim.
Key claim (recovering randomness). At the moment right before the ℓ-th recur-
sive call to fix on Line 7, we can completely recover 𝑥 from the current variable
assignments and the execution trace.
Note that all ℓ𝑘 random bits in 𝑥 have been used up at this point.
To see the key claim, note that from the execution trace, we can determine which clauses
were resampled and in what order. Furthermore, if fix(𝐷) was called on Line 7, then
𝐷 must have been violated right before the call, and there is a unique possibility for
the violating assignment to 𝐷 right before the call (e.g., if 𝐷 = 𝑥 1 ∨ 𝑥2 ∨ 𝑥 3 , then the
only violating assignment is (𝑥 1 , 𝑥2 , 𝑥3 ) = (0, 0, 1)). We can then rewind history, and
put the reassigned values to 𝐷 back into the random bit string 𝑥 to complete recover 𝑥.
How long can the execution bit string be? It has length ≤ ℓ(𝑘 − 1). Indeed, each of the
≤ ℓ recursive calls to fix produces 𝑘 − 2 bits for the call to fix and 1 bit for ending
the while loop. So the total number of possible execution strings is ≤ 2ℓ(𝑘−1)+1 (the
+1 accounts for variable lengths, though it can removed with a more careful analysis).
Thus, the key claim implies that each 𝑥 ∈ {0, 1}ℓ𝑘 that leads to a failed execution
produces a unique pair (variable assignment, execution bit string). Thus
P(≥ ℓ recursive calls to fix) 2ℓ𝑘 = |{𝑥 ∈ {0, 1}𝑛 leading to failure}| ≤ 2𝑛 2ℓ(𝑘−1)+1 .
Remark 6.6.9 (Entropy compression). Tao use the phrase “entropy compression” to
describe this argument. The intuition is that the recoverability of the random bit string
102
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
𝑥 means that we are somehow “compressing” a ℓ𝑘-bit random string into a shorter
length losslessly, but that would be impossible. Each call to fix uses up 𝑘 random bits
and converts it to 𝑘 − 1 bits to the execute trace (plus at most 𝑛 bits of extra information,
namely the current variables assignment, and this is viewed as a constant amount of
information), and this conversion is reversible. So we are “compressing entropy.” The
conservation of information tells us that we cannot losslessly compress 𝑘 random bits
to 𝑘 − 1 bits for very long.
Remark 6.6.10 (Relationship between the two proofs of the local lemma?). The
above proof, along with extensions of these ideas in Moser and Tardos (2010), seems
to give a completely different proof of the local lemma than the one we saw at the
beginning of the chapter. Is there some way to relate these seemingly completely
different proofs? Are they secretly the same proof? We do not know. This is an
interesting open-ended research problem.
Exercises
√
1. Show that it is possible to color the edges of 𝐾𝑛 with at most 3 𝑛 colors so that
there are no monochromatic triangles.
2. Prove that it is possible to color the vertices of every 𝑘-uniform 𝑘-regular hyper-
graph using at most 𝑘/log 𝑘 colors so that every color appears at most 𝑂 (log 𝑘)
times on each edge.
3. ★ Hitting thin rectangles. Prove that there is a constant 𝐶 > 0 so that for every
sufficiently small 𝜀 > 0, one can choose exactly one point inside each grid square
[𝑛, 𝑛 + 1) × [𝑚, 𝑚 + 1) ⊂ R2 , 𝑚, 𝑛 ∈ Z, so that every rectangle of dimensions
𝜀 by (𝐶/𝜀) log(1/𝜀) in the plane (not necessarily axis-aligned) contains at least
one chosen point.
4. List coloring. Prove that there is some constant 𝑐 > 0 so that given a graph and
a set of 𝑘 acceptable colors for each vertex such that every color is acceptable
for at most 𝑐𝑘 neighbors of each vertex, there is always a proper coloring where
every vertex is assigned one of its acceptable colors.
5. Prove that, for every 𝜀 > 0, there exist ℓ0 and some (𝑎 1 , 𝑎 2 , . . . ) ∈ {0, 1}N
such that for every ℓ > ℓ0 and every 𝑖 > 1, the vectors (𝑎𝑖 , 𝑎𝑖+1 , . . . , 𝑎𝑖+ℓ−1 ) and
(𝑎𝑖+ℓ , 𝑎𝑖+ℓ+1 , . . . , 𝑎𝑖+2ℓ−1 ) differ in at least ( 21 − 𝜀)ℓ coordinates.
6. Avoiding periodically colored paths. Prove that for every Δ, there exists 𝑘 so
that every graph with maximum degree at most Δ has a vertex-coloring using 𝑘
colors so that there is no path of the form 𝑣 1 𝑣 2 . . . 𝑣 2ℓ (for any positive integer
103
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
ℓ) where 𝑣 𝑖 has the same color as 𝑣 𝑖+ℓ for each 𝑖 ∈ [ℓ]. (Note that vertices on a
path must be distinct.)
7. Prove that every graph with maximum degree Δ can be properly edge-colored
using 𝑂 (Δ) colors so that every cycle contains at least three colors.
(An edge-coloring is proper if it never assigns the same color to two edges sharing a vertex.)
8. ★ Prove that for every Δ, there exists 𝑔 so that every bipartite graph with maximum
degree Δ and girth at least 𝑔 can be properly edge-colored using Δ + 1 colors so
that every cycle contains at least three colors.
9. ★ Prove that for every positive integer 𝑟, there exists 𝐶𝑟 so that every graph with
maximum degree Δ has a proper vertex coloring using at most 𝐶𝑟 Δ1+1/𝑟 colors
so that every vertex has at most 𝑟 neighbors of each color.
10. Vertex-disjoint cycles in digraphs. (Recall that a directed graph is 𝑘-regular if
all vertices have in-degree and out-degree both equal to 𝑘. Also, cycles cannot
repeat vertices.)
a) Prove that every 𝑘-regular directed graph has at least 𝑐𝑘/log 𝑘 vertex-
disjoint directed cycles, where 𝑐 > 0 is some constant.
b) ★ Prove that every 𝑘-regular directed graph has at least 𝑐𝑘 vertex-disjoint
directed cycles, where 𝑐 > 0 is some constant.
Hint: split in two and iterate
11. a) Generalization of Cayley’s formula. Using Prüfer codes, prove the identity
∑︁
𝑥1 𝑥 2 · · · 𝑥 𝑛 (𝑥 1 + · · · + 𝑥 𝑛 ) 𝑛−2 = 𝑥1𝑑𝑇 (1) 𝑥 2𝑑𝑇 (2) · · · 𝑥 𝑛𝑑𝑇 (𝑛)
𝑇
where the sum is over all trees 𝑇 on 𝑛 vertices labeled by [𝑛] and 𝑑𝑇 (𝑖) is
the degree of vertex 𝑖 in 𝑇.
b) Let 𝐹 be a forest with vertex set [𝑛], with components having 𝑓1 , . . . , 𝑓𝑠
vertices so that 𝑓1 + · · · + 𝑓𝑠 = 𝑛. Prove that the number of trees on the
( 𝑓𝑖 /𝑛 𝑓𝑖 −1 ).
Î𝑠
vertex set [𝑛] that contain 𝐹 is exactly 𝑛𝑛−2 𝑖=1
c) Independence property for uniform spanning tree of 𝐾𝑛 . Show that if 𝐻1
and 𝐻2 are vertex-disjoint subgraphs of 𝐾𝑛 , then for a uniformly random
spanning tree 𝑇 of 𝐾𝑛 , the events 𝐻1 ⊆ 𝑇 and 𝐻2 ⊆ 𝑇 are independent.
d) ★ Packing rainbow spanning trees. Prove that there is a constant 𝑐 > 0
so that for every edge-coloring of 𝐾𝑛 where each color appears at most
𝑐𝑛 times, there exist at least 𝑐𝑛 edge-disjoint spanning trees, where each
spanning tree has all its edges colored differently.
104
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
(In your submission, you may assume previous parts without proof.)
105
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
7 Correlation Inequalities
P( 𝐴𝐵) ≥ P( 𝐴)P(𝐵).
Remark 7.1.3 (Percolation). Many of such inequalities were initially introduced for
the study of percolations. A classic setting of this problem takes place in infinite
grid with vertices Z2 with edges connecting adjacent vertices at distance 1. Suppose
107
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
7 Correlation Inequalities
we keep each edge of this infinite grid with probability 𝑝 independently, what is the
probability that the origin is part of an infinite component (in which case we say
that there is “percolation”)? This is supposed to an idealized mathematical model
of how a fluid permeates through a medium. Harris showed that with probability 1,
percolation does not occur for 𝑝 ≤ 1/2. A later breakthrough of Kesten (1980) shows
that percolation occurs with probability 1 for all 𝑝 > 1/2. Thus the “bond percolation
threshold” for Z2 is exactly 1/2. Such exact results are extremely rare.
We state and prove a more general result, which says that independent random variables
possess positive association.
Let each Ω𝑖 be a linearly ordered set (i.e., {0, 1}, R) and 𝑥𝑖 ∈ Ω𝑖 with respect to some
probability distribution independent for each 𝑖. We say that a function 𝑓 (𝑥 1 , . . . , 𝑥 𝑛 )
is monotone increasing if
E[ 𝑓 𝑔] ≥ (E 𝑓 )(E𝑔).
This version of Harris inequality implies the earlier version by setting 𝑓 = 1 𝐴 and
𝑔 = 1𝐵 .
108
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
R by
Thus
Proof. For the second inequality, note that the complement 𝐵 is increasing, so
Harris
P( 𝐴𝐵) = P( 𝐴) − P( 𝐴𝐵) ≤ P( 𝐴) − P( 𝐴)P(𝐵) = P( 𝐴)P(𝐵).
The proof of the first inequality is similar. For the last inequality we apply the Harris
inequality repeatedly. □
109
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
7 Correlation Inequalities
Question 7.2.1
What’s the probability that 𝐺 (𝑛, 𝑝) is triangle-free?
Harris inequality will allow us to prove a lower bound. In the next chapter, we will use
Janson inequalities to derive upper bounds.
Theorem 7.2.2
P(𝐺 (𝑛, 𝑝) is triangle-free) ≥ (1 − 𝑝 3 ) ( 3)
𝑛
Proof. For each triple of distinct vertices 𝑖, 𝑗, 𝑘 ∈ [𝑛], the event that 𝑖 𝑗 𝑘 does not form
a triangle is a decreasing event (here the ground set is the set of edges of the complete
graph on 𝑛). So by Harris’ inequality,
©Û
P(𝐺 (𝑛, 𝑝) is triangle-free) = P {𝑖 𝑗 𝑘 not a triangle}®
ª
Ö«𝑖< 𝑗 <𝑘 ¬
P(𝑖 𝑗 𝑘 not a triangle) = (1 − 𝑝 3 ) ( 3) .
𝑛
≥ □
𝑖< 𝑗 <𝑘
3
Remark 7.2.3. How good is this bound? For 𝑝 ≤ 0.99, we have 1 − 𝑝 3 = 𝑒 −Θ( 𝑝 ) , so
the above bound gives
3 𝑝3 )
P(𝐺 (𝑛, 𝑝) is triangle-free) ≥ 𝑒 −Θ(𝑛 .
The bound from Harris is better when 𝑝 ≪ 𝑛−1/2 . Putting them together, we obtain
( 3 𝑝3 )
𝑒 −Θ(𝑛 if 𝑝 ≲ 𝑛−1/2
P(𝐺 (𝑛, 𝑝) is triangle-free) ≳ 2 𝑝)
𝑒 −Θ(𝑛 if 𝑛−1/2 ≲ 𝑝 ≤ 0.99
(note that the asymptotics agree at the boundary 𝑝 ≍ 𝑛−1/2 ). In the next chapter, we
will prove matching upper bounds using Janson inequalities.
110
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Maximum degree
Question 7.2.4
What’s the probability that the maximum degree of 𝐺 (𝑛, 1/2) is at most 𝑛/2?
For each vertex 𝑣, deg(𝑣) ≤ 𝑛/2 is a decreasing event with probability just slightly
over 1/2. So by Harris inequality, the probability that every 𝑣 has deg(𝑣) ≤ 𝑛/2 is at
least ≥ 2−𝑛 .
It turns out that the appearance of high degree vertices is much more correlated than
the independent case. The truth is exponentially more than the above bound.
Instead of giving a proof, we consider an easier continuous model of the problem that
motivates the numerical answer. Building on this intuition, Riordan and Selby (2000)
proved the result in the random graph setting, although this is beyond the scope of this
class.
In a random graphs, we assign independent Bernoulli random variables on edges of a
complete graph. Instead, let us assign independent standard normal random variables
to each edge of the complete graph.
The event 𝑊𝑣 ≤ 0 is supposed to model the event that the degree at vertex 𝑣 is less
than 𝑛/2. Of course, other than intuition, there is no justification here that these two
models should behave similarly
We have P(𝑊𝑣 ≤ 0) = 1/2. Since each {𝑊𝑣 ≤ 0} is a decreasing event of the
independent edge labels, Harris’ inequality tells us that
Proof sketch of Proposition 7.2.6. The tuple (𝑊𝑣 )𝑣∈[𝑛] has a joint normal distribution,
with each coordinate variance 𝑛 − 1 and pairwise covariance 1. So (𝑊𝑣 )𝑣∈[𝑛] has the
111
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
7 Correlation Inequalities
same distribution as
√
𝑛 − 2(𝑍1′ , 𝑍2′ , . . . , 𝑍𝑛′ ) + 𝑍0′ (1, 1, . . . , 1)
We can estimate the above integral for large 𝑛 using the Laplace method (which can be
justified rigorously by considering Taylor expansion around the maximum of 𝑓 ). We
have
𝑦2
𝑓 (𝑦) ≈ 𝑔(𝑦) := − + log Φ (𝑦)
2
and we can deduce that
∫
1 1
lim log P(max 𝑊𝑣 ≤ 0) = lim log 𝑒 𝑛 𝑓 (𝑦) 𝑑𝑦 = max 𝑔 = log 0.6102 · · · . □
𝑛→∞ 𝑛 𝑣∈[𝑛] 𝑛→∞ 𝑛
Exercises
1. Let 𝐺 = (𝑉, 𝐸) be a graph. Color every edge with red or blue independently
and uniformly at random. Let 𝐸 0 be the set of red edges and 𝐸 1 the set of blue
edges. Let 𝐺 𝑖 = (𝑉, 𝐸𝑖 ) for each 𝑖 = 0, 1. Prove that
112
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Let 𝐻𝑚,𝑛 denote the random subgraph of 𝐺 𝑚,𝑛 obtained by keeping every edge
with probability 1/2 independently.
Let RSW (𝑘) denote the following statement: there exists a constant 𝑐 𝑘 > 0 such
that for all positive integers 𝑛, P(𝐻 𝑘𝑛,𝑛 has a horizontal crossing) ≥ 𝑐 𝑘 .
a) Prove RSW (1).
b) Prove that RSW (2) implies RSW (100).
c) ★★ (Very challenging) Prove RSW (2).
4. Let 𝐴 and 𝐵 be two independent increasing events of independent random
variables. Prove that there are two disjoint subsets 𝑆 and 𝑇 of these random
variables so that 𝐴 depends only on 𝑆 and 𝐵 depends only on 𝑇.
5. Let 𝑈1 and 𝑈2 be increasing events and 𝐷 a decreasing event of independent
Boolean random variables. Suppose 𝑈1 and 𝑈2 are independent. Prove that
P(𝑈1 |𝑈2 ∩ 𝐷) ≤ P(𝑈1 |𝑈2 ).
6. Coupon collector. Let 𝑠1 , . . . , 𝑠𝑚 be independent random elements in [𝑛] (not
necessarily uniform or identically distributed; chosen with replacement) and
𝑆 = {𝑠1 , . . . , 𝑠𝑚 }. Let 𝐼 and 𝐽 be disjoint subsets of [𝑛]. Prove that P(𝐼 ∪ 𝐽 ⊆
𝑆) ≤ P(𝐼 ⊆ 𝑆)P(𝐽 ⊆ 𝑆).
7. ★ Prove that there exist 𝑐 < 1 and 𝜀 > 0 such that if 𝐴1 , . . . , 𝐴 𝑘 are increasing
events of independent Boolean random variables with P( 𝐴𝑖 ) ≤ 𝜀 for all 𝑖, then,
letting 𝑋 denote the number of events 𝐴𝑖 that occur, one has P(𝑋 = 1) ≤ 𝑐.
(Give your smallest 𝑐. It is conjectured that any 𝑐 > 1/𝑒 works.)
8. ★ Disjoint containment. Let S and T each be a collection of subsets of [𝑛].
Let 𝑅 ⊆ [𝑛] be a random subset where each element is included independently
(not necessarily with the same probability). Let 𝐴 be the event that 𝑆 ⊆ 𝑅 for
some 𝑆 ∈ S. Let 𝐵 be the event that 𝑇 ⊆ 𝑅 for some 𝑇 ∈ T . Let 𝐶 denote
the event there exist disjoint 𝑆, 𝑇 ⊆ 𝑅 with 𝑆 ∈ S and 𝑇 ∈ T . Prove that
P(𝐶) ≤ P( 𝐴)P(𝐵).
113
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
8 Janson Inequalities
115
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
8 Janson Inequalities
Remark 8.1.3. When P( 𝐴𝑖 ) = 𝑜(1) (which is the case in a typical application), Harris’
inequality gives us
𝑘
Ö
P(𝑋 = 0) = P 𝐴1 · · · 𝐴 𝑘 ≥ P 𝐴𝑖
𝑖=1
𝑘 𝑘
!
Ö ∑︁
= (1 − P( 𝐴𝑖 )) = exp −(1 + 𝑜(1)) P( 𝐴𝑖 ) = 𝑒 −(1+𝑜(1))𝜇 .
𝑖=1 𝑖=1
In the setting where Δ = 𝑜(𝜇), two bounds match to give P(𝑋 = 0) = 𝑒 −(1+𝑜(1)𝜇 .
Proof. Let
𝑟𝑖 = P( 𝐴𝑖 | 𝐴1 · · · 𝐴𝑖−1 ).
We have
P(𝑋 = 0) = P( 𝐴1 · · · 𝐴 𝑘 )
= P( 𝐴1 )P( 𝐴2 | 𝐴1 ) · · · P( 𝐴 𝑘 | 𝐴1 · · · 𝐴 𝑘−1 )
= (1 − 𝑟 1 ) · · · (1 − 𝑟 𝑘 )
≤ 𝑒 −𝑟1 −···−𝑟 𝑘
and thus !
∑︁ Δ
P(𝑋 = 0) ≤ exp − 𝑟𝑖 ≤ exp −𝜇 +
𝑖
2
116
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Then
P( 𝐴𝑖 𝐷 0 𝐷 1 ) P( 𝐴𝑖 𝐷 0 𝐷 1 )
𝑟𝑖 = P( 𝐴𝑖 | 𝐴1 · · · 𝐴𝑖−1 ) = P( 𝐴𝑖 |𝐷 0 𝐷 1 ) = ≥
P(𝐷 0 𝐷 1 ) P(𝐷 0 )
= P( 𝐴𝑖 𝐷 1 |𝐷 0 ) = P( 𝐴𝑖 |𝐷 0 ) − P( 𝐴𝑖 𝐷 1 |𝐷 0 )
= P( 𝐴𝑖 ) − P( 𝐴𝑖 𝐷 1 |𝐷 0 ) [by independence]
Since 𝐴𝑖 and 𝐷 1 are both increasing events, and 𝐷 0 is a decreasing event, by Harris’
inequality (Corollary 7.1.6),
!
Ü ∑︁
P( 𝐴𝑖 𝐷 1 |𝐷 0 ) ≤ P( 𝐴𝑖 𝐷 1 ) = P 𝐴𝑖 ∧ 𝐴𝑗 ≤ P( 𝐴𝑖 𝐴 𝑗 )
𝑗 <𝑖: 𝑗∼𝑖 𝑗 <𝑖: 𝑗∼𝑖
This concludes the proof of the claim, and thus the proof of the theorem. □
Remark 8.1.4 (History). Janson’s original proof was via analytic interpolation. The
above proof is based on Boppana and Spencer (1989) with a modification by Warnke
(personal communication). It has some similarities to the proof of Lovász local lemma
from Section 6.1. The above proof incorporates ideas from Riordan and Warnke
(2015), who extended Janson’s inequality from principal up-set to general up-sets.
Indeed, the above proof only requires that the events 𝐴𝑖 are increasing, whereas earlier
proofs of the result (e.g., the proof in Alon–Spencer) requires the full assumption of
Setup 8.1.1, namely that each 𝐴𝑖 is an event of the form 𝑆𝑖 ⊆ 𝑅𝑖 (i.e., a principal
up-set).
Question 8.1.5
What is the probability that 𝐺 (𝑛, 𝑝) is triangle-free?
In Setup 8.1.1, let [𝑁] with 𝑁 = 𝑛2 be the set of edges of 𝐾𝑛 , and let 𝑆1 , . . . , 𝑆 ( 𝑛) be
3
3-element sets where each 𝑆𝑖 is the edge-set of a triangle. As in the second moment
calculation in Section 4.2, we have
𝑛 3
𝜇= 𝑝 ≍ 𝑛3 𝑝 3 and Δ ≍ 𝑛4 𝑝 5 .
3
117
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
8 Janson Inequalities
If 𝑝 ≪ 𝑛−1/2 , then Δ = 𝑜(𝜇), in which case Janson inequality I (Theorem 8.1.2 and
Remark 8.1.3) gives the following.
Theorem 8.1.6
If 𝑝 = 𝑜(𝑛−1/2 ) , then
3 𝑝 3 /6
P(𝐺 (𝑛, 𝑝) is triangle-free) = 𝑒 −(1+𝑜(1))𝜇 = 𝑒 −(1+𝑜(1))𝑛 .
Corollary 8.1.7
For a constant 𝑐 > 0,
3 /6
lim P(𝐺 (𝑛, 𝑐/𝑛) is triangle-free) = 𝑒 −𝑐 .
𝑛→∞
(the second step assumes that 𝑝 is bounded away from 1. If 𝑝 ≫ 𝑛−1/2 , so the above
2
lower bound better than the previous one: 𝑒 −Θ(𝑛 𝑝) ≫ 𝑒 −(1+𝑜(1))𝜇 .
Nevertheless, we’ll still use Janson to bootstrap an upper bound on the triangle-free
probability. More generally, the next theorem works in the complement region of the
Janson inequality I, where now Δ ≥ 𝜇.
The proof idea is to applying the first Janson inequality on a randomly sampled subset
of events. This sampling technique might remind you of some earlier proofs, e.g.,
the proof of the crossing number inequality (Theorem 2.6.2), where we first proved a
118
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
“cheap bound” that worked in a more limited range, and then used sampling to obtain
a better bound.
Í
Proof. For each 𝑇 ⊆ [𝑘], let 𝑋𝑇 := 𝑖∈𝑇 1 𝐴𝑖 denote the number of occurring events in
𝑇. We have
P(𝑋 = 0) ≤ P(𝑋𝑇 = 0) ≤ 𝑒 −𝜇𝑇 +Δ𝑇 /2
where ∑︁
𝜇𝑇 = P( 𝐴𝑖 )
𝑖∈𝑇
and ∑︁
Δ𝑇 = P( 𝐴𝑖 𝐴 𝑗 )
(𝑖, 𝑗)∈𝑇 2 :𝑖∼ 𝑗
and so
E(−𝜇𝑇 + Δ𝑇 /2) = −𝑞𝜇 + 𝑞 2 Δ/2.
By linearity of expectations, thus there is some choice of 𝑇 ⊆ [𝑘] so that
so that
2 Δ/2
P(𝑋 = 0) ≤ 𝑒 −𝑞𝜇+𝑞
for every 𝑞 ∈ [0, 1]. Since Δ ≥ 𝜇, we can set 𝑞 = 𝜇/Δ ∈ [0, 1] to get the result. □
Let us revisit the example of estimating the probability that 𝐺 (𝑛, 𝑝) is triangle-free,
119
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
8 Janson Inequalities
𝑛3 𝑝 3 ≍ 𝜇 ≪ Δ ≍ 𝑛4 𝑝 5 .
Since
2 𝑝)
P(𝐺 (𝑛, 𝑝) is triangle-free) ≥ P(𝐺 (𝑛, 𝑝) is empty) ≥ (1 − 𝑝) ( 2) = 𝑒 −Θ(𝑛
𝑛
where the final step assumes that 𝑝 is bounded away from 1, we conclude that
2 𝑝)
P(𝐺 (𝑛, 𝑝) is triangle-free) = 𝑒 −Θ(𝑛
We summarize the results below (strictly speaking we have not yet checked the case
𝑝 ≍ 𝑛−1/2 , which we can verify by applying Janson inequalities; note that the two
regimes below match at the boundary).
Theorem 8.1.10
Suppose 𝑝 = 𝑝 𝑛 ≤ 0.99. Then
(
exp −Θ(𝑛2 𝑝) if 𝑝 ≳ 𝑛−1/2
P(𝐺 (𝑛, 𝑝) is triangle-free) =
exp −Θ(𝑛3 𝑝 3 ) if 𝑝 ≲ 𝑛−1/2
Remark 8.1.11. Sharper results are known. Here are some highlights.
2
1. The number of triangle-free graphs on 𝑛 vertices is 2 (1+𝑜(1))𝑛 /4 . In fact, an even
stronger statement is true: almost all (i.e., 1−𝑜(1) fraction) 𝑛-vertex triangle-free
graphs are bipartite (Erdős, Kleitman, and Rothschild 1976).
√︁ √
2. If 𝑚 ≥ 𝐶𝑛3/2 log 𝑛 for any constant 𝐶 > 3/4 (and this is best possible), then
almost all all 𝑛-vertex 𝑚-edge triangle-free graphs are bipartite (Osthus, Prömel,
and Taraz 2003). This result has been extended to 𝐾𝑟 -free graphs for every fixed
𝑟 (Balogh, Morris, Samotij, and Warnke 2016).
3. For 𝑛−1/2 ≪ 𝑝 ≪ 1, (Łuczak 2000)
This result was generalized to general 𝐻-free graphs using the powerful recent
method of hypergraph containers (Balogh, Morris, and Samotij 2015).
120
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Question 8.2.1
Fix a constant 0 < 𝛿 ≤ 1. Let 𝑋 be the number of triangles of 𝐺 (𝑛, 𝑝). Estimate
P(𝑋 ≤ (1 − 𝛿)E𝑋).
−𝑡 2
P(𝑋 ≤ 𝜇 − 𝑡) ≤ exp
2(𝜇 + Δ)
Note that setting 𝑡 = 𝜇 we basically recover the first two Janson inequalities (up to an
unimportant constant factor in the exponent):
−𝜇2
P(𝑋 = 0) ≤ exp . (8.1)
2(𝜇 + Δ)
(Note that this form of the inequality conveniently captures Janson inequalities I & II.)
Proof. (by Lutz Warnke1 ) We start the proof similarly to the proof of the Chernoff
bound, by applying Markov’s inequality on the moment generating function. To that
end, let 𝜆 ≥ 0 to be optimized later. Let
𝑞 = 1 − 𝑒 −𝜆 .
By Markov’s inequality,
−𝜆𝑋 −𝜆(𝜇−𝑡)
P(𝑋 ≤ 𝜇 − 𝑡) = P 𝑒 ≥𝑒
≤ 𝑒𝜆(𝜇−𝑡) E 𝑒 −𝜆𝑋
≤ 𝑒𝜆(𝜇−𝑡) E[(1 − 𝑞) 𝑋 ].
1 Personal communication
121
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
8 Janson Inequalities
For each 𝑖 ∈ [𝑘], let 𝑊𝑖 ∼ Bernoulli(𝑞) independently. Consider the random variable
𝑘
∑︁
𝑌= 1 𝐴𝑖 𝑊𝑖 .
𝑖=1
Conditioned on the value of 𝑋, the probability that 𝑌 = 0 is (1−𝑞) 𝑋 (i.e., the probability
that 𝑊𝑖 = 0 for each of the 𝑋 events 𝐴𝑖 that occurred). Taking expectation over 𝑋, we
have
P(𝑌 = 0) = E[P(𝑌 = 0|𝑋)] = E[(1 − 𝑞) 𝑋 ].
Note that 𝑌 fits within Setup 8.1.1 by introducing 𝑘 new elements to the ground set
[𝑁], where each new element is included according to 𝑊𝑖 , and enlarging each 𝑆𝑖 to
include this new element. The relevant parameters of 𝑌 are
𝜇𝑌 := E𝑌 = 𝑞𝜇
and ∑︁
Δ𝑌 := E[1 𝐴𝑖 𝑊𝑖 1 𝐴 𝑗 𝑊 𝑗 ] = 𝑞 2 Δ.
(𝑖, 𝑗):𝑖∼ 𝑗
Therefore,
2 Δ/2
E[(1 − 𝑞) 𝑋 ] = P(𝑌 = 0) ≤ 𝑒 −𝑞𝜇+𝑞 .
Continuing the moment calculation at the beginning of the proof, and using that
𝜆2
𝜆− ≤ 𝑞 ≤ 𝜆,
2
we have
Example 8.2.3 (Lower tails for triangle counts). Let 𝑋 be the number of triangles in
122
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
𝐺 (𝑛, 𝑝). We have 𝜇 ≍ 𝑛3 𝑝 3 and Δ ≍ 𝑛4 𝑝 5 . Fix a constant 𝛿 ∈ (0, 1]. Let 𝑡 = 𝛿E𝑋.
We have
(
exp −Θ𝛿 (𝑛2 𝑝) if 𝑝 ≳ 𝑛−1/2 ,
−𝛿2 𝑛6 𝑝 6
P(𝑋 ≤ (1 − 𝛿)E𝑋) ≤ exp −Θ 3 3 =
𝑛 𝑝 + 𝑛4 𝑝 5 exp −Θ𝛿 (𝑛3 𝑝 3 ) if 𝑝 ≲ 𝑛−1/2 .
Example 8.2.4 (No corresponding Janson inequality for upper tails). Continuing
with 𝑋 being the number of triangles of 𝐺 (𝑛, 𝑝), from on the above lower tail results,
we might expect P(𝑋 ≥ (1 + 𝛿)E𝑋) ≤ exp(−Θ𝛿 (𝑛2 𝑝)), but actually this is false!
By planting a clique of size Θ(𝑛𝑝), we can force 𝑋 ≥ (1 + 𝛿)E𝑋. Thus
2 𝑝2 )
P(𝑋 ≥ (1 + 𝛿)E𝑋) ≥ 𝑝 Θ 𝛿 (𝑛
which is much bigger than exp −Θ(𝑛2 𝑝) . The above is actually the truth (Kahn–
2 𝑝2 ) log 𝑛
P(𝑋 ≥ (1 + 𝛿)E𝑋) = 𝑝 Θ 𝛿 (𝑛 if 𝑝 ≳ ,
𝑛
but the proof is much more intricate. Recent results allow us to understand the exact
constant in the exponent though new developments in large deviation theory. The
current state of knowledge is summarized below.
𝛿 𝛿2/3 2 2
− log P(𝑋 ≥ (1 + 𝛿)E𝑋) ∼ min , 𝑛 𝑝 log(1/𝑝),
3 2
𝛿2/3 2 2
− log P(𝑋 ≥ (1 + 𝛿)E𝑋) ∼ 𝑛 𝑝 log(1/𝑝).
2
Remark 8.2.6. The leading constants were determined by Lubetzky and Zhao (2017)
by solving an associated variational problem. Earlier results, starting with Chatter-
jee and Varadhan (2011) and Chatterjee and Dembo (2016) prove large deviation
123
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
8 Janson Inequalities
frameworks that gave the above theorem for sufficiently slowly decaying 𝑝 ≥ 𝑛−𝑐 .
For the corresponding problem for lower tails, see Kozma and Samotij (2023) for an
approach using relative entropy that reduces the rate problem to a variational problem.
The exact leading constant is known only for sufficiently small 𝛿 > 0, where the answer
is given by “replica symmetry”, meaning that the exponential rate is given by a uniform
decrement in edge densities for the random graph. In contrast, for 𝛿 close to 1, we
expect (though cannot prove) that the typical structure of a conditioned random graph
is close to a two-block model (Zhao 2017).
Question 8.3.1
What is the chromatic number of 𝐺 (𝑛, 1/2)?
In Section 4.4, we used the second moment method to find the clique number 𝜔 of
𝐺 (𝑛, 1/2). We saw that, with probability 1 − 𝑜(1), the clique number is concentrated
on two values, and in particular,
The independence number 𝜶(𝑮) is the size of the largest independent set in 𝐺. The
independence number 𝛼(𝐺) is the equal to the clique number the complement of
𝐺. Since 𝐺 (𝑛, 1/2) and its graph complement have the same distribution, we have
𝛼(𝐺 (𝑛, 1/2)) ∼ 2 log2 𝑛 whp as well.
Using the following lower bound on the chromatic number 𝜒(𝐺):
|𝑉 (𝐺)|
𝜒(𝐺) ≥
𝛼(𝐺)
(1 + 𝑜(1))𝑛
𝜒(𝐺 (𝑛, 1/2)) ≥ whp.
log2 𝑛
The following landmark theorem shows that the above lower bound on 𝜒(𝐺 (𝑛, 1/2))
is asymptotically tight.
124
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Recall that 𝜔(𝐺 (𝑛, 1/2)) is typically concentrated around the point 𝑘 where the ex-
pected number of 𝑘-cliques 𝑛𝑘 2− ( 2) is neither too large nor too close to zero. The
𝑘
next lemma show that this probability drops very quickly when we decrease 𝑘 even by
a constant.
Lemma 8.3.3
𝑛 − ( 𝑘2 )
Let 𝑘 0 = 𝑘 0 (𝑛) be the largest possible integer 𝑘 so that 𝑘 2 ≥ 1. Then
2−𝑜(1)
P(𝛼(𝐺 (𝑛, 1/2)) < 𝑘 0 − 3) ≤ 𝑒 −𝑛
Note that there is a trivial lower bound of 2− ( 2) coming from an empty graph.
𝑛
𝑛 − ( 𝑘2 )
Let 𝜇 𝑘 := 𝑘 2 . For 𝑘 ∼ 𝑘 0 (𝑛) ∼ 2 log2 𝑛, we have
𝑛
𝜇 𝑘+1 𝑘+1 −𝑘 𝑛 −(2+𝑜(1)) log2 𝑛 1
= 𝑛 2 ∼ 2 = 1−𝑜(1) .
𝜇𝑘 𝑘
𝑘 𝑛
Let 𝑘 = 𝑘 0 − 3 and applying Setup 8.1.1 for Janson inequality with 𝑋 being the number
of 𝑘-cliques, we have
𝜇 = 𝜇 𝑘 > 𝑛3−𝑜(1)
and (details of the computation omitted)
𝑘4
Δ ∼ 𝜇2 = 𝑛4−𝑜(1) .
𝑛2
So Δ > 𝜇 for sufficiently large 𝑛, and we can apply Janson inequality II:
2−𝑜(1)
P(𝜔(𝐺 (𝑛, 1/2)) < 𝑘) = P(𝑋 = 0) ≤ 𝑒 −𝑛 . □
Proof of Theorem 8.3.2. The lower bound proof was discussed before the theorem
statement. For the upper bound we will give a strategy to properly color the random
graph with (2 + 𝑜(1)) log2 𝑛 colors. We will proceed by taking out independent sets of
125
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
8 Janson Inequalities
size ∼ 2 log2 𝑛 iteratively until 𝑜(𝑛/log 𝑛) vertices remain, at which point we can use
a different color for each remaining vertex.
Note that after taking out the first independent set of size ∼ 2 log2 𝑛, we cannot claim
that the remaining graph is still distributed as 𝐺 (𝑛, 1/2). It is not. Our selection of the
vertices was dependent on the random graph. We are not allowed to “resample” the
edges after the first selection.
The strategy is to apply the previous lemma to see that every large enough subset of
vertices has an independent set of size ∼ 2 log2 𝑛.
Let 𝐺 ∼ 𝐺 (𝑛, 1/2). Let 𝑚 = 𝑛/(log 𝑛) 2 , say. For any set 𝑆 of 𝑚 vertices, the
induced subgraph 𝐺 [𝑆] has the distribution of 𝐺 (𝑚, 1/2). By Lemma 8.3.3, for
we have
2−𝑜 (1) 2−𝑜(1)
P(𝛼(𝐺 [𝑆]) < 𝑘) = 𝑒 −𝑚 = 𝑒 −𝑛 .
Taking a union bound over all 𝑚𝑛 < 2𝑛 such sets 𝑆,
2−𝑜(1)
P(there is an 𝑚-vertex subset 𝑆 with 𝛼(𝐺 [𝑆]) < 𝑘) < 2𝑛 𝑒 −𝑛 = 𝑜(1).
Exercises
1. 3-AP-free probability. Determine, for all 0 < 𝑝 ≤ 0.99 (𝑝 is allowed to depend
on 𝑛), the probability that [𝑛] 𝑝 does not contain a 3-term arithmetic progression,
up to a constant factor in the exponent. (The form of the answer should be similar
126
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
to the conclusion in class about the probability that 𝐺 (𝑛, 𝑝) is triangle-free. See
3 for notation.)
2. Prove that with probability 1 − 𝑜(1), the size of the largest subset of vertices of
𝐺 (𝑛, 1/2) inducing a triangle-free subgraph is Θ(log 𝑛).
3. Nearly perfect triangle factor, again. Using Janson inequalities this time, give
another solution to Problem 11 in the following generality.
a) Prove that for every 𝜀 > 0, there exists 𝐶𝜀 > 0 such that such that with
probability 1 − 𝑜(1), 𝐺 (𝑛, 𝐶𝜀 𝑛−2/3 ) contains at least (1/3 − 𝜀)𝑛 vertex-
disjoint triangles.
b) (Optional) Compare the the dependence of the optimal 𝐶𝜀 on 𝜀 you obtain
using the method in Problem 11 versus this problem (don’t worry about
leading constant factors).
4. ★Threshold for extensions. Show that for every constant 𝐶 > 16/5, if 𝑛2 𝑝 5 >
𝐶 log 𝑛, then with probability 1 − 𝑜(1), every edge of 𝐺 (𝑛, 𝑝) is contained in a
𝐾4 .
(Be careful, this event is not increasing, and so it is insufficient to just prove the result for one
specific 𝑝.)
5. Lower tails of small subgraph counts. Fix graph 𝐻 and 𝛿 ∈ (0, 1]. Let 𝑋𝐻 denote
the number of copies of 𝐻 in 𝐺 (𝑛, 𝑝). Prove that for all 𝑛 and 0 < 𝑝 < 0.99,
′ ′
P(𝑋𝐻 ≤ (1 − 𝛿)E𝑋𝐻 ) = 𝑒 −Θ𝐻, 𝛿 (Φ𝐻 ) where Φ𝐻 := min ′ 𝑛𝑣(𝐻 ) 𝑝 𝑒(𝐻 ) .
𝐻 ′ ⊆𝐻:𝑒(𝐻 )>0
Here the hidden constants in Θ𝐻,𝛿 may depend on 𝐻 and 𝛿 (but not on 𝑛 and 𝑝).
6. ★ List chromatic number of a random graph. Show that the list chromatic number
of 𝐺 (𝑛, 1/2) is (1 + 𝑜(1)) 2 log𝑛 𝑛 with probability 1 − 𝑜(1).
2
The list-chromatic number (also called choosability) of a graph 𝐺 is defined to the minimum 𝑘
such that if every vertex of 𝐺 is assigned a list of 𝑘 acceptable colors, then there exists a proper
coloring of 𝐺 where every vertex is colored by one of its acceptable colors.
127
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
In this chapter, we develop tools for proving similar tail bounds for other random
variables that do not necessarily arise as a sum of independent random variables.
The next theorem says:
A Lipschitz function of many independent random variables is con-
centrated.
We will prove the following important and useful result, known by several names:
McDiarmid’s inequality, Azuma–Hoeffding inequality, and bounded differences
inequality.
Example 9.1.2 (Coupon collector). Let 𝑠1 , . . . , 𝑠𝑛 ∈ [𝑛] chosen uniformly and inde-
129
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
with 𝑛
1 𝑛−1 𝑛
E𝑍 = 𝑛 1 − ∈ , .
𝑛 𝑒 𝑒
Theorem 9.1.1 holds more generally allowing the bounded difference to depend on the
coordinate.
Definition 9.2.1
A martingale is a random real sequence 𝑍0 , 𝑍1 , . . . such that for every 𝑍𝑛 , E|𝑍𝑛 | < ∞
and
E[𝑍𝑛+1 |𝑍0 , . . . , 𝑍𝑛 ] = 𝑍𝑛 .
Example 9.2.2 (Random walks with independent steps). If (𝑋𝑖 )𝑖≥0 is a sequence of
130
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Í
independent random variables with E𝑋𝑖 = 0 for all 𝑖, then the partial sums 𝑍𝑛 = 𝑖≤𝑛 𝑋𝑖
is a Martingale.
Example 9.2.3 (Betting strategy). Betting on a sequence of fair coin tosses. After
round, you are allow to change your bet. Let 𝑍𝑛 be your balance after the 𝑛-th round.
Then 𝑍𝑛 is always a martingale regardless of your strategy.
Originally, the term “martingale” referred to the betting strategy where one doubles
the bet each time until the first win and then stop betting. Then, with probability 1,
𝑍𝑛 = 1 for all sufficiently large 𝑛. (Why does this “free money” strategy not actually
work?)
Example 9.2.5 (Edge-exposure martingale). We can reveal the random graph 𝐺 (𝑛, 𝑝)
by first fixing an order on all unordered pairs of [𝑛] and then revealing in order whether
each pair is an edge. For any graph parameter 𝑓 (𝐺) we can produce a martingale
𝑋0 , 𝑋1 , . . . , 𝑋 ( 𝑛) where 𝑍𝑖 is the conditional expectation of 𝑓 (𝐺 (𝑛, 𝑝)) after revealing
2
whether there are edges for first 𝑖 pairs of vertices. See Figure 9.1 for an example.
The main result is that a martingale with bounded differences must be concen-
trated. The following fundamental result is called Azuma’s inequality or the Azuma–
Hoeffding inequality.
131
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
3 3
2.5
2 2
2.25 2.25
2 2
2
2 2
2 2
2 2
2
2 2
1.75 1.75
2 2
1.5
1 1
Note that this is the same bound that we derived in Chapter 5 for 𝑍𝑛 = 𝑋1 + · · · 𝑋𝑛
where 𝑋𝑖 ∈ {−1, 1} uniform and iid.
More generally, allowing different bounds on different steps of the martingale, we have
the following.
The above formulations of Azuma’s inequality can be used to recover the bounded
differences inequality (Theorems 9.1.1 and 9.1.3) up to a usually unimportant constant
in the exponent. To obtain the exact statement of Theorem 9.1.3, we state the following
132
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
strengthening of Azuma’s inequality. (You are welcome to ignore the next statement
if you do not care about the constant factor in the exponent — and really, you should
not care.)
Remark 9.2.10. Applying the inequality to the martingale with terms −𝑍 𝑛 , we obtain
the following lower tail bound:
!
−2𝜆2
P(𝑍𝑛 − 𝑍0 ≤ −𝜆) ≤ exp 2 .
𝑐 1 + · · · + 𝑐2𝑛
Remark 9.2.11. Theorem 9.2.8 is a special case of Theorem 9.2.9, since we can take
(𝑋1 , . . . , 𝑋𝑛 ) = (𝑍1 . . . , 𝑍𝑛 ) and 𝑓 (𝑋1 , . . . , 𝑋𝑛 ) = 𝑋𝑛 . Note that the |𝑍𝑖 − 𝑍𝑖−1 | ≤ 𝑐𝑖
condition in Theorem 9.2.8 implies that 𝑍𝑖 lies in an interval of length 2𝑐𝑖 if we
condition on (𝑋1 , . . . , 𝑋𝑖−1 ).
133
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
Since E𝑋 = 0, we obtain
𝑏 𝑎 −𝑎 𝑏
E𝑒 𝑋 ≤ 𝑒 + 𝑒 .
𝑏−𝑎 𝑏−𝑎
Let 𝑝 = −𝑎/(𝑏 − 𝑎). Then 𝑎 = −𝑝ℓ and 𝑏 = (1 − 𝑝)ℓ. So
log E𝑒 𝑋 ≤ log (1 − 𝑝)𝑒 −𝑝ℓ + 𝑝𝑒 (1−𝑝)ℓ = −𝑝ℓ + log(1 − 𝑝 + 𝑝𝑒 ℓ ).
Iterating, we obtain h i 2 2 2
E 𝑒 𝑡 (𝑍𝑛 −𝑍0 ) ≤ 𝑒 𝑡 (𝑐1 +···𝑐 𝑛 )/8 .
By Markov,
𝑡2 2
h i 2
P(𝑍𝑛 − 𝑍0 ≥ 𝜆) ≤ 𝑒 −𝑡𝜆 E 𝑒 𝑡 (𝑍𝑛 −𝑍0 ) ≤ 𝑒 −𝑡𝜆+ 8 (𝑐1 +···𝑐 𝑛 ) .
Proof of the bounded differences inequality (Theorem 9.1.3). Consider the Doob mar-
134
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Remark 9.2.13. Azuma’s inequality (Theorem 9.2.9) is more versatile than (Theo-
rem 9.1.3). For example, while changing 𝑋𝑖 might change 𝑓 (𝑋1 , . . . , 𝑋𝑛 ) by a lot in
the worst case over all possible (𝑋1 , . . . , 𝑋𝑛 ), it might not change it by much in expec-
tation over random choices of (𝑋𝑖+1 , . . . , 𝑋𝑛 ). And so the 𝑐𝑖 in Theorem 9.2.9 could
potentially be smaller than in Theorem 9.1.3. This will be useful in some applications,
including one that we will see later in the chapter.
Proof. Let 𝑉 = [𝑛], and consider each vertex labeled graph as an element of Ω2 ×
· · · × Ω𝑛 where Ω𝑖 = {0, 1}𝑖−1 and its coordinates correspond to edges whose larger
coordinate is 𝑖 (cf. the vertex-exposure martingale Example 9.2.6). If two graphs 𝐺
and 𝐺 ′ differ only in edges incident to one vertex 𝑣, then | 𝜒(𝐺) − 𝜒(𝐺 ′)| ≤ 1 since,
given a proper coloring of 𝐺 using 𝜒(𝐺) colors, one can obtain a proper coloring
of 𝐺 ′ using 𝜒(𝐺) + 1 colors by using a new color for 𝑣. Theorem 9.1.3 implies the
result. □
135
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
2−𝑜(1)
P(𝜔(𝐺 (𝑛, 1/2)) < 𝑘 0 − 3) = 𝑒 −𝑛 .
136
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Taking expectations of both sides, and noting that E |𝑉 (𝐻 ′)| = 𝑞 |𝑉 (𝐻)| and
E |𝐸 (𝐻 ′)| = 𝑞 2 |𝐸 (𝐻)| by linearity of expectations, we have
Provided that |𝐸 (𝐻)| ≥ |𝑉 (𝐻)| /2, we can take 𝑞 = |𝑉 (𝐻)| /(2 |𝐸 (𝐻)|) ∈ [0, 1]
and obtain
|𝑉 (𝐻)| 2 1
𝛼(𝐻) ≥ if |𝐸 (𝐻)| ≥ |𝑉 (𝐻)| .
4 |𝐸 (𝐻)| 2
(This method allows us to recover Turán’s theorem up to a factor of 2, whereas
the Caro–Wei inequality recovers Turán’s theorem exactly. For the present
application, we do not care about these constant factors.)
By a second moment argument (details again omitted, like in the proofs of Theo-
rem 4.4.2 and Lemma 8.3.3), we have, with probability 1 − 𝑜(1), that the number of
𝑘-cliques in 𝐺 is
𝑛 − ( 𝑘)
|𝑉 (H )| ∼ E |𝑉 (H )| = 2 2 = 𝑛3−𝑜(1)
𝑘
and the number of unordered pairs of edge-overlapping 𝑘-cliques in 𝐺 is
E |𝐸 (H )| = 𝑛4−𝑜(1) .
Thus, with probability 1 − 𝑜(1), we can apply either of the above lower bounds on
independent sets to obtain
|𝑉 (H )| 2 𝑛6−𝑜(1) 𝑛6−𝑜(1)
E𝑌 ≳ E ≳E ≥ = 𝑛2−𝑜(1) .
|𝐸 (H )| |𝐸 (H )| E |𝐸 (H )|
2−𝑜 (1)
Together with (9.1), this completes the proof that P(𝜔(𝐺) < 𝑘) = 𝑒 −𝑛 . □
137
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
Proof. It suffices to show that for all 𝜀 > 0, there exists 𝑢 = 𝑢(𝑛, 𝑝, 𝜀) so that, provided
𝑝 < 𝑛−𝛼 and 𝑛 is sufficiently large,
−2(E𝑌 ) 2
−2𝜆2
𝑒 = 𝜀 < P( 𝜒(𝐺) ≤ 𝑢) = P(𝑌 = 0) = P(𝑌 ≤ E𝑌 − E𝑌 ) ≤ exp .
𝑛
Thus
√
E𝑌 ≤ 𝜆 𝑛.
Next, we apply the upper tail bound to show that 𝑌 is rarely large. We have
√ √ 2
P(𝑌 ≥ 2𝜆 𝑛) ≤ P(𝑌 ≥ E𝑌 + 𝜆 𝑛) ≤ 𝑒 −2𝜆 = 𝜀.
Each of the following three events occur with probability at least 1− 𝜀, for large enough
𝑛,
√
• By the above argument, there is some 𝑆 ⊆ 𝑉 (𝐺) with |𝑆| ≤ 2𝜆 𝑛 and 𝐺 − 𝑆
138
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Lemma 9.3.5
Fix 𝛼 > 5/6 and 𝐶. Let 𝑝 ≤ 𝑛−𝛼 . Then with probability 1 − 𝑜(1) every subset of at
√
most 𝐶 𝑛 vertices of 𝐺 (𝑛, 𝑝) can be properly 3-colored.
Proof. Let 𝐺 ∼ 𝐺 (𝑛, 𝑝). Assume that 𝐺 is not 3-colorable. Choose minimum size
𝑇 ⊆ 𝑉 (𝐺) so that the induced subgraph 𝐺 [𝑇] is not 3-colorable.
We see that 𝐺 [𝑇] has minimum degree at least 3, since if deg𝐺 [𝑇] (𝑥) < 3, then 𝑇 − 𝑥
cannot be 3-colorable either (if it were, then can extend coloring to 𝑥), contradicting
the minimality of 𝑇.
Thus 𝐺 [𝑇] has at least 3|𝑇 |/2 edges. The probability that 𝐺 has some induced subgraph
√
on 𝑡 ≤ 𝐶 𝑛 vertices and ≥ 3𝑡/2 edges is, by a union bound, (recall 𝑛𝑘 ≤ (𝑛𝑒/𝑘) 𝑘 )
√ √
∑︁𝑛
𝐶 𝑡 𝐶 𝑛
𝑛 2 3𝑡/2
∑︁ 𝑛𝑒 𝑡 𝑡𝑒 3𝑡/2 −3𝑡𝛼/2
≤ 𝑝 ≤ 𝑛
𝑡=4
𝑡 3𝑡/2 𝑡=4
𝑡 3
√ √
𝐶
∑︁𝑛 √ 𝑡 𝐶∑︁𝑛 𝑡
≤ 𝑂 (𝑛1−3𝛼/2 𝑡) ≤ 𝑂 (𝑛1−3𝛼/2+1/4 ) .
𝑡=4 𝑡=4
Remark 9.3.6. Theorem 9.3.4 was subsequently improved (by a refinement of the
above techniques) by Łuczak (1991) and Alon and Krivelevich (1997). We now know
that the chromatic number of 𝐺 (𝑛, 𝑛−𝛼 ) has two-point concentration for all 𝛼 > 1/2.
139
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
Euclidean space
The classic isoperimetric theorem in R𝑛 says that that among all subset of R𝑛 of given
volume, the ball has the smallest surface volume. (The word “isoperimetric” refers to
fixing the perimeter; equivalently we fix the surface area and ask to maximize volume.)
This result (at least in two-dimensions) was known to the Greeks, but rigorous proofs
were only found in towards the end of the nineteenth century.
Let (𝑋, 𝑑 𝑋 ) be a metric space. Let 𝐴 ⊆ 𝑋. For any 𝑥 ∈ 𝑋, write 𝑑 𝑋 (𝑥, 𝐴) :=
inf 𝑎∈𝐴 𝑑 𝑋 (𝑥, 𝑎) for the distance from 𝑥 to 𝐴. Denote the set of all points within
distance 𝑡 from 𝐴 by
𝐴𝑡 := {𝑥 ∈ 𝑋 : 𝑑 𝑋 (𝑥, 𝐴) ≤ 𝑡} (9.1)
This is also known as the radius-𝒕 neighborhood of 𝑨. One can visualize 𝐴𝑡 by
“expanding” 𝐴 by distance 𝑡.
Remark 9.4.2. A clean way to prove the above inequality is via the Brunn–Minkowski
theorem.
Classically, the isoperimetric inequality is stated as (here 𝜕 𝐴 is the boundary of 𝐴)
These two formulations are equivalent. Indeed, assuming Theorem 9.4.1, we have
𝑑 vol 𝐴𝑡 − vol 𝐴
vol𝑛−1 𝜕 𝐴 = vol𝑛 𝐴𝑡 = lim
𝑑𝑡 𝑡=0 𝑡→0 𝑡
vol 𝐵𝑡 − vol 𝐵
≥ lim = vol𝑛−1 𝜕𝐵.
𝑡→0 𝑡
Conversely, we can obtain the neighborhood version from the boundary version by
integrating (noting that 𝐵𝑡 is always a ball).
140
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
The cube
We have an analogous result in the {0, 1}𝑛 with respect to Hamming distance.In
Hamming cube, Harper’s theorem gives the exact result. Below, for 𝐴 ⊆ {0, 1}𝑛 , we
write 𝐴𝑡 as in (9.1) for 𝑋 = {0, 1}𝑛 and 𝑑 𝑋 being the Hamming distance.
Remark 9.4.4. The above statement is tight when 𝐴 has the same size as a Hamming
ball, i.e., when | 𝐴| = 𝑛0 + 𝑛1 + · · · + 𝑛𝑘 for some integer 𝑘. Actually, more is true.
For any value of | 𝐴| and 𝑡, the size of 𝐴𝑡 is minimized by taking 𝐴 to be an initial
segment of {0, 1}𝑛 according to the simplicial ordering: first sort by Hamming weight,
and for ties, sort by lexicographic order. For more on this topic, particularly extremal
set theory, see the book Combinatorics by Bollobás (1986).
Combined with the isoperimetic inequality on the cube, we obtain the following
surprising consequence. Suppose we start with just half of the cube, and then expand
it by a bit (recall that the diameter of the cube is 𝑛, and we will be expanding it by
𝑜(𝑛)), then resulting expansion occupies nearly all of the cube.
Proof. Let 𝐵 = {𝑥 ∈ {0, 1} 𝑛 : weight(𝑥) < 𝑛/2}, so that |𝐵| ≤ 2𝑛−1 ≤ | 𝐴|. Then by
Harper’s theorem (Theorem 9.4.3),
2 /𝑛
| 𝐴𝑡 | ≥ |𝐵𝑡 | = |{𝑥 ∈ {0, 1}𝑛 : weight(𝑥) < 𝑛/2 + 𝑡}| > (1 − 𝑒 −2𝑡 )2𝑛
In fact, using the above, we can deduce that even if we start with a small fraction (e.g.,
1%) of the cube, and expand it slightly, then we would cover most of the cube.
141
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
𝐴𝐶 √𝑛 ≥ (1 − 𝜀)2𝑛 .
√︁ 2
First proof via Harper’s isoperimetric inequality. Let 𝑡 = log(1/𝜀)𝑛/2 so that 𝑒 −2𝑡 /𝑛 =
𝜀. Applying Theorem 9.4.5 to 𝐴′ = {0, 1}𝑛 \ 𝐴𝑡 , we see that | 𝐴′ | < 2𝑛−1 (or else
𝐴′𝑡 > (1 − 𝜀)2𝑛 , so 𝐴′𝑡 would intersect 𝐴, which is impossible since the distance be-
tween 𝐴 and 𝐴′ is greater than 𝑡). Thus | 𝐴𝑡 | ≥ 2𝑛−1 , and then applying Theorem 9.4.5
yields | 𝐴2𝑡 | ≥ (1 − 𝜀)2𝑛 . □
Let us give another proof of Theorem 9.4.6 without using Harper’s exact isoperimetric
theorem in the Hamming cube, and instead use the bounded differences inequality that
we proved earlier.
Second proof via the bounded differences inequality. Pick a uniform random 𝑥 ∈
{0, 1}𝑛 and let 𝑋 = dist(𝑥, 𝐴). Note that 𝑋 changes by at most 1 if a single coor-
dinate of 𝑥 is changed. Applying the bounded differences inequality, Theorem 9.1.1,
we have the lower tail
2 /𝑛
P(𝑋 − E𝑋 ≤ −𝜆) ≤ 𝑒 −2𝜆 for all 𝜆 ≥ 0
Thus √︂ √
log(1/𝜀)𝑛 𝐶 𝑛
E𝑋 ≤ = .
2 2
Now we apply the upper tail of the bounded differences inequality
2 /𝑛
P(𝑋 − E𝑋 ≥ 𝜆) ≤ 𝑒 −2𝜆 for all 𝜆 ≥ 0
to yield √
√
𝐶 𝑛
P(𝑥 ∉ 𝐴𝐶 √𝑛 ) = P(𝑋 > 𝐶 𝑛) ≤ P 𝑋 ≥ E𝑋 + ≤ 𝜀. □
2
142
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
P( 𝐴𝑡 ) ≥ 1 − 𝜀.
P( 𝑓 > 𝑚 + 𝑡) ≤ 𝜀.
P( 𝑓 > 𝑚 + 𝑡) ≤ P( 𝐴𝑡 ) ≤ 𝜀.
143
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
P( 𝐴𝑡 ) = P( 𝑓 > 𝑚 + 𝑡) ≤ 𝜀. □
Informally, we say that a space (or rather, a sequence of spaces), has concentration
of measure if 𝜀 decays rapidly as a function of 𝑡 in the above theorem (the notion of
“Lévy family” makes this precise). Earlier we saw that the Hamming cube exhibits has
concentration of measure. Other notable spaces with concentration of measure include
the sphere, Gauss space, orthogonal and unitary groups, postively-curved manifolds,
and the symmetric group.
The sphere
We discuss analogs of the concentration of measure phenomenon in high dimensional
geometry. This is rich and beautiful subject. An excellent introductory to this topic is
the survey An Elementary Introduction to Modern Convex Geometry by Ball (1997).
Recall the isoperimetric inequality in R𝑛 says:
If 𝐴 ⊆ R𝑛 has the same measure as ball 𝐵, then vol( 𝐴𝑡 ) ≥ vol(𝐵𝑡 ) for all
𝑡 ≥ 0.
Analogous exact isoperimetric inequalities are known in several other spaces. We
already saw it for the boolean cube (Theorem 9.4.3). The case of sphere and Gaussian
space are particularly noteworthy. The following theorem is due to Lévy (∼1919).
We have the following upper bound estimate on the size of spherical caps.
The following proof (including figures) is taken from Tokz (2012), building on the
method by Ball (1997).
144
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Proof. Let 𝐶 denote the spherical cap consisting of unit vectors 𝑥 with 𝑥 1 ≥ 𝜀. Write
𝐶e for the convex hull of 𝐶 with the origin, i.e., the conical sector spanned by 𝐶. The
e in a ball of radius 𝑟 ≤ 𝑒 −𝜀2 /2 . Writing 𝐵(𝑟) for a ball of radius 𝑟
idea is to contain 𝐶
in R𝑛 so that, we have
vol𝑛−1 𝐶 vol𝑛 𝐶
e vol𝑛 𝐵(𝑟) 2
= = = 𝑟 𝑛 ≤ 𝑒 −𝜀 𝑛/2 .
vol𝑛−1 𝑆 𝑛−1 vol𝑛 𝐵𝑛 (1) vol𝑛 𝐵(1)
√
Case 1: 𝜀 ∈ [0, 1/ 2].
"2
1
P
p
Bn (0, 1)
Cone
√
As shown above, 𝐶e is contained in a ball of radius 𝑟 = 1 − 𝜀 2 ≤ 𝑒 −𝜀2 /2 .
√
Case 2: 𝜀 ∈ [1/ 2, 1].
r Q
Co
ne
Combining the above two theorems, we deduce the following concentration of measure
results.
145
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
vol𝑛−1 ( 𝐴𝑡 ) 2
≥ 1 − 𝑒 −𝑛𝑡 /4 .
vol𝑛−1 (𝑆 )
𝑛−1
Gauss space
Another related setting is the Gauss space, which is R𝑛 equipped with the the proba-
bility measure 𝛾𝑛 induced by the Gaussian random vector whose coordinates are 𝑛 iid
standard normals, i.e., the normal random vector in R𝑛 with covariance matrix 𝐼𝑛 . Its
2
probability density function of 𝛾𝑛 at 𝑥 ∈ R𝑛 is (2𝜋) −𝑛 𝑒 −|𝑥| /2 . The metric on R𝑛 is the
usual Euclidean metric.
What would an isoperimetric inequality in Gauss space look like?
Although earlier examples of isoperimetric optimizers were all balls, for the Gauss
space, the answer is actually a half-spaces, i.e., points on one side of some hyperplane.
The Gaussian isoperimetric inequality, below, was first shown independently by Borell
(1975) and Sudakov and Tsirel’son (1974).
146
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
2 /2
If 𝐻 = {𝑥 1 ≤ 0}, then 𝐻𝑡 = {𝑥 1 ≤ 𝑡}, which has Gaussian measure ≥ 1 − 𝑒 −𝑡 . Thus:
Random Gaussian vectors often yield easier calculations due to coordinate indepen-
dence, and so they often give an accessible way to analyze random unit vectors.
Note that how a half-space in the Gauss space intersect the sphere in a spherical cap,
with both italicized objects being isoperimetric optimizers in their respective spaces.
147
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
Sub-Gaussian distributions
We introduce some terminology that captures notions we have seen so far. It will also
be convenient for later discussions.
Remark 9.4.18. This definition is not standard. Some places say 𝜎 2 -subGaussian for
what we mean by 𝜎-subGaussian.
Usually we will not worry about constant factors. Thus, saying that a family of random
variables 𝑋𝑛 is 𝑂 (𝐾𝑛 )-subGaussian about its mean is the same as saying that there
exist constant 𝐶, 𝑐 > 0 such that
2 /𝐾 2
P (|𝑋𝑛 − E𝑋𝑛 | ≥ 𝑡) ≤ 𝐶𝑒 −𝑐𝑡 𝑛 for all 𝑡 ≥ 0 and 𝑛.
Also note that, up to changing the constants 𝑐, 𝐶, the definition does not change if we
replace E𝑋𝑛 by a median of 𝑋𝑛 above.
The following lemma shows that for subGaussian random variables, it does not matter
much if we define the tails around its median, mean, or root-mean-square.
148
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
|M𝑋 − 𝑚| ≤ 𝐶𝐾.
(d) For every constant 𝐴 there exists a constant 𝑐 > 0 so that if |𝑚′ − 𝑚| ≤ 𝐴𝐾, then
2 /𝐾 2
P(|𝑋 − 𝑚′ | ≥ 𝑡) ≤ 2𝑒 −𝑐𝑡 for all 𝑡 ≥ 0.
2
(c) We can make 𝑐 small enough so that 𝑅𝐻𝑆 = 2𝑒 −𝑐𝑡 ≥ 1 for 𝑡 ≤ 2𝐴. For 𝑡 > 2𝐴,
149
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
we note that
2 /4
P(|𝑋 − 𝑚′ | ≥ 𝑡) ≤ P(|𝑋 − 𝑚| ≥ 𝑡/2) ≤ 2𝑒 −𝑡 . □
Johnson–Lindenstrauss Lemma
Given a set of 𝑁 points in high-dimensional Euclidean space, the next result tells us
that one can embed them in 𝑂 (𝜀 −2 log 𝑁) dimensions without sacrificing pairwise
distances by more than 1 ± 𝜀 factor. This is known as dimension reduction. It is an
important tool in many areas, from functional analysis to algorithms.
Remark 9.4.23. Here the requirement 𝑑 > 𝐶𝜀 −2 log 𝑁 on the dimension is optimal
up to a constant factor (Larsen and Nelson 2017).
√︁
We will take 𝑓 to be 𝑚/𝑑 times an orthogonal projection onto a 𝑑-dimensional
subspace chosen uniformly at random. The theorem then follows from the following
lemma together with a union bound.
150
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
To prove the Theorem 9.4.22, for each pair of distinct points 𝑥, 𝑥 ′ ∈ 𝑋, set
𝑥 − 𝑥′ | 𝑓 (𝑥) − 𝑓 (𝑥 ′)|
√︂
𝑚
𝑧= , so that 𝑌= .
|𝑥 − 𝑥 ′ | 𝑑 |𝑥 − 𝑥 ′ |
Provided that 𝐶 > 1/𝑐, we can take a union bound over all 𝑁2 < 𝑁 2 /2 pairs of points
Proof of the lemma. We have 𝑧 21 + · · · + 𝑧 2𝑛 = 1 and each 𝑧𝑖 has the same distribution,
so E[𝑧𝑖2 ] = 1/𝑚 for each 𝑖. Thus
𝑑
E[𝑌 2 ] = E 𝑧21 + · · · + 𝑧 2𝑑 = .
𝑚
Note that 𝑃 is 1-Lipschitz on the unit sphere. By Lévy’s concentration measure
theorem on the sphere, letting M𝑌 denote the median of 𝑌 ,
2 /4
P (|𝑌 − M𝑌 | ≥ 𝑡) ≤ 2𝑒 −𝑚𝑡 .
√︁
The result then follows by Lemma 9.4.20, using that ∥𝑌 ∥ 2 = 𝑑/𝑚. □
Corollary 9.4.25
2𝑑
There is a constant 𝑐 > 0 so that for every positive integer 𝑑, there is a set of 𝑒 𝑐𝜀
points in R𝑑 whose pairwise distances are in [1 − 𝜀, 1 + 𝜀].
151
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
Proof. Applying Theorem 9.4.22 a regular simplex with unit edge lengths with 𝑁
vertices in R𝑁−1 to yield 𝑁 points in R𝑑 for 𝑑 = 𝑂 (𝜀 −2 log 𝑁) and pairwise distances
in [1 − 𝜀, 1 + 𝜀]. □
Distance to a subspace
We start with a geometrically motivated question.
Problem 9.5.1
Let 𝑉 be a fixed 𝑑-dimensional subspace. Let 𝑥 ∼ Unif{−1, 1}𝑛 . How well is dist(𝑥, 𝑉)
concentrated?
So ∑︁
E[dist(𝑥, 𝑉) 2 ] = 𝑝𝑖𝑖 = tr 𝑃 = 𝑛 − 𝑑.
𝑖
√
How well is dist(𝑥, 𝑉) concentrated around 𝑛 − 𝑑?
Some easier special cases (codimension-1):
• If 𝑉 is a coordinate subspace, then dist(𝑥, 𝑉) is a constant not depending on 𝑥.
√
• If 𝑉 = (1, 1, . . . , 1) ⊥ , then dist(𝑥, 𝑉) = |𝑥 1 + · · · + 𝑥 𝑛 |/ 𝑛 which converge to |𝑍 |
for 𝑍 ∼ 𝑁 (0, 1). In particular, it is 𝑂 (1)-subGaussian.
152
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
dist(𝑥, 𝑉) = sup |𝛼 · 𝑥| .
𝛼∈𝑉 ⊥
|𝛼|=1
It is not clear how to apply the bounded difference inequality to all such 𝛼 in the above
supremum simultaneously.
The bounded difference inequality applied to the function 𝑥 ∈ {−1, 1}𝑛 ↦→ dist(𝑥, 𝑉),
which is 2-Lipschitz (with respect to Hamming distance), gives
2 /(2𝑛)
P (|dist(𝑥, 𝑉) − E dist(𝑥, 𝑉)| ≥ 𝑡) ≤ 2𝑒 −𝑡 ,
√
showing that dist(𝑥, 𝑉) is 𝑂 ( 𝑛)-subGaussian—but this is a pretty bad result, as
√
|dist(𝑥, 𝑉)| ≤ 𝑛 (half the length of the longest diagonal of the cube).
Perhaps the reason why the above bound is so poor is that the bounded difference
inequality is measuring distance in {−1, 1}𝑛 using the Hamming distance (ℓ1 ) whereas
we really care about the Euclidean distance (ℓ2 ).
If, instead of sampling 𝑥 ∈ {−1, 1}𝑛 , we took 𝑥 to be a uniformly random point on
√
the radius 𝑛 sphere in R𝑛 (which contains {−1, 1}𝑛 ), then Lévy concentration on
the sphere (Corollary 9.4.14) implies that dist(𝑥, 𝑉) is 𝑂 (1)-subGaussian. Perhaps a
similar bound holds when 𝑥 is chosen from {−1, 1}𝑛 ?
Here is a corollary of Talagrand’s inequality, which we will state in its general form
later.
Theorem 9.5.2
Let 𝑉 be a fixed 𝑑-dimensional subspace in R𝑛 . For uniformly random 𝑥 ∈ {−1, 1}𝑛 ,
one has √ 2
P | dist(𝑥, 𝑉) − 𝑛 − 𝑑| ≥ 𝑡 ≤ 𝐶𝑒 −𝑐𝑡 ,
where 𝐶, 𝑐 > 0 are some constants.
153
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
Remark 9.5.4. (1) Note that 𝐴 is a convex body in R𝑛 and not simply a set of points
in 𝐴.
2 /𝑛
(2) The bounded differences inequality gives us an upper bound of the form 𝑒 −𝑐𝑡 ,
which is much worse than Talagrand’s bound.
Remark 9.5.7. The proof below shows that the assumption that 𝑓 is convex can be
weakened to 𝑓 being quasiconvex, i.e., { 𝑓 ≤ 𝑎} is convex for every 𝑎 ∈ R.
Proof that Theorem 9.5.3 and Corollary 9.5.6 are equivalent. Theorem 9.5.3 implies
Corollary 9.5.6: take 𝐴 = {𝑥 : 𝑓 (𝑥) ≤ 𝑟 }. We have 𝑓 (𝑥) ≤ 𝑟 + 𝑡 whenever
154
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Let us write M𝑋 to be a median for the random variable 𝑋, i.e., a non-random real so
that P(𝑋 ≥ M𝑋) ≥ 1/2 and P(𝑋 ≤ M𝑋) ≥ 1/2.
Theorem 9.5.2 then follows. Indeed, Corollary 9.5.8 shows that dist(𝑥, 𝑉) (which is
a convex 1-Lipschitz function of 𝑥 ∈ R𝑛 ) is 𝑂 (1)-subGaussian, which immediately
implies Theorem 9.5.2.
Example 9.5.9 (Operator norm of a random matrix). Let 𝐴 be a random matrix whose
2
entries are uniform iid from {−1, 1}. Viewing 𝐴 ↦→ ∥ 𝐴∥ op as a function R𝑛 → R,
we see that it is convex (since the operator norm is a norm) and 1-Lipschitz (using
that ∥·∥ op ≤ ∥·∥ HS , where the latter is the Hilbert–Schmidt norm, also known as the
Frobenius norm, i.e., the ℓ2 -norm of the matrix entries). It follows by Talagrand’s
inequality (Corollary 9.5.8) that ∥ 𝐴∥ op is 𝑂 (1)-subGaussian about its mean.
Convex distance
Talagrand’s inequality has a much more general form, which has far-reaching combi-
natorial applications. We need a define a more subtle notion of distance.
We consider Ω = Ω1 × · · · × Ω𝑛 with product probability measure (i.e., independent
random variables).
155
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
For 𝐴 ⊆ Ω,
𝒅𝜶 (𝒙, 𝑨) := inf 𝑑𝛼 (𝑥, 𝑦).
𝑦∈𝐴
® ConvexHull 𝜙𝑥 ( 𝐴)).
𝑑𝑇 (𝑥, 𝐴) = dist( 0,
The general form of Talagrand’s inequality says the following. Note that it reduces to
the earlier special case Theorem 9.5.3 if Ω = {0, 1}𝑛 .
156
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Let us see how Talagrand’s inequality recovers a more general form of our geometric
inequalities from earlier, extending from independent boolean random variables to
independent bounded random variables.
First taking the infimum over all 𝑦 ∈ 𝐴, and then taking the supremum over unit vectors
𝛼, the LHS becomes dist(𝑥, ConvexHull 𝐴) and the RHS becomes 𝑑𝑇 (𝑥, 𝐴). □
Corollary 9.5.13 (Talagrand’s inequality: convex sets and convex Lipschitz func-
tions)
Let 𝑥 = (𝑥 1 , . . . , 𝑥 𝑛 ) ∈ [0, 1] 𝑛 be independent random variables (not necessarily
identical). Let 𝑡 ≥ 0. Let 𝐴 ⊆ [0, 1] 𝑛 be a convex set. Then
2 /4
P(𝑥 ∈ 𝐴)P(dist(𝑥, 𝐴) ≥ 𝑡) ≤ 𝑒 −𝑡
157
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
2 /𝐾 2
P (| 𝑓 − M 𝑓 | ≥ 𝑡) ≤ 4𝑒 −𝑡 where 𝐾 = 2 sup |𝛼(𝑥)| .
𝑥∈Ω
Remark 9.5.16. The vector 𝛼(𝑥) measures the resilience of 𝑓 (𝑥) under changing
some coordinates of 𝑥. It is important that we can choose a different weight 𝛼(𝑥) for
each 𝑥. In fact, if we do not let 𝛼(𝑥) change with 𝑥, then Theorem 9.5.14 recovers the
bounded differences inequality Theorem 9.1.3 up to an unimportant constant factor in
the exponent of the bound.
|𝛼(𝑥)| 𝑑𝑇 (𝑥, 𝐴) ≥ 𝑡.
So
𝑡 2𝑡
𝑑𝑇 (𝑥, 𝐴) ≥ ≥ .
|𝛼(𝑥)| 𝐾
And hence by Talagrand’s inequality Theorem 9.5.11,
2𝑡 2 2
P( 𝑓 ≤ 𝑟 − 𝑡)P( 𝑓 ≥ 𝑟) ≤ P(𝑥 ∈ 𝐴)P 𝑑𝑇 (𝑥, 𝐴) ≥ ≤ 𝑒 −𝑡 /𝐾 .
𝐾
158
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Taking 𝑟 = M 𝑓 + 𝑡 yields
2 /𝐾 2
P( 𝑓 ≥ M 𝑓 + 𝑡) ≤ 2𝑒 −𝑡 ,
Theorem 9.5.17
Let 𝐴 = (𝑎𝑖 𝑗 ) be an 𝑛 × 𝑛 symmetric random matrix with independent entries in
[−1, 1]. Let 𝜆 1 (𝑋) denote the largest eigenvalue of 𝐴. Then
2 /32
P(|𝜆 1 ( 𝐴) − M𝜆 1 ( 𝐴)| ≥ 𝑡) ≤ 4𝑒 −𝑡 .
Proof. We shall verify the hypotheses of Theorem 9.5.14. We would like to come up
with a good choice of a weight vector 𝛼( 𝐴) for each matrix 𝐴 so that for any other
symmetric matrix 𝐵 with [−1, 1] entries,
∑︁
𝜆 1 (𝐵) ≥ 𝜆 1 ( 𝐴) − 𝛼𝑖, 𝑗 . (9.1)
𝑖≤ 𝑗:𝑎 𝑖 𝑗 ≠𝑏 𝑖 𝑗
Note that in a random symmetric matrix we only have 𝑛(𝑛 + 1)/2 independent random
entries: the entries below the diagonal are obtained by reflecting the upper diagonal
entries.
Let 𝑣 = 𝑣( 𝐴) be the unit eigenvector of 𝐴 corresponding to the eigenvalue 𝜆 1 ( 𝐴).
Then, by the Courant–Fischer characterization of eigenvalues,
𝑣 ⊺ 𝐴𝑣 = 𝜆 1 ( 𝐴) and 𝑣 ⊺ 𝐵𝑣 ≤ 𝜆 1 (𝐵).
Thus
∑︁ ∑︁
𝜆 1 ( 𝐴) − 𝜆 1 (𝐵) ≤ 𝑣 ⊺ ( 𝐴 − 𝐵)𝑣 ≤ |𝑣 𝑖 ||𝑣 𝑗 | 𝑎𝑖 𝑗 − 𝑏𝑖 𝑗 ≤ 2|𝑣 𝑖 ||𝑣 𝑗 |.
𝑖, 𝑗:𝑎 𝑖 𝑗 ≠𝑏 𝑖 𝑗 𝑖, 𝑗:𝑎 𝑖 𝑗 ≠𝑏 𝑖 𝑗
159
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
We have !2
∑︁ ∑︁ ∑︁
𝛼𝑖2𝑗 ≤ 8 |𝑣 𝑖 | 2 |𝑣 𝑗 | 2 = 8 |𝑣 𝑖 | 2 = 8.
𝑖≤ 𝑗 𝑖, 𝑗 𝑖
Remark 9.5.18. If 𝐴 has mean zero entries, then a moments computation shows that
√
E𝜆 1 ( 𝐴) = 𝑂 ( 𝑛) (the constant can be computed as well). A much more advanced
fact is that, say for uniform {−1, 1} entries, the true scale of fluctuation is 𝑛−1/6 , and
when normalized, the distribution converges to something known as the Tracy–Widom
distribution. This limiting distribution is “universal” in the sense that it occurs in many
naturally occurring problems, including the next example.
Question 9.5.19
How well is the length 𝑋 of the longest increasing subsequence of uniform random
permutation concentrated?
While the entries of 𝜎 are not independent, we can generate a uniform random permu-
tation by taking iid uniform 𝑥 1 , . . . , 𝑥 𝑛 ∼ Unif [0, 1] and let 𝜎 record the ordering of
the 𝑥𝑖 ’s. This trick converts the problem into one about independent random variables.
√
We leave it as an exercise to deduce that 𝑋 is Θ( 𝑛) whp.
Changing one of the 𝑥𝑖 ’s changes LIS by at most 1, so the bounded differences inequality
√
tells us that 𝑋 is 𝑂 ( 𝑛)-subGaussian. Can we do better?
The assertion that a permutation has an increasing permutation of length 𝑠 can be
checked by verifying 𝑠 coordinates of the permutation. Talagrand’s inequality√ tells
us that in such situations the typical fluctuation should be on the order 𝑂 ( M𝑋), or
𝑂 (𝑛1/4 ) in this case.
Definition 9.5.20
Let Ω = Ω1 × · · · × Ω𝑛 . Let 𝐴 ⊆ Ω. We say that 𝐴 is 𝒔-certifiable if for every 𝑥 ∈ 𝐴,
there exists a set 𝐼 (𝑥) ⊆ [𝑛] with |𝐼 | ≤ 𝑠 such that for every 𝑦 ∈ Ω with 𝑥𝑖 = 𝑦𝑖 for all
𝑖 ∈ 𝐼 (𝑥), one has 𝑦 ∈ 𝐴.
160
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
|{𝑖 ∈ 𝐼 (𝑦) : 𝑥𝑖 ≠ 𝑦𝑖 }| 𝑡
𝑑𝛼 (𝑥, 𝑦) = √︁ ≥ √ .
|𝐼 | 𝑠
√
Thus 𝑑𝑇 (𝑦, 𝐴) ≥ 𝑡/ 𝑠. Thus
√ 2
P( 𝑓 ≤ 𝑟 − 𝑡)P( 𝑓 ≥ 𝑟) ≤ P( 𝐴)P(𝐵) ≤ P(𝑥 ∈ 𝐴)P(𝑑𝑇 (𝑥, 𝐴) ≥ 𝑡/ 𝑠) ≤ 𝑒 −𝑡 /(4𝑠)
−𝑡 2
P( 𝑓 ≤ M 𝑓 − 𝑡) ≤ 2 exp
4M 𝑓
and
−𝑡 2
P( 𝑓 ≥ M 𝑓 + 𝑡) ≤ 2 exp .
4(M 𝑓 + 𝑡)
Proof. Applying the previous theorem, we have, for every 𝑟 ∈ R and every 𝑡 ≥ 0,
−𝑡 2
P( 𝑓 ≤ 𝑟 − 𝑡)P(𝑋 ≥ 𝑟) ≤ exp .
4𝑟
161
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
−𝑡 2
P( 𝑓 ≤ M 𝑓 − 𝑡) ≤ 2 exp .
4M 𝑓
−𝑡 2
P(𝑋 ≥ M 𝑓 + 𝑡) ≤ 2 exp . □
4(M 𝑓 + 𝑡)
We can apply the above corollary to [0, 1] 𝑛 with 𝑓 being the length of the longest
√
subsequence. Then 𝑓 ≥ 𝑟 is 𝑟-certifiable. It is also easy to deduce that M 𝑓 = 𝑂 ( 𝑛).
The above tail bounds give us a concentration window of width 𝑂 (𝑛1/4 ).
P(|𝑋 − M𝑋 | ≤ 𝐶𝑛1/4 ) ≥ 1 − 𝜀.
as 𝜎 ranges over all permutations of [𝑛]. This Euclidean traveling salesman problem
is NP-hard to solve exactly, although there is a (1 + 𝜀)-factor approximation algorithm
with running polynomial time for any constant 𝜀 > 0 (Arora 1998).
Let
𝐿 𝑛 = 𝐿 (𝑥 1 , . . . , 𝑥 𝑛 ) with i.i.d. 𝑥 1 , . . . , 𝑥 𝑛 ∼ Unif([0, 1] 2 )
162
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
√
Exercise: E𝐿 𝑛 = Θ( 𝑛)
√
Beardwood, Halton, and Hammersley (1959) showed that whp 𝐿 𝑛 / 𝑛 converges to
some constant as 𝑛 → ∞ (the exact value of the constant is unknown).
We shall focus on the concentration of 𝐿 𝑛 .
We will present two methods that illustrate different techniques from this chapter.
Martingale methods
The following simple monotonicity property will be important for us: for any 𝑆 and
𝑥 ∈ [0, 1] 2 ,
𝐿 (𝑆) ≤ 𝐿 (𝑆 ∪ {𝑥}) ≤ 𝐿(𝑆) + 2 dist(𝑥, 𝑆).
Here is the justification for the second inequality. Let 𝑦 be the closest point in 𝑆 to
𝑥. Consider a shortest tour through 𝑆 of length 𝐿(𝑆). Let us modify this tour by first
traversing through it, and when we hit 𝑦, we take a detour excursion from 𝑦 to 𝑥 and
then back to 𝑦. The length of this tour, which contains 𝑆 ∪ {𝑥}, is 𝐿(𝑆) + 2 dist(𝑥, 𝑆),
and thus the shortest tour through 𝑆 ∪ {𝑥} has length at most 𝐿 (𝑆) + 2 dist(𝑥, 𝑆).
If we simply apply the bounded difference inequality, we find that changing a single
√
𝑥𝑖 might change 𝐿(𝑥1 , . . . , 𝑥 𝑛 ) by 𝑂 (1) in the worse case, and so 𝐿 𝑛 is 𝑂 ( 𝑛)-
√
subGaussian about its mean. This is a trivial result since 𝐿 𝑛 is typically Θ( 𝑛).
To do better, we apply Azuma’s inequlality to the Doob martingale. The key obser-
vation is that the initially revealed points do not affect the conditional expectations by
much even in the worst case.
163
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
−𝑐𝑡 2
P(|𝐿 𝑛 − E𝐿 𝑛 | ≥ 𝑡) ≤ exp for all 𝑡 > 0,
log 𝑛
Lemma 9.6.2
Let 𝑆 be a set of 𝑘 random points chosen independently and uniformly in [0, 1] 2 . For
any (non-random) point 𝑦 ∈ [0, 1] 2 , one has
1
E dist(𝑦, 𝑆) ≲ √ .
𝑘
Proof. We have
√
∫ 2
E dist(𝑦, 𝑆) = P(dist(𝑦, 𝑆) ≥ 𝑡) 𝑑𝑡
0
√
∫ 2 𝑘
= 1 − area 𝐵(𝑦, 𝑡) ∩ [0, 1] 2 𝑑𝑡
0
√
∫ 2
≤ exp −𝑘 area 𝐵(𝑦, 𝑡) ∩ [0, 1] 2 𝑑𝑡
∫0 ∞
21
≤ exp −Ω(𝑘𝑡 ) 𝑑𝑡 ≲ √ . □
0 𝑘
Proof of Theorem 9.6.1. Let
𝐿 𝑛,𝑖 (𝑥 1 , . . . , 𝑥𝑖 ) = E [𝐿 𝑛 (𝑥 1 , . . . , 𝑥 𝑛 ) | 𝑥 1 , . . . , 𝑥𝑖 ]
be the expectation of 𝐿 𝑛 conditional on the first 𝑖 points (and averaging over the
remaining 𝑛 − 𝑖 points).
Claim. 𝐿 𝑛,𝑖 is 𝑂 √ 1 -Lipschitz with respect to Hamming distance.
𝑛−𝑖+1
We have
164
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Taking expectation over 𝑥𝑖+1 , . . . , 𝑥 𝑛 , and applying the previous lemma, we find that
′ 1
𝐿 𝑛,𝑖 (𝑥 1 , . . . , 𝑥𝑖 ) ≤ 𝐿 𝑛,𝑖 (𝑥 1 , . . . , 𝑥𝑖−1 , 𝑥𝑖 ) + 𝑂 √ .
𝑛−𝑖+1
This proves the claim. Thus the Doob martingale
𝑍𝑖 = E [𝐿 𝑛 (𝑥 1 , . . . , 𝑥 𝑛 ) | 𝑥 1 , . . . , 𝑥𝑖 ] = 𝐿 𝑛,𝑖 (𝑥 1 , . . . , 𝑥𝑖 )
satisfies
1
|𝑍𝑖 − 𝑍𝑖−1 | ≲ √ for each 1 ≤ 𝑖 ≤ 𝑛.
𝑛−𝑖+1
Now we apply Azuma’s inequality (Theorem 9.2.8). Since
𝑛 2
∑︁ 1
√ = 𝑂 (log 𝑛),
𝑖=1 𝑛−𝑖+1
√︁
we deduce that 𝑍 𝑁 = 𝐿 𝑛 is 𝑂 ( log 𝑛)-subGaussian about its mean. □
Talagrand’s inequality
Using Talagrand’s inequality, we will prove the following stronger estimate.
Remark 9.6.4. Rhee (1991) showed that this tail bound is sharp.
The proof below, following Steele (1997), applies the “space-filling curve heuristic.”
A space-filling curve is a continuous surjection from [0, 1] to [0, 1] 2 . Peano (1890)
constructed the first space-filling curve. Hilbert (1891) constructed another space-
filling curve known as the Hilbert curve. We will not give a precise description of the
Hilbert curve here. Intuitively, the Hilbert curve is the pointwise limit of a sequence of
piecewise linear curves illustrated in Figure 9.2. I recommend this 3Blue1Brown video
on YouTube for a beautiful animation of the Hilbert curve along with applications.
We will only need the following property of the Hilbert space filling curve.
165
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
Figure 9.2: The Hilbert space-filling curve is the limit of discrete curves illus-
trated.
Remark 9.6.6. Hölder continuity with exponent 1 is the same as Lipschitz continuity.
Often 𝑋 has bounded diameter, in which case if 𝑓 is Hölder continuous with exponent
𝛼, then it is so with any exponent 𝛼′ < 𝛼.
Theorem 9.6.7
The Hilbert curve 𝐻 : [0, 1] → [0, 1] 2 is Hölder continuous with exponent 1/2.
Proof sketch. The Hilbert space-filling curve 𝐻 sends every interval of the form
[(𝑖 − 1)/4𝑛 , 𝑖/4𝑛 ] to a square of the form [( 𝑗 − 1)/2𝑛 , 𝑗/2𝑛 ] × [(𝑘 − 1)/2𝑛 , 𝑘/2𝑛 ].
Indeed, for each fixed 𝑛, the discrete curves eventually all have this property.
Let 𝑥, 𝑦 ∈ [0, 1], and let 𝑛 be the largest integer so that 𝑥, 𝑦 ∈ [(𝑖 − 1)/4𝑛 , (𝑖 +
1)/4𝑛 ] for some integer 𝑖. Then |𝑥 − 𝑦| = Θ(4−𝑛 ), and |𝐻 (𝑥) − 𝐻 (𝑦)| ≲ 2−𝑛 . Thus
|𝐻 (𝑥) − 𝐻 (𝑦)| ≲ |𝑥 − 𝑦| 1/2 . □
166
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Remark 9.6.8. If a space filling space is Hölder continuous with exponent 𝛼, then
𝛼 ≤ 1/2. Indeed, the images of the intervals [(𝑖 − 1)/𝑘, 𝑖/𝑘], 𝑖 = √
1, . . . , 𝑘, cover the
unit square, and thus one intervals must have image diameter ≳ 1/ 𝑘.
Proof. Order the points as they appear on the Hilbert space filling curve 𝐻 : [0, 1] →
[0, 1] 2 (since 𝐻 is not injective, there is more than one possible order). Then, there
exist 0 ≤ 𝑡1 ≤ 𝑡 2 ≤ · · · ≤ 𝑡 𝑛 ≤ 1 so that 𝐻 (𝑡𝑖 ) = 𝑥 𝜎(𝑖) for each 𝑖. Using that 𝐻 is
Hölder continuous with exponent 1/2, we have
𝑛
∑︁ 𝑛
∑︁ 𝑛
∑︁
2 2
𝑥 𝜎(𝑖) − 𝑥 𝜎(𝑖+1) = |𝐻 (𝑡𝑖 ) − 𝐻 (𝑡𝑖+1 )| ≲ |𝑡𝑖 − 𝑡𝑖+1 | ≤ 2. □
𝑖=1 𝑖=1 𝑖=1
Using Talagrand’s inequality in the form of Theorem 9.5.14, to prove Theorem 9.6.3
that 𝐿 𝑛 is 𝑂 (1)-subGaussian, it suffices to prove the following lemma.
Lemma 9.6.11
Let Ω = ( [0, 1] 2 ) 𝑛 be the space of 𝑛-tuples of points in [0, 1] 2 . There exists a map
𝛼 : Ω → R𝑛≥0 so that for all 𝑥 ∈ Ω, 𝛼(𝑥) = (𝛼1 (𝑥), . . . , 𝛼𝑛 (𝑥)) ∈ R𝑛≥0 satisfies
∑︁
𝐿 (𝑥) ≤ 𝐿 (𝑦) + 𝛼𝑖 (𝑥) for all 𝑥, 𝑦 ∈ Ω (9.1)
𝑖:𝑥 𝑖 ≠𝑦 𝑖
and
𝑛
∑︁
sup 𝛼𝑖 (𝑥) 2 = 𝑂 (1). (9.2)
𝑥∈Ω 𝑖=1
167
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
Exercises
1. Sub-Gaussian tails. For each part, prove there is some constant 𝑐 > 0 so that,
for all 𝜆 > 0, √ 2
P(|𝑋 − E𝑋 | ≥ 𝜆 Var 𝑋) ≤ 2𝑒 −𝑐𝜆 .
a) 𝑋 is the number of triangles in 𝐺 (𝑛, 1/2).
b) 𝑋 is the number of inversions of a uniform random permutation of [𝑛] (an
inversion of 𝜎 ∈ 𝑆𝑛 is a pair (𝑖, 𝑗) with 𝑖 < 𝑗 and 𝜎(𝑖) > 𝜎( 𝑗)).
2. Prove that for every 𝜀 > 0 there exists 𝛿 > 0 and 𝑛0 such that for all 𝑛 ≥ 𝑛0
and 𝑆1 , . . . , 𝑆 𝑚 ⊂ [2𝑛] with 𝑚 ≤ 2𝛿𝑛 and |𝑆𝑖 | = 𝑛 for all 𝑖 ∈ [𝑚], there exists a
function 𝑓 : [2𝑛] → [𝑛] so that (1 − 𝑒 −1 − 𝜀)𝑛 ≤ | 𝑓 (𝑆𝑖 )| ≤ (1 − 𝑒 −1 + 𝜀)𝑛 for
all 𝑖 ∈ [𝑚].
3. Simultaneous bisections. Fix Δ. Let 𝐺 1 , . . . , 𝐺 𝑚 with 𝑚 = 2𝑜(𝑛) be connected
graphs of maximum degree at most Δ on the same vertex set 𝑉 with |𝑉 | = 𝑛.
Prove that there exists a partition 𝑉 = 𝐴∪𝐵 so that every 𝐺 𝑖 has (1+𝑜(1))𝑒(𝐺 𝑖 )/2
168
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
5. ★ Prove that there is some constant 𝑐 > 0 so that, with probability 1 − 𝑜(1),
𝐺 (𝑛, 1/2) has a bipartite subgraph with at least 𝑛2 /8 + 𝑐𝑛3/2 edges.
6. Let 𝑘 ≤ 𝑛/2 be positive integers and 𝐺 an 𝑛-vertex graph with average degree
at most 𝑛/𝑘. Prove that a uniform random 𝑘-element subset of the vertices of 𝐺
contains an independent set of size at least 𝑐𝑘 with probability at least 1 − 𝑒 −𝑐𝑘 ,
where 𝑐 > 0 is a constant.
7. ★ Prove that there exists a constant 𝑐 > 0 so that the following holds. Let 𝐺 be a
𝑑-regular graph and 𝑣 0 ∈ 𝑉 (𝐺). Let 𝑚 ∈ N and consider a simple random walk
𝑣 0 , 𝑣 1 , . . . , 𝑣 𝑚 where each 𝑣 𝑖+1 is a uniform random neighbor of 𝑣 𝑖 . For each
𝑣 ∈ 𝑉 (𝐺), let 𝑋𝑣 be the number times that 𝑣 appears among 𝑣 0 , . . . , 𝑣 𝑚 . For that
for every 𝑣 ∈ 𝑉 (𝐺) and 𝜆 > 0
1 ∑︁ 2
P 𝑋𝑣 − 𝑋𝑤 ≥ 𝜆 + 1® ≤ 2𝑒 −𝑐𝜆 /𝑚
© ª
𝑑
« 𝑤∈𝑁 (𝑣) ¬
Here 𝑁 (𝑣) is the neighborhood of 𝑣.
8. Prove that for every 𝑘 there exists a 2 (1+𝑜(1))𝑘/2 -vertex graph that contains every
𝑘-vertex graph as an induced subgraph.
9. ★ Tighter concentration of chromatic number
a) Prove that with probability 1 − 𝑜(1), every vertex subset of 𝐺 (𝑛, 1/2) with
at least 𝑛1/3 vertices contains an independent set of size at least 𝑐 log 𝑛,
where 𝑐 > 0 is some constant.
b) Prove that there exists some function 𝑓 (𝑛) and constant 𝐶 such that for all
𝑛 ≥ 2,
√
P( 𝑓 (𝑛) ≤ 𝜒(𝐺 (𝑛, 1/2)) ≤ 𝑓 (𝑛) + 𝐶 𝑛/log 𝑛) ≥ 0.99.
10. Show that for every 𝜀 > 0 there exists 𝐶 > 0 so that every 𝑆 ⊆ [4] 𝑛 with
|𝑆| ≥ 𝜀4𝑛 contains four elements with pairwise Hamming distance at least
√
𝑛 − 𝐶 𝑛 apart.
169
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
9 Concentration of Measure
170
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
−𝑐𝑡 2 −𝑐𝑡 2
P(𝑋 ≥ M𝑋 + 𝑡) ≤ 2 exp and P(𝑋 ≤ M𝑋 − 𝑡) ≤ 2 exp
M𝑋 + 𝑡 M𝑋
171
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
10 Entropy
Definition 10.1.1
Given a discrete random variable 𝑋 taking values in 𝑆, with 𝑝 𝑠 := P(𝑋 = 𝑠), its entropy
(or binary entropy to emphasis the base-2 logarithm) is defined to be
∑︁
𝑯(𝑿) := −𝑝 𝑠 log2 𝑝 𝑠
𝑠∈𝑆
Remark 10.1.2 (Base of the logarithm). It is also fine to use another base for the
logarithm, e.g., the natural log, as long as we are consistent throughout. There is some
combinatorial preference for base-2 due to its interpretation as counts bits. For certain
results, such as Pinsker’s inequality (which we will unfortunately not cover here), the
choice of the base does matter.
173
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
10 Entropy
Here are some basic properties. Throughout we only consider discrete random vari-
ables.
The proofs are all routine calculations. It will useful to understand the information
theoretic interpretations of these properties.
Proof. Let function 𝑓 (𝑥) = −𝑥 log2 𝑥 is concave for 𝑥 ∈ [0, 1]. Let 𝑆 = support(𝑋).
Then !
∑︁ 1 ∑︁ 1
𝐻 (𝑋) = 𝑓 ( 𝑝 𝑠 ) ≤ |𝑆| 𝑓 𝑝 𝑠 = |𝑆| 𝑓 = log2 |𝑆| . □
𝑠∈𝑆
|𝑆| 𝑠∈𝑆 |𝑆|
We write 𝐻 (𝑋, 𝑌 ) for the entropy of the joint random variables (𝑋, 𝑌 ). In other words,
letting 𝑍 = (𝑋, 𝑌 ),
∑︁
𝑯(𝑿, 𝒀) := 𝐻 (𝑍) = −P(𝑋 = 𝑥, 𝑌 = 𝑦) log2 P(𝑋 = 𝑥, 𝑌 = 𝑦).
(𝑥,𝑦)
𝐻 (𝑋, 𝑌 ) = 𝐻 (𝑋) + 𝐻 (𝑌 ).
174
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Proof.
∑︁
𝐻 (𝑋, 𝑌 ) = −P(𝑋 = 𝑥, 𝑌 = 𝑦) log2 P(𝑋 = 𝑥, 𝑌 = 𝑦)
(𝑥,𝑦)
∑︁
= −𝑝 𝑥 𝑝 𝑦 log2 ( 𝑝 𝑥 𝑝 𝑦 )
(𝑥,𝑦)
∑︁
= −𝑝 𝑥 𝑝 𝑦 (log2 𝑝 𝑥 + log2 𝑝 𝑦 )
(𝑥,𝑦)
∑︁ ∑︁
= −𝑝 𝑥 log2 𝑝 𝑥 + −𝑝 𝑦 log2 𝑝 𝑦 = 𝐻 (𝑋) + 𝐻 (𝑌 ). □
𝑥 𝑦
(each line unpacks the previous line. In the summations, 𝑥 and 𝑦 range over the
supports of 𝑋 and 𝑌 respectively).
175
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
10 Entropy
and so
∑︁ ∑︁
𝐻 (𝑋 |𝑌 ) := E𝑦 [𝐻 (𝑋 |𝑌 = 𝑦)] = −𝑝(𝑦) 𝑝(𝑥|𝑦) log2 𝑝(𝑥|𝑦)
𝑦 𝑥
∑︁ 𝑝(𝑥, 𝑦)
= −𝑝(𝑥, 𝑦) log2
𝑥,𝑦
𝑝(𝑦)
∑︁ ∑︁
= −𝑝(𝑥, 𝑦) log2 𝑝(𝑥, 𝑦) + 𝑝(𝑦) log2 𝑝(𝑦)
𝑥,𝑦 𝑦
= 𝐻 (𝑋, 𝑌 ) − 𝐻 (𝑌 ). □
More generally, by iterating the above inequality for two random variables, we have
176
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
𝐻 (𝑋 |𝑌 , 𝑍) ≤ 𝐻 (𝑋 |𝑍).
𝐻 (𝑋 |𝑌 ) = 𝐻 (𝑋, 𝑌 ) − 𝐻 (𝑌 ) ≤ 𝐻 (𝑋).
𝐻 (𝑋 |𝑌 , 𝑍 = 𝑧) ≥ 𝐻 (𝑋 |𝑍 = 𝑧)
𝐻 ( 𝑝)
0
0 𝑝 1
(This notation 𝐻 (·) is standard but unfortunately ambiguous: 𝐻 (𝑋) versus 𝐻 ( 𝑝). It
is usually clear from context which is meant.)
Theorem 10.1.12
If 0 < 𝑘 ≤ 𝑛/2, then
∑︁ 𝑛 𝑛 𝑘 𝑛 𝑛−𝑘
≤ 2𝐻 (𝑘/𝑛)𝑛 = .
0≤𝑖≤𝑘
𝑖 𝑘 𝑛 − 𝑘
177
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
10 Entropy
This bound can be established using our proof technique for Chernoff bound by
applying Markov’s inequality to the moment generating function:
∑︁ 𝑛 (1 + 𝑥) 𝑛
≤ 𝑘
for all 𝑥 ∈ [0, 1].
0≤𝑖≤𝑘
𝑖 𝑥
Remark 10.1.13. One can extend the above proof to bound the tail of Binomial(𝑛, 𝑝)
for any 𝑝. The result can be expressed in terms of the relative entropy (also known
as the Kullback–Leibler divergence between two Bernoulli random variables). More
concretely, for 𝑋 ∼ Binomial(𝑛, 𝑝), one has
and
log P(𝑋 ≥ 𝑛𝑞) 𝑞 1−𝑞
≤ −𝑞 log − (1 − 𝑞) log for all 𝑝 ≤ 𝑞 ≤ 1.
𝑛 𝑝 1− 𝑝
178
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
The formula for the permanent is simply that of the determinant without the sign factor:
∑︁ 𝑛
Ö
det 𝐴 := sgn(𝜎) 𝑎𝑖𝜎𝑖 .
𝜎∈𝑆 𝑛 𝑖=1
The following theorem gives an upper bound on the number of perfect matchings of
a bipartite graph with a given degree distribution. It was conjectured by Minc (1963)
and proved by Brégman (1973).
Note that equality is attained when 𝐴 consists diagonal blocks of 1’s (corresponding
to perfect matchings in a bipartite graph of the form 𝐾 𝑑1 ,𝑑1 ⊔ · · · ⊔ 𝐾 𝑑𝑡 ,𝑑𝑡 ).
Let 𝜎 be a uniform random permutation of [𝑛] conditioned on 𝑎𝑖𝜎𝑖 = 1 for all 𝑖 ∈ [𝑛].
Then
log2 per 𝐴 = 𝐻 (𝜎) = 𝐻 (𝜎1 , . . . , 𝜎𝑛 ) = 𝐻 (𝜎1 ) + 𝐻 (𝜎2 |𝜎1 ) + · · · + 𝐻 (𝜎𝑛 |𝜎1 , . . . , 𝜎𝑛−1 ).
We have
𝐻 (𝜎𝑖 |𝜎1 , . . . , 𝜎𝑖−1 ) ≤ 𝐻 (𝜎𝑖 ) ≤ log2 |support 𝜎𝑖 | = log2 𝑑𝑖 ,
but this step would be too lossy. In fact, what we just did amounts to a naive worst
case counting argument.
The key new idea is to reveal the chosen entries in a uniform random order.
179
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
10 Entropy
For any fixed 𝜎, as 𝜏 varies uniformly over all permutations of [𝑛], 𝑁𝑖 varies uniformly
over [𝑑𝑖 ]. (Why?) Thus
Proof. (Alon and Friedland 2008) Brégman’s theorem implies the statement for bipar-
tite graphs 𝐺 (by considering a bipartition on 𝐺 ⊔𝐺). For the extension of non-bipartite
𝐺, one can proceed via a combinatorial argument that pm(𝐺 ⊔ 𝐺) ≤ pm(𝐺 × 𝐾2 ),
which is left as an exercise. □
180
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Question 10.2.3
What is the maximum possible number of directed Hamilton paths in an 𝑛-vertex
tournament?
Earlier we saw that a uniformly random tournament has 𝑛!/2𝑛−1 Hamilton paths in
expectation, and hence there is some tournament with at least this many Hamilton
paths. This result, due to Szele, is the earliest application of the probabilistic method.
Using Brégman’s theorem, Alon proved a nearly matching upper bound.
Remark 10.2.5. The upper bound has been improved to 𝑂 (𝑛3/2−𝛾 𝑛!/2𝑛 ) for some
small constant 𝛾 > 0 (Friedgut and Kahn 2005), while the lower bound 𝑛!/2𝑛−1 has
been improved by a constant factor (Adler, Alon, and Ross 2001, Wormald 2004). It
remains open to close this 𝑛𝑂 (1) factor gap.
One can check (omitted) that the function 𝑔(𝑥) = (𝑥!) 1/𝑥 is log-concave, i.e, 𝑔(𝑛)𝑔(𝑛 +
2) ≥ 𝑔(𝑛 + 1) 2 for all 𝑛 ≥ 0. Thus, by a smoothing argument, among sequences
(𝑑1 , . . . , 𝑑𝑛 ) with sum 𝑛2 , the RHS above is maximized when all the 𝑑𝑖 ’s are within 1
√
of each other, which, by Stirling’s formula, gives 𝑂 ( 𝑛 · 𝑛!/2𝑛 ). □
Theorem 10.2.4 then follows by applying the above bound with the following lemma.
181
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
10 Entropy
Lemma 10.2.7
Given an 𝑛-vertex tournament with 𝑃 Hamilton paths, one can add a new vertex to
obtain a (𝑛 + 1)-vertex tournament with at least 𝑃/4 Hamilton cycles.
Proof. Add a new vertex and orient its incident edges uniformly at random. For every
Hamilton path in the 𝑛-vertex tournament, there is probability 1/4 that it can be closed
up into a Hamilton cycle through the new vertex. The claim then follows by linearity
of expectation. □
Question 10.2.9
How many STS are there on 𝑛 labeled vertices?
Theorem 10.2.10 (Upper bound on the number of STS — Linial and Luria 2013)
The number of Steiner triple systems on 𝑛 labeled vertices is at most
𝑛2
𝑛
.
𝑒 2 + 𝑜(1)
Proof. As in the earlier proof, the idea is to reveal the triples in a random order.
182
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Let 𝑋 denote a uniformly chosen STS on 𝑛 vertices. We wish to upper bound 𝐻 (𝑋).
We encode 𝑋 as a tuple (𝑋𝑖 𝑗 )𝑖< 𝑗 ∈ [𝑛] ( 2) where 𝑋𝑖 𝑗 is the label of the unique vertex
𝑛
that forms a triple with 𝑖 and 𝑗 in the STS. Here whenever we write 𝑖 𝑗 we mean the
unordered pair {𝑖, 𝑗 }, i.e., an edge of 𝐾𝑛 .
Let 𝑦 = (𝑦𝑖 𝑗 )𝑖< 𝑗 ∈ [0, 1] ( 2) , and we order the edges of 𝐾𝑛 in decreasing 𝑦𝑖 𝑗 :
𝑛
𝑘𝑙 ≺ 𝑖 𝑗 if 𝑦 𝑘𝑙 > 𝑦𝑖 𝑗 .
Let
∑︁
𝐻 (𝑋) ≤ E 𝑋 E𝑦 log2 𝑁𝑖 𝑗 .
𝑖𝑗
E𝑦 −𝑖 𝑗 log2 𝑁𝑖 𝑗 as a function of 𝑦𝑖 𝑗 .
We define 𝑖 𝑗 shows up first in its triple to be the event that 𝑖 𝑗 ≺ 𝑖𝑘, 𝑗 𝑘 where 𝑘 = 𝑋𝑖 𝑗 .
We have, for any fixed 𝑋,
If 𝑖 𝑗 does not show up first in its triple, then 𝑋𝑖 𝑗 has exactly one possibility (namely 𝑘)
by the time it gets revealed, and so 𝑁𝑖 𝑗 = 1 and log2 𝑁𝑖 𝑗 = 0. Thus
Now we use linearity of expectations (over 𝑦 −𝑖 𝑗 with fixed 𝑋). For each 𝑠 ∈ [𝑛] \
{𝑖, 𝑗, 𝑘 }, if 𝑠 is available as a possibility for 𝑋𝑖 𝑗 by the time 𝑋𝑖 𝑗 is revealed, then none
of the six edges of 𝐾𝑛 consisting of the two triangle 𝑖𝑠𝑋𝑖 𝑗 and 𝑗 𝑠𝑋 𝑗 𝑠 may occur before
183
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
10 Entropy
Thus
∫ 1 ∫ 1
1
E𝑦 log2 𝑁𝑖 𝑗 ≤ 𝑦𝑖2𝑗 log2 (1 + (𝑛 − 3)𝑦𝑖3𝑗 ) 𝑑𝑦𝑖 𝑗 = log2 (1 + (𝑛 − 3)𝑡 2 ) 𝑑𝑡.
0 3 0
Remark 10.2.12 (Guessing the formula). Here is perhaps how we might have guessed
the formula for the number of STSs. Suppose we select 31 𝑛2 triangles in 𝐾𝑛 indepen-
dently at random. What is the probability that every edge is contained in exactly one
triangle? Each edge is contained one triangle on expectation, and so by the Poisson
approximation, the probability that a single fixed edge is contained in exactly one tri-
angle is 1/𝑒 + 𝑜(1). Now let us pretend as if all the edges behave independently (!) —
the probability that every edge is contained in exactly one triangle is (1/𝑒 + 𝑜(1)) ( 2) .
𝑛
This would then lead us to guessing that the number of STSs being
Here is another heuristic for getting the formula, and this time this method can actually
be turned into a proof of matching lower bound on the number of STSs, though with
a lot of work (Keevash 2018). Suppose we remove triangles from 𝐾𝑛 one at a time.
After 𝑘 triangles have been removed, the number of edges remaining is 𝑛2 − 3𝑘. Let
us pretend that the remaining edges were randomly distributed. Then the number of
184
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
hom(𝐹, 𝐺)
𝑡 (𝐹, 𝐻) =
𝑣(𝐺) 𝑣(𝐹)
= P(a uniform random map 𝑉 (𝐹) → 𝑉 (𝐺) is a graph homomorphism 𝐹 → 𝐺)
In this section, we are interested in the regime of fixed 𝐹 and large 𝐺, in which case
almost all maps 𝑉 (𝐹) → 𝑉 (𝐺) are injective, so that there is not much difference
between homomorphisms and subgraphs. More precisely,
Question 10.3.1
Given a fixed graph 𝐹 and constant 𝑝 ∈ [0, 1], what is the minimum possible 𝐹-density
in a graph with edge density at least 𝑝?
185
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
10 Entropy
The 𝐹-density in the random graph 𝐺 (𝑛, 𝑝) is 𝑝 𝑒(𝐹) +𝑜(1). Here 𝑝 is fixed and 𝑛 → ∞.
Can one do better?
If 𝐹 is non-bipartite, then the complete bipartite graph 𝐾𝑛/2,𝑛/2 has 𝐹-density zero.
(The problem of minimizing 𝐹-density is still interesting and not easy; it has been
solved for cliques.)
Sidorenko’s conjecture (1993) (also proposed by Erdős and Simonovits (1983)) says
for any fixed bipartite 𝐹, the random graph asymptotically minimizes 𝐹-density. This
is an important and well-known conjecture in extremal graph theory.
186
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
and futhermore both quantities are equal to 𝐻 (𝑌 |𝑋) since 𝑋𝑌 , 𝑌 𝑍, 𝑍𝑊 are each dis-
tributed as a uniform random edge.
Thus
In the final step we used 𝐻 (𝑋, 𝑌 ) = log2 (2𝑒(𝐺)) since 𝑋𝑌 is uniformly distributed
among edges, and 𝐻 (𝑋) ≤ log2 |support(𝑋)| = log2 𝑣(𝐺). This proves (10.1) and
hence the theorem for a path of 4 vertices. (As long as the final expression has the
“right form” and none of the steps are lossy, the proof should work out.) □
Remark 10.3.4. See this MathOverflow discussion for the history as well as alternate
proofs.
The above proof easily generalizes to all trees. We omit the details.
Theorem 10.3.5
Sidorenko’s conjecture holds if 𝐹 is a tree.
187
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
10 Entropy
Theorem 10.3.6
Sidorenko’s conjecture holds for all complete bipartite graphs.
Proof. Following the same framework as earlier, let us demonstrate the result for
𝐹 = 𝐾2,2 . The same proof extends to all 𝐾 𝑠,𝑡 .
𝑥2 𝑦2
𝑥1 𝑦1
We will pick a random tuple (𝑋1 , 𝑋2 , 𝑌1 , 𝑌2 ) ∈ 𝑉 (𝐺) 4 with 𝑋𝑖𝑌 𝑗 ∈ 𝐸 (𝐺) for all 𝑖, 𝑗 as
follows.
• 𝑋1𝑌1 is a uniform random edge;
• 𝑌2 is a uniform random neighbor of 𝑋1 ;
• 𝑋2 is a conditionally independent copy of 𝑋1 given (𝑌1 , 𝑌2 ).
The last point deserves more attention. Note that we are not simply uniformly randomly
choosing a common neighbor of 𝑌1 and 𝑌2 as one might naively attempt. Instead, one
can think of the first two steps as generating a distribution for (𝑋1 , 𝑌1 , 𝑌2 )—according to
this distribution, we first generate (𝑌1 , 𝑌2 ) according to its marginal, and then produce
two conditionally independent copies of 𝑋1 (the second copy is 𝑋2 ).
As in the previous proof (applied to a 2-edge path), we see that
So we have
𝐻 (𝑋1 , 𝑋2 , 𝑌1 , 𝑌2 )
= 𝐻 (𝑌1 , 𝑌2 ) + 𝐻 (𝑋1 , 𝑋2 |𝑌1 , 𝑌2 ) [chain rule]
= 𝐻 (𝑌1 , 𝑌2 ) + 2𝐻 (𝑋1 |𝑌1 , 𝑌2 ) [conditional independence]
= 2𝐻 (𝑋1 , 𝑌1 , 𝑌2 ) − 𝐻 (𝑌1 , 𝑌2 ) [chain rule]
≥ 2(2 log2 (2𝑒(𝐺)) − log2 𝑣(𝐺)) − 2 log2 𝑣(𝐺). [prev. ineq. and uniform bound]
= 4 log(2𝑒(𝐺)) − 4 log2 𝑣(𝐺).
188
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Proof. Let us illustrate the proof for the following graph. The proof extends to the
general case.
𝑦1
𝑥1
𝑥0
𝑦2 𝑥2
𝑦3
𝐻 (𝑋0 , 𝑋1 , 𝑋2 , 𝑌1 , 𝑌2 , 𝑌3 )
= 𝐻 (𝑋0 , 𝑋1 , 𝑋2 |𝑌1 , 𝑌2 , 𝑌3 ) + 𝐻 (𝑌1 , 𝑌2 , 𝑌3 ) [chain rule]
= 𝐻 (𝑋0 |𝑌1 , 𝑌2 , 𝑌3 ) + 𝐻 (𝑋1 |𝑌1 , 𝑌2 , 𝑌3 ) + 𝐻 (𝑋2 |𝑌1 , 𝑌2 , 𝑌3 ) + 𝐻 (𝑌1 , 𝑌2 , 𝑌3 ) [cond indep]
= 𝐻 (𝑋0 |𝑌1 , 𝑌2 , 𝑌3 ) + 𝐻 (𝑋1 |𝑌1 , 𝑌2 ) + 𝐻 (𝑋2 |𝑌2 , 𝑌3 ) + 𝐻 (𝑌1 , 𝑌2 , 𝑌3 ) [cond indep]
= 𝐻 (𝑋0 , 𝑌1 , 𝑌2 , 𝑌3 ) + 𝐻 (𝑋1 , 𝑌1 , 𝑌2 ) + 𝐻 (𝑋2 , 𝑌2 , 𝑌3 ) − 𝐻 (𝑌1 , 𝑌2 ) − 𝐻 (𝑌2 , 𝑌3 ). [chain rule]
The proof of Theorem 10.3.3 actually lower bounds the first three terms:
189
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
10 Entropy
To check that you understand the above proof, where did we use the assumption that
𝐹 has a vertex complete to the other part?
Many other graphs can be proved by extending this method.
Remark 10.3.8 (Möbius graph). An important open case (and the smallest in some
sense) of Sidorenko conjecture is when 𝐹 is the following graph, known as the Möbius
graph. It is 𝐾5,5 with a 10-cycle removed. The name comes from it being the face-
vertex incidence graph of the simplicial complex structure of the Möbius strip, built
by gluing a strip of five triangles.
190
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Question 10.4.2
What is the maximum volume of a body in R3 that has area at most 1 when projected
to each of the three coordinate planes?
The cube [0, 1] 3 satisfies the above property and has area 1. It turns out that this is the
maximum.
To prove this claim, first let us use Shearer’s inequality to prove a discrete version.
Theorem 10.4.3
Let 𝑆 ⊆ R3 be a finite set, and 𝜋𝑥𝑦 (𝑆) be its projection on the 𝑥𝑦-plane, etc. Then
Corollary 10.4.4
Let 𝑆 be a body in R3 . Then
Let us now state the general form of Shearer’s lemma. (Chung, Graham, Frankl, and
Shearer 1986)
191
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
10 Entropy
Corollary 10.4.7
Let 𝐴1 , . . . , 𝐴𝑠 ⊆ Ω where each 𝑖 ∈ Ω appears in at least 𝑘 sets 𝐴 𝑗 . Then for every
family F of subsets of Ω, Ö
𝑘
|F | ≤ F |𝐴𝑗
𝑗 ∈[𝑠]
where F | 𝐴 := {𝐹 ∩ 𝐴 : 𝐹 ∈ F }.
Triangle-intersecting families
We say that a set G of labeled graphs on the same vertex set is triangle-intersecting if
𝐺 ∩ 𝐺 ′ contains a triangle for every 𝐺, 𝐺 ′ ∈ G.
Question 10.4.8
What is the largest triangle-intersecting family of graphs on 𝑛 labeled vertices?
The set of all graphs that contain a fixed triangle is triangle-intersecting, and they form
a 1/8 fraction of all graphs.
An easy upper bound: the edges form an intersecting family, so a triangle-intersecting
family must be at most 1/2 fraction of all graphs.
The next theorem improves this upper bound to < 1/4. It is also in this paper that
Shearer’s lemma was introduced.
192
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
For every 𝑆, every triangle has an edge in 𝐴𝑆 , and thus G restricted to 𝐴𝑆 must be an
intersecting family. Hence
G| 𝐴𝑆 ≤ 2| 𝐴𝑆 |−1 = 2𝑟−1 .
different 𝐴𝑆 with |𝑆| = ⌊𝑛/2⌋ (by symmetry and averaging). Applying Corol-
lary 10.4.7, we find that
( 𝑛 )
𝑘 𝑟−1 ⌊𝑛/2⌋
|G| ≤ 2 .
Therefore 𝑛
( 2)
|G| ≤ 2 ( 2) − 𝑟 < 2 ( 2) −2 .
𝑛 𝑛
□
all graphs containing a fixed triangle) was conjectured by Simonovits and Sós (1976)
and proved by Ellis, Filmus, and Friedgut (2012) using Fourier analytic methods.
Berger and Zhao (2023) gave a tight solution for 𝐾4 -intersecting families. The general
conjecture for 𝐾𝑟 -intersecting families is open.
Question 10.4.11
Fix 𝑑. Which 𝑑-regular graph on a given number of vertices has the most number of
independent sets? Alternatively, which graph 𝐺 maximizes 𝑖(𝐺) 1/𝑣(𝐺) ?
(Note that the number of independent sets is multiplicative: 𝑖(𝐺 1 ⊔𝐺 2 ) = 𝑖(𝐺 1 )𝑖(𝐺 2 ).)
Alon and Kahn conjectured that for graphs on 𝑛 vertices, when 𝑛 is a multiple of 2𝑑,
a disjoint union of 𝐾 𝑑,𝑑 ’s maximizes the number of independent sets.
193
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
10 Entropy
Alon (1991) proved an approximate version of this conjecture. Kahn (2001) proved it
assuming the graph is bipartite. Zhao (2010) proved it in general.
Proof assuming 𝐺 is bipartite. (Kahn) Let us first illustrate the proof for
𝑥1 𝑦1
𝐺= 𝑥2 𝑦2
𝑥3 𝑦3
Among all independent sets of 𝐺, choose one uniformly at random, and let (𝑋1 , 𝑋2 , 𝑋3 , 𝑌1 , 𝑌2 , 𝑌3 ) ∈
{0, 1}6 be its indicator vector. Then
Here we are using that (a) 𝑌1 , 𝑌2 , 𝑌3 are conditionally independent given (𝑋1 , 𝑋2 , 𝑋3 )
and (b) 𝑌1 and (𝑋3 , 𝑌2 , 𝑌3 ) are conditionally independent given (𝑋1 , 𝑋2 ). A more
general statement is that if 𝑆 ⊆ 𝑉 (𝐺), then the restrictions to the different connected
components of 𝐺 − 𝑆 are conditionally independent given 𝑋𝑆 .
It remains to prove that
and two other analogous inequalities. Let 𝑌1′ be conditionally independent copy of 𝑌1
given (𝑋1 , 𝑋2 ). Then (𝑋1 , 𝑋2 , 𝑌1 , 𝑌1′ ) is the indictor vector of an independent set of
𝐾2,2 (though not necessarily chosen uniformly).
𝑥1 𝑦1
𝑥2 𝑦′1
194
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Thus we have
This concludes the proof for 𝐺 = 𝐾2,2 , which works for all bipartite 𝐺. Here are the
details.
Let 𝑉 = 𝐴 ∪ 𝐵 be the vertex bipartition of 𝐺. Let 𝑋 = (𝑋𝑣 )𝑣∈𝑉 be the indicator function
of an independent set chosen uniformly at random. Write 𝑋𝑆 := (𝑋𝑣 )𝑣∈𝑆 . We have
𝐻 (𝑋𝑁 (𝑏) ) + 𝑑𝐻 (𝑋𝑏 |𝑋𝑁 (𝑏) ) = 𝐻 (𝑋𝑁 (𝑏) ) + 𝐻 (𝑋𝑏(1) , . . . , 𝑋𝑏(𝑑) |𝑋𝑁 (𝑏) )
= 𝐻 (𝑋𝑏(1) , . . . , 𝑋𝑏(𝑑) , 𝑋𝑁 (𝑏) )
≤ log2 𝑖(𝐾 𝑑,𝑑 )
where 𝑋𝑏(1) , . . . , 𝑋𝑏(𝑑) are conditionally independent copies of 𝑋𝑏 given 𝑋𝑁 (𝑏) . Sum-
ming over all 𝑏 yields the result. □
Now we give the argument from Zhao (2010) that removes the bipartite hypothesis.
The following combinatorial argument reduces the problem for non-bipartite 𝐺 to that
of bipartite 𝐺.
Starting from a graph 𝐺, we construct its bipartite double cover 𝐺 × 𝐾2 (see Fig-
ure 10.1), which has vertex set 𝑉 (𝐺) × {0, 1}. The vertices of 𝐺 × 𝐾2 are labeled 𝑣 𝑖
for 𝑣 ∈ 𝑉 (𝐺) and 𝑖 ∈ {0, 1}. Its edges are 𝑢 0 𝑣 1 for all 𝑢𝑣 ∈ 𝐸 (𝐺). Note that 𝐺 × 𝐾2
is always a bipartite graph.
Lemma 10.4.13
Let 𝐺 be any graph (not necessarily regular). Then
𝑖(𝐺) 2 ≤ 𝑖(𝐺 × 𝐾2 ).
Once we have the lemma, Theorem 10.4.12 then reduces to the bipartite case, which
we already proved. Indeed, for a 𝑑-regular 𝐺, since 𝐺 × 𝐾2 is bipartite, the bipartite
195
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
10 Entropy
2𝐺 𝐺 × 𝐾2 𝐺 × 𝐾2
Figure 10.1: The bipartite swapping trick in the proof of Lemma 10.4.13: swap-
ping the circled pairs of vertices (denoted 𝐴 in the proof) fixes the
bad edges (red and bolded), transforming an independent set of 2𝐺
into an independent set of 𝐺 × 𝐾2 .
Proof of Lemma 10.4.13. Let 2𝐺 denote a disjoint union of two copies of 𝐺. Label
its vertices by 𝑣 𝑖 with 𝑣 ∈ 𝑉 and 𝑖 ∈ {0, 1} so that its edges are 𝑢𝑖 𝑣 𝑖 with 𝑢𝑣 ∈ 𝐸 (𝐺) and
𝑖 ∈ {0, 1}. We will give an injection 𝜙 : 𝐼 (2𝐺) → 𝐼 (𝐺 × 𝐾2 ). Recall that 𝐼 (𝐺) is the
set of independent sets of 𝐺. The injection would imply 𝑖(𝐺) 2 = 𝑖(2𝐺) ≤ 𝑖(𝐺 × 𝐾2 )
as desired.
Fix an arbitrary order on all subsets of 𝑉 (𝐺). Let 𝑆 be an independent set of 2𝐺. Let
Note that 𝐸 bad (𝑆) is a bipartite subgraph of 𝐺, since each edge of 𝐸 bad has exactly one
endpoint in {𝑣 ∈ 𝑉 (𝐺) : 𝑣 0 ∈ 𝑆} but not both (or else 𝑆 would not be independent).
Let 𝐴 denote the first subset (in the previously fixed ordering) of 𝑉 (𝐺) such that all
edges in 𝐸 bad (𝑆) have one vertex in 𝐴 and the other outside 𝐴. Define 𝜙(𝑆) to be the
subset of 𝑉 (𝐺) × {0, 1} obtained by “swapping” the pairs in 𝐴, i.e., for all 𝑣 ∈ 𝐴,
𝑣 𝑖 ∈ 𝜙(𝑆) if and only if 𝑣 1−𝑖 ∈ 𝑆 for each 𝑖 ∈ {0, 1}, and for all 𝑣 ∉ 𝐴, 𝑣 𝑖 ∈ 𝜙(𝑆) if and
only if 𝑣 𝑖 ∈ 𝑆 for each 𝑖 ∈ {0, 1}. It is not hard to verify that 𝜙(𝑆) is an independent
set in 𝐺 × 𝐾2 . The swapping procedure fixes the “bad” edges.
It remains to verify that 𝜙 is an injection. For every 𝑆 ∈ 𝐼 (2𝐺), once we know
𝑇 = 𝜙(𝑆), we can recover 𝑆 by first setting
′
𝐸 bad (𝑇) = {𝑢𝑣 ∈ 𝐸 (𝐺) : 𝑢𝑖 , 𝑣 𝑖 ∈ 𝑇 for some 𝑖 ∈ {0, 1}},
196
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
The entropy proof of the bipartite case of Theorem 10.4.12 extends to graph homo-
morphisms, yielding the following result.
Furthermore, it was also shown in the same paper that in Theorem 10.4.14, the bipartite
hypothesis on 𝐺 can be weakened to triangle-free. Furthermore triangle-free is the
weakest possible hypothesis on 𝐺 so that the claim is true for all 𝐻.
For more discussion and open problems on this topic, see the survey by Zhao (2017).
Exercises
The problems in this section should be solved using entropy arguments or results
derived from entropy arguments.
1. Submodularity. Prove that 𝐻 (𝑋, 𝑌 , 𝑍) + 𝐻 (𝑋) ≤ 𝐻 (𝑋, 𝑌 ) + 𝐻 (𝑋, 𝑍).
2. Let F be a collection of subsets of [𝑛]. Let 𝑝𝑖 denote the fraction of F that
197
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
10 Entropy
3. ★ Uniquely decodable codes. Let [𝑟] ∗ denote the set of all finite strings of
elements in [𝑟]. Let 𝐴 be a finite subset of [𝑟] ∗ and suppose no two distinct
concatenations of sequences in 𝐴 can produce the same string. Let |𝑎| denote
the length of 𝑎 ∈ 𝐴. Prove that
∑︁
𝑟 −|𝑎| ≤ 1.
𝑎∈𝐴
Prove that △ ≤ ∧.
7. ★ Box theorem. Prove that for every compact set 𝐴 ⊆ R𝑑 , there exists an
axis-aligned box 𝐵 ⊆ R𝑑 with
198
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
8. Let G be a family of graphs on vertices labeled by [2𝑛] such that the intersection
2𝑛
of every pair of graphs in G contains a perfect matching. Prove that |G| ≤ 2 ( 2 ) −𝑛 .
9. Loomis–Whitney for sumsets. Let 𝐴, 𝐵, 𝐶 be finite subsets of some abelian
group. Writing 𝐴 + 𝐵 := {𝑎 + 𝑏 : 𝑎 ∈ 𝐴, 𝑏 ∈ 𝐵}, etc., prove that
| 𝐴 + 𝐵 + 𝐶 | 2 ≤ | 𝐴 + 𝐵| | 𝐴 + 𝐶 | |𝐵 + 𝐶 | .
10. ★ Shearer for sums. Let 𝑋, 𝑌 , 𝑍 be independent random integers. Prove that
2𝐻 (𝑋 + 𝑌 + 𝑍) ≤ 𝐻 (𝑋 + 𝑌 ) + 𝐻 (𝑋 + 𝑍) + 𝐻 (𝑌 + 𝑍).
199
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
11 Containers
Question 11.0.1
How many triangle-free graphs are there on 𝑛 vertices?
2 /4
By taking all subgraphs of 𝐾𝑛/2,𝑛/2 , we obtain 2𝑛 such graphs. It turns out this gives
the correct exponential asymptotic.
Remark 11.0.3. It does not matter here whether we consider vertices to be labeled, it
affects the answer up to a factor of at most 𝑛! = 𝑒𝑂 (𝑛 log 𝑛) .
We can convert this asymptotic enumeration problem into a problem about independent
sets in a 3-uniform hypergraph 𝐻:
• 𝑉 (𝐻) = [𝑛]
2
• The edges of 𝐻 are triples of the form {𝑥𝑦, 𝑥𝑧, 𝑦𝑥}, i.e., triangles
We then have the correspondence:
• A subset of 𝑉 (𝐻) = a graph on vertex set [𝑛]
• An independent set of 𝑉 (𝐻) = a triangle-free graph
201
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
11 Containers
202
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
The method of hypergraph containers is one of the most exciting developments in this
past decade. Some references and further reading:
• The graph container method was developed by Kleitman and Winston (1982) (for
counting 𝐶4 -free graphs) and Sapozhenko (2001) (for bounding the number of
independent sets in a regular graph, giving an earlier version of Theorem 10.4.12)
• The hypergraph container theorem was proved independently by Balogh, Morris,
and Samotij (2015), and Saxton and Thomason (2015).
• See the 2018 ICM survey of Balogh, Morris, and Samotij for an introduction to
the topic along with many applications
• See Samotij’s survey article (2015) for an introduction to the graph container
method
• See Morris’ lecture notes (2016) for a gentle introduction to the proof and
applications of hypergraph containers.
such that
(a) every 𝐺 ∈ C has at most ( 14 + 𝜀)𝑛2 edges, and
(b) every triangle-free graph is contained in some 𝐺 ∈ C.
Proof of upper bound of Theorem 11.0.2. We want to show that the number of 𝑛-
2 2
vertex triangle-free graphs is at most 2𝑛 /4+𝑜(𝑛 ) . Let 𝜀 > 0 be any real number
(arbitrarily small). Let C be produced by Theorem 11.1.1.
Then every 𝐺 ∈ C has at most ( 14 + 𝜀)𝑛2 edges, and every triangle-free graph is
203
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
11 Containers
Since 𝜀 > 0 can be made arbitrarily small, the number triangle-free graphs on 𝑛
1 2
vertices is 2 ( 4 +𝑜(1))𝑛 . □
The same proof technique, with an appropriate container theorem, can be used to count
𝐻-free graphs.
We write ex(𝑛, 𝐻) for the maximum number of edges in an 𝑛-vertex graph without 𝐻
as a subgraph.
Note that for bipartite graphs 𝐻, the above theorem just says 𝑜(𝑛2 ), though more
precise estimates are available. Although we do not know the asymptotic for ex(𝑛, 𝐻)
for all 𝐻, e.g., it is still open for 𝐻 = 𝐾4,4 and 𝐻 = 𝐶8 .
Theorem 11.1.3
Fix a non-bipartite graph 𝐻. Then the number of 𝐻-free graphs on 𝑛 vertices is
2 (1+𝑜(1)) ex(𝑛,𝐻) .
The analogous statement for bipartite graphs is false. The following conjecture remains
of great interest, and it is known for certain graphs, e.g., 𝐻 = 𝐶4 .
Conjecture 11.1.4
Fix a bipartite graph 𝐻 with a cycle. The number of 𝐻-free graphs on 𝑛 vertices is
2𝑂 (ex(𝑛,𝐻)) .
Theorem 11.1.5
√
If 𝑝 ≫ 1/ 𝑛, then with probability 1 − 𝑜(1), every triangle-free subgraph of 𝐺 (𝑛, 𝑝)
has at most ( 41 + 𝑜(1)) 𝑝𝑛2 edges.
204
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Remark 11.1.6. In fact, a much stronger result is true: the triangle-free subgraph
of 𝐺 (𝑛, 𝑝) with the maximum number of edges is whp bipartite (DeMarco and Kahn
2015).
√
Remark 11.1.7. The statement is false for 𝑝 ≪ 1/ 𝑛. Indeed, in this case, then the
expected number of triangles is 𝑂 (𝑛3 𝑝 3 ), whereas there are whp 𝑛2 𝑝/2 edges, and
𝑛3 𝑝 3 ≪ 𝑛2 𝑝, so we can remove 𝑜(𝑛2 𝑝) edges to make the graph triangle-free.
Proof. We prove a slightly weaker result, namely that the result is true if 𝑝 ≫
𝑛−1/2 log 𝑛. The version with 𝑝 ≫ 𝑛−1/2 can be proved via a stronger formulation
of the container lemma (using “fingerprints” as discussed later).
Let 𝜀 > 0 be aribtrarily small. Let C be a set of containers
for 𝑛-vertex triangle-free
graphs in Theorem 11.1.1. For every 𝐺 ∈ C, 𝑒(𝐺) ≤ 14 + 𝜀 𝑛2 , so by an application
of the Chernoff bound,
1 2
P 𝑒(𝐺 ∩ 𝐺 (𝑛, 𝑝)) > + 2𝜀 𝑛 𝑝 ≤ 𝑒 −Ω 𝜀 (𝑛 𝑝)
2
4
= 𝑜(1)
205
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
11 Containers
such that
1. Every independent set 𝐼 of 𝐺 is contained in some 𝐶 ∈ C.
2. |𝐶 | ≤ (1 − 𝛿) |𝑉 | for every 𝐶 ∈ C.
206
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
207
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
11 Containers
𝑆 : I → 2𝑉 and 𝐴 : 2𝑉 → 2𝑉
such that
• Every independent set of 𝐻 is contained in some 𝐶 ∈ C, and
• |𝐶 | ≤ (1 − 𝛿)𝑣(𝐻) for every 𝐶 ∈ C.
208
MIT OCW: Probabilistic Methods in Combinatorics — Yufei Zhao
Like the graph container theorem, the hypergraph container theorem is proved by
designing an algorithm to produce, from an independent set 𝐼 ⊆ 𝑉 (𝐻), a fingerprint
𝑆 ⊆ 𝐼 and a container 𝐶 ⊃ 𝐼.
The hypergraph container algorithm is more involved compared to the graph container
algorithm. In fact, the 3-uniform hypergraph container algorithm calls the graph
container algorithm.
Container algorithm for 3-uniform hypergraphs (a very rough sketch):
Throughout the algorithm, we will maintain
• A fingerprint 𝑆, initially 𝑆 = ∅
• A 3-uniform hypergraph 𝐴, initially 𝐴 = 𝐻
• A graph 𝐺 of “forbidden” pairs on 𝑉 (𝐻), initially 𝐺 = ∅
√
While |𝑆| ≤ 𝑣(𝐻)/ 𝑑 − 1:
• Let 𝑢 be the first vertex in 𝐼 in the max-degree order on 𝐴
• Add 𝑢 to 𝑆
• Add 𝑥𝑦 to 𝐸 (𝐺) whenever 𝑢𝑥𝑦 ∈ 𝐸 (𝐻)
• Remove from 𝑉 ( 𝐴) the vertex 𝑢 as well as all vertices proceeding 𝑢 in the
max-degree order on 𝐴
√
• Remove from 𝑉 ( 𝐴) every vertex whose degree in 𝐺 is larger than 𝑐 𝑑.
• Remove from 𝐸 ( 𝐴) every edge that contains an edge of 𝐺.
Finally, it is will be the case that either
• We have removed many vertices from 𝑉 ( 𝐴)
√ √
• Or the final graph 𝐺 has at least Ω( 𝑑𝑛) edges and has maximum degree 𝑂 ( 𝑑),
so that we can apply the graph container lemma to 𝐺.
In either case, the algorithm produces a container with the desired properties. Again
see Morris’ lecture notes for details.
209
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.