Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

Discrete Structures

Cours d'ingénierie niveau 3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Discrete Structures

Cours d'ingénierie niveau 3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 153

Machine Translated by Google

A Course in Discrete Structures

Rafael Pass
Wei-Lung Dustin Tseng
Machine Translated by Google

Preface

Discrete mathematics deals with objects that come in discrete bundles, eg, 1 or 2
babies. In contrast, continuous mathematics deals with objects that vary
continuously, eg, 3.42 inches from a wall. Think of digital watches versus analog
watches (ones where the second hand loops around continuously without
stopping).
Why study discrete mathematics in computer science? It does not directly
help us write programs. At the same time, it is the mathematics underlying almost
all of computer science. Here are a few examples:

• Designing high-speed networks and message routing paths.


• Finding good algorithms for sorting. •
Performing web searches.
• Analyzing algorithms for correctness and efficiency.
• Formalizing security requirements.
• Designing cryptographic protocols.

Discrete mathematics uses a range of techniques, some of which is sel-dom


found in its continuous counterpart. This course will roughly cover the following
topics and specific applications in computer science.

1. Sets, functions and relations


2. Proof techniques and induction
3. Number theory
a) The math behind the RSA Crypto system
4. Counting and combinatorics
5. Probability
a) Spam detection
b) Formal security
6. Logic
a) Proofs of program correctness
7. Graph theory

i
Machine Translated by Google

a) Message Routing
b) Social networks
8. Finite automata and regular languages
a) Compilers

In the end, we will learn to write precise mathematical statements that


capture what we want in each application, and learn to prove things about
these statements. For example, how will we formalize the infamous zero-
knowledge property? How do we state, in mathematical terms, that a banking
protocol allows a user to prove that she knows her password, without ever
revealing the password itself?
Machine Translated by Google

Happy

Happy iii

1 Sets, Functions and Relations 1.1 Sets. 1


. . . . . . . . . . . . ....... . . . . . . . . . . . . . 1
1.2 Relationships. . . . . . . . . . ......... . . . . . . . . . . . 5
1.3 Functions. . . . . . . . . . ....... . . . . . . . . . . . . . 7
1.4 Set Cardinality, revisited. . . . . . . . . . . . . . . . . . . . . . 8

2 Proofs and Induction 13


2.1 Basic Proof Techniques. . . . ....... . . . . . . . . . . . 13
2.2 Proof by Cases and Examples. ....... . . . . . . . . . . . 15
2.3 Induction. . . . . . . . . . ......... . . . . . . . . . . . 17
2.4 Inductive Definitions. . . . ....... . . . . . . . . . . . . . 26
2.5 Fun Tidbits. . . . . . . . . ....... . . . . . . . . . . . . . 31

3 Number Theory 3.1 37


Divisibility. . . . . . . ......... . . . . . . . . . . . . . 37
3.2 Modular Arithmetic. . . . ......... . . . . . . . . . . . 41
3.3 Bonuses. . . . . . . . . . . . ....... . . . . . . . . . . . . . 47
3.4 The Euler ÿ Function. . . . . . . ....... . . . . . . . . . 52
3.5 Public-Key Cryptosystems and RSA. . . . . . . . . . . . . . . 56

4 Counting 4.1 61
The Product and Sum Rules. ....... . . . . . . . . . . . 61
4.2 Permutations and Combinations. . . . . . . . . . . . . . . . . 63
4.3 Combinatorial Identities. . ....... . . . . . . . . . . . . . 65
4.4 Inclusion-Exclusion Principle. . . . . . . . . . . . . . . . . . . 69
4.5 Pigeonhole Principle. . . . ....... . . . . . . . . . . . . . 72

5 Probability 73

iii
Machine Translated by Google

5.1 Probability Spaces. . . . . ....... . . . . . . . . . . . . . 73


5.2 Conditional Probability and Independence. . . . . . . . . . . . 77
5.3 Random Variables. . . . . . . ....... . . . . . . . . . . . 85
5.4 Expectation. . . . . . . . ....... . . . . . . . . . . . . . 87
5.5 Variance. . . . . . . . . . . . . ....... . . . . . . . . . . . 92

6 Logic 95
6.1 Propositional Logic. . . . . . . ....... . . . . . . . . . . . 95
6.2 Logical Inference. . . . . . . . ....... . . . . . . . . . . . 100
6.3 First Order Logic. . . . . . ....... . . . . . . . . . . . . . 105
6.4 Applications. . . . . . . . . ....... . . . . . . . . . . . . . 108

7 Graphs 109
7.1 Graph Isomorphism. . . . . . ....... . . . . . . . . . . . 112
7.2 Paths and Cycles. . . . . . . . ....... . . . . . . . . . . . 115
7.3 Graph Coloring. . . . . . . . . ....... . . . . . . . . . . . 120
7.4 Random Graphs [Optional] . . ....... . . . . . . . . . . . 122

8 Finite Automata 125


8.1 Deterministic Finite Automata. . . . . . . . . . . . . . . . . . 125
8.2 Non-Deterministic Finite Automata. . . . . . . . . . . . . . . 130
8.3 Regular Expressions and Kleene's Theorem. . . . . . . . . . . 133

A Problem Sets A.1 137


Problem Set A. . . . . . . . . ....... . . . . . . . . . . . 137

B Solutions to Problem Sets 141


B.1 Problem Set A. . . . . . . . . ....... . . . . . . . . . . . 141
Machine Translated by Google

Chapter 1

Sets, Functions and Relations

“A happy person is not a person in a certain set of circumstances, but rather a


person with a certain set of attitudes.”
–Hugh Downs

1.1 Sets
A set is one of the most fundamental object in mathematics.

Definition 1.1 (Set, informal). A set is an unordered collection of objects.

Our definition is informal because we do not define what a “collection” is; a


deeper study of sets is out of the scope of this course.

Example 1.2. The following notations all refer to the same set:

{1, 2}, {2, 1}, {1, 2, 1, 2}, {x | x is an integer, 1 ÿ x ÿ 2}

The last example read as “the set of all x such that x is an integer between 1
and 2 (inclusive)”.

We will encounter the following sets and notations throughout the course:

• ÿ = { }, the empty set. •


N = {0, 1, 2, 3, . . . }, the non-negative integers •
N + = {1, 2, 3, . . . }, the positive integers •
Z = {. . . , ÿ2, ÿ1, 0, 1, 2 . . . }, the integers •
Q = {q | q = a/b, a, b ÿ Z, b = 0}, the rational numbers • Q+ =
{q | q ÿ Q, q > 0}, the positive rationals • R, the real
numbers

1
Machine Translated by Google

2 sets, functions and relations

• R +, the positive reals

Given a collection of objects (a set), we may want to know how large is the
collection:

Definition 1.3 (Set cardinality). The cardinality of a set A is the number of


(distinct) objects in A, written as |A|. When |A| ÿ N (a finite integer), A is a finite
set; otherwise A is an infinite set. We discuss the cardinality of infinite sets later.

Example 1.4. |{1, 2, 3}| = |{1, 2, {1, 2}}| = 3.

Given two collections of objects (two sets), we may want to know if they are
equal, or if one collection contains the other. These notions are formalized as
set equality and subsets:

Definition 1.5 (Set equality). Two sets S and T are equal, written as S = T, if S
and T contains exactly the same elements, ie, for every x, x ÿ S ÿ x ÿ T.

Definition 1.6 (Subsets). A set S is a subset of set T, written as S ÿ T, if every


element in S is also in T, ie, for every x, x ÿ S ÿ x ÿ T. Set S is a strict subset of
T, written as S ÿ T if S ÿ T, and there exists some element x ÿ T such that x /ÿ
S.

Example 1.7.

• {1, 2} ÿ {1, 2, 3}. •


{1, 2} ÿ {1, 2, 3}. •
{1, 2, 3} ÿ {1, 2, 3}. •
{1, 2, 3} ÿ {1, 2, 3}. •
For any set S, ÿ ÿ S. •
For every set S = ÿ, ÿ ÿ S. • S
ÿ T and T ÿ S if and only if S = T.

Finally, it is time to formalize operations on sets. Given two collection of


objects, we may want to merge the collections (set union), identify the objects
in common (set intersection), or identify the objects unique to one collection
(set difference). We may also be interested in knowing all possible ways of
picking one object from each collection (Cartesian product), or all possible ways
of picking some objects from just one of the collections (power set).

Definition 1.8 (Set operations). Given sets S and T, we define the following
operations:
Machine Translated by Google

1.1. SETS 3

• Power Sets. P(S) is the set of all subsets of S.

• Cartesian Product. S × T = {(s, t) | s ÿ S, t ÿ T}.

• Union. S ÿ T = {x | x ÿ S or x ÿ T}, set of elements in S or T.

• Intersection. S ÿ T = {x | x ÿ S, x ÿ T}, set of elements in S and T.

• Difference. S ÿ T = {x | x ÿ S, x /ÿ T}, set of elements in S but not T.

• Complements. S = {x | x /ÿ S}, set of elements not in S. This is only


meaningful when we have an implicit universe U of objects, ie, S = {x | x ÿ
U, x /ÿ S}.

Example 1.9. Let S = {1, 2, 3}, T = {3, 4}, V = {a, b}. Then:

• P(T) = {ÿ, {3}, {4}, {3, 4}}. • S


× V = {(1, a),(1, b),(2, a),(2, b),(3, a),(3, b)}. • S ÿ T =
{1, 2, 3, 4}. • S ÿ T =
{3}. • S ÿ T =
{1, 2}. • If we are
dealing with the set of all integers, S = {. . . , ÿ2, ÿ1, 0, 4, 5, . . . }.

Some set operations can be visualized using Venn diagrams. See Figure
1.1. To give an example of working with these set operations, consider the
following set identity.

Theorem 1.10. For all sets S and T, S = (S ÿ T) ÿ (S ÿ T).

Proof. We can visualize the set identity using Venn diagrams (see Figure 1.1b
and 1.1c). To formally prove the identity, we will show both of the following:

S ÿ (S ÿ T) ÿ (S ÿ T) (1.1)

(S ÿ T) ÿ (S ÿ T) ÿ S (1.2)
To prove (1.1), consider any element x ÿ S. Either x ÿ T or x /ÿ T.

• If x ÿ T, then x ÿ S ÿ T, and thus also x ÿ (S ÿ T) ÿ (S ÿ T). • If x /ÿ


T, then x ÿ (S ÿ T), and thus again x ÿ (S ÿ T) ÿ (S ÿ T).

To prove (1.2), consider any x ÿ (S ÿ T) ÿ (S ÿ T). Either x ÿ S ÿ T or x ÿ S ÿ T

• If x ÿ S ÿ T, then x ÿ S
Machine Translated by Google

4 sets, functions and relations

U U

S T S T

(a) S ÿ T (b) S ÿ T

U U

S T S T

(c) S ÿ T (d)S

S T

(e) Venn diagram with three sets.

Figure 1.1: Venn diagrams of sets S, T, and V under universe U.

• If x ÿ S ÿ T, then x ÿ S.

In computer science, we frequently use the following additional notation


(these notation can be viewed as short hands):

Definition 1.11. Given a set S and a natural number n ÿ N,

•Sn is the set of length n “strings” (equivalently n-tuples) with alphabet


S. Formally we define it as the product of n copies of S (ie, S × S ×
· · · × S).

•Sÿ is the set of finite length “strings” with alphabet S. Formally we


020
define it as the union of S 1 ÿ S ÿS ÿ · · · , where S is a set that
contains only one element: the empty string (or the empty tuple “()”).
Machine Translated by Google

1.2. RELATIONSHIPS 5

• [n] is the set {0, 1, . . . , n ÿ 1}.

Commonly seen set includes {0, 1} n as the set of n-bit strings, and {0, 1} ÿ
as the set of finite length bit strings. Also observe that |[n]| =n.
Before we end this section, let us revisit our informal definition of sets: an
unordered “collection” of objects. In 1901, Russell came up with the following
“set”, known as Russell's paradox1 :

S = {x | x /ÿ x}

That is, S is the set of all sets that don't contain themselves as an element.
This might seem like a natural “collection”, but is S ÿ S? It's not hard to see that
S ÿ S ÿ S /ÿ S. The conclusion today is that S is not a good “collection” of
objects; it is not a set.
So how will you know if {x | x satisfies some condition} is a set? Formally,
sets can be defined axiomatically, where only collections constructed from a
careful list of rules are considered sets. This is outside the scope of this course.
We will take a short cut, and restrict our attention to a well-behaved universe.
Let E be all the objects that we are interested in (numbers, letters, etc.), and let
U = E ÿP(E)ÿP(P(E)), ie, E, subsets of E and subsets of subsets of E. In fact,
we can extend U with three power set operations, or indeed any finite number
of power set operations. Then, S = {x | x ÿ U and some condition holds} is
always a set.

1.2 Relationships

Definition 1.12 (Relationships). A relation on sets S and T is a subset of S ×T.


A relation on a single set S is a subset of S × S.

Example 1.13. “Taller-than” is a relation on people; (A, B) ÿ ”Taller-than” if


person A is taller than person B. “ÿ” is a relation on R; “ÿ”= {(x, y) | x, y ÿ R, x ÿ
y}.

Definition 1.14 (Reflexitivity, symmetry, and transitivity). A relation R on set S is:

• Reflexive if (x, x) ÿ R for all x ÿ S.

• Symmetric if whenever (x, y) ÿ R, (y, x) ÿ R.


1A folklore version of this paradox concerns itself with barbers. Suppose in a town, the
only barber shaves all and only those men in town who do not shave themselves. This
seems perfectly reasonable, until we ask: Does the barber shave himself?
Machine Translated by Google

6 sets, functions and relations

• Transitive if whenever (x, y),(y, z) ÿ R, then (x, z) ÿ R

Example 1.15.

• “ÿ” is reflexive, but “<” is not. •


“sibling-of” is symmetric, but “ÿ” and “sister-of” is not. • “sibling-
of”, “ÿ”, and “<” are all transitive, but “parent-of” is not
(“ancestor-of” is transitive, however).

Definition 1.16 (Graph of relations). The graph of a relation R over S is an


directed graph with nodes corresponding to elements of S. There is an edge
from node x to y if and only if (x, y) ÿ R. See Figure 1.2.
Theorem 1.17. Let R be a relation over S.

• R is reflexive iff its graph has a self-loop on every node.

• R is symmetric iff in its graph, every edge goes both ways.

• R is transitive iff in its graph, for any three nodes x, y and z such that there
is an edge from x to y and from y to z, there exists an edge from x to z.

• More naturally, R is transitive iff in its graph, whenever there is a path from
node x to node y, there is also a direct edge from x to y.

Proof. The proofs of the first three parts follow directly from the definitions.
The proof of the last bullet connects on induction; we will revisit it later.

Definition 1.18 (Transitive closure). The transitive closure of a relation R is the


least (ie, smallest) transitive relation Rÿ such that R ÿ Rÿ .

Pictorially, Rÿ is the connectivity relation: if there is a path from x to y


in the graph of R, then (x, y) ÿ Rÿ .

Example 1.19. Let R = {(1, 2),(2, 3),(1, 4)} be a relation (say on set Z).
Then (1, 3) ÿ Rÿ (since (1, 2), (2, 3) ÿ R), but (2, 4) ÿ/ Rÿ . See Figure 1.2.

Theorem 1.20. A relation R is transitive iff R = Rÿ .

Definition 1.21 (Relationship equivalence). A relation R on set S is an


equivalence relation if it is reflexive, symmetric and transitive.

Equivalence relations capture the every day notion of “being the same” or
“equal”.
Machine Translated by Google

1.3. FUNCTIONS 7

2 2

1 3 1 3

4 4

(a) The relation R = {(1, 2),(2, 3),(1, 4)} (b) The relation R , transitive closure of
R

Figure 1.2: The graph of a relation and its transitive closure.

Example 1.22. The following are equivalence relations:

• Equality, “=”, a relation on numbers (say N or R). • Parity =


{(x, y) | x, y are both even or both odd}, a relation on inter-
gers.

1.3 Functions
Definition 1.23. A function f : S ÿ T is a “mapping” from elements in set S to elements in set
T. Formally, f is a relation on S and T such that for each s ÿ S, there exists a unique t ÿ T
such that (s, t) ÿ R. S is the domain of f, and T is the range of f. {y | y = f(x) for some x ÿ S}
is the image of f.

We often think of a function as being characterized by an algebraic formula, eg, y = 3x


ÿ 2 characterizes the function f(x) = 3x ÿ 2. Not all formulas = 1 is a relation (a circle) that is
2
function (no unique y for each x). Some 2 + y characterizes a function , eg x not a
functions are also not easily characterized by an algebraic expression, eg, the function
mapping past dates to recorded weather.

Definition 1.24 (Injection). f : S ÿ T is injective (one-to-one) if for every t ÿ T, there exists at


most one s ÿ S such that f(s) = t, Equivalently, f is injective if whenever s = s, we have f(s) =
f(s).

Example 1.25.

• f : N ÿ N, f(x) = 2x is injective.
Machine Translated by Google

8 sets, functions and relations

• f: R + ÿ R +, f(x) = x • f: 2 is injective.
R ÿ R, f(x) = x 2
is not injective since (ÿx) 2
2=x .

Definition 1.26 (Surjection). f : S ÿ T is surjective (onto) if the image of f equals


its range. Equivalently, for every t ÿ T, there exists some s ÿ S such that f(s) = t.

Example 1.27.

• f : N ÿ N, f(x) = 2x is not surjective. • f: R


+ ÿ R +, f(x) = x • f: R ÿ R, 2 is surjective.
f(x) = x square roots. 2 is not injective since negative reals don't have real

Definition 1.28 (Bijection). f: S ÿ T is bijective, or a one-to-one correspondence,


if it is injective and surjective.

See Figure 1.3 for an illustration of injections, surjections, and bijections.

Definition 1.29 (Inverse relationship). Given a function f : S ÿ T, the inverse


ÿ1 ÿ1
relation f on T and S is defined by (t, s) ÿ f if and only if f(s) = t.

If f is bijective, then f ÿ1 is a function (unique inverse for each t). Similarly,


if f is injective, then f ÿ1 is a also function if we restrict the domain of f ÿ1 TB
be the image of f. Often an easy way to show that a function is one-to-one is to
ÿ1
exhibit such an inverse mapping. In both these cases, f (f(x)) = x.

1.4 Set Cardinality, revisited


Bijections are very useful for showing that two sets have the same number of
elements. If f : S ÿ T is a bijection and S and T are finite sets, then |S| = |T|. In
fact, we will extend this definition to infinite sets as well.

Definition 1.30 (Set cardinality). Let S and T be two potentially infinite sets. S
and T have the same cardinality, written as |S| = |T|, if there exists a bijection f :
S ÿ T (equivalently, if there exists a bijection f : T ÿ S). T has cardinality at larger
or equal to S, written as |S| ÿ |T|, if there exists an injection g : S ÿ T (equivalently,
if there exists a surjection g : T ÿ S).

To “intuitively justify” Definition 1.30, see Figure 1.3. The next theorem
shows that this definition of cardinality corresponds well with our intuition for
size: if both sets are at least as large as the other, then they have the same
cardinality.
Machine Translated by Google

1.4. CARDINALITY SET, REVISITED 9

XY XY
(a) An injective function from X to (b) A surjective function from X to Y.
Y.

XY
(c) A bijective function from X to Y.

Figure 1.3: Injective, surjective and bijective functions.

Theorem 1.31 (Cantor-Bernstein-Schroeder). If |S| ÿ |T| and |T| ÿ |S|, then


|S| = |T|. In other words, given injective maps, g : S ÿ T and h : T ÿ S, we
can construct a bijection f : S ÿ T.
We omitted the proof of Theorem 1.31; interested readers can easily
find multiple flavors of proofs online. Set cardinality is much more
interesting when the sets are infinite. The cardinality of the natural numbers
is extra special, since you can “count” the numbers. (It is also the “smallest
infinite set”, a notion that is outside the scope of this course.)
Definition 1.32. A set S is countable if it is finite or has the same cardinality
as N +. Equivalently, S is countable if |S| ÿ |N +|.
Example 1.33.

• {1, 2, 3} is countable because it is finite.


• N is countable because it has the same cardinality as N +; consider f:
N + ÿ N, f(x) = x ÿ 1.
Machine Translated by Google

10 sets, functions and relations

• The set of positive even numbers, S = {2, 4, . . . }, is countable consider f :


N + ÿ S, f(x) = 2x.

Theorem 1.34. The set of positive rational numbers Q+ are countable.

Proof. Q+ is clearly not finite, so we need a way to count Q+. Note that double
counting, triple counting, even counting some element infinite many times is
okay, as long as we eventually count all of Q+. Ie, we implicitly construct a
surjection f: N + ÿ Q+.
Let us count in the following way. We first order the rational numbers p/q by
the value of p + q; then we break ties by ordering according to p. The ordering
then looks like this:

• First group (p + q = 2): 1/1 •


Second group (p + q = 3): 1/2, 2/1 •
Third group (p + q = 4): 1/3, 2/2 , 3/1

Implicitly, we have f(1) = 1/1, f(2) = 1/2, f(3) = 2/1, etc. Clearly, f is a surjection.
See Figure 1.4 for an illustration of f.

1/1 1/2 1/3 1/4 1/5 . . .

2/1 2/2 2/3 2/4 2/5 . . .

3/1 3/2 3/3 3/4 3/5 . . .

4/1 4/2 4/3 4/4 4/5 . . .

5/1 5/2 5/3 5/4 5/5 . . .


.. .. .. .. ..
. . . . .

Figure 1.4: An infinite table containing all positive rational numbers (with
repetition). The red arrow represents how f traverses this table—how we
count the rationals.

Theorem 1.35. There exists sets that are not countable.

Proof. Here we use Cantor's diagonlization argument. Let S be the set of infinite
sequences (d1, d2, . . .) over digits {0, 1}. Clearly S is infinite. To
Machine Translated by Google

1.4. CARDINALITY SET, REVISITED 11

show that there cannot be a bijection with N +, we proceed by contradiction.


Suppose f : N + ÿ S is a bijection. We can then enumerate these strings using f,
producing a 2-dimensional table of digits:
1 1
f(1) = s = (d 1 , d12 , d13 , . . .)
2 2
f(2) = s = (d 1 , d22 , d23 , . . .)
3 3
f(3) = s = (d 1 , d32 , d33 , . . .)

Now consider s = (1 ÿ d 11 , 22 ,1 ÿ d 3 1 ÿ d3 , . . .), ie, we are taking the diagonal


ÿ

ÿ
of the above table, and flipping all the digits. Then for any n, s is different from s
digit. Thisthcontradicts
not
in the n the fact that f is a bijection.

Theorem 1.36. The real interval [0, 1] (the set of real numbers between 0 and 1,
inclusive) is uncountable.

Proof. We will show that |[0, 1]| ÿ |S|, where S is the same set as in the proof of
Theorem 1.35. Treat each s = (d1, d2, . . .) ÿ S as the real number between 0
and 1 with the binary expansion 0.d1d2 · · · . Note that this does not establish a
bijection; some real numbers have two binary expansions, eg, 0.1 = 0.0111 · · ·
2
(similarly, in decimal expansion, we have 0.1 = 0.0999 · · · ).
We can overcome this “annoyance” in two ways:

• Since each real number can have at most two decimal representations
(most only have one), we can easily extend the above argument to show
that |S| ÿ |[0, 2]| (ie, map [0, 1] to one representation, and [1, 2] to the
other). It remains to show that |[0, 1]| = |[0, 2]| (can you think of a bijection
here?).

• We may repeat Cantor's diagonlization argument as in the proof of The-


ÿ
orem 1.35, in decimal expansion. When we construct s , avoid using the
digits 9 and 0 (eg, use only the digits 4 and 5).

A major open problem in mathematics (it was one of Hilbert's 23 famous


problems listed 1900) was whether there exists some set whose cardinality is
between N and R (can you show that R has the same cardinality as [0, 1]?).
Here is a naive candidate: P(N). Unfortunately, P(N) has the same
cardinality as [0, 1]. Note that every element S ÿ P(N) corresponds to an infinitely
th
long sequence over digits {0, 1} (the nn digit is 1 if and only if the number
ÿ S). Again, we arrive at the set S in the proof of Theorem 1.35.
2For a proof, consider letting x = 0.0999 · · · , and observe that 10x ÿ x = 0.999 · · · ÿ
0.0999 · · · = 0.9, which solves to x = 0.1.
Machine Translated by Google

12 sets, functions and relations

The Continuum Hypothesis states that no such set exists. G¨odel and
Cohen together showed (in 1940 and 1963) that this can neither be proven
nor disproved using the standard axioms underlying mathematics (we will
talk more about axioms when we get to logic).
Machine Translated by Google

Chapter 2

Proofs and Induction

“Pics or it didn’t happen.”


– the internet

There are many forms of mathematical proofs. In this chapter we introduce several basic
types of proofs, with special emphasis on a technique called induction that is invaluable
to the study of discrete math.

2.1 Basic Proof Techniques


In this section we consider the following general task: given a premise X, how do we show
that a conclusion Y holds? One way is to give a direct proof.
Start with premise X, and directly deduce Y through a series of logical steps.
See Claim 2.1 for an example.
2
Claim 2.1. Let n be an integer. If n is even, then n 2 then n is is even. If n is odd,
odd.

Direct proof. If n is even, then n = 2k for an integer k, and

2 2 = 4k = 2 · (2k 2
2n _
= (2k) ), which is even.

If n is odd, then n = 2k + 1 for an integer k, and

2 2
2n _
= (2k + 1)2 = 4k + 4k + 1 = 2 · (2k + 2k) + 1, which is odd.

There are also several forms of indirect proofs. A proof by contrapos-itive starts by
assuming that the conclusion Y is false, and deduce that the premise X must also be false
through a series of logical steps. See Claim 2.2 for an example.

13
Machine Translated by Google

14 proofs and induction

2
Claim 2.2. Let n be an integer. If n is even, then n is even.

Proof by contrapositive. Suppose that n is not even. Then by Claim 2.1, n is not even 2
as well. (Yes, the proof ends here.)

A proof by contradiction, on the other hand, assume both that the premise X is
true and the conclusion Y is false, and reach a logical fallacy.
We give another proof of Claim 2.2 as example.

Proof by contradiction. Suppose that n Claim 2 is even, but n is odd. Applying cannot
2.1, we see that n even! 2
2 must be odd. Goal n be both odd and

In their simplest forms, it may seem that a direct proof, a proof by con-trapositive,
and a proof and contradiction may just be restatements of each other; indeed, one can
always sentence a direct proof or a proof by contrapositive as a proof by contradiction
(can you see how?). In more complicated proofs, however, choosing the “right” proof
technique sometimes simplifies or improves the aesthetics of a proof. Below is an
interesting use of proof by contradiction.

Theorem 2.3. ÿ 2 is irrational.

Proof by contradiction. Assume for contradiction that ÿ 2 is rational. Then there exists
integers p and q, with no common divisors, such that ÿ 2 = p/q (ie, the reduced
fraction). Squaring both sides, we have:

2
p2 2 ÿ 2q 2=p
=2q

This means p 2 is even, and by Claim 2.2 p is even as well. Let us replace p by
2k. The expression becomes:

2 2 = 2k ÿ
2q2 = (2k) 2
2 = 4k q

This time, we conclude that q to 2 is even, and so q is even as well. But this leads
a contradiction, since p and q now shares a common factor of 2.

We end the section with the (simplest form of the) AM-GM inequality.

Theorem 2.4 (Simple AM-GM inequality). Let x and y be non-negative reals.


Then,
x+y
ÿ ÿ xy 2
Machine Translated by Google

2.2. PROOF BY CASES AND EXAMPLES 15

Proof by contradiction. Assume for contradiction that

x+y
2 < ÿ xy
1 2
ÿ <xy squaring non-negative values
4 (x + y)
2
2ÿx + 2xy + y < 4xy
2ÿx 2 ÿ 2xy + y < 0
ÿ 2 <0
(x ÿ y)

But this is a contradiction since squares are always non-negative.

Note that the proof Theorem 2.4 can be easily turned into a direct proof;
the proof of Theorem 2.3, on the other hand, cannot.

2.2 Proof by Cases and Examples


Sometimes the easiest way to prove a theorem is to split it into several cases.

Claim 2.5. (n + 1)2 ÿ 2 n for all integers n satisfying 0 ÿ n ÿ 5.

Proof by cases. There are only 6 different values of n. Let's try them all:

n (n + 1)2 0 1 2 not

ÿ1
1 4 ÿ2
2 9 ÿ4
3 16 ÿ8
4 25 ÿ16
5 36 ÿ32

2 2.
Claim 2.6. For all real x, |x | = |x|

Proof by cases. Split into two cases: x ÿ 0 and x < 0.


2 2 2
• If x ÿ 0, then |x |=x = |x| .
2 2 2 2.
• If x < 0, then |x |=x = (ÿx) = |x|

When presenting a proof by cases, make sure that all cases are covered! For
some theorems, we only need to construct one case that satisfy the theorem
statement.statement.

Claim 2.7. Show that there exists some n such that (n + 1)2 ÿ 2
not
.
Machine Translated by Google

16 proofs and induction

Proof by example. n = 6.

Sometimes we find a counterexample to disprove a theorem.

Claim 2.8. Prove or disprove that (n + 1)2 ÿ 2 n for all n ÿ N.

Proof by (counter)example. We choose to disprove the statement. Check out n


= 6. Done.

The next proof does not explicitly construct the example asked by the
theorem, but proves that such an example exists anyway. These type of proofs
(among others) are non-constructive.

Theorem 2.9. There exists irrational numbers x and y such that x y is rational.

Non-constructive proof of existence. We know ÿ 2 is irrational from Theorem


2.3. Let z = ÿ 2 ÿ 2 .

• If z is rational, then we are done (x = y = ÿ 2). ÿ 2



If z is irrational, then take x = z = ÿ 2 , and y = ÿ 2. Then:

ÿ 2 ÿÿ2 ÿ 2 ÿ 2 = ÿ 2 = 2
yx = (ÿ 2 2) =2

is indeed a rational number.

Here is another non-constructive existence proof. The game of Chomp is a


2-player game played on a “chocolate bar” made up of a rectangular grid.
The players take turns to choose one block and “eat it” (remove from the board),
together all other blocks that are below it or to its right (the whole lower right
quadrant). The top left block is “poisoned” and the player who eats this loses.

Theorem 2.10. Suppose the game of Chomp is played with rectangular grid
strictly larger than 1 × 1. Player 1 (the first player) has a winning strategy.

Proof. Consider following first move for player 1: eat the lower right most block.
We have two cases1 :
1
Here we use the well-known fact of 2-player, deterministic, finite-move games
without ties: any move is either a winning move (ie, there is a strategy following this move
that forces a win), or allows the opponent to follow up with a winning move. See Theorem
2.14 later for a proof of this fact.
Machine Translated by Google

2.3. INDUCTION 17

• Case 1: There is a winning strategy for player 1 starting with this move.
In this case we are done.
• Case 2: There is no winning strategy for player 1 starting with this move.
In this case there is a winning strategy for player 2 following this move.
But this winning strategy for player 2 is also a valid winning strategy for
players 1, since the next move made by player 2 can be mimicked by
player 1 (here we need the fact that the game is symmetric between the
players).

While we have just shown that Player 1 can always win in a game of Chomp,
no constructive strategy for Player 1 has been found for general rect-angular
grids (ie, you cannot buy a strategy guide in store that tells you how to win
Chomp ). For a few specific cases though, we do know good strategies for
Player 1. Eg, given an × n square grid, Player 1 starts by removing an ÿ 1 × n ÿ
1 (unique) block, leaving an L-shaped piece of chocolate with two “arms”;
thereafter, Player 1 simply mirrors Player 2's move, ie, whenever Player 2 takes
a bite from one of the arms, Player 1 takes the same bite on the other arm.

A our last example, consider tilling a 8 × 8 chess board with dominoes (2 ×


1 pieces), ie, the whole board should be covered by dominoes without any
dominoes over lapping each other or sticking out.

Q: Can we tile it?


A: Yes. Easy to give a proof by example (constructive existence proof).

Q: What if I remove one grid from the check board?


A: No. Each domino covers 2 grids, so the number of covered grids is always
even, but the board has 63 pieces (direct proof / proof by contradiction).

Q: What if I remove the top left and bottom right grids?


A: No. Each domino covers 1 grid of each colors. The top left and bottom right
grids have the same color, however, so the remaining board has more white
grids than black (or more black grids than white) (direct proof / proof by
contradiction).

2.3 Induction
We start with the most basic form of induction: induction over the natural
numbers. Suppose we want to show that a statement is true for all natural
Machine Translated by Google

18 proofs and induction

numbers, eg, for all n, 1 + 2 + · · · + n = n(n + 1)/2. The basic idea is to approach
the proof in two steps:

1. First prove that the statement is true for n = 1. This is called the base
box.

2. Next prove that whenever the statement is true for case n, then it is also
true for case n + 1. This is called the inductive step.

The base case shows that the statement is true for n = 1. Then, by repeatedly
applying the inductive step, we see that the statement is true for n = 2, and then
n = 3, and then n = 4, 5, . . . ; we just covered all the natural numbers!
Think of pushing over a long line of dominoes. The induction step is just like
setting up the dominoes; we make sure that if a domino falls, so will the next one.
The base case is then analogous to pushing down the first domino. The result?
All the dominoes fall.
Follow these steps to write an inductive proof:

1. Start by formulating the inductive hypothesis (ie, what you want to prove). It
should be parametrized by a natural number. Eg, P(n): 1 + 2 + · · · + n =
n(n + 1)/2.

2. Show that P(base) is true for some appropriate base case. Usually basic
is 0 or 1.

3. Show that the inductive step is true, ie, assume P(n) holds and prove that
P(n + 1) holds as well.

Viol`a, we have just shown that P(n) holds for all n ÿ base. Note that the base
case does not always have to be 0 or 1; we can start by showing that something
is P(n) is true for n = 5; this combined with the inductive step shows that P(n) is
true for all n ÿ 5. Let's put our new found power of inductive proofs to the test!

Claim 2.11. For all positive integers n, 1 + 2 + · · · + n = n(n + 1)/2.

Proof. Define out induction hypothesis P(n) to be true if


not

i = 1 n(n + 1)
2
i=1

Base case: P(1) is clearly true by inspection.


Machine Translated by Google

2.3. INDUCTION 19

Inductive Step: Assume P(n) is true; we wish to show that P(n + 1) is true as well:

n+1 not

i= i + (n + 1)
i=1 i=1
1
= n(n + 1) + n + 1 using P(n)
21 1
= (n(n + 1) + 2(n + 1)) = ((n + 1)(n + 2))
2 2

This is exactly P(n + 1).

Claim 2.12. For any finite set S, |P(S)| = 2|S|.

Proof. Define our induction hypothesis P(n) to be true if for every finite set S of cardinality
|S| = n, |P(S)| = 2n .
Base case: P(0) is true since the only finite set of size 0 is the empty set
ÿ, and the power set of the empty set, P(ÿ) = {ÿ}, has cardinality 1.
Inductive Step: Assume P(n) is true; we wish to show that P(n + 1) is true as well.
Consider a finite set S of cardinality n + 1. Pick an element e ÿ S, and consider S = S ÿ
{e}. By the induction hypothesis, |P(S)| = 2n .
Now consider P(S). Observe that a set in P(S) either contains e or not; furthermore,
there is a one-to-one correspondence between the sets containing e and the sets not
containing e (can you think of the bijection?). We have just partitioned P(S) into two equal
cardinality subsets, one of which is P(S ).
Therefore |P(S)| = 2|P(S)| = 2n+1 .

Claim 2.13. The following two properties of graphs are equivalent (recall that these are
the definitions of transitivity on the graph of a relation):

1. For any three nodes x, y and z such that there is an edge from x to y
and from y to z, there exists an edge from x to z.
2. Whenever there is a path from node x to node y, there is also a direct
edge from x to y.

Proof. Clearly property 2 implies property 1. We use induction to show that property 1
implies property 2 as well. Let G be a graph on which property 1 holds. Define our
induction hypothesis P(n) to be true if for every path of length n in G from node x to node
y, there exists a direct edge from x to y.
Base case: P(1) is simply true (path of length 1 is already a direct edge).
Inductive Step: Assume P(n) is true; we wish to show that P(n + 1) is true as well.
Consider a path of length n + 1 from node x to node y, and let
Machine Translated by Google

20 proofs and induction

z be the first node after x on the path. We now have a path of length n from
node z to y, and by the induction hypothesis, a direct edge from z to y. Now
that we have a directly edge from x to z and from z to y, property 1 implies
that there is a direct edge from x to y.

Theorem 2.14. In a deterministic, finite 2-player game of perfect information without ties,
either player 1 or player 2 has a winning strategy, ie, a
strategy that guarantees a win.2,3

Proof. Let P(n) be the theorem statement for n-move games.


Base case: P(1) is trivially true. Since only player 1 gets to move, if
there exists some move that makes player 1 win, then player 1 has a winning
strategy; otherwise player 2 always wins and has a winning strategy (the
strategy of doing nothing).
Inductive Step: Assume P(n) is true; we wish to show that P(n + 1) is
true as well. Consider some n + 1-move game. After player 1 makes the first
move, we end up in a n-move game. Each such game has a winning strategy
for either player 1 or player 2 by P(n).

• If all these games have a winning strategy for player 24 , then no matter
what move player 1 plays, player 2 has a winning strategy

• If one these games have a winning strategy for player 1, then player 1
has a winning strategy (by making the corresponding first move).

In the next example, induction is used to prove only a subset of the theo-rem to give
us a jump start; the theorem can then be completed using other
techniques.

Theorem 2.15 (AM-GM Inequality). Let x1, x2, . . . , xn be a sequence of non-negative


reals. Then
1
1/n
xi ÿ xi
not
i i

3By deterministic, we mean the game has no randomness and depends on only on player
moves (eg, not backgammon). By finite, we mean the game is always ends in some prede-
terminated fix number of moves; in chess, even though there are infinite sequences of moves
that avoid both checkmates and stalemates, many draw rules (eg, cannot have more than
100 consecutive moves without captures or pawn moves) ensures that chess is a finite game.
By perfect information, we mean that both players know each other's past moves (eg, no
fog of war).
4
By this we mean the player 1 of the n-move game (the next player to move) has a
winning strategy
Machine Translated by Google

2.3. INDUCTION 21

Proof. In this proof we use the notation

1/n
1 not

AM(x1, . . ., xn) = GM(x1, . . ., xn) =n


not
xi xi
i=1 i=1

Let us first prove the AM-GM inequality for values of n = 2k . Define our
induction hypothesis P(k) to be true if AM-GM holds for n = 2k .
Base case: P(0) (ie, n = 1) trivially holds, and P(1) (ie, n = 2) was shown in Theorem
2.4.
Inductive Step: Assume P(k) is true; we wish to show that P(k + 1) is X = (x1, .., x2
true as well. Given a sequence of length 2k+1 xk into , k+1 ), we split it
two sequences 1 2 2 = (x2 k+1, x2 k+2, . . ., x2 k+1 ). Then:

1
AM(X) = (AM(X 1) + AM(X 2))
21
(GM(X 1) + GM(X 2)) ÿ 2 by the induction hypothesis P(k)

= AM(GM(X 1), GM(X 2)) ÿ

GM(GM(X 1), GM(X 2)) by Theorem 2.4, ie, P(1)


1 1 1/2
k2 _ 2k +1
ÿÿ ÿ 2k ÿ ÿ 2k ÿ
=
xi xi

ÿÿ ÿ
i=1 ÿÿ i=2k+1 ÿÿ ÿ
1
2k +1 2k+1
= ÿ ÿ
xi = GM(X)
ÿ i=1 ÿ

We are now ready to show the AM-GM inequality for sequences of all lengths. Given
a sequence X = (x1, ..., xn) where n is not a power of 2, find the smallest k such that 2k >
n. Let ÿ = AM(X ), and consider a new
sequence

X = (x1, ..., xn, xn+1 = ÿ, xn+2 = ÿ, ..., x2 k = ÿ)

and verify that AM(X ) = AM(X ) = ÿ. Apply P(k) (the AM-GM inequality for sequences of
length 2k ), we have:
Machine Translated by Google

22 proofs and induction

k
k2 _
1/2
ÿ ÿ
AM(X) = ÿ ÿ GM(X) = xi
ÿ i=1 ÿ
k2 _ not
k
2 2 kÿn
ÿ ÿ ÿ xi = xi · ÿ
i=1 i=1
not

ÿ nÿ _ ÿ xi
i=1
not 1/n
ÿ ÿÿ xi = GM(X)
i=1

This finishes our proof (recalling that ÿ = AM(X)).

Note that for the inductive proof in Theorem 2.15, we needed to show both base cases
P(0) and P(1) to avoid circular arguments, since the inductive step relies on P(1) to be true.

A common technique in inductive proofs is to define a stronger induction hypothesis


than is needed by the theorem. A stronger induction hypothesis P(n) sometimes makes
the induction step simpler, since we would start each induction step with a stronger
premise. As an example, consider the game of “coins on the table”. The game is played
on a round table between two players. The players take turns putting on one penny at a
time onto the table, without overlapping with previous pennies; the first player who cannot
add another coin losses.

Theorem 2.16. The first player has a winning strategy in the game of “coins on the table”.

Proof. Consider the following strategy for player 1 (the first player). Start first by putting a
penny centered on the table, and in all subsequent moves, simply mirror player 2's last
move (ie, place a penny diagonally opposite of player 2's last penny). We prove by induction
that player 1 can always put down a coin, and therefore will win eventually (when the table
runs out of space).

th
Define the induction hypothesis P(n) to be true if on the n move of player
1, player 1 can put down a penny according to his strategy, and leave the table symmetric
about the center (ie, looks the same if rotated 180 degrees).
Machine Translated by Google

2.3. INDUCTION 23

Base case: P(1) holds since player 1 can always start by putting one
penny at the center of the table, leaving the table symmetric.
Inductive Step: Assume P(n) is true; we wish to show that P(n + 1) is moving, the table
th
true as well. By the induction hypothesis, after player 1's n is symmetric.
Therefore, if player 2 now puts down a penny, the diagonally opposite spot must be free of
pennies, allowing player 1 to set down a penny as well. Moreover, after player 1's move, the
table is back to being symmetric.

The Towers of Hanoi is a puzzle game where there is three poles, and a number of
increasingly larger rings that are originally all stacked in order of size on the first pole, largest
at the bottom. The goal of the puzzle is to move all the rings to another pole (pole 2 or pole
3), with the rule that:

• You can only move one ring at a time, and it must be the top most ring in one of the
three potential stacks.

• At any point, no ring can be placed on top of a smaller ring.5


not
Theorem 2.17. The Towers of Hanoi with n rings can be solved in 2 ÿ1
moves.

Proof. Define the induction hypothesis P(n) to be true if the theorem state-ment is true for n
rings.
Base case: P(1) is clearly true. Just move the ring.
Inductive Step: Assume P(n) is true; we wish to show that P(n + 1) is true as well.
Number the rings 1 to n + 1, from smallest to largest (top to bottom on the original stack).
First move rings 1 to n from pole 1 to pole 2; this takes 2n ÿ 1 steps by the induction
hypothesis P(n). Now move ring n + 1 from pole 1 to pole 3. Finally, move rings 1 to n from
pole 2 to pole 3; again, this takes 2n ÿ 1 steps by the induction hypothesis P(n). In total we
have used (2n ÿ 1) + 1 + (2n ÿ 1) = 2n+1 ÿ 1 moves. (Convince yourself that this recursive
definition of moves will never violate the rule that no ring can be placed on top of a smaller
ring.)

Legends say that such a puzzle was found in a temple with n = 64 rings, left for the priests to
solve. With our solution, that would require 264 ÿ1 ÿ 1.8×1019 moves. Is our solution just silly
and takes too many moves?
not
Theorem 2.18. The Towers of Hanoi with n rings requires at least 2 moves to solve. ÿ1
Good luck priests!

5Squashing small rings with large rings is bad, m’kay?


Machine Translated by Google

24 proofs and induction

Proof. Define the induction hypothesis P(n) to be true if the theorem state-ment
is true for n rings.
Base case: P(1) is clearly true. You need to move the ring.
Inductive Step: Assume P(n) is true; we wish to show that P(n + 1) is true as
well. Again we number the rings 1 to n + 1, from smallest to largest (top to bottom
on the original stack). Consider ring n + 1. It needs to be moved at some point.
Without loss of generality, assume its final destination
is pole 3. Let the k th move be the first move where ring n + 1 is moved away
from pole 1 (to pole 2 or 3), and let the kn th move be the last move where ring
+ 1 is moved to pole 3 (away from pole 1 to pole 2),
Before performing move k, all n other rings must first be moved to the
remaining free pole (pole 3 or 2); by the induction hypothesis P(n), 2n ÿ 1 steps
are required before move k. Similarly, after performing move k, all n other rings
must be on the remaining free pole (pole 2 or 1); by the induction hypothesis P(n),
2nÿ1 steps are required after move k to complete the puzzle.
In the best case where k = k (ie, they are the same move), we still need at least
(2n ÿ 1) + 1 + (2n ÿ 1) = 2n+1 ÿ 1 moves.

Strong Induction
Taking the dominoes analogy one step further, a large domino may require the
combined weight of all the previous toppling over before it topples over as well.
The mathematical equivalent of this idea is strong induction. To prove that a
statement P(n) is true for (a subset of) positive integers, the basic idea
is:

1. First prove that P(n) is true for some base values of n (eg, n = 1).
These are the base cases.

2. Next prove that if P(k) is true for 1 ÿ k ÿ n, then P(n + 1) is true.


This is called the inductive step.

How many base cases do we need? It roughly depends on the following factors:

• What is the theorem? Just like basic induction, if we only need P(n) to
be true for n ÿ 5, then we don't need base cases n < 5.

• What does the induction hypothesis need? Often to show P(n + 1), instead
of requiring that P(k) be true for 1 ÿ k ÿ n, we actually need, say P(n) and
P(n ÿ 1) to be true. Then having the base case of P(1) isn't enough for the
induction hypothesis to prove P(3); P(2) is another required base case.
Machine Translated by Google

2.3. INDUCTION 25

Let us illustrate both factors with an example.

Claim 2.19. Suppose we have an unlimited supply of 3 cent and 5 cent coins.
Then we can pay any amount ÿ 8 cents.

Proof. Let P(n) be the true if we can indeed form n cents with 3 cents and 5
hundred corners.

Base case: P(8) is true since 3 + 5 = 8.


Inductive Step: Assume P(k) is true for 8 ÿ k ÿ n; we wish to show that P(n+
1) is true as well. This seems easy; if P(nÿ2) is true, then adding another 3 cent
coin gives us P(n + 1). But the induction hypothesis doesn't necessarily say P(n
ÿ 2) is true! For (n + 1) ÿ 11, the induction hypothesis does apply (since n ÿ 2 ÿ
8). For n + 1 = 9 or 10, we have to do more work.
Additional base cases: P(9) is true since 3 + 3 + 3 = 9, and P(10) is true
since 5 + 5 = 10.

With any induction, especially strong induction, it is very important to check


for sufficient base cases! Here is what might happen in a faulty strong inductive
proof.6 Let P(n) be true if for all groups of n women, whenever one women is
blonde, then all of the women are blonde; since there is at least one blonde in
the world, once I am done with the proof, every woman in the world will be
blonde!

Base case: P(1) is clearly true.

Induction step: Assume P(k) is true for all 1 ÿ k ÿ n; we wish to show P(n + 1) is
true as well. Given a set W of n + 1 women in which x ÿ W is blonde, take
any two strict subsets A, BW (in particular |A|, |B| < n+ 1) such that they
both contain the blonde (x ÿ A , x ÿ B), and A ÿ B = W (no one is left out).
Applying the induction hypothesis to A and B, we conclude that all the
women in A and B are blonde, and so everyone in W is blonde.

What went wrong?7

6Another example is to revisit Claim 2.19. If we use the same proof to show that P(n) is
true for all n ÿ 3, without the additional base cases, the proof will be “seemingly correct”.
What is the obvious contradiction?
7Hint: Can you trace the argument when n = 2?
Machine Translated by Google

26 proofs and induction

2.4 Inductive Definitions


In addition to being a proof technique, induction can be used to define math-ematical
objects. Some basic examples include products or sums of sequences:

• The factorial function n! over non-negative integers can be formally de-


finished by
0! = 1; (n + 1)! =n! · (n + 1)
• The cumulative sum of a sequence x1, . . . , xk, often written as S(n) =
not

i=1 xi , can be formally defined by

S(0) = 0; S(n + 1)! = S(n) + xn+1


Just like inductive proofs, inductive definitions start with a “base case” (eg,
defining 0! = 1), and has an “inductive step” to define the rest of the values
(eg, knowing 0! = 1, we can compute 1! = 1 · 1 = 1, 2! = 1 · 2 = 2, and so
we).

Recurrence Relations
When an inductive definition generates a sequence (eg, the factorial sequence
is 1, 1, 2, 6, 24, . . .), we call the definition a recurrence relation. We can gen-eralize
inductive definitions and recurrence relations in a way much like us
generalize inductive proofs with strong induction. For example, consider a
sequence defined by:
a0 = 1; a1 = 2; year = 4yearÿ1 ÿ 4yearÿ2

According to the definition, the next few terms in the sequence will be
a2 = 4; a3 = 8

At this point, the sequence looks suspiciously as if an = 2n . Let's prove this


by induction!

Proof. Define P(n) to be true if an = 2n .


Base case: P(0) and P(1) are true since a0 = 1 = 20 , a1 = 2 = 21 .
Inductive Step: Assume P(k) is true for 0 ÿ k ÿ n; we wish to show
that P(n + 1) is true as well for n + 1 ÿ 2. We have
year+1 = 4year ÿ 4yearÿ1
nÿ1
=4·2 not
ÿ4·2 by P(n) and P(n ÿ 1)
= 2n+2 ÿ 2 n+1 = 2n+1

This is exactly P(n + 1).


Machine Translated by Google

2.4. INDUCTIVE DEFINITIONS 27

Remember that it is very important to check all the base cases (espe-cially
since this proof uses strong induction). Let us consider another example:

b0 = 1; b1 = 1; bn = 4bnÿ1 ÿ 3bnÿ2

From the recurrence part of the definition, its looks like the sequence (bn)n will
eventually out grow the sequence (an)n. Based only on this intuition, let us
conjecture that bn = 3n .

Possibly correct proof. Define P(n) to be true if bn = 3n .


Base case: P(0) is true since b0 = 1 = 30 .
Inductive Step: Assume P(k) is true for 0 ÿ k ÿ n; we wish to show that P(n +
1) is true as well for n + 1 ÿ 3. We have

bn+1 = 4bn ÿ 3bnÿ1


= 4 · 3 ÿ 3 · 3 nÿ1
not

by P(n) and P(n ÿ 1)


= (3n+1 + 3n ) ÿ 3
not
= 3n+1

Wow! Was that a lucky guess or what. Let us actually compute a few terms
of (bn)n to make sure. . .

b2 = 4b1 ÿ 3b0 = 4 ÿ 3 = 1,
b3 = 4b2 ÿ 3b1 = 4 ÿ 3 = 1,
..
.

Looks like in fact, bn = 1 for all n (as an exercise, prove this by induction).
What went wrong with our earlier “proof”? Note that P(n ÿ 1) is only well defined
if n ÿ 1, so the inductive step does not work when we try to show P(1) (when n =
0). As a result we need an extra base case to handle P(1); a simple check shows
that it is just not true: b1 = 1 = 31 = 3. (On the other hand, if we define b = 3, and
b then we can recycle 0 =our
1, b“faulty
1 = 4b
proof” and ÿnÿ2
show
not
3b that
nÿ1
, b
= 3n ).
not

In the examples so far, we guessed at a closed form formula for the se-
quences (an)n and (bn)n, and then proved that our guesses were correct using
induction. For certain recurrence relations, there are direct methods for
computing a closed form formula of the sequence.

Theorem 2.20. Consider the recurrence relation an = c1anÿ1 + c2anÿ2 with c2


= 0, and arbitrary base cases for a0 and a1. Suppose that the polynomial ÿ (c1x
2x _
+ c2) has two distinct roots r1 and r2 (these roots are non-zero since
c2 = 0). Then there exists constants ÿ and ÿ such that an = ÿxn + ÿxn 1 2.
Machine Translated by Google

28 proofs and induction

2
Proof. The polynomial f(x) = x ÿ (c1x + c2) is called the characteristic
polynomial for the recurrence relation an = c1anÿ1 + c2anÿ2. Its significance
0
can be explained by the sequence (r , r1 , . . .) where r is a root of f(x); we
claim that this sequence satisfies the recurrence relation (with base cases set
1 .
0 as r andr
not
). Let P(n) be true if an = r
Inductive Step: Assume P(k) is true for 0 ÿ k ÿ n; we wish to show
that P(n + 1) is true as well. Observe that:

an+1 = c1an + c2anÿ1


not nÿ1
= c1r + c2r by P(n ÿ 1) and P(n)
nÿ1
=r (c1r + c2)
nÿ1 2
=r ·r since r is a root of f(x)
=r n+1

Recall that there are two distinct roots, r1 and r2, so we actually have two
sequences that satisfy the recurrence relation (under proper base cases). In
fact, because the recurrence relation is linear (an depends linearly on anÿ1 and
anÿ2), and homogeneous (there is no constant term in the recurrence relation),
any sequence of the form an = ÿrn 1 + ÿrn 2 will satisfy the recurrence relation;
(this can be shown using a similar inductive step as above).
Finally, does sequences of the form an = ÿrn 1 + ÿrn 2 cover all possible basis
ÿ ÿ
boxes? The answer is yes. Given any base case a0 = a 0 , 1 a1
, =a we can solve
for the unique value of ÿ and ÿ using the linear system:

ÿ
has
0 = ÿr0 1 + ÿr0 2 = ÿ + ÿ
ÿ
has
1 = ÿr1 1 + ÿr1 2 = ÿr1 + ÿr2

The studious reader should check that this linear system always has a unique
solution (say, by checking that the determinant of the system is non-zero).

The technique outlined in Theorem 2.20 can be extended to any recurrence


relation of the form

an = c1anÿ1 + c2anÿ2 + · · · + ckanÿk

for some constant k; the solution is always a linear combination of k sequences


0
of the form (r , r1 , r2 , . . .), one for each distinct root r of the characteristic
polynomial
k kÿ1 kÿ2 + c2x
f(x) = x ÿ (c1x + · · · + ck)
Machine Translated by Google

2.4. INDUCTIVE DEFINITIONS 29

In the case that f(x) has duplicate roots, say when a root r has multiplicity m, in order to still
have a total of k distinct sequences, we associate the following m sequences with r:

0r , r1 , r2 , ..., rn ,
00·r, 11· , 22· , ..., nrn ,
r r
((( 02 · r 0 , 21 1r , 22 2r , ..., n2 r , not
. . .) . . .) . . .)
..
.
( 0mÿ1 · r
0
, 1 mÿ1 1 r , 2 mÿ1 2 r , ..., nmÿ 1r not
, . . .)

For example, if f(x) has degree 2 and has a unique root r with multiplicity 2, then the general
form solution to the recurrence is

an = ÿrn + ÿnrn

We omit the proof of this general construction. Interestingly, the same technique is used in
many other branches of mathematics (for example, to solve linear ordinary differential
equations).
As an example, let us derive a closed form expression to the famous Fi-bonacci numbers.

Theorem 2.21. Define the Fibonacci sequence inductively as

f0 = 0; f1 = 1; fn = fnÿ1 + fnÿ2

Then
not not

1 1+ÿ5 1 1ÿÿ5
fn = (2.1)
ÿ

ÿ5 2 ÿ5 2

Proof. It is probably hard to guess (2.1); we will derive it from scratch. The characteristic
2
polynomial here is f(x) = x ÿ (x + 1), which has roots

1+ÿ5 1ÿÿ5
,
2 2

This means the Fibonacci sequence can be expressed as


not not

1+ÿ5 1ÿÿ5
fn = ÿ +ÿ
2 2
Machine Translated by Google

30 proofs and induction

Figure 2.1: Approximating the golden ratio with rectangles whose side lengths are
consecutive elements of the Fibonacci sequence. Do the larger rectangles look more
pleasing than the smaller rectangles to you?

for constants ÿ and ÿ. Substituting f0 = 0 and f1 = 1 gives us

0=ÿ+ÿ1

+ÿ5 1ÿÿ5
1=ÿ +ÿ
2 2

which solves to ÿ = 1/ ÿ 5, ÿ = ÿ1/ ÿ 5.

As a consequence of (2.1), we know that for large n,


not

1+ÿ5
1 fn
ÿÿ5 2

because the other term approaches zero. This in term implies that

= 1+ÿ5
fn+1
limnÿÿ
fn 2

which is the golden ratio. It is widely believed that a rectangle whose ratio (length divided by
width) is golden is pleasing to the eye; as a result, the golden ratio can be found in many
artworks and architectures throughout history.
Machine Translated by Google

2.5. FUN TIDBITS 31

2.5 Fun Tidbits


We end the section with a collection of fun examples and anecdotes on
induction.

Induction and Philosophy: the Sorites Paradox


The sorites paradox, stated below, seems to question to validity of inductive
arguments:

Base Case: One grain of sand is not a heap of sand.

Inductive Step: If n grains of sand is not a heap of sand, then n + 1 grains


of sand is not a heap of sand either.

We then conclude that a googol (10100) grains of sand is not a heap of sand
(this is more than the number of atoms in the observable universe by some
estimates). What went wrong? The base case and the inductive step is per-
fectly valid! There are many “solutions” to this paradox, one of which is to blame
it on the vagueness of the word “heap”; the notion of vagueness is itself a topic
of interest in philosophy.

Induction and Rationality: the Traveller's Dilemma


Two travelers, Alice and Bob, fly with identical vases; the vases get broken.
The airline company offers to reimburse Alice and Bob in the following way.
Alice and Bob, separately, are asked to quote the value of the vase at between
2 to 100 dollars. If they come up with the same value, then the airline will
reimburse both Alice and Bob at the agreed price. If they come up with different
values, m and m with m < m then the person, who quoted the smaller amount m
gets m + 2 dollars in reimbursement, while the person who quoted the bigger
amount m gets m ÿ 2 dollars. What should they do?
Quoting $100 seems like a good strategy. But if Alice knows Bob will quote
$100, then Alice should quote $99. In fact, quoting 99 is sometimes better and
never worse than quoting $100. We conclude that it is never “rational” to quote
$100.
But now that Alice and Bob knows that the other person will never quote
$100, quoting $98 is now a better strategy than quoting $99. We conclude that
it is never “rational” to quote $99 or above.
We can continue the argument by induction (this argument is called
backward-ward induction in the economics literature) that the only rational thing for
Machine Translated by Google

32 proofs and induction

Alice and Bob to quote is $2. Would you quote $2 (and do you think you are
“rational”)?

Induction and Knowledge: the Muddy Children


Suppose a group of children are in a room, and some have mud on their
foreheads. All the children can see everyone else's forehead, but not their won
(no mirrors, and no touching), and so they do not know if they themselves are
muddy. The father comes into the room and announce that some of the children
are muddy, and asks if anyone knows (for sure) that they are the ones who are
muddy. Everyone says no. The father then asks the same question again, but
everyone still says no. The father keeps asking the same question over and
over again, until all of a sudden, all the muddy children in the room
simultaneously say yes, that they do know they are muddy. How many children
said yes? How many rounds of questioning has there been?

Claim 2.22. All the muddy children say yes in the n (and th round of questioning
not earlier) if and only if there are n muddy children.

ProofSketch. Since we have not formally defined the framework of knowledge,


we are constrained to an informal proof sketch. Let P(n) be true if the claim is
true for n muddy children.
Base case: We start by showing P(1). If there is only one child that is
muddy, the child sees that everyone else is clean, and can immediately deduce
that he/she must be muddy (in order for there to be someone muddy in the
room). On the other hand, if there are 2 or more muddy children, then all the
muddy children see at least another muddy child, and cannot tell apart whether
“some kids are muddy” refer to them or the other muddy children in the room.

Inductive Step: Assume P(k) is true for 0 ÿ k ÿ n; we wish to show P(n + 1).
Suppose there are exactly n + 1 muddy children. Since there are more than n
muddy children, it follows by the induction hypothesis that no one will speak
before round n + 1. From the view of the muddy children, they see n other
muddy kids, and know from the start that there are either n or n + 1 muddy
children in total (depending on whether they themselves are muddy). But, by
the induction hypothesis, they know that if there were n muddy children, then
someone would have said yes in round n; since no one has said anything yet,
each muddy child deduces that he/she is indeed muddy and says yes in round
n + 1. Now suppose there are strictly more than n + 1 muddy children. In this
case, everyone sees at least n + 1 muddy children already. By the induction
hypothesis, every child knows from the
Machine Translated by Google

2.5. FUN TIDBITS 33

beginning that that no one will speak up in the first n round. Thus in n + 1st
round, they have no more information about who is muddy than when the father
first asked the question, and thus they cannot say yes.

Induction Beyond the Natural Numbers [Optional Material]


In this chapter we have restricted our study of induction to the natural numbers.
Our induction hypotheses (eg, P(n)) are always parametrized by a natural
number, and our inductive definitions (eg, the Fibonacci sequence) have always
produced a sequence of objects indexed by the natural numbers.
Can we do induction over some other set that is not the natural numbers?
Clearly we can do induction on, say, all the even natural numbers, but what
about something more exotic, say the rational numbers, or say the set of C
programs?

Rational Numbers. Let us start with an ill-fated example of induction on the


rational numbers. We are going to prove (incorrectly) that all non-negative
rational numbers q, written in reduced form a/b, must be even in the numerator
(a) and odd in the denominator (b). Let P(q) be true if the claim is true for rational
number q.

Base Case: P(0) is true since 0 = 0/1 in its reduced form.


Inductive Step: Assume P(k) is true for all rationals 0 ÿ k < n. We wish to show
that P(n) is true as well. Consider the rational number n/2 and let a /b be
its reduced form. By the induction hypothesis P(n/2), a is even and b is
odd. It follows that n, in its reduced form, is (2a)/b, and thus P(n) is true.

Looking carefully at the proof, we are not making the same mistakes as before
in our examples for strong induction: to show P(n), we rely only on P(n/2), which
always satisfies 0 ÿ n/2 < n , so we are not simply missing base cases.
The only conclusion is that induction just “does not make sense” for the rational
numbers.

C programs. On the other hand, we can inductively define and reason about C
programs. Let us focus on a simpler example: the set of (very limited) arithmetic
expressions defined by the following context free grammar:

expr ÿ 0 | 1 | (expr + expr) | (expr × expr)


We can interpret this context free grammar as an inductive definition of arith-
metic expressions:
Machine Translated by Google

34 proofs and induction

Base Case: An arithmetic can be the digit 0 or 1.


Inductive (Recursive) Definition: An arithmetic expression can be of the form
“(expr1 +expr2)” or “(expr1 ×expr2)”, where expr1 and expr2 are itself
arithmetic expressions.

Notice that this inductive definition does not give us a sequence of arithmetic
expressions! We can also define the value of an arithmetic expression inductively:

Base Case: The arithmetic expression “0” has value 0, and the expression
“1” has value 1.
Inductive Definition: An arithmetic expression of the form “(expr1+expr2)” has
value equal to the sum of the values of expr1 and expr2. Similarly, an
arithmetic expression of the form “(expr1 × expr2)” has value equal to the
product of the values of expr1 and expr2.

We can even use induction to prove, for example, that any expression of length
n must have value ÿ 2 2 .
not

So how are natural numbers, rational numbers and C programs different


from one another? To make it more bewildering, did we not show a mapping
between the rational numbers and the natural numbers? The answer lies in the
way we induct through these sets or, metaphorically speaking, how the dominoes
are lined up. The formal desired property on lining up the dominoes is called well-
founded relations, and is beyond the scope of this course. Instead, here is a
thought experiment that illustrates the difference in the inductive procedure
between the numerous correct inductive proofs in this chapter, and the (faulty)
inductive proof on the rational numbers. Suppose we want to verify that 1 + 2 + ·
· · + 10 = 10 · 11/2 = 55; this is shown in Claim 2.11, our very first inductive
argument. Here is how we can proceed, knowing the inductive proof:

• We verify the inductive step, and conclude that if 1+2+· · · +9 = 9·10/2 is


true (the induction hypothesis), then 1 + 2 + · · · + 10 = 10 · 11/2 is true. It
remains to verify that 1 + 2 + · · · + 9 = 9 · 10/2.

• To verify that 1 + 2 + · · · + 9 = 9 · 10/2, we again look at the inductive step


and conclude that it remains to verify that 1 + 2 + · · · + 8 = 8 · 9/2 .

• Eventually, after a finite number of steps (9 steps in this case), it remains to


verify the that 1 = 1, which is shown in the base case of the induction.

Similarly, to verify that “((1+(0×1))×(1+1))” is a valid arithmetic expression, we


first verify that it is of the form “(expr1 × expr2)”, and recursively verify
Machine Translated by Google

2.5. FUN TIDBITS 35

that “(1 + (0×1))” and “(1 + 1)” are valid arithmetic expressions. Again, this
recursive verification will end in finite time.
Finally, let us consider our faulty example with rational numbers. To show
that the number 2/3 in reduced form is an even number over an odd number,
we need to check the claim for the number 1/3, and for that we need to check
1/6, and 1/12, and . . . ;this never ends, so we never have a complete proof of
the desired (faulty) fact.
Machine Translated by Google
Machine Translated by Google

Chapter 3

Number Theory

“Mathematics is the queen of sciences and number theory is the queen of


mathematics.”
– Carl Friedrich Gauss

Number theory is the study of numbers (in particular the integers), and is one of the
purest branch of mathematics. Regardless, it has many applications in computer
science, particularly in cryptography, the underlying tools that build modern services
such as secure e-commerce. In this chapter, we will touch on the very basics of
number theory, and put an emphasis on its applications to cryptography.

3.1 Severability
A fundamental relationship between two numbers is whether or not one divides another.

Definition 3.1 (Divisibility). Let a, b ÿ Z with a = 0. We say that a divides b, denoted


by a|b, if there exists some k ÿ Z such that b = ak.

Example 3.2. 3|9, 5|10, goal 3 7.

The following theorem lists a few well-known properties of divisibility.

Theorem 3.3. Let a, b, c ÿ Z.

1. If a|b and a|c then a|(b + c)


2. If a|b then a|bc
3. If a|b and b|c then a|c (ie, transitivity).

37
Machine Translated by Google

38 number theory

Proof. We show only item 1; the other proofs are similar (HW). By definition,

a|b ÿ there exists k1 ÿ Z such that b = k1a

a|c ÿ there exists k2 ÿ Z such that c = k2a


Therefore b + c = k1a + k2a = (k1 + k2)a, so a|(b + c).

Corollary 3.4. Let a, b, c ÿ Z. If a|b and a|c, then a|mb+nc for any m, n ÿ Z.

We learn in elementary school that even when integers don't divide evenly,
we can compute the quotient and the remainder.

Theorem 3.5 (Algorithm Division). For any a ÿ Z and d ÿ N +, there exists unique
q, r ÿ Z st a = dq + r and 0 ÿ r < d.
q is called the quotient and denoted by q = a div d. r
is called the remainder and denoted by r = a mod d.

For example, dividing 99 by 10 gives a quotient of q = 99 div 10 = 9 and


remainder of r = 99 mod 10 = 9, satisfying 99 = 10(9) + 9 = 10q + r. On the other
hand, dividing 99 by 9 gives a quotient of q = 99 div 9 = 11 and remainder of r =
99 mod 9 = 0. Again, we have 99 = 11(9) + 0 = 11q + r.
Onwards to proving the theorem.

Proof. Given a ÿ Z and d ÿ N +, let q = a/d (the greatest integer ÿ a/d), and let r
= a ÿ dq. By choice of q and r, we have a = dq + r. We also have 0 ÿ r < d,
because q is the largest integer such that dq ÿ a. It remains to show uniqueness.

Let q, r ÿ Z be any other pairs of integers satisfying a = dq + r and 0 ÿ r < d.


We would have:

dq + r = dq + r ÿ
d · (q ÿ q ) = r ÿ r.

This implies that d|(r ÿr). But ÿ(dÿ1) ÿ r ÿr ÿ dÿ1 (because 0 ÿ r, r < d), and the
only number divisible by d between ÿ(dÿ1) and dÿ1 is 0. Therefore we must
have r = r, which in turn implies that q = q.

Greatest Common Divisor


Definition 3.6 (Greatest Common Divisor). Let a, b ÿ Z with a, b not both 0. The
greatest common divisor of a and b, denoted by gcd(a, b), is the largest integer
d such that d|a and d|b.
Machine Translated by Google

3.1. DIVISIBILITY 39

Example 3.7.

gcd(4, 12) = gcd(12, 4) = gcd(ÿ4, ÿ12) = gcd(ÿ12, 4) = 4

gcd(12, 15) = 3 gcd(3, 5) = 1 gcd(20, 0) = 20

Euclid designed one of the first known algorithms in history (for any problem)
to compute the greatest common divisor:

Algorithm 1 EuclidAlg(a, b), a, b, ÿ N +, a, b not both 0


if b = 0 then
return a;
else
return EuclidAlg(b, a mod b); end
if

Example 3.8. Let's trace Euclid's algorithm on inputs 414 and 662.

EuclidAlg(414, 662) ÿ EuclidAlg(662, 414) ÿ EuclidAlg(414, 248) ÿ


EuclidAlg(248,166) ÿ EuclidAlg(166, 82) ÿ EuclidAlg(82, 2) ÿ EuclidAlg(2,
0) ÿ 2

The work for each step is shown below:

662 = 414(1) + 248


414 = 248(1) + 166
248 = 166(1) + 82
166 = 82(2) + 2
82 = 41(2) + 0

We now prove that Euclid's algorithm is correct in two steps. First, we show
that if the algorithm terminates, then it does output the correct greatest common
divisor. Next, we show that Euclid's algorithm always terminates (and does so
rather quickly).

Lemma 3.9. Let a, b ÿ N, b = 0. Then gcd(a, b) = gcd(b, a mod b).

Proof. It is enough to show that the common divisors of a and b are the same
as the common divisors of b and (a mod b). If so, then the two pairs of numbers
must also share the same greatest common divisor.
Machine Translated by Google

40 number theory

By the division algorithm, there exists unique q, r ÿ Z such that a = bq + r


and 0 ÿ r < b. Also recall that by definition, r = a mod b = a ÿ bq. Let d be a
common divisor of a and b, ie, d divides both a and b. Then d also divides r =
abq (by Corollary 3.4). Thus d is a common divisor of b and r. Similarly, let d be
a common divisor of b and r. Then d also divides a = bq + r. Thus d is a common
divisor of a and b.

Theorem 3.10. Euclid's algorithm (EuclidAlg) produces the correct output if it


terminates.

Proof. This can be shown by induction, using Lemma 3.9 as the inductive step.
(What would be the base case?)

We now show that Euclid's algorithm always terminates.

Claim 3.11. For every two recursive calls made by EuclidAlg, the first argument
a is halved.

Proof. Assume EuclidAlg(a, b) is called. If b ÿ a/2, then in the next re-cursive


call EuclidAlg(b, a mod b) already has the property that the first argument is
halved. Otherwise, we have b > a/2, so a mod b ÿ a/2. Then after two recursive
calls (first EuclidAlg(b, a mod b), then EuclidAlg(a mod b, b mod (a mod b))),
we have the property that the first argument is halved.

The next theorem follows directly from Claim 3.11.

Theorem 3.12. Euclid's algorithm, on input EuclidAlg(a, b) for a, b ÿ N +, a, b


not both 0, always terminates. Moreover it terminates in time proportional to
log2 a.

We end the section with a useful fact on greatest common divisors.

Theorem 3.13. Let a, b ÿ N +, a, b not both 0. Then, there exist s, t ÿ Z such


that sa + tb = gcd(a, b).

Theorem 3.13 shows that we can give a certificate for the greatest common
divisor. From Corollary 3.4, we already know that any common divisor of a and
b also divides sa + tb. Thus, if we can identify a common divisor d of a and b,
and show that d = sa + tb for some s and t, this demonstrates d is in fact the
greatest common divisor (d = gcd(a, b)). And there is more good news! This
certificate can be produced by slightly modifying Euclid's algorithm (often called
the extended Euclid's algorithm); this also constitutes
Machine Translated by Google

3.2. MODULAR ARITHMETIC 41

as a constructive proof of Theorem 3.13. We omitted the proof here and gave an
example instead.

Example 3.14. Suppose we want to find s, t ÿ Z such that s(252) +t(198) =


gcd(252, 198) = 18. Run Euclid's algorithm, but write out the equation a =
bq + r for each recursive call of EuclidAlg.

EuclidAlg(252, 198) 252 = 1(198) + 54 (3.1)


ÿEuclidAlg(198, 54) 198 = 3(54) + 36 (3.2)
ÿEuclidAlg(54, 36) 54 = 1(36) + 18 (3.3)
ÿEuclidAlg(36, 18) 36 = 2(18) + 0 (3.4)

We can construct s and t by substituting the above equations “backwards”:

18 = 1(54) ÿ 1(36) by (3.3)


= 1(54) ÿ (1(198) ÿ 3(54)) by (3.2)
= 4(54) ÿ 1(198)
= 4(252 ÿ 1(198)) ÿ 1(198) by (3.1)
= 4(252) ÿ 5(198)

We conclude that gcd(252, 198) = 18 = 4(252) ÿ 5(198).

3.2 Modular Arithmetic


Modular arithmetic, as the name implies, is arithmetic on the remainders of
integers, with respect to a fixed divisor. A central idea to modular arithmetic
is congruences: two integers are considered “the same” if they have the same
remainder with respect to the fixed divisor.

Definition 3.15. Let a, b ÿ Z, m ÿ N +. We say that a and b are congruent


modular m, denoted by a ÿ b (mod m), if m|(aÿb) (ie, if there exists k ÿ Z
such that a ÿ b = km).

As a direct consequence, we have a ÿ a (mod m) for any m ÿ N +.

Claim 3.16. a ÿ b (mod m) if and only if a and b have the same remainder
when divided by m, ie, a mod m = b mod m.

Proof. We start with the if direction. Assume a and b have the same remainder
when divided by m. That is, a = q1m + r and b = q2m + r. Then we have

a ÿ b = (q1 ÿ q2)m ÿ m|(a ÿ b)


Machine Translated by Google

42 number theory

For the only if direction, we start by assuming m|(aÿb). Using the division algorithm,
let a = q1m + r1, b = q2m + r2 with 0 ÿ r1, r2 < m. Because m|(a ÿ b), we have

m|(q1m + r1 ÿ (q2m + r2))

Since m clearly divides q1m and q2m, it follows by Corollary 3.4 that

m|r1 ÿ r2

But ÿ(m ÿ 1) ÿ r1 ÿ r2 ÿ m ÿ 1, so we must have a mod m = r1 = r2 =


b mod m.

The next theorem shows that addition and multiplication “carry over” to the modular
world (specifically, addition and multiplication can be computed before or after computing
the remainder).

Theorem 3.17. If a ÿ b (mod m), and c ÿ d (mod m) then

1. a + c ÿ b + d (mod m) 2. ac
ÿ bd (mod m)

Proof. For item 1, we have

a ÿ b (mod m) and c ÿ d (mod m) ÿ m|(a ÿ


b) and m|(c ÿ d) ÿ m|((a ÿ b) + (c
ÿ d)) ÿ m| ((a + c) ÿ (b + d)) by Corollary 3.4
ÿ a + c ÿ b + d (mod m)

For item 2, using Claim 3.16, we have unique integers r and r such that

a = q1m + r b = q2m + rd
c = q 1m + r = q2m + r

This shows that

ac = q1m · q 1m + q1mr + q 1mr + rr bd =


q2m · q 2m + q2mr + q 2mr + rr

which clearly implies that m|(ac ÿ bd).

Clever usage of Theorem 3.17 can simplify many modular arithmetic cal-
culations.
Machine Translated by Google

3.2. MODULAR ARITHMETIC 43

Example 3.18.

11999 ÿ 1 999 ÿ 1 (mod 10) ÿ


9999 (ÿ1)999 ÿ ÿ1 ÿ 9 (mod 10) ÿ 49499
7999 · 7 ÿ (ÿ1)499 · 7 ÿ ÿ7 ÿ 3 (mod 10)

Note that exponentiation was not included in Theorem 3.17. Because ÿ (a mod n)
does carry over, we have a already used this e e (mod n); we have multiplication
fact in the example. However, in general we cannot perform e mod n (mod n). modular
e ÿa
exponent first, ie, a operations on the

Applications of Modular Arithmetic


In this section we list some applications of modular arithmetic, and, as we promised,
give an example of an application to cryptography.

Hashing. The age-old setting that call for hashing is simple. How do we efficiently
retrieve (store/delete) a large number of records? Take for exam-ple student records,
where each record has a unique 10-digit student ID. We cannot (or do not want) a
table of size 1010 to index all the student records indexed by their ID. The solution?
Store the records in an array of size N where N is a bit bigger than the expected
number of students. The record for student ID is then stored in position h(ID) where h
is a hash function that maps IDs to {1, . . . , NOT}. One very simple hash-function
would be

h(k) = k mod N

ISBN. Most published books today have a 10 or 13-digit ISBN number; we will focus
on the 10-digit version here. The ISBN identifies the country of publication, the
publisher, and other useful data, but all this information is stored in the first 9 digits;
the 10thdigit is a redundancy check for errors.
The actual technical implementation is done using modular arithmetic. a10 be the
it must pass digits of an ISBN number. In order to be a valid ISBN Let a1, . . . , number,
the check:

a1 + 2a2 + ..9a9 + 10a10 ÿ 0 (mod 11)

This test would detect an error if:

• a single digit was changed, or


Machine Translated by Google

44 number theory

• a transposition occurred, ie, two digits were swapped (this is why in the
check, we multiply ai by i).

If 2 or more errors occur, the errors may cancel out and the check may still
pass; fortunately, more robust solutions exist in the study of error correcting
codes.

Cast out 9s. Observe that a number is congruent to the sum of its digits modulo
9. (Can you show this? Hint: start by showing 10n ÿ 1 (mod 9) for any n ÿ N +.)
The same fact also holds modulo 3. This allows us to check if the computation
is correct by quickly performing the same computation modulo 9. (Note that
incorrect computations might still pass, so this check only increase our
confidence that the computation is correct.)

Pseudorandom sequences. Randomized algorithms require a source of random


numbers; where do they come from in a computer? Computers today take a
small random “seed” (this can be the current time, or taken from the swap
space), and expand it to a long string that “looks random”. A technical standard
is the linear congruential generator (LCG):

• Choose a modulus m ÿ N +, •
a multiply a ÿ 2, 3, . . . , m ÿ 1, and • an
increment c ÿ Zm = {0, 1, . . . , m ÿ 1}

Given a seed x0 ÿ Zm, the LCG outputs a “random looking” sequence


inductively defined by
xn+1 = (axn + c) mod m
The LCG is good enough (ie, random enough) for some randomized al-
gorithms. Cryptographic algorithms, however, have a much more stringent
requirement for “random looking”; it must be the case that any adversary party,
any hacker on the internet, cannot tell apart from a “pseudorandom” string from
a truly random string. Can you see why the LCG is not a good pseudorandom
generator? (Hint: the LCG follows a very specific pattern)

Encryption Encryption solves the classical cryptographic problem of secure


communication. Suppose that Alice wants to send a private message to Bob;
however, the channel between Alice and Bob is insecure, in the sense that there
is an adversary Eve who listens in on everything sent between Alice and Bob
(later in the course we will discuss even more malicious behaviors than just
listening in). To solve this problem, Alice and Bob agree on a “secret code”
Machine Translated by Google

3.2. MODULAR ARITHMETIC 45

(an encryption scheme) so that Alice may “scramble” her messages to Bob (an
encryption algorithm) in a way that no one except Bob may “unscramble” it (a
decryption algorithm).

Definition 3.19 (Private-Key Encryption Scheme). A triplet of algorithms


(Gen,Enc, Dec), a message space M, and a key space K, together is called a
private-key encryption scheme if:

1. The key-generation algorithm, Gen is a randomized algorithm that re-


turns a key, k ÿ Gen, such that k ÿ K.

2. The encryption algorithm, Enc: K × M ÿ {0, 1} ÿ is an algorithm that takes


as input a key k ÿ K and a plain-text m ÿ M (the message), and outputs a
cipher-text c = Enck(m) ÿ {0, 1} ÿ .

3. The decryption algorithm, Dec: K × {0, 1} ÿ ÿ M is an algorithm that and


takes as input a key k ÿ K and a cipher-text c ÿ {0, 1} ÿ , output a
plain-text m ÿ M.

4. The scheme is correct; that is, decrypting a valid cipher-text should output
the original plain text. Formally we require that for all m ÿ M, k ÿ K,
Deck(Enck(m)) = m.

To use a private encryption scheme, Alice and Bob first meet in advance
and run k ÿ Gen together to agree on the secret key k. The next time Alice has
a private message m for Bob, she sends c = Enck(m) over the insecure channel.
Once Bob receives the cipher-text c, he decrypts it by running m = Deck(c) to
read the original message.

Example 3.20 (Caesar Cipher). The Caesar Cipher is a private-key encryption


scheme used by Julius Caesar to communicate with his generals; encryption is
achieved by “shifting” each alphabet by some fixed amount (the key).
Here is the formal description of the scheme. Let M = {A, . . . , Z} ÿ and K =
{0, . . . , 25}:

• Gen outputs a uniformly random key k from K = {0, . . . , 25}. •


Encryption shifts the alphabet of each letter in the plain-text by k:

Enck(m1m2 · · · mn) = c1c2 · · · cn, where ci = mi + k mod 26

• Decryption shifts each letter back:

Deck(c1c2 · · · cn) = m1m2 · · · cn, where mi = ci ÿ k mod 26


Machine Translated by Google

46 number theory

For example, if k = 3, then we substitute each letter in the plain-text according to the
following table:

plain-text: ABCDEFGHIJKLMNOPQRSTUVWXYZ
cipher-text: DEFGHIJKLMNOPQRSTUVWXYZABC

The message GOODMORNING encrypts to JRRGPRUQLQJ.

Claim 3.21. The Caesar Cipher is a private-key encryption scheme.

Proof. Correctness is trivial, since for all alphabets m and all keys k,

m = ((m + k) mod 26 ÿ k) mod 26

Nowadays, we know the Caesar Cipher is not a very good encryption scheme. There
are numerous freely available programs or applets on-line that can crack the Caesar Cipher.
(In fact, you can do it too! After all, there are only 26 keys to try.) The next example is on
the other extreme; it is a perfectly secure private-key encryption scheme. We wait until a
later chapter to formalize the notion of perfect secrecy; for now, we simply point out that at
least the key length of the one-time pad grows with the message length (ie, there is not just
26 keys).

Example 3.22 (One-Time Pad). In the one-time pad encryption scheme, the key is required
to be as long as the message. During encryption, the entire key is used to mask the plain-
text, and therefore “perfectly hides” the plain-text.
Formally, let M = {0, 1} n K = {0,, 1} n , and

• Gen samples a key k = k1k2 · · · kn uniformly from K = {0, 1} n .

• Enck(m1m2 · · · mn) = c1c2 · · · cn, where ci = mi+ki mod 2 (equivalently


ci = mi ÿ ki where ÿ denotes XOR).

• Deck(c1c2 · · · cn) = m1m2 · · · mn, where mi = ci ÿ ki mod 2 (again, it is equivalent to


say mi = ci ÿ ki).

To encrypt the message m = 0100000100101011 under key k = 1010101001010101, simply


compute

plain-text: 0100000100101011 ÿ
key: 1010101001010101 cipher-
text: 1110101101111110

Claim 3.23. The one-time pad is a private-key encryption scheme.


Machine Translated by Google

3.3. PREMIUMS 47

Proof. Again correctness is trivial, since for mi ÿ {0, 1} and all ki ÿ {0, 1}, mi = ((mi + ki)
mod 2 ÿ ki) mod 2 (equivalently, mi = mi ÿ ki ÿ ki) .

Private-key encryption is limited by the precondition that Alice and Bob must meet
in advance to (securely) exchange a private key. Is this an inherent cost for achieving
secure communication?
First let us ask: can parties communicate securely without having secrets?
Unfortunately, the answer is impossible. Alice must encrypt her message based on
some secret key known only to Bob; otherwise, everyone can run the same decryption
procedure as Bob to view the private message. Does this mean Alice has to meet with
Bob in advance?
Fortunately, the answer this time around is no. The crux observation is that maybe
we don't need the whole key to encrypt a message. Public-key cryptography, first
proposed by Diffie and Hellman in 1976, splits the key into two parts: an encryption key,
called public-key, and a decryption key, called the secret-key. In our example, this
allows Bob to generate his own public and private key-pair without meeting with Alice.
Bob can then publish his public-key for anyone to find, including Alice, while keeping
his secret-key to himself. Now when Alice has a private message for Bob, she can
encrypt it using Bob's public-key, and be safely assured that only Bob can decipher her

message.
To learn more about public-key encryption, we need more number theory;
in particular, we need to notion of prime numbers.

3.3 Bonuses
Primes are numbers that have the absolute minimum number of divisors; they are only
divisible by themselves and 1. Composite numbers are just numbers that are not prime.
Formally:

Definition 3.24 (Primes and Composites). Let p ÿ N and p ÿ 2. p is prime if its only
positive divisors are 1 and p. Otherwise p is composite (ie, there exists some a such
that 1 < a < n and a|n).

Note that the definition of primes and composites exclude the numbers 0 and 1.
Also note that, if n is composite, we may assume that there exists some a such that 1 <
a ÿ ÿ n and a|n. This is because given a divisor d|n with 1 < d < n, then 1 < n/d < n is
also a divisor of n; moreover, one of d or n/d must be at most ÿ n.
Machine Translated by Google

48 number theory

Example 3.25. The first few primes are: 2,3,5,7,11,13,17,19.


The first few composites are: 4,6,8,9,10,12,14,15.

How can we determine if a number n is prime or not? This is called a primality


test. Given the above observation, we can try to divide n by every positive integer
ÿ ÿ n; this is not very efficient, considering that in today's cryptographic
applications, we use 1024 or 2048-bit primes. A deterministic polynomial-time
algorithm to for primality tests was not known until Agarwal, Saxena and Kayal
constructed the first such algorithm in 2002; even after several improvements,
the algorithm is still not fast enough to practically feasible. Fortunately, there is a
much more efficient probabilistic algorithm for primality test; we will discuss this
more later.

Distribution of Bounties
How many primes are there? Euclid first showed that there are infinitely many
primes.

Theorem 3.26 (Euclid). There are infinitely many primes.

Proof. Assume the contrary that there exists a finite number of primes p1, . . . , p.n.
Consider q = p1p2 · · · pn + 1. By assumption, q is not prime. Let a > 1 be the
smallest number that divides q. Then a must be prime (or else it could not be the
smallest, by transitivity of divisibility), ie, a = pi for some i (since p1, . . ., pn are
all the primes).
We have pi |q. Since q = p1p2 · · · pn + 1 and pi clearly divides p1p2 · · · pn,
we conclude by Corollary 3.4 that pi |1, a contradiction.

Not only are there infinitely many primes, primes are actually common
(enough).

Theorem 3.27 (Prime Number Theorem). Let ÿ(N) be the number of primes ÿ N.
Then
lime
ÿ(N) =1
Nÿÿ N/ ln N
We omitted the proof, as it is out of the scope of this course. We can interpret
the theorem as follows: there exists (small) constants c1 and c2 such that

c1N/logN ÿ ÿ(N) ÿ c2N/logN

If we consider n-digit numbers, ie, 0 ÿ x < 10n , roughly 10n/ log 10n = 10n/n
numbers are prime. In other words, roughly 1/n fraction of n-digit numbers are
prime.
Machine Translated by Google

3.3. PREMIUMS 49

Given that prime numbers are dense (enough), here is a method for finding
a random n-digit prime:

• Pick a random (odd) n-digit number, x. • Efficiently


check if x is prime (we discuss how later). • If x is prime, output
x. • Otherwise restart the
procedure.

Roughly order n restarts would suffice.

Relative Primality
Primes are numbers that lack divisors. A related notion is relative primality, where a pair of
number lacks common divisors.

Definition 3.28 (Relative Primality). Two positive integers a, b ÿ N + relatively prime if are
gcd(a, b) = 1.

Clearly, a (standalone) prime is relatively prime to any other number ex-cept a multiple
of itself. From Theorem 3.13 (ie, from Euclid's algorithm), we have an alternative
characterization of relatively prime numbers:

Corollary 3.29. Two positive integers a, b ÿ N + are relatively prime if and only if there exists
s, t ÿ Z such that sa + tb = 1.

Corollary 3.29 has important applications in modular arithmetic; it guar-antees the


existence of certain multiplicative inverses (so that we can talk of modular division).

ÿ1 ÿ1
Theorem 3.30. Let a, b ÿ N +. There exists an element a 1 (mod b) such that a ·a ÿ

if and only if a and b are relatively prime.


ÿ1a
_ is called the inverse of a modulo b; whenever such an element exists, . For
ÿ1
we can “divide by a” modulo b by multiplying by a inverse of example, 3 is the
7 modulo 10, because 7 · 3 ÿ 21 ÿ 1 (mod 10). On the other hand, 5 does not have an
inverse modulo 10 (without relying on Theorem 3.30, we can establish this fact by simply
computing 5 · 1, 5 · 2, . . ., 5 · 9 modulo 10).

Proof of Theorem 3.30. If direction. If a and b are relatively prime, then there exists s, t such
that sa + tb = 1 (Corollary 3.29). Rearranging terms,

sa = 1 ÿ tb ÿ 1 (mod b)
ÿ1
therefore s = a .
Machine Translated by Google

50 number theory

Only if direction. Assume there exists an element s such that sa ÿ 1 (mod


b). By definition this means there exists t such that sa ÿ 1 = tb.
Rearranging terms we have sa + (ÿt)b = 1, which implies gcd(a, b) = 1.

Relative primality also has consequences with regards to divisibility.

Lemma 3.31. If a and b are relatively prime and a|bc, then a|c.

Proof. Because a and b are relatively prime, there exists s and t such that sa +
tb = 1. Multiplying both sides by c gives sac + tbc = c. Since a divides the left
hand side (a divides a and bc), a must also divides the right hand side (ie, c).

The Fundamental Theorem of Arithmetic


The fundamental theorem of arithmetic states that we can factor any positive
integer uniquely as a product of primes. We start with a lemma before proving
the fundamental theorem.
not

Lemma 3.32. If p is prime and p | i=1 ai, then there exist some 1 ÿ j ÿ n
such that p|aj .

Proof. We proceed by induction. Let P(n) be the statement: For every prime p,
not

ai , exists some 1 ÿ j ÿ n such that


and every sequence a1, . . . an, if p | theni=1there
p|aj .
Base case: P(1) is trivial (j = 1).
Inductive step: Assuming P(n), we wish to show P(n + 1). Consider some
n+1
prime p and sequence a1, .., an+1 st p| i=1 ai . We split into two cases, either
p | an+1 or gcd(p, an+1) = 1 (if the gcd was more than 1 but less than p, then p
would not be a prime).
In the case that p | an+1 we are immediately done (j = n + 1). Otherwise, by
not

Lemma 3.31, p | i=1 ai . We can then use the induction hypothesis to show that
there exists 1 ÿ j ÿ n such that p|aj .

Theorem 3.33 (Fundamental Theorem of Arithmetic). Every natural number n >


1 can be uniquely factored into a product of a sequence of non-decreasing
primes, ie, the unique prime factorization.

For example, 300 = 2 × 2 × 3 × 5 × 5.

Proof. We proceed by induction. Let P(n) be the statement “n has a unique


prime factorization”.
Base case: P(2) is trivial, with the unique factorization 2 = 2.
Machine Translated by Google

3.3. PREMIUMS 51

Inductive Step: Assume P(k) holds for all 2 ÿ k ÿ n ÿ 1. We will show P(n) (for n ÿ 3). If
n is prime, then we have the factorization of n = n, and this is unique (anything else would
contradict that n is prime). If n is composite, we show existence and uniqueness of the
factorization separately:

Existence. If n is composite, then there exists a, b such that 2 ÿ a, b ÿ nÿ1 and n = ab.
Apply the induction hypothesis P(a) and P(b) to get their respective factorization,
and “merge them” for a factorization of n.

Uniqueness. Suppose the contrary that n has two different factorizations,


not m
n= pi = qj
i=1 j=1

where n, m ÿ 2 (otherwise n would be prime). Because p1|n = j qj , by Lemma 3.32


there is some j0 such that p1|qj0 . Since p1 and qj0 are both primes, we have p1 =
qj0 . Consider the number n = n/p1 which
can be factorized as
not m
n= pi = qj
i=2 j=1,j=j0

Since 1 < n = n/p1 < n, the induction hypothesis shows that the two factorizations of
n are actually the same, and so the two factorization of n are also the same (adding
back the terms p1 = qj0 ).

Open Problems
Number theory is a field of study that is rife with (very hard) open problems.
Here is a small sample of open problems regarding primes.
Goldbach's Conjecture, first formulated way back in the 1700's, states that any positive
even integer other than 2 can be expressed as the sum of two primes. For example 4 = 2
+ 2, 6 = 3 + 3, 8 = 3 + 5 and 22 = 5 + 17.
With modern computing power, the conjecture has been verified for all even integers up to
ÿ 1017 .
The Twin Prime Conjecture states that there are infinitely pairs of primes that differ by
2 (called twins). For example, 3 and 5, 5 and 7, 11 and 13 or 41 and 43 are all twin primes.
A similar conjecture states that there are infinitely many safe primes or Sophie-Germain
primes — pairs of primes of the form p and 2p + 1 (p is called the Sophie-Germain prime,
and 2p + 1 is called the safe prime). For example, consider 3 and 7, 11 and 23, or 23 and
47. In cryptographic applications, the use of safe primes sometimes provides more security
guarantees.
Machine Translated by Google

52 number theory

3.4 The Euler ÿ Function


Definition 3.34 (The Euler ÿ Function). Given a positive integer n ÿ N +, define ÿ(n) to be
the number of integers x ÿ N +, x ÿ n such that gcd(x, n) = 1, ie, the number of integers
that are relatively prime with not.

For example, ÿ(6) = 2 (the relatively prime numbers are 1 and 5), and ÿ(7) = 6 (the
relatively prime numbers are 1, 2, 3, 4, 5, and 6). By definition ÿ(1) = 1 (although this is
rather uninteresting). The Euler ÿ function can be computed easily on any integer for
which we know the unique prime factor-ization (computing the unique prime factorization
itself may be difficult). In k1 k2 km fact, if the prime factorization of n is n = p then 1 p · · ·
p2
m,

1
ÿ(n) = n 1ÿ = (p iki kiÿ1 ÿ p i ) (3.5)
pi
i i

While we won't prove (3.5) here (it is an interesting counting exercise), we do state and
show the following special cases.

Claim 3.35. If p is a prime, then ÿ(p) = p ÿ 1. If n = pq where p = q are both primes, then
ÿ(n) = (p ÿ 1)(q ÿ 1).

Proof. If p is prime, then the numbers 1, 2, . . . , p ÿ 1 are all relatively prime to p.


Therefore ÿ(p) = p ÿ 1.
If n = pq with p = q both prime, then among the numbers 1, 2, . . . , n = pq, there are
exactly q multiples of p (they are p, 2p, . . . , n = qp). Similarly,there are exactly p multiples
of q. Observe that that other than multiples of p or q, the rest of the numbers are relatively
prime to n; also observe that we have counted n = pq twice. Therefore

ÿ(n) = n ÿ p ÿ q + 1 = pq ÿ p ÿ q + 1 = (p ÿ 1)(q ÿ 1)

The Euler ÿ function is especially useful in modular exponentiation, due


to Euler's Theorem:

Theorem 3.36. Given a, n ÿ N +, if gcd(a, n) = 1, then


ÿ(n)
has
ÿ 1 (mod n)

Proof. Let X = {x | x ÿ N +, x ÿ n, gcd(x, n) = 1}. By definition |X| = ÿ(n).


Let aX = {ax mod n | x ÿ For example, if n = 10, then X = {1, 3, 7, 9}, and

aX = {3 mod 10, 9 mod 10, 21 mod 10, 27 mod 10} = {3, 9, 1, 7}


Machine Translated by Google

3.4. THE EULER ÿ FUNCTION 53

By Theorem 3.30, X is the set of all numbers that have multiplicative inverses
modulo n; this is useful in the rest of the proof.
We first claim that X = aX (this does indeed hold in the example). We
prove this by showing X ÿ aX and aX ÿ X.

X ÿ aX. Given x ÿ X, we will show that x ÿ aX. Consider the number a ÿ1x
mod n (recall that a is theÿ1 multiplicative inverse of a, and exists
since gcd(a, n) = 1). We claim that a ÿ1x mod n ÿ X, since it has the
multiplicative inverse x ÿ1a. Consequently, a(a ÿ1x) ÿ x (mod n) ÿ aX.

aX ÿ X. We can give a similar argument as above1 . Given y ÿ aX, we


will show that y ÿ X. This can be done by constructing the
multiplicative inverse of y. We know y is of the form ax for some x
ÿ X, so the multiplicative inverse of. y is x ÿ1a ÿ1

Knowing aX =

xÿ yÿ ax (mod n)
xÿX yÿaX xÿX

Since each x ÿ
xÿX
s):

1ÿ a (mod n)
xÿX

Since |X| = ÿ, we have just shown that aÿ(n)


ÿ 1 (mod n).

We give two corollaries.

Corollary 3.37 (Fermat's Little Theorem). If p is a prime and gcd(a, p) = 1,


pÿ1 then a ÿ 1 (mod p).

Proof. This directly follows from the theorem and ÿ(p) = p ÿ 1.

Corollary 3.38. If gcd(a, n) = 1, then

x
has ÿa x mod ÿ(n) (mod n)
1
In class we showed this differently. Observe that |aX| ÿ |X| (since elements in aX are
“spawned” from X). Knowing that X ÿ aX, |aX| ÿ |X|, and the fact that these are finite sets
allows us to conclude that aX ÿ X (in fact we can directly conclude that aX = X).
Machine Translated by Google

54 number theory

Proof. Let x = qÿ(n) + r be the result of dividing x by ÿ(n) with remainder (recall that r is exactly
x mod ÿ(n)). Then

=a = (a ÿ(N) ) q r · a ÿ 1q r · a = a x mod ÿ(n) (mod n)


xa qÿ(N)+r

Example 3.39. Euler's function can speed up modular exponentiation by a lot. Let n = 21 = 3 ×
7. We have ÿ(n) = 2 × 6 = 12. Then
999 999 mod 12
2 ÿ2 = 23 = 8 (mod 21)

Application to Probabilistic Primality Checking


We have mentioned on several occasions that there is an efficient probabilistic algorithm for
checking primality. The algorithm, on input a number n, has the following properties:

• If n is prime, then algorithm always outputs YES

• If n is not prime, then with some probability, say 1/2, the algorithm may still output YES
incorrectly.

Looking at it from another point of view, if the algorithm ever says n is not prime, then n is
definitely not prime. With such an algorithm, we can ensure that n is prime with very high
probability: run the algorithm 200 times, and believe n is prime if the output is always YES. If n
is prime, we always correctly conclude that it is indeed prime. If n is composite, then we would
only incorrectly view it as a prime with probability (1/2)200 (which is so small that is more likely
to encounter some sort of hardware error).

How might we design such an algorithm? A first approach, on input n, is to pick a random
number 1 < a < n and output YES if and only if gcd(a, n) = 1.
Certainly if n is prime, the algorithm will always output YES. But if n is not prime, this algorithm
may output YES with much too high probability; in fact, it outputs YES with probability ÿ ÿ(n)/n
(this can be too large if say n = pq, and ÿ(n) = (p ÿ 1)(q ÿ 1)).

We can design a similar test relying on Euler's Theorem. On input n, pick a random 1 < a
nÿ1
< n and output YES if and only if a ÿ 1 (mod n). Again, if n is prime, this test will always output
YES. What if n is composite? For most composite numbers, the test does indeed output YES
with sufficiently small probability. However there are some composites, called Carmichael
numbers or pseudo-primes, on which this test always outputs YES incorrectly (ie, a Carmichael
number n has the property that for all 1 < a < n, a ÿ 1 (mod n), yet n is not prime).
nÿ1
Machine Translated by Google

3.4. THE EULER ÿ FUNCTION 55

By adding a few tweaks to the above algorithm, we would arrive at the Miller-Rabin
primality test that performs well on all numbers (this is out of the scope of this course).
nÿ1
For now let us focus on computing a way of computing a requires nÿ1 . The naive
nÿ1
multiplications — in that case we might as well just divide n by all numbers less than n.
A more clever algorithm is to do repeated squaring:

Algorithm 2 ExpMod(x, e, n), computing x e mod n


if e = 0 then
return 1;
else
2nd mod 2 · x
return ExpMod(x, e div 2, n) end if mod n;

The correctness of ExpMod is based on the fact the exponent e can be expressed
as (e div 2) · 2 + e mod 2 by the division algorithm, and therefore

(e div 2)·2+e mod 2 th div 22 e mod


ex =x =x 2·x

e mod 2
To analyze the efficiency of ExpMod, observe that x is easy to compute
1
0 = x or x
(it is either x = 1), and that the recursion has depth ÿ log2 e since the exponent e is
halved in each recursive call. The intuition behind ExpMod is simple. By repeated
squaring, it is must faster to compute exponents that are powers of two, eg, to compute
16 2 ÿ
x ÿ x 16. Exponents that are not powers requires squaring four times: x ÿ x
48xÿx
of two can first be split into sums of powers of two; this is the same concept
as binary representations for those who are familiar with it. As an example, suppose we
want to compute 4 19 would be 10112). By repeated squaring, we first compute:
19
mod 13. First observe that 19 = 16 + 2 + 1 (the binary representation of

24 mod 13 = 16 mod 13 = 3 44 mod 13 = 32 mod 13 = 9


84 mod 13 = 92 mod 13 = 3 416 mod 13 = 32 mod 13 = 9

Now we can compute


19 2
4 mod 13 = 416+2+1 mod 13 = 416 · 4 1·4 mod 13
= 9 · 3 · 4 mod 13 = 4

The takeaway of this section is that primality testing can be done ef-ficiently, in time
polynomial in the length (number of digits) of the input number n (ie, in time polynomial
in log n).
Machine Translated by Google

56 number theory

3.5 Public-Key Cryptosystems and RSA


In this section we formally define public-key cryptosystems. We describe the
RSA cryptosystem as an example, which relates on many number theoretical
results in this chapter.
Recall our earlier informal discussion on encryption: Alice would like to
send Bob a message over a public channel where Eve is eavesdropping. A
private-key encryption scheme would require Alice and Bob to meet in advance
to jointly generate and agree on a secret key. On the other hand a public-key
encryption scheme, first proposed by Diffie and Hellman, has two keys: a public-
key for encrypting, and a private-key for decrypting. Bob can now generate
both keys by himself, and leave the public-key out in the open for Alice to find;
when Alice encrypts her message using Bob's public-key, only Bob can decrypt
and read the secret message.

Definition 3.40. A triplet of algorithms (Gen,Enc, Dec), a key space K ÿ Kpk×Ksk


(each key is actually a pair of public and secret keys), and a sequence of
message spaces indexed by public-keys M = {Mpk} pkÿKpk , together is called
a public-key encryption scheme if:

1. The key-generation algorithm, Gen is a randomized algorithm that re-


turns a pair of keys, (pk, sk) ÿ Gen, such that (pk, sk) ÿ K (and so pk ÿ
Kpk, sk ÿ Ksk ) .

2. The encryption algorithm takes as input a public-key pk ÿ Kpk and a plain-


text m ÿ Mpk (the message), and outputs a cipher-text c = Encpk(m) ÿ {0,
1} ÿ .

3. The decryption algorithm takes as input a secret-key k ÿ Ksk and a cipher-


,
text c ÿ {0, 1} ÿ and output a plain-text m = Decsk(c).

4. The scheme is correct; that is, decrypting a valid cipher-text with the
correct pair of keys should output the original plain text. Formally we
require that for all (pk, sk) ÿ K, m ÿ Mpk, Decsk(Encpk(m)) = m.

The RSA Cryptosystem


The RSA cryptosystem, designed by Rivest, Shamir and Adleman, uses mod-
ular exponentiation to instantiate a public-key crypto-system. Given a security
parameter n, the plain RSA public-key encryption scheme is as follows:

• Gen(n) picks two random n-bit primes p and q, set N = pq, and pick a
random e such that 1 < e < n, gcd(e, ÿ(N)) = 1. The public key is
Machine Translated by Google

3.5. PUBLIC-KEY CRYPTOSYSTEMS AND RSA 57

pk = (N, e), while the secret key is sk = (p, q) (the factorization of N).
N is called the modulus of the scheme.

• Through the definition of Gen, we already have an implicit definition of the


key-space K. The message space for a public-key pk = (N, e) is: Mpk =
{m | 0 < m < N, gcd(m, N) = 1}.
• Encryption and decryption are defined as follows:
Encpk(m) = me mod N
d ÿ1
Decsk(c) = c mod N, where d = e mod ÿ(N)

Correctness of RSA
First and foremost we should verify that all three algorithms, Gen, Enc and Dec,
can be efficiently computed. Gen involves picking two n-bit primes, p and q,
and an exponent e relatively prime to pq; we covered generating random primes
in Section 3.3, and choosing e is simple: just make sure e is not a multiple of p
or q (and a random e would work with very high probability).
Enc and Dec are both modular exponentiations; we covered that in Section
ÿ1
the Dec additionally requires us to compute d = e 3.4. mod ÿ(N); knowing
secret-key, which contains the factorization of N, it is easy to compute ÿ(N) =
(p ÿ 1)(q ÿ 1), and then compute d using the extended GCD algorithm and
Theorem 3.30.
Next, let us verify that decryption is able to recover encrypted messages.
Given a message m satisfying 0 < m < N and gcd(m, N) = 1, we have:
d
Dec(Enc(m)) = ((me ) mod N) mod N
= med mod N
= med mod ÿ(N) mod N by Corollary 3.38 = m1
mod N since d = e ÿ1 mod ÿ(N)

= m mod N = m

This calculation also shows why the message space is restricted to {m | 0 < m
< N, gcd(m, N) = 1}: A message m must satisfy gcd(m, N) = 1 so that we can
apply Euler's Theorem, and m must be in the range 0 < m < N so that when we
recover m mod N, it is actually equal to the original message m.

Security of RSA
Let us informally discuss the security of RSA encryption. What stops Eve from
decrypting Alice's messages? The assumption we make is that without
Machine Translated by Google

58 number theory

knowing the secret key, it is hard for Eve to compute d = e modÿ1 ÿ(N). In
particular, we need to assume the factoring conjecture: there is no efficient
algorithm that factor numbers N that are products of two equal length primes p
and q (formally, efficient algorithm means any probabilistic algorithm that runs
in time polynomial in the length of N , ie, the number of digits of N).
Otherwise Eve would be able to recover the secret-key and decrypt in the same
way as Bob would.
There is another glaring security hole in the our description of the RSA
scheme: the encryption function is deterministic. What this means is that once
the public-key is fixed, the encryption of each message is unique! For example,
there is only one encryption for the word “YES”, and one encryption for the word
“NO”, and anyone (including Eve) can compute these encryptions (it is a public-
key scheme after all). If Alice ever sends an encrypted YES or NO answer to
Bob, Eve can now completely compromise the message.
One solution to this problem is for Alice to pad each of her message m with
a (fairly long) random string; she than encrypts the resulting padded message
m as before, outputting the cipher-text (m) e mod N (now the whole encryption
procedure is randomized). This type of “padded RSA” is implemented in practice.

On generating secret keys. Another security concern (with any key-based


scheme) is the quality of randomness used to generate the secret-keys. Let us
briefly revisit the Linear Congruential Generator, a pseudorandom generator
that we remarked should not be used for cryptographic applications. The LCG
generates a sequence of numbers using the recurrence:

x0 = random seed
xi = ax + c mod M

In C++, we have a = 22695477, c = 1, and M = 232. Never mind the fact that
the sequence (x0, x1, x2, . . .) has a pattern (that is not very random at all).
Because there are only 232 starting values for x0, we can simply try them all,
and obtain the secret-key of any RSA key-pair generated using LCG and C+
+.2

Padded RSA
We have already discussed why a padded scheme for RSA is necessary for
se-curity. A padded scheme also has another useful feature; it allows us to define
2
In Java M = 248, so the situation improves a little bit.
Machine Translated by Google

3.5. PUBLIC-KEY CRYPTOSYSTEMS AND RSA 59

a message space that does not depend on the choice of the public-key (eg, it
would be tragic if Alice could not express her love for Bob simply because Bob
chose the wrong key). In real world implementations, designing the padding
scheme is an engineering problem with many practical considerations; here we
give a sample scheme just to illustrate how padding can be done. Given a se-
curity parameter n, a padded RSA public-key encryption scheme can proceed
as follows:
nÿ1
• Gen(n) picks two random n-bit primes, p, q > 2 and set N , = pq, and pick a
random e such that 1 < e < n, gcd(e, ÿ(N)) = 1. The public key is pk = (N,
e), while the secret key is sk = (p, q) (the factorization of N). N is called
the modulus of the scheme.

• Through the definition of Gen, we already have an implicit definition of the


key-space K. The message space is simply M = {m | 0 ÿ m < 2 n} = {0, 1}
n , the set of n-bit strings.

• Encryption is probabilistic. Given public-key pk = (N, e) and a message m


ÿ M, pick a random (n ÿ 2)-bit string r and let rm be the concate-nation of
2nÿ2
r and m, interpreted as an integer 0 ÿ rm <2<N.
Furthermore, if rm = 0 or if gcd(rm, N) = 1, we re-sample the ran-dom
string r until it is not so. The output cipher-text is then

Encpk(m) = (rm) e mod N

• Decryption can still be deterministic. Given secret-key sk and cipher-


text c, first decrypt as in plain RSA, ie, m = c mod d mod N where d =
ÿ1
e ÿ(N), and output the n right-most bits of m as the plain-text.

RSA signatures
We end the section with another cryptographic application of RSA: digital
signatures. Suppose Alice wants to send Bob a message expressing her love,
“I love you, Bob,” and Alice is so bold and confident that she is not afraid of
eavesdroppers. However Eve is not just eavesdropping this time, but out to
sabotage the relationship between Alice and Bob. She sees Alice's message,
and changes it to “I hate you, Bob” before it reaches Bob. How can cryptog-
raphy help with this sticky situation? A digital signature allows the sender of a
message to “sign” it with a signature; when a receiver verifies the signature, he
or she can be sure that the message came from the sender and has not been
tampered.
Machine Translated by Google

60 number theory

In the RSA signature scheme, the signer generates keys similar to the RSA encryption
scheme; as usual, the signer keeps the secret-key, and publishes the public-key. To sign
a message m, the signer computes:

ÿm = md mod N

Anyone that receives a message m along with a signature ÿ can perform the following
check using the public-key:

eÿ mod N ?= m

The correctness and basic security guarantees of the RSA signature scheme is the
same as the RSA encryption scheme. Just as before though, there are a few security
concerns with the scheme as described.
Consider this attack. By picking the signature ÿ first, and computing anyone can forge
em=ÿ a signature, although the message m is most likely meaningless (what if the
attacker gets lucky?). Or suppose Eve collects two signatures, (m1, ÿ1) and (m2, ÿ2); now
she can construct a new signature (m = m1 ·m2 mod N, ÿ = ÿ1 ·ÿ2 mod N) (very possible
that the new message m is meaningful). To prevent these two attacks, we modify the
signature scheme to first transform the message using a “crazy” function H (ie, ÿm = H(m)

d
mod N).
Another important consideration is how do we sign large messages (eg, lengthy
documents)? Certainly we do not want to increase the size of N. If we apply the same
solution as we did for encryption — break the message into chunks and sign each chunk
individually — then we run into another security hole. Suppose Alice signed the sentences
“I love you, Bob” and “I hate freezing rain” by signing the individual words; then Eve can
collect and rearrange these signatures to produce a signed copy of “I hate you, Bob”. The
solution again relies on the crazy hash function H: we require H to accept arbitrary large
messages as input, and still output a hash < N. A property that H must have is collision
resistance: it should be hard to find two messages, m1 and m2, that hash to the same
thing H(m1) = H(m2) (we wouldn't want “I love you, Bob” and “I hate you, Bob” to share the
same signature).
Machine Translated by Google

Chapter 4

Counting

“How do I love tea? Let me count the ways.”


–Elizabeth Browning

Counting is a basic mathematical tool that has uses in the most diverse
circumstances. How much RAM can a 64-bit register address? How many poker
hands form full houses compared to flushes? How many ways can ten coin
tosses end up with four heads? To count, we can always take the time to
enumerate all the possibilities; but even just enumerating all poker hands is
already daunting, let alone all 64-bit addresses. This chapter covers several
techniques that serve as useful short cuts for counting.

4.1 The Product and Sum Rules

The product and sum rules represent the most intuitive notions of counting.
Suppose there are n(A) ways to perform task A, and regardless of how task A
is performed, there are n(B) ways to perform task B. Then, there are n(A) · n(B)
ways to perform both task A and task B; this is the product rule. This can
generalize to multiple tasks, eg, n(A) · n(B) · n(C) ways to perform task A, B,
and C, as long as the independence condition holds, eg, the number of ways to
perform task C does not depend on how task A and B are done.

Example 4.1. On an 8 × 8 chess board, how many ways can I place a pawn and
a rook? First I can place the pawn anywhere on the board; there are 64 ways.
Then I can place the rook anywhere except where the pawn is; there are 63
ways. In total, there are 64 × 63 = 4032 ways.

61
Machine Translated by Google

62 counting

Example 4.2. On an 8 × 8 chess board, how many ways can I place a pawn
and a rook so that the rook does not threaten the pawn? First I can place the
rook anywhere on the board; there are 64 ways. At the point, the rook takes up
on square, and threatens 14 others (7 in its row and 7 in its column).
Therefore I can then place the pawn on any of the 64 ÿ 14 ÿ 1 = 49 remaining
squares. In total, there are 64 × 49 = 3136 ways.

Example 4.3. If a finite set S has n elements, then |P(S)| = 2n . We have seen
a proof of this by induction; now we will see a proof using the product rule. P(S)
is the set of all subsets of S. To form a subset of S, each of the n elements can
either be in the subset or not (2 ways). Therefore there are 2n possible ways to
form unique subsets, and so |P(S)| = 2n .

Example 4.4. How many legal configurations are there in the towers of Hanoi?
Each of the n rings can be on one of three poles, giving us 3n config-urations.
Normally we would also need to count the height of a ring relative to other rings
on the same pole, but in the case of the towers of Hanoi, the rings sharing the
same pole must be ordered in a unique fashion: from small at the top to large
at the bottom.

The sum rule is probably even more intuitive than the product rule.
Suppose there are n(A) ways to perform task A, and distinct from these, there
are n(B) ways to perform task B. Then, there are n(A) + n(B) ways to perform
task A or task B. This can generalize to multiple tasks, eg, n(A) + n(B) + n(C)
ways to perform task A, B, or C, as long as the distinct condition holds, eg, the
ways to perform task C are different from the ways to perform task A or B.

Example 4.5. To fly from Ithaca to Miami you must fly through New York or
Philadelphia. There are 5 such flights a day through New York, and 3 such
flights a day through Philadelphia. How many different flights are there in a day
that can take you from Ithaca to get to Miami? The answer is 5 + 3 = 8.

Example 4.6. How many 4 to 6 digit pin codes are there? By the product rule,
the number of distinct n digit pin codes is 10n (each digit has 10 possibilities).
By the sum rule, we have 104 + 105 + 106 number of 4 to 6 digit pin codes (to
state the obvious, we have implicitly used the fact that every 4 digit pin code is
different from every 5 digit pin code).
Machine Translated by Google

4.2. PERMUTATIONS AND COMBINATIONS 63

4.2 Permutations and Combinations


Our next tools for counting are permutations and combinations. Given n distinct
objects, how many ways are there to “choose” r of them? Well, it depends on
whether the r chosen objects are ordered or not. For example, suppose we deal
three cards out of a standard 52-card deck. If we are dealing one card each to
Alice, Bob and Cathy, then the order of the cards being dealt matters; this is
called a permutation of 3 cards. On the other hand, if we are dealing all three
cards to Alice, then the order of the cards being dealt does not matter; this is
called a combination or 3 cards.

Permutations

Definition 4.7. A permutation of a set A is an ordered arrangement of the


elements in A. An ordered arrangement of just r elements from A is called an r-
permutation of A. For non-negative integers r ÿ n, P(n, r) denotes the number of
r-permutations of a set with n elements.

What is P(n,r)? To form an r-permutation from a set A of n elements, we can


start by choosing any element of A to be the first in our permutation; there are
n possibilities. The next element in the permutation can be any element of A
except the one that is already taken; there are nÿ1 possibilities. Continuing the
argument, the final element of the permutation will have n ÿ (r ÿ 1) possibilities.
Applying the product-rule, we have
Theorem 4.8.
not!
P(n, r) = n(n ÿ 1)(n ÿ 2)· · ·(n ÿ r + 1) = (n ÿ .1
r)!

Example 4.9. How many one-to-one functions are there from a set A with m
elements to a set B with n elements? If m > n we know there are no such one-
to-one functions. If m ÿ n, then each one-to-one function f from A to B is a m-
permutation of the elements of B: we choose m elements from B in an ordered
manner (eg, first chosen element is the value of f on the first element in A).
Therefore there are P(n, m) such functions.

Combinations
Let us turn to unordered selections.
1Recall that 0! = 1.
Machine Translated by Google

64 counting

Definition 4.10. An unordered arrangement of r elements from a set A is called an r-


not

combination of A. For non-negative integers r ÿ n, C(n, r) or denotes the number of r- r


combinations of a set with n elements. C(n, r) is also called the binomial coefficients (we
will soon see why).

For example, how many ways are there to put two pawns on a 8 × 8 chess board?
We can select 64 possible squares for the first pawn, and 63 possible remaining squares
for the second pawn. But now we are over counting, eg, choosing squares (b5, c8) is the
same as choosing (c8, b5) since the two pawns are identical. Therefore we divide by 2 to
get the correct count: 64 × 63/2 = 2016. More generally,

Theorem 4.11.
not!
C(n, r) =
(n ÿ r)!r!

Proof. Let us express P(n, r) in turns of C(n, r). It must be that P(n, r) = C(n, r)P(r, r),
because to select an r-permutation from n elements, we can first selected an unordered
set of r elements, and then select an ordering of the r elements. Rearranging the
expression gives:

P(n, r) = n!/(n ÿ r)! r! =


not!
C(n, r) =
P(r,r) (n ÿ r)!r!

Example 4.12. How many poker hands (ie, sets of 5 cards) can be dealt from a standard
deck of 52 cards? Exactly C(52, 5) = 52!/(47!5!).

Example 4.13. How many full houses (3 of a kind and 2 of another) can be dealt from a
standard deck of 52 cards? Recall that we have 13 denominations (ace to king), and 4
suits (spades, hearts, diamonds and clubs). To count the number of full houses, we may

• First pick a denomination for the “3 of a kind”: there are 13 choices.

• Pick 3 cards from this denomination (out of 4 suits): there are C(4, 3) =
4 choices.

• Next pick a denomination for the “2 of a kind”: there are 12 choices left (different
from the “3 of a kind”).

• Pick 2 cards from this denomination: there are C(4, 2) = 6 choices.

So in total there are 13 ÿ 4 ÿ 12 ÿ 6 = 3744 possible full houses.


Machine Translated by Google

4.3. COMBINATORIAL IDENTITIES 65

Figure 4.1: Suppose there are 5 balls and 3 urns. Using the delimiter idea, the
first row represents the configuration (1, 3, 1) (1 ball in the first urn, 3 and 1 ball
second configuration, in the third). The second row represents the balls in the
(4, 0, 1) (4 balls in the first urn, none in the second, and 1 ball in the third). In
general, we need to choose 2 positions out of 7 as delimiters (the rest of the
positions are the 5 balls).

Balls and Urns


How many ways are there to put n balls into k urns? This classical counting
problem has many variations. For our setting, we assume that the urns are
distinguishable (eg, numbered). If the balls are also distinguishable, then this is
a simple application of the product rule: each ball can be placed into k possible
urns, resulting in a total of k n possible placements.
What if the balls are indistinguishable? Basically we need to assign a
number to each urn, representing the numbers of balls in the urn, so that the
sum of the numbers is n. Suppose we line up the n balls, and put k ÿ 1
delimiters between some of the adjacent balls. Then we would have exactly
what we need: n balls split among k distinct urns (see Figure 4.1). The number
of ways to place the delimiters is as simple as choosing k ÿ1 delimiters among
n + k ÿ 1 positions (total number of positions for both balls and delimiters), ie,
C(n + k ÿ 1, k ÿ 1) .

Example 4.14. How many solutions are there to the equation x+y+z = 100, if x,
y, z ÿ N? This is just like having 3 distinguishable urns (x, y and z) and 100
indistinguishable balls, so there are C(102, 2) solutions.

4.3 Combinatorial Identities


There are many identities involving combinations. These identities are fun to
learn because they often represent different ways of counting the same thing;
Machine Translated by Google

66 counting

one can also prove these identities by churning out the algebra, but that is boring. We
start with a few simple identities.

Lemma 4.15. If 0 ÿ k ÿ n, then C(n, k) = C(n, n ÿ k).

Proof. Each unordered selection of k elements has a unique complement: an unordered


selection of n ÿ k elements. So instead of counting the number of selections of k elements
from n, we can count the number of selections of nÿk elements from n (eg, to deal 5 cards
from a 52 card deck is the same as to throw away 52 ÿ 5 = 47 cards).

An algebraic proof of the same fact (without much insight) goes as follows:
not! not!
C(n, k) = = = C(n, n ÿ k) (n
(n ÿ k)!k! ÿ (n ÿ k))!(n ÿ k)!

Lemma 4.16 (Pascal's Identity). If 0 < k ÿ n, then C(n + 1, k) = C(n, k ÿ 1) + C(n, k).

Proof. Here is another way to choose k elements from n + 1 total elements.


Either the n + 1st element is chosen or not:

• If it is, then it remains to choose kÿ1 elements from the first n elements. • If it isn't,
then we need to choose all k elements from the first n elements.

By the sum rule, we have C(n + 1, k) = C(n, k ÿ 1) + C(n, k).

Pascals identity, along with the initial conditions C(n, 0) = C(n, n) = 1, gives a recursive
way of computing the binomial coefficients C(n, k). The recursion table is often written as
a triangle, called Pascal's Triangle; see Figure 4.2.

Here is another well-known identity.


not

Lemma 4.17. C(n, k) = 2n .


k=0

Proof. Let us once again count the number of possible subsets of a set of n elements. We
have already seen by induction and by the product rule that there are 2n such subsets;
this is the RHS.
Another way to count is to use the sum rule:
not not

# of subsets = # of subsets of size k = C(n, k)


k=0 k=0

This is the LHS.


Machine Translated by Google

4.3. COMBINATORIAL IDENTITIES 67

1 1

11 10 11

= 2 2 2
121 0 1 2

3 3 3
1331
+ 30 1 2 3

4 4 4 4 4
14641 0 1 2 3 4

Figure 4.2: Pascal's triangle contains the binomial coefficients C(n, k) ordered
as shown in the figure. Each entry in the figure is the sum of the two entries
on top of it (except the entries on the side which are always 1).

The next identity is more tricky:


not

Lemma 4.18. kC(n, k) = n2


nÿ1
.
k=0

Proof. The identity actually gives two ways to count the following problem: given n people,
how many ways are there to pick a committee of any size, and then pick a chairperson of
the committee? The first way to count is:

• Use the sum rule to count committees of different sizes individually.

• For committees of size k, there are C(n, k) ways of choosing the committee, and
independently, k ways of choosing a chairperson from the committee.

not

• This gives a total of kC(n, k) possibilities; this is the LHS.


k=0

The second way to count is:

• Pick the chairman first; there are n choices.

• For the remaining n ÿ 1 people, each person can either be part of the committee or
not; there are 2nÿ1 possibilities.

nÿ1
• This gives a total of n2 possibilities; this is the RHS.

A similar trick can be used to prove Vandermonde's Identity.


Machine Translated by Google

68 counting

Lemma 4.19 (Vandermondes Identity). If r ÿ m and r ÿ n, then

C(m + n, r) = C(m, r ÿ k)C(n, k)


k=0

Proof. Let M be a set with m elements and N be a set with n elements. Then the LHS
represents the number of possible ways to pick r elements from M and N together.
Equivalently, we can count the same process by splitting into r cases (the sum rule): let k
range from 0 to r, and consider picking r ÿ k elements from M and k elements from M.

The next theorem explains the name “binomial coefficients”: the combination function
C(n, k) are also the coefficient of powers of the simplest binomial, (x + y).

Theorem 4.20 (The Binomial Theorem). For n ÿ N,

not

= nÿkk y
(x+y) n C(n, k)x
k=0

, get 2n terms with coefficient 1 (each term


Proof. If we manually expand (x+y) n we would
corresponds to choosing x or y from each of the n factors). If we then collect these terms,
how many of them have the form x nÿky k that form must chooses nÿk many ? Terms of
x's, and k many y's. Because just choosing the k many y's specifies the rest to be x's,
there are C(n, k) such terms.

7
Example 4.21. What is the coefficient of x 13y in the expansion of (xÿ3y) 20? as (x +
20
gives us the term (ÿ3y))20 and apply the binomial theorem, which We write (x ÿ 3y)
7
C(20, 7)x 13(ÿ3y) = ÿ3 7C(20, 7)x 13y 7 .

If we substitute specific values for x and y, the binomial theorem gives us more
combinatorial identities as corollaries.
not

Corollary 4.22. C(n, k) = 2n , again.


k=0

Proof. Simply write 2n = (1+1)n and expand using the binomial theorem.

not

Corollary 4.23. (ÿ1)k+1C(n, k) = 1.


k=1
Machine Translated by Google

4.4. INCLUSION-EXCLUSION PRINCIPLE 69

Proof. Expand 0 = 0n = (1 ÿ 1)n using the binomial theorem:


not

0= C(n, k)1nÿk (ÿ1)k


k=0

= C(n, 0) +n (ÿ1)kC(n, k)
k=1

Rearranging terms gives us


not not

C(n, 0) = ÿ (ÿ1)kC(n, k) = (ÿ1)k+1C(n, k)


k=1 k=1

This proves the corollary since C(n, 0) = 1.

4.4 Inclusion-Exclusion Principle


Some counting problems simply do not have a closed form solution (drats!). In
this section we discuss an counting tool that also does not give a closed form
solution. The inclusion-exclusion principle can be seen as a generalization of the
sum rule.
Suppose there are n(A) ways to perform task A and n(B) ways to perform
task B, how many ways are there to perform task A or B, if the methods to
perform these tasks are not distinct? We can cast this as a set cardinality
problem. Let X be the set of ways to perform A, and Y be the set of ways to
perform B. Then:
|X ÿ Y | = |X| + |Y | ÿ |X ÿ Y |

This can be observed using the Venn Diagram. The counting argument goes as
follows: To count the number of ways to perform A or B (|X ÿ Y |) we start by
adding the number of ways to perform A (ie, |X|) and the number of ways to
perform B (ie, |Y |). But if some of the ways to perform A and B are the same (|X
ÿ Y |), they have been counted twice, so we need to subtract those.

Example 4.24. How many positive integers ÿ 100 are multiples of either 2 or 5?
Let A be the set of multiples of 2 and B be the set of multiples of 5. Then |A| =
50, |B| = 20, and |A ÿ B| = 10 (since this is the number of multiples of 10). By the
inclusion-exclusion principle, we have 50 + 20 ÿ 10 = 60 multiples of either 2 or
5.
Machine Translated by Google

70 counting

What if there are more tasks? For three sets, we can still gleam from the
Venn diagram that

|X ÿ Y ÿ Z| = |X| + |Y | + |Z| ÿ |X ÿ Y | ÿ |X ÿ Z| ÿ |Y ÿ Z| + |X ÿ Y ÿ Z|

More generally,

Theorem 4.25. Let A1, . . . , An be finite sets. Then,


not not

Ai = (ÿ1)k+1 Ai = (ÿ1)|I|+1 Have

i=1 k=1 I,Iÿ{1,...,n},|I|=kiÿI Iÿ{1,...,n} iÿI

Proof. Consider some x ÿ ÿiAi . We need to show that it gets counted exactly
one in the RHS. Suppose that x is contained in exactly m of the starting sets
(A1 to An), 1 ÿ m ÿ n. Then for each k ÿ m, x appears in C(m, k) many k-way
intersections (that is, if we look at Ai for all |I| =iÿIk, x appears in C(m, k) many
terms) . Therefore, the number of times x gets counted by the inclusion-
exclusion formula is exactly
m

(ÿ1)k+1C(m, k)
k=1

and this is 1 by Corollary 4.23.

Example 4.26. How many onto functions are there from a set A with n elements
to a set B with m ÿ n elements? We start by computing the number
th of functions that are not onto. Let Ai be the set of functions that miss the i
th
element of B (ie, does not have the i element of B in its range). ÿ mi=1Ai is then
the set of functions that are not onto. By the inclusion exclusion principle, we
have:
m m

Ai = (ÿ1)k+1 Ai
i=1 k=1 Iÿ{1,...,m},|I|=kiÿI

For any k and I with |I| = k, observe that ÿiÿIAi is the set of functions that miss a
particular set of k elements, therefore

Ai = (m ÿ k) n
iÿI
Machine Translated by Google

4.4. INCLUSION-EXCLUSION PRINCIPLE 71

Also observe that there are exactly C(m, k) many different I's of size k. Using these two
facts, we have
m m

Ai = (ÿ1)k+1 Ai
i=1 k=1 Iÿ{1,...,m},|I|=kiÿI
m
= (ÿ1)k+1C(m, k)(m ÿ k) n
k=1

Finally, to count the number of onto functions, we take all possible functions (mn many)
and subtract that functions that are not onto:
m m

min ÿ (ÿ1)k+1C(m, k)(m ÿ k) n = mn + (ÿ1)kC(m, k)(m ÿ k) n


k=1 k=1
m
= (ÿ1)kC(m, k)(m ÿ k) n
k=0

where the last step relates on (ÿ1)kC(m, k)(m ÿ k) n = mn when k = 0. This m k=0(ÿ1)kC(m,
closely related to the Sterling
k)(m ÿ k)number
n final expression
of the second
is kind, another counting function
(similar to C(n, k)) that is out of the scope of this course.

Counting the Complement

Counting the complement follows the same philosophy as the inclusion-exclusion principle:
sometimes it is easier to over-count first, and subtract some later.
This is best explained by examples.

Example 4.27. How many standard poker hands (5 cards from a 52-card deck) contain at
least a pair2 ? We could count the number of hands that are (strict) pairs, two-pairs, three-
of-a-kinds, full houses and four-of-a-kinds, and sum up the counts. It is easier, however,
to count all possible hands, and subtract the number of hands that do not contain at least
a pair, ie, hands where all 5 cards have different ranks:

C(52, 5) ÿ C(13, 5) 5·4

all possible hands the five ranks the suits

2
By this we mean hands where at least two cards share the same rank. A slightly
more difficult question (but perhaps more interesting in a casino) is how many hands is
better or equal to a pair? (ie, a straight does not contain a pair, but is better than a pair.)
Machine Translated by Google

72 counting

4.5 Pigeonhole Principle


In this final section, we cover the pigeonhole principle: a proof technique that
relates to counting. The principle says that if we place k + 1 or more pigeons
into k pigeon holes, then at least one pigeon hole contains 2 or more pigeons.
For example, in a group of 367 people, at least two people must have the same
birthday (since there are a total of 366 possible birthdays). More generally, we
have

Lemma 4.28 (Pigeonhole Principle). If we place n (or more) pigeons into k


pigeon holes, then at least one box contains n/k or more pigeons.

Proof. Assume the contrary that every pigeon hole contains ÿ n/kÿ1 < n/k many
pigeons. Then the total number of pigeons among the pigeon holes would be
strictly less than k(n/k) = n, a contradiction.

Example 4.29. In a group of 800 people, there are at least 800/366 = 3 people
with the same birthday.
Machine Translated by Google

Chapter 5

Probability

“. . . the chances of survival are 725. . . to 1.”


– C-3PO

Originally motivated by gambling, the study of probability is now fundamental to a


wide variety of subjects, including social behavior (eg, economics and game theory) and
physical laws (eg, quantum mechanics and radioactive decay). In computer science, it
is essential for studying randomized algorithms (eg, randomized quick sort, primality
testing), average-case problems (eg, spam detection), and cryptography.

What is probability? What does it mean that a fair coin toss comes up heads with
probability 50%? One interpretation is Bayesian: “50%” is a statement of our beliefs,
and how much we are willing to bet on one coin toss. Another interpretation is more
experimental: “50%” means that if we toss the coin 10 million times, it will come up
heads in roughly 5 million tosses. Regardless of how we view probability, this chapter
introduces the mathematical formalization of probability, accompanied with useful
analytical tools to go with the formalization.

5.1 Probability Spaces


In this class, we focus on discrete probability spaces.1

Definition 5.1 (Probability Space). A probability space is a pair (S, f) where S is a


countable set called the sample space, and f : S ÿ [0, 1] is

1Without formally defining this term, we refer to random processes whose outcomes are discrete,
such as dice rolls, as opposed to picking a uniformly random real number from zero
to one.

73
Machine Translated by Google

74 chance

2
called the probability mass function. f(x) Additionally, f satisfies the property
xÿS = 1.

Intuitively, the sample space S corresponds to the set of possible states that
the world could be in, and the probability mass function f assigns a probability
from 0 to 1 to each of these states. To model our conventional notion of
probability, we require that the total probability assigned by f to all possible
states should sum up to 1.

Definition 5.2 (Event). Given a probability space (S, f), an event is simply a
subset of S. The probability of an event E, denoted by Pr(S,f) [E] = Pr[E], is f(x).
to be xÿE In particular, the event that includes “everything”, defined
E = S, has probability Pr[S] = 1.
Even though events and probabilities are not well-defined without a prob-
ability space (eg, see the quote of the chapter), by convention, we often omit S
and f in our statements when they are clear from context.

Example 5.3. Consider rolling a regular 6-sided die. The sample space is S = {1,
2, 3, 4, 5, 6}, and the probability mass function is constant: f(x) = 1/6 for all x ÿ
S. The event of an even roll is E = {2, 4, 6}, and this occurs with probability

1
Pr[E] = f(x) =
2
xÿ{2,4,6}

The probability mass function used in the above example has a (popular)
property: it assigns equal probability to all the elements in the sample space.

Equiprobable Probability Spaces


Definition 5.4. A probability space (S, f) is equiprobable if f is constant, ie, there
exists a constant ÿ such that f(x) = ÿ for all x ÿ S. In this case we call f an
equiprobable mass function.

The most common examples of probability spaces, such as rolling a dice,


flipping a coin and dealing a deck of (well-shuffled) cards, are all equiprobable.
The next theorem reveals a common structure (and limitation) among
equiprobable probability spaces.

Theorem 5.5. Let (S, f) be an equiprobable probability space. Then S is finite, f


takes on the constant value 1/|S|, and the probability of an event E ÿ S is |E|/|S|.

2
By [0, 1] we mean the real interval {x | 0 ÿ x ÿ 1}
Machine Translated by Google

5.1. PROBABILITY SPACES 75

In other words, calculating probabilities under an equiprobable probability space is


just counting.

Proof. Let f take on the constant value ÿ. First note that ÿ = 0, because it would force f(x)
= 0, violating the definition
xÿS of mass functions. Next f(x) = note that S cannot be infinite,
it would force ÿ, again violating the definition of mass functions. because ÿ=
xÿS xÿS

Knowing that S is finite, we can then deduce that:

1= f(x) = ÿ = |S|ÿ
xÿS xÿS
1
ÿ |S| =
ÿ

It follows that the probability of an event E is

1
Pr[E] = ÿ= = |E|
xÿE xÿE
|S| |S|

Example 5.6. What is the probability that a random hand of five cards in poker is a full
house? We have previously counted the number of possible five-card hands and the
number of possible full houses (Example 4.12 and 4.13).
Since each hand is equally likely (ie, we are dealing with an equiprobable probability
space), the probability of a full house is:

# of possible full houses # = C(13, 1)C(4, 3)C(12, 1)C(4, 2) ÿ 0.144% C(52,


of possible hands 5)

Theorem 5.5 highlights two limitations of equiprobable probability spaces.


Firstly, the sample space must be finite; as we will soon see, some natural random
processes require an infinite sample space. Secondly, the (equally assigned) probability
of any event must be rational. That is, we cannot have an event that occurs with probability
1/ ÿ 2; this is required, for example, to formalize game theory in a satisfactory way.
Despite these limitations, most of the time, we will deal with equiprobable probability
spaces.

Infinite Sample Spaces


How might we construct a probability mass function for an infinite space?
For example, how might one pick a random positive integer (S = N +)? We illustrate some
possibilities (and subtleties) with the following examples.
Machine Translated by Google

76 chance

Example 5.7. We may have the probability space (N +, f) where f(n) = 1/2 n This .
corresponds with the following experiment: how many coin tosses does it take for a head
to come up?3 We expect this to be a well-defined probability space since it corresponds
to a natural random process. But to make sure, we = 1.4 verify that nÿN+ 1/2 n

Example 5.8. Perhaps at a whim, we want to pick the positive integer n with probability
proportional to 1/n2 . In this case we need to normalize 1/n2 = ÿ 2/6.5 we can assign f(n)
n2 ), so that nÿN+ f(n) = 1. nÿN+ = the probability. Knowing that (6/ÿ2 )(1/

Example 5.9. Suppose now we wish to pick the positive integer n with probability
proportional to 1/n. This time we are bound to fail, since the series 1 + 1/2 + 1/3 + · · ·
diverges (approaches ÿ), and cannot be normalized.

Probabilities
Now that probability spaces are defined, we give a few basic properties of probability:

Claim 5.10. If A and B are disjoint events (A ÿ B = ÿ) then Pr[A ÿ B] = Pr[A] + Pr[B].

Proof. By definition,

Pr[A ÿ B] = f(x)
xÿAÿB

= f(x) + f(x) since A and B are disjoint


xÿA xÿB

= Pr[A] + Pr[B]

Corollary 5.11. For any event E, Pr[E] = 1 ÿ Pr[E].

Proof. This follows directly from Claim 5.10, E ÿ E = S, and E ÿ E = ÿ.


3 th
To see this, observe that in order for the first head to occur on the n toss, we must have the unique sequence of tosses that start with n ÿ
1 tails and end with a head. On the other hand, there are 2n equally-probable sequences of n coin tosses. This gives probability 1/2 n

.
4
One way to compute the sum is to observe that it is a converging geometric series. and
More directly, let S = 1/2 + 1/4 + · · · · · , observe that S = 2S ÿ S = (1 + 1/2 + 1/4 +
·) ÿ (1/2 + 1/4 + · · ·) = 1.
5
This is the Basel problem, first solved by Euler.
Machine Translated by Google

5.2. CONDITIONAL PROBABILITY AND INDEPENDENCE 77

When events are not disjoint, we instead have the following generalization
of the inclusion-exclusion principle.

Claim 5.12. Given events A and B, Pr[A ÿ B] = Pr[A] + Pr[B] ÿ Pr[A ÿ B].

Proof. First observe that A ÿ B = (A ÿ B) ÿ (B ÿ A) ÿ (A ÿ B) and that all the terms on the
RHS are disjoint. Therefore

Pr[A ÿ B] = Pr[A ÿ B] + Pr[B ÿ A] + Pr[A ÿ B] (5.1)

Similarly, we have

Pr[A] = Pr[A ÿ B] + Pr[A ÿ B] (5.2)


Pr[B] = Pr[B ÿ A] + Pr[A ÿ B] (5.3)

because, say A is the disjoint union of A ÿ B and A ÿ B. Substituting (5.2) and (5.3) into
(5.1) gives

Pr[A ÿ B] = Pr[A ÿ B] + Pr[B ÿ A] + Pr[A ÿ B]


= (Pr[A] ÿ Pr[A ÿ B]) + (Pr[B] ÿ Pr[A ÿ B]) + Pr[A ÿ B]
= Pr[A] + Pr[B] ÿ Pr[A ÿ B]

We remark that given an equiprobable probability space, Claim 5.12 is exactly


equivalent to the inclusion-exclusion principle. An easy corollary of Claim 5.12 is the
union bound.

Corollary 5.13 (Union Bound). Given events A and B, Pr[AÿB] ÿ Pr[A]+ Pr[B]. In general,
given events A1 . . . , Year,

Pr Ai ÿ Pr[Ai ]
i i

5.2 Conditional Probability and Independence


Let us continue with the interpretation that probabilities represent our beliefs on the
state of the world. How does knowing that one event has occurred affect our beliefs on
the probability of another event? Eg, if it is cloudy instead of sunny, then it is more likely
to rain. Perhaps some events are independent and do not affect each other. Eg, we
believe the result of a fair coin-toss does not depend on the result of previous tosses.
In this section we capture these notions with the study of conditional probability and
independence.
Machine Translated by Google

78 chance

Conditional Probability
Suppose after receiving a random 5-card hand dealt from a standard 52-card deck, we
are told that the hand contains “at least a pair” (that is, at least two of the cards have the
same rank). How do we calculate the probability of a full-house given this extra
information? Consider the following thought
process:

• Start with the original probability space of containing all 5-card hands, pair or no
pair.

• To take advantage of our new information, eliminate all hands that do


not contain a pair.

• Re-normalize the probability among the remaining hands (that contain at least a
pair).

Motivate by this line of reasoning, we define conditional probability as follows:

Definition 5.14. Let A and B be events, and let Pr[B] = 0. The conditional probability of
A, conditioned on B, denoted by Pr[A | B], is defined as

Pr[A ÿ B]
Pr[A | B] =
Pr[B]

In the case of an equiprobable probability space, we have

Pr[A | B] = |A ÿ B|/|B|

because the probability of an event is proportional to the cardinality of the event.

Example 5.15 (Second Ace Puzzle). Suppose we have a deck of four cards: {Aÿ, 2ÿ, Aÿ,
2ÿ}. After being dealt two random cards, facing down, the dealer tells us that we have at
least one ace in our hand. What is the probability that our hand has both aces? That is,
what is Pr[ two aces | at least one ace ]?

Because we do not care about the order in which the cards were dealt, we have an
equiprobable space with 6 outcomes:

{Aÿ2ÿ, AÿAÿ, Aÿ2ÿ, 2ÿAÿ, 2ÿ2ÿ, Aÿ2ÿ}

If we look closely, five of the outcomes contain at least one ace, while only one outcome
has both aces. Therefore Pr[ two aces | at least one ace ] = 1/5.
Machine Translated by Google

5.2. CONDITIONAL PROBABILITY AND INDEPENDENCE 79

Now what if the dealer tells us that we have the ace of spades (Aÿ) in our
hand? Now Pr[ two aces | has ace of spades ] = 1/3. It might seem strange that
the probability of two aces has gone up; why should finding out the suit of the
ace we have increase our chances? The intuition is that by finding out the suit
of our ace and knowing that the suit is spades, we can eliminate many more
hands that are not two aces.

Example 5.16 (Second-Child Problem). Let us assume that a child is equally


likely to be a boy or a girl. A friend of yours has two children but you don't know
their sexes. Consider the following two cases. A boy walks in the door and your
friend says,
“This is my child.” (5.4)
Or, a boy walks in the door and your friend says,
“This is my older child.” (5.5)
What is the probability that both children are boys?
If we order the children by age, we again have an equiprobable probabil-ity
space6 . The four outcomes are { boy-boy, boy-girl, girl-boy, girl-girl }.
Therefore,

Pr[ two boys | (5.4)] = 1/3


Pr[ two boys | (5.5)] = 1/2
Now suppose that we know exactly one of the children plays the cello. If a
boy walks in the door and your friend says,
“He is the one who plays the cello.”
then we are in same case as (5.5) (instead of ordering by age, we order the
children according to who plays the cello).
One more food for thought. What if we know that at least one of the children
plays the cello? Now what is the probability that both children are boys, if a boy
walks in, and starts playing the cello? To calculate this probability, first we need
to enlarge the sample space; each child, in addition to being a boy or a girl,
either plays the cello or does not. Next we need to specify a probability mass
function on this space. This is where we get stuck, since we need additional
information to define the probability mass function.
Eg, what is the probability that both children plays the cello, vs only one child
plays the cello?
6
If we do not order the children, then our sample space could be {two boys, one boy
and one girl, two girls}, with probability 1/4, 1/2, and 1/4 respectively. This probability
space is suitable for case (5.4), but not (5.5)
Machine Translated by Google

80 chance

Independence
By defining conditional probability, we model how the occurrence of one event can affect
the probability of another event. An interesting equally concept is independence, where
a set of events do not affect each other.

Definition 5.17 (Independence). A sequence of events A1, . . . , An are (mutually)


independent7 if and only if for every subset of these events, Ai1 , . . . , Aik ,

Pr[Ai1 ÿ Ai2 ÿ · · · ÿ Aik ] = Pr[Ai1 ] · Pr[Ai2 ] · · ·Pr[Aik ]

If there are just two events, A and B, then they are independent if and only if Pr[A ÿ
B] = Pr[A] Pr[B]. The following claim gives justification to the definition of independence.

Claim 5.18. If A and B are independent events and Pr[B] = 0, then Pr[A | B] = Pr[A]. In
other words, conditioning on B does not change the probability of A.

Proof.

Pr[A ÿ B] = Pr[A] Pr[B]


Pr[A | B] = = Pr[A]
Pr[B] Pr[B]

The following claim should also hold according to our intuition of independence:

Claim 5.19. If A and B are independent events, then A and B are also independent events.
In other words, if A is independent of the occurrence of B, then it is also independent of
the “non-occurrence” of B.

Proof.

Pr[A ÿ B] = Pr[A] ÿ Pr[A ÿ B] Claim 5.10

= Pr[A] ÿ Pr[A] Pr[B] by independence


= Pr[A](1 ÿ Pr[B]) = Pr[A] Pr[B] Corollary 5.11

7A related notion that we do not cover in this class is pair-wise independence. A


sequence of events A1, . . . , An are pair-wise independent if and only if for every pair of
independence is events in the we have Pr[Ai1 ÿ Ai2 ] = Pr[Ai1 ] Pr[Ai2 ]. Pair-wise
sequence, Ai1 , Ai2 , a weaker requirement than (mutual) independence, and is therefore
easier to achieve in applications.
Machine Translated by Google

5.2. CONDITIONAL PROBABILITY AND INDEPENDENCE 81

A common use of independence is to predict the outcome of n coin tosses, or


more generally, the outcome of n independent Bernoulli trials (for now think of a
Bernoulli trial with success probability not as a biased coin toss that comes up
“success” with probability p and “failure” with probability 1 ÿ p).

Theorem 5.20. The probability of having exactly k successes in n independent


Bernoulli trials with success probability p is C(n, k)p k
(1 ÿ p) nÿk .

Proof. If we denote success by S and failure by F, then our probability space is the
set of n-character strings containing the letters S and F (eg, SF F · · · F denotes the
out come that the first Bernoulli trial is successful, while all the rest failed). Using
our counting tools, we know that number of such strings with exactly k occurrences
of S (success) is C(n, k). Each of those strings occurs with probability p due to
k
independence. (1 ÿ p) nÿk

Bayes' Rule
Suppose that we have a test against a rare disease that affects only 0.3% of the
population, and that the test is 99% effective (ie, if a person has the disease the
test says YES with probability 0.99, and otherwise it says NO with probability 0.99).
If a random person in the populous tested positive, what is the probability that he
has the disease? The answer is not 0.99. Indeed, this is an exercise in conditional
probability: what are the chances that a random person has the rare disease, given
the occurrence of the event that he tested positive?

We start with some preliminaries.

Claim 5.21. Let A1, . . . , An be disjoint events with non-zero probability such Ai =
S). i S (ie, the events are exhaustive; the events partition the sample that space
not
Let B be an event. Then Pr[B] = Pr[B | Ai ]P r[Ai ] i=1

Proof. By definition Pr[B | Ai ] = Pr[BÿAi ]/Pr[Ai ], and so the RHS evaluates


to
not

Pr[B ÿ Ai ]
i=1

Since A1, . . . ,An are disjoint it follows that the events B ÿ A1, . . . , B ÿ An are
also disjoint. Therefore
not not not

Pr[B ÿ Ai ] = Pr B ÿ Ai = Pr B ÿ Ai = Pr[B ÿ S] = Pr[B]


i=1 i=1 i=1
Machine Translated by Google

82 chance

Theorem 5.22 (Bayes' Rule). Let A and B be events with non-zero probabilities. Then:

Pr[A | B] Pr[B]
Pr[B | A] =
Pr[A]

Proof. Multiply both sides by Pr[A]. Now by definition of conditional prob,


both sides equal:

Pr[B | A] Pr[A] = Pr[A ÿ B] = Pr[A | B] Pr[B]

A remark on notation: the symbol for conditioning “|” is similar to that of


division. While Pr[A | B] is definitely not Pr[A]/Pr[B], it does have the form
“stuff/Pr[B]”. In this sense, the above two proofs are basically “multiply by
the denominator”.
Sometimes we expand the statement of Bayes' Rule is with Claim 5.21:

Corollary 5.23 (Bayes' Rule Expanded). Let A and B be events with non-zero probability.
Then:

Pr[A | B] Pr[B]
Pr[B | A] =
Pr[B] Pr[A | B] + Pr[B] Pr[A | B]

Proof. We apply Claim 5.21, using that B and B are disjoint and B ÿ B = S.

We return to our original question of testing for rare diseases. Let's consider the sample
space S = {(t, d) | t ÿ {0, 1}, d ÿ {0, 1}}, where t represents
the outcome of the test on a random person in the populous, and d represents
whether the same person carries the disease or not. Let D be event that a
randomly drawn person has the disease (d = 1), and T be the event that a
randomly drawn person tests positive (t = 1).
We know that Pr[D] = 0.003 (because 0.3% of the population has the dis-ease). We
also know that Pr[T | D] = 0.99 and Pr[T | D] = 0.01 (because the
test is 99% effective). Using Bayes' rule, we can now calculate the probability
that a random person, who tested positive, actually has the disease:

Pr[T | D] Pr[D]
Pr[D | T] =
(Pr[D] Pr[T | D] + Pr[D] Pr[T | D])
.99 ÿ .003
= = 0.23
.003 ÿ .99 + .997 ÿ .01

Notice that 23%, while significant, is a far cry from 99% (the effectiveness
of the test). This final probability can vary if we have a different prior (initial
Machine Translated by Google

5.2. CONDITIONAL PROBABILITY AND INDEPENDENCE 83

belief). For example, if a random patient has other medical conditions that raises the
probability of contracting the disease up to 10%, then the final probability of having the
disease, given a positive test, raises to 92%.

Conditional Independence

Bayes' rule shows us how to update our beliefs when we receive new information. What
if we receive multiple signals at once? How do we compute Pr[A | B1 ÿ B2]? First we need
the notion of conditional independence.

Definition 5.24 (Conditional Independence). A sequence of events B1, . . . , Bn are


conditionally independent given event A if and only if for every subset of the sequence of
events, Bi1 , . . . , Bik ,

Pr Bick | A = Pr[Bik | HAS]


k k

In other words, given that the event A has occurred, then the events B1, . . . , Bn are
independent.

When there are only two events, B1 and B2, they are conditionally independent given
event A if and only if Pr[B1 ÿ B2 | A] = Pr[B1 | A] Pr[B2 | HAS].
The notion of conditional independence is somewhat fickle, illustrated by the following
examples:

Independence does not imply conditional independence. Suppose we toss a fair coin
twice; let H1 and H2 be the event that the first and second coin tosses come up
heads, respectively. Then H1 and H2 are independent:

1 1 1
· =
Pr[H1] Pr[H2] = = Pr[H1 ÿ H2] 4
2 2

However, if we are told that the at least one of the coin tosses came up tails (call
this event T), then H1 and H2 are no longer independent given T:

1 1
·
Pr[H1 | T] Pr[H2 | T] = 3 = 0 = Pr[H1 ÿ H2 | T] 3

Conditional independence does not imply independence. Suppose we have two coins,
one is heavily biased towards heads, and the other one is heavily biased towards
tails (say with probability 0.99). First we choose
Machine Translated by Google

84 chance

a coin at random; let BH be the event that we choose the coin that is biased
towards heads. Next we toss the chosen coin twice; let H1 and H2 be the event
that the first and second coin tosses come up heads, respectively. Then, given that
we chose the coin biased towards heads (the event BH), H1 and H2 are independent:

Pr[H1 | F] Pr[H2 | F] = 0.99 · 0.99 = 0.992 = Pr[H1 ÿ H2 | F]

However, H1 and H2 are not independent, since if the first toss came up heads, it
is most likely that we chose the coin that is biased towards heads, and so the
second toss will come up heads as well. Actually probabilities are:

1 1 1
· =
Pr[H1] Pr[H2] = = Pr[H1 ÿH2] = 0.5(0.992 )+ 0.5(0.012 ) ÿ 0.5
2 2 4

Independence conditioned on event A does not imply independence conditioned on the


complement event A. Consider twins. Let G1 and G2 be the event that the first and
second child is a girl, respectively.
Let A be the event that the twins are fraternal (non-identical). Then given event A,
it is reasonable to assume that G1 and G2 are independent. On the other hand,
given event A (identical twins), then G1 and G2 are most certainly dependent since
identical twins must both be boys or both be girls.

Let us return to the question of computing Pr[A | B1 ÿ B2]. If we assume that the signals
B1 and B2 are independent when conditioned on A, and also independent when
conditioned on A, then:

Pr[A | B1 ÿ B2]

= Pr[B1 ÿ B2 | A] Pr[A]
Pr[A] Pr[B1 ÿ B2 | A] + Pr[A] Pr[B1 ÿ B2 | HAS]
= Pr[B1 | A] Pr[B2 | A] Pr[A]
Pr[A] Pr[B1 | A] Pr[B2 | A] + Pr[A] Pr[B1 | A] Pr[B2 | HAS]

In general, given signals B1, . . . , Bn that are conditionally independent given A and
conditionally independent given A, we have

Pr[A] i Pr [Bi | HAS]


Pr A | Bi =
i Pr[A] i Pr [Bi | A] + Pr[A] i Professor Bi | HAS
Machine Translated by Google

5.3. RANDOM VARIABLES 85

Application: Spam Detection


Using “training data” (e-mails classified as spam or not by hand), we can
estimate the probability that a message contains a certain string conditioned
on being spam (or not), eg, Pr[ “viagra” | spam ], Pr[ “viagra” | not spam].
We can also estimate the chance that a random e-mail is spam, ie, Pr[spam]
(this is about 80% in real life, although most spam detectors are “unbiased”
and assume Pr[spam] = 50% to make nicer calculations).
By choosing a diverse set of keywords, say W1, . . . , Wn, and assuming
that the occurrence of these keywords are conditionally independent given a
spam message or given a non-spam e-mail, we can use Bayes' rule to estimate
the chance that an e-mail is spam based on the words it contains (we have
simplified the expression assuming Pr[spam] = Pr[not spam] = 0.5):

Wi = i Pr [Wi | spam]
Pr spam |
i i Pr [Wi | spam] + i Pr [Wi | not spam]

5.3 Random Variables


We use events to express whether a particular class of outcomes has occurred
or not. Sometimes we want to express more: for example, after 100 fair coin
tosses, we want to study how many coin tosses were heads (instead of focusing
on just one event, say, that there were 50 coin tosses). This takes us to the
definition of random variables.

Definition 5.25. A random variable X on a probability space (S, f) is a function


from the sample space to the real numbers X: S ÿ R.

Back to the example of 100 coin tosses, given any outcome of the
experiment s ÿ S, we would define X(s) to be the number of heads that occurred
in that outcome.

Definition 5.26. Given a random variable X on probability space (S, f), we can
consider a new probability space (S , fX) where the sample space is the range
of s ÿ S}, and the probability mass function is extended from f, fX(x) = PrS,f [{x
| X(s) = x}]. We call fX the probability distribution or the probability density
function of the random variable X. Similarly defined, the cumulative distribution
or the cumulative density function of the random variable X(s) ÿ x}].

Example 5.27. Suppose we toss two 6-sided dice. The sample space would be
pairs of outcomes, S = {(i, j) | i, j ÿ {1, . . . , 6}}, and the probability
Machine Translated by Google

86 chance

mass function is equiprobable. Consider the random variables, X1(i, j) = i,


X2(i, j) = j and X(i, j) = i+j. These random variables denote the outcome of the
first die, the outcome of the second die, and the some of the two die,
respectively. The probability density function of
fX(1) = 0
fX(2) = Pr[(1, 1)] = 1/36
fX(3) = Pr[(1, 2),(2, 1)] = 2/36
..
.
fX(6) = Pr[(1, 5), (2, 3), . . . ,(3, 1)] = 5/36
fX(7) = Pr[(1, 6)..(6, 1)] = 6/36
fX(8) = Pr[(2, 6),(3 , 5), . . . ,(6, 2)] = 5/36 = fX(6)
..
.
fX(12) = 1/36
And the cumulative density function of
FX(2) = 1/36
FX(3) = 1/36 + 2/36 = 1/12
..
.
FX(12) = 1

Notation Regarding Random Variables


We can describe events by applying predicates to random variables (eg, the
event that X, the number of heads, is equal to 50). We often use a short-hand
notation (in which we treat random variables as if they are real numbers), as
demonstrated in the following examples. Let X and Y be random variables:
X = 50 is the event {s ÿ S | X(s) = 50}
Y ÿ X is the event {s ÿ S | Y(s) ÿ X(s)}
Using this notation, we can define the probability density function of a random
variable

In a similar vain, we can define new random variables from existing


random variables. In Example 5.27 , we can write X = X1 + X2, to mean that
for any s ÿ S , the are real numbers).
Machine Translated by Google

5.4. EXPECTATION 87

Independent Random Variables


The intuition behind independent random variables is just like that of events:
the value of one random variable should not affect the value of another
independent random variable.

Definition 5.28 . A sequence of random variables X1 , . . , Xik and for any real numbers Xn are (mu-

x1, x2, . . . , xk, the events X1 = xi1 , independent. X2 = xi2 , . . . , Xik = xk are (mutually)

In the case of two random variables X and Y, they are independent if and
only if for all real values x and y, Pr[X = x ÿ X = y] = Pr[X = x] Pr[Y = y].
As mentioned before, a common use of independence is to model the out-
come of consecutive coin tosses. This time we model it as the sum of
independent random variables. Consider a biased coin that comes up heads with
probability p. Define X = 1 if the coin comes up heads and X = 0 if the coin
comes up tails; then X is called the Bernoulli random variable (with probability
p). Suppose now we toss this biased coin n times, and let Y be the
random variable that denotes the total number of occurrence of heads.8 We
not
can view Y as a sum of independent random variables, wherei=1Xi Xi is ,
a Bernoulli random variable with probability p that represents the outcome
th of the i toss. We leave it as an exercise to show that the random variables
X1, . . . ,Xn are indeed independent.

5.4 Expectation
Given a random variable defined on a probability space, what is its “average”
value? Naturally, we need to weigh things according to the probability that
the random variable takes on each value.

Definition 5.29. Given a random variable


(S, f), we define the expectation of

E[X] = Pr[X = x] · x = fX(x) · x


xÿ range of xÿ range of

An alternative but equivalent definition is

E[X] = f(s)X(s)
sÿS
8
Just for fun, we can calculate the density function and cumulative density function of
Y. By Theorem 5.20, fY (k) = C(n, k)p k
(1 ÿ p) nÿk , and FY (k) = Pk i nÿi .
i=0 C(n, i)p (1 ÿ p)
Machine Translated by Google

88 chance

These definitions are equivalent because:

Pr[X = x] · x
xÿ range of

Expanding Pr[X = x], and recall that X


= f(s) · x = x is the set {s | x(s) = x}
xÿ range of X sÿ(X=x)
= f(s) · X(s) Replacing x with
xÿ range of X sÿ(X=x)
the events X = x partitions S when x
= f(s)X(s) ranges over the range of
sÿS
X

The following simple fact can be shown with a similar argument:

Claim 5.30. Given a random variable X and a function g: R ÿ R,

E[g(X)] = Pr[X = x]g(x)


xÿ range of

Proof.

Pr[X = x]g(x)
xÿ range of

= f(s)g(x)
xÿ range of X sÿ(X=x)
= f(s)g(X(s)) xÿ
range of X sÿ(X=x)
= f(s)g(X(s)) = E[g(X)]
sÿS

Example 5.31. Suppose in a game, with probability 1/10 we are paid $10, and with
probability 9/10 we are paid $2. What is our expected payment?
The answer is
1
9 $10 + $2 = $2.80 10
10

Example 5.32. Given a biased coin that ends up heads with probability p, how many
tosses does it take for the coin to show heads, in expectation?
We can consider the state space S = {H, TH, TTH, TTTH, . . . }; these are possible
results of a sequence of coin tosses that ends when we see the first
Machine Translated by Google

5.4. EXPECTATION 89

head. Because each coin toss is independent, we define the probability mass function to
be

f(T iH) = f(i tails followed by a head) = (1 ÿ p) ip

We leave it as an exercise to show that f is a valid probabilistic mass function.9


Let X be the random variable that denotes the number of coin tosses needed
for heads to show up. Then X(T iH) = i + 1. The expectation of
ÿ

i
E[X] = (i + 1)p(1 ÿ p)
i=0
ÿ
1
=p (i + 1)(1 ÿ p) 1i=p2p =
p
i=0

where we used the fact that ÿ i 2


i=0(i + 1)x = 1/(1 ÿ x) whenever |x| < 1.10

Application to Game Theory


In game theory, we assign a real number, called the utility, to each outcome in the sample
space of a probabilistic game. We then assume that rational players make decisions that
maximize their expected utility. For example, should we pay $2 to participate in the game
in Example 5.31? If we assume that our utility is exactly the amount of money that we
earn, then

with probability 1/10 we get paid $10 and gets utility 8 with
probability 9/10 we get paid $2 and gets utility 0

This gives a positive expected utility of 0.8, so we should play the game!
This reasoning of utility does not always explain human behavior though.
Suppose there is a game that costs a thousand dollars to play. With one chance in a
million, the reward is two billion dollars (!), but otherwise there is no reward. The expected
utility is
1 1 (2 × 109 ÿ 1000) + (1 ÿ )(0 ÿ
1000) = 1000 106 106

One expects to earn a thousand dollars from the game on average. Would you play it?
Turns out many people are risk-averse and would turn down
9
Recall that an infinite geometric series with ratio |x| < 1 converges to Pÿ 1/(1 ÿ x). =
ixi = 0

10 i
To see this, let S = Pÿ i=0(i + 1)x , and observe that if |x| < 1, then S(1 ÿ x) is a + · · ·)ÿ(x
converging geometric series: S(1ÿx) = S ÿxS = (x (x + · · ·)
0 1
+ 3x 2 + · · ·) = + 2x1 + 2x 2
0
1 + x
= 1/(1 ÿ x).
Machine Translated by Google

90 chance

the game. After all, except with one chance in a million, you simply lose a
thousand dollars. This example shows how expectation does not capture all the
important features of a random variable, such as how likely does the random
variable end up close to its expectation (in this case, the utility is either -1000 or
two billion, not close to the expectation of 1000 at all).
In other instances, people are risk-seeking. Take yet another game that takes
a dollar to play. This time, with one chance in a billion, the reward is a million
dollars; otherwise there is no reward. The expected utility is

1 1
109(106 ÿ 1) + (1 ÿ )(0 ÿ 1) = ÿ0.999 109

Essentially, to play the game is to throw a dollar away. Would you play the
game? Turns out many people do; this is called a lottery. Many people think
losing a dollar will not change their life at all, but the chance of winning a million
dollars is worth it, even if the chance is tiny. One way to explain this behavior
within the utility framework is to say that perhaps earning or losing just a dollar is
not worth 1 point in utility.

Linearity of Expectation

One nice property of expectation is that the expectation of the sum of random
variables, is the sum of expectations. This can often simplify the calculation of
expectation (or in applications, the estimation of expectation). More generally,

Theorem 5.33. Let X1, . . . , Xn be random variables, and a1, . . . , an be real


constant. Then

not not

E aiXi = ai E[Xi ]
i=1 i=1
Machine Translated by Google

5.4. EXPECTATION 91

Proof.
not not

E aiXi = f(s) aiXi(s)


i=1 sÿS i=1
not

= aif(s)Xi(s)
sÿS i=1
not

= f(s)Xi(s)
have

i=1 sÿS
not

= ai E[Xi ]
i=1

Example 5.34. If we make n tosses of a biased coin that ends up heads with th
probability p, what is the expected number of heads? Let Xi = 1 if the i toss is heads,
and Xi = 0 otherwise. Then Xi is an independent Bernoulli random variable with probability
p, and has expectation

E[Xi ] = p · 1 + (1 ÿ p) · 0 = p

The expected number of heads is then


not not

E Xi = E[Xi ] = np
i=1 i=1

Thus if the coin was fair, we would expect (1/2)n, half of the tosses, to be
heads.

Markov's Inequality
The expectation of a non-negative random variable X gives us a (relatively weak) bound
on the probability of X growing too large:

Theorem 5.35 (Markov's Inequality). If X is a non-negative random variable E[X] (ie, X ÿ


0)
and a > 0, then Pr[X ÿ a] ÿ .
has

Proof. The expectation of X is the weighted sum of the possible values of X, where the
weights are the probabilities. Consider the random variable Y defined by

a if X ÿ a 0 if
Y=
a>Xÿ0
Machine Translated by Google

92 chance

Clearly Y ÿ X at all times and so E[Y ] ÿ E[X] (easy to verify by the definition
of expectation). Now observe that
E[X] ÿ E[Y ] = a · Pr[Y = a] + 0 · Pr[Y = 0] = a · Pr[X ÿ a]
Rearranging the terms gives us Markov's inequality.

Example 5.36. Let Then most people are comfortable with the
assumption that X would not exceed one thousand times its expectation,
because

EX] 1
Pr[X ÿ 1000 E[X]] ÿ =
1000 E[X] 1000

5.5 Variance
Consider the following two random variables:
1.
2. Y = 0 with probability 1 ÿ 10ÿ6 , and Y = 106 with probability 10ÿ6 .
Both X and Y has expectation 1, but they have very different distributions.
To capture their differences, the variance of a random variable is introduced to
capture how “spread out” is the random variable away from its expectation.
Definition 5.37. The variance of a random variable

Var[X] = E[(XE[X])2 ]
Intuitively, the term (X ÿ E[X])2 measures the distance of X to its
expectation. The term is squared to ensure that the distance is always
positive (perhaps we could use absolute value, but it turns out defining
variance with a square gives it much nicer properties).
Example 5.38. Going back to the start of the section, the random variable
X (that takes the constant value 1) has E[X] = 1 and Var[X] = 0 (it is never
different from its mean). The random variable
0 wp 1 ÿ 10ÿ6
Y=
106 wp 10ÿ6
also has expectation E[Y ] = 1, but variance
Var[Y ] = E[(Y ÿ E[Y ])2 ] = (1 ÿ 10ÿ6 ) · (0 ÿ 1)2 + (10ÿ6 )(106 ÿ 1)2 ÿ 106
Machine Translated by Google

5.5. VARIANCE 93

Example 5.39. Let X be Bernoulli random variable with probability p. Then


2
Var[X] = E[(X ÿ E[X])2 ] = E[(X ÿ p) ]
2 2
= p(1 ÿ p) + (1 ÿ p)(ÿp) = (1 ÿ p)p(1 ÿ p + p) = (1 ÿ p)p
Sometimes it is easier to calculate the variance using the following formula
Theorem 5.40. Let X be a random variable. Then
2
Var[X] = E[X2 ] ÿ E[X]
Proof.

Var[X] = E[(X ÿ E[X])2 ]


2
= E[X2 ÿ (2 E[X])X + E[X] ]
2
= E[X2 ] ÿ (2 E[X]) E[X] + E[X] by linearity of expectation
2
= E[X2 ] ÿ E[X]
Example 5.41. Let X be Bernoulli random variable with probability p, and let us calculate
its variance with the new formula:
2
Var[X] = E[X2 ] ÿ E[X]
2
= p(1)2 + (1 ÿ p)(0)2 ÿ p 2=pÿp = p(1 ÿ p)

Chebyshev's Inequality
Knowing the variance of a random variable

Theorem 5.42 (Chebyshev's Inequality). Let X be a random variable. Then


Var[X]
Pr [|X ÿ E[X]| ÿ k] ÿ k 2
2
Proof. Let Y = |X ÿ E[X]|. Applying Markov's Inequality to Y , we have
2 2
Pr [|X ÿ E[X]| ÿ k] = Pr[Y ÿ k] = Pr[Y ÿk ] because Y ÿ 0
2
ÿ
E[Y ] = E[(X ÿ E[X])2 ] k = Var[X]
k2 2 k2

Example 5.43. The variance (more specifically the square root of the vari-ance) can be
used as a “ruler” to measure how much a random variable By convention, let Var[X] = ÿ
Var[X] 2 . Then
1
=
Pr[|X ÿ E[X]| ÿ nÿ] ÿ n2ÿ 2
n2
Machine Translated by Google
Machine Translated by Google

Chapter 6

Logic

“Logic will get you from A to B. Imagination will take you everywhere.”
–Albert Einstein

Logic is a formal study of mathematics; it is the study of mathematical reasoning


and proofs itself. In this chapter we cover two most basic forms of logic. In propositional
logic, we consider basic conjunctives such as AND, OR, and NOT. In first-order logic,
we additionally include tools to reason, for example, about “for all prime numbers” or
“for some bijective function”.
There are many more logical systems studied by mathematicians that we do not cover
(eg, modal logic for reasoning about knowledge, or temporal logic for reasoning about
time).

6.1 Propositional Logic


A formula in propositional logic consists of atoms and connectives. An atom is a
primitive proposition, such as “it is raining in Ithaca”, or “I open my umbrella”; we
usually denote that atoms by capital letters (P, Q, R). The atoms are then connected
by connectives, such as AND (ÿ), OR (ÿ), NOT (¬), implication (ÿ), iff (ÿ). An example
of a formula is

(P ÿ Q) ÿ R

If P is the atom “it is raining in Ithaca”, Q is the atom “I have an umbrella”, and R is the
atom “I open my umbrella”, then the formula reads as:

If it is raining in Ithaca and I have an umbrella, then I open my umbrella.

Formally, we define formulas recursively:

95
Machine Translated by Google

96 logic

• Every atom is a formula.

• If ÿ and ÿ are formulas, then

¬ÿ, ÿ ÿ ÿ, ÿ ÿ ÿ, ÿ ÿ ÿ, ÿ ÿ ÿ

are all valid formulas.

What does P ÿ Q ÿ R mean? Just like in arithmetic where multiplication has


precedence over addition, here the order of precedence is: NOT (¬), AND (ÿ),
OR (ÿ), implication (ÿ), equivalence (ÿ). The preferred way to disambiguate
a formula, or course, is to use parenthesis (eg, it is more clear and equivalent
to write P ÿ (Q ÿ R)).

Semantics of Propositional Logic


Here is how we interpret propositional logic. An atom can either be true (T
or 1) or false (F or 0). This is specified by a truth assignment or, in logic
jargon, an interpretation (eg, an interpretation would specify whether today
is really raining, or whether I have opened my umbrella). The connectives are
functions from truth value(s) to a truth value; these functions are defined to
reflect the meaning of the connectives' English names. The formal definition
of these functions can be seen in a truth table in Figure 6.1.

ÿ ÿ ¬ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ ÿ
TTF T T T T
TFF F T F F
FTTF T T F
FFTF F T T

Figure 6.1: The truth table definition of the connectives NOT (¬), AND (ÿ),
OR (ÿ), implication (ÿ), and equivalence (ÿ).

Most of the definitions are straightforward. NOT flips a truth value; AND
outputs true iff both inputs are true, OR outputs true iff at least one of the
inputs are true; equivalence outputs true iff both inputs have the same truth
value. Implication (ÿ) may seem strange at first. ÿ ÿ ÿ is false only if ÿ is
true, yet ÿ is false. In particular, ÿ ÿ ÿ is true whenever ÿ is false, regardless
of what ÿ is. An example of this in English might be “if pigs fly, then I am the
Machine Translated by Google

6.1. PROPOSITIONAL LOGIC 97

president of the United States”; this seems like a correct statement regardless
of who says it since pigs don't fly in our world.1
Finally, we denote the truth value of a formula ÿ, evaluated on an inter-
pretation I, by ÿ[I]. We define ÿ[I] inductively:

• If ÿ is an atom P, then ÿ[I] is the truth value assigned to P in the


interpretation I.

• If ÿ = ¬ÿ, then ÿ[I] = ¬ÿ[I] (using Table 6.1).

• If ÿ = ÿ1 ÿ ÿ2, then ÿ[I] = ÿ1[I] ÿ ÿ2[I] (using Table 6.1). The value of ÿ[I] is
similarly defined if ÿ = ÿ1 ÿ ÿ2, ÿ1 ÿ ÿ2 or ÿ1 ÿ ÿ2.

Given a formula ÿ, we call the mapping from interpretations to the truth value
of ÿ (ie, the mapping that takes I to ÿ[I]) the truth table of ÿ.
At this point, for convenience, we add the symbols T and F as special
atoms that are always true or false, respectively. This does not add anything
real substance to propositional logic since we can always replace T by “P ÿ
¬P” (which always evaluates to true), and F by “P ÿ ¬P” (which always
evaluates to false).

Equivalence of Formulas
We say that two formulas ÿ and ÿ are equivalent (denoted ÿ ÿ ÿ) if for all
interpretations I, they evaluate to the same truth value (equivalently, if ÿ and
ÿ have the same truth table). How many possible truth tables are there over n
atoms? Because each atom is either true or false, we have 2n interpretations.
A formula can evaluate to true or false on each of the interpretations, resulting
not

in 2 possible truth tables (essentially we are counting the number of functions


2 of the form {0, 1} n ÿ {0, 1}).
With such a large count of distinct (not equivalent) formulas, we may
wonder is our propositional language rich enough to capture all of them? The
answer is yes. The following example can be extended to show how AND,
OR and NOT (ÿ, ÿ and ¬) can be used to capture any truth table. Suppose we

1
A related notion, counterfactuals, is not captured by propositional implication. In the
sentence “if pigs were to fly then they would have wings”, the speaker knows that pigs do not
fly, but wish to make a logical conclusion in an imaginary world where pigs do. Formalizing
counterfactuals is still a topic of research in logic.
Machine Translated by Google

98 logic

want to capture the truth table for implication:

PQ ÿ (= P ÿ Q)
TT T
TF F
FT T
FF T

We find the rows where ÿ is true; for each such row we create an AND formula
that is true iff P and Q takes on the value of that row, and then we OR these
formulas together. That is:

(P ÿ Q) ÿ (¬P ÿ Q) ÿ (¬P ÿ ¬Q)


first row third row fourth row

This can be simplified to the equivalent formula:

(P ÿ Q) ÿ (¬P ÿ Q) ÿ (¬P ÿ ¬Q) ÿ (P ÿ Q) ÿ ¬P ÿ ¬P ÿ Q

The equivalence
P ÿ Q ÿ ¬P ÿ Q (6.1)
is a very useful way to think about implication (and a very useful formula for
manipulating logic expressions).
Finally, we remark that we do not need both OR and AND (ÿ and ÿ) to
capture all truth tables. This follows from De Morgan's Laws:

¬(ÿ ÿ ÿ) ÿ ¬ÿ ÿ ¬ÿ ¬(ÿ
(6.2)
ÿ ÿ) ÿ ¬ÿ ÿ ¬ÿ

Coupled with the (simple) equivalence ¬¬ÿ ÿ ÿ, we can eliminate AND (ÿ), for
example, using
ÿ ÿ ÿ ÿ ¬(¬ÿ ÿ ¬ÿ)

Satisfiability and Validity

Intuitively, a formula is satisfiable if it can be made true.

Definition 6.1 (Satisfiability). We say that a truth assignment I satisfies a formula


ÿ if ÿ[I] = T; we write this as I |= ÿ. A formula ÿ is satisfiable if there exists a truth
assignment I such that I |= ÿ; otherwise ÿ is unsatisfiable.

Even better, a formula is valid if it is always true.


Machine Translated by Google

6.1. PROPOSITIONAL LOGIC 99

Definition 6.2 (Validity). A formula ÿ is valid (or a tautology) if for all a truth
assignments I, I |= ÿ.

Example 6.3. • P ÿ Q is satisfiable.

• P ÿ ¬P is unsatisfiable.

• P ÿ ¬P is valid.

• For a more complicated example, the following formula is valid.

(P ÿ Q) ÿ (Q ÿ P)

To see why, note that it is equivalent to

(¬P ÿ Q) ÿ (¬Q ÿ P)

by (6.1), and clearly either ¬P or P is true.

How do we check if a formula ÿ is valid, satisfiable or unsatisfiable? A simple


way is to go over all possible truth assignments I and evaluate ÿ[I].
Are there more efficient algorithms?
It turns out that simply finding out whether an efficient algorithm exists for
satisfiability is a famous open problem2 (with a prize money of a million dollars
set by the Clay Mathematics Institute). The good news is that once a satisfying
assignment I is found for ÿ, everyone can check efficiently that ÿ[I] = T.
Unsatisfiability, on the other hand, does not have this property: even after taking
the time to verify that ÿ[ I] = F for all possible truth as-signments I, it appears
hard to convince anyone else of this fact (eg, how do you convince someone
that you do not have a brother?).3 Finally, checking whether a formula is valid is
as hard as checking unsatisfiability.

Claim 6.4. A formula ÿ is valid if and only if ¬ÿ is unsatisfiable.

Proof. The claim essentially follows from definition. If ÿ is valid, then ÿ[I] = T for
every interpretation I. This means (¬ÿ)[I] = F for every interpretation I, and so ¬ÿ
is unsatisfiable. The other direction follows similarly.
2
In complexity jargon, checking if a formula is satisfiable is “NP-complete”, and finding
an efficient algorithm to determine satisfiability would show that P=NP.
3
In complexity jargon, the unsatisfiability problem is co-NP complete. The major open
problem here is whether or not NP=coNP; that is, whether there exists an efficient way of
convincing someone that a formula is unsatisfiable.
Machine Translated by Google

100 logic

6.2 Logical Inference


Now that we have established the language and semantics (meaning) of
propositional logic, let us now reason with it. Suppose we know two facts. First,

“Bob carries an umbrella if it is cloudy and the forecast calls for rain.”
Next we know that

“It’s not cloudy.”

Can we conclude that Bob is not carrying an umbrella? The answer is no.
Bob may always carry an umbrella around to feel secure (say in Ithaca).
To make sure that we make correct logical deductions in more complex
settings, let us cast the example in the language of propositional logic. Let P be
the atom “it is cloudy”, Q be the atom “the forecast calls for rain”, and R be the
atom “Bob carries an umbrella”. Then we are given two premises:

(P ÿ Q) ÿ R, ¬P

Can we make the conclusion that ¬R is true? The answer is no, because the
truth assignment P = Q = F, R = T satisfies the premises, but does not satisfy
the conclusion.4 The next definition formalizes proper logical deductions.

Definition 6.5. A set of formulas {ÿ1, . . . , ÿn} entails a formula ÿ, de-noted by


ÿ1, . . . , ÿn |= ÿ, if for every truth assignment I that satisfies all of ÿ1, . . . , ÿn, I
satisfy ÿ.

When {ÿ1, . . . , ÿn} entails ÿ, we consider ÿ as a logical consequence of


{ÿ1, . . . , ÿn}.

Theorem 6.6. ÿ1, . . . , ÿn entails ÿ if and only if (ÿ1ÿ· · ·ÿÿn) ÿ ÿ is valid.

Proof. Only if direction. Assume ÿ1, . . . , ÿn entails ÿ. To show that ÿ = (ÿ1 ÿ · · ·


ÿ ÿn) ÿ ÿ is valid, we need to show that for every truth assignment I, ÿ[I] = T.
Consider any truth assignment I; we have two cases:

• (ÿ1 ÿ · · · ÿ ÿn)[I] = F. In this case ÿ[I] = T by definition of implication


(ÿ).

• (ÿ1 ÿ · · · ÿ ÿn)[I] = T. Because ÿ1, . . . , ÿn entails ÿ, we also have


ÿ[I] = T. This in turn makes ÿ[I] = T.
4
If we change the first premise to (P ÿ Q) ÿ R, ie, “Bob carries an umbrella if and
only if it is cloudy and the forecast calls for rain”, then ¬R is a valid conclusion.
Machine Translated by Google

6.2. LOGICAL INFERENCE 101

If direction. Assume ÿ = (ÿ1 ÿ · · · ÿ ÿn) ÿ ÿ is valid. For any truth assignment I that
satisfies all of ÿ1, . . . , ÿn, we have (ÿ1 ÿ · · · ÿ ÿn)[I] = T.
We also have ÿ[I] = ((ÿ1 ÿ · · · ÿ ÿn) ÿ ÿ)[I] = T due to validity. Together this means
ÿ[I] must be true, by observing the truth table for implication (ÿ). This shows that
ÿ1, . . . , ÿn entails ÿ.

Theorem 6.6 gives us further evidence that we have defined implication (ÿ)
correctly. We allow arguments to be valid even if the premise are false.

Axiom Systems
Checking the validity of a formula is difficult (as we discussed, it has been a long
standing open question). On the other hand, we perform logic reasoning everyday,
in mathematical proofs and in English. An axiom system formalizes the reasoning
tools we use in a syntactic way (ie, pattern matching and string manipulations of
formulas) so that we can study and eventually automate the reasoning process.

Definition 6.7. An axiom system H consists of a set of formulas, called the axioms,
and a set of rules of inference. A rule of inference is a way of producing a new
formula (think of it as new logical conclusions), given several established formulas
(think of it as known facts). A rule of inference has the form:

ÿ1
ÿ2 This means “from the formulas ÿ1, . . . , ÿn we may infer ÿ”. We
..
. also use the notation ÿ1, . . . ÿn ÿ (note that this is different from
ÿn the symbol for satisfiability |=).
ÿ

When we define an axiom system, think of the axioms as an initial set of


tautologies (preferably a small set) that describes our world (eg, Euclidean geometry
has the axiom that two distinct points defines a unique straight line). We can then
pattern match the axioms against the rules of inference to derive new tautologies
(logical consequences) from the initial set:

Definition 6.8. A proof or a derivation in an axiom system H is a sequence of


formulas ÿ1, ÿ2, . . . , ÿn where each formula ÿk either matches5an axiom in H, or
follows from previous formulas via an inference rule from H, ie, there exists an
inference rule ÿ1, . . . , ÿm ÿ such that ÿk matches ÿ, and there exists j1, . . . , jm ÿ
{1, . . . , k ÿ 1} such that ÿji matches ÿi , correspondingly.
Machine Translated by Google

102 logic

Definition 6.9. We say a formula ÿ can be inferred from an axiom system H,


denoted by H ÿ, if there exists a derivation in H that ends with the formula ÿ
(when understood, we leave out the subscript for convenience).
Similarly, we say a set of formulas ÿ1, . . . , ÿn infers ÿ (under axiom system
H), denoted by ÿ1, . . . , ÿn ÿ, if there exists a derivation in H that ends in ÿ,
when ÿ1, . . . , ÿn are treated as addition axioms.

It is very important to understand that a-priori, derivation and inference has


nothing to do with truth and validity. If we start with false axioms or illogical rules
of inference, we may end up deriving an invalid formula. On the other hand, if
we start with an incomplete set of axioms, or if we miss a few rules of inference,
we may not be able to derive some valid formulas. What we want is an axiom
system that is both complete and sound:

Completeness: An axiom system is complete if all valid statements can be


derived.

Soundness: An axiom system is sound if only valid statements can be de-


riveted.

For example, an axiom system that contains an invalid axiom is not sound,
while a trivial axiom system that contains no axioms or no rules of inference is
trivially incomplete.

Rules of inference. Here are well-known (and sound) rules of inference for
propositional logic:

Modus Ponens: Modus Tollens:


ÿÿÿ ÿÿÿ
¬ÿ
ÿÿ ¬ÿ
Hypothetical Syllogism: Disjunctive Syllogism:
ÿÿÿ ÿÿÿ
ÿÿÿ ¬ÿ
ÿÿÿ ÿ

It is easy to see that all of the above inferences rules preserves validity, ie,
the antecedents (premises) entail the conclusion. Therefore an axiom system
using these rules will at least be sound.
5We have left out the (rather tedious) formal definitions of “matching” against axioms
or inference rules. This is best explained through examples later in the section.
Machine Translated by Google

6.2. LOGICAL INFERENCE 103

Example 6.10. The following derivation shows that ¬C, ¬C ÿ (A ÿ C) ¬A.

¬C an axiom

¬C ÿ (A ÿ C) an axiom
AÿC Modus Ponens, from line 1 and 2
¬A Modus Tollens, from line 3 and 1

In our application of Modus Ponens, we have “matched” ¬C with ÿ and


(A ÿ C) with ÿ.

Example 6.11. The following derivation shows that A ÿ B, ¬B ÿ (C ÿ


¬C), ¬AC ÿ ¬C. Note that the conclusion is nonsense (can never be
true); this is because we have started with a “bad” set of axioms.

AÿB an axiom
¬A an axiom
B Disjunctive Syllogism, from line 1 and 2
¬B ÿ (C ÿ ¬C) an axiom
C ÿ ¬C Disjunctive Syllogism, from line 3 and 4

Axioms. An example axiom may look like this:

ÿ ÿ (ÿ ÿ ÿ)

By this we mean any formula that “matches” against the axiom is assumed
to be true. For example, let P and Q be atoms, then

P ÿ (Q ÿ P)

(P ÿ Q) ÿ ((Q ÿ P) ÿ (P ÿ Q))

are both assumed to be true (in the second example, we substitute ÿ = P ÿ Q,


ÿ = Q ÿ P). To have a sound axiom system, we much start with axioms
that are valid (tautologies); it is not hard to see that the example axiom is
indeed valid.

A sound and complete axiomatization. We now present a sound and


complete axiom system for propositional logic. We limit the connectives in
the language to only implication (ÿ) and negation (¬); all other connectives
Machine Translated by Google

104 logic

that we have introduced can be re-written using only ÿ and ¬ (eg, P ÿ Q ÿ ¬P ÿ


Q, P ÿ Q ÿ ¬(P ÿ ¬Q)).
Consider the axioms

ÿ ÿ (ÿ ÿ ÿ) (ÿ (A1)
ÿ (ÿ ÿ ÿ)) ÿ ((ÿ ÿ ÿ) ÿ (ÿ ÿ ÿ)) (¬ÿ ÿ ¬ÿ) ÿ (ÿ (A2)
ÿ ÿ) (A3)

Theorem 6.12. The axioms (A1), (A2) and (A3), together with the inference rule
Modus Ponens, form a sound and complete axiom system for propositional logic
(restricted to connectives ÿ and ¬).

The proof of Theorem 6.12 is out of the scope of this course (although keep
in mind that soundness follows from the fact that our axioms are tautologies and
Modus Ponens preserves validity). We remark that the derivations guar-anteed
by Theorem 6.12 (for valid formulas) are by and large so long and tedious that
they are more suited to be generated and checked by computers.

Natural deduction. Natural deduction is another logical proof system that


generates proofs that appear closer to “natural” mathematical reasoning.
This is done by having a large set of inference rules to encompass all kinds of
reasoning steps that are seem in everyday mathematical proofs. We do not
formally define natural deduction here; instead, we simply list some example
rules of inference to present a taste of natural deduction. Note that these are all
valid inference rules and can be incorporated into axiom systems as well.
Constructive Dilemma: Resolution: Conjunction:
ÿ1 ÿ ÿ1
ÿ2 ÿ ÿ2 ÿÿÿ
ÿ1 ÿ ÿ2 ¬ÿ ÿ ÿ ÿ
ÿ1 ÿ ÿ2 ÿÿÿ ÿÿÿÿ

Simplification: Addition:
ÿÿÿ
ÿ ÿÿÿÿ
Most of the time we also add rules of “replacement” which allow us to rewrite
formulas into equivalent (and simpler) forms, eg,

¬(ÿ ÿ ÿ) ÿ ¬ÿ ÿ ¬ÿ By Morgan's Law


¬¬ÿ ÿ ÿ Double negative
Machine Translated by Google

6.3. FIRST ORDER LOGIC 105

6.3 First Order Logic


First order logic is an extension of propositional logic. First order logic operates
over a set of objects (eg, real numbers, people, etc.). It allows us to express
properties of individual objects, to define relationships between ob-jects, and,
most important of all, to quantify over the entire set of objects.
Below is a classic argument in first order logic:

All men are mortal


Socrates is a man
Therefore Socrates is a mortal

In first order logic, the argument might be translated as follows:

ÿx Man(x) ÿ Mortal(x)
Man(Socrates)
Mortal(Socrates)

Several syntax features of first order logic can be seen above: ÿ is one of the
two quantifiers introduced in first order logic; x is a variable; Socrates is a
constant (a particular person); Mortal(x) and Man(x) are predicated.
Formally, an atomic expression is a predicate symbol (eg, Man(x),
LessThan(x, y)) with the appropriate number of arguments; the arguments can
either be constants (eg, the number 0, Socrates) or variables (eg, x, y and z).
A first order formula, similar to propositional logic, is multiple atomic expressions
connected by connectives. The formal recursive definition goes as follows:

• Every atomic expression is a formula.

• If ÿ and ÿ are formulas, then ¬ÿ, ÿ ÿ ÿ, ÿ ÿ ÿ, ÿ ÿ ÿ and ÿ ÿ ÿ are


also formulas.

• [New to first order logic.] If ÿ is a formula and x is a variable, then ÿxÿ (for
all x the formula ÿ holds) and ÿxÿ (for some x the formula ÿ holds) are
also formulas.

Example 6.13. The following formula says that the binary predicate P is
transitive:
ÿxÿyÿz(P(x, y) ÿ P(y, z)) ÿ P(x, z)
Machine Translated by Google

106 logic

Example 6.14. The following formula shows that the constant “1” is a mul-
tiplicative identity (the ternary predicate Mult(x, y, z) is defined to be true if xy =
z):
ÿxÿy(Mult(1, x, x) ÿ Mult(x, 1, x))
Can you extend the formula to enforce that “1” is the unique multiplicative
identity?

Example 6.15. The following formula shows that every number except 0 has an
inverse multiplicative:

ÿxÿy(¬Equals(x, 0) ÿ Mult(x, y, 1))

Semantics of First Order Logic


We have already described the intuitive meaning of first order logic formulas in
English, but let us now give it a formal treatment. Just as in propositional logic,
we need an interpretation I to assign values to constants, predicates, etc.
Additionally, we need a domain D that specifies the universe of objects, in order
for quantifiers to make sense.
First we define the notion of a sentence; these are formulas without “dan-
gling” variables.

Definition 6.16. An occurrence of variable x in a formula ÿ is bound if there is


some quantifier that operates on x (that is it occurs in some sub-formula ÿ the is
preceded by ÿx or ÿx); otherwise the variable x is free. A sentence is a formula
with no free variables.

Example 6.17. In the following formula (that is not a sentence), the first
occurrence of x is free, and the second one is bound:

ÿyP(x, y) ÿ ÿxR(x)

The next formula is a sentence (note that in this case, ÿx captures both
occurrences of x):
ÿyÿx(P(x, y) ÿ R(x))

From now on we restrict ourselves to sentences, and define their truth


values. Recall that in propositional logic, we needed a truth assignment for each
propositional atom. In first order logic, we need:

• A domain D (simply a set of elements that we are concerned with).

• An interpretation I = ID for domain D that


Machine Translated by Google

6.3. FIRST ORDER LOGIC 107

– for each constant c, the interpretation assigns an element in the


domain c[I] ÿ D.
– for each predicate P(x1, . . . , xn), the interpretation assigns a func-
tion P[I]: Dn ÿ {T, F} (equivalently, the interpretation assigns an n-
ary relation that contains all n- tuples that evaluates to true).

For example, in the Socrates example, we could have D be the set of all people
(or the set of all living creatures, or the set of all Greeks). An interpretation I
would need to single out Socrates in D, and also specify for each a ÿ D, whether
Man(x) and Mortal(x) holds.
Given a first-order sentence ÿ, a domain D and an interpretation I = ID
(together (D, I) is called a model), we can define the truth value of ÿ, denoted
by ÿ[I], recursively:

• If ÿ is an atomic expression (ie, a predicate), then because ÿ is a sent-


tence, it is of the form P(c1, . . . , cn ) where ci are constants. The value
of ÿ[I] is P[I](c1[I], . . . , cn[I]).

• If ÿ has the form ¬ÿ, ÿ1 ÿ ÿ2, ÿ1 ÿ ÿ2, ÿ1 ÿ ÿ2 or ÿ1 ÿ ÿ2, then ÿ[I] = ¬ÿ[I],
ÿ1[I]ÿÿ2[I], ÿ1[I ]ÿÿ2[I], ÿ1[I] ÿ ÿ2[I] or ÿ1[I] ÿ ÿ2[I], respectively (following
the truth tables for ¬, ÿ, ÿ, ÿ, and ÿ).

• If ÿ has the form ÿxÿ, then ÿ[I] is true if and only if for every element a ÿ D,
ÿ, with free occurrences of x replaced by a, evaluates to true.

• If ÿ has the form ÿxÿ, then ÿ[I] is true if and only if there exists some
element a ÿ D such that ÿ, with free occurrences of x replaced by a,
evaluates to true.

For instance, if the domain D is the natural numbers N, then

ÿxP(x) ÿ P(0) ÿ P(1) ÿ . . .


ÿxP(x) ÿ P(0) ÿ P(1) ÿ . . .

A note on the truth value of first order formulas [optional]. We have cheated in
our definition above, in the case that ÿ = ÿxÿ or ÿxÿ. When we replace free
occurrences of x in ÿ by a, we no longer have a formula (be-cause strictly
speaking, “a”, an element, is not part of the language). One work around is to
extend the language with constants for each element in the domain (this has to
be done after the domain D is fixed). A more common approach (but slightly
more complicated) is to define truth values for all for-mulas, including those
that are not sentences. In this case, the interpretation
Machine Translated by Google

108 logic

also needs to assign an element a ÿ D to each (free) variable x occurring in ÿ;


this is out of the scope of this course.

Satisfiability and Validity


We define satisfiability and validity of first order sentences similar to the way
they are defined for propositional logic.

Definition 6.18. Given a domain D and an interpretation I over D, we say (I, D)


satisfies a formula ÿ if ÿ[I] = T; in this case we write D, I |= ÿ. A formula is ÿ is
satisfiable if there exists D and I such that D, I |= ÿ, and is unsatisfiable otherwise.
A formula is ÿ is valid (of a tautology) if for every D and I, D, I |= ÿ,

Logical Reasoning and Axiom Systems


We can define entailment in the same way it was defined for propositional logic.
Just as for propositional logic we can find a complete and sound axiom-atization
for first-order logic, but the axiom system is much more complex to describe and
is out of the scope of this course.

6.4 Applications
Logic has a wide range of applications in computer science, including program
verification for correctness, process verification for security policies, information
access control, formal proofs of cryptographic protocols, etc.
In a typical application, we start by specifying of “model”, a desired prop-
erty in logic, eg, we want to check that a piece of code does not create
deadlocks. We next describe the “system” in logic, eg, the piece of code, and
the logic behind code execution. It then remains to show that our system
satisfies our desired model, using tools in logic; this process is called model
checking. Recently Edmund Clark received the Turing award in 2007 for his
work on hardware verification using model checking. He graduated with his
Ph.D. from Cornell in 1976 with Bob Constable as advisor.
Machine Translated by Google

Chapter 7

Graphs

“One graph is worth a thousand logs.”


– Michal Aharon, Gilad Barash, Ira Cohen and Eli Mordechai.

Graphs are simple but extremely useful mathematical objects; they are
ubiquitous in practical applications of computer science. For example:

• In a computer network, we can model how the computers are connected to


each other as a graph. The nodes are the individual computers and the edges
are the network connections. This graph can then be used, for example, to
route messages as quickly as possible.

• In a digitalized map, nodes are intersections (or cities), and edges are roads (or
highways). We may have directed edges to capture one-way streets, and
weighted edges to capture distance. This graph is then used for generating
directions (eg, in GPS units).

• On the internet, nodes are web pages, and (directed) edges are links from one
web page to another. This graph can be used to rank the importance of each
web page for search results (eg, the importance of a web page can be
determined by how many other web pages are pointing to it, and recursively
how important those web pages are).

• In a social network, nodes are people, and edges are friendships. Under-
standing social networks is a very hot topic of research. For example, how
does a network achieve “six-degrees of separation”, where everyone is
approximately 6 friendships away from anyway else? Also known as the small
world phenomena, Watts and Strogatz (from Cornell) published the first
models of social graphs that have this property, in 1998.

109
Machine Translated by Google

110 graphs

In Milgram's small world experiment, random people in Omaha, Nebraska


were tasked to route a letter to “Mr. Jacobs” in Boston, Massachusetts by
passing the letter only to someone they know on a first-name basis. The
average number of times that a letter switched hands
before reaching Mr. Jacobs was approximately 6! This meant that not
only are people well connected, they can route messages efficiently given
only the information of their friends (ie, knowledge only of a small,
local part of the graph). Jon Kleinberg (also from Cornell) gave the
first models for social graphs that allow such efficient, local routing al-
gorithms.

Definition 7.1. A directed graph G is a pair (V, E) where V is a set of


vertices (or nodes), and E ÿ V × V is a set of edges. An undirected graph
additionally has the property that (u, v) ÿ E if and only if (v, u) ÿ E.

In directed graphs, edge (u, v) (starting from node u, ending at node v) is


different from edge (v, u). We also allow “self-loops”, ie, edges of the form
(v, v) (say, a web page may link to itself). In undirected graphs, because edge
(u, v) and (v, u) must both be present or missing, we often treat a non-self-loop
edge as an unordered set of two nodes (eg, {u, v}).
A common extension is a weighted graph, where each edge additionally
carries a weight (a real number). The weight can have a variety of meanings
in practice: distance, importance and capacity, to name a few.

Graph Representations
The way a graph is represented by a computer can affect the efficiency of vari-
ous graph algorithms. Since graph algorithms are not a focus of this course, we
instead examines the space efficiency of the different common representations.
Given a graph G = (V, E):

Adjacency Matrix. We can number the vertices v1 to vn, and represent the
edges in an by n matrix A. Row i and column j of the matrix, aij , is 1
if and only if there is an edge from vi to vj . If the graph is undirected,
then aij = aji and the matrix A is symmetric about the diagonal; in this
case we can just store the upper right triangle of the matrix.

Adjacency Lists. We can represent a graph by listing the vertices in V ,


and for each vertex v, listing the edges that originates from V (ie, the
set Ev = {u | (v, u) ÿ E}).
Machine Translated by Google

graphs 111

Edge Lists. We may simply have a list of all the edges in E, which implicitly
defines a set of “interesting” vertices (vertices that have at least one edge
entering or leaving).

If the graph is dense (ie, has lots of edges), then consider the adjacency
2
matrix representation. The matrix requires storing O(n ) entries, which is
comparable to the space required by adjacency lists or edge lists if the graph is
dense. In return, the matrix allows very efficient lookups of whether an edge (u,
v) exists (by comparison, if adjacency lists are used, we would need to traverse
the whole adjacency list for the vertex u).For sparse graphs, using adjacency
lists or edge lists can result in large savings in the size of the representation.1

Vertex Degree
The degree of a vertex corresponds to the number of edges coming out or
going into a vertex. This is defined slightly differently for directed and undirected
graphs.

Definition 7.2. In a directed graph G = (V, E), the in-degree of a vertex v ÿ V is


the number of edges coming in to it (ie, of the form (u, v), u ÿ V ); the out-degree
is the number of edges going out of it (ie, of the form (v, u), u ÿ V ). The degree
of v is the sum of the in-degree and the out-degree.
In an undirected graph the degree of v ÿ V is the number of edges going out
of the vertex (ie, of the form (v, u), u ÿ V ), with the exception that self loops (ie,
the edge (v , v)) is counted twice.
We denote the degree of vertex v ÿ V by deg(v).

This seemingly cumbersome definition actually makes a lot of sense


pictorially: the degree of a vertex corresponds to the number of “lines” connected
to the vertex (and hence self loops in undirected graphs are counted twice).
The definition also leads to the following theorem:

Theorem 7.3. Given a (directed or undirected) graph G = (V, E), 2|E| =


vÿV deg(v).

Proof. In a directed graph, each edge contributes once to the in-degree of some
vertex and the out-degree of some, possibly the same, vertex. In an undirected
1
Since the advent of the internet, we now have graphs of unprecedented sizes (eg, the
graph of social networks such as Facebook, or the graph of web pages). Storing and working
with these graphs are an entirely different science and a hot topic of research backed by both
academic and commercial interests.
Machine Translated by Google

112 graphs

graph, each non-looping edge contributes once to the degree of exactly two
vertices, and each self-loop contributes twice to the degree of one vertex. In
both cases we conclude that 2|E| = vÿV deg(v).

A useful corollary is the “hand shaking lemma”:2

Corollary 7.4. In a graph, the number of vertices with an odd degree is even.

Proof. Let A be the set of vertices of even degree, and B = V\A be the set of
vertices of odd degree. Then by Theorem 7.3,

2|E| = deg(v) + deg(v)


vÿA vÿB

Since the LHS and the first term of RHS is even, we have that deg(v) vÿB
is even. In
order for a sum of odd numbers to be even, there must be an even number of
terms.

7.1 Graph Isomorphism


When do we consider two graphs “the same”? If the number of vertices or edges
differ, then clearly the graphs are different. Therefore let us focus on the case
when two graphs have the same number of vertices and edges. Consider:

G1 = (V1 = {a, b, c}, E1 = {(a, b),(b, c)})


G2 = (V2 = {1, 2, 3}, E2 = {(1, 2),(2, 3)})

The only difference between G1 and G2 are the names of the vertices; they are
clearly the same graph! On the other hand, the graphs

H1 = (V1 = {a, b, c}, E1 = {(a, b),(b, a)})


H2 = (V2 = {a, b, c}, E2 = {(a, b),(b, c)})

are clearly different (eg, in H1, there is a node without any incoming or outgoing
edges.) What about the undirected graphs shown in Figure 7.1c?
One can argue that K1 and K2 are also the same graph. One way to get K2
from K1 is to rename/permute the nodes a, b and c to b, c and a, respectively.
(Can you name another renaming scheme?)
2The name stems from anecdote that the number of people that shake hands with an
odd number of people is even.
Machine Translated by Google

7.1. GRAPH ISOMORPHISM 113

b b

G1 a H1 a K1 b has vs

vs vs

2 b b

G2 1 H2 has K2 has

3 vs
vs

(a) G1 and G2 are clearly (b) H1 and H2 are clearly (c) Are K1 and K2
isomorphic. not isomorphic. isomorphic?

Figure 7.1: Various examples of graph (non-)isomorphism.

Definition 7.5. Two graphs G1 = (V1, E1) and G2 = (V2, E2) are
isomorphic if there exists a bijection f : V1 ÿ V2 such that (u, v) ÿ E1 if
and only if (f(u), f(v)) ÿ E2. The bijection f is called the isomorphism from
G1 to G2, and we use the notation G2 = f(G1).

As we would expect, the definition of isomorphism, through the use of the bijection
function f, ensures, at the very least, that the two graphs have the same number of vertices
and edges. Another observation is that given an isomorphism f from G1 to G2, then inverse
function f is an isomorphism from G2 to G1 (we leave it to the readers to verify the details);
this makes sense since we would expect isomorphisms to be symmetric.

Given two graphs, wow do we check if they are isomorphic? This is hard problem where
no efficient algorithm is known. However, if an isomorphism f is found, it can be efficiently
stored and validated (as a proper isomorphism) by anyone. In other words, f serves as a
short and efficient proof that two graphs are isomorphic.

Can we prove that two graphs are not isomorphic in an efficient way? Sure, if the
graphs have a different number of vertices or edges. Or we may be able to find some
structure present in one graph G1 that can be checked to not be in the other graph G2, eg,
G1 contains a “triangle” (three nodes that are all connected to each other) but G2 doesn' t,
or G1 has a vertex of degree 10 but G2 doesn't. Unfortunately, no general and efficient
method is known for proving that two graphs are not isomorphic. This is analogous to the
task of
Machine Translated by Google

114 graphs

Input: Graphs G0 = (V = {1, . . . , n}, E0) and G1 = (V = {1, . . . , n}, E1),


allegedly not isomorphic.

Step 1: V picks a random permutation/bijection ÿ: {1, . . . , n} ÿ {1, . . . , n}, a


random bit b ÿ {0, 1}, and send H = ÿ(Gb) to P.

Step 2: P checks if H is isomorphic to G0 or G1, and send b = 0 or b = 1 to V


respectively.

Step 3: V accepts (that the graph are non-isomorphic) if and only if b = b.

Figure 7.2: An interactive protocol for graph non-isomorphism (the verifier should
accept when G1 and G2 are not isomorphic).

proving satisfiability of a propositional formula: if a formula is satisfiable, we can


convince others of this efficiently by presenting the satisfying assignment;
convincing others that a formula is unsatisfiable seems hard.

Interactive Proofs
In 1985, Goldwasser, Micali and Rackoff, and independently Babai, found a work
around to prove that two graphs are not isomorphic. The magic is to add
interaction to proofs. Consider a proof system that consists of two players, a
prover P and a verifier V, where the players can communicate interactively with
each other, instead of the prover writing down a single proof.
In general the prover (who comes up with the proof) may not be efficient, but the
verifier (who checks the proof) must be. As with any proof system, we desire
completeness: on input non-isomorphic graphs, the prover P should be able to
convince V of this fact. We also require soundness, but with a slight relaxation:
on input isomorphic graphs, no matter what the prover says to V, V should reject
with very high probability. We present an interactive proof for graph non-
isomorphism in Figure 7.2.
Let us check that the interactive proof in Figure 7.2 is complete and sound.

Completeness: If the graphs are not isomorphic, then H is isomorphic to Gb, but
not to the other input graph G1ÿb. This allows P to determine b = b every
time.

Soundness: If the graphs are isomorphic, then H is isomorphic to both G0 and


G1. Therefore it is impossible to tell from which input graph is used
Machine Translated by Google

7.2. PATHS AND CYCLES 115

by V to generate H; the best thing P can do is guess. With probability 1/2,


we have b = b and the verifier rejects.

As of now, given isomorphic graphs, the verifier accepts or rejects with proba-
bility 1/2; this may not fit the description “reject with very high probability”.
Fortunately, we can amplify this probability by repeating the protocol (say) 100
times, and let the verifier accept if only if b = b is all 100 repetitions.
Then by independence, the verifier would accept in the end with probability at
most 1/2 100, and reject with probability at least 1 ÿ 1/2 100. Note that the
completeness of the protocol is unchanged even after the repetitions.

7.2 Paths and Cycles


Definition 7.6. A path or a walk in a graph G = (V, E) is a sequence of vertices
(v0, v2, . . . , vk) such that there exists an edge between any two consecutive
vertices, ie (vi , vi+1 ) ÿ E for 0 ÿ i < k. A cycle is a walk where k ÿ 1 and v0 =
vk (ie, starts and ends at the same vertex). The length of the walk, path or cycle
is k (ie, the number of edges).

A directed graph without cycles is called a DAG (a directed acyclic graph).


For undirected graphs, we are more interested in cycles that use each edge at
most once (otherwise an edge {u, v} would induce the “cycle” (u, v, u)). An
undirected graph without cycles that use each edge at most once is called a
tree. We make a few easy observations:

• On a directed graph, every walk or path is “reversible” (ie, (vk, . . ., v0) is


also a walk/path); this is not necessarily true on a directed graph.

• We allow walks of length 0 (ie, no walking is done). However cycles must


at least have length 1, and length 1 cycles must be a self loop.

• A walk can always be “trimmed” in such away that every vertex is visited
at most once, while keeping the same starting and ending vertices.

Example 7.7. The Bacon number of an actor or actress is the shortest path
from the actor or actress to Kevin Bacon on the following graph: the nodes are
actors and actresses, and edges connect people who star together in a movie.
The Erd¨os number is similarly defined to be the distance of a mathematician
to Paul Erd¨os on the co-authorship graph.
Machine Translated by Google

116 graphs

(a) A strongly connected graph. (b) A weakly (but not strongly)


The graph would not be strongly connected graph. The graph
connected if any edge was re- would not be weakly connected if
moved. any edge was removed.

Figure 7.3: The difference between strong and weak connectivity in a directed
graph.

Connectivity
Definition 7.8. An undirected graph is connected if there exists a path between
any two nodes u, v ÿ V (note that a graph containing a single node v is
considered connected via the length 0 path (v)).

The notion of connectivity on a directed graph is more complicated, be-


cause paths are not reversible.

Definition 7.9. A directed graph G = (V, E) is strongly connected if there exists


a path from any node u to any node v. It is called weakly connected if there
exists a path from an node u to any node v in the underlying undirected graph:
the graph G = (V, E ) where each edge (u, v) ÿ E in G induces an undirected
edge in G (ie (u, v),(v, u) ÿ E ).

When a graph is not connected (or strongly connected), we can decompose


the graph into smaller connected components.

Definition 7.10. Given a graph G = (V, E), a subgraph of G is simply a graph G


= (V, E) with V ÿ V and E ÿ (V ×V )ÿE; we denote subgraphs using G ÿ G.

A connected component of graph G = (V, E) is a maximal connected


subgraph. Ie, it is a subgraph H ÿ G that is connected, and any larger subgraph
H (satisfying H = H, H ÿ H ÿ G) must be disconnected.
We can similarly define a strongly connected component as a maximal
strongly connected subgraph.
Machine Translated by Google

7.2. PATHS AND CYCLES 117

(a) Connected components of the graph are


cir-cled in red. Note that there are no edges
be-tween connected components.

(b) Strongly connected components of the


graph are circled in red. Note that there can
still be edges between strongly connected
components.

Figure 7.4: Example connected components.

Computing Connected Components


We can visualize a connected component by starting from any node v in the
(undirected) graph, and “grow” the component as much as possible by in-
cluding any other node u that admits a path from v. Thus, first we need an
algorithm to check if there is a path from a vertex v to another vertex u = v.
Breadth first search (BFS) is a basic graph search algorithm that traverses
a graph as follows: starting from a vertex v, the algorithm marks v as visited,
and traverses the neighbors of v (nodes that share an edge with v). After visiting
the neighbors of v, the algorithms recursively visits the neighbors of the
neighbors, but takes care to ignore nodes that have been visited before.
We claim that the algorithm eventually visits vertex u if and only if there is
a path from v to u. 3

3We have avoided describing implementation details of BFS. It suffices to say that BFS
is very efficient, and can be implemented to run in linear time with respect to the size of the
graph. Let us also mention here that an alternative graph search algorithm, depth first search
(DFS), can also be used here (and is more efficient than BFS at computing strongly
Machine Translated by Google

118 graphs

To see why, first note that if there are no path between v and u, then of
course the search algorithm will never reach u. On the other hand, assume that
there exists a path between v and u, but for the sake of contradiction, that the
BFS algorithm does not visit u after all the reachable vertices are visited. Let w
be the first node on the path from v to u that is not visited by BFS (such a node
must exists because u is not visited). We know w = v since v is visited right
away. Let wÿ1 be the vertex before w on the path from v to u, which must be
visited because w is the first unvisited vertex on the path.
But this gives a contradiction; after BFS visits wÿ1, it must also visit w since w is
4
an unvisited neighbor of wÿ1.
Now let us use BFS to find the connected components of a graph. Simply
start BFS from any node v; when the graph search ends, all visited vertex form
one connected component. Repeat the BFS on remaining unvisited nodes to
recover additional connected components, until all nodes are visited.

Euler and Hamiltonian Cycles


A cycle that uses every edge in a graph exactly once is called a Euler cycle5 A
cycle that uses every vertex in a graph exactly once, except the starting and
ending vertex (which is visited exactly twice), is called a Hamiltonian cycle.
How can we find a Euler cycle (or determine that one does not exist)? The
following theorem cleanly characterizes when a graph has an Eulerian cycle.

Theorem 7.11. A undirected graph G = (V, E) has an Euler cycle if and only if G
is connected and every v ÿ V has even degree. Similarly, a directed graph G =
(V, E) has an Euler cycle if and only if G is strongly connected and every v ÿ V
has equal in-degree and out-degree.

Proof. We prove the theorem for the case of undirected graphs; it generalizes
easily to directed graphs. First observe that if G has a Euler cycle, then of course
G is connected by the cycle. Because every edge is in the cycle, and each time
the cycle visits a vertex it must enter and leave, the degree of each vertex is
even.
To show the converse, we describe an algorithm that builds the Eulerian
cycle assuming connectivity and that each node has even degrees. The
algorithm grows the Euler cycle in iterations. Starting from any node v, follow

connected components in directed graphs).


4
This argument can be extended to show that in fact, BFS would traverse a shorter
path from v to u.
5Originally, Euler was searching for a continuous route that crosses all seven bridges in the
city of K¨onigsberg exactly once.
Machine Translated by Google

7.2. PATHS AND CYCLES 119

any path in the graph without reusing edges (at each node, pick some unused
edge to continue the path). We claim the path must eventually return to v; this
is because the path cannot go on forever, and cannot terminate on any other
vertex u = v do to the even degrees constraint: if there is an available edge into
u, there is also an available edge out of u. That is, we now have a cycle (from v
to v). If the cycle uses all edges in G then we are done.
Otherwise, find the first node on the cycle, w, that still has an unused edge;
w must exist since otherwise the cycle would be disconnected from the part G
that still has unused edges. We repeat the algorithm starting from vertex w,
resulting in a cycle from w to w that does not have repeated edges, and does
not use edges in the cycle from v to v. We can then “stitch” these two cycles
together into a larger cycle:

path from v to cycle from path from w to


w in v's cycle w to w v in v's cycle
v w w v

Eventually the algorithm terminates after finite iterations (since we continually


use up edges in each iteration). When the algorithm terminates, there are no
more unused edges, and so we have a Euler cycle.

We can relax the notion of Euler cycles into Euler paths — a path that uses
every edge in the graph exactly once.

Corollary 7.12. A undirected graph G = (V, E) has an Euler path, but not a Euler
cycle, if and only if the graph is connected and exactly two nodes has an odd
degree.

Proof. Again it is easy to see that if G has an Euler path that is not a cycle, then
the graph is connected. Moreover, the starting and ending nodes of the path,
and only these two nodes, have an odd degree.
To prove the converse, we reduce the problem into finding an Euler cycle.
Let u, v ÿ V be the unique two nodes that have an odd degree. Consider
introducing an extra node w and the edges {u, w}, {v, w}. This modified graph
satisfies the requirements for having a Euler cycle! Once we find the cycle in
the modified graph, simply break the cycle at node w to get an Euler path from
u to v in the original graph.

How can we compute Hamiltonian cycles or paths? Of course we can always


do a brute force search on all possible paths and cycles. As of now, no efficient
algorithm for computing Hamiltonian cycles (or deciding whether
Machine Translated by Google

120 graphs

one exists) is known. In fact, this problem is NP-complete, ie, as hard as


deciding whether a propositional formula is satisfiable.

7.3 Graph Coloring


In this section we discuss the problem of coloring the vertices of a graph, so
that vertices sharing an edge gets different colors.

Definition 7.13. A (vertex) coloring of an undirected graph G = (V, E) is function


c : V ÿ N (that assigns color c(v) to node v ÿ V ) such that nodes that share an
edge has different colors, ie, ÿ(u, v) ÿ E, c(u) = c(v).

Definition 7.14. A graph is k-colorable if it can be colored with k colors, ie, ie,
there exists a coloring c satisfying ÿv ÿ V, 0 ÿ c(v) < k. The chromatic number
ÿ(G) of a graph G is the smallest number such that G is ÿ(G)-colorable.

Here are some easy observations and special cases of graph coloring:

• A fully connected graph with n nodes (ie, every two distinct nodes share
an edge) has chromatic number n; every node must have a unique color,
and every node having a unique color works.

• The chromatic number of a graph is bounded by the max degree of any


vertex plus 1 (ie, ÿ(G) ÿ 1 + maxvÿV deg(v)). With this many colors, we
can color the graph one vertex at a time, by choosing the smallest color
that has not been taken by its neighbors.

• A graph is 1-colorable if and only if it does not have edges.

• It is easy to check if a graph is 2-colorable. Simply color an arbitrary vertex


v black, then color the neighbors of v white, and the neighbors of
neighbors of v black, etc. The graph is 2-colorable if and only if this
coloring scheme succeeds (ie, produces a valid coloring).

In general, determining whether a graph is 3-colorable is NP-complete, ie, as


hard as determining whether a formula is satisfiable.

Coloring Planar Graphs. A graph is planar if all the edges can be drawn on a
plane (eg, a piece of paper) without any edges crossing. A well-known result in
mathematics state that all planar graphs are 4-colorable. Eg, the complete
graph with 5 nodes cannot be planar, since it requires 5 colors! Also
Machine Translated by Google

7.3. GRAPH COLORING 121

known as the “4 color map theorem”, this allow any map to color all the countries
(or states, provinces) with only four colors without ambiguity (no neighboring
countries will be colored the same). In general, checking whether planar graphs
are 3-colorable is still NP-complete.

Applications of Graph Coloring. In task scheduling, if we have a graph whose


nodes are tasks, and whose edges connect tasks that are conflicting (eg, they
both require some common resource), then given a coloring, all tasks of the
same color can be performed simultaneously. An optimal coloring would partition
the tasks into a minimal number of groups (corresponding to the chromatic
number).
To decide the number of registers needed to compute a function, consider a
graph where the nodes are variables and the edges connect variables that go
through a joint operation (ie, need to be stored on separate registers). Then
each color in a coloring corresponds to a need of registers, and the chromatic
number is the minimum number of registers required to compute the function.

An Interactive Proof for 3-Coloring


Consider the interactive proof for graph 3-coloring presented in Figure 7.5.
Clearly the protocol in Figure 7.5 is complete; if the graph is 3-colorable, the

Input: A graph G = (V, E), supposedly 3-colorable.

Step 1: P computes a 3-coloring c : V ÿ {0, 1, 2} of the graph, and picks a


random permutation of the three colors: ÿ : {0, 1, 2} ÿ {0, 1, 2 }. P then
colors the graph G with the permuted colors (ie, the coloring ÿ(c(v)))
and sends the colors to of each vertex to V, but covers each vertex
using a “cup”.

Step 2: V chooses a random edge (u, v) ÿ E.

Step 3: P removes the cups covering u and v to reveal their colors, ÿ(c(u))
and ÿ(c(v)).

Step 4: V accepts if and only if ÿ(c(u)) = ÿ(c(v)).

Figure 7.5: An interactive protocol for graph non-isomorphism (the verifier should
accept when G1 and G2 are not isomorphic).

prover will always convince the verifier by following the protocol. What if the
Machine Translated by Google

122 graphs

graph is not 3-colorable? With what probability can the prove cheat? The coloring ÿ(c(v)) must be
wrong for at least one edge. Since the verifier V asks the prover P to reveal the colors along a
random edge, P will be caught with probability 1/|E|.

As with before, even though the prover may cheat with a seemingly large probability, 1ÿ1/|E|,
we can amplify the probabilities by repeating the proto-col (say) 100|E| times. Due to independence,
the probability that the prover successfully cheats in all 100|E| repetitions is bounded by

100
1 100| 1 |E|
ÿ100 ÿ
1ÿ E| = 1 ÿ e
|E| |E|

The zero-knowledge property. It is easy to “prove” that a graph is 3-colorable: simply write down
the coloring! Why do we bother with the interactive proof in Figure 7.5? The answer is that it has
the zero-knowledge property.

Intuitively, in a zero-knowledge interactive proof, the verifier should not learn anything from the
interaction other than the fact that the proof state-ment proved is true. Eg, After the interaction, the
verifier cannot better compute a 3-coloring for the graph, or better predict the weather for tomorrow-
row. Zero-knowledge is roughly formalized by requiring that the prover only tells the verifier things
that he already knows — that the prover messages could have been generated by the verifier itself.

For our 3-coloring interactive proof, it is zero-knowledge because the prover messages consists
only of two random colors (and anyone can pick out two random colors from {0, 1, 2}).

Implementing electronic “cups”. To implement a cup, the prover P can pick an RSA public-key (N,
e) and encrypt the color of each node using Padded RSA. To reveal a cup, the prover simply
provides the color and the padding (and the verifier can check the encryption). We use Padded
RSA instead of plain RSA because without the padding, the encryption of the same color would
always be the same; essentially, the encryptions themselves would give the coloring away.

7.4 Random Graphs [Optional]


A rich subject in graph theory is the study of the properties of randomly generated graphs. We give
a simple example in this section.
Machine Translated by Google

7.4. RANDOM GRAPHS [OPTIONAL] 123

Consider the following random process to generate an vertex graph: for each pair of
vertices, randomly create an edge between them with independent probability 1/2 (we will
not have self loops). What is the probability that two nodes, u, and v, are connected with a
path of length at most 2? (This is a simple version of “six degrees of separation”.)

Taking any third node w, the probability that the path u–w–v does not exist is 3/4 (by
independence). Again by independence, ranging over all possible third nodes w, the
probability the path u–w–v does not exist for all w = u, w = v is (3/4)nÿ2 . Therefore, the
probability that u and v are more than distance 2 part is at most (3/4)nÿ2 .

What if we look at all pairs of nodes? By the union bound (Corollary 5.13),
the probably the same pair of node is more than distance 2 apart is

Pr ÿ
u, v has distance ÿ 2 ÿ Pr[u, v has distance ÿ 2]
ÿ ÿ u=v ÿ u=v

nÿ2
n(n ÿ 1) ÿ 3
2 4

This quantity decreases very quickly as the number of vertices, n increases.


Therefore it is most likely that every pair of nodes is at most distance 2 apart.
Machine Translated by Google
Machine Translated by Google

Chapter 8

Finite Automata

“No finite point has meaning without an infinite reference point.”


- Jean paul Sartre.

A finite automaton is a mathematical model for a very simple form of computing.


As such, we are most interested in what it can and cannot compute.
More precisely, a finite automaton takes as input a string of characters (taken from
some alphabet ÿ, typically ÿ = {0, 1}), and outputs “accept” or “reject”.
Always accepting or rejecting would be considered “easy computing”, while accepting
if and only if the input string fits a complex pattern is considered “more powerful
computing”. What power does a finite automaton hold?

8.1 Deterministic Finite Automata


Definition 8.1. A deterministic finite automaton (DFA) is a 5-tuple M = (S, ÿ, f, s0, F)
where

• S is a finite set of states.

• ÿ is a finite input alphabet (eg, {0, 1}).

• f is a transition function f: S × ÿ ÿ S.

• s0 ÿ S is the start state (also called the initial state).

• F ÿ S is a set of final states (also called the accepting states).

Here is how a DFA operates, on input string x. The DFA starts in state s0 (the
start state). It reads the input string x one character at time, and transition into a new
state by applying the transition function f to the current

125
Machine Translated by Google

126 finite automata

state and the character read. For example, if x = x1x2 · · · start , the DFA would
by transitioning through the following states:

s (0) = s0 ÿs (1) ÿs (2)


= f(s (0), x1) = f(s (1), x2)

starting state after reading x1 after reading x2

After reading the whole input x, if the DFA ends in an accepting state ÿ F, then x
is accepted. Otherwise x is rejected.

Definition 8.2. Given an alphabet ÿ, a language L is just a set of strings over the
ÿ
alphabet ÿ, ie, L ÿ ÿ recognized by a . We say a language L is accepted or
DFA M, if M accepts an input string x ÿ ÿ ÿ if and only if x ÿ L.

We can illustrate a DFA with a graph: each state s ÿ S becomes a node, and
each mapping (s, ÿ) ÿ t in the transition function becomes an edge from s to t
labeled by the character ÿ. The start state is usually represented by an extra edge
pointing to it (from empty space), while the final states are marked
with double circles.

Example 8.3. Consider the alphabet ÿ = {0, 1} and the DFA M = (S, ÿ, f, s0, F) defined
by

S = {s0, s1} F = {s0}


f(s0, 0) = s0 f(s0, 1) = s1
f(s1, 0) = s1 f(s1, 1) = s0

The DFA M accepts all strings that has an even number of 1s. Intuitively, state s0
corresponds to “we have seen an even number of 1s”, and state s1 corresponds
to “we have seen an odd number of 1s”. A graph of M looks like:

1
0 s0 s1 0

Automata with output. Occasionally we consider DFAs with output. We extend


the transition function to have the form f : S × ÿ ÿ S × ÿ; ie, each time the automata
reads a character and makes a transition, it also outputs a character. Additional
extensions may allow the DFA to sometimes output a character and sometimes
not.
Machine Translated by Google

8.1. DETERMINISTIC FINITE AUTOMATA 127

reads 1i reads 1t reads 1cÿiÿt


ÿÿÿÿÿÿÿ ÿÿÿÿÿÿÿ ÿÿÿÿÿÿÿÿÿÿÿ
s0 s ÿ (= if) s ÿ (= sj ) sc

(a) State transitions of M when it accepts the string 1c .

reads 1i reads 1t reads 1t reads 1cÿiÿt


ÿÿÿÿÿÿÿ ÿÿÿÿÿÿÿ ÿÿÿÿÿÿÿ ÿÿÿÿÿÿÿÿÿÿÿ
s0 sÿ sÿ sÿ sc

(b) State transitions of M when it accepts the string 1c+t .

Figure 8.1: Illustration for Lemma 8.4. If a DFA M with < c states accepts the string 1c
, then it must also accept infinitely many other strings.

Limits of Deterministic Finite Automata


Finite automata can recognize a large set of languages (see regular expressions later), but
also have some basic limitations. For example, they are not good at counting.

Lemma 8.4. Let c be a constant and L = {1 c} (the singleton language con-taining the string
of c many 1s). Then no DFA with < c states can accept L.

Proof. Assuming the contrary that some DFA M with < c states accepts sc be the states
accept state). By traversed by M to accept the string 1c (and L. Let s0, . . ., so sc ÿ F is an
the pigeon hold principle, some state is repeated twice, say s
ÿ

= si = sj with j > i. Let t = j ÿ i.


We now claim that M must also accept the strings 1c+t , 1c+2t , etc; let us
describe the behavior of M on these inputs. Starting from s0, after reading i M ends up in
ÿ
,
state s = si . From this point on, for every t many 1s in the 1 input, M will loop back to state
ÿ
s = si = sj . After sufficiently many loops, M reads the final c ÿ i ÿ t many 1s, and transition
ÿ
from state s = sj to sc, an accept state. See Figure 8.1.

This gives a contradiction, since M accepts (infinitely) more strings than the language L.

On the other hand, see Figure 8.2 for a DFA with c+ 2 states that accepts the language
{1 c}. The techniques of Lemma 8.1 can be generalized to show the pumping lemma:
Machine Translated by Google

128 finite automata

1 1 0.1
s0 s1 s2 hell 0.1

Figure 8.2: A DFA with 4 states that accepts the language {1 2}. This can be easily
generalized to construct a DFA with c + 2 states that accepts the language {1 c}.

Lemma 8.5 (Pumping Lemma). If M is a DFA with k states and M accepts some string
x with |x| > k, there exists strings u, v and w such that x = uvw, |uv| ÿ k, |v| ÿ 1 and uviw
is accepted by M for i ÿ N.

Proof sketch. Again let s0, . . . , s|x| be the states that M travels through to accept the
string x¿ Due to the pigeonhole principle, some state must be repeated among s0, . . . ,
ÿ
sk, say s = si = sj with 0 ÿ i < j ÿ k. We can now set u to be the first i characters of x, v
to be the next j ÿ i > 0 characters of x, and w to be the rest of x.

Example 8.6. No DFA can accept the language L = {0 n1 n | n ÿ N} (intuitively, this is another
counting exercise). If we take any DFA with N states, and assume that it accepts the string 0N 1 then
NOT
, strings 0N+t1 N 0N+2t1 N , etc., for
the pumping lemma says that the same DFA must accept the
some 0 < t ÿ N . ,

2
Example 8.7. No DFA can accept the language L = {0 n | n ÿ N}. If we take any DFA
with N states, and assume that it accepts the string 0N2 then the pumping lemma ,
says that the same DFA must accept the strings 0 N2+t , etc., for some 0 < t ÿ N. (In
, 0N+2t
particular, 0N2+t ÿ/ L because
0 < t ÿ N.)

A game theory perspective. Having a computing model that does not count may be a
good thing. Consider the repeated prisoner's dilemma from game theory. We have two
prisoners under suspicion for robbery. Each pris-oner may either cooperate (C) or
defect (D) (ie, they may keep their mouths shut, or rat each other out). The utilities of
the players (given both players' choices) are as follows (they are symmetric between
the players):
Machine Translated by Google

8.1. DETERMINISTIC FINITE AUTOMATA 129

VS D
C (3, 3) (ÿ5, 5)
D (5, ÿ5) (ÿ3, ÿ3)

Roughly the utilities say the following. Both players cooperating is fine (both
prisoners get out of jail). But if one prisoner cooperates, the other should defect
(not only does the defector get out of jail, he always gets to keep the root all to
himself, while his accomplish stays in jail for a long time). If both players are
defective, then they both stay in jail.
In game theory we look for a stable state called a Nash equilibrium; we look at
a pair of strategies for the prisoners such that neither player has any incentive to
deviate. It is unfortunate (although realistic) that the only Nash equilibrium here is
for both prisoners to defect.
Now suppose we repeat this game 100 times. The total utility of a player ÿ iu
is 100 (i) where u (i)
i=1 the utility of the player in round i, and 0 < ÿ < 1 is a discount
factor (for inflation and interests over time, etc.). Instead of prisoners, we now
have competing stores on the same street. To cooperate is to continue business
as usual, while to defect means to burn the other store down for the day.

Clearly cooperating all the way seems best. But knowing that the first store
would cooperate all the time, the second store should defect in that last (100th)
round. Knowing this, the first store would defect the round before (99th round).
Continuing this argument1 , the only Nash equilibrium is again for both prisoners
to always defect.
What happens in real life? Tit-for-tat2 seems to be the most popular strategy:
cooperate or defect according to the action of the other player in the previous
round (eg, cooperate if the other player cooperated). How can we change our
game theoretical model to predict the use of tit-for-tat?
Suppose players use a DFA (with output) to compute their decisions; the input
is the decision of the other player in the previous round. Also assume that players
need to pay for the number of states in their DFA (intuitively, having many states
is cognitively expensive). Then tit-for-tat is a simple DFA with just 1 state s, and
the identity transition function: f(s, C) = (s, C), f(s, D) = (s, D). Facing a player that
follows tit-for-tat, the best strategy would be to cooperate until round 99 and then
defect in round 100. But we have seen that counting with DFA requires many
states (and therefore bears a heavy cost)! This is especially true if the game has
more rounds, or if the

1See “Induction and Rationality” in Section 2.5


2
In a famous example in 2000, Apple removed all promotions of ATI graphic cards after
ATI prematurely leaked information on upcoming Macintosh models.
Machine Translated by Google

130 finite automata

discount factor ÿ is harsh (ie, 1). If we restrict ourselves to 1-state DFAs, then both
players following tit-for-tat is a Nash equilibrium.

8.2 Non-Deterministic Finite Automata


A non-deterministic finite automaton (NFA) is really just a DFA except that for each
character in the alphabet, there may be several edges going out of each state. In
other words, after reading a character from the input string, an NFA is given a choice
to transition to multiple states. We formalize this by allowing the transition function to
output a set of possible new states.

Definition 8.8. A nondeterministic finite automaton (NFA) is a 5-tuple M = (S, ÿ, f, s0,


F) where
• S is a finite set of states.

• ÿ is a finite input alphabet.

• f is a transition function f: S × ÿ ÿ P(S).

• s0 ÿ S is the start state (also called the initial state).

• F ÿ S is a set of final states (also called the accepting states).

An NFA M is said to accept an input string x if it is possible to transition from the


start state s0 to some final state s ÿ F. More formally, M is said 0 to accept x if there
exists a sequence of states s ÿ F and for each i ÿ , s1, . . . s|x| such that s 0 = s0,
s |x| {0, . . . , |x| ÿ 1}, s i+1 ÿ f(s i , xi). As before, we say M

accepts (or recognizes) a language L for all inputs x, M accepts x if and only if x ÿ L.
Note that just as it is possible for a state to have multiple possible transitions after
reading a character, a state may have no possible transitions. An input that simply
does not have a sequence of valid state transitions (ignoring final states altogether)
is of course rejected.
Note that an NFA is not a realistic “physical” model of computation.
At any point in the computation where there are multiple possible states to transition
into, it is hard to find locally the “correct transition”. An alternative model of
computation is a randomized finite automaton (RFA). A RFA is much like an NFA,
with the additional property that whenever there is a choice of transitions, the RFA
would specify the probability with which the automaton transitions to each of the
allowed states. Correspondingly, a RFA not simply accept or reject an input x, but
instead accepts each input with some probability. Compared to an NFA, an RFA is a
more realistic “physical” model of computation.
Machine Translated by Google

8.2. NON-DETERMINISTIC FINITE AUTOMATA 131

0.1

1 0.1 0.1 0.1


s0 s1 s2 s3 hell 0.1

Figure 8.3: A NFA with 5 states that accepts the language L3. Intuitively, given a
string x ÿ L3, the NFA would choose to remain in state s0 until it reads the third
last character; it would then (magically decide to) transition to state s1, read the
final two characters (transitioning to s2 and s3), and accept.
The converse that any x accepted by the NFA must be in the language L3 is
easy to see. This can be easily generalized to construct an NFA with n + 2 states
that accepts the language Ln.

Example 8.9. Consider the language Ln = {x ÿ {0, 1} ÿ | |x| ÿ n, x|x|ÿn = 1} (ie, the
language of bit strings where the n bit countingthfrom the end is a 1). Ln can be
recognized by an O(n)-state NFA, as illustrated in Figure 8.3.
On the other hand, any DFA that recognizes Ln must have at least 2n states.
(Can you construct a DFA for recognizing this language?) Let M be a DFA with
less than 2n states. By the pigeonhole principle, there exists 2 n-bit strings x and
x such that M would reach the same state s after reading x or x as input (because
there are a total of 2n n-bit strings). Let x and x differ in position i (1 ÿ i ÿ n), and
without loss of generality assume that xi = 1 and x (ie, appending the string of n
ÿ i many 1s). Mi =would
0. Now consider
reach the strings
the same ˆx = x1
state after nÿi
reading ˆxand xˆ (since
or xˆ = x 1 itnÿi
reached the same state after reading x or x ), and so M must either accept both
strings or reject both strings. Yet ˆx ÿ Ln and xˆ ÿ/ Ln, ie, M does not recognize
the language Ln.

Clearly, any language recognized by a DFA can be recognized by a NFA


since any DFA is a NFA. The following result show that the converse is true too:
any NFA can be converted into a DFA that recognizes the same lan-guage. The
conversion described below causes an exponential blow-up in the number of
states of the automaton; as seen in Example 8.9, such a blow-up is sometimes
necessary.

Theorem 8.10. Let M be an NFA and L be the language recognized by M.


Then there exists a DFA M that recognizes the same language L.

Proof. Let M = (S, ÿ, f, s0, F) be an NFA. We construct a DFA M = (T, ÿ, f, t0, F)


as follows:
Machine Translated by Google

132 finite automata

• Let T = P(S); that is, each state t of M corresponds to a subset of


states of M.

• Upon reading the character ÿ ÿ ÿ, we transition from state t ÿ P(S) to the state
corresponding to the union of all the possible states that M could have transitioned
into, if M is currently in any state s ÿ t. More formally, let

f(t, ÿ) = f(s, ÿ)
sÿt

• The start state of M is the singleton state containing the start state of
M, ie, t0 = {s0}.

• The final states of M is any state that contains a final state of M, ie, F = {t ÿ T = P(S) | t
ÿ F = ÿ}.

Intuitively, after reading any (partial) string, the DFA M tries to keep track of all possible states
that M may be in.
We now show that the DFA M accepts any input x if and only if the NFA M accepts x.
Assume that M accepts x; that is, there exists some path of s|x| such that s, xi). t|x| .
0
computation s ÿ F, ,ands1, s. Consider
..,
0
the (deterministic)
x| = path | computation i+1
s0, s of
i
ÿ f(s
0
of M on input x: t It can be shown inductively that s 0 Base case. s ,...,
i
iÿt for all 0 ÿ i ÿ |x|: = {s
0 ÿ t 0 since t 0} by definition. ÿ f(s
Inductive step. If s
ii ÿ t
, then because s i+1 i
, xi), we also have

1si+ ÿt
i+1 =
f(s, xi)
sÿt i

conclude that s. Since s|x| ÿ t |x| |x| We ÿ F, we have t |x| ÿ F and so M would
accept x.
0
For the converse direction, assume M accepts x. Let t = t0 and t |x| , . . . , t|x| be the
deterministic computation path of M we can , 0 with t ÿF. From this
inductively define an accepting sequence of state transitions for M on input x, starting from
the final state and working backwards.
Base case. Because t |x| |x| ÿ F there, exists some s ÿ |x| such that s ÿ k ÿ F.
i+1 ii ÿ t
Inductive step. Given some s i+1 ÿ f(s ÿ t i+1, then there must exist some s, xi) (in
i i to transition to t i+1 s| .
s It is easy to such that order for t
0
see that the sequence s an valid, accepting , . . . , x| , inductively defined above, is
0
sequence of state transitions for M on input x: s |x| 0 = t0 = { s0}, s since s = s0
0ÿt ÿ F by the base case of the definition, and the
transitions are valid by the inductive step of the definition. Therefore M accepts x.
Machine Translated by Google

8.3. REGULAR EXPRESSIONS AND KLEENE'S THEOREM 133

8.3 Regular Expressions and Kleene's Theorem


Regular expressions provide an “algebraic” way of specifying a language:
namely, we start off with some basic alphabet and build up the language using
a fixed set of operations.

Definition 8.11. The set of regular expressions over alphabet ÿ are de-fined
inductively as follows:

• the symbols “ÿ” and “ÿ” are regular expressions.

• the symbol “x” is a regular expression if x ÿ ÿ (we use boldface to


distinguish the symbol x from the element x ÿ ÿ).

• if A and B are regular expressions, then so are A|B (their alternation), AB


(their concatenation), and Aÿ (the Kleene star of A).

Usually the Kleene star takes precedence over concatenation, which takes
precedence over alternation. In more complex expressions, we use parenthesis
to disambiguate the order of operations between concatenations, alternations
and Kleene stars. Examples of regular expressions over the lower case letters ,
include: ab|c ÿ (a|b)(c|ÿ), ÿ. A common extension of regular expressions is
the “+” operator; A+ is interpreted as syntactic sugar (a shortcut) for AAÿ .
As of now a regular expression is just a syntactic object — it is just a
sequence of symbols. Next we describe how to interpret these symbols to
specify a language.

Definition 8.12. Given a regular expression E over alphabet ÿ, we inductively


define L(E), the language specified by E, as follows:

• L(ÿ) = ÿ (ie, the empty set).

• L(ÿ) = {ÿ} (ie, the set consisting only of the empty string).

• L(x) = {x} (ie, the singleton set consisting only of the one-character
string “x”).

• L(AB) = L(A)L(B) = {ab | a ÿ L(A), b ÿ L(B)}.

• L(A|B) = L(A) ÿ L(B).

L(Aÿ ) = L(A) ÿ = {ÿ} ÿ {a1a2 · · · an | n ÿ N +, ai ÿ L(A)}. (note that •


this is a natural extension of the star notation defined in Definition 1.11).
Machine Translated by Google

134 finite automata

Example 8.13. The parity language consisting of all strings with an even number
of 1s can be specified by the regular expression (0 ÿ10ÿ10ÿ ) ÿ language . Tea
consisting of all finite strings {0, 1} ÿ can be specified either by (0|1 ) ÿ or (0 ÿ1
ÿ)ÿ .

Languages that can be expressed as a regular expression are called regular.


The class of regular languages are used in a wide variety of applications such as
pattern matching or syntax specification.

Definition 8.14. A language L over alphabet ÿ is regular if there exists a regular


expression E (over ÿ) such that L = L(E).

Kleene's Theorem [Optional]


It turns out that DFAs, and thus also NFAs, recognize exactly the class of regular
languages.

Theorem 8.15 (Kleene). A language is regular if and only if it is recognized by a


DFA M.

We can prove Kleene's Theorem constructively. That is, given any DFA, we
can generate an equivalent regular expression to describe the language
recognized by the DFA, and vice versa. We omitted the formal proof of Kleene's
Theorem; in the rest of this section, we give an outline of how a regular expression
can be transformed into a NFA (which can then be transformed into a DFA).

Converting Regular Expressions to NFAs

We sketch how any regular language can be recognized by a NFA; since NFAs
are equivalent to DFAs, this means any regular language can be recognized by
a DFA as well. The proof proceeds by induction over regular expressions.
Base case: It is easy to show that the language specified by regular
expressions ÿ, ÿ, and x for x ÿ ÿ can be recognized by a NFA (also see Figure
8.4):

• L(ÿ) is recognized by a NFA/DFA without any final states.

• L(ÿ) is recognized by a NFA where the start state is also a final state, and
has outgoing transitions.

• L(x) can be recognized by a NFA where the start state transitions to a


final state on input x, and has no other transitions.
Machine Translated by Google

8.3. REGULAR EXPRESSIONS AND KLEENE'S THEOREM 135

Figure 8.4: NFAs that recognize the languages ÿ, ÿ and x, respectively.

Inductive step: Next we show that regular languages specified by regular


expressions of the form AB, A|B and Aÿ can be recognized by NFAs.

Case AB: Let the languages L(A) and L(B) be recognized by NFAs MA and MB,
respectively. Recall that L(AB) contains strings that can be divided into two parts
such that the first part is in L(A), recognized by MA, and the second part is in L(B),
recognized by MB. Hence intuitively, we need a combined NFA MAB that contains
the NFA MA followed by the NFA MB, sequentially. To do so, let us “link” the final
states of MA to the starting state of MB. One way to proceed is to modify all the
final states in MA, by adding to them all outgoing transitions leaving the start state
of MB (ie, each final state in MA can now function as the start state of MB and
transition to appropriate states in MB.)

The start state of MAB is the start state of MA. The final states of MAB is the final
states of MB; furthermore, if the start state of MB is final, then all of the final states
in MA is also final in MAB (because we want the final states of MA to be “linked” to
the start state of MB).
We leave it to the readers to check that this combined NFA MAB does indeed
accept strings in L(AB), and only strings in L(AB).

Before we proceed onto other cases, let us abstract the notion of a “link” from above. The
goal of a “link” from state s to t, is to allow the NFA to (nonde-terministically) transition
from state s to state t, without reading any input.
We have implemented this “link” above by adding the outgoing transitions of t to s (ie, this
simulates the case when the NFA nondeterministically transitions to state t and then
follows one of t's outgoing transition). We also make its final state if t is a final state (ie,
this simulates the case when the NFA nondeterministically transitions to state t and then
halt and accepts).

Case A|B: Again, let the languages L(A) and L(B) be recognized by NFAs MA and MB,
respectively. This time, we construct a machine MA|B that contains MA, MB and a
brand new start state s0, and add “links”
Machine Translated by Google

136 finite automata

from s0 to the start state of MA and MB. Intuitively, at state s0, the
machine MA|B must nondeterministically decide whether to accept
the input string as a member of L(A) or as a member of L(B). The
start state of MA|B is the new state s0, and the final states of MA|B
3 MB.
are all the final states of MA and

Case Aÿ : Let the languages L(A) be recognized by NFA MA. Consider the
NFA MA+ that is simply the NFA MA but with “links” from its final
states back to its start state (note that we have not constructed the
machine MAÿ yet; in order for a string to be accepted by MA+ as de-
scribed above, the string must be accepted by the machine MA at
least once). We can then construct the machine MAÿ by using the
fact that the regular expressions ÿ|A+ and Aÿ are equivalent.

ÿ-transitions. The notion of a “link” above can be formalized as a special


type of transition, called ÿ-transitions. It can be shown that NFAs have the
same power (can accept exactly the same languages), whether or not it
has access to ÿ-transitions or not.

3A tempting solution is to instead “merge” the start state of MA and MB (combining


their incoming and outgoing transitions), and make the newly merged state the start state
of MA|B. This does not work, because such a merger does not force the NFA MA|B to
“choose” between MA and MB; on some input, MA|B may transition between the states of
MA and MB multiple times via the merged state. Can you come up with a counter example?
Machine Translated by Google

Appendix A

Problem Sets

A.1 Problem Set A


Problem 1 [6 points]
Let a0 = ÿ1, a1 = 0, and for n ÿ 2, let an = 3anÿ1 ÿ 2anÿ2.
Write a closed-form expression for an (for n ÿ 0), and prove using strong
induction that your expression is correct. (You must use strong induction for this
problem. Some clarification: “closed-form expression for an” simply means an
expression for an that depends only on n (and other constants) and is not defined
recursively, ie, does not depend on an ÿ1 or other previous terms of the sequence.)

Problem 2 [5 points]
Compute (38002 · 7 201) mod 55. Show your work.

Problem 3 [2 + 4 = 6 points]
Let p ÿ 3 be any prime number. Let c, a ÿ {1, . . . , p ÿ 1} such that a is a solution to
2 2
the equation x ÿ c (mod p). ÿ c (mod p), ie, a

2
(a) Show that p ÿ a is also a solution to the equation x ÿ c ÿ c (mod p), ie,
2
(p ÿ a) (mod p).
2
(b) Show that a and p ÿ a are the only solutions to the equation x (mod p) ÿc
2
modulo p, ie, if b ÿ Z satisfies b ÿ c (mod p), then b ÿ a (mod p) or b ÿ p ÿ a
(mod p).

137
Machine Translated by Google

138 problem sets

Problem 4 [4 points]
How many solutions are there to the equation a+b+c+d = 30, if a, b, c, d ÿ N?
(N includes the number 0. You do not need to simplify your answer.)

Problem 5 [2 + 2 + 4 = 8 points]
Let n be a positive even integer.

(a) How many functions f : {0, 1} n ÿ {0, 1} n are there that do not map an element to itself
(ie, f satisfies f(x) = x for all x ÿ {0, 1 } n )?

, let x the string in {0, 1} n obtained from x by reversing


(b) Given a string x ÿ {0, 1} n rev denotes
rev
the ordering of the bits of x (eg, the first bit of x is the last bit of x, etc.). How
rev =x?
many strings x ÿ {0, 1} n satisfy x Justify your answer.

(c) How many functions f : {0, 1} n ÿ {0, 1} n are there that satisfy f(x) = x and f(x) = x rev
for all x ÿ {0, 1} n ? Justify your answer.

Problem 6 [6 points]
r
not=
Let n, r, k ÿ N + such that k ÿ r ÿ n. Show that by using a combinatorial
nÿkrÿk
not
argument (ie, show
r k k
that both sides of the equation count the same thing).

Problem 7 [3 + 3 = 6 points]
A certain candy similar to Skittles is manufactured with the following proper-ties: 30% of
the manufactured candy pieces are sweet, while 70% of the pieces are sour. Each candy
piece is colored either red or blue (but not both). If a candy piece is sweet, then it is
colored blue with 80% probability (and colored red with 20% probability), and if a piece is
sour, then it is colored red with 80% probability. The candy pieces are mixed together
randomly before they are sold. You bought a jar containing such candy pieces.

(a) If you choose a piece at random from the jar, what is the probability that you choose
a blue piece? Show your work. (You do not need to simplify your answer.)

(b) Given that the piece you chose is blue, what is the probability that the piece is sour?
Show your work. (You do not need to simplify your answer.)
Machine Translated by Google

A.1. PROBLEM SET A 139

Problem 8 [3 + 3 = 6 points]
A literal is an atom (ie, an atomic proposition) or the negation of an atom (eg, if
P is an atom, then P is a literal, and so is ¬P). A clause is a formula of the form
liÿlj ÿlk, where li , lj , lk are literals and no atom occurs in liÿlj ÿlk more than once
(eg, P ÿ ¬Q ÿ ¬P is not allowed, since the atom P occurs in P ÿ ¬Q ÿ ¬P more
than once).
Examples: P1 ÿ ¬P2 ÿ ¬P4 is a clause, and so is P2 ÿ ¬P4 ÿ ¬P5.

(a) Let C be a clause. If we choose a uniformly random interpretation by each


assigning T rue with probability 1 2 and F alse with probability 1 2
atom independently, what is the probability that C evaluates to T rue under
the chosen interpretation? Justify your answer.

(b) Let {C1, C2, . . . , Cn} be a collection of n clauses. If we choose a uniformly random interpretation as in part (a), what
is the expected number of clauses in {C1, C2, . . . , Cn} that evaluates to T rue under the chosen in-terpretation?
Justify your answer.

Problem 9 [3 + 3 = 6 points]
Consider the formula (P ÿ ¬Q) ÿ (¬P ÿ Q), where P and Q are atoms.

(a) Is the formula valid? Justify your answer.

(b) Is the formula satisfiable? Justify your answer.

Problem 10 [6 points]
Let G be an undirected graph, possibly with self-loops. Suppose we have the
predicate symbols Equals(·, ·), IsV ertex(·), and Edge(·, ·).
Let D be some domain that contains the set of vertices of G (D might contain
other elements as well). Let I be some interpretation that specifies functions for
Equals(·, ·), IsV ertex(·), and Edge(·, ·), so that for all u, v ÿ D, we have Equals[I]
(u, v ) = T (True) if and only if u = v, IsV ertex[I](u) = T if and only if u is a vertex
of G, and if u and v are both vertices of G, then Edge[I ](u, v) = T if and only if
{u, v} is an edge of G (we do not know the value of Edge[I](u, v) when u or v is
not a vertex of G) .
Write a formula in first order logic that captures the statement “the graph G
does not contain a triangle”, ie, the formula is T (True) under (D, I) if and only if
the graph G does not contain a triangle ( a triangle is 3 distinct vertices all of
which are connected to one another via an edge).
Machine Translated by Google

140 problem sets

Problem 11 [6 points]
Suppose we have an undirected graph G such that the degree of each vertex is
a multiple of 10 or 15. Show that the number of edges in G must be a multiple
of 5.

Problem 12 [3 + 2 + 5 = 10 points]
Consider the following non-deterministic finite automaton (NFA):

b
y
has

x b
has
b has

b a,b
w b
b
z
has

u v
has

(a) Write a regular expression that defines the language recognized by the
above NFA.

(b) Construct (by drawing a state diagram) the smallest deterministic finite
automaton (DFA) that recognizes the same language as the above NFA
(smallest in terms of the number of states).

(c) Prove that your DFA for part (b) is indeed the smallest DFA that rec-ognizes
the same language as the above NFA (smallest in terms of the number of
states).

Problem 13 [6 points]
Let S1, S2, S3, S4, . . . be an infinite sequence of countable sets. Show that Sn is defined by Sn := {x | x ÿ
ÿn=1 Sn is countable. (ÿ n=1 Sn for ÿn=1

some n ÿ N +}.)
Machine Translated by Google

Appendix B

Solutions to Problem Sets

B.1 Problem Set A

Problem 1 [6 points]

Let a0 = ÿ1, a1 = 0, and for n ÿ 2, let an = 3anÿ1 ÿ 2anÿ2.


Write a closed-form expression for an (for n ÿ 0), and prove using strong
induction that your expression is correct. (You must use strong induction for this
problem. Some clarification: “closed-form expression for an” simply means an
expression for an that depends only on n (and other constants) and is not
defined recursively, ie, does not depend on an ÿ1 or other previous terms of the
sequence.)

Solution: an = 2n ÿ 2 for every integer n ÿ 0.

Base cases: For n = 0, we have an = a0 = ÿ1 (by definition) and 2n ÿ 2 = 0 2 ÿ


2 = ÿ1, so an = 2n ÿ 2, as required. For n = 1, we have an = a1 = 0 (by
definition) and 2n ÿ 2 = 21 ÿ 2 = 0, so an = 2n ÿ 2, as required.

Induction step: Let n ÿ 1, and suppose that ak = 2k ÿ 2 for k = 0, . . . , Then, not.


an+1 = 3an ÿ 2anÿ1 = 3(2n ÿ 2) ÿ 2(2nÿ1 ÿ 2) = 3(2n ) ÿ 6 ÿ 2 2(2n ) ÿ 2 + 4 =
not

= 2n+1 ÿ 2, where the second equality follows from the induction hypothesis.

Thus, an = 2n ÿ 2 for every integer n ÿ 0.

141
Machine Translated by Google

142 solutions to problem sets

Problem 2 [5 points]
Compute (38002 · 7 201) mod 55. Show your work.

Solution:

Note that 55 = 5 · 11, so we have ÿ(55) = (4)(10) = 40. Thus, we have (38002 ·
7 201) mod 55 = ((38002 mod 55)·(7201 mod 55)) mod 55 = ((38002 mod ÿ(55) mod
mod 40 mod 40 mod
55)·(7201 mod ÿ(55) mod 55)) mod 55 = ((38002 mod 55)·(7201
55)) mod 55 = ((32 mod 55) · (71 mod 55)) mod 55 = (9 · 7) mod 55 = 63 mod
55 = 8, where we have used Euler's theorem in the second equality.

Problem 3 [2 + 4 = 6 points]
Let p ÿ 3 be any prime number. Let c, a ÿ {1, . . . , p ÿ 1} such that a is a
2 2
solution to the equation x ÿ c (mod p), ie, a ÿ c (mod p).

2
(a) Show that p ÿ a is also a solution to the equation x ÿ c (mod p), ie,
2
(p ÿ a) ÿ c (mod p).
2
(b) Show that a and p ÿ a are the only solutions to the equation x ÿc
2
(mod p) modulo p, ie, if b ÿ Z satisfies b ÿ c (mod p), then b ÿ a
(mod p) or b ÿ p ÿ a (mod p).

Solution:

2 2
(a) Observe that (p ÿ a) 2=p ÿ 2pa + a 2ÿa ÿ c (mod p).

2 2 2
(b) Assume b ÿ Z satisfies b ÿ c (mod p). Then, b ÿa (mod p), so
2 2 ÿ
b ÿa 0 (mod p), so (bÿa)(b+a) ÿ 0 (mod p), so p | (bÿa)(b+a). Since
p is prime, we must have p | (bÿa) or p | (b+a). The former case implies that
b ÿ a (mod p), and the latter case implies that b ÿ ÿa (mod p), so b ÿ p ÿ a
(mod p), as required.

Problem 4 [4 points]
How many solutions are there to the equation a+b+c+d = 30, if a, b, c, d ÿ N?
(N includes the number 0. You do not need to simplify your answer.)

Solution:
Machine Translated by Google

B.1. PROBLEM SET A 143

33
since there are 4 distinguishable urns (a, b, c, and d) and 30 indistinct
3,

guishable balls. (See the lecture notes.)

Problem 5 [2 + 2 + 4 = 8 points]
Let n be a positive even integer.

(a) How many functions f : {0, 1} n ÿ {0, 1} n are there that do not map an element to itself
(ie, f satisfies f(x) = x for all x ÿ {0, 1 } n )?

, denote the string in {0, 1} n obtained from x by


(b) Given a string x ÿ {0, 1} n let x rev
rev
reversing the ordering of the bits of x (eg, the first bit of x is the last bit of x,
rev
etc.). How many strings x ÿ {0, 1} n satisfy x = x?
Justify your answer.

(c) How many functions f : {0, 1} n ÿ {0, 1} n are there that satisfy f(x) = x and f(x) = x rev
for all x ÿ {0, 1} n ? Justify your answer.

Solution:

(a) (2n ÿ 1)(2n) , since there are 2n elements in the domain {0, 1} n and for each, of these
elements, f can map the element to any element in {0, 1} n cept for itself, so there ex-
are 2n ÿ 1 choices for the element.

rev
(b) 2n/2 =, x,since
theretoare
construct
2 choices
a string
(either
x ÿ0 {0,
or 1)
1} for
n such
eachthat
of the
x first n/2 bits (the first half of
the n-bit string), and then the second half is fully determined by the first half.

(c) Consider constructing a function f : {0, 1} n ÿ {0, 1} n such that f(x) = x and f(x) = x rev for all x ÿ {0, 1} n . For each x ÿ {0, 1} n such
rev
that x there are 2n ÿ 1 choices for f(x). For each x ÿ {0, 1} n such that x 2n/2 strings x ÿ {0, 1} n such there are 2n ÿ 2 choices = x,
rev
for f(x). Since there are = x, and since there are 2n ÿ 2 n/2 strings x ÿ {0, 1} n such that that x = x, the number of functions = x,
2 ) · (2n ÿ 2)(2nÿ2 n/2 ) f : {0, 1} n ÿ {0, 1} n such that f(x) = x is (2n ÿ 1)(2n/
rev

rev
x

and f(x) = x rev for all x ÿ {0, 1} n , .

Problem 6 [6 points]
r nÿkrÿk
=
Let n, r, k ÿ N + such that k ÿ r ÿ n. Show that by using a combinatorial
not not

r k k argument (ie, show

that both sides of the equation count the same thing).


Machine Translated by Google

144 solutions to problem sets

Solution:

Suppose there are n people, and we want to choose r of them to serve on a committee,
and out of the r people on the committee, we want to choose k of them to be responsible
for task A (where task A is some task ). The LHS of the equation counts precisely the
number of possible ways to do the choosing above.

Another way to count this is the following: Out of the n people, choose k of them to
be part of the committee and the ones responsible for task A; however, we want exactly r
people to serve on the committee, so we need to choose rÿk more people out of the
remaining nÿk people left to choose from.
nÿkrÿk
Thus, there are possible
nk ways to do the choosing, which is the RHS.

Problem 7 [3 + 3 = 6 points]
A certain candy similar to Skittles is manufactured with the following proper-ties: 30% of
the manufactured candy pieces are sweet, while 70% of the pieces are sour. Each candy
piece is colored either red or blue (but not both). If a candy piece is sweet, then it is
colored blue with 80% probability (and colored red with 20% probability), and if a piece is
sour, then it is colored red with 80% probability. The candy pieces are mixed together
randomly before they are sold. You bought a jar containing such candy pieces.

(a) If you choose a piece at random from the jar, what is the probability that you choose
a blue piece? Show your work. (You do not need to simplify your answer.)

(b) Given that the piece you chose is blue, what is the probability that the piece is sour?
Show your work. (You do not need to simplify your answer.)

Solution:

Let B be the event that the piece you choose is blue, and let D be the event that the piece
you choose is sweet.

(a) Pr[B] = Pr[B | D] Pr[D] + Pr[B | D] Pr[D] = (0.80)(0.30) + (0.20)(0.70) = 0.38.

Pr[B|D] Pr[D] = (0.20)(0.70) = ÿ 0.368.


(b) Pr[D | B] = 7 19
Pr[B] (0.80)(0.30)+(0.20)(0.70)
Machine Translated by Google

B.1. PROBLEM SET A 145

Problem 8 [3 + 3 = 6 points]
A literal is an atom (ie, an atomic proposition) or the negation of an atom (eg, if P is an
atom, then P is a literal, and so is ¬P). A clause is a formula of the form liÿlj ÿlk, where li ,
lj , lk are literals and no atom occurs in liÿlj ÿlk more than once (eg, P ÿ ¬Q ÿ ¬P is not
allowed, since the atom P occurs in P ÿ ¬Q ÿ ¬P more than once).

Examples: P1 ÿ ¬P2 ÿ ¬P4 is a clause, and so is P2 ÿ ¬P4 ÿ ¬P5.

(a) Let C be a clause. If we choose a uniformly random interpretation by each


assigning T rue with probability atom 1 2 and F alse with probability 12

independently, what is the probability that C evaluates to T rue under the chosen
interpretation? Justify your answer.

(b) Let {C1, C2, . . . , Cn} be a collection of n clauses. If we choose a uniformly random interpretation as in part (a), what is

the expected number of clauses in {C1, C2, . . . , Cn} that evaluates to T rue under the chosen in-terpretation?

Justify your answer.

Solution:

(a) The probability that C evaluates to T rue is equal to 1 minus the prob-ability that C
evaluates to F alse. Now, C evaluates to F alse if and only if each of the three literals in
C evaluates to F alse. A literal in C evaluates to F alse with probability once, the
probability that all three literals in C 1 2 , and since no atom occurs in C more than (by
1 3
evaluate to F alse is ( 2
)
independence). Thus, the probability that C evaluates to T rue is 1ÿ(
1 3 =
2) 78.

(b) Let Xi = 1 if clause Ci evaluates to T rue, and Xi = 0 otherwise. Then, by linearity of


expectation, the expected number of clauses that evaluate to T rue is E[ Xi ] =
not not
E[Xi ] = Pr[Xi = 1] =
not = 7n not

i=1 i=1 i=1 7 i=1 8 8.

Problem 9 [3 + 3 = 6 points]
Consider the formula (P ÿ ¬Q) ÿ (¬P ÿ Q), where P and Q are atoms.

(a) Is the formula valid? Justify your answer.

(b) Is the formula satisfiable? Justify your answer.

Solution:
Machine Translated by Google

146 solutions to problem sets

(a) No, since the interpretation that assigns T rue to P and F alse to Q would
make the formula evaluate to F alse, since (P ÿ ¬Q) evaluates to T rue while (¬P
ÿ Q) evaluates to F alse .

(b) Yes, since any interpretation that assigns F alse to P would make (P ÿ¬Q)
evaluate to F alse, and so (P ÿ ¬Q) ÿ (¬P ÿ Q) would evaluate to T rue.

Problem 10 [6 points]
Let G be an undirected graph, possibly with self-loops. Suppose we have the
predicate symbols Equals(·, ·), IsV ertex(·), and Edge(·, ·).
Let D be some domain that contains the set of vertices of G (D might contain
other elements as well). Let I be some interpretation that specifies functions for
Equals(·, ·), IsV ertex(·), and Edge(·, ·), so that for all u, v ÿ D, we have Equals[I]
(u, v ) = T (True) if and only if u = v, IsV ertex[I](u) = T if and only if u is a vertex
of G, and if u and v are both vertices of G, then Edge[I ](u, v) = T if and only if {u,
v} is an edge of G (we do not know the value of Edge[I](u, v) when u or v is not a
vertex of G) .
Write a formula in first order logic that captures the statement “the graph G
does not contain a triangle”, ie, the formula is T (True) under (D, I) if and only if
the graph G does not contain a triangle ( a triangle is 3 distinct vertices all of
which are connected to one another via an edge).

Solution:

¬ÿuÿvÿw(isV ertex(u)ÿisV ertex(v)ÿisV ertex(w)ÿ¬Equals(u, v)ÿ¬Equals(u, w)ÿ ¬Equals(v,


w) ÿ Edge(u, v) ÿ Edge(u, w) ÿ Edge(v, w))

Problem 11 [6 points]
Suppose we have an undirected graph G such that the degree of each vertex is
a multiple of 10 or 15. Show that the number of edges in G must be a multiple of
5.

Solution:

Let G = (V, E). Recall that 2|E| = deg(v).vÿV


The degree of each vertex is a multiple
of 10 or 15, and thus is a multiple of 5. Thus, 2|E| = deg(v) is a multiplevÿV
of 5, since
each term of the sum is a multiple of 5. Thus, 5 divides
Machine Translated by Google

B.1. PROBLEM SET A 147

2|E|, and since 5 is prime, 5 must divide 2 or |E|. 5 clearly does not divide 2, so 5
divides |E|, as required.

Problem 12 [3 + 2 + 5 = 10 points]
Consider the following non-deterministic finite automaton (NFA):

b
y
has

x b
has
b has

b a,b
w b
b
z
has

u v
has

(a) Write a regular expression that defines the language recognized by the
above NFA.

(b) Construct (by drawing a state diagram) the smallest deterministic finite automaton
(DFA) that recognizes the same language as the above NFA (smallest in
terms of the number of states).

(c) Prove that your DFA for part (b) is indeed the smallest DFA that rec-ognizes the
same language as the above NFA (smallest in terms of the number of states).

Solution:

(a) (ab) ÿ

(b)
has
a,b
v
has

u b
b w
Machine Translated by Google

148 solutions to problem sets

(c) We will show that any DFA that recognizes (ab) ÿ must have at least 3
states. 1 state is clearly not enough, since a DFA with only 1 state recognizes
either the empty language or the language {a, b} ÿ . Now, consider any DFA
with 2 states. We note that the start state must be an accepting state, since
the empty string needs to be accepted by the DFA. Since the string a is not
accepted by the DFA, the a-transition (the arrow labeled a) out of the start
state cannot be a self-loop, so the a-transition must lead to the other state.
Similarly, since the string b is not accepted by the DFA, the b-transition out
of the start state must lead to the other state. However, this means that bb is
accepted by the DFA, since ab is accepted by the DFA and the first letter of
the string does not affect whether the string is accepted, since both a and b
result in transitioning to the non-start state. This means that the DFA does
not recognize the language defined by (ab) ÿ . Thus, any DFA that recognizes
(ab) ÿ must have at least 3 states.

You might also like