Algorithm Lectures
Algorithm Lectures
Sandeep Sen1
1 Department of Computer Science and Engineering, IIT Delhi, New Delhi 110016, India.
E-mail:ssen@cse.iitd.ernet.in
Contents
2 Warm up problems 14
2.1 Euclid’s algorithm for GCD . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Extended Euclid’s algorithm . . . . . . . . . . . . . . . . . . . 15
2.2 Finding the k-th element . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Choosing a random splitter . . . . . . . . . . . . . . . . . . . 17
2.2.2 Median of medians . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Sorting words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Mergeable heaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Merging Binomial Heaps . . . . . . . . . . . . . . . . . . . . . 21
3 Optimization I :
Brute force and Greedy strategy 22
3.1 Heuristic search approaches . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.1 Game Trees * . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 A framework for Greedy Algorithms . . . . . . . . . . . . . . . . . . . 26
3.2.1 Maximal Spanning Tree . . . . . . . . . . . . . . . . . . . . . 28
3.2.2 A Scheduling Problem . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Efficient data structures for MST algorithms . . . . . . . . . . . . . . 29
3.3.1 A simple data structure for union-find . . . . . . . . . . . . . 30
3.3.2 A faster scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.3 The slowest growing function ? . . . . . . . . . . . . . . . . . 32
3.3.4 Putting things together . . . . . . . . . . . . . . . . . . . . . . 33
1
3.3.5 Path compression only . . . . . . . . . . . . . . . . . . . . . . 34
4 Optimization II :
Dynamic Programming 35
4.1 A generic dynamic programming formulation . . . . . . . . . . . . . . 36
4.2 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 Context Free Parsing . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.2 Longest monotonic subsequence . . . . . . . . . . . . . . . . . 37
4.2.3 Function approximation . . . . . . . . . . . . . . . . . . . . . 38
4.2.4 Viterbi’s algorithm for Expectation Maximization . . . . . . . 39
5 Searching 41
5.1 Skip Lists - a simple dictionary . . . . . . . . . . . . . . . . . . . . . 41
5.1.1 Construction of Skip-lists . . . . . . . . . . . . . . . . . . . . . 41
5.1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Treaps : Randomized Search Trees . . . . . . . . . . . . . . . . . . . 44
5.3 Universal Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.1 Example of a Universal Hash function . . . . . . . . . . . . . 47
5.4 Perfect Hash function . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4.1 Converting expected bound to worst case bound . . . . . . . . 49
5.5 A log log N priority queue . . . . . . . . . . . . . . . . . . . . . . . . 49
2
7.4 Schonage and Strassen’s fast multiplication . . . . . . . . . . . . . . . 69
9 Graph Algorithms 79
9.1 Applications of DFS . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.1.1 Strongly Connected Components (SCC) . . . . . . . . . . . . 79
9.1.2 Finding Biconnected Components (BCC) . . . . . . . . . . . 80
9.2 Path problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.2.1 Bellman Ford SSSP Algorithm . . . . . . . . . . . . . . . . . . 82
9.2.2 Dijkstra’s SSSP algorithm . . . . . . . . . . . . . . . . . . . . 84
9.2.3 Floyd-Warshall APSP algorithm . . . . . . . . . . . . . . . . . 85
9.3 Maximum flows in graphs . . . . . . . . . . . . . . . . . . . . . . . . 85
9.3.1 Max Flow Min Cut . . . . . . . . . . . . . . . . . . . . . . . . 87
9.3.2 Ford and Fulkerson method . . . . . . . . . . . . . . . . . . . 88
9.3.3 Edmond Karp augmentation strategy . . . . . . . . . . . . . . 88
9.3.4 Monotonicity Lemma and bounding the iterations . . . . . . . 88
9.4 Global Mincut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.4.1 The contraction algorithm . . . . . . . . . . . . . . . . . . . . 90
9.4.2 Probability of mincut . . . . . . . . . . . . . . . . . . . . . . . 91
3
A Recurrences and generating functions 104
A.1 An iterative method - summation . . . . . . . . . . . . . . . . . . . . 104
A.2 Linear recurrence equations . . . . . . . . . . . . . . . . . . . . . . . 106
A.2.1 Homogeneous equations . . . . . . . . . . . . . . . . . . . . . 106
A.2.2 Inhomogeneous equations . . . . . . . . . . . . . . . . . . . . 107
A.3 Generating functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
A.3.1 Binomial theorem . . . . . . . . . . . . . . . . . . . . . . . . . 109
A.4 Exponential generating functions . . . . . . . . . . . . . . . . . . . . 109
A.5 Recurrences with two variables . . . . . . . . . . . . . . . . . . . . . 110
4
Preface
This write-up is a rough chronological sequence of topics that I have covered in the
past in postgraduate and undergraduate courses on Design and Analysis of Algorithms
in IIT Delhi. A quick browse will reveal that these topics are covered by many
standard textbooks in Algorithms like AHU, HS, CLRS, and more recent ones like
Kleinberg-Tardos and Dasgupta-Papadimitrou-Vazirani.
What motivated me to write these notes are
(i) As a teacher, I feel that the sequence in which the topics are exposed has a
significant impact on the appreciation and understanding of the subject.
(ii) Most textbooks have far too much material for one semester and often intimidate
an average student. Even when I pick and choose topics, the ones not covered have
a distracting effect on the reader.
(iii) Not prescribing a textbook in a course (that I have done in the past) creates
insecurity among many students who are not adept at writing down notes as well
as participating in class discussions so important for a course like algorithms. (As a
corollary, this may make it easier for some of the students to skip some lectures.)
(iv) Gives me some flexibility about asking students to fill up some of the gaps left
deliberately or inadvertently in my lectures.
(v) Gives my students some idea of the level of formalism I expect in the assignments
and exams - this is somewhat instructor dependent.
(vi) The students are not uniformly mature so that I can demand formal scribing.
I am sure that many of my colleagues have felt about one or more of the above
reasons before they took the initiative at some point as is evidenced by the availability
of many excellent notes that are accessible via internet. This is a first draft that I
am making available in the beginning of the semester and I am hoping to refine and
fill up some of the incomplete parts by the middle of this semester. The notes are
likely to contain errors, in particular, typographic. I am also relying on the power
of collective scrutiny of some of the brightest students to point these out and I will
endeavour to update this every week.
I have also promised myself that I will not get carried away in adding to the
present version beyond 1xx pages.
Sandeep Sen
July 2007
5
Chapter 1
When we make a claim like Algorithm A has running time O(n2 log n), we have an
underlying computational model where this statement is valid. It may not be true if
we change the model. Before we formalize the notion of a computational model, let
us consider the example of computing Fibonacci numbers.
Exercise 1.1 How large is Fn , the n-th Fibonacci number - derive a closed form.
It is not difficult to argue that it grows exponentially with n. You can also prove that
n−2
X
Fn = 1 + Fi
i=0
Since the closed form solution for Fn involves the golden ratio - an irrational number,
we must find out a way to compute it efficiently without incurring numerical errors
or approximation as it is an integer.
Method 1
Simply use the recursive formula. Unfortunately, one can easily argue that the
number of operations (primarily additions) involved is proportional to the value of Fn
(just unfold the recursion tree where each internal node corresponds to an addition.
6
As we had noted earlier this leads to an exponential time algorithm and we can’t
afford it.
Method 2
Observe that we only need the last two terms of the series to compute the new
term. So by applying the idea of dynamic programming we gradually compute the
Fn starting with F0 = 0 and F1 = 1.
This takes time that is proportional to approximately n additions where each
addition involves adding (increasingly large) numbers. The size of F ⌈n/2⌉ is about
n/2 bits so the last n/2 computations are going to take Ω(n) steps 1 culminating in
an O(n2 ) algorithm.
Since the n-th Fibonacci number is at most n bits, it is reasonable to look for a
faster algorithm.
Method 3
Fi 1 1 Fi−1
=
Fi−1 1 0 Fi−2
By iterating the above equation we obtain
n−1
Fn 1 1 1
=
Fn−1 1 0 0
To compute An , where A is a square matrix we can extend the following strategy for
computing xn where n is an integer.
2
x2k = (xk ) for even integral powers
2k+1 2k
x =x·x for odd integral powers
The number of multiplications taken by the above approach to compute xn is
bounded by 2log n (Convince yourself by writing a recurrence). However, the actual
running time depends on the time to multiply two numbers which in turn depends on
their lengths (number of digits). If we assume that M(n) is the number of (bit-wise)
steps to multiply two n bit numbers. Therefore the number of steps to implement
the above approach must take into account the lengths of numbers that are being
multiplied. The following observations will be useful.
The length of xk is bounded by k · |x| where |x| is the length of x.
Therefore, the cost of the the squaring of xk is bounded by M(k|x|). Similarly, the
cost of computing x × x2k can also be bound by M(2k|x|).The overall recurrence for
computing xn can be written as
7
where TB (n) is the number of bit operations to compute the n-th power using the
previous recurrence. The solution of the above recurrence can be written as the
following summation (by unfolding)
log n
X
M(2i |x|)
i=1
If M(2i) > 2M(i), then the above summation can be bounded by O(M(n|x|), i.e. the
cost the last squaring operation.
In our case, A is a 2 × 2 matrix - each squaring operation involves 8 multiplication
and 4 additions involving entries of the matrix. Since multiplications are more ex-
pensive than additions, let us count the cost of multiplications only. Here, we have to
keep track of the lengths of the entries of the matrix. Observe that if the maximum
size of an entry is |x|, then the maximum size of an entry after squaring is at most
2|x| + 1 (Why ?).
A × B = (2n/2 · A1 + A2 ) × (2n/2 · B1 + B2 )
8
where A1 (B1 ) is the leading n/2 bits of A (B). Likewise A2 is the trailing n/2 bits
of A. We can expand the above product as
Exercise 1.3 What is the time to multiply using the above method - write and solve
an appropriate recurrence ?
Although strictly speaking, A1 + A2 is not n/2 bits but at most n/2 + 1 bits (Why ?),
we can still view this as computing three separate products involving n/2 bit numbers
recursively and subsequently subtracting appropriate terms to get the required prod-
ucts. Subtraction and additions are identical in modulo arithmetic (2’s complement),
so the cost of subtraction can be bounded by O(n). (What is maximum size of the
numbers involved in subtraction ?). This gives us the following recurrence
where the last term accounts for addition, subtractions and shifts.
Exercise 1.4 With appropriate terminating condition, show that the solution to the
recurrence is O(nlog2 3 ).
The running time is roughly O(n1.7 ) which is asymptotically better than n2 and there-
fore we have succeeded in designing an algorithm to compute Fn faster than n2 .
9
1.3 Model of Computation
Although there are a few thousand variations of the computer with different archi-
tectures and internal organization, it is best to think about them at the level of the
assembly language. Despite architectural variations, the assembly level language sup-
port is very similar - the major difference being in the number of registers and the word
length of the machine. But these parameters are also in a restricted range of a factor
of two, and hence asymptotically in the same ball park. In summary, think about any
computer as a machine that supports a basic instruction set consisting of arithmetic
and logical operations and memory accesses (including indirect addressing). We will
avoid cumbersome details of the exact instruction set and assume realistically that
any instruction of one machine can be simulated using a constant number of available
instruction of another machine. Since analysis of algorithms involves counting the
number of operations and not the exact timings (which could differ by an order of
magnitude), the above simplification is justified.
The careful reader would have noticed that during our detailed analysis of Method
3 in the previous sections, we were not simply counting the number of arithmetic
operations but actually the number of bit-level operations. Therefore the cost of
a multiplication or addition was not unity but proportional to the length of the
input. Had we only counted the number of multiplications for computing xn , that
would only be O(log n). This would indeed be the analysis in a uniform cost model
where only the number of arithmetic (also logical) operations are counted and does
not depend on the length of the operands. A very common us of this model is for
comparison-based problems like sorting, selection, merging, and many data-structure
operations. For these problems, we often count only the number of comparisons (not
even other arithmetic operations) without bothering about the length of the operands
involved. In other words, we implicitly assume O(1) cost for any comparison. This
is not considered unreasonable since the size of the numbers involved in sorting do
not increase during the course of the algorithm for majority of the commonly known
sorting problems. On the other hand consider the following problem of repeated
n
squaring n times starting with 2. The resultant is a number 22 which requires 2n
bits to be represented. It will be very unreasonable to assume that a number that is
exponentially long can be written out (or even stored) in O(n) time. Therefore the
uniform cost model will not reflect any realistic setting for this problem.
On the other extreme is the logarithmic cost model where the cost of an operation
is proportional to length of the operands. This is very consistent with the physical
world and also has close relation with the Turing Machine model which is a favorite of
complexity theorists. Our analysis in the previous sections is actually done with this
model in mind. It is not only the arithmetic operations but also the cost of memory
access is proportional to the length of the address and the operand.
10
The most commonly used model is something in between. We assume that for
an input of size n, any operation involving operands of size log n 2 takes O(1) steps.
This is justified as follows. All microprocessor chips have specialized hardware circuits
for arithmetic operations like multiplication, addition, division etc. that take a fixed
number of clock cycles when the operands fit into a word. The reason that log n is a
natural choice for a word is that, even to address an input size n, you require log n
bits of address space. The present high end microprocessor chips have typically 2-4
GBytes of RAM and about 64 bits word size - clearly 264 exceeds 4 GBytes. We will
also use this model, popularly known as Random Access Machine (or RAM in
short) except for problems that deal with numbers as inputs like multiplication in
the previous section where we will invoke the log cost model. In the beginning, it
is desirable that for any algorithm, you get an estimate of the maximum size of the
numbers to ensure that operands do not exceed Ω(log n) so that it is safe to use the
RAM model.
11
There are further refinements to this model that parameterizes multiple levels and
also accounts for internal computation. As the model becomes more complicated,
designing algorithms also becomes more challenging and often more laborious.
12
from any input gate to the output gate (each gate contributes to a unit delay). Those
familiar with circuits for addition, comparison can analyse them in this framework.
The carry-save adder is a low-depth circuit that adds two n-bit numbers in about
O(log n) steps which is much faster than a sequential circuit that adds one bit at a
time taking n steps.
One of the most fascinating developments is the Quantum Model which is in-
herently parallel but it is also fundamentally different from the previous models. A
breakthrough result in recent years is a polynomial time algorithm for factorization
which forms the basis of many cryptographic protocals in the conventional model.
Biological Computing models is a very active area of research where scientists
are trying to assemble a machine out of DNA strands. It has potentially many
advantages over silicon based devices and is inherently parallel.
13
Chapter 2
Warm up problems
One of the primary challenges in algorithm design is to come up with provably optimal
algorithms. The optimality is with respect to the underlying model. In this chapter,
we look closely at some well-known algorithms for basic problems that uses basic
properties of the problem domain in conjunction with elementary analytical methods.
Let c = a mod b.
If c = 0 then return b else
return Euclid GCD(b, c)
14
Let us now analyze the running time of Euclid’s algorithm in the bit model. Since
it depends on integer division, which is a topic in its own right, let us address the
number of iterations of Euclid’s algorithm in the worst case.
Observation 2.1 The number a mod b ≤ a2 , i.e. the size of a mod b is strictly less
than |a|.
This is a simple case analysis based on b ≤ a2 and b > a2 . As a consequence of the above
observation, it follows that the the number of iterations of the Euclid’s algorithm is
bounded by |a|, or equivalently O(log a).
Exercise 2.2 Construct an input for which the number of iterations match this
bound.
So, by using the long division method to compute mod , the running time is bounded
by O(n2 log n) where n = |a| + |b|.
Proof: Let ℓ = min{xa + yb|xa + yb > 0}. Clearly gcd(a, b) divides ℓ and hence
gcd(a, b) ≤ ℓ. We now prove that ℓ divides a (also b). Let us prove by contradiction
that a = ℓq + r where ℓ > r > 0. Now r = a − ℓq = (1 − xq)a − (yq)b contradicting
the minimality of ℓ. 2
The above result can be restated as ℓ divides a and b. For some applications,
we are interested in computing x and y corresponding to gcd(a, b). We can compute
them recursively along with the Euclid’s algorithm.
Exercise 2.3 Let (x′ , y ′ ) correspond to gcd(b, a mod b), i.e. gcd(b, a mod b) = x′ ·
b + y ′ · (a mod b). Then show that gcd(a, b) = y ′ · a + (x′ − q)b where q is the quotient
of the integer division of a by b.
One immediate application of the extended Euclid’s algorithm is computing the in-
verse in a multiplicative prime field Fp∗ where p is prime. Fq∗ = {0, 1, 2 . . . (q − 1)}
where the multiplication is performed modulo p. It is known 1 that for every number
1
since it forms a group
15
x ∈ Fq∗ − {0}, there exists y ∈ Fq∗ such that x · y = 1 which is also called the inverse
of x. To compute the inverse of a we can use extended Euclid algorithm to find s, t
such that sa + tp = 1 since a is relatively prime to p. By taking remainder modulo p,
we see that s mod p is te required inverse.
This is easily done by comparing x with all elements in S − {x} and finding the rank
of x. Using an arbitrary element x as a filter, we can subsequently confine our search
for the k-th element to either
In the fortuitous situation, R(x, S) = k,x is the required element. In case 1, we must
find k ′ -th element in S> where k ′ = k − R(x, S).
Suppose T (n) is the worst case running time for selecting the k-th element for any
k, then we can write the following recurrence
16
A quick inspection tells us that if we can ensure max{|S< |, |S> |} ≤ ǫn for some
1/2 ≤ ǫ < n−1 n
, (Why the bounds ?) T (n) is bounded by O( 1−ǫ 1
· n). So it could
2
vary between Ω(n) and O(n ) - where a better running time is achieved by ensuring
a smaller value of ǫ.
An element x used to divide the set is often called a splitter or a pivot. So, now
we will discuss methods to select a good splitter. From our previous discussion, we
would like to select a splitter that has a rank in the range [ǫ · n, (1 − ǫ) · n] for a fixed
fraction ǫ. Typically, ǫ will be chosen as 1/4.
It is easy to verify if the rank R(r, S) falls in the above range, and if it does not, then
we choose another element independently at random. This process is repeated till we
find a splitter in the above range - let us call such a splitter a good splitter.
How many times do we need to repeat the process ?
To answer this, we have to take a slightly different view. One can argue easily
that there is no guarantee that we will terminate after some fixed number of trials,
while it is also intuitively clear that it is extremely unlikely that we need to repeat
this more than say 10 times. The probability of failing 9 consecutive times, when the
success probability of picking a good splitter is ≥ 1/2 independently is ≤ 219 . More
precisely, the expected 2 number of trials is bounded by 2. So, in (expected) two trials,
we will find a good splitter that reduces the size of the problem to at most 43 n. This
argument can be repeated for the recursive calls, namely, the expected number of
splitter selection (and verification of its rank) is 2. If ni is the size of the problem
after i recursive calls with n0 = n, then the expected number of comparisons done
after the i-th recursive call is 2ni . The total expected number of comparisons X after
t calls can be written as X0 + X1 + . . . Xt where t is sufficiently large such that the
problem size nt ≤ C for some constant C (you can choose other stopping criteria)
and Xi is the number of comparisons done at stage i. By taking expectation on both
sides
E[X] = E[X1 + X2 + . . . Xt ] = E[X1 ] + E[X2 ] + . . . E[Xt ]
2
Please refer to the Appendix for a quick recap of basic measures of discrete probability
17
From the previous discussion E[Xi ] = 2ni and moreover ni ≤ 43 ni−1 . Therefore the
expected number of comparisons is bounded by 4n.
where the second recursive call is to find the median of medians (for finding a good
splitter). After we find the splitter ( by recursively applying the same algorithm), we
use it to reduce the original problem size to at most 7n 10
.
18
straightforward comparison sorting. Let us recall some basic results about integer
sorting.
Claim 2.1 n integers in the range [1..m] can be sorted in O(n + m) steps.
A sorting algorithm is considered stable if the relative order of input elements having
identical values is preserved in the sorted output.
Claim 2.2 Using stable sorting, n integers in the range [1..mk ] can be sorted in
O(k(n + m) steps.
The above is easily achieved by applying integer sorting in the range [1..m] starting
from the least significant digits - note that the maximum number of digits in radix
m representation is k. If we apply the same algorithm for sorting words, then the
running time will be O(L(n + |Σ|) where L = max{l1 , l2 ..ln }. This is not satisfactory
since L · n can be much larger than N (size of input).
The reason that the above method is potentially inefficient is that many words
may be much shorter than L and hence by considering them to be length L words (by
hypothetical trailing blanks), we are increasing the input size asymptotically. When
we considered radix sort as a possible solution, the words have to be left-aligned, i.e.,
all words begin from the same position. To make radix sort efficient and to avoid
redundant comparison (of blanks), we should not consider a word until the radix sort
reaches the right boundary of the word. The radix sort will take a maximum of L
rounds and a word of length l will start participating from the L − l + 1 iteration.
This can be easily achieved. A bigger challenge is to reduce the range of sorting in
each iteration depending on which symbols of the alphabet participate.
Given a word wi = ai,1 ai,2 . . . ai,li , where ai,j ∈ Σ, we form the following pairs -
(1, ai,1 ), (2, ai,2) . . . . There are N such pairs from the n words and we can think of
them as length two strings where the first symbol is from the range [1..L] and the
second symbol is from Σ. We can sort them using radix sort in two rounds in time
proportional to O(N + L + |Σ|) which is O(N + |Σ|) since N > L. From the sorted
pairs we know exactly which symbols appear in a given position (between 1 and L)
- let there be mi words that have non-blank symbols in position i. We also have the
ordering of symbols in position i which is crucial to implement integer sort in O(mi )
steps.
Now we go back to sorting the given words using radix sort where we will use
the information available from the sorted pairs. When we are sorting position i from
the left, we apply integer sort in the range [1..mi ] where the ordered buckets are also
defined by the sorted pairs. We do not move the entire words into the buckets but
only the pointers (which could be the input index of the words) associated with the
words. For every round, we allocate an array of size mi where we place the pointers
19
according to the sorted order of the symbols involved. For same symbols, we maintain
the order of the previous round (stable sorting). We must also take care of the new
words that start participating in the radix sort - once a word participates, it will
participate in all future rounds. (Where should the new words be placed within its
symbol group ?)
The analysis of this algorithm
P can be done by looking at the cost of each radix
sort which is proportional to Li=1 O(mi ) which can be bounded by N. Therefore
overall running time of the algorithm is the sum of sorting the pairs and the radix
sort. This is given by O(N + |Σ|). If |Σ| < N, then the optimal running time is given
by O(N).
• Combine Bi+1 consists of making the root node of one Bi a child of the other
Bi .
A binomial heap is an ordered set of binomial trees such that for any i there is
at most one Bi .
Let us refer to the above property as the unique-order property. We actually maintain
list of the root nodes in increasing order of their degrees.
You may think of the above property as a binary representation of a number where
the i-th bit from right is 0 or 1 and in the latter case, its contribution is 2i (for LSB
i = 0). From the above analogue, a Binomial Heap on n elements has log n Binomial
4
we are assuming min heaps
20
trees. Therefore, finding the minimum element can be done in O(log n) comparisons
by finding the minimum of the log n roots.
Claim 2.3 Two Binomial heaps can be combined in O(log n) steps where the total
number of nodes in the two trees is n.
Every time we combine two trees, the number of binomial trees decreases by one, so
there can be at most 2 log n times where we combine trees.
Remark The reader may compare this with the method for summing two numbers
in binary representation.
Exercise 2.6 Show that the delete-min operation can be implemented in O(log n)
steps using merging.
Inserting a new element is easy - add a node to the root list and merge. Deletion
takes a little thought. Let us first consider an operation decrease-key. This happens
when a key value of a node x decreases. Clearly, the min-heap property of the parent
node, parent(x) may not hold. But this can be restored by exchanging the node x
with its parent. This operation may have to be repeated at the parent node. This
continues until the value of x is greater than its current parent or x doesn’t have a
parent, i.e., it is the root node. The cost is the height of a Binomial tree which is
O(log n).
Exercise 2.7 Show how to implement the insert operation in O(log n) comparisons.
21
Chapter 3
Optimization I :
Brute force and Greedy strategy
22
The knapsack problem can be stated as
n
X n
X
Maximize xi · pi subject to xi · wi ≤ C
i=0 i=0
Note that the constraint xi ∈ {0, 1} is not linear. A simplistic approach will be to
enumerate all subsets and select the one that satisfies the constraints and maximizes
the profits. Any solution that satisfies the capacity constraint is called a feasible
solution. The obvious problem with this strategy is the running time which is at
least 2n corresponding to the power-set of n objects.
We can imagine that the solution space is generated by a binary tree where we start
from the root with an empty set and then move left or right according to selecting
x1 . At the second level, we again associate the left and right branches with the
choice of x2 . In this way, the 2n leaf nodes correspond to each possible subset of the
power-set which corresponds to a n length 0-1 vector. For example, a vector 000 . . . 1
corresponds to the subset that only contains xn .
Any intermediate node at level j from the root corresponds to partial choice among
the objects x1 , x2 . . . xj . As we traverse the tree, we keep track of the best feasible
solution among the nodes visited - let us denote this by T . At a node v, let S(v)
denote the subtree rooted at v. If we can estimate the maximum profit P (v) among
the leaf nodes of S(v), we may be able to prune the search. Suppose L(v) and U(v)
are the lower and upperbounds of P (v), i.e. L(v) ≤ P (v) ≤ U(v). If U(v) < T , then
there is no need to explore S(v) as we cannot improve the current best solution, viz.,
T . In fact, it is enough to work with only the upper-bound of the estimates and L(v)
is essentially the current partial solution. As we traverse the tree, we also update
U(v) and if it is less than T , we do not search the subtree any further. This method
of pruning search is called branch and bound and although it is clear that there it is
advantageous to use the strategy, there may not be any provable savings in the worst
case.
Exercise 3.1 Construct an instance of a knapsack problem that visits every leaf node,
even if you use branch and bound. You can choose any well defined estimation.
Example 3.1
Let the capacity of the knapsack be 15 and the weights and profits are respectively
Profits 10 10 12 18
Volume 2 4 6 9
We will use the ratio of profit per volume as an estimation for upperbound. For
the above objects the ratios are 5, 2.5, 2 and 2. Initially, T = 0 and U = 5 × 15 = 75.
After including x1 , the residual capacity is 13 and T = 10. By proceeding this way, we
23
obtain T = 38 for {x1 , x2 , x4 }. By exploring further, we come to a stage when we have
included x1 and decided against including x2 so that L(v) = 10, and residual capacity
is 13. Should we explore the subtree regarding {x3 , x4 } ? Since profit per volume of
x3 is 2, we can obtain U(v) = 2 × 13 + 10 = 36 < L = 38. So we need not search this
subtree. By continuing in this fashion, we may be able to prune large portions of the
search tree. However, it is not possible to obtain any provable improvements.
24
Consider a tree with depth 2k (i.e. 4k leaf nodes) with alternating AND and OR
nodes, each type having k nodes. We will show that the expected cost of evaluation
is 3k by induction on k.
Exercise 3.2 Show that for k = 1, the expected number of evaluations is 3. (You
must consider all cases of output and take the worst, since we are not assuming any
distribution on input or output).
Let us consider a root with label OR and its two AND children, say y and z, whose
children are OR nodes with 2(k − 1) depth. We have the two cases
output is 0 at the root Both y and z must evaluate to 0. Since these are AND
nodes, again with probability 1/2, we will end up evaluating only one of the
children (of y, z) that requires expected 21 · (1 + 2) · 3k−1 = 23 · 3k−1 steps for
y as well as z from Induction hypothesis. This adds upto a total of expected
2 · 23 · 3k−1 = 3k steps for y and z.
Output is 1 at the root At least one of the AND nodes y, z must be 1. With prob-
ability 1/2 this will be chosen first and this can be evaluated using the expected
cost of evaluating two OR nodes with output 1. By induction hypothesis this
is 2 · 3k−1.
The other possibility (also with probability 1/2) is that the first AND node (say
y) is 0 and the second AND node is 1. The expected cost of the first AND node
with 0 output is 1/2 · 3k−1 + 1/2 · (3k−1 + 3k−1 ) - the first term corresponds to
the scenario that the first child evaluates to 0 and the second term corresponds
to evaluating both children of y are evaluated. The expected cost of evaluating
y having value 0 is 3/2 · 3k−1.
The expected number of evaluation for second AND node z with output 1 is
2 · 3k−1 since both children must be evaluated.
So the total expected cost is 1/2 · 3k−1 (2 + 3/2 + 2) = 2.75 · 3k−1 < 3k .
In summary, for an OR root node, regardless of the output, the expected number
of evaluations is bounded by 3k .
Exercise 3.3 Establish a similar result for the AND root node.
If N the number of leaf nodes, then the expected number of evaluations is N log4 3 = N α
where α < 0.8.
25
3.2 A framework for Greedy Algorithms
There are very few algorithmic techniques for which the underlying theory is as precise
and clean as what we will discuss here. Let us define the framework. Let S be a set
and M be a subset 1 of 2S . A subset system (S, M) is called a matroid if it satisfies
the following property
Note that the empty subset Φ ∈ M. The family of subsets M is often referred to as
independent subsets and one may think of M as the feasible subsets.
Example 3.2 For the maximal spanning tree problem on a graph G = (V, E), (E, F )
is a matroid where F is the set of all subgraphs without cycles (i.e. all the forests).
The running time of the algorithm is dependent mainly on the test for independence
which depends on the specific problem. M is not given explicitly as it may be very
large (even exponential2 ). Instead, a characterization of M is used to perform the
test.
What seems more important is the question - Is T the maximum weight subset ?
This is answered by the next result
26
1. Algorithm Gen Greedy outputs the optimal subset for any choice of the weight
function.
2. exchange property
For any s1 , s2 ∈ M where |s1 | < |s2 |, then there exists e ∈ s2 − s1 such that
s1 ∪ {e} ∈ M.
3. For any A ⊂ S, all maximal subsets of A have the same cardinality. A maximal
subset T of A implies that there is no element e ∈ A − T such that T ∪ {e} ∈ M.
The obvious use of the theorem is to establish properties 2 or 3 to justify that a greedy
approach works for the problem. On the contrary, we can try to prove that one of the
properties doesn’t hold (by a suitable counterexample), then greedy cannot always
return the optimum subset.
Proof: We will prove it in the following cyclic implications - Property 1 implies Prop-
erty 2. Then Property 2 implies Property 3 and finally Property 3 implies Property
1.
Property 1 implies Property 2 We will prove it by contradiction. Suppose Prop-
erty 2 doesn’t hold for some subsets s1 and s2 . That is, we cannot add any element
from s2 −s1 to s1 and keep it independent. Further, wlog, let |s2 | = p+1 and |s1 | = p.
Let us define a weight function on the elements of S as follows
p + 2 if e ∈ s1
w(e) = p + 1 if e ∈ s2 − s1
0 otherwise
The greedy approach will pick up all elements from s1 and then it won’t be able to
choose any element from s2 − s1 . The greedy solution has weight p + 2|s1| = (p + 2) · p.
By choosing all elements of s2 , the solution has cost (p + 1) · (p + 1) which has a higher
cost than greedy and hence it is a contradiction of Property 1 that is assumed to be
true.
Property 2 implies Property 3 If two maximal subsets of a set A have different
cardinality, it is a violation of Property 2. Since both of these sets are independent,
we should be able augment the set s1 with an element from s2 .
Property 3 implies Property 1 Again we will prove by contradiction. Let e1 e2 . . . ei . . . en
be the edges chosen by the greedy algorithm in decreasing order of their weights. Fur-
ther, let e′1 e′2 . . . e′i . . . e′m be the edges of an optimal solution in decreasing order - (
Is m = n ?). Since the weight of the greedy solution is not optimal, there must a
j ≤ n such that ej < e′j . Let A = {e ∈ S|w(e) ≥ w(e′j )}. The subset {e1 , e2 . . . ej−1 }
is maximal with respect to A (Why ?). All the elements in {e′1 , e′2 . . . e′j } form an
independent subset of A that has greater cardinality. This contradicts Property 3. 2
27
Example 3.3 Half Matching Problem Given a directed graph with non-negative
edge weights, find out the maximum weighted subset of edges such that the in-degree
of any node is at most 1.
The problem defines a subset system S is the set of edges and M is the family
of all subsets of edges such that no two incoming edges share a vertex. Let us verify
Property 2 by considering two subsets Sp and Sp+1 with p and p + 1 edges respectively.
Sp+1 must have at least p + 1 distinct vertices incident on the p + 1 incoming edges
and there must be at least one vertex is not part of Sp ’s vertex set incident to Sp ’s
incoming edges. Clearly, we can add this edge to Sp without affecting independence.
We will show that there is an edge (u′ , v ′ ) in F1 that connects such that u′ and v ′ are
in different connected components of F2 and therefore (u′ , v ′ ) ∪ F2 cannot contain a
cycle.
You can easily argue that there is some component in F1 that is split up into some
components of F2 . If you imagine coloring the vertices of F1 according to the compo-
nents in F2 , at least one edge will have its end-points colored differently.
Exercise 3.4 The matroid theory is about maximizing the total weight of a subset.
How would you extend it to finding minimum weighted subset - for example Minimal
Spanning Trees ?
3
The set of end-points of F1 may be a proper subset of F2 . Even then we can restrict our
arguments to the end-points of F2 - Why ?
28
3.2.2 A Scheduling Problem
We are given a set of jobs J1 , J2 . . . Jn , their corresponding deadlines di for completion
and the corresponding penalties pi if a job completes after deadlines. The jobs have
unit processing time on a single available machine. We want to find out the minimize
the total penalty incurred by the jobs that are not completed before their deadlines.
Stated otherwise, we want to maximize the penalty of the jobs that get completed
before their deadlines.
A set A of jobs is independent if there exists a schedule to complete all jobs in
A without incurring any penalty. We will try to verify Property 2. Let A, B be two
independent sets of jobs with |B| > |A|. We would like to show that for some job
J ∈ B, {J} ∪ A is independent. Let |A| = m < n = |B|. Start with any feasible
schedules for A and B and compress them, i.e. remove any idle time between the
jobs by transforming the schedule where there is no gap between the finish time of
a job and start time of the next. This shifting to left does not affect independence.
Let us denote the (ordered) jobs in A by A1 , A2 . . . Am and the times for scheduling
of jobs in A be d1 , d2 . . . dm respectively. Likewise, let the jobs in B be B1 , B2 . . . Bn
and their scheduling times d′1 , d′2 . . . d′n .
If Bn ∈
/ A, then we can add Bn to A and schedule it as the last job. If Bn = Aj ,
then move Aj to the same position as Bn (this doesn’t violate the deadline) creating
a gap at the j-th position in A. We can now shift-to-left the jobs in A − Aj and now
by ignoring the jobs Bn = Aj , we have one less job in A and B. We can renumber
the jobs and are in a similar position as before. By applying this strategy inductively,
either we succeed in adding a job from B to A without conflict or we are in a situation
where A is empty and B is not so that we can now add without conflict.
29
Algorithm Kruskal MST
Output T .
The key to an efficient implementation is the cycle test, i.e., how do we quickly de-
termine if adding an edge induces a cycle in T . We can view Kruskal’s algorithm as
a process that starts with a forest of singleton vertices and gradually connects the
graph by adding edges and growing the trees. The algorithm adds edges that connect
distinct trees (connected components) - an edge whose endpoints are within the same
tree may not be added since it induces cycles. Therefore we can maintain a data
structure that supports the following operations
Find For a vertex, find out which connected component it belongs to.
30
elements in set j. For obvious reasons, we would change labels of the smaller
subset.
Although the time for a single union operation can be quite large, in the context of
MST, we will analyze a sequence of union operations - there are at most n − 1 union
operations in Kruskal’s algorithm. Consider a fixed element x ∈ S. The key to the
analysis lies in the answer to the following question.
How many times can the label of x change ?
Every time there is a label change the size of the set containing x increases by a factor
of two (Why ?). Since the size of a set is ≤ n, this implies that the maximum number
of label changes is log n). Kruskal’s algorithm involves |E| finds and at most |V | − 1
unions; from the previous discussion this can be done in O(m + n log n) steps using
the array data-structure.
31
So we are in a situation where Find takes O(log n) and Union operation takes O(1).
Seemingly, we haven’t quite gained anything so let us use the following heuristic.
Path compression
When we do a Find(x) operation, let x0 = root of x, x1 , x2 . . . x be the sequence
of nodes visited. Then we make the subtrees rooted at xi (minus the subtree rooted
at xi+1 ) the children of the root node. Clearly, the motivation is to bring more nodes
closer to the root node, so that the time for the Find operation decreases. And Path
compression does not increase the asymptotic cost of the current Find operation (it
is factor of two).
While it is intuitively clear that it should give us an advantage, we have to rigor-
ously analyze if it indeed leads to any asymptotic improvement.
22
which can defined more formally as a function
1
2 i=0
B(i) = 22 i=1
B(i−1)
2 otherwise for i ≥ 2
Let
(i) n i=0
log n= (i−1)
log(log n) for i ≥ 1
The inverse of B(i) is defined as
In other words,
. 2)
2. n
log∗ 22 =n+1
We will use the function B() and log∗ () to analyze the effect of path compression.
We will say that two integers x and y are in the same block if log∗ x = log∗ y.
Although log∗ appears to slower than anything we can imagine, (for example
log∗ 265536 ≤ 5), there is a closely related family of function called the inverse Acker-
man function that is even slower !
32
Ackerman’s function is defined as
A(1, j) = 2j for j ≥ 1
A(i, 1) = A(i − 1, 2) for i ≥ 2
A(i, j) = A(i − 1, A(i, j − 1)) for i, j ≥ 2
Note that A(2, j) is similar to B(j) defined earlier. The inverse-Ackerman function
is given by
m
α(m, n) = min{i ≥ 1|A(i, ⌊ ⌋) > log n}
n
To get a feel for how slowly it grows, verify that
. 2)
2. 16
α(n, n) = 4 for n = 22
Block charge If the block number of the parent node is strictly greater then a node
incurs a block charge. Clearly the maximum number of block charges for a
single Find operation is O(log∗ n)
Path charge Any charge incurred by a Find operation that is not a block charge.
From our previous observation, we will focus on counting the path charges.
Observation 3.1 Once the rank of a node and its parent are in different blocks, they
continue to be in different blocks, i.e. once a node incurs block charges, it will never
incur any more path charges.
The parent of a node may change because of path compression preceded by one or
more union operations, but the new parent will have a rank higher than the previous
parent. Consequently, a node in block j can incur path charges at most B(j) − B(j −
33
1) ≤ B(j) times. Since the number of elements with rank r is at most 2nr , the number
of elements having ranks in block i is
n n n 1 1 1 n
+ +. . . = n + + . . . ≤ 2n =
2B(i−1)+1 2B(i−1)+2 2B(i) 2B(i−1)+1 2B(i−1)+2 2B(i−1)+1 2B(i−1)
n
Therefore the total number of path charges for elements in block i is at most 2B(i−1) ·
B(i) which is O(n). For all the log∗ n blocks the cumulative path charges is O(n log∗ n)
to which we have to add O(m log∗ n) block charges.
Remark For some technical reason, in a Find operation, the child of the root node
always incurs a block charge (Why ?)
Exercise 3.6 Can you construct an example to show that the above analysis is tight
?
34
Chapter 4
Optimization II :
Dynamic Programming
Let us try to solve the knapsack problem using a somewhat different strategy. Let
Fi (y) denote the optimal solution for a knapsack capacity y and using only the objects
in {x1 , x2 . . . xi }. Under this notation, Fn (M) is the final solution to the knapsack
problem with n objects and capacity M. Let us further assume that all the weights
are integral as also M. We can write the following equation
Fi (y) = max{Fi−1 (y), Fi−1(y − wi ) + pi }
where the two terms correspond to inclusion or exclusion of object i in the optimal
solution. Also note that, once we decide about the choice of xi , the remaining choices
must be optimal with respect to the remaining objects and the residual capacity of
the knapsack.
We can represent the above solution in a tabular form, where the rows correspond
to the residual capacity from 1 to M and the column i represents the choice of objects
restricted to the subset {1, 2 . . . i}.
The first column corresponds to the base case of the subset containing only object
{x1 } and varying the capacity from 1 to M. Since the weight of the object is w1 ,
for all i < w1 , F1 (i) = 0 and p1 otherwise. From the recurrence, it is clear that the
i-th column can be filled up from the (i − 1)-st column and therefore after having
computed the entries of column 1, we can successively fill up all the columns (till n).
The value of Fn (M) is readily obtained from the last column.
The overall time required to fill up the table is proportional to the size of the
table multiplied by the time to compute each entry. Each entry is a function of two
previously computed terms and therefore the total running time is O(n · M).
Comment The running time (nM) should be examined carefully. M is the capac-
ity of knapsack, for which log M bits are necessary for its representation. For the
35
remaining input data about n objects, let us assume that we need b · n bits where b
is the number of bits per object. This makes the input size N = b · n + log M so if
log M = N/2 then the running time is clearly exponential (M = 2N/2 ).
In the knapsack problem t(s) = O(1). The space bound is proportional to part of table
that must be retained to compute the remaining entries. This is where we can make
substantial savings by sequencing the computation cleverly. Dynamic programming
is often seen as a trade-off between space and running time, where we are reducing the
running time at the expense of extra space. By storing the solutions of the repeated
subproblems, we save the time for recomputation. For the knapsack problem, we only
need to store the previous column - so instead of M · n space, we can do with O(n)
space.
A → BC A → a
36
where A, B, C are non-terminals and a is a terminal (symbol of the alphabet). All
derivations must start from a special non-terminal S which is the start symbol. We
∗
will use the notation S ⇒ α to denote that S can derive the sentence α in finite
number of steps by applying production rules of the grammar.
The basis of our algorithm is the following observation
∗ ∗
Observation 4.1 A ⇒ xi xi+1 . . . xk iff A ⇒ BC and there exists a i < j < k such
∗ ∗
that B ⇒ xi xi+1 . . . xj and C ⇒ xj+1 . . . xk .
There are k − i possible partitions of the string and we must check for all partitions
if the above condition is satisfied. More generally, for the given string x1 x2 . . . xn , we
consider all substrings Xi,k = xi xi+1 . . . xk where 1 ≤ i < k ≤ n - there are O(n2 ) such
substrings. For each substring, we try to determine the set of non-terminals A that
can derive this substring. To determine this, we use the the previous observation.
Note that both B and C derive substrings that are strictly smaller than Xi,j . For
substrings of length one, it is easy to check which non-terminals derive them, so these
serve as base cases.
We define a two dimensional table T such that the entry T (s, t) corresponds to all
non-terminals that derive the substring starting at xs of length t. For a fixed t, the
possible values of s are 1, 2, . . . n − t + 1 which makes the table triangular. Each entry
in the table
P can be filled up in O(t) time for column t. That yields a total running
time of nt=1 O((n − t) · t) which is O(n3 ). The space required is the size of the table
which is O(n2 ). This algorithm is known as CYK (Cocke-Young-Kassimi) after the
inventors.
1
We are using Si for both the length and the sequence interchangeably.
37
Computing Si takes O(i) time and therefore, we can compute the longest monotonic
subsequence in O(n2 ) steps. The space required is O(n).
Can we improve the running time ? For this, we will actually address a more gen-
eral problem, namely for each i, we will compute the longest monotonic subsequence
of length i (if it exists). For each i ≤ n, let Mi,j j ≤ i be the longest monotonic
subsequence of length j in x1 x2 . . . xi . Clearly, if Mi,j exists then Mi,j−1 exists and
the maximum length subsequence is
Among all subsequences of length i, we will compute an Mi,j which has the minimum
value of the last element. For example, among the subsequences 2,4,5,9 and 1,4,5,8
(both length 4), we will choose the second one, since 8 < 9.
Let ℓi,j be the last element of Mi,j - by convention we will implicitly initialise all
ℓi,j = ∞. We can write a recurrence for ℓi,j as follows
ℓi,j−1 if ℓi,j−1 ≤ xi+1 < ℓi,j
ℓi+1,j =
ℓi,j otherwise
This follows since, Mi+1,j is either Mi,j or xi+1 must be the last element of Mi+1,j .
If we implement this idea in a straightforward fashion, the running time will be Ω(n2 ),
since there are potentially n subsequences and all these may have to be updated for
every 1 ≤ i ≤ n. To improve it further, we make an important additional observation
Observation 4.2 The ℓi,j ’s form a non-decreasing sequence in j for all i.
This paves way for updating and maintaining the information about ℓi,j in a compact
manner. We will maintain a sorted sequence of ℓi,j such that when we scan xi+1 ,
we can quickly identify ℓi,k such that ℓi,k−1 ≤ xi+1 < ℓi,k . Moreover, Si , the longest
monotonic subsequence ending at xi+1 is of length k. We can reconstruct Si by
keeping track of the predecessor. We can easily maintain the ℓi,j ’s in a dynamic
dictionary data structure (like AVL tree) O(log n) time. Therefore, the total running
time reduces to O(n log n).
38
Let gj∗ (i) denote the optimal j-step function for this problem restricted to the
points 1, . . . i - we are interested in computing gk∗(n).
Exercise 4.3 Show that g1∗ (j) = n1 ji=1 h(i), i.e., it is a constant function equal to
P
the mean.
where Dj,ℓ denotes the sum of squares of deviation of h() from the mean value of h
in the interval [j, ℓ].
The recurrence captures the property that an optimal k step approximation can be
expressed as an optimal k − 1 step approximation till an intermediate point followed
by the best 1 step approximation of the remaining interval (which is the mean value
in this interval from our previous observation). Assuming that Dj,ℓ are precomputed
for all 1 ≤ j < ℓ ≤ n, we can compute the gi∗ (j) for all 1 ≤ i ≤ k and 1 ≤ j ≤ n in a
table of size kn. The entries can be computed in increasing order of i and thereafter
in increasing order of j’s. The base case of i = 1 can be computed directly from the
result of the previous exercise. Each entry can be computed from j − 1 previously
computed entries yielding a total time of
i=k X
X n
O(j) = O(k · n2 )
i=1 j=1
The space required is proportional to the previous column (i.e. we need to keep track
of the previous value of i), given that D(j, ℓ) can be stored/computed quickly.
Exercise 4.4 Complete the analysis of the above algorithm considering the compu-
tation of the D(i, j).
39
log, we have changed the product of the probabilities to sum of log (note that log of
probabilities are negative numbers).
We can write a recurrence based on the following observation.
The optimal least-weight path x1 , x2 . . . xn starting at vertex x1 with label σ1 σ2 . . . σn
is such that the path x2 x3 . . . xn is optimal with respect to the label σ2 , σ3 . . . σn . For
paths of lengths one, it is easy to find the optimal labelled path. Let Pi,j (v) denote
the optimal labelled path for the labels σi σi+1 . . . σj starting at vertex v. We are
interested in P1,n (vo ).
Starting from the base case of length one paths, we build length 2 paths from each
vertex and so on. Note that the length i + 1 paths from a vertex v can be built from
length i paths from w (computed for all vertices w ∈ V ). The paths that we compute
are of the form Pi,n for all 1 ≤ i ≤ n. Therefore we can compute the entries of the
table starting from i = n − 1. From the previous recurrence, we can now compute
the entries of the Pn−2,n etc. by comparing at most |V | entries (more specifically the
outdegree ) for each starting vertex v 2 . Given that the size of table is n · |V |, the total
time required to compute all the entries is O(n·|V |2 ). However, the space requirement
can be reduced to O(|V |) from the observation that only the (i − 1) length paths are
required to compute the optimal i length paths.
2
You can argue that each iteration takes O(|E|) steps where |E| is the number of edges.
40
Chapter 5
Searching
41
The search begins from the topmost level Lk where Tk can be determined in
constant time. If lk = E or rk = T E then the search is successful else we recursively
search among the elements [lk , rk ] L0 . Here [lk , rk ] denotes the closed interval bound
by lk and rk . This is done by searching the elements of Lk−1 which are bounded by
lk and rk . Since both lk , rk ∈ Lk−1 , the descendence from level k to k − 1 is easily
achieved in O(1) time. In general, at any level i we determine the tuple Ti by walking
through a portion of the list Li . If li or ri equals E then we are done else we repeat
this procedure by descending to level i − 1.
In other words, we refine the search progressively until we find an element in S
equal to E or we terminate when we have determined (l0 , r0 ). This procedure can
also be viewed as searching in a tree that has variable degree (not necessarily two as
in binary tree).
Of course, to be able to analyze this algorithm, one has to specify how the lists Li
are constructed and how they are dynamically maintained under deletions and addi-
tions. Very roughly, the idea is to have elements in i-th level point to approximately
2i nodes ahead (in S) so that the number of levels T is approximately O(log n). The
time spent at each level i depends on [li+1 , ri+1 ] Li and hence the objective is to
keep this small. To achieve these conditions on-line, we use the following intuitive
method. The nodes from the bottom-most layer (level 0) are chosen with probability
p (for the purpose of our discussion we shall assume p = 0.5) to be in the first level.
Subsequently at any level i, the nodes of level i are chosen to be in level i + 1 inde-
pendently with probability p and at any level we maintain a simple linked list where
the elements are in sorted order. If p = 0.5, then it is not difficult to verify that for a
list of size n, the expected number of elements in level i is approximately n/2i and are
spaced about 2i elements apart. The expected number of levels is clearly O(log n),
(when we have just a trivial length list) and the expected space requirement is O(n).
To insert an element, we first locate its position using the search strategy described
previously. Note that a byproduct of the search algorithm are all the Ti ’s. At level
0, we choose it with probability p to be in level L1 . If it is selected, we insert it in
the proper position (which can be trivially done from the knowledge of T1 ), update
the pointers and repeat this process from the present level. Deletion is very similar
and it can be readily verified that deletion and insertion have the same asymptotic
run time as the search operation. So we shall focus on this operation.
5.1.2 Analysis
To analyze the run-time of the search procedure, we look at it backwards, i.e., retrace
the path from level 0. The search time is clearly the length of the path (number of
links) traversed over all the levels. So one can count the number of links one traverses
before climbing up a level. In other words the expected search time can be expressed
42
in the following recurrence
where C(k) is the expected cost for climbing k levels. From the boundary condition
C(0) = 0, one readily obtains C(k) = k/p. For k = O(log n), this is O(log n). The
recurrence captures the crux of the method in the following manner. At any node of
a given level, we climb up if this node has been chosen to be in the next level or else
we add one to the cost of the present level. The probability of this event (climbing up
a level) is p which we consider to be a success event. Now the entire search procedure
can be viewed in the following alternate manner. We are tossing a coin which turns up
heads with probability p - how many times should we toss to come up with O(log n)
heads ? Each head corresponds to the event of climbing up one level in the data
structure and the total number of tosses is the cost of the search algorithm. We are
done when we have climbed up O(log n) levels (there is some technicality about the
number of levels being O(log n) but that will be addressed later). The number of
heads obtained by tossing a coin N times is given by a Binomial random variable X
with parameters N and p. Using Chernoff bounds (see Appendix, equation B.1.5),
for N = 15 log n and p = 0.5, Pr[X ≤ 1.5 log n] ≤ 1/n2 (using ǫ = 9/10 in equation
1). Using appropriate constants, we can get rapidly decreasing probabilities of the
form Pr[X ≤ c log n] ≤ 1/nα for c, α > 0 and α increases with c. These constants can
be fine tuned although we shall not bother with such an exercise here.
We thus state the following lemma.
Lemma 5.1 The probability that access time for a fixed element in a skip-list data
structure of length n exceeds c log n steps is less than O(1/n2 ) for an appropriate
constant c > 1.
Proof We compute the probability of obtaining fewer than k (the number of levels
in the data-structure) heads when we toss a fair coin (p = 1/2) c log n times for some
fixed constant c > 1. That is, we compute the probability that our search procedure
exceeds c log n steps. Recall that each head is equivalent to climbing up one level
and we are done when we have climbed k levels. To bound the number of levels, it
is easy to see that the probability that any element of S appears in level i is at most
1/2i , i.e. it has turned up i consecutive heads. So the probability that any fixed
element appears in level 3 log n is at most 1/n3 . The probability that k > 3 log n is
the probability that at least one element of S appears in L3 log n . This is clearly at
most n times the probability that any fixed element survives and hence probability
of k exceeding 3 log n is less than 1/n2 .
Given that k ≤ 3 log n we choose a value of c, say c0 (to be plugged into equation
B.1.6 of Chernoff bounds) such that the probability of obtaining fewer than 3 log n
43
heads in c0 log n tosses is less than 1/n2 . The search algorithm for a fixed key exceeds
c0 log n steps if one of the above events fail; either the number of levels exceeds 3 log n
or we get fewer than 3 log n heads from c0 log n tosses. This is clearly the summation
of the failure probabilities of the individual events which is O(1/n2). 2.
Theorem 5.1 The probability that the access time for any arbitrary element in skip-
list exceeds O(log n) is less than 1/nα for any fixed α > 0.
Proof: A list of n elements induces n + 1 intervals. From the previous lemma,
the probability P that the search time for a fixed element exceeding c log n is less
than 1/n2 . Note that all elements in a fixed interval [l0 , r0 ] follow the same path in
the data-structure. It follows that for any interval the probability of the access time
exceeding O(log n) is n times P . As mentioned before, the constants can be chosen
appropriately to achieve this. 2
It is possible to obtain even tighter bounds on the space requirement for a skip
list of n elements. We can show that the expected space is O(n) since the expected
number of times a node survives is 2.
Exercise 5.1 Prove the following stronger bound on space using Chernoff bounds -
For any constant α > 0, the probability of the space exceeding 2n + α · n, is less than
2
expΩ(−α n) .
44
priorities and then count the number of elements that Q can see during the course
of their insertions. This method (of assigning the random numbers on-line) makes
arguments easier and the reader must convince himself that it doesn’t affect the final
results. Q can see an element Ni if there are no previously inserted elements in
between.
Lemma 5.2 The tree constructed by inserting the nodes in order of their priorities
(highest priority is the root) is the same as the tree constructed on-line.
Lemma 5.3 The number of nodes Q sees is exactly the number of comparisons per-
formed for searching Q. In fact, the order in which it sees corresponds to the search
path of Q.
Theorem 5.2 The expected length of search path in RST is O(HN ) where HN is the
N-th harmonic number.
45
Theorem 5.3 The probability that the search time exceeds 2 log n comparisons in a
randomized triep is less than O(1/n).
A similar technique can be used for counting the number of rotations required for
RST during insertion and deletions. Backward analysis is a very elegant technique
for analyzing randomized algorithms, in particular in a paradigm called randomized
incremental construction.
Hash by chaining: The more the collision the worse the performance. Look at a
sequence O1 (x2 ), O2 (x2 ), ..., On (xn ) where Oi ∈ {Insert, Delete, Search} and xi ∈ U
Let us make the following assumptions
46
n
Claim: Total expected cost = O((1 + β)n) where β = m (load factor).
Proof: Expected cost of (k + 1)th operation = expected number of elements in
location ≤ 1 + k( m1 ) assuming all the previous operations were Insert.
So total expected cost ≤ nk=1 1 + m k
= n + n(n+1) = (1 + β2 )n. This is worst case
P
2m
over operations but not over elements. 2
|H|
|{h|h ∈ H and h(x) = h(y)}| ≤ c
m
≤ c |H|
P
for some small constant c. Roughly h δh (x, y) m
Claim:
1 X n
1 + δh (x, S) ≤ 1 + c
|H| h∈H m
where |S| = n.
Proof: Working from the LHS, we obtain
1 X 1 XX
= 1+ δh (x, y)
|H| h∈H |H| h y∈S
1 XX
= 1+ δh (x, y)
|H| y
h
1 X |H|
≤ 1+ c
|H| y m
c
= 1+ n
m
ci
P
So expected cost of n operation = (1 + m
) ≤ (1 + cβ)n 2
47
If hab (x) = hab (y) then for some q ∈ [0...m − 1]and r, s ∈ [0... Nm−1 ]
ax + b = (q + rm) mod N
ay + b = (q + sm) mod N
2
This is a unique solution for a, b once q, r, s are fixed. So there are a total of m( Nm )
2
solutions = Nm . Also, since |H ′ | = N 2 , therefore H ′ is ”1” universal.
X X X
n2i = 2 (δ(x, y) + ni )
i i x,y|h(x)=h(y)=i
X X X
= 2 ni + 2 δ(x, y)
i i x,y|h(x)=h(y)=i
X
= 2n + δ(x, y)
x,y|h(x)=h(y)
48
Taking expectation
P on both sides (with respect to choice of a random hash function),
the R.H.S. is 2E[ x,y∈S δ(x, y)]+n. This equals 2 n2 · m2 since E[δ(x, y)] = Pr[h(x) =
49
√
N . Otherwise, we search for P RED(x) among the elements of T OP . Note that
either we search within the subtree or in the set T OP but not both. Suitable termi-
nating conditions can be defined. The search time can be captured by the following
recurrence √
T (N) = T ( N) + O(1)
which yields T (N) = O(log log N). The T OP data structure is built on keys of length
log N/2 higher order bits and the search structure is built on the lower log N/2 bits.
The space complexity of the data structure satisfies the recurrence
√ √ √
S(m) = ( m + 1)S( m) + O( m)
which yields S(N) = O(N log log N). The additive term is the storage of the elements
of T OP .
Exercise 5.2 Propose a method to decrease the space bound to O(N) - you may want
to prune the lower levels of the tree.
For the actual implementation, the tree structure is not explicitly built - only the
relevant locations are stored. To insert an element x in this data-structure, we first
find its predecessor and successor. If it is the first element in the subtree then the
ancestor at level log N/2 is appropriately initialized. SUCC(P RED(x)) ← x where
P RED(x) is in a different subtree. Notice that we do not have to process within the
subtree any further. Otherwise, we find its P RED within the subtree. It may appear
that the time for insertion is (log log2 N) since it satisfies the recurrence
√
I(N) = I( N ) + O(log log N) or I(N) ∈ log log2 N
To match the O(log log N) time bound for searching, we will actually do the search
for P RED simultaneously with the insert operation.
Exercise 5.3 Show how to implement delete operation in O(log log N) steps.
One of the most serious drawbacks of this data structure is the high space require-
ment, which is proportional to the size of the universe. Note that, as long as
log log N ∈ o(log n), this is faster than the conventional heap. For example, when
log n/ log log n
N ≤ 22 , this holds an advantage, but the space is exponential. However, we
can reduce the space to O(n) by using the techniques mentioned in this chapter to
store only those nodes that are coloured black.
50
Chapter 6
Searching in a dictionary is one of the most primitive kind of search problem and it is
relatively simple because of the property that the elements can be ordered. Instead,
suppose that the points are from d dimensional space Rd - for searching we can build
a data-structure based on lexicographic ordering. If we denote a d dimensional point
by (x0 , x1 . . . xd ) then show the following.
Exercise 6.1 The immediate predecessor of a query point can be determined in O(d+
log n) comparisons in a set of n d-dimensional points.
Queries can be far more sophisticated than just point-queries but we will address the
more restricted kind.
51
and right subtree R(v) contains all the points strictly greater than xv . It is easy to
see that we can build a balanced tree using median as the splitter value.
To report the points in the range query [xℓ : xu ] we search with xℓ and xu in the
tree T . Let ℓ1 and ℓ2 be the leaves where the searches end. Then the points in the
interval [xℓ : xu ] are the points stored between the leaves ℓ1 and ℓ2 . Another way to
view the set of points is the union of the leaves of some subtrees of T .
Exercise 6.2 Show that for any query interval, the points belong to at most 2 log n
subtrees.
Proof Sketch If you examine the search path of xℓ and xu , they share a common
path from root to some vertex (may be the root itself), where the paths fork to the
left and right - let us call this the fork node. The leaf nodes correspond to the union
of the right subtrees of the left path and the left subtrees of the right path.
Complexity The tree takes O(n) space and O(n log n) time in preprocessing. Each
query takes O(log n + k) time, where k is the number of points in the interval, i.e.,
the output size. The counting query takes O(log n) time. This is clearly the best we
can hope for.
You can hypothetically associate a half-open interval with each internal node x,
(l(x) : r(x)] where l(x) (r(x)) is the value of the leftmost (rightmost) leaf node in the
subtree rooted at x.
Exercise 6.3 Show how to use threading to solve the range query in a BST without
having leaf based storage.
52
to the final output, we cannot afford to list out all the points of the vertical slab.
However, if we had the one dimensional data structure available for this slab, we can
quickly find out the final points by doing a range query with [yℓ : yu ]. A naive scheme
will build the data structure for all possible vertical slabs but we can do much better
using the following observation.
It follows from the same argument as any interval. Each canonical slab corresponds
to the vertical slab (the corresponding [xℓ : xu ]) spanned by an internal node. We
can therefore build a one-dimensional range tree for all the points spanned by corre-
sponding vertical slab - this time in the y-direction. So the final answer to the two
dimensional range query is P the union of at most 2log n one-dimensional range query,
giving a total query time of Pti=1 O(log n+ki ) where ki is the number of output points
in slab i among t slabs and i ki = k. This results in a query time of O(t log n + k)
where t is bounded by 2log n.
The space is bounded by O(n log n) since in a given level of the tree T , a point is
stored exactly once.
The natural extension of this scheme leads us to d-dimensional range search trees
with the following performance parameters.
Q(d) ≤ 2 log n·Q(d−1) Q(1) = O(log n+k) Q(d) is the query time in d dimensions for n points
Exercise 6.4 What is the space bound for d dimensional range trees ?
Exercise 6.5 How would you modify the data structure for counting range queries ?
53
report all the points associated with U otherwise we search recursively in U. We may
have to search in more than one region - we define a search tree where each region is
associated with a node of the tree. The leaf nodes correspond to the original point
set. In general, this strategy will work for other (than rectangular) kinds of regions
also.
For the rectangular query, we split on x-coordinate and next on y-coordinate,
then alternately on each coordinate. We partition with a vertical line at nodes whose
depth is even and we split with a horizontal line at nodes whose depth is odd. The
time to build the 2-D tree is as follows.
The region R(v) corresponding to a node v is a rectangle which is bounded by
horizontal and vertical lines and it is a subset of the parent node. The root of a tree
is associated with a (bounded) rectangle that contains all the n points. We search
a subtree rooted at v iff the query rectangle intersects the associated with node v.
This involves testing if two rectangles (the query rectangle and R(v)) overlap that
can be done in O(1) time. We traverse the 2-D tree, but visit only nodes whose
region is intersected by the query rectangle. When a region is fully contained in the
query rectangle, we can report all the points stored in its subtree. When the traversal
reaches a leaf, we have to check whether the point stored at the leaf is contained in
the query region and, if so, report it.
Search(Q, v)
Since a point is stored exactly once and the description of a region corresponding to
a node takes O(1) space, the total space taken up the search tree is O(n).
Query Time Let Q(i) be the number of nodes at distance i from the root that
are visited in the worst case by a rectangular query. Since a vertical segment of Q
intersects only horizontal partitioning edges, we can write a recurrence for Q(i) by
observing that the number of nodes can increase by a factor 2 by descending 2 levels.
Hence Q(i) satisfies the recurrence
Q(i + 2) ≤ 2Q(i)
which one can √ verify to be Q(i) ∈ O(2⌊i/2⌋ ) or total number of nodes visited in the
last level is O( n).
54
6.3 Priority Search Trees
The combination of BST with heap property resulted in a simple strategy for main-
taining balanced search trees called treaps. The heap property was useful to keep a
check on the expected height of the tree within O(log n). What if we want to main-
tain a heap explicitly on set of parameters (say the y coordinates) along with a total
ordering required for binary search on the x coordinates. Such a data structure would
be useful to support a three sided range query in linear space.
A three sided query is a rectangle [xℓ : xu ] × yℓ : ∞], i.e. a half-infinite vertical slab.
If we had a data structure that is a BST on x coordinates, we can first locate the
two points xℓ and xu to determine (at most) 2log n subtrees whose union contains
the points in the interval [xℓ : xu ]. Say, these are T1 , T2 . . . Tk . Within each such tree
Ti , we want to find the points whose y coordinates are larger than yl . If Ti forms a
max-heap on the y coordinates then we can output the points as follows -
Since v is a root of max-heap, if vy < yℓ , then all the descendents of v and therefore we
do not need to search any further. This establishes correctness of the search procedure.
Let us mark all the nodes that are visited by the procedure in the second phase. When
we visit a node in the second phase, we either output a point or terminate the search.
For the nodes that are output, we can charge it to the output size. For the nodes that
are not output, let us add a charge to its parent - the maximum charge to a node is
two because of its two children. The first phase takes O(log n) time to determine the
canonical sub-intervals and so the total search time is O(log n + k) where k is number
of output points3 .
Until now, we assmed that such a dual-purpose data structure exists. How do we
construct one ?
First we can build a leaf based BST on the x coordinates. Next, we promote the
points according to the heap ordering. If a node is empty, we inspect its two children
and the pull up the larger value. We terminate when no value moves up.
Exercise 6.6 Work out the details of the construction of the second phase and also
analyse the running time.
3
This kind of analysis where we are amortizing the cost on the output points is called filtering
search.
55
This combo data structure is known as priority search trees that takes only O(n) space
and supports O(log n) time three sided query.
We know that the entire segment (xi , xi+1 ) should be inside the hull, so if all the points
of P (other than xi , xi+1 lie to one side, then CH(P ) ⊂ half-plane supported by xi , xi+1 .
Therefore CH(P ) is a sequence of extreme points and the edges joining those points
and clearly there cannot be a smaller convex set containing P since any point in the
intersection must belong to the convex hull.
For building the hull, we divide the points by a diagonal joining the leftmost and
the rightmost point - the points above the diagonal form the upper hull and the points
below form the lower hull. We also rotate the hull so that the diagonal is parallel to
x-axis. We will describe algorithms to compute the upper hull - computing the lower
hull is analogous.
The planar convex hull is a two dimensional problem and it cannot be done using
a simple comparison model. While building the hull, we will need to test whether
three points (x0 , y0 ), (x1 , y1 ), and (x2 , y2) are clockwise (counter-clockwise) oriented.
Since the x-coordinates of all the points are ordered, all we need to do is test whether
the middle point is above or below the line segment formed by the other two. A triple
of points (p0 , p1 , p2 ) is said to form a right turn iff the determinant
x0 y0 1
x1 y1 1 < 0
x2 y2 1
where (x1 , y1 ) are the co-ordinates of p1 . If the determinant is positive, then the triple
points form a left turn. If the determinant is 0, the points are collinear.
4
inside the triangle
56
6.4.1 Jarvis March
A very intuitive algorithm for computing convex hulls which simply simulates (or gift
wrapping). It starts with any extreme point and repeatedly finds the successive points
in clockwise direction by choosing the point with the least polar angle with respect to
the positive horizontal ray from the first vertex. The algorithm runs in O(nh) time
where h is the number of extreme points in CH(P ). Note that we actually never
compute angles; instead we rely on the determinant method to compare the angle
between two points, to see which is smaller. To the extent possible, we only rely on
algebraic functions when we are solving problems in Rd . Computing angles require
inverse trigonometric functions that we avoid. Jarvis march starts by computing the
leftmost point ℓ, i.e., the point whose x-coordinate is smallest which takes linear time.
When h is o(log n), Jarvis march is asymptotically faster than Graham’s scan.
58
Algorithm Quickhull(Sa , pl , pr )
Input: Given Sa = {p1 , p2 , . . . , pn } and the leftmost extreme point pl and
the rightmost extreme point pr . All points of Sa lie above the line pl pr .
Output: Extreme points of the upper hull of Sa ∪ {pl , pr } in clockwise order.
Step 1. If Sa = {p}, then return the extreme point {p}.
Step 2. Select randomly a pair {p2i−1 , p2i } from the the pairs
{p2j−1, p2j }, j = 1, 2, . . . , ⌊ n2 ⌋.
Step 3. Select the point pm of Sa which supports a line with slope(p2i−1 p2i ).
(If there are two or more points on this line then choose a pm that is
distinct from pl and pr ). Assign Sa (l) = Sa (r) = ∅).
Step 4. For each pair {p2j−1 , p2j }, j = 1, 2, . . . , ⌊ n2 ⌋ do the following
( assuming x[p2j−1 ] < x[p2j ] )
Case 1: x[p2j ] < x[pm ]
if left-turn (pm , p2j , p2j−1 ) then Sa (l) = Sa (l) ∪ {p2j−1, p2j }
else Sa (l) = Sa (l) ∪ {p2j−1 }.
Case 2: x[pm ] < x[p2j−1 ]
if left-turn (pm , p2j−1, p2j ) then Sa (r) = Sa (r) ∪ {p2j }
else Sa (r) = Sb (r) ∪ {p2j−1 , p2j }.
Case 3: x[p2j−1 ] < x[pm ] < x[p2j ]
Sa (l) = Sa (l) ∪ {p2j−1 };
Sa (r) = Sa (r) ∪ {p2j }.
Step 5. (i) Eliminate points from Sa (l) which lie below the line joining pl and pm .
(ii) Eliminate points from Sa (r) which lie below the line joining pm and pr .
Step 6. If Sa (l) 6= ∅ then Quickhull(Sa (l), pl , pm ).
Output pm .
If Sa (r) 6= ∅ then Quickhull(Sa (r), pm , pr ).
Exercise 6.7 In step 3, show that if the pair {p2i−1 , p2i } satisfies the property that
the line containing p2i−1 p2i does not intersect the line segment pl pr , then it guarantees
that p2i−1 or p2i does not lie inside the triangle △pl p2i pr or △pl p2i−1 pr respectively.
This could improve the algorithm in practice.
6.5.1 Analysis
To get a feel for the convergence of the algorithm Quickhull we must argue that in
each recursive call, some progress is achieved. This is complicated by the possibility
that one of the end-points can be repeatedly chosen as pm . However, if pm is pl ,
then at least one point is eliminated from the pairs whose slopes are larger than the
59
L
p p
m 2j-1
p
2j
p p
r
l
x
m
Figure 6.1: Left-turn(pm , p2j−1, p2j ) is true but slope(p2j−1 p2j ) is less than the median
slope given by L.
supporting line L through pl . If L has the largest slope, then there are no other points
on the line supporting pm (Step 3 of algorithm). Then for the pair (p2j−1 , p2j ), whose
slope equals that of L, left-turn (pm , p2j , p2j−1 ) is true, so p2j−1 will be eliminated.
Hence it follows that the number of recursive calls is O(n + h), ,since at each call,
either with an output vertex or it leads to elimination of at least one point.
Let N represent the set of slopes(p2j−1 p2j ), j = 1, 2, . . . ⌊ n2 ⌋. Let k be the rank
of the slope(p2i−1 p2i ), selected uniformly at random from N. Let nl and nr be the
sizes of the subproblems determined by the extreme point supporting the line with
slope(p2i−1 , p2i ). We can show that
Observation 6.3 max(nl , nr ) ≤ n − min(⌊ n2 ⌋ − k, k).
Without loss of generality, let us bound the size of the right sub-problem. There are
⌊ n2 ⌋ − k pairs with slopes greater than or equal to slope(p2i−1 p2i ). At most one point
out of every such pair can be an output point to the right of pm .
If we choose the median slope, i.e., k = n4 , then nl , nr ≤ 43 n. Let h be the number
of extreme points of the convex hull and hl (hr ) be the extreme points of the left
(right) subproblem. We can write the following recurrence for the running time.
T (n, h) ≤ T (nl , hl ) + T (nr , hr ) + O(n)
where nl + nr ≤ n, hl + hr ≤ h − 1.
Exercise 6.8 Show that T (n, h) is O(n log h).
Therefore this achieves the right balance between Jarvis march and Graham scan as
it scales with the output size at least as well as Jarvis march and is O(n log n) in the
worst case.
60
∗
6.5.2 Expected running time
Let T (n, h) be the expected running time of the algorithm randomized Quickhull to
compute h extreme upper hull vertices of a set of n points, given the extreme points
pl and pr . So the h points are in addition to pl and pr , which can be identified using
3
2
· n comparisons initially. Let p(nl , nr ) be the probability that the algorithm recurses
on two smaller size problems of sizes nl and nr containing hl and hr extreme vertices
respectively. Therefore we can write
X
T (n, h) ≤ p(nl , nr )(T (nl , hl ) + T (nr , hr )) + bn (6.5.1)
∀nl ,nr ≥0
Proof: We will use the inductive hypothesis that for h′ < h and for all n′ , there
is a fixed constant c, such that T (n′ , h′ ) ≤ cn′ log h′ . For the case that pm is not pl or
pr , from Eq. 6.5.1 we get
P
T (n, h) ≤ ∀nl ,nr ≥0 p(nl , nr )(cnl log hl + cnr log hr ) + bn.
Since nl + nr ≤ n and hl , hr ≤ h − 1,
Let E denote the event that max(nl , nr ) ≤ 87 n and p denote the probability of E. Note
that p ≥ 21 .
From the law of conditional expectation, we have
7 1
nl log hl + nr log hr ≤ n log hl + n log hr (6.5.3)
8 8
The right hand side of 6.5.3 is maximized when hl = 87 (h − 1) and hr = 81 (h − 1).
Therefore,
61
nl log hl + nr log hr ≤ n log(h − 1) − tn
where t = log 8 − 87 log 7 ≥ 0.55. We get the same bounds when max(nl , nr ) ≤ 78 n
and hr ≥ hl . Therefore
T (n, h) ≤ p(cn log(h − 1) − tcn) + (1 − p)cn log(h − 1) + bn
= pcn log(h − 1) − ptcn + (1 − p)cn log(h − 1) + bn
≤ cn log h − ptcn + bn
Therefore from induction, T (n, h) ≤ cn log h for c ≥ tpb .
In case pm is an extreme point (say pl ), then we cannot apply Eq. 6.5.1 directly,
but some points will still be eliminated according to Observation 6.3. This can happen
a number of times, say r ≥ 1, at which point, Eq. 6.5.1 can be applied. We will show
that this is actually a better situation, that is, the expected time bound will be less
and hence the previous case dominates the solution of the recurrence.
The rank k of slope(p2i−1 p2i ) is uniformly distributed in [1, n2 ], so the number of
points eliminated is also uniformly distributed in the range [1, n2 ] from Observation
6.3. (We are ignoring the floor in n2 to avoid special cases for odd values of n - the
same bounds can be derived even without this simplification). Let n1 , n2 . . . nr be the
r random variables that represent the sizes of subproblems in the r consecutive
Pr times
that pm is an extreme point. It can be verified by induction, that E[ i=1 ni ] ≤ 4n
and E[nr ] P ≤ (3/4)r n where E[·] represents the expectation of a random variable.
Note that ri=1 b · ni is the expected work done in the r divide steps. Since cn log h ≥
4nb + c(3/4)r · n log h for r ≥ 1 (and log h ≥ 4), the previous case dominates. 2
62
binary search to answer a ray shooting query in O(log n) primitives of the following
kind - Is the point below/above a line segment. This strategy works since the line
segments are totally ordered within the slab (they mat intersect outside).
For the planar partition, imagine a vertical line V being swept from left to right
and let V (x) represent the intersection of V with the planar partition at an X-
coordinate value x. For simplicity let us assume that no segment is vertical. Further
let us order the line-segments according to V (x) and denote it by S(x).
Observation 6.4 Between two consecutive (in X direction) end-points of the planar
partition, S(x) remains unchanged.
A crucial observation is that the two consecutive vertical slabs have almost all the
segments in common except for the one whose end-points separate the region.
Can we exploit the similarity between two ordered lists of segments and
support binary search on both list efficiently ? In particular, can we avoid
storing the duplicate segments and still support log n steps binary searches.
Here is the intuitive idea. Wlog, let as assume that an element is inserted and we
would like to maintain both versions of the tree (before and after insertion). Let us
also assume that the storage is leaf based.
path copying strategy If a node changes then make a new copy of its parent and also
copy the pointers to its children.
Once a parent is copied, it will lead to copying its parent, etc, until the entire
root-leaf path is copied. At the root, create a label for the new root. Once we know
which root node to start the binary search, we only follow pointers and the search
proceeds in the normal way that is completely oblivious to fact that there are actually
two implicit search trees. The search time also remains unchanged at O(log n). The
same strategy works for any number of versions except that to start searching at
the correct root node, we may require an additional data structure. In the context of
planar point location, we can build a binary search tree that supports one dimensional
search.
63
The space required is path − length · #versions + n which is O(n log n). This
is much smaller than the O(n2 ) scheme tat stores each tree explicitly. With some
additional ideas, it is possible to improve it to O(n) space which will be optimal.
64
Chapter 7
65
Exercise 7.1 Show that Lagrange’s formula can be used to compute the coeffi-
cients ai ’s in O(n2 ) operations.
One of the consequences of the interpolation is an alternate representation of
polynomials as {(x0 , y0), (x1 , y1) . . . (xn−1 , yn−1)} from where the coefficients can
be computed. We will call this representation as the point-value representation.
multiplication The product of two polynomials can be easily computed in O(n2 )
steps by clubbing the coefficients of the powers of x. This is assuming that the
polynomials are described by their coefficients. If the polynomials are given by
their point-value, then the problem is considerably simpler since
P (x) = P1 (x) · P2 (x) where P is the product of P1 and P2
A closely related problem is that
P of convolution where we have to perform
computations of the kind ci = l+p=i al · bp for 1 ≤ i ≤ n.
The efficiency of many polynomial related problems depends on how quickly we can
perform transformations between the two representations.
66
this strategy of choosing points, at the j-th level of recursion, we require
j−1 j−1 n
x2i = −x2nj +i 0 ≤ i ≤ −1
2 2j
log n−1 log n−1
This yields x21 = −x02 , i.e., if we choose ω n/2 = −1 then xi = ωxi−1 . By
setting x0 = 1, the points of evaluation work out to be 1, ω, ω 2 . . . ω n/2 . . . ω n−1 which
are usually referred to as the principal n-th roots of unity.
Analysis
Let P(x)za10,z,a21...z
...an−1 denote the evaluation of P(x) with coefficients a0 , a1 . . . an−1 at
i
This immediately yields O(n log n) operations for the FFT computation.
For the inverse problem, i.e., interpolation of polynomials given the values at
1, ω, ω 2 . . . ω n−1, let us view the process of evaluation as a matrix vector product.
1 1 1 ... 1 a0 y0
1 ω2 ω4 ... ω 2(n−1) a1 y1
1 ω3 6 3(n−1)
ω ... ω · a2 = y2
.. ..
..
. . .
n−1 2(n−1) (n−1)(n−1) an−1 yn−1
1 ω ω ... ω
Let us denote this by the matrix equation A · a = y. In this setting, the interpolation
problem can be viewed as computing the a = A−1 · y. Even if we had A−1 available,
we still have to compute the product which could take Ω(n2 ) steps. However the good
news is that the inverse of A−1 is
1 1 1 ... 1
1 1 1 1
ω2 ω4
... ω 2(n−1)
1 1
1 1
... 1
ω3 ω6 ω 3(n−1)
n ..
.
1 1 1
1 ω n−1 ω 2(n−1)
... ω (n−1)(n−1)
1 + ω i + ω 2i + ω 3i + . . . w i(n−1) = 0
in −1
(Use the identity j ω ji = ωωi −1 = 0 for ω i 6= 1.)
P
Moreover ω −1 , ω −2, . . . w −(n−1) also satisfy the properties of n-th roots of unity. This
enables us to use the same algorithm as FFT itself that runs in O(n log n) operations.
67
7.3 The butterfly network
If you unroll the recursion of an 8 point FFT, then it looks like the Figure 7.1.
Let us work through some successive recursive calls.
To calculate P0,4 (ω04) and P0,4 (ω14 ) we compute P0,4 (ω04 ) = P0 (ω08) + ω04 P4 (ω08 ) and
Since Pi denotes ai , we do not recurse any further. Notice that in the above figure
a0 and a4 are the multipliers on the left-hand side. Note that the indices of the ai
on the input side correspond to the mirror image of the binary representation of i. A
butterfly operation corresponds to the gadget ⊲⊳ that corresponds to a pair of recursive
calls. The black circles correspond to ”+” and ”-” operations and the appropriate
multipliers are indicated on the edges (to avoid cluttering only a couple of them are
indicated).
68
One advantage of using a network is that, the computation in each stage can
be carried out in parallel, leading to a total of log n parallel stages. Thus FFT is
inherently parallel and the butterfly network manages to capture the parallelism ina
natural manner.
Since n and m are relatively prime, n has a unique inverse in Zm (recall extended
Euclid’s algorithm). Also
n/2 n/2
ω n = ω n/2·ω n/2 = (2t ) ·(2t ) ≡ (m−1)·(m−1) mod m ≡ (−1)·(−1) mod m ≡ 1 mod m
Claim 7.1 If the maximum size of a coefficient is b bits, the FFT and its inverse can
be computed in time proportional to O(bn log n).
Note that addition of two b bit numbers take O(b) steps and the multiplications
with powers of ω are multiplications by powers of two which can also be done in
O(b) steps. The basic idea of the algorithm is to extend the idea of polynomial
multiplication. Recall, that in Chapter 2 , we had divided each number into two parts
and subsequently recursively computed by computing product of smaller numbers. By
extending this strategy, we divide the numbers a and b into k parts ak−1 , ak−2 , . . . a0
and bk−1 , bk−2 , . . . b0 .
69
where x = 2n/k - for simplicity assume n is divisible by k. By multiplying the RHS,
and clubbing the coefficients of xi , we obtain
Although in the final product, x = 2n/k , we can compute the coefficients using any
method and perform the necessary multiplcations by an appropriate power of two
(which is just adding trailing 0’s). This is polynomial multiplication and each term
is a convolution, so we can invoke FFT-based methods to compute the coefficients.
The following recurrence captures the running time
where P (k, n/k) is the time for polynomial multiplication of two degree k − 1 polyno-
mials involving coefficients of size n/k. (In a model where the coefficients are not too
large, we could have used O(k log k) as the complexity of polynomial multiplication.)
We will have to do exact computations for the FFT and for that we can use modular
arithmetic. The modulo value must be chosen carefully so that
(i) It must be larger than the maximum value of the numbers involved, so that there
is no loss of information
(ii) Should not be too large, otherwise, operations will be expensive.
Moreover, the polynomial multiplication itself consists of three distinct phases
(i) Forward FFT transform. This takes O(bk log k) using b bits.
(ii) Paiwise product of the values of the polynomials at the roots of unity.
This will be done recursively with cost 2k · T (b) where b ≥ n/k.
The factor two accounts for the number of coefficients of the product of
two polynomials of degree k − 1.
(iii) Reverse FFT, to extract the actual coefficients. This step also takes
O(bk log k) where b is the number of bits in each operand.
So the previous recurrence can be expanded to
70
Exercise 7.2 With appropriate teminating condition, say the O(nlog2 3 ) time multi-
plication algorithm, verify that T (n) ∈ O(n log2 n log log n).
An underlying assumption in writing the recurrence is that all the expressions are
integral.
√ This can actually be ensured by choosing n = 2ℓ and carefully choosing
n for even and odd values of ℓ. Using the technique of wrapped convolution, one
can save a factor of two in the degree of the polynomial, yielding the best known
O(n log n log log n) algorithm for multiplication.
71
Chapter 8
F (X) = x mod p
. Here X is assumed to be a binary pattern (of 0 and 1) and x is its integer value.
1
It is called a hash function.
72
Theorem 8.1 (Chinese Remaindering Theorem) For k numbers n1 , n2 , ..., nk , rela-
tively prime to each other,
Moreover,
R
X
y≡ ci d i y i
i=1
Q
where ci di ≡ 1 mod ni , di = n1 , n2 . . . ni−1 ni+1 , . . . nk and yi = x mod ni
Let k be such that 2m < M = 2 × 3 × ... × pk i.e. the first k primes. From CRT, if
X 6= Y (i) then for some p in {2, 3, . . . , pk },
Fp (X) 6= Fp (Y (i))
73
So expected cost =
m
X n n·m
O(1) + = O(m) +
1
t2 t2
By choosing t ≥ m, it is O(m).
Example 8.1 A stack supports push, pop and empty-stack operations. Define Φ() as
the number of elements in the stack. If we begin from an empty stack, Φ(0) = 0. For
a sequence of push, pop and empty-stack operations, we can analyze the amortized
cost. Amortized cost of push is 2, for pop it is 0 and for empty stack it is negative.
Therefore the bound on amortized cost is O(1) and therefore the cost of n operations
is O(n). Note that the worst-case cost of an empty-stack operation can be very high.
Case: match The amortised cost of a match is 2 (actual cost is one and the increase
in potential is also one).
75
8.2.3 Pattern Analysis
The preprocessing of the pattern involves constructing the failure function f (i).
Observation 8.1 If the failure function f (i) = j, j < i, it must be true that X(j −
1) < X(i − 1) and pi = pj .
This shows that the computation of the failure function is very similar to the KMP
algorithm itself and we compute the f (i) incrementally with increasing values of i.
Exercise 8.1 Using the potential function method, show that the failure function can
be computed in O(|X|) steps.
Exercise 8.2 Give a example to argue why KMP algorithm cannot handle wild-cards.
You may want to extend the definition of failure function to handle wild-cards.
76
PA (x) = a0 xn−1 + a1 xn−2 + a2 xn−3 + . . . an−1 and PB (x) = b0 + b1 x + b2 x2 + . . . xm−1 .
Pm+n−2
The product of PA and PB can be written as i=0 ci xi Note that
The above convolution can be easily done using FFT computation in O(m log m)
steps.4 When wildcard characters are present in the pattern, we can assign them the
value 0. If there are w such characters then we can modify the previous observation
by looking for terms that have value n − w (Why). However, the same may not work
if we have wildcards in the text also - try to construct a counterexample.
Wildcard in Pattern and Text
Assume that the alphabet is {1, 2 . . . s} (zero is not included). We will reserve
zero for the wildcard. For every position i of the pattern that (assume there are
no wildcards), we will associate a random number ri from Pthe set {1, 2, . . . N} for
a sufficiently large N that we will choose later. Let t = i ri Xi . Here ri s are are
random multipliers such that
Observation 8.3 For any vector P v1 , v2 . . . vn , suppose there exists some i for which
Xi 6= vi . Then the probability that i vi · ri = t is less than N1 .
We can imagine that the random numbers are chosen sequentially, so that after fixing
the first n − 1 numbers, there is only one choice for which the equation is satisfied 5 .
By choosing N ≥ n2 , the probability of a false match in any of the possible positions
is n · 1/N ≤ 1/n. P
Clearly, if the vector v1 , v2 . . . vn is the same as X, then i vi ·ri = t. So this forms
the basis for a randomized string matching algorithm. In the presence of wildcards
in the pattern X, we assign ri = 0 iff Xi =* (instead of a random non-zero number)
and the same result holds for positions that do not correspond to wildcards (these are
precisely the positions that are blanked out by 0). The number t acts like a fingerprint
or a hash function for the pattern.
4
The number involved are small enough so that we can do exact computation using O(log n) bit
integers.
5
We do the arithmetic modulo N
77
When the text has wildcard, then the fingerprint cannot be fixed and will vary
according to the wildcards in the text. The fingerprint tk at position k of the text
can be defined as n
X
tk = δj+k−1 · rj · Xj
j=1
78
Chapter 9
Graph Algorithms
A number of graph problems use Depth First Search as the starting point. Since it
runs in linear time, it is efficient as well.
Observation 9.1 Every DAG has at least one vertex with indegree 0 (source) and a
vertex with outdegree 0 (sink).
This is also called a topological sorting of the vertices. Consider a DFS numbering
pre(v), v ∈ V and also a post-order numbering post(v).
If the DFS reaches u before v, then clearly it is true. If v is reached before u, then
the DFS of v is completed before it starts at u since there is no path from v to u.
Claim 9.1 The vertices in reverse order of the post-order numbering gives a topolog-
ical sorting.
79
graph G = (V ′ , E ′ ) as follows - V ′ corresponds to the SCCs of G and (c1 , c2 ) ∈ E ′ if
there is an edge from some vertex in c1 to some vertex of c2 . Here c1 , c2 ∈ V ′ , i.e.,
they are SCCs of G.
To determine the SCCs of G, notice that if we start a DFS from a vertex of a sink
component c′ of G then precisely all the vertices of c′ can be reached. Since G is not
explicitly available, we will use the following strategy to determine a sink component
of G. First, reverse the edges of G - call it GR . The SCCs of GR is the same as G
but the sink components and source components of G are interchanged.
Exercise 9.2 If we do a DFS in GR , then show that the vertex with the largest
postorder numbering is in a sink component of G.
This enables us to output the SCC corresponding to the sink component of G using
a simple DFS. Once this component is deleted (delete the vertices and the induced
edges), we can apply the same strategy to the remaining graph.
Exercise 9.3 Show that the SCCs can be determined using two DFS - one in G and
the other in GR .
Definition 9.1 Two edges belong to the same BCC iff they belong to a common
(simple) cycle.
Exercise 9.4 Show that the above relation defines an equivalence relation on edges.
Moreover, the equivalence classes are precisely the BCC (as defined by the vertex
connectivity).
1
If the removal of a vertex disconnects a graph then such a vertex is called an articulation point.
80
The DFS on an undirected graph G = (V, E) partitions the edges into T (tree
edges) and B (back edges). Based on the DFS numbering (pre-order numbering) of
the vertices, we can direct the edges of T from a lower to a higher number and the
edges in B from a higher to a lower number. Let us denote the DFS numbering by
a function d(v) v ∈ V . Analogous to the notion of component tree in the context of
SCC, we can also define a component tree on the BCC. Here the graph G has the
biconnected components (denoted by B) and the articulation points (denoted by A)
as the set of vertices. We have an edge between a ∈ A and b ∈ B if a ∈ B.
The basic idea behind the BCC algorithm is to detect articulation points. If there
are no articulation points then the graph is biconnected. Simultaneously, we also
determine the BCC. The DFS numbering d(v) helps us in this objective based on the
following observation.
Observation 9.3 If there are no back-edges out of some subtree of the DFS tree Tu
rooted at a vertex u that leads to a vertex w with d(w) < d(u), then u is an articulation
point.
This implies that all paths from the subtree to the remaining graph must pass through
u making u an articulation point. To detect this condition, we define an additional
numbering of the vertices based on DFS numbering. Let h(v) denote the minimum
of the d(u) where (v, u) is a back edge. Then
Exercise 9.6 How would you compute the LOW (v) v ∈ V along with the DFS num-
bering ?
Exercise 9.7 Formalize the above argument into an efficient algorithm that runs in
O(|V | + |E|) steps.
81
9.2 Path problems
We are given a directed graph G = (V, E) and a weight function w : E → R (may
have negative weights also). The natural versions of the shortest path problem are
as follows
distance between a pair Given vertices x, y ∈ V , find the least weighted path
starting at x and ending at y.
Single source shortest path (SSSP) Given a vertex s ∈ V , find the least weighted
path from s to all vertices in V − {s}.
All pairs shortest paths (APSP) For every pair of vertices x, y ∈ V , find least
weighted paths from x to y.
Although the first problem often arises in practice, there is no specialized algo-
rithm for it. The first problem easily reduces to the SSSP problem. Intuitively, to
find the shortest path from x to y, it is difficult to avoid any vertex z since there
may be a shorter path from z to y. Indeed, one of the most basic operations used by
shortest path algorithms is the relaxation step. It is defined as follows -
Relax(u, v) : (u, v) ∈ E,
if ∆(v) > ∆(u) + w(u, v) then ∆(v) = ∆(v) + w(u, v)
For any vertex v, ∆(v) is an upperbound on the shortest path. Initially it is set to
∞ but gradually its value decreases till it becomes equal to δ(v) which is the actual
shortest path distance (from a designated source vertex).
The other property that is exploited by all algorithms is
This follows by a simple argument by contradiction, that otherwise the original path
is not the shortest path.
82
where In(v) denotes the set of vertices u ∈ V such that (u, v) ∈ E. The shortest
path to v must have one of the incoming edges into v as the last edge. The algorithm
actually maintains upperbounds ∆(v) on the distance from the source vertex s -
initially ∆(v) = ∞ for all v ∈ V − {s} and ∆(s) = 0 = δ(s). The previous recurrence
is recast in terms of ∆
∆(v) = min {∆(u) + w(u, v)}
u∈In(v)
that follows from a similar reasoning. Note that if D(u) = δ(u) for any u ∈ In(v),
then after applying relax(u, v), ∆(v) = δ(v). The underlying technique is dynamic
programming as many vertices may have common predecessors in the shortest path
recurrence.
83
Exercise 9.9 If there is no negative cycle, show that the predecessors form a tree
(which is called the shortest-path tree).
Observation 9.6 The vertices v ∈ V − U for which ∆(v) is minimum, satisfies the
property that ∆(v) = δ(v).
Suppose for some vertex v that has the minimum label after some iteration, ∆(v) >
δ(v). Consider a shortest path s ; x → y ; v, where y ∈ / U and all the earlier
vertices in the path s ; x are in U. Since x ∈ U, ∆(y) ≤ δ(x) + w(x, y) = δ(y).
Since all edge weights are non-negative, δ(y) ≤ δ(v) < ∆(v) and therefore ∆(y) = δ(y)
is strictly less than ∆(v) which contradicts the minimality of ∆(v).
A crucial property exploited by Dijkstra’s algorithm is that along any shortest
path s ; u, the shortest path-lengths are non-decreasing because of non-negative
edge weights. Along similar lines, we can also assert the following
Observation 9.7 Starting with s, the vertices are inserted into U is non-decreasing
order of their shortest path lengths.
84
9.2.3 Floyd-Warshall APSP algorithm
Consider the adjacency matrix AG of the given graph G = (V, E) where the entry
AG (u, v) contains the weight w(u, v). Let us define the matrix product of AG · AG
in a way where the multiplication and addition operators are replaced with addition
and max operator.
Claim 9.2 Define the k-th iterate of AG , namely AkG for n−1 ≥ k ≥ 2 as min{Ak−1 k−1
G , AG ·
AG } where the min is the entrywise minimum. Then AkG (u, v) contains the shortest
path distance between u and v consisting of at most k edges.
1. Capacity constraint
f (e) ≤ C(e) ∀e ∈ E
2. Flow conservation
X X
∀v ∈ V − {s, t}, f (e) = f (e)
e∈in(v) e∈out(v)
where in(v) are the edges directed into vertex v and out(v) are the edges directed
out of v.
The vertices s and t are often called the source and the sink and the flow is directed
out of s and into t. P
The outflow of a vertex v is defined as e∈out(v) f (e) and the inflow into v is given
P P
by e∈in(v) f (e). The net flow is defined as outflow minus inflow = e∈out(v) f (e) −
P
e∈in(v) f (e). From the property of flow conservation, net flow is zero for all vertices
except s, t. For vertex s which is the source, the net flow is positive and for t, the net
flow is negative.
Observation 9.8 The net flow at s and the net flow at t are equal in magnitude.
85
From the flow conservation constraint
X X X
f (e) − f (e) = 0
v∈V −{s,t} e∈out(v) e∈in(v)
Let E ′ be edges that are not incident on s, t (either incoming or outgoing). Then
X X X X X
= (f (e) − f (e)) + f (e) − f (e) + f (e) − f (e) = 0
e∈E ′ e∈out(s) e∈in(s) e∈out(t) e∈in(t)
For an an edge e ∈ E ′ , f (e) is counted once as incoming and once as outgoing which
cancel eachother. So
X X X X
f (e) − f (e) = f (e) − f (e)
e∈out(s) e∈in(s) e∈in(t) e∈out(t)
86
9.3.1 Max Flow Min Cut
P as a partition of V such that s ∈ S and t ∈ T . The size of
An (S, T ) cut is defined
a cut is defined as u∈S, C(u, v). Note that only the capacities of the forward edges
v∈T
are counted.
Consider a flow f such that there is no augmenting path. Let S ∗ be the set of vertices
such that there is an augmenting path from s to u ∈ S ∗ . By definition, s ∈ S ∗ and
t∈/ S ∗ and T ∗ = V − S ∗ .
Observation 9.9 The forward edges from S ∗ to T ∗ are saturated and the backward
arcs from T ∗ to S ∗ are full.
Otherwise it will contradict our definition of S ∗ and T ∗ . The net flow from S ∗ to T ∗
is X X
f (e) = C(e) = C(S ∗ , T ∗ )
e∈out(S ∗ ) e∈out(S ∗ )
e∈in(T ∗ ) e∈in(T ∗ )
P P
= e∈out(s) f (e)− e∈in(s) f (e) since the flow conservation holds at every other vertex
in S - this is the net flow out of s. By rewriting the the first summation over two
sets of edges E and E ′ corresponding to the cases when both endpoints of e are in S
or exactly one end-point is in S (the other is in T ), we obtain the following
X X X
(f (e) − f (e)) + f (e) − f (e)
e∈E e∈out(S) e∈in(S)
The first term is 0 and the second term equals C(S, T ). Since the third term is
negative, we can upperbound the expression by C(S, T ). Therefore f ≤ C(S, T ) for
any cut and in particular the mincut, i.e., the maxflow is less than or equal to the
mincut. As f is any flow, it implies
87
In conjunction with Equation 9.3.1 , we find that f ∗ = C(S ∗ , T ∗ ) = mincut.
Since the maxflow corresponds to the situation where no augmenting path is
possible, we have proved that
The flow is maximum iff there is no augmenting path.
88
We will prove it by contradiction. Suppose sk+1
i < ski for some k and among all
such vertices let s ; v have minimum path length (after k + 1 iterations). Consider
the last edge in the path, say (u, v). Then
sk+1
v = sk+1
u +1 (9.3.2)
sk+1
u ≥ sku (9.3.3)
Case 1 : f (u, v) < C(u, v) Then there is a forward edge u → v and hence skv ≤ sku +1
From Equation 9.3.4 sk+1v ≥ skv . that contradicts our assumption.
sku = skv + 1
Let us now bound the number of times edge (u, v) can be a bottleneck for augmen-
tations passing through the edge (u, v) in either direction. If (u, v) is critical after
k-iteration in the forward direction then skv = sku + 1. From monotonicity property
sℓv ≥ skv , so
sℓv ≥ sku + 1 (9.3.5)
Let ℓ(≥ k + 1) be the next iteration when an augmenting path passes through (u, v)3.
Then (u, v) must be a backward edge and therefore
using inequality 9.3.5 . Therefore we can conclude that distance from u to s increases
by at least 2 every time (u, v) becomes bottleneck and hence it can become bottleneck
for at most |V |/2 augmentations.
3
it may not be a bottleneck edge.
89
9.4 Global Mincut
A cut of a given (connected) graph G = (V, E) is set of edges which when removed
disconnects the graph. An s − t cut must have the property that the designated
vertices s and t should be in separate components. A mincut is the minimum number
of edges that disconnects a graph and is sometimes referred to as global mincut to
distinguish is from s − t mincut. The weighted version of the mincut problem is the
natural analogue when the edges have non-negative associated weights. A cut can
also be represented by a set of vertices S where the cut-edges are the edges connecting
S and V − S.
It was believed for a long time that the mincut is a harder problem to solve than
the s − t mincut - in fact the earlier algorithms for mincuts determined the s − t
mincuts for all pairs s, t ∈ V . The s − t mincut can be determined from the s − t
maxflow flow algorithms and over the years, there have been improved reductions of
the global mincut problem to the s − t flow problem, such that it can now be solved
in one computation of s − t flow.
In a remarkable departure from this line of work, first Karger, followed by Karger
and Stein developed faster algorithms (than maxflow) to compute the mincut with
high probability. The algorithms produce a cut that is very likely the mincut.
90
Procedure P artition(2) produces a 2-partition of V which defines a cut. If it is a
mincut then we are done. There are two issues that must be examined carefully.
The second question addresses a more general question, namely, how does one verify
the correctness of a Monte Carlo randomized algorithm ? In most cases there are no
efficient verification procedures and we can only claim the correctness in a probabilis-
tic sense. In our context, we will show that the contraction algorithm will produce
a mincut with probability p, so that, if we run the algorithm p1 times we expect to
see the mincut at least once. Among all the cuts that are output in O( p1 ) runs of the
algorithm, we choose the one with the minimum cut value. If the minimum cut had
been produced in any of the independent runs, we will obtain the mincut.
where Ā denotes the complement of event A. We can use the above equation induc-
tively obtain Q
n−t
Pr[E(i)] ≥ i=1 (1 − 2/ni−1)
Qn−t 2
= i=1 1 − n−i+1
t(t−1)
≥ n(n−1)
Claim 9.4 The probability that a specific mincut C survives at the end of Partition(t)
t(t−1)
is at least n(n−1) .
4
We will prove it only for the unweighted version but the proof can be extended using multiset
arguments.
91
Therefore Partition (2) produces a mincut with probability Ω( n12 ). Repeating the
above algorithm O(n2 ) times would ensure that the min cut is expected to be the
output at least once. If each contraction can be performed in t(n) time then the
expected running time is O(t(n) · n · n2 ).
Exercise 9.10 By using an adjacency matrix representation, show that the contrac-
tion operation can be performed in O(n) steps.
We now address the problem of choosing a random edge using the above data struc-
ture.
Claim 9.5 An edge E can chosen uniformly at random at any stage of the algorithm
in O(n) steps.
We first present a method to pick an edge randomly in O(n) time. The selection
works as follows.
where #E(u, v) denotes the number of edges between u and v and N(v) is the set of
neighbours of v.
Hence, the probability of choosing any edge (v, w) is given by
Thus, the above method picks edges with probability that is proportional to the
number of edges between v and w. When there are no multiple edges, all edges are
equally likely to be picked. For the case of integer weights, the above derivation works
directly for weighted sampling. By using an adjacency matrix M for storing the graph
where Mv,w denotes the number of edges between v and w allows us to merge vertices
v and w in O(n) time.
92
Chapter 10
NP Completeness and
Approximation Algorithms
Let C() be a class of problems defined by some property. We are interested in charac-
terizing the hardest problems in the class, so that if we can find an efficient algorithm
for these, it would imply fast algorithms for all the problems in C. The class that is
of great interest to computer scientists is the class P that is the set of problems for
which we can design polynomial time algorithms. A related class is N P, the class of
problems for which non-deterministic1 polynomial time algorithms can be designed.
More formally,
P = ∪i≥1 C(T P (ni ))
where C(T P (ni )) denotes problems for which O(ni ) time algorithms can be designed.
93
10.1 Classes and reducibility
The intuitive notion of reducibility between two problems is that if we can solve one
we can also solve the other. Reducibility is actually an asymmetric relation and also
entails some details about the cost of reduction. We will use the notation P1 ≤R P2
to denote that problem P1 is reducible to P2 using resource (time or space as the case
may be) to problem P2 . Note that it is not necessary that P2 ≤R P1 .
In the context of decision problems, a problem P1 is many-one reducible
to P2 if there is a many-to-one function g() that maps an instance I1 ∈ P1
to an instance I2 ∈ P2 such that the answer to I2 is YES iff the answer to
I1 is YES.
In other words, the many-to-one reducibility maps YES instances to YES instances
and NO instances to NO instances. Note that the mapping need not be 1-1 and
therefore reducibility is not a symmetric relation.
Further, if the mapping function g() can be computed in polynomial time
then we say that P1 is polynomial-time reducible to P2 and is denoted by
P1 ≤poly P2 .
The other important kind of reduction is logspace reduction and is denoted by
P1 ≤log P2 .
94
Claim 10.3 If Π1 ≤poly Π2 then
(i) If there is a polynomial time algorithm for Π2 then there is a polynomial time
algorithm for Π1 .
(ii) If there is no polynomial time algorithm for Π1 , then there cannot be a polynomial
time algorithm for Π2 .
Part (ii) is easily proved by contradiction. For part (i), if p1 (n) is the running time
of Π1 and p2 is the time of the reduction function, then there is an algorithm for P i1
that takes p1 (p2 (n)) steps where n is the input length for Π1 .
A problem Π is called NP-hard under polynomial reduction if for any problem
Π ∈ N P, Π′ ≤poly Π.
′
95
then the problem is known as CNF Satisfiability. Further, if we restrict the number
of variables in each clause to be exactly k then it is known as the k-CNF Satisfiability
problem. A remarkable result attributed to Cook and Levin says the following
Theorem 10.1 The CNF Satisfiability problem is NP Complete under polynomial
time reductions.
To appreciate this result, you must realize that there are potentially infinite num-
ber of problems in the class N P, so we cannot explicitly design a reduction function.
Other than the definition of N P we have very little to rely on for a proof of the above
result. A detailed technical proof requires that we define the computing model very
precisely - it is beyond the scope of this discussion. Instead we sketch an intuition
behind the proof.
Given an arbitrary problem Π ∈ N P, we want to show that Π ≤poly CNF −SAT .
In other words, given any instance of Π, say IΠ , we would like to define a boolean
formula B(IΠ ) which has a satisfiable assignment iff IΠ is a YES instance. Moreover
the length of B(IΠ ) should be polynomial time constructable (as a function of the
length of IΠ ).
A computing machine is a transition system where we have
(i) An initial configuration
(ii) A final configuration that indicates whether or not the input is a YES
or a NO instance
(iii) A sequence of intermediate configuration Si where Si+1 follows from Si
using a valid transition. In a non-deterministic system, there can be more
than one possible transition from a configuration. A non-deterministic
machine accepts a given input iff there is some valid sequence of configu-
rations that verifies that the input is a YES instance.
All the above properties can be expressed in propositional logic, i.e., by an unquan-
tified boolean formula in a CNF. Using the fact that the number of transitions is
polynomial, we can bound the size of this formula by a polynomial. The details can
be quite messy and the interested reader can consult a formal proof in the context
of Turing Machine model. Just to give the reader a glimpse of the kind of formalism
used, consider a situation where we want to write a propositional formula to assert
that a machine is in exactly one of the k states at any given time 1 ≤ i ≤ T . Let
us use boolean variables x1,i , x2,i . . . xk,i where xj,i = 1 iff the machine is in state j at
time i. We must write a formula that will be a conjunction of two two conditions
(i) At least one variable is true at any time i:
96
(ii) At most one variable is true :
(x1,i ⇒ x̄2,i ∧x̄3,i . . .∧x̄k,i )∧(x2,i ⇒ x̄1,i ∧x̄3,i . . .∧x̄k,i ) . . .∧(xk,i ⇒ x̄1,i ∧x̄2,i . . .∧x̄k−1,i )
The second step can be served by reducing any known NPC to P . Some of the earliest
problems that were proved NPC include (besides CNF-SAT)
• 3D Matching
97
• co − N P A problem whose complement is in N P belongs to this class. If the
problem is in P, then the complement of the problem is also in P and hence in
N P. In general we can’t say much about the relation between P and co − N P.
In general, we can’t even design an NP algorithm for a problem in co − N P, i.e.
these problems are not efficiently verifiable. For instance how would you verify
that a boolean formula is unsatisfiable (all assignments make it false) ?
Exercise 10.3 Show that the complement of an NPC problem is complete for
the class co − N P under polynomial time reduction.
Exercise 10.4 What would it imply if an NPC problem and its complement
are polynomial time reducible to eachother ?
• PSPACE The problems that run in polynomial space (but not necessarily poly-
nomial time). The satisfiability of Quantified Boolean Formula (QBF) is a com-
plete problem for this class.
BPP ⊂ N P?
98
10.4 Combating hardness with approximation
Since the discovery of NPC problems in early 70’s , algorithm designers have been
wary of spending efforts on designing algorithms for these problems as it is considered
to be a rather hopeless situation without a definite resolution of the P = N P question.
Unfortunately, a large number of interesting problems fall under this category and
so ignoring these problems is also not an acceptable attitude. Many researchers have
pursued non-exact methods based on heuristics to tackle these problems based on
heuristics and empirical results 4 . Some of the well known heuristics are simulated
annealing, neural network based learning methods , genetic algorithms. You will have
to be an optimist to use these techniques for any critical application.
The accepted paradigm over the last decade has been to design polynomial time
algorithms that guarantee near-optimal solution to an optimization problem. For a
maximization problem, we would like to obtain a solution that is at least f · OP T
where OP T is the value of the optimal solution and f ≤ 1 is the approximation factor
for the worst case input. Likewise, for minimization problem we would like a solution
no more than a factor f ≥ 1 larger than OP T . Clearly the closer f is to 1, the better
is the algorithm. Such algorithm are referred to as Approximation algorithms and
there exists a complexity theory of approximation. It is mainly about the extent of
approximation attainable for a certain problem.
For example, if f = 1 + ε where ε is any user defined constant, then we way
that the problem has a Polynomial Time Approximable Scheme (PTAS). Further,
if the algorithm is polynomial in 1/ε then it is called FPTAS (Fully PTAS). The
theory of hardness of approximation has yielded lower bounds (for minimization and
upper bounds for maximization problems) on the approximations factors for many
important optimization problems. A typical kind of result is that Unless P = N P
we cannot approximate the set cover problem better than log n in polynomial time.
In this section, we give several illustrative approximation algorithms. One of the
main challenges in the analysis is that even without the explicit knowledge of the
optimum solutions, we can still prove guarantees about the quality of the solution of
the algorithm.
99
P
Let B = i zi and consider the following a generalization of the problem, namely,
the subset sum problem. For a given integer K ≤ B, is there a subset R ⊂ S such
that the elements in R sum up to K.
Let S(j, r) denote a subset of {z1 , z2 . . . zj } that sums to r - if no such subset exists
then we define it as φ (empty subset). We can write the following recurrence
S(j, r) = S(j−1, r−zj )∪zj if zj is included or S(j−1, r) if zj is not included or φ not possible
Using the above dynamic programming formulation we can compute S(j, r) for
1 ≤ j ≤ n and r ≤ B. You can easily argue that the running time is O(n · B) which
may not be polynomial as B can be very large.
Suppose, we are given an approximation factor ε and let A = ⌈ nε ⌉ so that A1 ≤ ε/n.
zi
Then we define a new scaled problem with the integers scaled as zi′ = ⌊ z/A ⌋ and let
′ r
r = ⌊ z/A ⌋ where z is the maximum value of an integer that can participate in the
solution 5 .
Let us solve the problem for {z1′ , z2′ . . . zn′ } and r ′ using the previous dynamic
programming strategy and let So′ denote the optimal solution for the scaled problem
and let So be the solution for the original problem. Further let C and C ′ denote the
cost function for the original and the scaled problems respectively. The running time
of the algorithm is O(n · r ′ ) which is O( 1ε n2 ). We would like to show that the cost of
C(So′ ) is ≥ (1 − ε)C(So ). For any S ′′ ⊂ S
n n
C(S ′′ ) · ≥ C ′ (S ′′ ) ≥ C(S ′′ ) · − |S ′′ |
εz εz
So
εz εz n εz εz
C(So′ ) ≥ C ′ (So′ ) ≥ C ′ (So ) ≥ C(So ) − |So | = C(So ) − |So |
n n εz n n
The first and the third inequality follows from the previous bound and the second
inequality follows from the optimality of So′ wrt C ′ . Since C(So ) ≥ z|Sno | and so
C(So′ ) ≥ (1 − ε)C(So )
100
In the greedy algorithm, we pick up a subset that is most cost-effective in terms
of the cost per unchosen element. The cost-effectiveness of a set U is defined by UC(U
−V
)
where V ⊂ S is the set of elements already covered. We do this repeatedly till all
elements are covered.
Let us number the elements of S in the order they were covered by the greedy
algorithm (wlog, we can renumber such that they are x1 , x2 . . .). We will apportion
C(U )
the cost of covering an element e ∈ S as w(e)
P= U −V where e is covered for the first
time by U. The total cost of the cover is = i w(xi ).
Claim 10.4
Co
w(xi ) ≤
n−i+1
where Co is the cost of an optimum cover.
In iteration i, the greedy choice is more cost effective than any left over set of the opti-
′ ′
mal cover. Suppose n the cost-effectivenessoof the best set in the optimal cover is C /U ,
C(S ) C(S ) C(S )
i.e. C ′ /U ′ = min Si −Si1 i2
, Si −S ik
. . . Si −S where Si1 , Si2 . . . Sik forms a minimum set
1 2 k
cover. It follows that
C(Si1 ) + C(Si1 ) + . . . C(Si1 ) Co
C ′ /U ′ ≤ ≤
(Si1 − S) + (Si2 − S) + . . . (Sik − S) n−i+1
Co
So w(xi ) ≤ n−i+1 .
Co
P
Thus the cost of the greedy cover is i n−i+1
which is bounded by Co · Hn . Here
1
Hn = n1 + n−1 + . . . 1.
Exercise 10.5 Formulate the Vertex cover problem as an instance of set cover prob-
lem.
Analyze the approximation factor achieved by the following algorithm. Construct a
maximal matching of the given graph and consider the union C of the end-points of
the matched edges. Prove that C is a vertex cover and the size of the optimal cover
is at least C/2. So the approximation factor achieved is better than the general set
cover.
101
Metric TSP on graphs
Input: A graph G = (V,E) with weights on edges that satisfy triangle inequality.
2. Double every edge - call the resulting graph E ′ and construct an Euler tour T .
Claim 10.5 The length of this tour no more than twice that of the optimal tour.
Observation 10.1 A graph that has maximum degree ∆ can be coloured using ∆ + 1
colours using a greedy strategy.
10.4.5 Maxcut
Problem Given a graph G = (V, E), we want to partition the vertices into sets U, V −
U such that the number of edges across U and V − U is maximized. There is a
102
corresponding weighted version for a weighted graph with a weight function w : E →
R.
We have designed a polynomial time algorithm for mincut but the maxcut is an
NP-hard problem. Let us explore a simple idea of randomly assigning the vertices to
one of the partitions. For any fixed edge (u, v) ∈ E, it either belongs to the optimal
maxcut or not depending on whether u, v belong to different partitions of the optimal
maxcut Mo . The probability that we have chosen the the right partitions is at least
half. Let Xe be a random 0-1 variable (also called indicator random variables) that
is 1 iff the algorithm choses it consistently with the maxcut. The expected size of the
cut produced by the algorithm is
X X
E[ w(e) · Xe ] = w(e) · E[Xe ] ≥ Mo /2
e e
1
Therefore we have a simple randomized algorithm that attains a 2
apprximation.
Exercise 10.6 For an unweighted graph show that a simple greedy strategy leads to
a 12 approximation algorithm.
103
Appendix A
104
Example A.1 The number of moves required to solve the Tower of Hanoi problem
with n discs can be written as
an = 2an−1 + 1
By substituting for an−1 this becomes
an = 22 an−2 + 2 + 1
an = 2n−1 a1 + 2n−2 + . . . .. + 1
a2n = 2an + cn
we can use the same technique to show that a2n = i=0 log2 n2i n/2i · c + 2na1 .
P
Remark We made an assumption that n is a power of 2. In the general case, this may
present some technical complication but the nature of the answer remains unchanged.
Consider the recurrence
T (n) = 2T (⌊n/2⌋) + n
Suppose T (x) = cx log2 x for some constant c > 0 for all x < n. Then T (n) =
2c⌊n/2⌋ log2 ⌊n/2⌋+n. Then T (n) ≤ cn log2 (n/2)+n ≤ cn log2 n−(cn)+n ≤ cn log2 n
for c ≥ 1.
T (n) = aT (n/b) + f (n) a, b are constants and f (n) a positive monotonic function
Theorem A.1 For the following different cases, the above recurrence has the follow-
ing solutions
• If f (n) = O(nlogb a+ǫ ) for some constant ǫ, and if af (n/b) is O(f (n)) then T (n)
is Θ(f (n)).
105
Example A.3 What is the maximum number of regions induced by n lines in the
plane ? If we let Ln represent the number of regions, then we can write the following
recurrence
Ln ≤ Ln−1 + n L0 = 1
n(n+1)
Again by the method of summation, we can arrive at the answer Ln = 2
+ 1.
Example A.4 Let us try to solve the recurrence for Fibonacci, namely
Fn = Fn−1 + Fn−2 F0 = 0, F1 = 1
If we try to expand this in the way that we have done previously, it becomes unwieldy
very quickly. Instead we ”guess” the following solution
1
Fn = √ φn − φ̄n
5
√ √
where φ = (1+2 5) and φ̄ = (1−2 5) . The above solution can be verified by induction. Of
course it is far from clear how one can magically guess the right solution. We shall
address this later in the chapter.
106
This observation (of unique solution) makes it somewhat easier for us to guess some
solution and verify.
Let us guess a solution of the form ar = Aαr where A is some constant. This may
be justified from the solution of Example A.1. By substituting this in the homoge-
neous linear recurrence and simplification, we obtain the following equation
c0 αk + c1 αk−1 . . . + ck = 0
This is called the characteristic equation of the recurrence relation and this degree
k equation has k roots, say α1 , α2 . . . αk . If these are all distinct then the following is
a solution to the recurrence
which is also called the homogeneous solution to linear recurrence. The values of
A1 , A2 . . . Ak can be determined from the k boundary conditions (by solving k simul-
taneous equations).
When the roots are not unique, i.e. some roots have multiplicity then for mul-
tiplicity m, αn , nαn , n2 αn . . . nm−1 αn are the associated solutions. This follows from
the fact that if α is a multiple root of the characteristic equation, then it is also the
root of the derivative of the equation.
d a constant B
dn B1 n + B0
dn2 B2 n2 + B1 n + B0
edn , e, d are constants Bdn
107
A.3 Generating functions
An alternative representation for a sequence a1 , a2 . . . ai is a polynomial function
a1 x+a2 x2 +. . . ai xi . Polynomials are very useful objects in mathematics, in particular
as ”placeholders.” For example if we know that two polynomials are equal (i.e. they
evaluate to the same value for all x), then all the corresponding coefficients must
be equal. This follows from the well known property that a degree d polynomial
has no more than d distinct roots (unless it is the zero polynomial). The issue of
convergence is not important at this stage but will be relevant when we use the
method of differentiation.
Example A.5 Consider the problem of changing a Rs 100 note using notes of the
following denomination - 50, 20, 10, 5 and 1. Suppose we have an infinite supply of
each denomination then we can represent each of these using the following polynomials
where the coefficient corresponding to xi is non-zero if we can obtain a certain sum
using the given denomination.
P1 (x) = x0 + x1 + x2 + . . .
P5 (x) = x0 + x5 + x10 + x15 + . . .
P10 (x) = x0 + x10 + x20 + x30 + . . .
P20 (x) = x0 + x20 + x40 + x60 + . . .
P50 (x) = x0 + x50 + x100 + x150 + . . .
For example, we cannot have 51 to 99 using Rs 50,so all those coefficients are zero.
By multiplying these polynomials we obtain
P (x) = E0 + E1 x + E2 x2 + . . . E100 x100 + . . . Ei xi
where Ei is the number of ways the terms of the polynomials can combine such that
the sum of the exponents is 100. Convince yourself that this is precisely what we are
looking for. However we must still obtain a formula for E100 or more generally Ei ,
which the number of ways of changing a sum of i.
Note that for the polynomials P1 , P5 . . . P50 , the following holds
this is a linear recurrence. Find the final answer by extending these observations.
108
Let us try the method of generating function on the Fibonacci sequence.
Example A.6 Let the generating function be G(z) = F0 + F1 x + F2 x2 . . . Fn xn where
Fi is the i-th Fibonacci number. Then G(z) − zG(z) − z 2 G(z) can be written as the
infinite sequence
F0 + (F1 − F2 )z + (F2 − F1 − F0 )z 2 + . . . (Fi+2 − Fi+1 − Fi )z i+2 + . . . = z
z
for F0 = 0, F1 = 1. Therefore G(z) = .
This can be worked out to be
1−z−z 2
1 1 1
G(z) = √ −
5 1 − φz 1 − φ̄z
1
√
where φ̄ = 1 − φ = 2
1− 5 .
Example A.7 Let Dn denote the number of derangements of n objects. Then it can
be shown that Dn = (n − 1)(Dn−1 + Dn−2 ). This can be rewritten as Dn − nDn−1 =
−(Dn−1 − (n − 2)Dn−2. Iterating this, we obtain Dn − nDn−1 = (−1)n−2 (D2 − 2D1 ).
Using D2 = 1, D1 = 0, we obtain
If we let D(x) represent the exponential generating function for derangements, after
simplification, we get
where Ci,j are constants. We will use the technique of generating functions to extend
the one variable method. Let
110
A1 (x) = a1,0 + a1,1 x + . . . a1,r xr
An (x) = an,0 + an,1 x + . . . an,r xr
Then we can define a generating function with A0 (x), A1 (x)A3 (x) . . . as the sequence
- the new indeterminate can be chosen as y.
111
Appendix B
The sample space Ω may be infinite with infinite elements that are called elementary
events. For example consider the experiment where we must toss a coin until a head
comes up for the first time. A probability space consists of a sample space with a
probability measure associated with the elementary events. The probability measure
Pr is a real valued function on events of the sample space and satisfies the following
Sometimes we are only interested in a certain collection of events (rather the entire
sample space)a, say F . If F is closed under union and complementation, then the
above properties can be modified in a way as if F = Ω.
The principle of Inclusion-Exclusion has its counterpart in the probabilistic world,
namely
Lemma B.1
X X X
Pr[∪i Ei ] = Pr[Ei ] − Pr[Ei ∩ Ej ] + Pr[Ei ∩ Ej ∩ Ek ] . . .
i i<j i<j<k
Definition B.1 A random variable (r.v.) X is a real-valued function over the sample
space, X : Ω → R. A discrete random variable is a random variable whose range is
finite or a countable finite subset of R.
The distribution function FX : R → (0, 1] for a random variable X is defined as
FX (x) ≤ Pr[X = x]. The probability density function of a discrete r.v. X, fX is
112
given by fX (x) = Pr[X = x]. P
The expectation of a r.v. X, denoted by E[X] = x x · Pr[X = x].
A very useful property of expectation, called the linearity property can be stated as
follows
Lemma B.2 If X and Y are random variables, then
The theorem of total expectation that can be proved easily states that
X
E[X] = E[X|Y = y]
y
113
This
P is also known as the z-transform of X and it is easily seen that GX (1) = 1 =
i pi . The convergence of the PGF is an important issue for some calculations
involving differentiation of the PGF. For example,
dGX (z)
E[X] = |z = 1
dz
The notion of expectation of random variable can be extended to function f (X)
of random variable X in the following way
X
E[f (X)] = pi f (X = i)
i
i.e., the MGF of the sum of independent random variables is the product of the
individual MGF’s.
114
If we have knowledge of the second moment, then the following gives a stronger
result
Chebychev’s inequality
σ2
Pr[(X − E[X])2 ≥ t] ≤ (B.1.2)
t
where σ is the variance, i.e. E 2 [X] − E[X 2 ].
With
Pnknowledge of higher moments, then we have the following inequality. If
X = i xi is the sum of n mutually independent random variables where xi is
uniformly distributed in {-1 , +1 }, then for any δ > 0,
Chernoff bounds
Pr[X ≥ ∆] ≤ e−λ∆ E[eλX ] (B.1.3)
2 −λ λ
If we choose λ = ∆/n, the RHS becomes e−∆ /2n using a result that e 2+e =
2
coshh(λ) ≤ eλ /2 .
A more useful form of the above inequality is for a situation where a random
variable X is the sum of n independent
P 0-1 valued Poisson trials with a success prob-
ability of pi in each trial. If i pi = np, the following equations give us concentration
bounds of deviation of X from the expected value of np. The first equation is more
useful for large deviations whereas the other two are useful for small deviations from
a large expected value.
np m
P rob(X ≥ m) ≤ em−np (B.1.4)
m
P rob(X ≤ (1 − ǫ)pn) ≤ exp(−ǫ2 np/2) (B.1.5)
P rob(X ≥ (1 + ǫ)np) ≤ exp(−ǫ2 np/3) (B.1.6)
for all 0 < ǫ < 1.
A special case of non-independent random variables P
Consider n 0-1 random variables y1 , y2 , . . . yn such that Pr[yi = 1] ≤ pi and pi =
np. The random variables are not known to be independent. In such a case, we will
not be able to directly invoke the previous Chernoff bounds directly but we will show
the following
P P
Lemma B.3 Let Y = i yi and let X = i xi where xi are independent Poisson
trials with xi = pi . Then
Pr[Y ≥ k] ≤ [X ≥ k]∀k, 0 ≤ k ≤ n
115
Therefore we can invoke the Chernoff bounds on X to obtain a bound on Y . We
will prove the above property by induction on i (number of variables). For i = 1, (for
all k) this is true by definition. Suppose this is true for i < t (for all k ≤ i) and let
i = t. Let Xi = x1 + x2 . . . xi and Yi = y1 + y2 . . . yi. Then
116