Introduction To Algorithms, Recitation 4
Introduction To Algorithms, Recitation 4
006
Massachusetts Institute of Technology
Instructors: Erik Demaine, Jason Ku, and Justin Solomon Recitation 4
Recitation 4
Operations O(·)
Container Static Dynamic Order
Data Structure build(X) find(k) insert(x) find min() find prev(k)
delete(k) find max() find next(k)
Array n n n n n
Sorted Array n log n log n n 1 log n
We’ve learned how to implement a set interface using a sorted array, where query operations are
efficient but whose dynamic operations are lacking. Recalling that Θ(log n) growth is much closer
to Θ(1) than Θ(n), a sorted array provides really good performance! But one of the most common
operations you will do in programming is to search for something you’re storing, i.e., find(k).
Is it possible to find faster than Θ(log n)? It turns out that if the only thing we can do to items is
to compare their relative order, then the answer is no!
Comparison Model
The comparison model of computation acts on a set of comparable objects. The objects can be
thought of as black boxes, supporting only a set of binary boolean operations called comparisons
(namely <, ≤, >, ≥, =, and = 6 ). Each operation takes as input two objects and outputs a Boolean
value, either True or False, depending on the relative ordering of the elements. A search algorithm
operating on a set of n items will return a stored item with a key equal to the input key, or return
no item if no such item exists. In this section, we assume that each item has a unique key.
If binary comparisons are the only way to distinguish between stored items and a search key, a
deterministic comparison search algorithm can be thought of as a fixed binary decision tree rep-
resenting all possible executions of the algorithm, where each node represents a comparison per-
formed by the algorithm. During execution, the algorithm walks down the tree along a path from
the root. For any given input, a comparison sorting algorithm will make some comparison first, the
comparison at the root of the tree. Depending on the outcome of this comparison, the computation
will then proceed with a comparison at one of its two children. The algorithm repeatedly makes
comparisons until a leaf is reached, at which point the algorithm terminates, returning an output to
the algorithm. There must be a leaf for each possible output to the algorithm. For search, there are
n + 1 possible outputs, the n items and the result where no item is found, so there must be at least
n + 1 leaves in the decision tree. Then the worst-case number of comparisons that must be made
by any comparison search algorithm will be the height of the algorithm’s decision tree, i.e., the
length of any longest root to leaf path.
Recitation 4 2
Exercise: Prove that the smallest height for any tree on n nodes is dlg(n + 1)e − 1 = Ω(log n).
Solution: We show that the maximum number of nodes in any binary tree with height h is
n ≤ T (h) = 2h+1 − 1, so h ≥ (lg(n + 1)) − 1. Proof by induction on h. The only tree of
height zero has one node, so T (0) = 1, a base case satisfying the claim. The maximum number
of nodes in a height-h tree must also have the maximum number of nodes in its two subtrees, so
T (h) = 2T (h − 1) + 1. Substituting T (h) yields 2h+1 − 1 = 2(2h − 1) + 1, proving the claim.
A tree with n + 1 leaves has more than n nodes, so its height is at least Ω(log n). Thus the min-
imum number of comparisons needed to distinguish between the n items is at least Ω(log n), and
the worst-case running time of any deterministic comparison search algorithm is at least Ω(log n)!
So sorted arrays and balanced BSTs are able to support find(k) asymptotically optimally, in a
comparison model of computation.
Comparisons are very limiting because each operation performed can lead to at most constant
branching factor in the decision tree. It doesn’t matter that comparisons have branching factor
two; any fixed constant branching factor will lead to a decision tree with at least Ω(log n) height.
If we were not limited to comparisons, it opens up the possibility of faster-than-O(log n) search.
More specifically, if we can use an operation that allows for asymptotically larger than constant
ω(1) branching factor, then our decision tree could be shallower, leading to a faster algorithm.
Now suppose we want to store a set of n items, each associated with a unique integer key in the
bounded range from 0 to some large number u − 1. We can store the items in a length u direct
access array, where each array slot i contains an item associated with integer key i, if it exists. To
find an item having integer key i, a search algorithm can simply look in array slot i to respond to
the search query in worst-case constant time! However, order operations on this data structure
will be very slow: we have no guarantee on where the first, last, or next element is in the direct
access array, so we may have to spend u time for order operations.
Recitation 4 3
Worst-case constant time search comes at the cost of storage space: a direct access array must have
a slot available for every possible key in range. When u is very large compared to the number of
items being stored, storing a direct access array can be wasteful, or even impossible on modern
machines. For example, suppose you wanted to support the set find(k) operation on ten-letter
names using a direct access array. The space of possible names would be u ≈ 2610 ≈ 9.5 × 1013 ;
even storing a bit array of that length would require 17.6 Terabytes of storage space. How can we
overcome this obstacle? The answer is hashing!
1 class DirectAccessArray:
2 def __init__(self, u): self.A = [None] * u # O(u)
3 def find(self, k): return self.A[k] # O(1)
4 def insert(self, x): self.A[x.key] = x # O(1)
5 def delete(self, k): self.A[k] = None # O(1)
6 def find_next(self, k):
7 for i in range(k, len(self.A)): # O(u)
8 if A[i] is not None:
9 return A[i]
10 def find_max(self):
11 for i in range(len(self.A) - 1, -1, -1): # O(u)
12 if A[i] is not None:
13 return A[i]
14 def delete_max(self):
15 for i in range(len(self.A) - 1, -1, -1): # O(u)
16 x = A[i]
17 if x is not None:
18 A[i] = None
19 return x
Hashing
Is it possible to get the performance benefits of a direct access array while using only linear O(n)
space when n u? A possible solution could be to store the items in a smaller dynamic direct
access array, with m = O(n) slots instead of u, which grows and shrinks like a dynamic array
depending on the number of items stored. But to make this work, we need a function that maps
item keys to different slots of the direct access array, h(k) : {0, . . . , u − 1} → {0, . . . , m − 1}. We
call such a function a hash function or a hash map, while the smaller direct access array is called
a hash table, and h(k) is the hash of integer key k. If the hash function happens to be injective
over the n keys you are storing, i.e. no two keys map to the same direct access array index, then
we will be able to support worst-case constant time search, as the hash table simply acts as a direct
access array over the smaller domain m.
Recitation 4 4
Unfortunately, if the space of possible keys is larger than the number of array indices, i.e. m < u,
then any hash function mapping u possible keys to m indices must map multiple keys to the same
array index, by the pigeonhole principle. If two items associated with keys k1 and k2 hash to the
same index, i.e. h(k1 ) = h(k2 ), we say that the hashes of k1 and k2 collide. If you don’t know in
advance what keys will be stored, it is extremely unlikely that your choice of hash function will
avoid collisions entirely1 . If the smaller direct access array hash table can only store one item at
each index, when collisions occur, where do we store the colliding items? Either we store collisions
somewhere else in the same direct access array, or we store collisions somewhere else. The first
strategy is called open addressing, which is the way most hash tables are actually implemented,
but such schemes can be difficult to analyze. We will adopt the second strategy called chaining.
Chaining
Chaining is a collision resolution strategy where colliding keys are stored separately from the orig-
inal hash table. Each hash table index holds a pointer to a chain, a separate data structure that sup-
ports the dynamic set interface, specifically operations find(k), insert(x), and delete(k).
It is common to implement a chain using a linked list or dynamic array, but any implementation
will do, as long as each operation takes no more than linear time. Then to insert item x into the
hash table, simply insert x into the chain at index h(x.key); and to find or delete a key k from
the hash table, simply find or delete k from the chain at index h(k).
Ideally, we want chains to be small, because if our chains only hold a constant number of items,
the dynamic set operations will run in constant time. But suppose we are unlucky in our choice of
hash function, and all the keys we want to store has all of them to the same index location, into the
same chain. Then the chain will have linear size, meaning the dynamic set operations could take
linear time. A good hash function will try to minimize the frequency of such collisions in order to
minimize the maximum size of any chain. So what’s a good hash function?
Hash Functions
Division Method (bad): The simplest mapping from an integer key domain of size u to a smaller
one of size m is simply to divide the key by m and take the remainder: h(k) = (k mod m), or in
Python, k % m. If the keys you are storing are uniformly distributed over the domain, the division
method will distribute items roughly evenly among hashed indices, so we expect chains to have
small size providing good performance. However, if all items happen to have keys with the same
remainder when divided by m, then this hash function will be terrible. Ideally, the performance of
our data structure would be independent of the keys we choose to store.
1
If you know all of the keys you will want to store in advance, it is possible to design a hashing scheme that will
always avoid collisions between those keys. This idea, called perfect hashing, follows from the Birthday Paradox.
Recitation 4 5
Universal Hashing (good): For a large enough key domain u, every hash function will be bad for
some set of n inputs2 . However, we can achieve good expected bounds on hash table performance
by choosing our hash function randomly from a large family of hash functions. Here the expecta-
tion is over our choice of hash function, which is independent of the input. This is not expectation
over the domain of possible input keys. One family of hash functions that performs well is:
n o
H(m, p) = hab (k) = (((ak + b) mod p) mod m) a, b ∈ {0, . . . , p − 1} and a 6= 0 ,
where p is a prime that is larger than the key domain u. A single hash function from this family
is specified by choosing concrete values for a and b. This family of hash functions is universal3 :
for any two keys, the probability that their hashes will collide when hashed using a hash function
chosen uniformly at random from the universal family, is no greater than 1/m, i.e.
Pr {h(ki ) = h(kj )} ≤ 1/m, ∀ki =
6 kj ∈ {0, . . . , u − 1}.
h∈H
If we know that a family of hash functions is universal, then we can upper bound the expected
size of any chain, in expectation over our choice of hash function from the family. Let Xij be
the indicator random variable representing the value 1 if keys ki and kj collide for a chosen hash
function, and 0 otherwise. Then the Prandom variable representing the number of items hashed to
index h(ki ) will be the sum Xi = j Xij over all keys kj from the set of n keys {k0 , . . . , kn−1 }
stored in the hash table. Then the expected number of keys hashed to the chain at index h(ki ) is:
⎧ ⎫
⎨X ⎬ X X
E {Xi } = E Xij = E {Xij } = 1 + E {Xij }
h∈H h∈H ⎩ ⎭ h∈H h∈H
j j 6 i
j=
X
=1+ (1) Pr {h(ki ) = h(kj )} + (0) Pr {h(ki ) =
6 h(kj )}
h∈H h∈H
j6=i
X
≤1+ 1/m = 1 + (n − 1)/m.
j6=i
If the size of the hash table is at least linear in the number of items stored, i.e. m = Ω(n), then
the expected size of any chain will be 1 + (n − 1)/Ω(n) = O(1), a constant! Thus a hash table
where collisions are resolved using chaining, implemented using a randomly chosen hash function
from a universal family, will perform dynamic set operations in expected constant time, where
the expectation is taken over the random choice of hash function, independent from the input keys!
Note that in order to maintain m = O(n), insertion and deletion operations may require you to
rebuild the direct access array to a different size, choose a new hash function, and reinsert all the
items back into the hash table. This can be done in the same way as in dynamic arrays, leading to
amortized bounds for dynamic operations.
2
If u > nm, every hash function from u to m maps some n keys to the same hash, by the pigeonhole principle.
3
The proof that this family is universal is beyond the scope of 6.006, though it is usually derived in 6.046.
Recitation 4 6
1 class Hash_Table_Set:
2 def __init__(self, r = 200): # O(1)
3 self.chain_set = Set_from_Seq(Linked_List_Seq)
4 self.A = []
5 self.size = 0
6 self.r = r # 100/self.r = fill ratio
7 self.p = 2**31 - 1
8 self.a = randint(1, self.p - 1)
9 self._compute_bounds()
10 self._resize(0)
11
12 def __len__(self): return self.size # O(1)
13 def __iter__(self): # O(n)
14 for X in self.A:
15 yield from X
16
17 def build(self, X): # O(n)e
18 for x in X: self.insert(x)
19
20 def _hash(self, k, m): # O(1)
21 return ((self.a * k) % self.p) % m
22
23 def _compute_bounds(self): # O(1)
24 self.upper = len(self.A)
25 self.lower = len(self.A) * 100*100 // (self.r*self.r)
26
27 def _resize(self, n): # O(n)
28 if (self.lower >= n) or (n >= self.upper):
29 f = self.r // 100
30 if self.r % 100: f += 1
31 # f = ceil(r / 100)
32 m = max(n, 1) * f
33 A = [self.chain_set() for _ in range(m)]
34 for x in self:
35 h = self._hash(x.key, m)
36 A[h].insert(x)
37 self.A = A
38 self._compute_bounds()
39
Operations O(·)
Container Static Dynamic Order
Data Structure build(X) find(k) insert(x) find min() find prev(k)
delete(k) find max() find next(k)
Array n n n n n
Sorted Array n log n log n n 1 log n
Direct Access Array u 1 1 u u
Hash Table n(e) 1(e) 1(a)(e) n n
Exercise
Given an unsorted array A = [a0 , . . . , an−1 ] containing n positive integers, the D UPLICATES prob-
lem asks whether two integers in the array have the same value.
4
In 6.006, you do not have to specify these details when answering problems. You may simply quote that hash
tables can achieve the expected/amortized bounds for operations described in class.
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms