CLRS Linked Lists

11.
2 Hash tables 257

T
U
(universe of keys)
K
(actual
keys)
k
1
k
2
k
3
k
4
k
5
k
6
k
7
k
8
k
1
k
2
k
3
k
4
k
5
k
6
k
7
k
8
Figure 11.3 Collision resolution by chaining. Each hash-table slot T j contains a linked list of
all the keys whose hash value is j. For example, h.k
1
/ = h.k
4
/ and h.k
5
/ = h.k
7
/ = h.k
2
/.
The linked list can be either singly or doubly linked; we show it as doubly linked because deletion is
faster that way.
There is one hitch: two keys may hash to the same slot. We call this situation
a collision. Fortunately, we have effective techniques for resolving the conict
created by collisions.
Of course, the ideal solution would be to avoid collisions altogether. We might
try to achieve this goal by choosing a suitable hash function h. One idea is to
make h appear to be random, thus avoiding collisions or at least minimizing
their number. The very term to hash, evoking images of random mixing and
chopping, captures the spirit of this approach. (Of course, a hash function h must be
deterministic in that a given input k should always produce the same output h.k/.)
Because [U[ > m, however, there must be at least two keys that have the same hash
value; avoiding collisions altogether is therefore impossible. Thus, while a well-
designed, random-looking hash function can minimize the number of collisions,
we still need a method for resolving the collisions that do occur.
The remainder of this section presents the simplest collision resolution tech-
nique, called chaining. Section 11.4 introduces an alternative method for resolving
collisions, called open addressing.
Collision resolution by chaining
In chaining, we place all the elements that hash to the same slot into the same
linked list, as Figure 11.3 shows. Slot j contains a pointer to the head of the list of
all stored elements that hash to j ; if there are no such elements, slot j contains NIL.
258 Chapter 11 Hash Tables
The dictionary operations on a hash table T are easy to implement when colli-
sions are resolved by chaining:
CHAINED-HASH-INSERT.T; x/
1 insert x at the head of list T h.x: key/
CHAINED-HASH-SEARCH.T; k/
1 search for an element with key k in list T h.k/
CHAINED-HASH-DELETE.T; x/
1 delete x from the list T h.x: key/
The worst-case running time for insertion is O.1/. The insertion procedure is fast
in part because it assumes that the element x being inserted is not already present in
the table; if necessary, we can check this assumption (at additional cost) by search-
ing for an element whose key is x: key before we insert. For searching, the worst-
case running time is proportional to the length of the list; we shall analyze this
operation more closely below. We can delete an element in O.1/ time if the lists
are doubly linked, as Figure 11.3 depicts. (Note that CHAINED-HASH-DELETE
takes as input an element x and not its key k, so that we dont have to search for x
rst. If the hash table supports deletion, then its linked lists should be doubly linked
so that we can delete an item quickly. If the lists were only singly linked, then to
delete element x, we would rst have to nd x in the list T h.x: key/ so that we
could update the next attribute of xs predecessor. With singly linked lists, both
deletion and searching would have the same asymptotic running times.)
Analysis of hashing with chaining
How well does hashing with chaining perform? In particular, how long does it take
to search for an element with a given key?
Given a hash table T with m slots that stores n elements, we dene the load
factor for T as n=m, that is, the average number of elements stored in a chain.
Our analysis will be in terms of , which can be less than, equal to, or greater
than 1.
The worst-case behavior of hashing with chaining is terrible: all n keys hash
to the same slot, creating a list of length n. The worst-case time for searching is
thus .n/ plus the time to compute the hash functionno better than if we used
one linked list for all the elements. Clearly, we do not use hash tables for their
worst-case performance. (Perfect hashing, described in Section 11.5, does provide
good worst-case performance when the set of keys is static, however.)
The average-case performance of hashing depends on how well the hash func-
tion h distributes the set of keys to be stored among the m slots, on the average.
11.2 Hash tables 259
Section 11.3 discusses these issues, but for now we shall assume that any given
element is equally likely to hash into any of the m slots, independently of where
any other element has hashed to. We call this the assumption of simple uniform
hashing.
For j = 0; 1; : : : ; m 1, let us denote the length of the list T j by n
j
, so that
n = n
0
n
1
n
m1
; (11.1)
and the expected value of n
j
is En
j
= = n=m.
We assume that O.1/ time sufces to compute the hash value h.k/, so that
the time required to search for an element with key k depends linearly on the
length n
h.k/
of the list T h.k/. Setting aside the O.1/ time required to compute
the hash function and to access slot h.k/, let us consider the expected number of
elements examined by the search algorithm, that is, the number of elements in the
list T h.k/ that the algorithm checks to see whether any have a key equal to k. We
shall consider two cases. In the rst, the search is unsuccessful: no element in the
table has key k. In the second, the search successfully nds an element with key k.
Theorem 11.1
In a hash table in which collisions are resolved by chaining, an unsuccessful search
takes average-case time .1/, under the assumption of simple uniform hashing.
Proof Under the assumption of simple uniform hashing, any key k not already
stored in the table is equally likely to hash to any of the m slots. The expected time
to search unsuccessfully for a key k is the expected time to search to the end of
list T h.k/, which has expected length En
h.k/
= . Thus, the expected number
of elements examined in an unsuccessful search is , and the total time required
(including the time for computing h.k/) is .1 /.
The situation for a successful search is slightly different, since each list is not
equally likely to be searched. Instead, the probability that a list is searched is pro-
portional to the number of elements it contains. Nonetheless, the expected search
time still turns out to be .1 /.
Theorem 11.2
In a hash table in which collisions are resolved by chaining, a successful search
takes average-case time .1/, under the assumption of simple uniform hashing.
Proof We assume that the element being searched for is equally likely to be any
of the n elements stored in the table. The number of elements examined during a
successful search for an element x is one more than the number of elements that
appear before x in xs list. Because new elements are placed at the front of the
list, elements before x in the list were all inserted after x was inserted. To nd
the expected number of elements examined, we take the average, over the n ele-
ments x in the table, of 1 plus the expected number of elements added to xs list
after x was added to the list. Let x
i
denote the i th element inserted into the ta-
ble, for i = 1; 2; : : : ; n, and let k
i
= x
i
: key. For keys k
i
and k
j
, we dene the
indicator random variable X
ij
= I {h.k
i
/ = h.k
j
/]. Under the assumption of sim-
ple uniform hashing, we have Pr {h.k
i
/ = h.k
j
/] = 1=m, and so by Lemma 5.1,
EX
ij
= 1=m. Thus, the expected number of elements examined in a successful
search is
E
"
1
n
n
X
i D1
1
n
X
jDi C1
X
ij
!#
=
1
n
n
X
i D1
1
n
X
jDi C1
EX
ij
!
(by linearity of expectation)
=
1
n
n
X
i D1
1
n
X
jDi C1
1
m
!
= 1
1
nm
n
X
i D1
.n i /
= 1
1
nm
n
X
i D1
n
n
X
i D1
i
!
= 1
1
nm
n
2
n.n 1/
2
(by equation (A.1))

= 1
n 1
2m
= 1
2n
:
Thus, the total time required for a successful search (including the time for com-
puting the hash function) is .2 =2 =2n/ = .1 /.
What does this analysis mean? If the number of hash-table slots is at least pro-
portional to the number of elements in the table, we have n = O.m/ and, con-
sequently, = n=m = O.m/=m = O.1/. Thus, searching takes constant time
on average. Since insertion takes O.1/ worst-case time and deletion takes O.1/
worst-case time when the lists are doubly linked, we can support all dictionary
operations in O.1/ time on average.
11.2 Hash tables 261
Exercises
11.2-1
Suppose we use a hash function h to hash n distinct keys into an array T of
length m. Assuming simple uniform hashing, what is the expected number of
collisions? More precisely, what is the expected cardinality of {{k; l] : k = l and
h.k/ = h.l/]?
11.2-2
Demonstrate what happens when we insert the keys 5; 28; 19; 15; 20; 33; 12; 17; 10
into a hash table with collisions resolved by chaining. Let the table have 9 slots,
and let the hash function be h.k/ = k mod 9.
11.2-3
Professor Marley hypothesizes that he can obtain substantial performance gains by
modifying the chaining scheme to keep each list in sorted order. How does the pro-
fessors modication affect the running time for successful searches, unsuccessful
searches, insertions, and deletions?
11.2-4
Suggest how to allocate and deallocate storage for elements within the hash table
itself by linking all unused slots into a free list. Assume that one slot can store
a ag and either one element plus a pointer or two pointers. All dictionary and
free-list operations should run in O.1/ expected time. Does the free list need to be
doubly linked, or does a singly linked free list sufce?
11.2-5
Suppose that we are storing a set of n keys into a hash table of size m. Show that if
the keys are drawn from a universe U with [U[ > nm, then U has a subset of size n
consisting of keys that all hash to the same slot, so that the worst-case searching
time for hashing with chaining is .n/.
11.2-6
Suppose we have stored n keys in a hash table of size m, with collisions resolved by
chaining, and that we know the length of each chain, including the length L of the
longest chain. Describe a procedure that selects a key uniformly at random from
among the keys in the hash table and returns it in expected time O.L .1 1=//.
11.3 Hash functions
In this section, we discuss some issues regarding the design of good hash functions
and then present three schemes for their creation. Two of the schemes, hashing by
division and hashing by multiplication, are heuristic in nature, whereas the third
scheme, universal hashing, uses randomization to provide provably good perfor-
mance.
What makes a good hash function?
A good hash function satises (approximately) the assumption of simple uniform
hashing: each key is equally likely to hash to any of the m slots, independently of
where any other key has hashed to. Unfortunately, we typically have no way to
check this condition, since we rarely know the probability distribution from which
the keys are drawn. Moreover, the keys might not be drawn independently.
Occasionally we do know the distribution. For example, if we know that the
keys are random real numbers k independently and uniformly distributed in the
range 0 _ k < 1, then the hash function
h.k/ = ]km
satises the condition of simple uniform hashing.
In practice, we can often employ heuristic techniques to create a hash function
that performs well. Qualitative information about the distribution of keys may be
useful in this design process. For example, consider a compilers symbol table, in
which the keys are character strings representing identiers in a program. Closely
related symbols, such as pt and pts, often occur in the same program. A good
hash function would minimize the chance that such variants hash to the same slot.
A good approach derives the hash value in a way that we expect to be indepen-
dent of any patterns that might exist in the data. For example, the division method
(discussed in Section 11.3.1) computes the hash value as the remainder when the
key is divided by a specied prime number. This method frequently gives good
results, assuming that we choose a prime number that is unrelated to any patterns
in the distribution of keys.
Finally, we note that some applications of hash functions might require stronger
properties than are provided by simple uniform hashing. For example, we might
want keys that are close in some sense to yield hash values that are far apart.
(This property is especially desirable when we are using linear probing, dened in
Section 11.4.) Universal hashing, described in Section 11.3.3, often provides the
desired properties.

CLRS Linked Lists

Uploaded by

Copyright:

Available Formats

CLRS Linked Lists

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CLRS Linked Lists

Uploaded by

Copyright:

Available Formats

11.

2 Hash tables 257

(by equation (A.1))

You might also like