Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Hashing

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 38
At a glance
Powered by AI
The key takeaways are that hashing is a technique used to map keys to values in a data structure called a hash table. Hashing provides fast lookup, insertion and deletion of records with an average time complexity of O(1).

Some of the different hashing methods discussed include static hashing, dynamic hashing, linear probing, quadratic probing and random probing.

Some issues that can arise with hashing include collisions where two different keys hash to the same slot, clustering where keys are stored in adjacent slots, and the choice of hash function impacting collision rates.

CHAPTER 8 Hashing

Concept of Hashing

 In CS, a hash table, or a hash map, is


a data structure that associates keys
(names) with values (attributes).

 Look-Up Table
 Dictionary
 Cache
 Extended Array
Tables of logarithms
Example

A small phone book as a hash table.


(Figure is from Wikipedia)
Dictionaries
 Collection of pairs.
 (key, value)
 Each pair has a unique key.
 Operations.
 Get(theKey)
 Delete(theKey)
 Insert(theKey, theValue)
Just An Idea
 Hash table :
 Collection of pairs,
 Lookup function (Hash function)
 Hash tables are often used to implement
associative arrays,
 Worst-case time for Get, Insert, and Delete
is O(size).
 Expected time is O(1).
Origins of the Term
 The term "hash" comes by way of
analogy with its standard meaning in
the physical world, to "chop and mix.”
D. Knuth notes that Hans Peter Luhn
of IBM appears to have been the first to
use the concept, in a memo dated
January 1953; the term hash came into
use some ten years later.
Search vs. Hashing
 Search tree methods: key comparisons
 Time complexity: O(size) or O(log n)
 Hashing methods: hash functions
 Expected time: O(1)
 Types
 Static hashing (section 8.2)
 Dynamic hashing (section 8.3)
Static Hashing
 Key-value pairs are stored in a fixed size table
called a hash table.
 A hash table is partitioned into many buckets.
 Each bucket has many slots.
 Each slot holds one record.
 A hash function f(x) transforms the identifier (key)
into an address in the hash table
Hash table
s slots
0 1 s-1
0 . . .
1
b buckets

. . .
. . .
. . .
b-1 . . .
Data Structure for Hash Table
#define MAX_CHAR 10
#define TABLE_SIZE 13
typedef struct {
char key[MAX_CHAR];
/* other fields */
} element;
element hash_table[TABLE_SIZE];
Other Extensions

Hash List and Hash Tree


(Figure is from Wikipedia)
Formal Definition
 Hash Function
 In addition, one-to-one / onto
Ideal Hashing
 Uses an array table[0:b-1].
 Each position of this array is a bucket.
 A bucket can normally hold only one
dictionary pair.
 Uses a hash function f that converts
each key k into an index in the range [0,
b-1].
 Every dictionary pair (key, element) is
stored in its home bucket table[f[key]].
Example
 Pairs are: (22,a), (33,c), (3,d), (73,e),
(85,f).
 Hash table is table[0:7], b = 8.
 Hash function is key (mod 11).
What Can Go Wrong?

 Where does (26,g) go?


 Keys that have the same home bucket are
synonyms.
 22 and 26 are synonyms with respect to the hash
function that is in use.
 The bucket for (26,g) is already occupied.
Some Issues
 Choice of hash function.
 Really tricky!
 To avoid collision (two different pairs are in
the same the same bucket.)
 Size (number of buckets) of hash table.
 Overflow handling method.
 Overflow: there is no space in the bucket for
the new pair.
Example (fig 8.1)
Slot 0 Slot 1
synonyms: 0 acos atan synonyms
char, ceil, 1
clock, ctime
2 char ceil synonyms
3 define
4 exp
overflow
5 float floor
6

25
Choice of Hash Function
 Requirements
 easy to compute
 minimal number of collisions
 If a hashing function groups key values
together, this is called clustering of the
keys.
 A good hashing function distributes the
key values uniformly throughout the
range.
Some hash functions
 Middle of square
 H(x):= return middle digits of x^2
 Division
 H(x):= return x % k
 Multiplicative:
 H(x):= return the first few digits of the
fractional part of x*k, where k is a fraction.
 advocated by D. Knuth in TAOCP vol. III.
Some hash functions II
 Folding:
 Partition the identifier x into several parts, and add the
parts together to obtain the hash address
 e.g. x=12320324111220; partition x into
123,203,241,112,20; then return the address
123+203+241+112+20=699
 Digit analysis:
 If all the keys have been known in advance, then we could
delete the digits of keys having the most skewed
distributions, and use the rest digits as hash address.
Criterion of Hash Table
 The key density (or identifier density) of
a hash table is the ratio n/T
 n is the number of keys in the table
 T is the number of distinct possible keys
 The loading density or loading factor of a
hash table is  = n/(sb)
 s is the number of slots
 b is the number of buckets
Example
Slot 0 Slot 1
synonyms: 0 acos atan synonyms
char, ceil, 1
clock, ctime
2 char ceil synonyms
3 define
4 exp
overflow
5 float floor
6

25
b=26, s=2, n=10, =10/52=0.19, f(x)=the first char of x
Overflow Handling
 An overflow occurs when the home bucket for
a new pair (key, element) is full.
 We may handle overflows by:
 Search the hash table in some systematic fashion
for a bucket that is not full.
 Linear probing (linear open addressing).
 Quadratic probing.
 Random probing.
 Eliminate overflows by permitting each bucket to
keep a list of all pairs for which it is the home
bucket.
 Array linear list.
 Chain.
Linear probing (linear open
addressing)
 Open addressing ensures that all
elements are stored directly into the
hash table, thus it attempts to resolve
collisions using various methods.

 Linear Probing resolves collisions by


placing the data into the next open slot
in the table.
Linear Probing – Get And
Insert

 divisor = b (number of buckets) = 17.


 Home bucket = key % 17.
0 4 8 12 16
34 0 45 6 23 7 28 12 29 11 30 33

• Insert pairs whose keys are 6, 12, 34, 29, 28, 11,
23, 7, 0, 33, 30, 45
Linear Probing – Delete
0 4 8 12 16
34 0 45 6 23 7 28 12 29 11 30 33

 Delete(0)
0 4 8 12 16
34 45 6 23 7 28 12 29 11 30 33

• Search cluster for pair (if any) to fill vacated bucket.

0 4 8 12 16
34 45 6 23 7 28 12 29 11 30 33
Linear Probing – Delete(34)
0 4 8 12 16
34 0 45 6 23 7 28 12 29 11 30 33

0 4 8 12 16
0 45 6 23 7 28 12 29 11 30 33

 Search cluster for pair (if any) to fill vacated


bucket.
0 4 8 12 16
0 45 6 23 7 28 12 29 11 30 33

0 4 8 12 16
0 45 6 23 7 28 12 29 11 30 33
Performance Of Linear Probing

0 4 8 12 16
34 0 45 6 23 7 28 12 29 11 30 33

 Worst-case find/insert/erase time is (n),


where n is the number of pairs in the table.
 This happens when all pairs are in the same
cluster.
Problem of Linear Probing
 Identifiers tend to cluster together
 Adjacent cluster tend to merge
 Increase the search time
Quadratic Probing
 Linear probing searches buckets (H(x)
+i2)%b
 Quadratic probing uses a quadratic
function of i as the increment
 Examine buckets H(x), (H(x)+i2)%b,
(H(x)-i2)%b, for 1<=i<=(b-1)/2
 b is a prime number of the form 4j+3, j
is an integer
Random Probing
 Random Probing works incorporating
with random numbers.
 H(x):= (H’(x) + S[i]) % b
 S[i] is a table with size b-1
 S[i] is a random permuation of integers
[1,b-1].
Rehashing
 Rehashing: Try H1, H2, …, Hm in
sequence if collision occurs. Here Hi is a
hash function.
 Double hashing is one of the best
methods for dealing with collisions.
 If the slot is full, then a second hash
function is calculated and combined with the
first hash function.
 H(k, i) = (H1(k) + i H2(k) ) % m
Data Structure for Chaining
#define MAX_CHAR 10
#define TABLE_SIZE 13
The idea of Chaining
#define IS_FULL(ptr) (!(ptr))
typedef struct {
is to combine the
char key[MAX_CHAR]; linked list and hash
/* other fields */ table to solve the
} element; overflow problem.
typedef struct list *list_pointer;
typedef struct list {
element item;
list_pointer link;
};
list_pointer hash_table[TABLE_SIZE];
Figure of Chaining
[0] 0 34

Sorted Chains
[4]
• Put in pairs
whose keys are 6 23
6, 12, 34, 29, 7
28, 11, 23, 7, 0, [8]
33, 30, 45
• Bucket = key %
17. 11 28 45
[12] 12 29
30

[16] 33
Comparison : Load Factor
 If open addressing is used, then each
table slot holds at most one element,
therefore, the loading factor can never
be greater than 1.

 If external chaining is used, then


each table slot can hold many elements,
therefore, the loading factor may be
greater than 1.
Conclusion
 The main tradeoffs between these
methods are that linear probing has
the best cache performance but is most
sensitive to clustering, while double
hashing has poorer cache performance
but exhibits virtually no clustering;
quadratic probing falls in between the
previous two methods.

You might also like