Hashng Notes SVIMS

Hashing
Having an insertion, find and removal of O(log(N)) is good but as the size of the table
becomes larger, even this value becomes significant. We would like to be able to use an
algorithm for finding of O(1). This is when hashing comes into play!
Hashing using Arrays

When implementing a hash table using arrays, the nodes are not stored consecutively,
instead the location of storage is computed using the key and a hash function. The
computation of the array index can be visualized as shown below:
hash array
Key index
function
Figure 5. Array Index Computation
The value computed by applying the hash function to the key is often referred to as the
hashed key. The entries into the array, are scattered (not necessarily sequential) as can be
seen in figure below.
key entry
10
123
Figure 6. Hashed Array
The cost of the insert, find and delete operations is now only O(1). Can you think of
why?
Hash tables are very good if you need to perform a lot of search operations on a relatively
stable table (i.e. there are a lot fewer insertion and deletion operations than search
operations).
One the other hand, if traversals (covering the entire table), insertions, deletions are a lot
more frequent than simple search operations, then ordered binary trees (also called AVL
trees) are the preferred implementation choice.
Hashing Performance
There are three factors the influence the performance of hashing:
▪ Hash function
o should distribute the keys and entries evenly throughout the entire table
o should minimize collisions
▪ Collision resolution strategy

o Open Addressing: store the key/entry in a different position
o Separate Chaining: chain several keys/entries in the same position
▪ Table size
o Too large a table, will cause a wastage of memory
o Too small a table will cause increased collisions and eventually force
rehashing (creating a new hash table of larger size and copying the
contents of the current hash table into it)
o The size should be appropriate to the hash function used and should
typically be a prime number. Why? (We discussed this in class).
Selecting Hash Functions
The hash function converts the key into the table position. It can be carried out using:
▪ Modular Arithmetic: Compute the index by dividing the key with some value and
use the remainder as the index. This forms the basis of the next two techniques.
For Example: index := key MOD table_size
▪ Truncation: Ignoring part of the key and using the rest as the array index. The
problem with this approach is that there may not always be an even distribution
throughout the table.
For Example: If student id’s are the key 928324312 then select just the last three
digits as the index i.e. 312 as the index. => the table size has to be atleast 999.
Why?
▪ Folding: Partition the key into several pieces and then combine it in some
convenient way.
For Example:
o For an 8 bit integer, compute the index as follows:
Index := (Key/10000 + Key MOD 10000) MOD Table_Size.
o For character strings, compute the index as follows:

Index :=0
For I in 1.. length(string)
Index := Index + ascii_value(String(I))
Collision
Let us consider the case when we have a single array with four records, each with two
fields, one for the key and one to hold data (we call this a single slot bucket). Let the
hashing function be a simple modulus operator i.e. array index is computed by finding the
remainder of dividing the key by 4.
Array Index := key MOD 4
Then key values 9, 13, 17 will all hash to the same index. When two(or more) keys hash
to the same value, a collision is said to occur.
hash_table (I,J )
0
Hash Hashed 1 9
Key
k=9 function value 2
k = 13 1 3
k = 17
Figure 7. Collision Using a Modulus Hash Function
Collision Resolution
The hash table can be implemented either using
▪ Buckets: An array is used for implementing the hash table. The array has size
m*p where m is the number of hash values and p ( 1) is the number of slots (a
slot can hold one entry) as shown in figure below. The bucket is said to have p
slots.
Hash 1st slot 2nd slot 3rd slot
value
(index) key key key
0
1
2
3
Figure 8. Hash Table with Buckets
▪ Chaining: An array is used to hold the key and a pointer to a liked list (either
singly or doubly linked) or a tree. Here the number of nodes is not restricted
(unlike with buckets). Each node in the chain is large enough to hold one entry as
shown in figure below.
Hash Hash
Chain
Value Table
0 2
1
1
3 A
.
.
. C B
Figure 9. Chaining using Linked Lists / Trees
Open Addressing (Probing)
Open addressing / probing is carried out for insertion into fixed size hash tables (hash
tables with 1 or more buckets). If the index given by the hash function is occupied, then
increment the table position by some number.
There are three schemes commonly used for probing:
▪ Linear Probing: The linear probing algorithm is detailed below:
Index := hash(key)
While Table(Index) Is Full do
index := (index + 1) MOD Table_Size
if (index = hash(key))
return table_full
else
Table(Index) := Entry
▪ Quadratic Probing: increment the position computed by the hash function in

quadratic fashion i.e. increment by 1, 4, 9, 16, ….
▪ Double Hash: compute the index as a function of two different hash functions.
Chaining
In chaining, the entries are inserted as nodes in a linked list. The hash table itself is an
array of head pointers.
The advantages of using chaining are

▪ Insertion can be carried out at the head of the list at the index
▪ The array size is not a limiting factor on the size of the table
The prime disadvantage is the memory overhead incurred if the table size is small.
What is Collision?
Since a hash function gets us a small number for a key which is a big integer or string, there
is a possibility that two keys result in the same value. The situation where a newly inserted
key maps to an already occupied slot in the hash table is called collision and must be
handled using some collision handling technique.
What are the chances of collisions with large table?

Collisions are very likely even if we have big table to store keys. An important observation
is Birthday Paradox. With only 23 persons, the probability that two people have the same
birthday is 50%.
How to handle Collisions?

There are mainly two methods to handle collision:
1) Separate Chaining
2) Open Addressing
In this article, only separate chaining is discussed. We will be discussing Open addressing
in the next post.
Separate Chaining:
The idea is to make each cell of hash table point to a linked list of records that have same
hash function value.
Let us consider a simple hash function as “key mod 7” and sequence of keys as 50, 700, 76,
85, 92, 73, 101.
Advantages:
1. Simple to implement.
2. Hash table never fills up, we can always add more elements to the chain.
3. Less sensitive to the hash function or load factors.
4. It is mostly used when it is unknown how many and how frequently keys may be
inserted or deleted.
Disadvantages:
1. Cache performance of chaining is not good as keys are stored using a linked list.
Open addressing provides better cache performance as everything is stored in the
same table.
2. Wastage of Space (Some Parts of hash table are never used)
3. If the chain becomes long, then search time can become O(n) in the worst case.
4. Uses extra space for links.
Performance of Chaining:
Performance of hashing can be evaluated under the assumption that each key is equally
likely to be hashed to any slot of the table (simple uniform hashing).
m = Number of slots in hash table
n = Number of keys to be inserted in hash table
Load factor α = n/m
Expected time to search = O(1 + α)
Expected time to delete = O(1 + α)
Time to insert = O(1)
Time complexity of search insert and delete is O(1) if α is O(1)
What is Dynamic hashing?
Dynamic hashing is a method of hashing in which the data structure grows and shrinks
dynamically as records are added or removed.
In traditional static hashing, the hash function maps keys to a fixed number of buckets or
slots. However, this approach can lead to problems such as overflow and poor distribution
of keys, especially when the number of keys is unpredictable or changes over time.
Dynamic hashing, also known as extendible hashing, addresses these issues by allowing the
hash table to expand or contract as needed.
The key to dynamic hashing is the use of a directory that points to buckets. Each bucket can
hold a certain number of records. When a bucket becomes full upon an insertion, it is split
into two, and the directory is updated to reflect this change. The hash function in dynamic
hashing is designed to produce a binary string of digits. The system initially considers only
the first few digits but can consider more digits as the table grows, effectively increasing
the number of available buckets.
For example, if the system starts with one bucket (represented by 0) and this bucket
becomes full, it is split into two buckets represented by 0 and 1. If the bucket represented by
1 becomes full, it is split into buckets represented by 10 and 11, and so on. This way, the
hash table can grow incrementally without needing to rehash all existing keys, which can be
a costly operation.
Similarly, when a deletion causes a bucket to become empty, it can be merged with another
bucket, and the directory can be updated to reflect this change. This allows the hash table to
shrink when necessary, saving memory.
In summary, dynamic hashing provides a flexible and efficient method for managing hash
tables with a changing number of records. It avoids the problems of overflow and poor key
distribution that can occur with static hashing, and it eliminates the need for costly
rehashing operations.
Extendible Hashing
It is a dynamic hashing method wherein directories, and buckets are used to hash data. It
is an aggressively flexible method in which the hash function also experiences dynamic
changes.
Main features of Extendible Hashing: The main features in this hashing technique are:
 Directories: The directories store addresses of the buckets in pointers. An id is
assigned to each directory which may change each time when Directory Expansion
takes place.
 Buckets: The buckets are used to hash the actual data.
Basic Structure of Extendible Hashing:
Frequently used terms in Extendible Hashing:
 Directories: These containers store pointers to buckets. Each directory is given a

unique id which may change each time when expansion takes place. The hash function
returns this directory id which is used to navigate to the appropriate bucket. Number
of Directories = 2^Global Depth.
 Buckets: They store the hashed keys. Directories point to buckets. A bucket may
contain more than one pointers to it if its local depth is less than the global depth.
 Global Depth: It is associated with the Directories. They denote the number of bits
which are used by the hash function to categorize the keys. Global Depth = Number of
bits in directory id.
 Local Depth: It is the same as that of Global Depth except for the fact that Local
Depth is associated with the buckets and not the directories. Local depth in
accordance with the global depth is used to decide the action that to be performed in
case an overflow occurs. Local Depth is always less than or equal to the Global Depth.
 Bucket Splitting: When the number of elements in a bucket exceeds a particular size,
then the bucket is split into two parts.
 Directory Expansion: Directory Expansion Takes place when a bucket overflows.
Directory Expansion is performed when the local depth of the overflowing bucket is
equal to the global depth.
Basic Working of Extendible Hashing:
 Step 1 – Analyze Data Elements: Data elements may exist in various forms eg.
Integer, String, Float, etc.. Currently, let us consider data elements of type integer. eg:
49.
 Step 2 – Convert into binary format: Convert the data element in Binary form. For
string elements, consider the ASCII equivalent integer of the starting character and
then convert the integer into binary form. Since we have 49 as our data element, its
binary form is 110001.
 Step 3 – Check Global Depth of the directory. Suppose the global depth of the
Hash-directory is 3.
 Step 4 – Identify the Directory: Consider the ‘Global-Depth’ number of LSBs in the
binary number and match it to the directory id.
Eg. The binary obtained is: 110001 and the global-depth is 3. So, the hash function
will return 3 LSBs of 110001 viz. 001.
 Step 5 – Navigation: Now, navigate to the bucket pointed by the directory with
directory-id 001.
 Step 6 – Insertion and Overflow Check: Insert the element and check if the bucket
overflows. If an overflow is encountered, go to step 7 followed by Step 8, otherwise,
go to step 9.
 Step 7 – Tackling Over Flow Condition during Data Insertion: Many times, while
inserting data in the buckets, it might happen that the Bucket overflows. In such cases,
we need to follow an appropriate procedure to avoid mishandling of data.
First, Check if the local depth is less than or equal to the global depth. Then choose
one of the cases below.
 Case1: If the local depth of the overflowing Bucket is equal to the global
depth, then Directory Expansion, as well as Bucket Split, needs to be
performed. Then increment the global depth and the local depth value by 1.
And, assign appropriate pointers.
Directory expansion will double the number of directories present in the
hash structure.
 Case2: In case the local depth is less than the global depth, then only
Bucket Split takes place. Then increment only the local depth value by 1.
And, assign appropriate pointers.
 Step 8 – Rehashing of Split Bucket Elements: The Elements present in the

overflowing bucket that is split are rehashed w.r.t the new global depth of the
directory.
 Step 9 – The element is successfully hashed.
Example based on Extendible Hashing: Now, let us consider a prominent example of
hashing the following elements: 16,4,6,22,24,10,31,7,9,20,26.
Bucket Size: 3 (Assume)
 Solution: First, calculate the binary forms of each of the given numbers.
16- 10000
4- 00100
6- 00110
22- 10110
24- 11000
10- 01010
31- 11111
7- 00111
9- 01001
20- 10100
26- 11010
 Initially, the global-depth and local-depth is always 1. Thus, the hashing frame looks
like this:
 Inserting 16:
The binary format of 16 is 10000 and global-depth is 1. The hash function returns 1
LSB of 10000 which is 0. Hence, 16 is mapped to the directory with id=0.
 Inserting 4 and 6:
Both 4(100) and 6(110)have 0 in their LSB. Hence, they are hashed as follows:
 Inserting 22: The binary form of 22 is 10110. Its LSB is 0. The bucket pointed by
directory 0 is already full. Hence, Over Flow occurs.
 As directed by Step 7-Case 1, Since Local Depth = Global Depth, the bucket splits
and directory expansion takes place. Also, rehashing of numbers present in the
overflowing bucket takes place after the split. And, since the global depth is
incremented by 1, now,the global depth is 2. Hence, 16,4,6,22 are now rehashed w.r.t
2 LSBs.[ 16(10000),4(100),6(110),22(10110) ]

*Notice that the bucket which was underflow has remained untouched. But, since the
number of directories has doubled, we now have 2 directories 01 and 11 pointing to the
same bucket. This is because the local-depth of the bucket has remained 1. And, any
bucket having a local depth less than the global depth is pointed-to by more than one
directories.
 Inserting 24 and 10: 24(11000) and 10 (1010) can be hashed based on directories
with id 00 and 10. Here, we encounter no overflow condition.
 Inserting 31,7,9: All of these elements[ 31(11111), 7(111), 9(1001) ] have either 01
or 11 in their LSBs. Hence, they are mapped on the bucket pointed out by 01 and 11.
We do not encounter any overflow condition here.
 Inserting 20: Insertion of data element 20 (10100) will again cause the overflow
problem.
 20 is inserted in bucket pointed out by 00. As directed by Step 7-Case 1, since

the local depth of the bucket = global-depth, directory expansion (doubling) takes
place along with bucket splitting. Elements present in overflowing bucket are
rehashed with the new global depth. Now, the new Hash table looks like this:
 Inserting 26: Global depth is 3. Hence, 3 LSBs of 26(11010) are considered.
Therefore 26 best fits in the bucket pointed out by directory 010.
 The bucket overflows, and, as directed by Step 7-Case 2, since the local depth of
bucket < Global depth (2<3), directories are not doubled but, only the bucket is split
and elements are rehashed.
Finally, the output of hashing the given list of numbers is obtained.
 Hashing of 11 Numbers is Thus Completed.

Key Observations:
1. A Bucket will have more than one pointers pointing to it if its local depth is less than
the global depth.
2. When overflow condition occurs in a bucket, all the entries in the bucket are rehashed
with a new local depth.
3. If Local Depth of the overflowing bucket
4. The size of a bucket cannot be changed after the data insertion process begins.
1. Data retrieval is less expensive (in terms of computing).
2. No problem of Data-loss since the storage capacity increases dynamically.
3. With dynamic changes in hashing function, associated old values are rehashed w.r.t
the new hash function.
Limitations Of Extendible Hashing:
1. The directory size may increase significantly if several records are hashed on the same
directory while keeping the record distribution non-uniform.
2. Size of every bucket is fixed.
3. Memory is wasted in pointers when the global depth and local depth difference
becomes drastic.
4. This method is complicated to code.

Hashng Notes SVIMS

Uploaded by

Copyright:

Available Formats

Hashng Notes SVIMS

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hashng Notes SVIMS

Uploaded by

Copyright:

Available Formats

Hashing

Hashing using Arrays

Figure 6. Hashed Array

There are three factors the influence the performance of hashing:

▪ Collision resolution strategy

Selecting Hash Functions

For Example: index := key MOD table_size

o For character strings, compute the index as follows:

Array Index := key MOD 4

Figure 7. Collision Using a Modulus Hash Function

Figure 9. Chaining using Linked Lists / Trees

Open Addressing (Probing)

There are three schemes commonly used for probing:

▪ Linear Probing: The linear probing algorithm is detailed below:

▪ Quadratic Probing: increment the position computed by the hash function in

The advantages of using chaining are

What are the chances of collisions with large table?

How to handle Collisions?

What is Dynamic hashing?

Basic Structure of Extendible Hashing:

Frequently used terms in Extendible Hashing:

 Directories: These containers store pointers to buckets. Each directory is given a

 Step 8 – Rehashing of Split Bucket Elements: The Elements present in the

 20 is inserted in bucket pointed out by 00. As directed by Step 7-Case 1, since

 Hashing of 11 Numbers is Thus Completed.

You might also like