Hashng Notes SVIMS
Hashng Notes SVIMS
Hashng Notes SVIMS
Having an insertion, find and removal of O(log(N)) is good but as the size of the table
becomes larger, even this value becomes significant. We would like to be able to use an
algorithm for finding of O(1). This is when hashing comes into play!
hash array
Key index
function
Figure 5. Array Index Computation
The value computed by applying the hash function to the key is often referred to as the
hashed key. The entries into the array, are scattered (not necessarily sequential) as can be
seen in figure below.
key entry
10
123
The cost of the insert, find and delete operations is now only O(1). Can you think of
why?
Hash tables are very good if you need to perform a lot of search operations on a relatively
stable table (i.e. there are a lot fewer insertion and deletion operations than search
operations).
One the other hand, if traversals (covering the entire table), insertions, deletions are a lot
more frequent than simple search operations, then ordered binary trees (also called AVL
trees) are the preferred implementation choice.
Hashing Performance
▪ Hash function
o should distribute the keys and entries evenly throughout the entire table
o should minimize collisions
▪ Table size
o Too large a table, will cause a wastage of memory
o Too small a table will cause increased collisions and eventually force
rehashing (creating a new hash table of larger size and copying the
contents of the current hash table into it)
o The size should be appropriate to the hash function used and should
typically be a prime number. Why? (We discussed this in class).
The hash function converts the key into the table position. It can be carried out using:
▪ Modular Arithmetic: Compute the index by dividing the key with some value and
use the remainder as the index. This forms the basis of the next two techniques.
▪ Truncation: Ignoring part of the key and using the rest as the array index. The
problem with this approach is that there may not always be an even distribution
throughout the table.
For Example: If student id’s are the key 928324312 then select just the last three
digits as the index i.e. 312 as the index. => the table size has to be atleast 999.
Why?
▪ Folding: Partition the key into several pieces and then combine it in some
convenient way.
For Example:
o For an 8 bit integer, compute the index as follows:
Index := (Key/10000 + Key MOD 10000) MOD Table_Size.
Collision
Let us consider the case when we have a single array with four records, each with two
fields, one for the key and one to hold data (we call this a single slot bucket). Let the
hashing function be a simple modulus operator i.e. array index is computed by finding the
remainder of dividing the key by 4.
Then key values 9, 13, 17 will all hash to the same index. When two(or more) keys hash
to the same value, a collision is said to occur.
hash_table (I,J )
0
Hash Hashed 1 9
Key
k=9 function value 2
k = 13 1 3
k = 17
Collision Resolution
The hash table can be implemented either using
▪ Buckets: An array is used for implementing the hash table. The array has size
m*p where m is the number of hash values and p ( 1) is the number of slots (a
slot can hold one entry) as shown in figure below. The bucket is said to have p
slots.
Hash 1st slot 2nd slot 3rd slot
value
(index) key key key
0
1
2
3
Figure 8. Hash Table with Buckets
▪ Chaining: An array is used to hold the key and a pointer to a liked list (either
singly or doubly linked) or a tree. Here the number of nodes is not restricted
(unlike with buckets). Each node in the chain is large enough to hold one entry as
shown in figure below.
Hash Hash
Chain
Value Table
0 2
1
1
3 A
.
.
. C B
Open addressing / probing is carried out for insertion into fixed size hash tables (hash
tables with 1 or more buckets). If the index given by the hash function is occupied, then
increment the table position by some number.
Index := hash(key)
While Table(Index) Is Full do
index := (index + 1) MOD Table_Size
if (index = hash(key))
return table_full
else
Table(Index) := Entry
Chaining
In chaining, the entries are inserted as nodes in a linked list. The hash table itself is an
array of head pointers.
The prime disadvantage is the memory overhead incurred if the table size is small.
What is Collision?
Since a hash function gets us a small number for a key which is a big integer or string, there
is a possibility that two keys result in the same value. The situation where a newly inserted
key maps to an already occupied slot in the hash table is called collision and must be
handled using some collision handling technique.
Separate Chaining:
The idea is to make each cell of hash table point to a linked list of records that have same
hash function value.
Let us consider a simple hash function as “key mod 7” and sequence of keys as 50, 700, 76,
85, 92, 73, 101.
Advantages:
1. Simple to implement.
2. Hash table never fills up, we can always add more elements to the chain.
3. Less sensitive to the hash function or load factors.
4. It is mostly used when it is unknown how many and how frequently keys may be
inserted or deleted.
Disadvantages:
1. Cache performance of chaining is not good as keys are stored using a linked list.
Open addressing provides better cache performance as everything is stored in the
same table.
2. Wastage of Space (Some Parts of hash table are never used)
3. If the chain becomes long, then search time can become O(n) in the worst case.
4. Uses extra space for links.
Performance of Chaining:
Performance of hashing can be evaluated under the assumption that each key is equally
likely to be hashed to any slot of the table (simple uniform hashing).
m = Number of slots in hash table
n = Number of keys to be inserted in hash table
Load factor α = n/m
Expected time to search = O(1 + α)
Expected time to delete = O(1 + α)
Time to insert = O(1)
Time complexity of search insert and delete is O(1) if α is O(1)
Dynamic hashing is a method of hashing in which the data structure grows and shrinks
dynamically as records are added or removed.
In traditional static hashing, the hash function maps keys to a fixed number of buckets or
slots. However, this approach can lead to problems such as overflow and poor distribution
of keys, especially when the number of keys is unpredictable or changes over time.
Dynamic hashing, also known as extendible hashing, addresses these issues by allowing the
hash table to expand or contract as needed.
The key to dynamic hashing is the use of a directory that points to buckets. Each bucket can
hold a certain number of records. When a bucket becomes full upon an insertion, it is split
into two, and the directory is updated to reflect this change. The hash function in dynamic
hashing is designed to produce a binary string of digits. The system initially considers only
the first few digits but can consider more digits as the table grows, effectively increasing
the number of available buckets.
For example, if the system starts with one bucket (represented by 0) and this bucket
becomes full, it is split into two buckets represented by 0 and 1. If the bucket represented by
1 becomes full, it is split into buckets represented by 10 and 11, and so on. This way, the
hash table can grow incrementally without needing to rehash all existing keys, which can be
a costly operation.
Similarly, when a deletion causes a bucket to become empty, it can be merged with another
bucket, and the directory can be updated to reflect this change. This allows the hash table to
shrink when necessary, saving memory.
In summary, dynamic hashing provides a flexible and efficient method for managing hash
tables with a changing number of records. It avoids the problems of overflow and poor key
distribution that can occur with static hashing, and it eliminates the need for costly
rehashing operations.
Extendible Hashing
It is a dynamic hashing method wherein directories, and buckets are used to hash data. It
is an aggressively flexible method in which the hash function also experiences dynamic
changes.
Main features of Extendible Hashing: The main features in this hashing technique are:
Directories: The directories store addresses of the buckets in pointers. An id is
assigned to each directory which may change each time when Directory Expansion
takes place.
Buckets: The buckets are used to hash the actual data.
Step 6 – Insertion and Overflow Check: Insert the element and check if the bucket
overflows. If an overflow is encountered, go to step 7 followed by Step 8, otherwise,
go to step 9.
Step 7 – Tackling Over Flow Condition during Data Insertion: Many times, while
inserting data in the buckets, it might happen that the Bucket overflows. In such cases,
we need to follow an appropriate procedure to avoid mishandling of data.
First, Check if the local depth is less than or equal to the global depth. Then choose
one of the cases below.
Case1: If the local depth of the overflowing Bucket is equal to the global
depth, then Directory Expansion, as well as Bucket Split, needs to be
performed. Then increment the global depth and the local depth value by 1.
And, assign appropriate pointers.
Directory expansion will double the number of directories present in the
hash structure.
Case2: In case the local depth is less than the global depth, then only
Bucket Split takes place. Then increment only the local depth value by 1.
And, assign appropriate pointers.
Inserting 16:
The binary format of 16 is 10000 and global-depth is 1. The hash function returns 1
LSB of 10000 which is 0. Hence, 16 is mapped to the directory with id=0.
Inserting 4 and 6:
Both 4(100) and 6(110)have 0 in their LSB. Hence, they are hashed as follows:
Inserting 22: The binary form of 22 is 10110. Its LSB is 0. The bucket pointed by
directory 0 is already full. Hence, Over Flow occurs.
As directed by Step 7-Case 1, Since Local Depth = Global Depth, the bucket splits
and directory expansion takes place. Also, rehashing of numbers present in the
overflowing bucket takes place after the split. And, since the global depth is
incremented by 1, now,the global depth is 2. Hence, 16,4,6,22 are now rehashed w.r.t
2 LSBs.[ 16(10000),4(100),6(110),22(10110) ]
*Notice that the bucket which was underflow has remained untouched. But, since the
number of directories has doubled, we now have 2 directories 01 and 11 pointing to the
same bucket. This is because the local-depth of the bucket has remained 1. And, any
bucket having a local depth less than the global depth is pointed-to by more than one
directories.
Inserting 24 and 10: 24(11000) and 10 (1010) can be hashed based on directories
with id 00 and 10. Here, we encounter no overflow condition.
Inserting 31,7,9: All of these elements[ 31(11111), 7(111), 9(1001) ] have either 01
or 11 in their LSBs. Hence, they are mapped on the bucket pointed out by 01 and 11.
We do not encounter any overflow condition here.
Inserting 20: Insertion of data element 20 (10100) will again cause the overflow
problem.
The bucket overflows, and, as directed by Step 7-Case 2, since the local depth of
bucket < Global depth (2<3), directories are not doubled but, only the bucket is split
and elements are rehashed.
Finally, the output of hashing the given list of numbers is obtained.
1. A Bucket will have more than one pointers pointing to it if its local depth is less than
the global depth.
2. When overflow condition occurs in a bucket, all the entries in the bucket are rehashed
with a new local depth.
3. If Local Depth of the overflowing bucket
4. The size of a bucket cannot be changed after the data insertion process begins.
1. Data retrieval is less expensive (in terms of computing).
2. No problem of Data-loss since the storage capacity increases dynamically.
3. With dynamic changes in hashing function, associated old values are rehashed w.r.t
the new hash function.
Limitations Of Extendible Hashing:
1. The directory size may increase significantly if several records are hashed on the same
directory while keeping the record distribution non-uniform.
2. Size of every bucket is fixed.
3. Memory is wasted in pointers when the global depth and local depth difference
becomes drastic.
4. This method is complicated to code.