Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Hashing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Hashing

Dr. Arup Kumar Pal


Department of Computer Science & Engineering
Indian Institute of Technology (ISM), Dhanbad
Jharkhand-826004
E-mail: arupkrpal@iitism.ac.in

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 1
Outline
Introduction
Hash Functions
To resolve collision
Conclusions

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 2
Introduction
The search time of each algorithm discussed so
far depends on the number n of elements in
the collection S of data.
A searching technique, called hashing or hash
addressing, which is essentially independent of
the number n.
In hashing, the idea is to use a hash function
that converts a given key to a smaller number
and uses the small number as an index in a
table called a hash table.

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 3
Motivation
Suppose a company assigns a 4-digit employee number
to each of its 90 employees and uses these numbers as
the primary keys for employee records, there are two
potential approaches:
Direct Addressing (No Hashing):
Each employee's record is stored in memory at a memory
location corresponding to their 4-digit employee number.
This approach would result in 10,000 memory locations
(assuming all 4-digit combinations) to accommodate any
possible employee number.
While search operations are incredibly fast (constant time),
this method is highly inefficient in terms of memory usage.
Most of the memory locations remain empty (wasting
space), as only 90 out of 10,000 possible locations would
actually be used.

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 4
Contd…
Hashing:
Hashing allows to map the employee numbers (keys) to memory
locations (buckets) in a more efficient way.
Instead of allocating memory for all possible employee numbers, a
hash function is used to map each employee number to a smaller
range of memory locations.
This results in a significant reduction in memory usage while
maintaining relatively fast search times.
The motivation behind hashing in this case is to strike a
balance between memory efficiency and search efficiency.
Hashing reduces the memory footprint required to store data
while still providing reasonably fast search operations.
This trade-off becomes even more important as the size of
the dataset grows, making efficient use of memory resources
a crucial consideration in modern computer systems.

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 5
Contd…
A hash table, also known as a hash map, is a
data structure designed to employ a hash
function for the effective mapping of keys to
corresponding values, facilitating rapid and
efficient search and retrieval operations.
Purpose of the Hash Table Data Structure: To
facilitate average-case constant-time
operations for insertion, deletion, and search.

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 6
Working Procedue
We compute a hash value for the input using the hash
function and subsequently use this generated hash as
the key for storing the element in the hash table.

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 7
Hash Functions
There are two principal criteria in deciding a function :
The hash function should be very easy and quick to
compute
The hash function should as far as possible give two
different indices for two different key values.
Unfortunately, such a function H may not yield distinct
values: it is possible that two different keys k1 and k2 will
yield the same hash address.
This situation is called collision, and some method must
be used to resolve it.
The topic of hashing is divided into two parts: (1) hash
functions and (2) collision resolutions.

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 8
Some popular hash functions
Different hashing functions
Division-Method
Midsquare Methods
Folding Method
Digit Analysis
Algebraic Coding

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 9
Division-Method
Choose 'm' larger than 'n' (number of keys in set
'K').
Preferably, choose 'm' as a prime number or a
number without small divisors to reduce collisions.
Define hash function 'H(k)' as:
'H(k) = k (mod m)' for hash addresses ranging from 0 to
'm - 1.'
'H(k) = k (mod m) + 1' for hash addresses ranging from
1 to 'm.‘
Suppose, k = 23, m = 10 then H(k) = (23 mod 10) +
1= 3 + 1=4 (The key whose value is 23 is placed in
4th location).

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 10
Midsquare Method
Midsquare method involves:
Squaring the key, k to get k2.
Extracting a portion of middle digits from k2 as the hash code.
It relies on the assumption that middle digits are
uniformly distributed.
Requires careful selection of the number of digits to
extract for a balanced distribution.
For example, Square of 5678 = 56782=32,256,484,
Extracting middle 2 digits: 56
This method has been criticized because of time
consuming computation, but it usually gives good results
so far as the uniform distribution of the keys over the
hash table is concerned.

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 11
Folding method
The key k is partitioned into a number of parts,
k1, …, kr , where each part, except possibly the
last, has the same number of digits as the
required address.
Then the parts are added together, ignoring the
last carry. That is, H(k) = k1 + k2 + ... + kr where
the leading-digit carries, if any, are ignored.
Sometimes, for extra “milling,” the even-
numbered parts, k2, k4, …, are each reversed
before the addition.

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 12
Example
Suppose, the key is : 12345678, and the
required address is of two digits,
Then break the key into: 12, 34, 56, 78.
Add these, we get 12 + 34 + 56 + 78 : 180
Ignore first 1 we get 80 as location

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 13
Digit Analysis
This hashing function is a distribution-dependent.
Here we make a statistical analysis of digits of the
key, and select those digits (of fixed position)
which occur quite frequently.
Then reverse or shifts the digits to get the address.
For example, if the key is : 9861234. If the
statistical analysis has revealed the fact that the
third and fifth position digits occur quite
frequently, then we choose the digits in these
positions from the key.
So we get, 62. Reversing it we get 26 as the
address.

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 14
Algebraic Coding
Here a n bit key value is represented as a
polynomial.
The divisor polynomial is then constructed based
on the address range required.
The modular division of key-polynomial by divisor
polynomial, to get the address-polynomial.
Let f(x) = polynomial of n bit key=a1 + a2x + … +
anxn-1 and d(x) = divisor polynomial then the
required address polynomial will be f(x) mod d(x).

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 15
Collision Resolution
Suppose we want to add a new record R with key k to
our file F, but suppose the memory location address
H(k) is already occupied. This situation is called collision.
The collisions are almost impossible to avoid.
Specifically, suppose a student class has 24 students and
suppose the table has space for 365 records.
One random hash function is to choose the student’s
birthday as the hash address.
Although the load factor λ = 24/365 ≈ 7% is very small, it
can be shown that there is a better than fifty-fifty
chance that two of the students have the same birthday.

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 16
Birthday Paradox
Problem Statement: How many people are needed at a
party such that there is a reasonable chance that at least
two people have the same birthday?
Solution:

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 17
Contd…

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 18
Collision Resolution Strategies
Collision resolution is the main problem in
hashing.
There are several strategies for collision
resolution.
The most commonly used are :
Closed Hashing (also called linear probing)
Open Hashing (also called chaining)

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 19
Closed Hashing
Suppose that a new record R with key k is to be
added to the memory table T, but that the
memory location with hash address H(k) = h is
already filled.
One natural way to resolve the collision is to assign
R to the first available location following T[h]. (We
assume that the table T with m locations is
circular, so that T [1] comes after T [m].)
Accordingly, with such a collision procedure, we
will search for the record R in the table T by
linearly searching the locations T[h], T[h + 1], T[h +
2], … until finding R or meeting an empty location,
which indicates an unsuccessful search.

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 20
Contd…

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 21
Contd…

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 22
Limitations
One main disadvantage of linear probing is that records
tend to cluster, that is, appear next to one another,
when the load factor is greater than 50 percent.
Such a clustering substantially increases the average
search time for a record.
Two techniques to minimize clustering are as follows:
Quadratic probing.
Suppose a record R with key k has the hash address H(k)= h.
Then, instead of searching the locations with addresses h, h
+ 1, h + 2, …, we linearly search the locations with addresses
h, h + 1, h + 4, h + 9, h+ 16 ,…, h + i2,…

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 23
Contd…
Double hashing.
Here a second hash function H′ is used for resolving
a collision.
Suppose a record R with key k has the hash
addresses H(k) = h and H′(k) = h′, then we linearly
search the locations with addresses h, h + h′, h +
2h′, h + 3h′, …

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 24
Remark
One major disadvantage in any type of open addressing
procedure is in the implementation of deletion.
Specifically, suppose a record R is deleted from the
location T[r].
Afterwards, suppose we meet T[r] while searching for
another record R′.
This does not necessarily mean that the
search is unsuccessful.
Thus, when deleting the record R, we must label the
location T[r] to indicate that it previously did contain a
record.
Accordingly, open addressing may seldom be used when
a file F is constantly changing.

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 25
Open Hashing
It involves maintaining two tables in memory.
First of all, as before, there is a table T in
memory which contains the records in F..
T now has an additional field LINK which is
used so that all records in T with the same hash
address h may be linked together to form a
linked list.
Second, there is a hash address table LIST
which contains pointers to the linked lists in T.

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 26
Contd…

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 27
Advantages
An overflow situations never arises.
Collision resolution can be achieved very
efficiently.
Insertion and deletion become a quick and
easy task

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 28
Conclusions
There are three factors the influence the performance
of hashing:
Hash function
should distribute the keys and entries evenly throughout
the entire table
should minimize collisions
Need effective Collision resolution strategy
Table size
Too large a table, will cause a wastage of memory
Too small a table will cause increased collisions and
eventually force rehashing
The size should be appropriate to the hash function used
and should typically be a prime number.

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 29
Thank You…
Any Queries...?

Data Structures and Algorithms Dept. of CSE, IIT(ISM) Dhanbad September 16, 2023 30

You might also like