Hash Tables: A Detailed Description
Hash Tables: A Detailed Description
Hash Tables: A Detailed Description
Introduction
In many scenarios, we often need to store a certain set of values in a data structure so that they
could be retrieved at a later moment. This set of values (let’s call it ‘data’) could be in any
form, say integer, letters or in some cases, even mathematical functions. In order to retrieve the
data, our selected data structure should consist of a certain index, (let’s call that ‘key’) which
uniquely identifies our data. Let’s also assume that our data is the subset of a larger set called
the Universe, U.
What we want is that our data structure should assign a particular key to a certain data point,
so that we could extract one particular data point out of the whole data by referring to its
respective key. This raises many important problems, such as:
Definition:
A Hash table is a data structure which allows us to store data and label it with certain ‘keys’.
The hash table then stores this data according to a shorter length index. For the purpose of
simplicity in further arguments, we’ll consider hashing to an integer index.
These integer indices are computed by using user defined keys and performing a certain
function upon them. This function is known as a hash function and the method is known as
hashing.
Consider the following diagram: Abdul
HASH FUNCTION
0 1 2 3 4 5
HASH TABLE
Ram: Rahim: Abdul: Tom: Din: Dan:
3.5 5.6 8.7 4.5 5.4 4.6
8.7
In the above given implementation of a hash table, we have made a list which uses names as
keys and maps them to certain float values. The hash table uses the hash function to compute
the integer indices corresponding to our keys (i.e. the names), eg for Abdul, the hash function
computes the value of 2. Thus when we want to retrieve the data of Abdul, we just input the
key into the table, and then the table computes its index value and returns the corresponding
result.
Thus we’re able to maintain user decided keys, while also maintaining the speed of arrays.
00 1
N2 Bucket-2
01 2
N3 Bucket-3
10 3
N4 Bucket-4
11 4
COLLISION HANDLING:
When hashing, we use a function, known as a hash function to compute the index key. This
means that we convert a large key into a small, and preferably integral key. It is but natural that
during this process, there may arise a case where two different keys result in the same hashed
value, i.e., to different keys are hashed to the same memory location. This is known as collision.
To counter this problem, there are many techniques available. They come under the category
of collision handling. A few more common types are:
1. Chaining: In this method, if collision occurs at a particular hashed value, they are
connected via a linked list. Then to retrieve a record, we first go to the entry in the hash
table, and then traverse the linked list if multiple memory locations are hashed. This
method, even though is relatively simple, requires a lot of additional memory.
e.g let us consider the example of ‘key mod 6’ and the sequence of keys 20, 350, 38,
45, 46, 36, 55. The chaining would be done as follows
700 700
20 20 20 45
38 38
Initial empty Insert 20 Insert 350 and 38 Insert 45, collision happened
table add to chain
2. Open Addressing/Probing: In open addressing, the hash table is used to store every
element itself. The entry in the table either has the value of the record or a null value.
This clearly implies that the size of the table should either be equal or greater than the
number of entries that are to be stored onto it. The null value means that no entry is
hashed to the particular record in the table. To retrieve a particular record, we traverse
through the whole table one by one until the element is found or it isn’t in the table.
The following operations are performed in the following way in open addressing:
a. Insert: The table is traversed for an empty slot. Once the slot is found, the entry
is entered.
b. Search: We keep traversing the table until we match the key in the table with
the key for the record that we are searching for.
c. Delete: Here we encounter a problem as if we simply delete an entry, it might
cause a break in our table. Instead, we mark the entries in a specific way, say as
“deleted”. This way when searching, our function just skips over these entries.
Also, other records can be overwritten at their place.
Open addressing is done in the following ways:
a) Linear Probing: As literally apparent, in this method, we linearly probe the table. We
also select a probing length, i.e., the number of entries that the algorithm will skip to
check for the next empty slot. Here is an example:
Let us consider the example of ‘key mod 6’ and the sequence of keys 20, 350, 38, 45,
46, 36, 55. The chaining would be done as follows:
700 700
20 20 20
45
38 38
Initial empty Insert 20 Insert 350 and 38 Insert 45, collision happened
table with 20, enter at next free slot
700
20
45
46
38
Insert 45, Collision with 20
again, enter at next free slot
The following is the algorithm for Open Addressing: (Referred from Jayakanth
Srinivasan @MIT)
Index := hash(key)
While Table(Index) Is Full do
index := (index + 1) MOD Table_Size
if (index = hash(key))
return table_full
else
Table(Index) := Entry
b) Quadratic probing: Instead of linearly checking for empty entries, in the case of
quadratic probing, the table is traversed according to a quadratic fashion, i.e for ith
entries, i2 entry is checked.
c) Double hashing: In this case, instead of one hash function, two hash functions are used
and the combined result is probed for entry.
PERFECT HASHING:
In most of the cases when we theoretically analyse hash function, we assume that our hash
functions are ideal. This means that out of a set of hash functions available to us, each one is
selected uniformly, i.e. with about equal probability. Mathematically:
But finding these types of function is extremely difficult and hence focus is laid on finding a
set of hash functions as close to uniformity as possible.
In another case of ideal hash functions, we focus our attention to perfect hash functions.
Perfect hash functions are those in which no collision occurs. Perfect hashing is possible when
we know beforehand exactly what keys are or will be available to us. This is popularly used in
case of hashing keywords for compilers.
A simple example of perfect hashing is to use function which maps a key to it’s index, i.e a set
of n keys will be mapped to the range from [0, n-1]. This is feasible for a small set of records,
but as the size increases, so does the memory requirement, and hence this type of hashing has
limited functionality.
CONCLUSION:
In the above paper, we discussed the details about hash tables and how they are broadly
implemented. We also dived into some of the techniques of used in hashing while also
discussing some of the popular hash functions in use. The paper ended with the description of
some of the related terms to hash tables and hashing in general.
In the end, we should discuss the applications of hashing and hash tables:
Hash tables could be used to build a primitive form of search engine. This would lead
us to build two hash functions. One hash table would store certain set of keywords and
map them to a set of URLs, and the other hash table would contain the URLs within
every set.
Hash tables are also very much in use the building of compilers. In Compiler Design,
hash tables are used to store certain keywords that the compiler could quickly refer to
when doing lexical analysis.
Hashing is also widely used in cryptography. A certain set of keywords that need to be
encrypted are hashed to a different keywords. The receiver then reverses the process
according to certain rules that are already available to him. These rule are nothing but
hash functions that reverse map the hashed keywords back to the original phrase.
REFERENCES:
web.mit.edu/16.070/www/lecture/hashing.pdf
https://www.geeksforgeeks.org/hashing-set-2-separate-chaining
https://www.hackerearth.com/practice/data-structures/hash-tables/basics-of-hash-
tables/tutorial/
https://www.cs.cmu.edu/~fp/courses/15122-f15/lectures/12-hashtables.pdf
ee.usc.edu/~redekopp/cs104/slides/L21_Hashing.pdf