Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Hashing

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 42

HASHING

PRESENTED BY
SHAKUNTLA RAVANI
ASST. PROFF.
PSE
SYMBOL TABLE

• It is a standard data structure used in language


processing.
• Information about various source language construct
can be stored in symbol table. For example a compiler
scans a source file and stores an identifier, a string of
character in a symbol table entry. Here identifier is a
symbol. Thus a table of symbol is known as symbol
table.
• Each symbol can have list of associated attributes.
• A symbol associated with list of attributes forms an
entry of a symbol table.
CONT..
• Attributes of symbol(identifier) could be :
▫ Type of Identifier
▫ It’s Usage(e.g. function name, variable, label)
▫ It’s memory address SYMBOL ATTRIBUTES
• Symbol table primarily TOTAL -----------------------
concerned with saving X -----------------------
l -----------------------
and retrieve of symbols.
The following operations can be performed on symbol
table.
• Insert(s,t) : Insert new entry for string s and associated
attributes t
• Lookup(s) : Return Index for entry for string s.
REPRESENTATION OF SYMBOL TABLE
• There are various ways to represent symbol table
• 1. binary search : a symbol can be stored using sequential
data structure. It can be located quickly using binary
search. A balanced binary search tree is can be used. Each
node of tree holds a symbol. It is good when each name is
searched with equal probability
• 2.Optimal binary search tree should be preferred when
different symbols are searched with different probabilities
• 3. Hashing can be used for representation of symbol table.
Hash tables have an advantage over other representations
as symbol can be searched in constant time irrespective of
location of symbol in symbol table.
STATIC TREE TABLES
if

read
do

While

• In static symbol table, symbols are known in advance


and no insertion and deletions are allowed. Example :
Huffman tree and OBST
• Cost of searching a symbol occurring with higher
frequency is small
HASH TABLES
What is hashing ??
• Hashing is generally applied to file F containing R
records. Each record contain many fields, out of
these particular field may uniquely identify the
records in the file. Such field is known as primary
key.
• Hashing is implemented using hash table. This data
structure is known as symbol table.
• Whenever a key is to be inserted in hash table a hash
function is applied on it, which gives a index of key.
• It is then inserted at that index at the hash table.
CONT..

RECORDS

KEY

F( ) ADDRESS 0
1
2
4
5
6
CONT..

• Since there are finite numbers of locations in hash


table, it can store infinite number of keys virtually.
• It is quite possible that two distinct keys has a same
index in hash table. The situation in which key has
index which is already occupied by another key is
called collision.
• It should be resolved by finding some other location
to insert the new key. The process of finding
another location is called collision resolution.
HASH TABLE DATA STRUCTURE

• There are two different forms of hashing


• (1) open hashing / external hashing data
structure
• (2) Close hashing/ internal hashing data
structure
HASH TABLE DATA STRUCTURE
• Open Hashing allows records to be stored in
unlimited space(could be a hard disk). It places
no limitation on the size of tables.
• Closed hashing uses a fixed space for storage and
thus limits the size of symbol table.
Bucket table header List of elements

0 ……….
1 ……….
.
.
.
……….
B-1
CONT..
• The basic idea is that records are partitioned in
‘B’ classes , numbered o,1,2,….,B-1.
• Hashing function f(x) maps records with key n to
integer values between 0 to B-1. If record is
mapped to location 1 than we can say that record
is mapped to bucket 1 or class 1.
• Each bucket of bucket table is the head of linked
list of records mapped to bucket.
CLOSED HASHING

• A closed hash table keeps the element in bucket


itself.
• Only one element can be put in the bucket. If we
try to place an element in the bucket f(n) and
find it already holds an element, then we can say
collision has occurred.
• In case of collision the element should be
rehashed to alternate empty locations f1(x), f2(x)
within the bucket table.
HASHING FUNCTIONS

• A hash function H is a simply mathematical formula


that maps the key to some slot in the hash table T.
Thus we can say that the key K hashes a slot h(k).
• If the size of hash table is N, then the index of hash
table ranges from 0 to N-1. A hash table with N slot is
denoted by T[N]
• If input key is integer then applying hash function on
them is simple. However, if the input keys are strings ,
then they are first converted in to integer before
applying hash function. Here ASCII code is used to
convert character in to integer.
CONT..

• Characteristics of a Good Hash Function :


▫ (1) A good hash function avoids collisions.
▫ (2) A good hash function tends to spread keys
evenly in array.
▫ (3) A good hash function is easy to compute.
DIVISION METHOD

• It is simplest and most commonly used method.


• In this method the key K is divided by number of slots N in
hash table and reminder obtained after division is used as
index in the hash table.
• The hash function is : h(k) = k mod N
where mod is modulus(%) operator and the index ranges
from 0 to N-1
• If the index ranges from 1 to N then function will be : h(k) =
k mod N + 1
• Consider A hash table with N=10 and key value k is 23 then
and indexes starts from 1 to 10 then
• L= (23 mod 10) +1 = 3+1 =4
MID SQUARE METHOD

• In this method , square the value of key and take the


number of digits required to form an address from middle
position of square value.
• Suppose a key is 16 then it’s square is 256. Now if we want
address of 2 digits then you select 56(two digits starting
from middle of 256)
• If N=1000 and key =132437 then key value is calculated as :
• (1)Square of 132437 is 17539558969
• The hash value is obtained by taking 5th, 6th and 7th digit
counting from left which is 955
• So L=955
FOLDING METHOD

• This method can be implemented using two ways:


• (1) Fold shifting
• It is two step process. In first step key k is divided in
to several groups from left most digits, where each
groups contains n number of digits except the last
one which may contain lesser number or digits.
• In next step these groups are added together and
hash value is obtained by ignoring the last carry(if
any)
CONT..

• Suppose a key is : 12345678 and if N=100 then the


slot of table varies from 0 to 99 so index will be of
2 digits.
• Then break the key in to two parts from left :12,
34,56,78
• Add these we get 12 + 34 + 56 + 78 = 1 80
• So address L=80
ignore
CONT..
(2) Fold Boundary
• It is two step process. In first step key k is
divided in to several groups from left most digits,
where each groups contains n number of digits
except the last one which may contain lesser
number or digits.
• In next step reverse values of outer parts of
groups are added together and hash value is
obtained by ignoring the last carry(if any)
CONT..

• Suppose a key is : 12345678 and if N=100 then the


slot of table varies from 0 to 99 so index will be of
2 digits.
• Then break the key in to two parts from left :12,
34,56,78
• Add these we get 21 + 43 + 65 + 87 = 1 98
• So address L=98
ignore
MULTIPLICATIVE METHOD

• Here the address is obtaining by multiplication value.


• If k is non negative value, and constant c(0<c<1) compute KC
MOD 1 which is fractional part of KC.
• Multiply this fractional part by m and take a floor value to get
the address.
• h(k)= [ m (kc mod 1) ] 0 <= h(k) <m
• If N=101 and k=132437 and c=0.6180 then
• h(132437)= floor of[ 101 *(132437 * 0.6180) mod 1)
=floor of[101 * (81850.5673) mod 1 ]
=floor of[101 * 0.5673]
=floor of [57.3041]
= 57
DIGIT ANALYSIS

• Here the digits of fixed position of key k are


selected
• Then reverse the digits to get address
• For example if key k is :9861234.
• If we take fix position 3rd and 5th then we get 62
• By reversing it we get 26 as the address
• So address L =26
LENGTH DEPENDENT METHOD

• It gives address using two ways :


(1) Direct method
(2) Indirect Method
• (1) Direct method:
• Use Length of key along with some portion of
key to produce the address
• Example : if k=1234 and N=1000 then
• Address L=412 (where 4 is length of key 1234
and 12 is a first two digit of k)
CONT..

• (2) Indirect method


• Use a length of key along with some portion of key is
used to obtain intermediate address.
• Then use any other method to obtain the address value
• Example : if k=12345 and N=100 then
• Address L=512 (where 4 is length of key 1234 and 12 is
a first two digit of k)
• If we apply division method then
• h(512)=512 mod 100=12
• Address L =12
COLLISION RESOLUTION STATERGIES

• By applying any hash function it may possible


that two keys gives same address which is known
as collision.
• There are several strategies to to resolve collision
• (a) separate chaining – use with open hashing
• (b) Open addressing - use with closed hashing
SEPARATE CHAINING
• Separate chaining is based on collision avoidance
• If memory space is tight then separate chaining should be
avoided.
• In this method a link list of all key values that hash to same
value is maintained.
• Each node of link list contains key value and a pointer to next
node. Each index i (0<=i<N)in hash table contains the address
of first node of link list containing all the keys that hash to i.
• If there is no key value that hashes to index i, the slot contains
NULL value.
• So in this method a slot of hash table does not contain actual
key values rather it contains the address of first node of linked
list containing elements that hash to this slot.
CONT..
• Consider a key values
20,32,41,66,72,80,105,77,56 and 53 that needed
to be hashed using simple hash function
h(k)=k mod 10
20 80 NULL
0
1 41 NULL

2 32 72 NULL
3 53 NULL
4 NULL
5 105 NULL
6 66 56 NULL
7 77 NULL
8
NULL
9
NULL
SEPARATE CHAINING

• Separate chaining is based on collision


avoidance
• If memory space is tight then separate chaining
should be avoided.
• Additional memory for links is wasted in storing
address of linked elements.
• A hash function should ensure even distribution
of elements along with buckets
HOMEWORK

The integers given below are to be inserted in a


hash table with 5 locations using separate
chaining to resolve collision. Construct the hash
table and use simplest hash function .
1 ,2 , 3, 4, 5, 10, 21, 22, 33, 34, 15, 32, 31, 48, 49
50
OPEN ADDRESSING

• In open addressing if collision occurs, alternate


cells are tried until empty cell is found.
• Because of all data elements stored inside a table
large memory space is needed for open addressing
• There are three commonly used collision
resolution strategies in open addressing:
(1) Linear probing
• (2) Quadratic probing
• (3) Double hashing
LINEAR PROBING
• Process of examining the slots in hash function to find the location
of key value is known as probing.
• In linear probing, collisions are resolved by sequentially scanning
an array (with wraparound) until an empty cell is found.
• It uses following hash function :
h(k,i)=[ h’(k) + i ] mod N
Where
• h’(k) is any hash function (simply k mod N)
• i is the probe number ranges from 0 to N - 1
• Example:
– Insert items with keys: 89, 18, 49, 58, 9 into an empty hash table.
– Table size is 10.
– Hash function is hash(x) = x mod 10.
CONT..

• To insert any key k in hash table, first the slot


T[h’(k)] is probed. If this slot is empty, the key is
inserted in to the slot.
• Otherwise the slots T[h’(k) + 1], T[h’(k)+2 ],..and
so up to t[h’(k) + (N-1) are probed sequentially
until an empty slot is found.
• If no empty slot is found up to T[N-1] , we wrap
around to slots T[0], T[1] and so on until empty
cell is found
CONT..
Figure
Linear probing
hash table after
each insertion
PROBLEM OF CLUSTERING

• Linear probing is easy to implement but it suffers


from primary clustering.
• When many keys are mapped to same location
(clustering) linear probing will not distribute these
keys evenly in the hash table.
• These key will be stored in neighborhood of the
location where they are mapped. This will leads to
primary clustering.
HOMEWORK : consider the insertion of following keys
in hash table with N=10
126 , 75, 37, 56, 29, 154, 10, 99
QUADRATIC PROBING

• In quadratic probing the collision hash function is


quadratic instead of linear function of I as in linear
probing.
• h( k, i ) = [ h’(k) + i2 ] mod N
Where h’(k) is any simple hash function (simply k mod N)
i is the probe number ranging from 0 to N-1
• Example:
– Insert items with keys: 89, 18, 49, 58, 9 into an empty hash
table.
– Table size is 10.
– Hash function is hash(x) = x mod 10.
CONT..
Figure
Quadratic probing
hash table after
each insertion
PROBLEM WITH QUADRATIC PROBING

• Although quadratic probing eliminates primary clustering,


elements that hash to the same location will probe the same
alternative cells. This is know as secondary clustering.
• Example:
• Insert items with keys: 126, 75, 37, 56, 29, 154, 10, 99 into an
empty hash table.
• Table size is N=11.
• Hash function is hash’(x) = x mod 11.
• By inserting key value 88 we will see the effect of secondary
clustering in this table.
• Techniques that eliminate secondary clustering are available.

DOUBLE HASHING

• Double hashing almost eliminates the problem of primary as


well as secondary clustering.
• Double hashing uses the following hash function:
h(k , i)=[h’(k) + i * h’’(k) ] mod N
where
h’(k) is any hash function (for simplicity we use k mod N here )
h’’(k) is another hash function (for simplicity k mod N’ where N’
is slightly less then N(say N-1 or N-2)
i is probe number ranges from 0 to N-1
• To understanding double hashing consider following keys in
to hash table with N=13
126, 75, 37, 56, 29,152, 35, 99
• Consider h’(k)=K mod N and
h’’(k)= K mod (N-2)
• For key 126 it hashes to slot 9 as follow :
h(126,0)=(126 mod 13 + 0*(126 mod 11)) mod 13 =(9+0)mod
13=9
Slot
0 91 is empty
2 3 ,4 126
5 is inserted
6 7 8 in 9slot10
9 11 12
126

h(75,0)=(75 mod13 + 0 *(75 mod 11)) mod 13


=(10+0 *9)mod 13 = 10 mod 13=10
Slot0 101is empty,
2 3 so4 75
5 is 6
inserted
7 8 at9slot10
10 11 12
126 75

• h(37,0) =(37 mod 13 + 0 *(37 mod 11))mod 13


=(11+0) mod 13 =11
0 1 2 3 4 5 6 7 8 9 10 11 12
Slot 11 is empty so 37 is inserted at slot 11
126 75 37
• h(56,0)=(56 mod 13 + 0* (56 mod 13))mod 13
=(4+0)mod 13= 4
Since
0 1 slot
2 4 3is empty,
4 5 656 is
7 inserted
8 9 10at 11
slot12 4
56 126 75 37

• h(29,0)=(29 mod 13 + 0 * (29 mod 11))mod 13


=(3 +0)mod 13 =3
Since slot 3 is empty 29 is inserted at slot 3.
• h(152,0)=(152 mod 13 + 0 *(152 mod 11))mod 13
=(9 +0)mod 13 =9
Since slot 9 is not empty, the next probe sequence is
computed as follow
h(152,1)=(152 mod 13 + 1*(152 mod 11))mod 13=
(90 +9)mod
1 2 3
13=18
4 5
mod
6
13=5
7 8 9 10 11 12
Since slot 529is 56
empty
152 152 is inserted
126 75 at
37 slot 5
• h(35,0)=(35 mod 13 + 0 *(35 mod 11))mod 13
=(9+0)mod 13=9
Since slot 9 is not empty next probe sequence is
computed as follow :
h(35,1)=(35 mod 13 + 1*(35 mod 11))mod 13
=(9 + 2)mod 13 =11
Since slot 11 is not empty next probe sequence is
computed as follow:
h(35,2)=(35 mod 13 + 2*(35 mod 11))mod 13
=(9+4)mod 13=13 mod 13 =0
Since slot 0 is empty, 35 is inserted at slot 0
0 1 2 3 4 5 6 7 8 9 10 11 12
35 56 152 126 75 37
• N0w key 99 hashes to the slot as follow:
h(99,0)=(99 mod 13 + 0 * (99 mod 13))mod 13
=(8+0)mod 13=8
Since 8 is empty , 99 is inserted at slot 8
0 1 2 3 4 5 6 7 8 9 10 11 12
35 56 152 99 126 75 37

You might also like