DS - Module 5
DS - Module 5
DS - Module 5
GRAPHS
Definitions : A graph, G, consists of two setsV and E.
V is a finite non-empty set of vertices.
E is a set of pairs of vertices, these pairs are called as edges.
V(G) and E(G) will represent the sets of vertices and edges of graph G.
We write G = (V,E) to represent a graph where Vis number of vertices
and E is number of Edges in the graph G.
A graph may be either undirected or directed graph
If the pair of vertices representing any edge is unordered then graph is
said to be undirected graph. Thus, the pairs (v1, v2) and (v2, v1)
represent the same edge.
If each edge is represented by a directed pair (v1, v2) where v1 is the tail
and v2 the head of the edge the graph is said to be directed
graph. Therefore <v2, v1> and <v1, v2> represent two different
edges.
Figure below shows three graphs G1, G2 and G3.
The graphs G1 and G2 are undirected. G3 is a directed graph.
V (G1) = {0,1,2,3}; E(G1) = {(0,1),(0,2),(0,3),(1,2),(1,3),(2,3)}
V (G2) = {0,1,2,3,4,5,6}; E(G2) = {(0,1),(0,2),(1,3),(1,4),(2,5),(2,6)}
V (G3) = {0,1,2}; E(G3) = {<0,1>, <1,2>, <1,0>}.
Note that the edges of a directed graph are drawn with an arrow from the tail to the head.
The graph G2 is also a tree while the graphs G1 and G3 are not.
Restriction on graphs
1. A graph may not have self edges, that is edge to itself.
2. A graph may not have multiple occurrences of the same
edge.
3. For a n-vertex undirected graph, the maximum number of
edges is n(n-1)/2
For a directed graph, maximum number of edges is n(n-1)
Terminologies
Complete: An n-vertex undirected graph with n(n-1)/2 edges is said to be
complete. In directed graph maximum number of edges is n(n-1). G1 is the
complete graph on 4 vertices while G2 and G3 are not complete graphs.
If (u,v) is an edge in E(G) of an undirected graph, then vertices u and v are
adjacent vertices and edge(u,v) is incident on vertices u and v
In directed graph, for an edge <u,v> the vertex u is adjacent to v or v is
adjacent from u. The edge <v1,v2> is incident to v1 and v2.
1 1 1
A subgraph of G is a graph G such that V(G ) ⊆V(G) and E(G ) ⊆ E(G)
1 1
V(G ) ⊆ V(G) means V(G ) has fewer elements or equal to the set which is there
in V(G)
Terminologies
A path from vertex vp to vertex vq in graph G is a sequence of
vertices vp,vi1,vi2, ...,vin,vq such that (vp,vi1),(vi1,vi2), ...,(vin,vq) are
edges in E(G). If G' is directed then the path consists of
<vp,vi1>,<vi,vi2>, ..., <vin,vq>, edges in E(G').
The length of a path is the number of edges on it.
A simple path is a path in which all vertices except possibly the
first and last are distinct.
A cycle is a simple path in which the first and last vertices are the
same.
In an undirected graph, G, two vertices v1 and v2 are said to be connected
if there is a path in G from v1 to v2 (since G is undirected, this means there
must also be a path from v2 to v1).
Terminologies
An undirected graph is said to be connected if for every pair of
distinct vertices u and v inV(G) there is a path from u to v in G.
A connected component or simply a component of an
undirected graph is a maximal connected subgraph. That is a
connected component is a subgraph in which every two
vertices are connected to each other by a path and which is not
connected to any additional vertices of the super graph.
Graph with two connected components shown below.
Terminologies
A tree is a connected acyclic (i.e., has no cycles) graph .
A directed graph G is said to be strongly connected if for every pair
of distinct vertices u, v inV(G) there is a directed path from u to v
and also from v to u
The degree of a vertex is the number of edges incident to
that vertex.
In case G is a directed graph, we define the in-degree of a vertex v to
be the number of edges for which v is the head. The out-degree is
defined to be the number of edges for which v is the tail.
If di is the degree of vertex i in a graph G with n vertices and e edges,
then it is easy to see that e = (Σ di/2) for i=0 to n-1
We refer to a directed graph as a digraph. An undirected graph
will sometimes be referred to simply as a graph
Graph Representation
1. Adjacency Matrix
2. Adjacency list
The adjacency matrix of G is a 2-dimensional n x n boolean
array say A, with the property that A(i,j) = 1 iff there is an
edge (vi,vj) (<vi,vj> for a directed graph).
A(i,j) = 0 if there is no such edge in G.
Adjacency matrix for a given graph
The adjacency matrix for an undirected graph is symmetric
where as not symmetric for directed graph .
The space needed to represent a graph using its adjacency matrix is
n2 bits. About half this space can be saved in the case of
undirected graphs by storing only the upper or lower triangle
of the matrix.
For an undirected graph the degree of any vertex i is its row
sum =∑A(i,j) for j=0 to n-1.
For a directed graph the row sum is the out-degree while the
column sum is the in-degree.
Adjacency Lists representation
In this representation n rows of the adjacency matrix are represented
as n linked lists that is chains
There is one list for each vertex in G.The nodes in list i represent
the vertices that are adjacent from vertex i. Each node has at least
two fields: VERTEX and LINK. The VERTEX fields contain
the indices of the vertices adjacent to vertex i.
Each list has a head node. The head nodes are sequential
providing easy random access to the adjacency list for any
particular vertex.
In the case of an undirected graph with n vertices and e edges,
this representation requires n head nodes and 2e chain nodes.
Adjacency Lists for given graph : Examples
In a digraph the number of list nodes = number of edges.The out-degree of any vertex may be
determined by counting the number of nodes on its adjacency list.
Adjacency Lists for given graph : Examples
Packed Adjacency list
It is possible to sequentially pack the nodes on the adjacency
lists and eliminate the link fields.
In this case adjacency lists may be packed into an integer
array node[n+2e+1].
Hence initially declare an array node[n+2e+1]
Set node[n]=n+2e+1.
Node[0] to node[n-1] holds starting points of adjacency list
for different vertices. That is in sequential mapping , node[i]
gives the starting point of list for vertex i for 0 ≤ i <n.
The vertices adjacent from vertex i are stored in location
node[i]….node[i+1]-1 places in node array where 0 ≤ i <n.
Example for a packed adjacency list
Adjacency Multilists
In the adjacency list representation of an undirected graph each
edge (u,v) is represented by two entries, one on the list for u and the
other on the list for v.
In some situations it is necessary to be able to determine the second entry
for a particular edge and mark that edge as already having been
examined. This can be accomplished easily if the adjacency lists are
actually maintained as multilists (i.e., lists in which nodes may be
shared among several lists).
For each edge there will be exactly one node, but this node will be
in two lists, i.e., the adjacency lists for each of the two nodes it is
incident to.
The node structure is
m is a one bit mark field that may be used to indicate whether or not
the edge has been examined.
Link1 is address of other node incident to u which is not covered
earlier and link2 is address of other node incident to v which is not
covered earlier.
Example
N0 N1
N3
N2
N4
N5
Weighted Edges
The edges of graph may be assigned with weight.
These weight may be representing the distance from one vertex
to other or cost of going from one vertex to other.
Adjacency matrix entries a[i][j] holds that weights. If vertex is not
adjacent , then a[i][j] = ∞ and diagonal elements are 0s. The
matrix is generally called as weight matrix
In adjacency list representation , the weight information is kept in
the chain nodes by including an additional field for weight
Elementary graph operations
Graph Traversal: Given an undirected graph G (V,E) and a
vertex v inV(G) we are interested in visiting all vertices in G that are
reachable from v (i.e., all vertices connected to v).This may be
performed by two graph traversal methods
Depth First Search (DFS)
Breadth First Search (BFS)
DFS is similar to preorder traversal of binary tree
BFS resembles level order traversal of binary tree
Depth First Search(DFS)
Depth first search of an undirected graph proceeds as follows.
The start vertex v is visited.
Next an unvisited vertex w adjacent to v is selected
Depth first search from w initiated.
When a vertex u is reached such that all its adjacent vertices have been
visited, we back up to the last vertex visited which may have an
unvisited vertex w adjacent to it and initiate a depth first search from w .
The search terminates when no unvisited vertex can be reached from
any of the visited one
DFS- Algorithm
procedure DFS(v)
//Given an undirected graph G = (V,E) with n vertices and an arrayVlSlTED[n] initially set to zero,
// this algorithm visits all vertices reachable from v. G andVISITED are global.
VISITED (v) =1
for each vertex w adjacent to v do
if VISlTED(w) = 0 then
call DFS(w)
end
end DFS
DFS- Assuming linked adjacency list representation is used for
a given graph. DFS(v)
Pass A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8]
K=1 -∞ 77
K=2 -∞ 33 77
K=3 -∞ 33 44 77
K=4 -∞ 11 33 44 77
K=5 -∞ 11 33 44 77 88
K=6 -∞ 11 22 33 44 77 88
K=7 -∞ 11 22 33 44 66 77 88
K=8 -∞ 11 22 33 44 55 66 77 88
RADIX SORT
Radix sort is the method used when alphabetizing a large list of
names. (Here the radix is 26, the 26 letters of the alphabet.)
Specifically, the list of names is first sorted according to the first
letter of each name.
That is, the names are arranged in 26 classes, where the first class
consists of those names that begin with "A," the second class
consists of those names that begin with "B," and so on.
During the second pass, each class is alphabetized according to the
second letter of the name. And so on. If no name contains, for
example, more than 12 letters, the names are alphabetized with at
most 12 passes.
The radix sort is the method used by a card sorter.
Suppose 9 cards are punched as follows: 348, 143, 361, 423, 538,
128, 321, 543, 366
The same concept may be extended to sort the cards with that
numbers.
Suppose we are interested to sort 9 cards that are punched as
follows: 348, 143, 361, 423, 538, 128, 321, 543, 366
Here radix is 10 (0,1,2……9)
Sorting may be done considering least significant digit first method
For next pass input is : 361, 321, 143, 423, 366, 348,538,128
Example : contd.
1 2 3 4 5 6 7 8 9 10 11
X C Z A E Y B D
Collision Resolution Methods
The efficiency of hashing function with collision resolution is
measured by Average number of key comparison need to
find location of record with a given key. We denote it by
S(λ) = Average number of probes for a successful search
U(λ)= Average number of probes for an successful search
For the above solved problem
S(λ) = (1+1+1+1+2+2+2+3)/8 =1.6
U(λ)= (7+6+5+4+3+2+1+2+1+1+8)/11 = 3.6
1 2 3 4 5 6 7 8 9 10 11
X C Z A E Y B D
Linear Probing
Let key x be stored in element of the array whose
address is the array index computed using hash
function h(x)= x %15
Then the keys 35,129,36,47,25,2501 are stored in
the hash table t
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
47 35 36 129 25 2501
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
47 35 36 65 129 25 2501
attempts
Linear Probing
Where would you store: 29?
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
47 35 36 65 129 25 2501 29
attempts
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
16 47 35 36 65 129 25 2501 29
Linear Probing
Where would you store: 14?
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
14 16 47 35 36 65 129 25 2501 29
attempts
Where would you store: 99?
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
14 16 47 35 36 65 129 25 2501 99 29
attempts
Linear Probing
Where would you store: 127 ?
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
14 16 47 35 36 65 127 129 25 2501 99 29
attempts
Linear Probing
• Leads to problem of clustering. Elements tend
to cluster in dense intervals in the array.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
47 35 36 129 25 2501
65(?)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
47 35 36 129 25 2501 65
t t+1 t+4 t+9
attempts
Where would you store: 29? f(29)=t=14
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
29 47 35 36 129 25 2501 65
t+1 t
attempts
Quadratic Probing
Where would you store: 16? f(16)=t=1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
29 16 47 35 36 129 25 2501 65
t attempts
Where would you store: 14? f(14)=t=14
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
29 16 47 14 35 36 129 25 2501 65
t+1 t+4 t
attempts
Quadratic Probing
Where would you store: 99 ? f(99)= t=9
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
29 16 47 14 35 36 129 25 2501 99 65
t t+1 t+4
attempts
Where would you store: 127 ? f(127)=t=7
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
29 16 47 14 35 36 127 129 25 2501 99 65
t
attempts
Quadratic Probing
• Tends to distribute keys better than linear
probing
• Alleviates problem of clustering
• Runs the risk of an infinite loop on insertion,
unless precautions are taken.
• E.g., consider inserting the key 16 into a table
of size 16, with positions 0, 1, 4 and 9 already
occupied.
• Therefore, table size should be prime.
Double Hashing
Let key x be stored in element f(x)=t of the array
Array:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
47 35 36 129 25 2501
65(?)
Array:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
47 35 36 65 129 25 2501 29
t
attempt
Double Hashing
If the hash table is not full, attempt to store key
in array elements (t+d)%N, (t+2d)%N …
Where would you store: 16?
Array:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
16 47 35 36 65 129 25 2501 29
t
attempt
Double Hashing
If the hash table is not full, attempt to store key
in array elements (t+d)%N, (t+2d)%N …
Where would you store: 14?
Array:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
14 16 47 35 36 65 129 25 2501 29
t+16 t+8 t
attempts
Double Hashing
If the hash table is not full, attempt to store key
in array elements (t+d)%N, (t+2d)%N …
Where would you store: 99?
Let f2(x)= 11 − (x % 11) f2(99)=d=11
Array:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
14 16 47 35 36 65 129 25 2501 99 29
t+22 t+11 t t+33
attempts
Double Hashing
If the hash table is not full, attempt to store key
in array elements (t+d)%N, (t+2d)%N …
Where would you store: 127 ?
Array:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
14 16 47 35 36 65 129 25 2501 99 29
t+10 t t+5
attempts
Separate Chaining
The keys are 35,129,36,47,25,2501,65
KEY 1 4 5 7 8 10
h(KEY) 0001 0100 0101 0111 1000 1010
4
Bucket [0]
Bucket [1]
1,5
Example
Step 2: When we insert 7 there is bucket overflow in
bucket[1]. Then split bucket and double the directory and
place items in proper bucket.
1,5 ,7
1,5
7
After inserting 8 and 10 for the buckets
4,8
1,5
10
7
Size of directory depends on number of bits h(k) is used to index the directory
When indexing is done using h(k,2) the directory size is 4.
For h(k,5) directory size is 32.
The number of bits used to index directory is called directory depth.
Example based on Extendible Hashing:
Hashing the following
elements: 16,4,6,22,24,10,31,7,9,20,26.
Bucket Size: 3 (Assume)
Hash Function: Suppose the global depth is X. Then the Hash
Function returns X LSBs.
Solution: First, calculate the binary forms of each of the given
numbers.
16- 10000
4- 00100
6- 00110
22- 10110
24- 11000
10- 01010
31- 11111
7- 00111
9- 01001
20- 10100
26- 01101
Extendible Hashing: contd. 16,4,6,22,24,10,31,7,9,20,26.
Initially, the global-depth and local-depth is always 1where
Global Depth = Number of bits in directory id.
Local Depth is associated with the buckets and not the directories.
Inserting 16:
The binary format of 16 is 10000 and global-depth is 1. The hash function
returns 1 LSB of 10000 which is 0. Hence, 16 is mapped to the directory with
id=0.
Inserting 4 and 6:
Both 4(100) and 6(110)have 0 in their LSB. Hence, they are hashed as follows:
Initially
Extendible Hashing: contd. 16,4,6,22,24,10,31,7,9,20,26.
Inserting 22: The binary form of 22 is 10110. Its LSB is 0. The bucket
pointed by directory 0 is already full. Hence, Over Flow occurs.
Since Local Depth = Global Depth, the bucket splits and directory
expansion takes place. Also, rehashing of numbers present in the
overflowing bucket takes place after the split.
The global depth is incremented by 1, now, the global depth is 2. Hence,
16,4,6,22 are now rehashed w.r.t 2 LSBs.[ 16(10000),4(100),6(110),
22(10110) ]
The bucket which was underflow has remained untouched. But, since the number of
directories has doubled, we now have 2 directories 01 and 11 pointing to the same
bucket. This is because the local-depth of the bucket has remained 1.
Extendible Hashing: contd. 16,4,6,22,24,10,31,7,9,20,26.
The local depth of bucket < Global depth (2<3), directories are not
doubled but, only the bucket is split and elements are rehashed.
Files and organization
Introduction
Every file contains data which can be organized in a hierarchy
to present a systematic organization.
The data hierarchy includes data items such as fields, records,
files, and database.
Data field
A data field is an elementary unit that stores a single fact. A
data field is usually characterized by its type and size.
Example: student‟s name is a data field that stores the name
of students.
Record
A record is a collection of related data fields which is seen as a
single unit from the application point of view.
Example:
The student‟s record may contain data fields such as name,
address, phone number, roll number, marks obtained, and so
on
File
A file is a collection of related records.
Example: A file of all the employees working in an organization
Directory
A directory stores information of related files. A directory
organizes information so that users can find it easily
File Attributes
File has a list of attributes associated with it that gives the
information about the file to the operating system and the
application software and how it is intended to be used.
File name
It is a string of characters that stores the name of a file.
File naming conventions vary from one operating system to
the other
File position
It is a pointer that points to the position at which the next
read/write operation will be performed
File structure
It indicates whether the file is a text file or a binary file.
In the text file, the numbers are stored as a string of
characters.
A binary file stores numbers in the same way as they are
represented in the main memory
File access methods
It indicates whether the records in a file can be accessed
sequentially or randomly
In sequential access mode, records are read one by one
In random access, records can be accessed in any order
Attributes flag
A file can have six additional attributes attached to it.
These attributes are usually stored in a single byte, with each
bit representing a specific attribute.
If a particular bit is set to „1‟ then this means that the
corresponding attribute is turned on
Read-only
A file marked as read-only cannot be deleted or modified.
Hidden
A file marked as hidden is not displayed in the directory
listing.
System
Volume label
Directory
Archive
Text files Binary files
A text file, also known as a A binary file contains any
flat file or an ASCII file, is type of data encoded in
structured as a sequence of binary form for computer
lines of alphabet, numerals, storage and processing
special characters purposes
It is possible for humans to A binary file is not readable
read text files which by humans
contain only ASCII text
Text files can be
Binary files provide efficient
manipulated by any text
storage of data, but they can
editor, they do not provide
be read only through an
efficient storage.
appropriate program.
Basic File Operations
Basic File Operations
Creating a File
A file is created by specifying its name and mode. The file may be
opened for writing records that are read from an input device. Once
all the records have been written into the file, the file is closed. The
file is now available for future read/write operations by any program
that has been designed to use it in some way or the other.
Updating a File
Updating a file means changing the contents of the file to reflect a
current picture of reality. A file can be updated in the following ways:
Inserting a new record in the file. For example, if a new student joins
the course, we need to add his record to the STUDENT file.
Deleting an existing record. For example, if a student quits a course in
the middle of the session, his record has to be deleted from the
STUDENT file.
Modifying an existing record. For example, if the name of a student
was spelt incorrectly, then correcting the name will be a modification
of the existing record.
Basic File Operations
Retrieving from a File
It means extracting useful data from a given file. Information can
be retrieved from a file either for an inquiry or for report
generation. An inquiry for some data retrieves low volume of data,
while report generation may retrieve a large volume of data from
the file.
Maintaining a File
It involves restructuring or re-organizing the file to improve the
performance of the programs that access this file.
Restructuring a file keeps the file organization unchanged and
changes only the structural aspects of the file.
Example: changing the field width or adding/deleting fields.
File reorganization may involve changing the entire organization of
the file
File organization
Organization of records means the logical arrangement of records
in the file and not the physical layout of the file as stored on a
storage media
Sequential Organization
A sequentially organized file stores the records in the order in
which they were entered.
Sequential files can be read only sequentially, starting with the first
record in the file.
Sequential file organization is the most
basic way to organize a large collection
of records in a file
Advantages
Simple and easy to Handle
No extra overheads involved
Sequential files can be stored on magnetic disks as well as magnetic tapes
Well suited for batch– oriented applications
Disadvantages
Records can be read only sequentially. If ith record has to be read, then all
the i–1 records must be read
Does not support update operation. A new file has to be created and the
original file has to be replaced with the new file that contains the desired
changes
Cannot be used for interactive applications
Relative File Organization
If the records are of fixed length and we know the base address of
the file and the length of the record, then any record i can be
accessed using the following formula:
Address of ith record = base_address + (i–1) * record_length
Consider the base address of a file is 1000 and each record
occupies 20 bytes, then the address of the 5th record can be given
as:
1000 + (5–1) * 20
= 1000 + 80
= 1080
Features
Provides an effective way to access individual records
The record number represents the location of the record relative to
the beginning of the file
Records in a relative file are of fixed length
Relative files can be used for both random as well as sequential access
Every location in the table either stores a record or is marked as
FREE
Advantages
Ease of processing
If the relative record number of the record that has to be accessed
is known, then the record can be accessed instantaneously
Random access of records makes access to relative files fast
Allows deletions and updations in the same file
Provides random as well as sequential access of records with low
overhead
New records can be easily added in the free locations based on the
relative record number of the record to be inserted
Well suited for interactive applications
Disadvantages
Use of relative files is restricted to disk devices
Records can be of fixed length only
For random access of records, the relative record number must be
known in advance
Indexed Sequential File Organization
Features
Index table stores the address of the records in the file
The ith entry in the index table points to the ith record of the file
Provides fast data retrieval
Records are of fixed length
While the index table is read sequentially to find the
address of the desired record, a direct access is made
to the address of the specified record in order to
access it randomly
Indexed sequential files perform well in situations
where sequential access as well as random access is made to
the data
Advantages
The key improvement is that the indices are small and can be searched
quickly, allowing the database to access only the records it needs
Supports applications that require both batch and interactive processing
Records can be accessed sequentially as well as randomly
Updates the records in the same file
Disadvantages
Indexed sequential files can be stored only on disks
Needs extra space and overhead to store indices
Handling these files is more complicated than handling sequential files
Supports only fixed length records
INDEXING
In the indexing technique the index table stores the address
of the records in the file
. There are two kinds of indices:
Ordered indices that are sorted based on one or more key values
Hash indices that are based on the values generated by applying a hash function
1. Ordered Indices
Indices are used to provide fast random access to records. An index of a
file may be a primary index or a secondary index.
Primary Index
In a sequentially ordered file, the index whose search key specifies the
sequential order of the file is defined as the primary index.
Example: suppose records of students are stored in a STUDENT file in a
sequential order starting from roll number 1 to roll number 60. Now, if
we want to search a record for, say, roll number 10, then the student‟s roll
number is the primary index.
INDEXING
Secondary Index
An index whose search key specifies an order different from the sequential order
of the file is called as the secondary index.
Example: If the record of a student is searched by his name, then the name is a
secondary index. Secondary indices are used to improve the performance of
queries on non-primary keys.
Dense index
In a dense index, the index table stores the address of every record in the file.
By looking at the dense index, it can be concluded directly whether the record
exists in the file or not.
Sparse index
In a sparse index, the index table stores the address of only some of the records in
the file.
Sparse indices are easy to fit in the main memory,
In a sparse index, to locate a record, first find an entry in the index table with the
largest search key value that is either less than or equal to the search key value of
the desired record. Then, start at that record pointed to by that entry in the index
table and then proceed searching the record using the sequential pointers in the
file, until the desired record is obtained.
Hashed Indices
Hashing is used to compute the address of a record by using a hash
function on the search key value. The hashed values map to the same
address, then collision occurs and schemes to resolve these collisions are
applied to generate a new address
Choosing a good hash function is critical to the success of this
technique. By a good hash function, it mean two things.
1. First, irrespective of the number of search keys, gives an average-case
lookup that is a small constant.
2. Second, the function distributes records uniformly and randomly
among the buckets, where a bucket is defined as a unit of one or more
records
The worst hash function is one that maps all the keys to the
same bucket.
The drawback of using hashed indices includes:
Though the number of buckets is fixed, the number of files may grow
with time.
If the number of buckets is too large, storage space is wasted.
If the number of buckets is too small, there may be too many collisions.
Hashed Indices
The following operations are performed in a hashed file
organization.
1. Insertion
To insert a record that has ki as its search value, use the hash function h(ki)
to compute the address of the bucket for that record. If the bucket is free,
store the record else use chaining to store the record.
2. Search
To search a record having the key value ki, use h(ki) to compute the
address of the bucket where the record is stored. The bucket may contain
one or several records, so check for every record in the bucket to retrieve
the desired record with the given key value.
3. Deletion
To delete a record with key value ki, use h(ki) to compute the address of
the bucket where the record is stored. The bucket may contain one or
several records so check for every record in the bucket, and then delete
the record.
File Handling in C
Console oriented I/O functions use keyboard as input device
and monitor as output device.
The I/O functions like printf(), scanf(), getchar(), putchar(),
gets(), puts()
The Problem is
1. Entire data is lost when either the program is terminated
or the computer is turned off.
2. When the volume of data to be entered is large, it takes a
lot of time to enter the data.
3. If user makes a mistake while entering data, whole data has
to be re-entered.
Solution is File : A File is a place on the disk (not memory)
where a group of related data is stored. Also called data files.
There are Two ways to perform file operation in C.
1. Low level I/O that uses Unix system calls.
2. High level I/O operation using functions in C‟s standard
I/O library.
'C' provides following file management functions,
Creation of a file
Opening a file
Reading a file
Writing to a file
Closing a file
Some important file management functions available in 'C,'
Function Purpose
fopen ( ) ------->Creating a file or opening an existing file
fclose ( ) -------> Closing a file
fprintf ( ) -------> Writing a block of data to a file
fscanf ( ) -------> Reading a block data from a file
getc ( ) -------> Reads a single character from a file
putc ( ) -------> Writes a single character to a file
getw ( ) -------> Reads an integer from a file
putw ( ) -------> Writing an integer to a file
fseek ( ) -------> Sets the position of a file pointer to a specified location
ftell ( ) -------> Returns the current position of a file pointer
rewind ( ) -------> Sets the file pointer at the beginning of a file
Defining and Opening a file
The general format for declaring and opening a file is:
FILE *fp;
fp=fopen(“filename”, “mode”);
Here, the first statement declares the variable fp as a “pointer to
the data type FILE”.
The second statement opens the file named filename with the
purpose mode and the beginning address of the buffer area allocated
for the file is stored by file pointer fp.
• Note: Any no. of files can be opened and used at a time.
File Opening Modes
File Mode Description
r Open a file for reading. If a file is in reading mode, then no
data is deleted if a file is already present on a system.
w Open a file for writing. If a file is in writing mode, then a new
file is created if a file doesn't exist at all. If a file is already
present on a system, then all the data inside the file is
truncated, and it is opened for writing purposes.
a Open a file in append mode. If a file is in append mode, then
the file is opened. The content within the file doesn't change.
r+ open for reading and writing from beginning
w+ open for reading and writing, overwriting a file
a+ open for reading and writing, appending to file
Closing a file
One should always close a file whenever the operations on file are
over. It means the contents and links to the file are terminated. This
prevents accidental damage to the file.
'C' provides the fclose function to perform file closing operation.
The syntax of fclose is as follows,
fclose (file_pointer);
After closing the file, the same file pointer can also be used with
other files.
In 'C' programming, files are automatically close when the program
is terminated. Closing a file manually by writing fclose function is a
good programming practice.
Writing to a File
In C, when you write to a file, newline characters '\n' must be
explicitly added.
The stdio library offers the necessary functions to write to a file:
fputc(char, file_pointer): It writes a character to the file
pointed to by file_pointer.
fputs(str, file_pointer): It writes a string to the file pointed to
by file_pointer.
fprintf(): is formatted output function which is used to print or
write integer, float, char or string value to a file.
Syntax: fprintf(fp, “control_string”, list_of_variables);
fputw(num, file_pointer): It writes a num to the file pointed
to by file_pointer.
Reading data from a File
void main()
{
FILE *fp;
char filename[20];
char c;
clrscr();
printf("Enter filename:\t");
gets(filename);
fp=fopen(filename, "r");
if(fp==NULL)
{
printf("\n Cannot open file.");
exit();
}
printf("\n The content of file is:\n");
while((c=fgetc(fp))!=EOF)
putchar(c);
fclose(fp);
getch();
}
/* Program to copy content of sfile to dfile*/
void main()
{
FILE *sfp,*dfp;
char sfilename[20],dfilename[20];
char c;
clrscr();
printf("Enter source filename:\t");
gets(sfilename);
printf("\n Enter destination filename:\t");
gets(dfilename);
sfp=fopen(sfilename,"r");
if(sfp==NULL) {
printf("\nSource file can't be opened.");
exit();
}
dfp=fopen(dfilename, "w");
if(dfp==NULL) {
printf("\n Destination file cannot be created or opened.");
exit();
}
while((c=fgetc(sfp))!=EOF)
fputc(c, dfp);
printf("\n Copied........");
fclose(dfp);
fclose(sfp);
getch();
Functions used in random access
1. ftell():This function takes a file pointer as argument and returns
a number of type long, that indicates the current position of
the file
pointer within the file.
This function is useful in saving the current position of a file,
which can be used later in the program.
Syntax
n = ftell(fp);
Here, n would give the relative offset (in bytes) of the current position.
This means that n bytes have already been read (or written).
Random access to files
2. rewind():This function takes a file pointer as argument and resets
the current position of the file pointer to the start of the file.
Syntax: rewind(fp);
What these statements do?:
rewind(fp);
n=ftell(fp);
• Here, n would be assigned 0, because file position has been set to the start
of the file by rewind().
• Note:The first byte in the file is numbered as 0, second as 1, and so on.
Random access to files
3. fseek():This function is used to move the file pointer to a desired
position within a file.
Syntax : fseek(fp, offset, position);
where fp is a file pointer, offset is a number or variable data type
long, and position is an integer number
• The offset specifies the number of positions (bytes) to be moved
from the location specified by position.
• The position can have one of the following 3 values:
Value Meaning
0 Beginning of file
1 Current position
2 End of file
The offset may be positive, meaning move forwards, or negative, meaning move
backwards.
• Examples:
Statement Meaning
fseek(fp, 0L, 0); Move file pointer to beginning of file. (Same as rewind.)
fseek(fp, 0L, 1); Stay at the current position. (File pointer is not moved.)
fseek(fp, 0L, 2); Move file pointer past the last character of the file. (Go
to the end of file.)
fseek(fp, m, 0); Move file pointer to (m+1)th byte in the file.
fseek(fp, m, 1); Move file pointer forwards by m bytes.
fseek(fp, -m, 1); Move file pointer backwards by m bytes from the current
position.
fseek(fp, -m, 2); Move file pointer backwards by m bytes from the end.
(Positions the file pointer to the m th character from the
end)
When the operation is successful, fseek() returns a 0 (zero).
• If we attempt to move the file pointer beyond the file boundaries,
an error occurs and fseek() returns -1 (minus one).
• It is good practice to check whether an error has occurred or not,
before proceeding further.
/* A program that uses the functions ftell() and fseek() */
#include <stdio.h>
void main()
{
FILE *fp;
char c;
long n;
clrscr();
fp=fopen("RANDOM","w");
if(fp==NULL)
{
printf("\nCannot create file.");
exit();
}
while((c=getchar())!=EOF)
fputc(c,fp);
printf("\nNo. of characters entered=%ld",ftell(fp));
fclose(fp);
fp=fopen("RANDOM","r");
if(fp==NULL)
{
printf("\nCannot create file.");
exit();
}
n=0L;
while(feof(fp)==0)
{
fseek(fp,n,0); //Position to (n+1)th character
printf("Position of %c is %ld\n",fgetc(fp),ftell(fp));
n=n+5L;
}
putchar('\n');
fseek(fp,-1L,2); /*Position to the last character*/
do
{
putchar(fgetc(fp));
}while(!fseek(fp,-2L,1));
fclose(fp);
getch();
}