Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Module_3_DM

The document discusses various types of data storage in database management systems (DBMS), including primary, secondary, and tertiary storage, highlighting their characteristics and use cases. It also outlines different file organization methods such as sequential, heap, hash, B+, ISAM, and cluster file organization, detailing their pros and cons. The objective of file organization is to ensure efficient record selection, modification, and storage cost management.

Uploaded by

kavya.jagtap04
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module_3_DM

The document discusses various types of data storage in database management systems (DBMS), including primary, secondary, and tertiary storage, highlighting their characteristics and use cases. It also outlines different file organization methods such as sequential, heap, hash, B+, ISAM, and cluster file organization, detailing their pros and cons. The objective of file organization is to ensure efficient record selection, modification, and storage cost management.

Uploaded by

kavya.jagtap04
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Module-3

Storage System in DBMS


A database system provides an ultimate view of the stored data. However, data in the form of bits, bytes get stored in
different storage devices.

Types of Data Storage


For storing the data, there are different types of storage options available. These storage types differ from one
another as per the speed and accessibility. There are the following types of storage devices used for storing the data:

● Primary Storage
● Secondary Storage
● Tertiary Storage

Primary Storage
It is the primary area that offers quick access to the stored data. We also know the primary storage as volatile
storage. It is because this type of memory does not permanently store the data. As soon as the system leads to a
power cut or a crash, the data also gets lost. Main memory and cache are the types of primary storage.

● Main Memory: It is the one that is responsible for operating the data that is available by the storage
medium. The main memory handles each instruction of a computer machine. This type of memory can
store gigabytes of data on a system but is small enough to carry the entire database. At last, the main
memory loses the whole content if the system shuts down because of power failure or other reasons.
● Cache: It is one of the costly storage media. On the other hand, it is the fastest one. A cache is a tiny
storage media which is maintained by the computer hardware usually. While designing the algorithms and
query processors for the data structures, the designers keep concern on the cache effects.
Secondary Storage
Secondary storage is also called Online storage. It is the storage area that allows the user to save and store data
permanently. This type of memory does not lose the data due to any power failure or system crash. That's why we
also call it non-volatile storage.There are some commonly described secondary storage media which are available in
almost every type of computer system:

● Flash Memory: A flash memory stores data in USB (Universal Serial Bus) keys which are further plugged
into the USB slots of a computer system. These USB keys help transfer data to a computer system, but it
varies in size limits. Unlike the main memory, it is possible to get back the stored data which may be lost
due to a power cut or other reasons. This type of memory storage is most commonly used in the server
systems for caching the frequently used data. This leads the systems towards high performance and is
capable of storing larger amounts of databases than the main memory.
● Magnetic Disk Storage: This type of storage media is also known as online storage media. A magnetic
disk is used for storing the data for a long time. It is capable of storing an entire database. It is the
responsibility of the computer system to make availability of the data from a disk to the main memory for
further accessing. Also, if the system performs any operation over the data, the modified data should be
written back to the disk. The tremendous capability of a magnetic disk is that it does not affect the data due
to a system crash or failure, but a disk failure can easily ruin as well as destroy the stored data.

Tertiary Storage
It is the storage type that is external from the computer system. It has the slowest speed. But it is capable of storing a
large amount of data. It is also known as Offline storage. Tertiary storage is generally used for data backup. There
are following tertiary storage devices available:

● Optical Storage: An optical storage can store megabytes or gigabytes of data. A Compact Disk (CD) can
store 700 megabytes of data with a playtime of around 80 minutes. On the other hand, a Digital Video Disk
or a DVD can store 4.7 or 8.5 gigabytes of data on each side of the disk.
● Tape Storage: It is the cheapest storage medium than disks. Generally, tapes are used for archiving or
backing up the data. It provides slow access to data as it accesses data sequentially from the start. Thus,
tape storage is also known as sequential-access storage. Disk storage is known as direct-access storage as
we can directly access the data from any location on disk.

Storage Hierarchy
Besides the above, various other storage devices reside in the computer system. These storage media are organized
on the basis of data accessing speed, cost per unit of data to buy the medium, and by medium's reliability. Thus, we
can create a hierarchy of storage media on the basis of its cost and speed.

Thus, on arranging the above-described storage media in a hierarchy according to its speed and cost, we conclude
the below-described image:
In the image, the higher levels are expensive but fast. On moving down, the cost per bit is decreasing, and the access
time is increasing. Also, the storage media from the main memory to up represents the volatile nature, and below the
main memory, all are non-volatile devices.

File Organization
● The File is a collection of records. Using the primary key, we can access the records. The type and
frequency of access can be determined by the type of file organization which was used for a given set of
records.
● File organization is a logical relationship among various records. This method defines how file records are
mapped onto disk blocks.
● File organization is used to describe the way in which the records are stored in terms of blocks, and the
blocks are placed on the storage medium.
● The first approach to map the database to the file is to use several files and store only one fixed length
record in any given file. An alternative approach is to structure our files so that we can contain multiple
lengths for records.
● Files of fixed length records are easier to implement than the files of variable length records.

Objective of file organization

● It contains an optimal selection of records, i.e., records can be selected as fast as possible.
● To perform insert, delete or update transactions on the records should be quick and easy.
● The duplicate records cannot be induced as a result of insert, update or delete.
● For the minimal cost of storage, records should be stored efficiently.

Types of file organization:


File organization contains various methods. These particular methods have pros and cons on the basis of access or
selection. In the file organization, the programmer decides the best-suited file organization method according to his
requirement. Types of file organization are as follows:

A. Sequential File Organization


This method is the easiest method for file organization. In this method, files are stored sequentially. This method can
be implemented in two ways:

1. Pile File Method:

● It is a quite simple method. In this method, we store the record in a sequence, i.e., one after another. Here,
the record will be inserted in the order in which they are inserted into tables.
● In case of updating or deleting any record, the record will be searched in the memory blocks. When it is
found, then it will be marked for deleting, and the new record is inserted.

Insertion of the new record:


Suppose we have four records R1, R3 and so on upto R9 and R8 in a sequence. Hence, records are nothing but a row
in the table. Suppose we want to insert a new record R2 in the sequence, then it will be placed at the end of the file.
Here, records are nothing but a row in any table.
2. Sorted File Method:
● In this method, the new record is always inserted at the file's end, and then it will sort the sequence in
ascending or descending order. Sorting of records is based on any primary key or any other key.
● In the case of modification of any record, it will update the record and then sort the file, and lastly, the
updated record is placed in the right place.

Insertion of the new record:


Suppose there is a preexisting sorted sequence of four records R1, R3 and so on upto R6 and R7. Suppose a new
record R2 has to be inserted in the sequence, then it will be inserted at the end of the file, and then it will sort the
sequence.

Pros of sequential file organization

● It contains a fast and efficient method for the huge amount of data.
● In this method, files can be easily stored in cheaper storage mechanisms like magnetic tapes.
● It is simple in design. It requires no much effort to store the data.
● This method is used when most of the records have to be accessed like grade calculation of a student,
generating the salary slip, etc.
● This method is used for report generation or statistical calculations.

Cons of sequential file organization


● It will waste time as we cannot jump on a particular record that is required but we have to move
sequentially which takes our time.
● Sorted file method takes more time and space for sorting the records.

B. Heap file organization


● It is the simplest and most basic type of organization. It works with data blocks. In heap file organization,
the records are inserted at the file's end. When the records are inserted, it doesn't require the sorting and
ordering of records.
● When the data block is full, the new record is stored in some other block. This new data block need not be
the very next data block, but it can select any data block in the memory to store new records. The heap file
is also known as an unordered file.
● In the file, every record has a unique id, and every page in a file is of the same size. It is the DBMS
responsibility to store and manage the new records.

Insertion of a new record


Suppose we have five records R1, R3, R6, R4 and R5 in a heap and suppose we want to insert a new record R2 in a
heap. If the data block 3 is full then it will be inserted in any of the databases selected by the DBMS, let's say data
block 1.
If we want to search, update or delete the data in heap file organization, then we need to traverse the data from the
start of the file till we get the requested record.

If the database is very large then searching, updating or deleting records will be time-consuming because there is no
sorting or ordering of records. In the heap file organization, we need to check all the data until we get the requested
record.

Pros of Heap file organization


● It is a very good method of file organization for bulk insertion. If there is a large number of data which
needs to load into the database at a time, then this method is best suited.
● In the case of a small database, fetching and retrieving of records is faster than the sequential record.

Cons of Heap file organization


● This method is inefficient for the large database because it takes time to search or modify the record.
● This method is inefficient for large databases.
C. Hash File Organization
Hash File Organization uses the computation of hash function on some fields of the records. The hash function's
output determines the location of the disk block where the records are to be placed.

When a record has to be received using the hash key columns, then the address is generated, and the whole record is
retrieved using that address. In the same way, when a new record has to be inserted, then the address is generated
using the hash key and the record is directly inserted. The same process is applied in the case of delete and update.
In this method, there is no effort for searching and sorting the entire file. In this method, each record will be stored
randomly in the memory.

D. B+ File Organization
● B+ tree file organization is the advanced method of an indexed sequential access method. It uses a tree-like
structure to store records in File.
● It uses the same concept of key-index where the primary key is used to sort the records. For each primary
key, the value of the index is generated and mapped with the record.
● The B+ tree is similar to a binary search tree (BST), but it can have more than two children. In this method,
all the records are stored only at the leaf node. Intermediate nodes act as a pointer to the leaf nodes. They
do not contain any records.

The above B+ tree shows that:


● There is one root node of the tree, i.e., 25.
● There is an intermediary layer with nodes. They do not store the actual record. They have only pointers to
the leaf node.
● The nodes to the left of the root node contain the prior value of the root and nodes to the right contain the
next value of the root, i.e., 15 and 30 respectively.
● There is only one leaf node which has only values, i.e., 10, 12, 17, 20, 24, 27 and 29.
● Searching for any record is easier as all the leaf nodes are balanced.
● In this method, searching any record can be traversed through the single path and accessed easily.
Pros of B+ tree file organization
● In this method, searching becomes very easy as all the records are stored only in the leaf nodes and sorted
in the sequential linked list.
● Traversing through the tree structure is easier and faster.
● The size of the B+ tree has no restrictions, so the number of records can increase or decrease and the B+
tree structure can also grow or shrink.
● It is a balanced tree structure, and any insert/update/delete does not affect the performance of the tree.

Cons of B+ tree file organization


● This method is inefficient for the static method.
E. Indexed sequential access method (ISAM)
ISAM method is an advanced sequential file organization. In this method, records are stored in the file
using the primary key. An index value is generated for each primary key and mapped with the record. This
index contains the address of the record in the file.

If any record has to be retrieved based on its index value, then the address of the data block is fetched and the record
is retrieved from the memory.

Pros of ISAM:
● In this method, each record has the address of its data block, searching a record in a huge database is quick
and easy.
● This method supports range retrieval and partial retrieval of records. Since the index is based on the
primary key values, we can retrieve the data for the given range of values. In the same way, the partial
value can also be easily searched, i.e., the student name starting with 'JA' can be easily searched.

Cons of ISAM:
● This method requires extra space in the disk to store the index value.
● When the new records are inserted, then these files have to be reconstructed to maintain the sequence.
● When the record is deleted, then the space used by it needs to be released. Otherwise, the performance of
the database will slow down.
F. Cluster file organization
● When the two or more records are stored in the same file, it is known as clusters. These files will have two
or more tables in the same data block, and key attributes which are used to map these tables together are
stored only once.
● This method reduces the cost of searching for various records in different files.
● The cluster file organization is used when there is a frequent need for joining the tables with the same
condition. These joins will give only a few records from both tables. In the given example, we are
retrieving the record for only particular departments. This method can't be used to retrieve the record for
the entire department.

In this method, we can directly insert, update or delete any record. Data is sorted based on the key with which
searching is done. Cluster key is a type of key with which joining of the table is performed.

Types of Cluster file organization:

Cluster file organization is of two types:

1. Indexed Clusters:

In an indexed cluster, records are grouped based on the cluster key and stored together. The above EMPLOYEE and
DEPARTMENT relationship is an example of an indexed cluster. Here, all the records are grouped based on the
cluster key- DEP_ID and all the records are grouped.

2. Hash Clusters:
It is similar to the indexed cluster. In a hash cluster, instead of storing the records based on the cluster key, we
generate the value of the hash key for the cluster key and store the records with the same hash key value.
Pros of Cluster file organization
● The cluster file organization is used when there is a frequent request for joining the tables with the same
joining condition.
● It provides an efficient result when there is a 1:M mapping between the tables.

Cons of Cluster file organization

● This method has low performance for the very large database.
● If there is any change in joining conditions, then this method cannot be used. If we change the condition of
joining then traversing the file takes a lot of time.
● This method is not suitable for a table with a 1:1 condition.

Indexing in DBMS
● Indexing is used to optimize the performance of a database by minimizing the number of disk accesses
required when a query is processed.
● The index is a type of data structure. It is used to locate and access the data in a database table quickly.

Index structure:Indexes can be created using some database columns.

● The first column of the database is the search key that contains a copy of the primary key or candidate key
of the table. The values of the primary key are stored in sorted order so that the corresponding data can be
accessed easily.
● The second column of the database is the data reference. It contains a set of pointers holding the address of
the disk block where the value of the particular key can be found.

Indexing Methods

1. Ordered indices
The indices are usually sorted to make searching faster. The indices which are sorted are known as ordered indices.

Example: Suppose we have an employee table with thousands of records and each of which is 10 bytes long. If their
IDs start with 1, 2, 3....and so on, we have to search students with ID-543.

● In the case of a database with no index, we have to search the disk block from starting till it reaches 543.
The DBMS will read the record after reading 543*10=5430 bytes.
● In the case of an index, we will search using indexes and the DBMS will read the record after reading
542*2= 1084 bytes which are very less compared to the previous case.
2. Primary Index
● If the index is created on the basis of the primary key of the table, then it is known as primary indexing.
These primary keys are unique to each record and contain 1:1 relation between the records.
● As primary keys are stored in sorted order, the performance of the searching operation is quite efficient.
● The primary index can be classified into two types: Dense index and Sparse index.

a. Dense index
● The dense index contains an index record for every search key value in the data file. It makes searching
faster.
● In this, the number of records in the index table is the same as the number of records in the main table.
● It needs more space to store the index record itself. The index records have the search key and a pointer to
the actual record on the disk.

b. Sparse index
● In the data file, the index record appears only for a few items. Each item points to a block.
● In this, instead of pointing to each record in the main table, the index points to the records in the main table
in a gap.

3. Clustering Index
● A clustered index can be defined as an ordered data file. Sometimes the index is created on non-primary
key columns which may not be unique for each record.
● In this case, to identify the record faster, we will group two or more columns to get the unique value and
create an index out of them. This method is called a clustering index.
● The records which have similar characteristics are grouped, and indexes are created for these groups.

Example: suppose a company contains several employees in each department. Suppose we use a clustering
index, where all employees which belong to the same Dept_ID are considered within a single cluster, and
index pointers point to the cluster as a whole. Here Dept_Id is a non-unique key.
The previous schema is a little confusing because one disk block is shared by records which belong to the different
cluster. If we use separate disk blocks for separate clusters, then it is called a better technique.

4. Secondary Index
In the sparse indexing, as the size of the table grows, the size of mapping also grows. These mappings are
usually kept in the primary memory so that address fetch should be faster. Then the secondary memory
searches the actual data based on the address obtained from mapping. If the mapping size grows then
fetching the address itself becomes slower. In this case, the sparse index will not be efficient. To
overcome this problem, secondary indexing is introduced.

In secondary indexing, to reduce the size of mapping, another level of indexing is introduced. In this
method, the huge range for the columns is selected initially so that the mapping size of the first level
becomes small. Then each range is further divided into smaller ranges. The mapping of the first level is
stored in the primary memory, so that address fetch is faster. The mapping of the second level and actual
data are stored in the secondary memory (hard disk).
For example:

● If you want to find the record of roll 111 in the diagram, then it will search the highest entry which is
smaller than or equal to 111 in the first level index. It will get 100 at this level.
● Then in the second index level, again it does max (111) <= 111 and gets 110. Now using the address 110, it
goes to the data block and starts searching each record till it gets 111.
● This is how a search is performed in this method. Inserting, updating or deleting is also done in the same
manner.

B-Tree in DBMS

● When it comes to storing and searching large amounts of data, traditional binary search trees can become
impractical due to their poor performance and high memory usage. B-Trees, also known as B-Tree or
Balanced Tree, are a type of self-balancing tree that was specifically designed to overcome these
limitations.
● Unlike traditional binary search trees, B-Trees are characterized by the large number of keys that they can
store in a single node, which is why they are also known as “large key” trees. Each node in a B-Tree can
contain multiple keys, which allows the tree to have a larger branching factor and thus a shallower height.
This shallow height leads to less disk I/O, which results in faster search and insertion operations. B-Trees
are particularly well suited for storage systems that have slow, bulky data access such as hard drives, flash
memory, and CD-ROMs.

Properties of B-Tree:
● Property #1 - All leaf nodes must be at same level.
● Property #2 - All nodes except root must have at least [m/2]-1 keys and maximum of m-1
keys.
● Property #3 - All non leaf nodes except root (i.e. all internal nodes) must have at least m/2
children.
● Property #4 - If the root node is a non leaf node, then it must have atleast 2 children.
● Property #5 - A non leaf node with n-1 keys must have n number of children.
● Property #6 - All the key values in a node must be in Ascending Order.

Operations on a B-Tree
The following operations are performed on a B-Tree...

1. Search
2. Insertion
3. Deletion

Search Operation in B-Tree


The search operation in B-Tree is similar to the search operation in Binary Search Tree. In a Binary
search tree, the search process starts from the root node and we make a 2-way decision every time
(we go to either left subtree or right subtree). In B-Tree also search process starts from the root node
but here we make an n-way decision every time. Where 'n' is the total number of children the node
has. In a B-Tree, the search operation is performed with O(log n) time complexity. The search
operation is performed as follows...

● Step 1 - Read the search element from the user.8


● Step 2 - Compare the search element with first key value of root node in the tree.
● Step 3 - If both are matched, then display "Given node is found!!!" and terminate the function
● Step 4 - If both are not matched, then check whether search element is smaller or larger than
that key value.
● Step 5 - If search element is smaller, then continue the search process in left subtree.
● Step 6 - If search element is larger, then compare the search element with next key value in
the same node and repeate steps 3, 4, 5 and 6 until we find the exact match or until the
search element is compared with last key value in the leaf node.
● Step 7 - If the last key value in the leaf node is also not matched then display "Element is not
found" and terminate the function.

Insertion Operation in B-Tree

In a B-Tree, a new element must be added only at the leaf node. That means, the new keyValue is
always attached to the leaf node only. The insertion operation is performed as follows...

● Step 1 - Check whether tree is Empty.


● Step 2 - If tree is Empty, then create a new node with new key value and insert it into the
tree as a root node.
● Step 3 - If tree is Not Empty, then find the suitable leaf node to which the new key value is
added using Binary Search Tree logic.
● Step 4 - If that leaf node has empty position, add the new key 8value to that leaf node in
ascending order of key value within the node.
● Step 5 - If that leaf node is already full, split that leaf node by sending middle value to its
parent node. Repeat the same until the sending value is fixed into a node.
● Step 6 - If the spilting is performed at root node then the middle value becomes new root
node for the tree and the height of the tree is increased by one.

Example
Construct a B-Tree of Order 3 by inserting numbers from 1 to 10.
B+ Tree in DBMS

● The B+ tree is a balanced binary search tree. It follows a multi-level index format.
● In the B+ tree, leaf nodes denote actual data pointers. B+ tree ensures that all leaf nodes remain at the same
height.
● In the B+ tree, the leaf nodes are linked using a link list. Therefore, a B+ tree can support random access as
well as sequential access.

Structure of B+ Tree

● In the B+ tree, every leaf node is at equal distance from the root node. The B+ tree is of the order n where n
is fixed for every B+ tree.
● It contains an internal node and leaf node.

Internal node

● An internal node of the B+ tree can contain at least n/2 record pointers except the root node.
● At most, an internal node of the tree contains n pointers.

Leaf node

● The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key values.
● At most, a leaf node contains n record pointer and n key values.
● Every leaf node of the B+ tree contains one block pointer P to point to the next leaf node.

Searching a record in B+ Tree

Suppose we have to search 55 in the below B+ tree structure. First, we will fetch the intermediary node which will
direct to the leaf node that can contain a record for 55. So, in the intermediary node, we will find a branch between
50 and 75 nodes. Then at the end, we will be redirected to the third leaf node. Here DBMS will perform a sequential
search to find 55.
B+ Tree Insertion

Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf node after 55. It is a balanced
tree, and a leaf node of this tree is already full, so we cannot insert 60 there. In this case, we have to split the leaf
node, so that it can be inserted into the tree without affecting the fill factor, balance and order.

The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We will split the leaf node of the
tree in the middle so that its balance is not altered. So we can group (50, 55) and (60, 65, 70) into 2 leaf nodes. If
these two have to be leaf nodes, the intermediate node cannot branch from 50. It should have 60 added to it, and then
we can have pointers to a new leaf node.

This is how we can insert an entry when there is overflow. In a normal scenario, it is very easy to find the node
where it fits and then place it in that leaf node.

B+ Tree Deletion
Suppose we want to delete 60 from the above example. In this case, we have to remove 60 from the intermediate
node as well as from the 4th leaf node too. If we remove it from the intermediate node, then the tree will not satisfy
the rule of the B+ tree. So we need to modify it to have a balanced tree. After deleting node 60 from above B+ tree
and re-arranging the nodes, it will show as follows:

Hashing in DBMS
In a huge database structure, it is very inefficient to search all the index values and reach the desired data. Hashing
technique is used to calculate the direct location of a data record on the disk without using index structure. In this
technique, data is stored at the data blocks whose address is generated by using the hashing function. The memory
location where these records are stored is known as data bucket or data blocks.

In this, a hash function can choose any of the column values to generate the address. Most of the time, the hash
function uses the primary key to generate the address of the data block. A hash function is a simple mathematical
function to any complex mathematical function. We can even consider the primary key itself as the address of the
data block. That means each row whose address will be the same as a primary key stored in the data block.

The above diagram shows data block addresses same as primary key value. This hash function can also be a simple
mathematical function like exponential, mod, cos, sin, etc. Suppose we have a mod (5) hash function to determine
the address of the data block. In this case, it applies mod (5) hash function on the primary keys and generates 3, 3, 1,
4 and 2 respectively, and records are stored in those data block addresses.
Types of Hashing:

1. Static Hashing
In static hashing, the resultant data bucket address will always be the same. That means if we generate an address for
EMP_ID =103 using the hash function mod (5) then it will always result in the same bucket address 3. Here, there
will be no change in the bucket address. Hence in this static hashing, the number of data buckets in memory remains
constant throughout. In this example, we will have five data buckets in the memory used to store the data.

Operations of Static Hashing


● Searching a record
When a record needs to be searched, then the same hash function retrieves the address of the bucket where the data
is stored.
● Insert a Record
When a new record is inserted into the table, then we will generate an address for a new record based on the hash
key and the record is stored in that location.
● Delete a Record
To delete a record, we will first fetch the record which is supposed to be deleted. Then we will delete the records for
that address in memory.
● Update a Record
To update a record, we will first search it using a hash function, and then the data record is updated. If we want to
insert some new record into the file but the address of a data bucket generated by the hash function is not empty, or
data already exists in that address. This situation in the static hashing is known as bucket overflow. This is a critical
situation in this method.

2. Dynamic Hashing
● The dynamic hashing method is used to overcome the problems of static hashing like bucket overflow.
● In this method, data buckets grow or shrink as the records increases or decreases. This method is also
known as Extendable hashing method.
● This method makes hashing dynamic, i.e., it allows insertion or deletion without resulting in poor
performance.

Transactions
A transaction is a program including a collection of database operations, executed as a logical unit of data
processing. The operations performed in a transaction include one or more of database operations like insert, delete,
update or retrieve data. It is an atomic process that is either performed into completion entirely or is not performed at
all. A transaction involving only data retrieval without any data update is called a read-only transaction.
Each high level operation can be divided into a number of low level tasks or operations. For example, a data update
operation can be divided into three tasks −

● read_item() − reads data item from storage to main memory.


● modify_item() − change value of item in the main memory.
● write_item() − write the modified value from main memory to storage.
Database access is restricted to read_item() and write_item() operations. Likewise, for all transactions, read and
write forms the basic database operations.

Transaction Operations
The low level operations performed in a transaction are −
● begin_transaction − A marker that specifies start of transaction execution.
● read_item or write_item − Database operations that may be interleaved with main memory operations as a
part of transaction.
● end_transaction − A marker that specifies end of transaction.
● commit − A signal to specify that the transaction has been successfully completed in its entirety and will
not be undone.
● rollback − A signal to specify that the transaction has been unsuccessful and so all temporary changes in
the database are undone. A committed transaction cannot be rolled back.

Transaction States
A transaction may go through a subset of five states, active, partially committed, committed, failed and aborted.

● Active − The initial state where the transaction enters is the active state. The transaction remains in this
state while it is executing read, write or other operations.
● Partially Committed − The transaction enters this state after the last statement of the transaction has been
executed.
● Committed − The transaction enters this state after successful completion of the transaction and system
checks have issued a commit signal.
● Failed − The transaction goes from partially committed state or active state to failed state when it is
discovered that normal execution can no longer proceed or system checks fail.
● Aborted − This is the state after the transaction has been rolled back after failure and the database has been
restored to its state that was before the transaction began.
The following state transition diagram depicts the states in the transaction and the low level transaction operations
that cause change in states.

Desirable Properties of Transactions


Any transaction must maintain the ACID properties, viz. Atomicity, Consistency, Isolation, and Durability.

● Atomicity − This property states that a transaction is an atomic unit of processing, that is, either it is
performed in its entirety or not performed at all. No partial update should exist.
● Consistency − A transaction should take the database from one consistent state to another consistent state.
It should not adversely affect any data item in the database.
● Isolation − A transaction should be executed as if it is the only one in the system. There should not be any
interference from the other concurrent transactions that are simultaneously running.
● Durability − If a committed transaction brings about a change, that change should be durable in the
database and not lost in case of any failure.

Schedules and Conflicts


In a system with a number of simultaneous transactions, a schedule is the total order of execution of operations.
Given a schedule S consisting of n transactions, say T1, T2, T3………..Tn; for any transaction Ti, the operations in
Ti must execute as laid down in the schedule S.
Types of Schedules
There are two types of schedules −

● Serial Schedules − In a serial schedule, at any point of time, only one transaction is active, i.e. there is no
overlapping of transactions. This is depicted in the following graph −

Schedules in which the transactions are executed non-interleaved, i.e., a serial schedule is one in which no
transaction starts until a running transaction has ended are called serial schedules.
Example: Consider the following schedule involving two transactions T1 and T2.

where R(A) denotes that a read operation is performed on some data item ‘A’

This is a serial schedule since the transactions perform serially in the order T1 —> T2
● Non Serial schedule − In parallel schedules, more than one transaction is active simultaneously, i.e. the
transactions contain operations that overlap at time. This is depicted in the following graph −

This is a type of Scheduling where the operations of multiple transactions are interleaved. This might lead to a rise
in the concurrency problem. The transactions are executed in a non-serial manner, keeping the end result correct and
the same as the serial schedule. Unlike the serial schedule where one transaction must wait for another to complete
all its operation, in the non-serial schedule, the other transaction proceeds without waiting for the previous
transaction to complete. This sort of schedule does not provide any benefit of the concurrent transaction. It can be of
two types namely, Serializable and Non-Serializable Schedule.
The Non-Serial Schedule can be divided further into Serializable and Non-Serializable.

1. Serializable:
This is used to maintain the consistency of the database. It is mainly used in the Non-Serial scheduling
to verify whether the scheduling will lead to any inconsistency or not. On the other hand, a serial
schedule does not need the serializability because it follows a transaction only when the previous
transaction is complete. The non-serial schedule is said to be in a serializable schedule only when it is
equivalent to the serial schedules, for an n number of transactions. Since concurrency is allowed in this
case thus, multiple transactions can execute concurrently. A serializable schedule helps in improving
both resource utilization and CPU throughput. These are of two types:
1. Conflict Serializable:
A schedule is called conflict serializable if it can be transformed into a serial schedule by
swapping non-conflicting operations. Two operations are said to be conflicting if all
conditions satisfy:
■ They belong to different transactions
■ They operate on the same data item
■ At Least one of them is a write operation
2. View Serializable:
A Schedule is called view serializable if it is view equal to a serial schedule (no
overlapping transactions). A conflict schedule is a view serializable but if the
serializability contains blind writes, then the view serializable does not conflict
serializable.
2. Non-Serializable:
The non-serializable schedule is divided into two types, Recoverable and Non-recoverable Schedule.
1. Recoverable Schedule:
Schedules in which transactions commit only after all transactions whose changes they
read commit are called recoverable schedules.
2. Non-Recoverable Schedule:A non-recoverable schedule means when there is a system
failure, we may not be able to recover to a consistent database state

Conflicts in Schedules
In a schedule consisting of multiple transactions, a conflict occurs when two active transactions perform
non-compatible operations. Two operations are said to be in conflict, when all of the following three conditions
exists simultaneously −

● The two operations are parts of different transactions.


● Both the operations access the same data item.
● At least one of the operations is a write_item() operation, i.e. it tries to modify the data item

Serializability
A serializable schedule of ‘n’ transactions is a parallel schedule or non serial schedule which is equivalent to a
serial schedule consisting of the same ‘n’ transactions. A serializable schedule contains the correctness of serial
schedule while ascertaining better CPU utilization of parallel schedule.

Equivalence of Schedules
Equivalence of two schedules can be of the following types −

● Result equivalence − Two schedules producing identical results are said to be result equivalent.
● View equivalence − Two schedules that perform similar action in a similar manner are said to be view
equivalent.
● Conflict equivalence − Two schedules are said to be conflict equivalent if both contain the same set of
transactions and have the same order of conflicting pairs of operations.

Concurrency control
Concurrency Control is the management procedure that is required for controlling concurrent execution of the
operations that take place on a database. Concurrency Control is the working concept that is required for controlling
and managing the concurrent execution of database operations and thus avoiding the inconsistencies in the database.
Thus, for maintaining the concurrency of the database, we have the concurrency control protocols.
Concurrent Execution in DBMS
● In a multi-user system, multiple users can access and use the same database at one time, which is known as
the concurrent execution of the database. It means that the same database is executed simultaneously on a
multi-user system by different users.

● While working on the database transactions, there occurs the requirement of using the database by multiple
users for performing different operations, and in that case, concurrent execution of the database is
performed.
● The thing is that the simultaneous execution that is performed should be done in an interleaved manner, and
no operation should affect the other executing operations, thus maintaining the consistency of the database.
Thus, on making the concurrent execution of the transaction operations, there occur several challenging
problems that need to be solved.

Concurrency Control Protocols: The concurrency control protocols ensure the atomicity, consistency,
isolation, durability and serializability of the concurrent execution of the database transactions. Therefore, these
protocols are categorized as:

● Lock Based Concurrency Control Protocol


● Timestamp Concurrency Control Protocol
● Validation Based Concurrency Control Protocol

1. Lock-Based Protocol

In this type of protocol, any transaction cannot read or write data until it acquires an appropriate lock on it. There are
two types of lock:

1. Shared lock:
● It is also known as a Read-only lock. In a shared lock, the data item can only be read by the transaction.
● It can be shared between the transactions because when the transaction holds a lock, then it can't update the
data on the data item.

2. Exclusive lock:
● In the exclusive lock, the data item can be both read and written by the transaction.
● This lock is exclusive, and in this lock, multiple transactions do not modify the same data simultaneously.
There are four types of lock protocols available:
1. Simplistic lock protocol: It is the simplest way of locking the data while transaction. Simplistic lock-based
protocols allow all the transactions to get the lock on the data before insert or delete or update on it. It will unlock
the data item after completing the transaction.

2. Pre-claiming Lock Protocol:

● Pre-claiming Lock Protocols evaluate the transaction to list all the data items on which they need locks.
● Before initiating an execution of the transaction, it requests DBMS for all the lock on all those data items.
● If all the locks are granted then this protocol allows the transaction to begin. When the transaction is
completed then it releases all the lock.
● If all the locks are not granted then this protocol allows the transaction to roll back and waits until all the
locks are granted.

3. Two-phase locking (2PL):

● The two-phase locking protocol divides the execution phase of the transaction into three parts.
● In the first part, when the execution of the transaction starts, it seeks permission for the lock it requires.
● In the second part, the transaction acquires all the locks. The third phase is started as soon as the transaction
releases its first lock.
● In the third phase, the transaction cannot demand any new locks. It only releases the acquired locks.

There are two phases of 2PL:

Growing phase: In the growing phase, a new lock on the data item may be acquired by the transaction, but none can
be released.
Shrinking phase: In the shrinking phase, existing lock held by the transaction may be released, but no new locks
can be acquired.

4. Strict Two-phase locking (Strict-2PL):

● The first phase of Strict-2PL is similar to 2PL. In the first phase, after acquiring all the locks, the
transaction continues to execute normally.
● The only difference between 2PL and strict 2PL is that Strict-2PL does not release a lock after using it.
● Strict-2PL waits until the whole transaction to commit, and then it releases all the locks at a time.
● Strict-2PL protocol does not have a shrinking phase of lock release.

2. Timestamp Ordering Protocol


● The Timestamp Ordering Protocol is used to order the transactions based on their Timestamps. The order of
transaction is nothing but the ascending order of the transaction creation.
● The priority of the older transaction is higher that's why it executes first. To determine the timestamp of the
transaction, this protocol uses system time or logical counter.
● The lock-based protocol is used to manage the order between conflicting pairs among transactions at the
execution time. But Timestamp based protocols start working as soon as a transaction is created.
● Let's assume there are two transactions T1 and T2. Suppose the transaction T1 has entered the system at
007 times and transaction T2 has entered the system at 009 times. T1 has the higher priority, so it executes
first as it enters the system first.
● The timestamp ordering protocol also maintains the timestamp of last 'read' and 'write' operation on data.

Basic Timestamp ordering protocol works as follows:

1. Check the following condition whenever a transaction Ti issues a Read (X) operation:

● If W_TS(X) >TS(Ti) then the operation is rejected.


● If W_TS(X) <= TS(Ti) then the operation is executed.
● Timestamps of all the data items are updated.

2. Check the following condition whenever a transaction Ti issues a Write(X) operation:

● If TS(Ti) < R_TS(X) then the operation is rejected.


● If TS(Ti) < W_TS(X) then the operation is rejected and Ti is rolled back otherwise the operation is
executed.

Where,
TS(TI) denotes the timestamp of the transaction Ti.
R_TS(X) denotes the Read time-stamp of data-item X.
W_TS(X) denotes the Write time-stamp of data-item X.

3. Validation Based Protocol

Validation phase is also known as optimistic concurrency control technique. In the validation based protocol, the
transaction is executed in the following three phases:

1. Read phase: In this phase, the transaction T is read and executed. It is used to read the value of various
data items and store them in temporary local variables. It can perform all the write operations on temporary
variables without an update to the actual database.
2. Validation phase: In this phase, the temporary variable value will be validated against the actual data to
see if it violates the serializability.
3. Write phase: If the validation of the transaction is validated, then the temporary results are written to the
database or system otherwise the transaction is rolled back.

Here each phase has the following different timestamps:


Start(Ti): It contains the time when Ti started its execution.
Validation (Ti): It contains the time when Ti finishes its read phase and starts its validation phase.
Finish(Ti): It contains the time when Ti finishes its write phase.

● This protocol is used to determine the time stamp for the transaction for serialization using the timestamp
of the validation phase, as it is the actual phase which determines if the transaction will commit or rollback.
● Hence TS(T) = validation(T).
● The serializability is determined during the validation process. It can't be decided in advance.
● While executing the transaction, it ensures a greater degree of concurrency and also less number of
conflicts.
● Thus it contains transactions which have less number of rollbacks.

Multiversion Concurrency Control:

Multiversion schemes keep old versions of data items to increase concurrency.


Multiversion 2 phase locking:
Each successful write results in the creation of a new version of the data item written. Timestamps are used to label
the versions. When a read(X) operation is issued, select an appropriate version of X based on the timestamp of the
transaction.

Optimistic Concurrency Control Algorithm:


In systems with low conflict rates, the task of validating every transaction for serializability may lower
performance. In these cases, the test for serializability is postponed to just before commit. Since the conflict rate is
low, the probability of aborting transactions which are not serializable is also low. This approach is called
optimistic concurrency control technique.
In this approach, a transaction’s life cycle is divided into the following three phases −
● Execution Phase − A transaction fetches data items to memory and performs operations upon them.
● Validation Phase − A transaction performs checks to ensure that committing its changes to the database
passes serializability tests.
● Commit Phase − A transaction writes back modified data items in memory to the disk.
This algorithm uses three rules to enforce serializability in validation phase −
Rule 1 − Given two transactions Ti and Tj, if Ti is reading the data item which Tj is writing, then Ti’s execution
phase cannot overlap with Tj’s commit phase. Tj can commit only after Ti has finished execution.
Rule 2 − Given two transactions Ti and Tj, if Ti is writing the data item that Tj is reading, then Ti’s commit phase
cannot overlap with Tj’s execution phase. Tj can start executing only after Ti has already committed.
Rule 3 − Given two transactions Ti and Tj, if Ti is writing the data item which Tj is also writing, then Ti’s commit
phase cannot overlap with Tj’s commit phase. Tj can start to commit only after Ti has already committed.

Database Recovery Techniques in DBMS


The Database is prone to failures due to inconsistency, network failure, errors or any kind of accidental damage. So,
database recovery techniques are highly important to bring a database back into a working state after a failure. There
are four different recovery techniques available in the Database.

1. Mirroring
2. Recovery using Backups
3. Recovery using Transaction Logs
4. Shadow Paging

Mirroring:
Two complete copies of the database are maintained on-line on different stable storage devices. This method is
mostly used in environments that require non-stop, fault-tolerant operations.

Recovery using Backups:


Backups are useful if there has been extensive damage to the database. Backups are mainly two types :

● Immediate Backup: Immediate Backup are kept in a floppy disk, hard disk or magnetic tapes. These come
in handy when a technical fault occurs in the primary database such as system failure, disk crashes, network
failure. Damage due to virus attacks repair using the immediate backup.
● Archival Backup: Archival Backups are kept in mass storage devices such as magnetic tape, CD-ROMs,
Internet Servers etc. They are very useful for recovering data after a disaster such as fire, earthquake, flood
etc. Archival Backup should be kept at a different site other than where the system is functioning. Archival
Backup at a separate place remains safe from thefts and international destruction by user staff.

Recovery using Transaction Logs:


In Recovery using Transaction Logs, some following steps are :

Step1: The log searches for all the transactions that have recorded a [ start transaction, ‘ ‘] entry, but haven’t
recorded a corresponding [commit, ‘ ‘] entry.
Step2: These transactions are rolling back.
Step3: Transactions which have recorded a [commit, ‘ ‘] entry in the log, must have recorded the changes they did
to the database in the log. These changes will follow to undo their effects on the database.

Shadow Paging:
These system can use for data recovery instead of using transaction logs. In the Shadow Paging, a database is
divided into several fixed-sized disk pages, say n, thereafter a current directory creates. It having n entries with each
entry pointing to a disk page in the database. the current directory transfer to the main memory.

When a transaction begins executing, the current directory copies into a shadow directory. Then, the shadow
directory saves on the disk. The transaction will be using the current directory. During the transaction execution, all
the modifications are made on the current directory and the shadow directory is never modified.

You might also like