Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Dbms Mod3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

DBMS

MODULE 3
● Storage Strategies
❖ Comparison of ordered indexing and hashing

Indexing Hashing

It is a technique that
It is a technique that
allows to search location
allows to quickly retrieve
of desired data on disk
records from database
without using index
file.
structure.

It is generally used to It is generally used to


optimize or increase index and retrieve items
performance of database in database as it is faster
simply by minimizing to search that specific
number of disk accesses item using shorter hashed
that are required when a key rather than using its
query is processed. original value.
It is faster than searching
It offers faster search and
arrays and lists, provides
retrieval of data to users,
more flexible and reliable
helps to reduce table
method of data retrieval
space, makes it possible
rather than any other
to quickly retrieve or
data structure, can be
fetch data, can be used
used for comparing two
for sorting, etc.
files for quality, etc.

Its main purpose is to


Its main purpose is to use
provide basis for both
math problem to organize
rapid random lookups
data into easily
and efficient access of
searchable buckets.
ordered records.

It is not considered best


for large databases and It is considered best for
its good for small large databases.
databases.

Types of indexing Types of hashing includes


includes ordered static and dynamic
indexing, primary hashing.
indexing, secondary
indexing, clustered
indexing.

It uses mathematical
It uses data reference to functions known as hash
hold address of disk function to calculate
block. direct location of records
on disk.

It is important because it
It is important because it
ensures data integrity of
protects file and
files and messages, takes
documents of large size
variable length string or
business organizations,
messages and
and optimize
compresses and converts
performance of database.
it into fixed length value.

❖ Indices
➔ Types of index
➔ Ordered indices
➔ Hash indices
➔ Dense Index
➔ Sparse index
➔ Multilevel Index
➔ Types of Indexing
★ Single level (primary,clustering ,secondary index)
Indexing in DBMS
○ Indexing is used to optimize the performance of a database by minimizing the
number of disk accesses required when a query is processed.

○ The index is a type of data structure. It is used to locate and access the data in a
database table quickly.

Index structure:
Indexes can be created using some database columns.

○ The first column of the database is the search key that contains a copy of the
primary key or candidate key of the table. The values of the primary key are
stored in sorted order so that the corresponding data can be accessed easily.

○ The second column of the database is the data reference. It contains a set of
pointers holding the address of the disk block where the value of the particular
key can be found.

Indexing Methods
Ordered indices

The indices are usually sorted to make searching faster. The indices which are sorted
are known as ordered indices.

Example: Suppose we have an employee table with thousands of record and each of
which is 10 bytes long. If their IDs start with 1, 2, 3....and so on and we have to search
student with ID-543.

○ In the case of a database with no index, we have to search the disk block from
starting till it reaches 543. The DBMS will read the record after reading
543*10=5430 bytes.

○ In the case of an index, we will search using indexes and the DBMS will read the
record after reading 542*2= 1084 bytes which are very less compared to the
previous case.

Primary Index
○ If the index is created on the basis of the primary key of the table, then it is
known as primary indexing. These primary keys are unique to each record and
contain 1:1 relation between the records.

○ As primary keys are stored in sorted order, the performance of the searching
operation is quite efficient.

○ The primary index can be classified into two types: Dense index and Sparse
index.

Dense index

○ The dense index contains an index record for every search key value in the data
file. It makes searching faster.

○ In this, the number of records in the index table is same as the number of records
in the main table.

○ It needs more space to store index record itself. The index records have the
search key and a pointer to the actual record on the disk.

Sparse index

○ In the data file, index record appears only for a few items. Each item points to a
block.
○ In this, instead of pointing to each record in the main table, the index points to the
records in the main table in a gap.

Clustering Index

○ A clustered index can be defined as an ordered data file. Sometimes the index is
created on non-primary key columns which may not be unique for each record.

○ In this case, to identify the record faster, we will group two or more columns to
get the unique value and create index out of them. This method is called a
clustering index.

○ The records which have similar characteristics are grouped, and indexes are
created for these group.

Example: suppose a company contains several employees in each department.


Suppose we use a clustering index, where all employees which belong to the same
Dept_ID are considered within a single cluster, and index pointers point to the cluster as
a whole. Here Dept_Id is a non-unique key.
The previous schema is little confusing because one disk block is shared by records
which belong to the different cluster. If we use separate disk block for separate clusters,
then it is called better technique.
Secondary Index

1. In the sparse indexing, as the size of the table grows, the size of mapping also
grows. These mappings are usually kept in the primary memory so that address
fetch should be faster.
2. Then the secondary memory searches the actual data based on the address got
from mapping.
3. If the mapping size grows then fetching the address itself becomes slower. In
this case, the sparse index will not be efficient.
4. To overcome this problem, secondary indexing is introduced.
5. In secondary indexing, to reduce the size of mapping, another level of indexing is
introduced.
6. In this method, the huge range for the columns is selected initially so that the
mapping size of the first level becomes small.
7. Then each range is further divided into smaller ranges. The mapping of the first
level is stored in the primary memory, so that address fetch is faster.
8. The mapping of the second level and actual data are stored in the secondary
memory (hard disk).

For example:

★ If you want to find the record of roll 111 in the diagram, then
it will search the highest entry which is smaller than or equal
to 111 in the first level index. It will get 100 at this level.

★ Then in the second index level, again it does max (111) <=
111 and gets 110. Now using the address 110, it goes to the
data block and starts searching each record till it gets 111.
★ This is how a search is performed in this method. Inserting,
updating or deleting is also done in the same manner.

Multilevel Indexing:
1. Multilevel Indexing: With the growth of the size of the database, indices also grow. As the
index is stored in the main memory, a single-level index might become too large a size to
store with multiple disk accesses.
2. The multilevel indexing segregates the main block into various smaller blocks so that the
same can be stored in a single block.
3. The outer blocks are divided into inner blocks which in turn are pointed to the data
blocks.
4. This can be easily stored in the main memory with fewer overheads.
Parameter Clustered Non-clustered
s

Use for You can sort the records A non-clustered index helps
and store clustered index you to creates a logical
physically in memory as order for data rows and uses
per the order. pointers for physical data
files.

Storing Allows you to stores data This indexing method never


method pages in the leaf nodes stores data pages in the leaf
of the index. nodes of the index.

Size The size of the clustered The size of the


index is quite large. non-clustered index is small
compared to the clustered
index.

Data Faster Slower compared to the


accessing clustered index

Additional Not Required Required to store the index


disk space separately

Type of By Default Primary Keys It can be used with unique


key Of The Table is a constraint on the table which
Clustered Index. acts as a composite key.

Main A clustered index can It should be created on


feature improve the performance columns which are used in
of data retrieval. joins.
Advantages of Indexing
● Improved Query Performance: Indexing enables faster data retrieval from the database.
The database may rapidly discover rows that match a specific value or collection of
values by generating an index on a column, minimizing the amount of time it takes to
perform a query.
● Efficient Data Access: Indexing can enhance data access efficiency by lowering the
amount of disk I/O required to retrieve data. The database can maintain the data pages
for frequently visited columns in memory by generating an index on those columns,
decreasing the requirement to read from disk.
● Optimized Data Sorting: Indexing can also improve the performance of sorting
operations. By creating an index on the columns used for sorting, the database can
avoid sorting the entire table and instead sort only the relevant rows.
● Consistent Data Performance: Indexing can assist ensure that the database performs
consistently even as the amount of data in the database rises. Without indexing, queries
may take longer to run as the number of rows in the table grows, while indexing
maintains a roughly consistent speed.
● By ensuring that only unique values are inserted into columns that have been indexed as
unique, indexing can also be utilized to ensure the integrity of data. This avoids storing
duplicate data in the database, which might lead to issues when performing queries or
reports.
Overall, indexing in databases provides significant benefits for improving query performance,
efficient data access, optimized data sorting, consistent data performance, and enforced data
integrity

Disadvantages of Indexing
● Indexing necessitates more storage space to hold the index data structure, which might
increase the total size of the database.
● Increased database maintenance overhead: Indexes must be maintained as data is
added, destroyed, or modified in the table, which might raise database maintenance
overhead.
● Indexing can reduce insert and update performance since the index data structure must
be updated each time data is modified.
● Choosing an index can be difficult: It can be challenging to choose the right indexes for a
specific query or application and may call for a detailed examination of the data and
access patterns.

★ Multilevel (B+ trees index files,B-trees index files)


★ Index update,insertion and deletion
★ B+ trees Node structure , leaf node
structure,Non-leaf nodes
★ Queries on B+trees
★ B+trees file organization
★ B+ trees index files
★ The B+ tree is a balanced binary search tree. It follows a
multi-level index format.

★ In the B+ tree, leaf nodes denote actual data pointers. B+


tree ensures that all leaf nodes remain at the same height.

★ In the B+ tree, the leaf nodes are linked using a link list.
Therefore, a B+ tree can support random access as well as
sequential access.

Structure of B+ Tree

○ In the B+ tree, every leaf node is at equal distance from the root node. The B+
tree is of the order n where n is fixed for every B+ tree.

○ It contains an internal node and leaf node.

Internal node
○ An internal node of the B+ tree can contain at least n/2 record pointers except
the root node.

○ At most, an internal node of the tree contains n pointers.

Leaf node

○ The leaf node of the B+ tree can contain at least n/2 record pointers and n/2 key
values.

○ At most, a leaf node contains n record pointer and n key values.

○ Every leaf node of the B+ tree contains one block pointer P to point to next leaf
node.

Searching a record in B+ Tree


Suppose we have to search 55 in the below B+ tree structure. First, we will fetch for the
intermediary node which will direct to the leaf node that can contain a record for 55.

So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the
end, we will be redirected to the third leaf node. Here DBMS will perform a sequential
search to find 55.

B+ Tree Insertion
Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf
node after 55. It is a balanced tree, and a leaf node of this tree is already full, so we
cannot insert 60 there.
In this case, we have to split the leaf node, so that it can be inserted into tree without
affecting the fill factor, balance and order.

The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We
will split the leaf node of the tree in the middle so that its balance is not altered. So we
can group (50, 55) and (60, 65, 70) into 2 leaf nodes.

If these two has to be leaf nodes, the intermediate node cannot branch from 50. It
should have 60 added to it, and then we can have pointers to a new leaf node.

This is how we can insert an entry when there is overflow. In a normal scenario, it is very
easy to find the node where it fits and then place it in that leaf node.

B+ Tree Deletion
Suppose we want to delete 60 from the above example. In this case, we have to remove
60 from the intermediate node as well as from the 4th leaf node too. If we remove it
from the intermediate node, then the tree will not satisfy the rule of the B+ tree. So we
need to modify it to have a balanced tree.
After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as
follows:

❖ B-Trees
➔ B+trees examples
➔ Queries on B-Trees
➔ B-Trees index files
https://builtin.com/data-science/b-tree-index
❖ Hashing
➔ Static Hashing
➔ Deficiencies of static hashing
➔ Linear Probing
➔ Dynamic Hashing
➔ Hash Structure(extendable hashing)

Hashing in DBMS
In a huge database structure, it is very inefficient to search all the index
values and reach the desired data. Hashing technique is used to calculate
the direct location of a data record on the disk without using index
structure.

In this technique, data is stored at the data blocks whose address is


generated by using the hashing function. The memory location where these
records are stored is known as data bucket or data blocks.

In this, a hash function can choose any of the column value to generate the
address. Most of the time, the hash function uses the primary key to
generate the address of the data block. A hash function is a simple
mathematical function to any complex mathematical function. We can even
consider the primary key itself as the address of the data block. That
means each row whose address will be the same as a primary key stored
in the data block.

The above diagram shows data block addresses same as primary key
value. This hash function can also be a simple mathematical function like
exponential, mod, cos, sin, etc. Suppose we have mod (5) hash function to
determine the address of the data block. In this case, it applies mod (5)
hash function on the primary keys and generates 3, 3, 1, 4 and 2
respectively, and records are stored in those data block addresses.
Types of Hashing:

Static Hashing
In static hashing, the resultant data bucket address will always be the same. That
means if we generate an address for EMP_ID =103 using the hash function mod (5) then
it will always result in same bucket address 3. Here, there will be no change in the
bucket address.

Hence in this static hashing, the number of data buckets in memory remains constant
throughout. In this example, we will have five data buckets in the memory used to store
the data.
Operations of Static Hashing

○ Searching a record

When a record needs to be searched, then the same hash function retrieves the address
of the bucket where the data is stored.

○ Insert a Record

When a new record is inserted into the table, then we will generate an address for a new
record based on the hash key and record is stored in that location.

○ Delete a Record

To delete a record, we will first fetch the record which is supposed to be deleted. Then
we will delete the records for that address in memory.

○ Update a Record
To update a record, we will first search it using a hash function, and then the data record
is updated.

If we want to insert some new record into the file but the address of a data bucket
generated by the hash function is not empty, or data already exists in that address. This
situation in the static hashing is known as bucket overflow. This is a critical situation in
this method.

To overcome this situation, there are various methods. Some commonly used methods
are as follows:

1. Open Hashing
When a hash function generates an address at which data is already stored, then the
next bucket will be allocated to it. This mechanism is called as Linear Probing.

For example: suppose R3 is a new address which needs to be inserted, the hash
function generates address as 112 for R3. But the generated address is already full. So
the system searches next available data bucket, 113 and assigns R3 to it.

2. Close Hashing
When buckets are full, then a new data bucket is allocated for the same hash result and
is linked after the previous one. This mechanism is known as Overflow chaining.
For example: Suppose R3 is a new address which needs to be inserted into the table,
the hash function generates address as 110 for it. But this bucket is full to store the new
data. In this case, a new bucket is inserted at the end of 110 buckets and is linked to it.

Dynamic Hashing
○ The dynamic hashing method is used to overcome the problems of static
hashing like bucket overflow.

○ In this method, data buckets grow or shrink as the records increases or


decreases. This method is also known as Extendable hashing method.

○ This method makes hashing dynamic, i.e., it allows insertion or deletion without
resulting in poor performance.

How to search a key

○ First, calculate the hash address of the key.

○ Check how many bits are used in the directory, and these bits are called as i.

○ Take the least significant i bits of the hash address. This gives an index of the
directory.
○ Now using the index, go to the directory and find bucket address where the
record might be.

How to insert a new record

○ Firstly, you have to follow the same procedure for retrieval, ending up in some
bucket.

○ If there is still space in that bucket, then place the record in it.

○ If the bucket is full, then we will split the bucket and redistribute the records.

For example:

Consider the following grouping of keys into buckets, depending on the prefix of their
hash address:

The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5 and
6 are 01, so it will go into bucket B1. The last two bits of 1 and 3 are 10, so it will go into
bucket B2. The last two bits of 7 are 11, so it will go into B3.
Insert key 9 with hash address 10001 into the above
structure:

○ Since key 9 has hash address 10001, it must go into the first bucket. But bucket
B1 is full, so it will get split.

○ The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it will
go into bucket B1, and the last three bits of 6 are 101, so it will go into bucket B5.

○ Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry
because last two bits of both the entry are 00.

○ Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry
because last two bits of both the entry are 10.

○ Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because
last two bits of both the entry are 11.
Advantages of dynamic hashing

○ In this method, the performance does not decrease as the data grows in the
system. It simply increases the size of memory to accommodate the data.

○ In this method, memory is well utilized as it grows and shrinks with the data.
There will not be any unused memory lying.

○ This method is good for the dynamic database where data grows and shrinks
frequently.

Disadvantages of dynamic hashing

● In this method, if the data size increases then the bucket size is also increased.
These addresses of data will be maintained in the bucket address table. This is
because the data address will keep changing as buckets grow and shrink. If there
is a huge increase in data, maintaining the bucket address table becomes
tedious.
● In this case, the bucket overflow situation will also occur. But it might take little
time to reach this situation than static hashing.

RAID (Redundant Array of Independent Disk)


RAID refers to redundancy array of the independent disk. It is a technology which is
used to connect multiple secondary storage devices for increased performance, data
redundancy or both. It gives you the ability to survive one or more drive failure
depending upon the RAID level used.

It consists of an array of disks in which multiple disks are connected to achieve


different goals.

RAID technology
There are 7 levels of RAID schemes. These schemas are as RAID 0, RAID 1, ...., RAID 6.

These levels contain the following characteristics:

○ It contains a set of physical disk drives.

○ In this technology, the operating system views these separate disks as a single
logical disk.

○ In this technology, data is distributed across the physical drives of the array.

○ Redundancy disk capacity is used to store parity information.

○ In case of disk failure, the parity information can be helped to recover the data.

Standard RAID levels

RAID 0

○ RAID level 0 provides data stripping, i.e., a data can place across multiple disks. It
is based on stripping that means if one disk fails then all data in the array is lost.
○ This level doesn't provide fault tolerance but increases the system performance.

Example:

Disk 0 Disk 1 Disk 2 Disk 3

20 21 22 23

24 25 26 27

28 29 30 31

32 33 34 35

In this figure, block 0, 1, 2, 3 form a stripe.

In this level, instead of placing just one block into a disk at a time, we can work with two
or more blocks placed it into a disk before moving on to the next one.

Disk 0 Disk 1 Disk 2 Disk 3

20 22 24 26

21 23 25 27

28 30 32 34

29 31 33 35

In this above figure, there is no duplication of data. Hence, a block once lost cannot be
recovered.

Pros of RAID 0:
○ In this level, throughput is increased because multiple data requests probably not
on the same disk.

○ This level full utilizes the disk space and provides high performance.

○ It requires minimum 2 drives.

Cons of RAID 0:

○ It doesn't contain any error detection mechanism.

○ The RAID 0 is not a true RAID because it is not fault-tolerance.

○ In this level, failure of either disk results in complete data loss in respective array.

RAID 1

This level is called mirroring of data as it copies the data from drive 1 to drive 2. It
provides 100% redundancy in case of a failure.

Example:

Disk 0 Disk 1 Disk 2 Disk 3

A A B B

C C D D

E E F F

G G H H

Only half space of the drive is used to store the data. The other half of drive is just a
mirror to the already stored data.

Pros of RAID 1:
○ The main advantage of RAID 1 is fault tolerance. In this level, if one disk fails,
then the other automatically takes over.

○ In this level, the array will function even if any one of the drives fails.

Cons of RAID 1:

○ In this level, one extra drive is required per drive for mirroring, so the expense is
higher.

RAID 2

○ RAID 2 consists of bit-level striping using hamming code parity. In this level, each
data bit in a word is recorded on a separate disk and ECC code of data words is
stored on different set disks.

○ Due to its high cost and complex structure, this level is not commercially used.
This same performance can be achieved by RAID 3 at a lower cost.

Pros of RAID 2:

○ This level uses one designated drive to store parity.

○ It uses the hamming code for error detection.

Cons of RAID 2:

○ It requires an additional drive for error detection.

RAID 3
○ RAID 3 consists of byte-level striping with dedicated parity. In this level, the parity
information is stored for each disk section and written to a dedicated parity drive.

○ In case of drive failure, the parity drive is accessed, and data is reconstructed
from the remaining devices. Once the failed drive is replaced, the missing data
can be restored on the new drive.

○ In this level, data can be transferred in bulk. Thus high-speed data transmission is
possible.

Disk 0 Disk 1 Disk 2 Disk 3

A B C P(A, B, C)

D E F P(D, E, F)

G H I P(G, H, I)

J K L P(J, K, L)

Pros of RAID 3:

○ In this level, data is regenerated using parity drive.

○ It contains high data transfer rates.

○ In this level, data is accessed in parallel.

Cons of RAID 3:

○ It required an additional drive for parity.

○ It gives a slow performance for operating on small sized files.


RAID 4

○ RAID 4 consists of block-level stripping with a parity disk. Instead of duplicating


data, the RAID 4 adopts a parity-based approach.

○ This level allows recovery of at most 1 disk failure due to the way parity works. In
this level, if more than one disk fails, then there is no way to recover the data.

○ Level 3 and level 4 both are required at least three disks to implement RAID.

Disk 0 Disk 1 Disk 2 Disk 3

A B C P0

D E F P1

G H I P2

J K L P3

In this figure, we can observe one disk dedicated to parity.

In this level, parity can be calculated using an XOR function. If the data bits are 0,0,0,1
then the parity bits is XOR(0,1,0,0) = 1. If the parity bits are 0,0,1,1 then the parity bit is
XOR(0,0,1,1)= 0. That means, even number of one results in parity 0 and an odd number
of one results in parity 1.

C1 C2 C3 C4 Parity

0 1 0 0 1

0 0 1 1 0
Suppose that in the above figure, C2 is lost due to some disk failure. Then using the
values of all the other columns and the parity bit, we can recompute the data bit stored
in C2. This level allows us to recover lost data.

RAID 5

○ RAID 5 is a slight modification of the RAID 4 system. The only difference is that in
RAID 5, the parity rotates among the drives.

○ It consists of block-level striping with DISTRIBUTED parity.

○ Same as RAID 4, this level allows recovery of at most 1 disk failure. If more than
one disk fails, then there is no way for data recovery.

Disk 0 Disk 1 Disk 2 Disk 3 Disk 4

0 1 2 3 P0

5 6 7 P1 4

10 11 P2 8 9

15 P3 12 13 14

P4 16 17 18 19

This figure shows that how parity bit rotates.

This level was introduced to make the random write performance better.

Pros of RAID 5:

○ This level is cost effective and provides high performance.

○ In this level, parity is distributed across the disks in an array.


○ It is used to make the random write performance better.

Cons of RAID 5:

○ In this level, disk failure recovery takes longer time as parity has to be calculated
from all available drives.

○ This level cannot survive in concurrent drive failure.

RAID 6

○ This level is an extension of RAID 5. It contains block-level stripping with 2 parity


bits.

○ In RAID 6, you can survive 2 concurrent disk failures. Suppose you are using RAID
5, and RAID 1. When your disks fail, you need to replace the failed disk because if
simultaneously another disk fails then you won't be able to recover any of the
data, so in this case RAID 6 plays its part where you can survive two concurrent
disk failures before you run out of options.

Disk 1 Disk 2 Disk 3 Disk 4

A0 B0 Q0 P0

A1 Q1 P1 D1

Q2 P2 C2 D2

P3 B3 C3 Q3

Pros of RAID 6:
○ This level performs RAID 0 to strip data and RAID 1 to mirror. In this level,
stripping is performed before mirroring.

○ In this level, drives required should be multiple of 2.

Cons of RAID 6:

● It is not utilized 100% disk capability as half is used for mirroring.

● It contains very limited scalability.

● Transaction Processing
❖ Transaction concepts
❖ Transaction state
❖ ACID properties

What does a Transaction mean in DBMS?


1. Transaction in Database Management Systems (DBMS) can be defined as a set of
logically related operations.
2. It is the result of a request made by the user to access the contents of the database and
perform operations on it.
3. It consists of various operations and has various states in its completion journey. It also
has some specific properties that must be followed to keep the database consistent.

Operations of Transaction
A user can make different types of requests to access and modify the contents of a database.
So, we have different types of operations relating to a transaction. They are discussed as
follows:

i) Read(X)
1. A read operation is used to read the value of X from the database and store it in a buffer
in the main memory for further actions such as displaying that value.
2. Such an operation is performed when a user wishes just to see any content of the
database and not make any changes to it.
3. For example, when a user wants to check his/her account’s balance, a read operation
would be performed on user’s account balance from the database.
ii) Write(X)
1. A write operation is used to write the value to the database from the buffer in the main
memory.
2. For a write operation to be performed, first a read operation is performed to bring its
value in buffer, and then some changes are made to it, e.g. some set of arithmetic
operations are performed on it according to the user’s request, then to store the modified
value back in the database, a write operation is performed.
3. For example, when a user requests to withdraw some money from his account, his
account balance is fetched from the database using a read operation, then the amount
to be deducted from the account is subtracted from this value, and then the obtained
value is stored back in the database using a write operation.

iii) Commit
1. This operation in transactions is used to maintain integrity in the database. Due to some
failure of power, hardware, or software, etc., a transaction might get interrupted before all
its operations are completed.
2. This may cause ambiguity in the database, i.e. it might get inconsistent before and after
the transaction.
3. To ensure that further operations of any other transaction are performed only after work
of the current transaction is done, a commit operation is performed to the changes made
by a transaction permanently to the database.

iv) Rollback
1. This operation is performed to bring the database to the last saved state when any
transaction is interrupted in between due to any power, hardware, or software failure.
2. In simple words, it can be said that a rollback operation does undo the operations of
transactions that were performed before its interruption to achieve a safe state of the
database and avoid any kind of ambiguity or inconsistency.

Transaction Schedules
When multiple transaction requests are made at the same time, we need to decide their order of
execution. Thus, a transaction schedule can be defined as a chronological order of execution of
multiple transactions. There are broadly two types of transaction schedules discussed as
follows,

i) Serial Schedule
1. In this kind of schedule, when multiple transactions are to be executed, they are
executed serially, i.e. at one time only one transaction is executed while others wait for
the execution of the current transaction to be completed.
2. This ensures consistency in the database as transactions do not execute
simultaneously.
3. But, it increases the waiting time of the transactions in the queue, which in turn lowers
the throughput of the system, i.e. number of transactions executed per time.
4. To improve the throughput of the system, another kind of schedule are used which has
some more strict rules which help the database to remain consistent even when
transactions execute simultaneously.

ii) Non-Serial Schedule


1. To reduce the waiting time of transactions in the waiting queue and improve the system
efficiency, we use nonserial schedules which allow multiple transactions to start before a
transaction is completely executed.
2. This may sometimes result in inconsistency and errors in database operation. So, these
errors are handled with specific algorithms to maintain the consistency of the database
and improve CPU throughput as well.
3. Serial Schedules are also sometimes referred to as parallel schedules as transactions
execute in parallel in this kind of schedules.

Serializable
1. Serializability in DBMS is the property of a nonserial schedule that determines whether it
would maintain the database consistency or not.
2. The nonserial schedule which ensures that the database would be consistent after the
transactions are executed in the order determined by that schedule is said to be
Serializable Schedules.
3. The serial schedules always maintain database consistency as a transaction starts only
when the execution of the other transaction has been completed under it.
4. Thus, serial schedules are always serializable.
5. A transaction is a series of operations, so various states occur in its completion journey.
They are discussed as follows:

i) Active
1. It is the first stage of any transaction when it has begun to execute. The execution of the
transaction takes place in this state.
2. Operations such as insertion, deletion, or updation are performed during this state.
3. During this state, the data records are under manipulation and they are not saved to the
database, rather they remain somewhere in a buffer in the main memory.

ii) Partially Committed


1. This state of transaction is achieved when it has completed most of the operations and is
executing its final operation.
2. It can be a signal to the commit operation, as after the final operation of the transaction
completes its execution, the data has to be saved to the database through the commit
operation.
3. If some kind of error occurs during this state, the transaction goes into a failed state,
else it goes into the Committed state.

iii) Commited
1. This state of transaction is achieved when all the transaction-related operations have
been executed successfully along with the Commit operation, i.e. data is saved into the
database after the required manipulations in this state.
2. This marks the successful completion of a transaction.

iv) Failed
1. If any of the transaction-related operations cause an error during the active or partially
committed state, further execution of the transaction is stopped and it is brought into a
failed state.
2. Here, the database recovery system makes sure that the database is in a consistent
state.

v) Aborted
1. If the error is not resolved in the failed state, then the transaction is aborted and a
rollback operation is performed to bring database to the the last saved consistent state.
2. When the transaction is aborted, the database recovery module either restarts the
transaction or kills it.
3. The illustration below shows the various states that a transaction may encounter in its
completion journey.

Transaction in DBMS

Properties of Transaction
1. As transactions deal with accessing and modifying the contents of the database, they
must have some basic properties which help maintain the consistency and integrity of
the database before and after the transaction.
2. Transactions follow 4 properties, namely, Atomicity, Consistency, Isolation, and
Durability.
3. Generally, these are referred to as ACID properties of transactions in DBMS. ACID is the
acronym used for transaction properties.
4. A brief description of each property of the transaction is as follows.

i) Atomicity
1. This property ensures that either all operations of a transaction are executed or it is
aborted.
2. In any case, a transaction can never be completed partially.
3. Each transaction is treated as a single unit (like an atom). Atomicity is achieved through
commit and rollback operations, i.e. changes are made to the database only if all
operations related to a transaction are completed, and if it gets interrupted, any changes
made are rolled back using rollback operation to bring the database to its last saved
state.

ii) Consistency
1. This property of a transaction keeps the database consistent before and after a
transaction is completed.
2. Execution of any transaction must ensure that after its execution, the database is either
in its prior stable state or a new stable state.
3. In other words, the result of a transaction should be the transformation of a database
from one consistent state to another consistent state.
4. Consistency, here means, that the changes made in the database are a result of logical
operations only which the user desired to perform and there is not any ambiguity.

iii) Isolation
1. This property states that two transactions must not interfere with each other, i.e. if some
data is used by a transaction for its execution, then any other transaction can not
concurrently access that data until the first transaction has completed.
2. It ensures that the integrity of the database is maintained and we don’t get any
ambiguous values. Thus, any two transactions are isolated from each other.
3. This property is enforced by the concurrency control subsystem of DBMS.

iv) Durability
1. This property ensures that the changes made to the database after a transaction is
completely executed, are durable.
2. It indicates that permanent changes are made by the successful execution of a
transaction.
3. In the event of any system failures or crashes, the consistent state achieved after the
completion of a transaction remains intact.
4. The recovery subsystem of DBMS is responsible for enforcing this property.
❖ Concurrent execution
❖ Problem with concurrent execution
❖ Concurrency Control
❖ Serializability of schedule
➔ Types of schedule
➔ Cascading rollback
➔ Conflict operation
➔ What is serializability
➔ Testing for serializability
➔ Algorithm for creation of graphs
➔ View serializability

Concurrent Execution in DBMS

○ In a multi-user system, multiple users can access and use the same database at
one time, which is known as the concurrent execution of the database. It means
that the same database is executed simultaneously on a multi-user system by
different users.

○ While working on the database transactions, there occurs the requirement of


using the database by multiple users for performing different operations, and in
that case, concurrent execution of the database is performed.

○ The thing is that the simultaneous execution that is performed should be done in
an interleaved manner, and no operation should affect the other executing
operations, thus maintaining the consistency of the database. Thus, on making
the concurrent execution of the transaction operations, there occur several
challenging problems that need to be solved.

Problems with Concurrent Execution


In a database transaction, the two main operations are READ and WRITE operations. So,
there is a need to manage these two operations in the concurrent execution of the
transactions as if these operations are not performed in an interleaved manner, and the
data may become inconsistent. So, the following problems occur with the Concurrent
Execution of the operations:

Problem 1: Lost Update Problems (W - W Conflict)

The problem occurs when two different database transactions perform the read/write
operations on the same database items in an interleaved manner (i.e., concurrent
execution) that makes the values of the items incorrect hence making the database
inconsistent.

For example:

Consider the below diagram where two transactions TX and TY, are performed on the
same account A where the balance of account A is $300.

○ At time t1, transaction TX reads the value of account A, i.e., $300 (only read).

○ At time t2, transaction TX deducts $50 from account A that becomes $250 (only
deducted and not updated/write).
○ Alternately, at time t3, transaction TY reads the value of account A that will be
$300 only because TX didn't update the value yet.

○ At time t4, transaction TY adds $100 to account A that becomes $400 (only
added but not updated/write).

○ At time t6, transaction TX writes the value of account A that will be updated as
$250 only, as TY didn't update the value yet.

○ Similarly, at time t7, transaction TY writes the values of account A, so it will write
as done at time t4 that will be $400. It means the value written by TX is lost, i.e.,
$250 is lost.

Hence data becomes incorrect, and database sets to inconsistent.

Dirty Read Problems (W-R Conflict)

The dirty read problem occurs when one transaction updates an item of the database,
and somehow the transaction fails, and before the data gets rollback, the updated
database item is accessed by another transaction. There comes the Read-Write Conflict
between both transactions.

For example:

Consider two transactions TX and TY in the below diagram performing read/write


operations on account A where the available balance in account A is $300:
○ At time t1, transaction TX reads the value of account A, i.e., $300.

○ At time t2, transaction TX adds $50 to account A that becomes $350.

○ At time t3, transaction TX writes the updated value in account A, i.e., $350.

○ Then at time t4, transaction TY reads account A that will be read as $350.

○ Then at time t5, transaction TX rollbacks due to server problem, and the value
changes back to $300 (as initially).

○ But the value for account A remains $350 for transaction TY as committed, which
is the dirty read and therefore known as the Dirty Read Problem.

Unrepeatable Read Problem (W-R Conflict)

Also known as Inconsistent Retrievals Problem that occurs when in a transaction, two
different values are read for the same database item.

For example:

Consider two transactions, TX and TY, performing the read/write operations on account
A, having an available balance = $300. The diagram is shown below:
○ At time t1, transaction TX reads the value from account A, i.e., $300.

○ At time t2, transaction TY reads the value from account A, i.e., $300.

○ At time t3, transaction TY updates the value of account A by adding $100 to the
available balance, and then it becomes $400.

○ At time t4, transaction TY writes the updated value, i.e., $400.

○ After that, at time t5, transaction TX reads the available value of account A, and
that will be read as $400.

○ It means that within the same transaction TX, it reads two different values of
account A, i.e., $ 300 initially, and after updation made by transaction TY, it reads
$400. It is an unrepeatable read and is therefore known as the Unrepeatable read
problem.

Thus, in order to maintain consistency in the database and avoid such problems that
take place in concurrent execution, management is needed, and that is where the
concept of Concurrency Control comes into role.
Concurrency Control
Concurrency Control is the working concept that is required for controlling and
managing the concurrent execution of database operations and thus avoiding the
inconsistencies in the database. Thus, for maintaining the concurrency of the database,
we have the concurrency control protocols.

Concurrency Control Protocols

The concurrency control protocols ensure the atomicity, consistency, isolation, durability
and serializability of the concurrent execution of the database transactions. Therefore,
these protocols are categorized as:

○ Lock Based Concurrency Control Protocol

○ Time Stamp Concurrency Control Protocol

○ Validation Based Concurrency Control Protocol

❖ Locking and timestamp based schedulers


❖ Multiversion and optimistic Concurrency control schemes
➔ Lock based protocols
★ Two phase locking protocols
★ Validation of two phase locking protocol
★ Strict 2pl
★ Lock conversion
★ Automatic acquisition of locks
★ Implementation of locks

Lock-Based Protocol
In this type of protocol, any transaction cannot read or write data until it acquires an
appropriate lock on it. There are two types of lock:

1. Shared lock:
○ It is also known as a Read-only lock. In a shared lock, the data item can only read
by the transaction.

○ It can be shared between the transactions because when the transaction holds a
lock, then it can't update the data on the data item.

2. Exclusive lock:

○ In the exclusive lock, the data item can be both reads as well as written by the
transaction.

○ This lock is exclusive, and in this lock, multiple transactions do not modify the
same data simultaneously.

There are four types of lock protocols available:

1. Simplistic lock protocol

It is the simplest way of locking the data while transaction. Simplistic lock-based
protocols allow all the transactions to get the lock on the data before insert or delete or
update on it. It will unlock the data item after completing the transaction.

2. Pre-claiming Lock Protocol

○ Pre-claiming Lock Protocols evaluate the transaction to list all the data items on
which they need locks.

○ Before initiating an execution of the transaction, it requests DBMS for all the lock
on all those data items.

○ If all the locks are granted then this protocol allows the transaction to begin.
When the transaction is completed then it releases all the lock.

○ If all the locks are not granted then this protocol allows the transaction to rolls
back and waits until all the locks are granted.
3. Two-phase locking (2PL)

○ The two-phase locking protocol divides the execution phase of the transaction
into three parts.

○ In the first part, when the execution of the transaction starts, it seeks permission
for the lock it requires.

○ In the second part, the transaction acquires all the locks. The third phase is
started as soon as the transaction releases its first lock.

○ In the third phase, the transaction cannot demand any new locks. It only releases
the acquired locks.
There are two phases of 2PL:

Growing phase: In the growing phase, a new lock on the data item may be acquired by
the transaction, but none can be released.

Shrinking phase: In the shrinking phase, existing lock held by the transaction may be
released, but no new locks can be acquired.

In the below example, if lock conversion is allowed then the following phase can
happen:

1. Upgrading of lock (from S(a) to X (a)) is allowed in growing phase.

2. Downgrading of lock (from X(a) to S(a)) must be done in shrinking phase.

Example:
The following way shows how unlocking and locking work with 2-PL.

Transaction T1:

○ Growing phase: from step 1-3

○ Shrinking phase: from step 5-7

○ Lock point: at 3

Transaction T2:

○ Growing phase: from step 2-6

○ Shrinking phase: from step 8-9

○ Lock point: at 6
4. Strict Two-phase locking (Strict-2PL)

○ The first phase of Strict-2PL is similar to 2PL. In the first phase, after acquiring all
the locks, the transaction continues to execute normally.

○ The only difference between 2PL and strict 2PL is that Strict-2PL does not
release a lock after using it.

○ Strict-2PL waits until the whole transaction to commit, and then it releases all the
locks at a time.

○ Strict-2PL protocol does not have shrinking phase of lock release.

It does not have cascading abort as 2PL does.

➔ Timestamp protocols

Timestamp Ordering Protocol


○ The Timestamp Ordering Protocol is used to order the transactions based on
their Timestamps. The order of transaction is nothing but the ascending order of
the transaction creation.
○ The priority of the older transaction is higher that's why it executes first. To
determine the timestamp of the transaction, this protocol uses system time or
logical counter.

○ The lock-based protocol is used to manage the order between conflicting pairs
among transactions at the execution time. But Timestamp based protocols start
working as soon as a transaction is created.

○ Let's assume there are two transactions T1 and T2. Suppose the transaction T1
has entered the system at 007 times and transaction T2 has entered the system
at 009 times. T1 has the higher priority, so it executes first as it is entered the
system first.

○ The timestamp ordering protocol also maintains the timestamp of last 'read' and
'write' operation on a data.

Basic Timestamp ordering protocol works as follows:

1. Check the following condition whenever a transaction Ti issues a Read (X) operation:

○ If W_TS(X) >TS(Ti) then the operation is rejected.

○ If W_TS(X) <= TS(Ti) then the operation is executed.

○ Timestamps of all the data items are updated.

2. Check the following condition whenever a transaction Ti issues a Write(X) operation:

○ If TS(Ti) < R_TS(X) then the operation is rejected.

○ If TS(Ti) < W_TS(X) then the operation is rejected and Ti is rolled back otherwise
the operation is executed.

Where,

TS(TI) denotes the timestamp of the transaction Ti.

R_TS(X) denotes the Read time-stamp of data-item X.


W_TS(X) denotes the Write time-stamp of data-item X.

Advantages and Disadvantages of TO protocol:

○ TO protocol ensures serializability since the precedence graph is as follows:

➔ TS protocol ensures freedom from deadlock that means no


transaction ever waits.

➔ But the schedule may not be recoverable and may not even be
cascade- free.

➔ Validation based protocols

Validation Based Protocol


Validation phase is also known as optimistic concurrency control technique. In the
validation based protocol, the transaction is executed in the following three phases:

1. Read phase: In this phase, the transaction T is read and executed. It is used to
read the value of various data items and stores them in temporary local
variables. It can perform all the write operations on temporary variables without
an update to the actual database.
2. Validation phase: In this phase, the temporary variable value will be validated
against the actual data to see if it violates the serializability.

3. Write phase: If the validation of the transaction is validated, then the temporary
results are written to the database or system otherwise the transaction is rolled
back.

Here each phase has the following different timestamps:

Start(Ti): It contains the time when Ti started its execution.

Validation (Ti): It contains the time when Ti finishes its read phase and starts its
validation phase.

Finish(Ti): It contains the time when Ti finishes its write phase.

➔ This protocol is used to determine the time stamp for the


transaction for serialization using the time stamp of the validation
phase, as it is the actual phase which determines if the transaction
will commit or rollback.

➔ Hence TS(T) = validation(T).

➔ The serializability is determined during the validation process. It


can't be decided in advance.

➔ While executing the transaction, it ensures a greater degree of


concurrency and also less number of conflicts.

➔ Thus it contains transactions which have less number of rollbacks.

➔ Deadlock handling
➔ Strategies
➔ Prevention
➔ Detection
DO IT FROM MA’AM’S PDF
➔ Thomas write rule

Thomas write Rule


Thomas Write Rule provides the guarantee of serializability order for the protocol. It
improves the Basic Timestamp Ordering Algorithm.

The basic Thomas write rules are as follows:

○ If TS(T) < R_TS(X) then transaction T is aborted and rolled back, and operation is
rejected.

○ If TS(T) < W_TS(X) then don't execute the W_item(X) operation of the transaction
and continue processing.

○ If neither condition 1 nor condition 2 occurs, then allowed to execute the WRITE
operation by transaction Ti and set W_TS(X) to TS(T).

If we use the Thomas write rule then some serializable schedule can be permitted that
does not conflict serializable as illustrate by the schedule in a given figure:

Figure: A Serializable Schedule that is not Conflict Serializable

In the above figure, T1's read and precedes T1's write of the same data item. This
schedule does not conflict serializable.

Thomas write rule checks that T2's write is never seen by any transaction. If we delete
the write operation in transaction T2, then conflict serializable schedule can be obtained
which is shown in below figure.
Figure: A Conflict Serializable Schedule

➔ Validation test for transactions


➔ Ordering
➔ Multiversion schemes
➔ Locking
➔ Snapshot Isolation
➔ Benefits of Snapshot Isolation
➔ Snapshot isolation and anomalies
➔ SI in postgres and Oracle
DO IT FROM MA’AM’S PDF
❖ Database recovery
DO IT FROM MA’AM’S PDF
❖ Shadow paging
DO IT FROM MA’AM’S PDF
❖ Advantage of shadow paging
DO IT FROM MA’AM’S PDF
❖ Questions from ma’am’s ppt

You might also like