Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
10 views

Ddbms Notes

Uploaded by

Awais Gujjar
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Ddbms Notes

Uploaded by

Awais Gujjar
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

DISTRIBUTED DATABASE

A distributed database is a database that is under the control of central DBMS in which not all
storage devices are attached to a common CPU. It may be stored on multiple computers located
in the same physical location, or may be dispersed over a network of interconnected computers.
Collection of data (e.g in a database) can be distributed across multiple physical locations. In
short, a distributed database is a logically interrelated collection of shared data and a
description of this data is physically distributed over a computer network.

DISTRIBUTED DATABASE

CENTRALIZED DATABASE
A distributed database system allows applications to access various data items from local and

remote databases. Applications are classified into two categories depending on whether the

transactions access data from local site or remote sites.

Local Applications – These applications require access to local data only and do not require data

from more than one site.

Global applications – These applications require access to data from other remote sites in the

distributed system.

FEATURES OF DISTRIBUTED DBMS


A DDBMS has the following features:

 A collection of logically related shared data.


 The data is split into a number of fragments.
 Fragments may be replicated.
 Fragments/replicas are allocated to sites.
 The sites are linked by a communications network.
 The data at each site is under the control of a DBMS.
 The DBMS at each site can handle local applications, autonomously.
 Each DBMS participates in at least one global application.

TYPES OF DDBMS:
1. Homogeneous Database:

In a homogeneous database, all different sites store database identically. The operating
system, database management system, and the data structures used – all are the same
at all sites. Hence, they’re easy to manage.

2. Heterogeneous Database:

In a heterogeneous distributed database, different sites can use different schema and
software that can lead to problems in query processing and transactions. Also, a
particular site might be completely unaware of the other sites. Different computers may
use a different operating system, different database application. They may even use
different data models for the database. Hence, translations are required for different
sites to communicate.

 Applications of Distributed Database:


 It is used in Corporate Management Information System.
 It is used in multimedia applications.
 Used in Military’s control system, Hotel chains etc.
 It is also used in manufacturing control system.
ADVANTAGES OF DISTRIBUTED DATABASE SYSTEM
1.Improved availability - In a centralized DBMS, a computer failure terminates the applications of the
DBMS. However, a failure at one site of a DDBMS, or a failure of a communication link making some
sites inaccessible, does not make the entire system fail. Distributed DBMSs are designed to continue to
function despite such failures. If a single node fails, the system may be able to reroute the failed node's
requests to another site.
2.Improved reliability - As data may be replicated so that it exists at more than one site, the failure of a
node or a communication link does not necessarily make the data inaccessible.
3.Improved Performance - As the data is located near the site of 'greatest demand', speed of database
access may be better than that achievable from a remote centralized database. Furthermore, since each
site handles only a part of the entire database, there may not be the same contention for CPU and I/O
services as characterized by a centralized DBMS.
4 Local control – The data is distributed in such a way that every portion of it is local to some sites
(servers). The site in which the portion of data is stored is the owner of the data.

5. Modular growth – Growth is easier. We do not need to interrupt any of the functioning sites to
introduce (add) a new site. Hence, the expansion of the whole system is easier. Removal of site is also
does not cause much problems.

6. Lower communication costs (More Economical) – Data are distributed in such a way that they are
available near to the location where they are needed more. This reduces the communication cost much
more compared to a centralized system.

7. Faster response – Most of the data are local and in close proximity to where they are needed. Hence,
the requests can be answered quickly compared to a centralized system.

8. Secured management of distributed data – Various transparencies like network transparency,


fragmentation transparency, and replication transparency are implemented to hide the actual
implementation details of the whole distributed system. In such way, Distributed database provides
security for data.

9. Robust – The system is continued to work in case of failures. For example, replicated distributed
database performs in spite of failure of other sites.

10. Complied with ACID properties – Distributed transactions demands Atomicity, Consistency,
Isolation, and Reliability.

DISADVANTAGES OF DISTRIBUTED DATABASE SYSTEMS


1. Complex Software – Costs more in terms of software cost compared to a centralized system.
Additional software might be needed in most of the cases over a centralized system.

2. Increased Processing overhead – It costs many messages to be shared between sites to complete a
distributed transaction.

3. Data integrity – Data integrity becomes complex. Too much network resources may be used.
4. Deadlock is difficult to handle compared to a centralized system.

5 Lack of experience - General-purpose distributed DBMSs have not been widely accepted, although
many of the protocols and problems are well understood. Consequently, we do not yet have the same
level of experience in industry as we have with centralized DBMSs.

6. The data shared between sites over networks are vulnerable to attack. Hence, network oriented
security protocols to be used based on the sensitivity of data shared.
7. More complex in terms database design – According to various applications, we may need to fragment
a database, or replicate a database or both.
8. Handling failures is a difficult task. In some cases, we may not distinguish site failure, network
partition, and link failure.
ALTENATIVE STRATEGIES FOR DATA ALLOCATION
Four alternative strategies have been identified for data allocation.

1. CENTRALIZED

In this strategy, the distributed system consists of single database and DBMS is stored at one
site with users distributed across the communication network. In this approach locality of
reference is the lowest at all sites, except the central site where the data is stored. The
communication cost is very high since all users accept the central site have to use the network
for all types of data accesses. Reliability and availability are very low, since the failure of the
central site results in the loss of entire database system.
2. FRAGMENTED (OR PARTITIONED)
This strategy partitions the entire database into disjoint fragments, where each fragment is
assigned to one site. In this strategy, fragments are not replicated. If fragments are stored at the
site where they are used most frequently, locality of reference is high. As there is no replication
of data, storage cost is low. Reliability and availability are also low but still higher than
centralized data allocation strategies, as the failure of a site results in the loss of local data only.
In this case, communication costs are incurred only for global transactions. However, in this
approach, performance should be good, and communication costs are low if the data
distribution is designed properly.

3. COMPLETE REPLICATION
In this strategy, each site of the system maintains a complete copy of the entire database. Since
all the data are available at all sites, locality of reference, availability and reliability, and
performance are maximized in this approach. Storage costs are very high in this case, and hence,
no communication costs are incurred due to global transactions. However, communication costs
for updating data items are the most expensive. To overcome this problem, snapshots are
sometimes used. A snapshot is a copy of the data at a given time. The copies are updated
periodically, so they may not be always up-to-date. Snapshots are also sometimes used to
implement views in a distributed database, to reduce the time taken for performing a database
operation on a view.

4. SELECTIVE REPLICATION
This strategy is a combination of centralized, fragmented, and complete replication strategies. In
this approach, some of the data items are fragmented and allocated to the sites where they are
used frequently, to achieve high localization of reference. Some of the data items or fragments
of the data items that are used by many sites simultaneously but not frequently updated, are
replicated and stored at all these different sites. The data items that are not used frequently are
centralized.

Difference between Centralized database and Distributed database:

Basis of
S.NO. Comparison Centralized database Distributed database

It is a database that consists of multiple


It is a database that is stored,
databases which are connected with
1. Definition located as well as maintained
each other and are spread across
at a single location only.
different physical locations.

The data access time in the The data access time in the case of
2. Access time case of multiple users is more multiple users is less in a distributed
in a centralized database. database.

3. Management The management, The management, modification, and


of data modification, and backup of backup of this database are very difficult
Basis of
S.NO. Comparison Centralized database Distributed database

this database are easier as the


as it is spread across different physical
entire data is present at the
locations.
same location.

This database provides a Since it is spread across different


4. View uniform and complete view to locations thus it is difficult to provide a
the user. uniform view to the user.

This database has more data


Data This database may have some data
5. consistency in comparison to
Consistency replications thus data consistency is less.
distributed database.

The users cannot access the In a distributed database, if one


6. Failure database in case of database database fails users have access to other
failure occurs. databases.

A centralized database is less


7. Cost This database is very expensive.
costly.

Ease of maintenance because It is difficult to maintain because of the


the whole of the data and distribution of data and information at
8. Maintenance information is available at a varied places. So, there is a need to
single location and thus, easy check for data redundancy issues and
to reach and access. how to maintain data consistency.

A centralized database is less


A distributed database is more efficient
efficient as data finding
than a centralized database because of
becomes quite complex
9. Efficient the splitting up of data at several places
because of the storing of data
which makes data finding simple and
and information at a particular
less time-consuming.
place.

The response speed is more in


Response The response speed is less in comparison
10. comparison to a distributed
Speed to a centralized database.
database.

 High performance because of the


 Integrity of data division of workload.
 Security  High availability because of the
11. Advantages  Easy access to all readiness of available nodes to do
information work.
 Data is easily portable  Independent nodes and better
control over resources

 Data searching takes time


 It is quite large and complex so
 In case of failure of a
difficult to use and maintain.
centralized server, the
 Difficult to provide security
whole database will be
 Issue of data integrity
12. Disadvantages lost.
 Increase in storage and
 If multiple users try to
infrastructure requirements
access the data at the
 Handling failures is a quite difficult
same time then it may
task
create issues.
Basis of
S.NO. Comparison Centralized database Distributed database

 Apache Ignite
 Apache Cassandra
 A desktop or server CPU  Apache HBase
13. Examples
 A mainframe computer.  Amazon SimpleDB
 Clusterpoint
 FoundationDB.

Difference between Parallel Computing and Distributed Computing:


S.N
O Parallel Computing Distributed Computing

Many operations are performed


1. System components are located at different locations
simultaneously

2. Single computer is required Uses multiple computers

Multiple processors perform multiple


3. Multiple computers perform multiple operations
operations

It may have shared or distributed


4. It have only distributed memory
memory

Processors communicate with each Computer communicate with each other through
5.
other through bus message passing.

Improves system scalability, fault tolerance and


6. Improves the system performance
resource sharing capabilities

Date’s Twelve Rules for Distributed Database Systems :


The discussion about DDBMS is incomplete without discussing DATE’S TWELVE RULES. DBMS which
follows that rules is purely distributed DBMS.
The rules are as follows :
1. Local Autonomy or Local Site Independence –

Each site has its own operations and also act as independent autonomous, centralized DBMS. For
security, concurrency control, backup and recovery there is responsibility of each site..

2. Central Site Independence –

All the sites are equal and no site is dependent on central site to perform any service. We can say
there is no such site which cannot which system cannot operate. There are some services for which
central server is not required they are transaction management, query optimization, deadlock
detection and management of global system catalog.

3. Continuous Operation –

There is no effect of site failure to system. System continue its operation even in case of site failure
or any expansion in network.

4. Local Independence –
To retrieve any data in system, there is need to know about the storage of data i.e where the data is
stored in system.

5. Fragmentation Independence –

The user is able to see only one single logical database. There is transparency to data fragmentation
to user. To retrieve any fragments of database, there is no need to know about name of database
fragments.

6. Replication Independence –

Data can be replicate and stored in different sites. The DDBMS is manages all fragments
transparently the user.

7. Distributed Query Independence –

To execute single query at different location, does not able to satisfy transparent request. So, query
optimization is crucial and performed transparently by DDBMS.

8. Distributed Transaction Independence –

Transaction is able to update data at different sites transparently, but control of recovery and
concurrency is achieved by using agents.
9. Hardware Independence –

It should be possible for DDBMS to run on different hardware platforms.

10. Operating System Independence –

It should be possible for DDBMS to run on different Operating system platforms.

11. Network Independence –

The DDBMS system is able to run on any network platform.

12. Database Independence –


The system must support any vendor of the database product.

FRAGMENTATION IN DISTRIBUTED DBMS


Fragmentation is a process of dividing the whole or full database into various sub tables or sub relations
so that data can be stored in different systems. The small pieces of sub relations or sub tables are
called fragments. These fragments are called logical data units and are stored at various sites. It must be
made sure that the fragments are such that they can be used to reconstruct the original relation (i.e,
there isn’t any loss of data).
In the fragmentation process, let’s say, If a table T is fragmented and is divided into a number of
fragments say T1, T2, T3….TN. The fragments contain sufficient information to allow the restoration of
the original table T. This restoration can be done by the use of UNION or JOIN operation on various
fragments. This process is called data fragmentation. All of these fragments are independent which
means these fragments cannot be derived from others. The users needn’t be logically concerned about
fragmentation which means they should not have concerned that the data is fragmented and this is
called fragmentation Independence or we can say fragmentation transparency.
Advantages :
 As the data is stored close to the usage site, the efficiency of the database system will increase
 Local query optimization methods are sufficient for some queries as the data is available locally
 In order to maintain the security and privacy of the database system, fragmentation is advantageous
Disadvantages :
 Access speeds may be very high if data from different fragments are needed
 If we are using recursive fragmentation, then it will be very expensive
We have three methods for data fragmenting of a table:
 Horizontal fragmentation
 Vertical fragmentation
 Mixed or Hybrid fragmentation
Let’s discuss them one by one.

Horizontal fragmentation –
Horizontal fragmentation refers to the process of dividing a table horizontally by assigning each row or
(a group of rows) of relation to one or more fragments. These fragments are then be assigned to
different sides in the distributed system. Some of the rows or tuples of the table are placed in one
system and the rest are placed in other systems. The rows that belong to the horizontal fragments are
specified by a condition on one or more attributes of the relation. In relational algebra horizontal
fragmentation on table T, can be represented as follows:
σp(T)
where, σ is relational algebra operator for selection
p is the condition satisfied by a horizontal fragment

Note that a union operation can be performed on the fragments to construct table T. Such a fragment
containing all the rows of table T is called a complete horizontal fragment.
For example, consider an EMPLOYEE table (T) :

Eno Ename Design Salary Dep

101 A abc 3000 1

102 B abc 4000 1

103 C abc 5500 2

104 D abc 5000 2

105 E abc 2000 2

This EMPLOYEE table can be divided into different fragments like:


EMP 1 = σDep = 1 EMPLOYEE
EMP 2 = σDep = 2 EMPLOYEE
These two fragments are: T1 fragment of Dep = 1

Enam Salar
Eno Design Dep
e y

101 A abc 3000 1

102 B abc 4000 1

Similarly, the T2 fragment on the basis of Dep = 2 will be :

Enam Salar
Eno Design Dep
e y

103 C abc 5500 2

104 D abc 5000 2

105 E abc 2000 2

Now, here it is possible to get back T as T = T1 ∪ T2 ∪…. ∪ TN


Vertical Fragmentation
Vertical fragmentation refers to the process of decomposing a table vertically by attributes are columns.
In this fragmentation, some of the attributes are stored in one system and the rest are stored in other
systems. This is because each site may not need all columns of a table. In order to take care of
restoration, each fragment must contain the primary key field(s) in a table. The fragmentation should be
in such a manner that we can rebuild a table from the fragment by taking the natural JOIN operation
and to make it possible we need to include a special attribute called Tuple-id to the schema. For this
purpose, a user can use any super key. And by this, the tuples or rows can be linked together. The
projection is as follows:
πa1, a2,…, an (T)
where, π is relational algebra operator
a1…., an are the attributes of T
T is the table (relation)

For example, for the EMPLOYEE table we have T1 as :

Enam
Eno Design Tuple_id
e

101 A abc 1

102 B abc 2

103 C abc 3

104 D abc 4

105 E abc 5

For the second. sub table of relation after vertical fragmentation is given as follows :

Salar
Dep Tuple_id
y

3000 1 1

4000 2 2

5500 3 3

5000 1 4

2000 4 5

This is T2 and to get back to the original T, we join these two fragments T1 and T2 as πEMPLOYEE (T1 ⋈ T2)
Mixed Fragmentation
The combination of vertical fragmentation of a table followed by further horizontal fragmentation of
some fragments is called mixed or hybrid fragmentation. For defining this type of fragmentation we use
the SELECT and the PROJECT operations of relational algebra. In some situations, the horizontal and the
vertical fragmentation isn’t enough to distribute data for some applications and in that conditions, we
need a fragmentation called a mixed fragmentation.
Mixed fragmentation can be done in two different ways:
1. The first method is to first create a set or group of horizontal fragments and then create vertical
fragments from one or more of the horizontal fragments.
2. The second method is to first create a set or group of vertical fragments and then create horizontal
fragments from one or more of the vertical fragments.

The original relation can be obtained by the combination of JOIN and UNION operations which is
given as follows:
σP(πa1, a2..,an(T))
πa1,a2….,an (σp(T))

For example, for our EMPLOYEE table, below is the implementation of mixed fragmentation is πEname,
Design (σEno< 104(EMPLOYEE))
The result of this fragmentation is:

Ename Design

A abc

B abc

C abc

CORRECTNESS RULES FOR DATA FRAGMENTATION


To ensure no loss of information and no redundancy of data (i.e., to ensure the correctness of
fragmentation), there are three different rules that must be considered during fragmentation. These
correctness rules are listed below.
(a) Completeness – If a relation instance R is decomposed into fragments R1, R2, . . . , Rn, each data
item in R must appear in at least one of the fragments Ri. This property is identical to the lossless
decomposition property of normalization and it is necessary in fragmentation to ensure that there is no
loss of data during data fragmentation.
(b) Reconstruction – If a relation R is decomposed into fragments R1, R2, . . . ,Rn, it must be
possible to define a relational operation that will reconstruct the relation R from the fragments R1,
R2, . . . ,Rn. This rule ensures that constraints defined on the data in the form of functional dependencies
are preserved during data fragmentation.
(c) Disjointness – If a relation instance R is decomposed into fragments R1, R2, . . . ,Rn, and if a data
item is found in the fragment Ri, then it must not appear in any other fragment. This rule ensures
minimal data redundancy. In case of vertical fragmentation, primary key attribute must be repeated to
allow reconstruction and to preserve functional dependencies. Therefore, in case of vertical
fragmentation, disjointness is defined only on non-primary key attributes of a relation.

DATA REPLICATION IN DBMS


Data Replication is the process of storing data in more than one site or node. It is useful in improving
the availability of data. It is simply copying data from a database from one server to another server so
that all the users can share the same data without any inconsistency. The result is a distributed
database in which users can access data relevant to their tasks without interfering with the work of
others. Data replication encompasses duplication of transactions on an ongoing basis, so that
the replicate is in a consistently updated state and synchronized with the source. However, in data
replication data is available at different locations, but a particular relation has to reside at only one
location. There can be full replication, in which the whole database is stored at every site. There can also
be partial replication, in which some frequently used fragment of the database are replicated and others
are not replicated. Types of Data Replication –
1. Transactional Replication – In Transactional replication users receive full initial copies of the
database and then receive updates as data changes. Data is copied in real time from the publisher to
the receiving database(subscriber) in the same order as they occur with the publisher therefore in
this type of replication, transactional consistency is guaranteed. Transactional replication is
typically used in server-to-server environments. It does not simply copy the data changes, but rather
consistently and accurately replicates each change.
2. Snapshot Replication – Snapshot replication distributes data exactly as it appears at a specific
moment in time does not monitor for updates to the data. The entire snapshot is generated and
sent to Users. Snapshot replication is generally used when data changes are infrequent. It is bit
slower than transactional because on each attempt it moves multiple records from one end to the
other end. Snapshot replication is a good way to perform initial synchronization between the
publisher and the subscriber.
3. Merge Replication – Data from two or more databases is combined into a single database. Merge
replication is the most complex type of replication because it allows both publisher and subscriber
to independently make changes to the database. Merge replication is typically used in server-to-
client environments. It allows changes to be sent from one publisher to multiple subscribers.

Replication Schemes –

1. Full Replication

In full replication scheme, the database is available to almost every location or user in
communication network.

Advantages of full replication

 High availability of data, as database is available to almost every location.


 Faster execution of queries.
Disadvantages of full replication

 Concurrency control is difficult to achieve in full replication.


 Update operation is slower.

2. No Replication

No replication means, each fragment is stored exactly at one location.


Advantages of no replication

 Concurrency can be minimized.


 Easy recovery of data.
Disadvantages of no replication

 Poor availability of data.


 Slows down the query execution process, as multiple clients are accessing the same server.

3. Partial replication

Partial replication means only some fragments are replicated from the database.

Advantages of partial replication

The number of replicas created for fragments depend upon the importance of data in that
fragment.

There are several types of data replication:

1. Master-slave replication: In this type of replication, one database server is designated as the
master, and one or more other servers are designated as slaves. The master server receives all the
write operations, and the slaves receive a copy of the data from the master.
2. Multi-master replication: In this type of replication, all the servers involved in replication can
receive write operations, and all the updates made to any server will be replicated to all the other
servers.
3. Peer-to-peer replication: In this type of replication, each server can act as both a master and a
slave, and the data is replicated between all the servers in a peer-to-peer fashion.
4. Single-source replication: In this type of replication, a single source database is replicated to
multiple target databases.

The advantages of data replication include:

1. Improved performance, as data can be read from a local copy of the data instead of a remote one.
2. Increased data availability, as copies of the data can be used in case of a failure of the primary
database.
3. Improved scalability, as the load on the primary database can be reduced by reading data from the
replicas.

The disadvantages of data replication include:

1. Increased complexity, as the replication process needs to be configured and maintained.


2. Increased risk of data inconsistencies, as data can be updated simultaneously on different replicas.
3. Increased storage and network usage, as multiple copies of the data need to be stored and
transmitted.
4. Data replication is widely used in various types of systems, such as online transaction processing
systems, data warehousing systems, and distributed systems.

Distributed DBMS - Replication Control

As discussed earlier, replication is a technique used in distributed databases to store multiple copies of a
data table at different sites. The problem with having multiple copies in multiple sites is the overhead of
maintaining data consistency, particularly during update operations.
In order to maintain mutually consistent data in all sites, replication control techniques need to be
adopted. There are two approaches for replication control, namely −

 Synchronous Replication Control


 Asynchronous Replication Control

Synchronous Replication Control

In synchronous replication approach, the database is synchronized so that all the replications always
have the same value. A transaction requesting a data item will have access to the same value in all the
sites. To ensure this uniformity, a transaction that updates a data item is expanded so that it makes the
update in all the copies of the data item. Generally, two-phase commit protocol is used for the purpose.

Asynchronous Replication Control

In asynchronous replication approach, the replicas do not always maintain the same value. One or more
replicas may store an outdated value, and a transaction can see the different values. The process of
bringing all the replicas to the current value is called synchronization.
A popular method of synchronization is store and forward method. In this method, one site is
designated as the primary site and the other sites are secondary sites. The primary site always contains
updated values. All the transactions first enter the primary site. These transactions are then queued for
application in the secondary sites. The secondary sites are updated using rollout method only when a
transaction is scheduled to execute on it.

Replication Control Algorithms

Some of the replication control algorithms are −

 Master-slave replication control algorithm.


 Distributed voting algorithm.
 Majority consensus algorithm.
 Circulating token algorithm.
Master-Slave Replication Control Algorithm
There is one master site and ‘N’ slave sites. A master algorithm runs at the master site to detect
conflicts. A copy of slave algorithm runs at each slave site. The overall algorithm executes in the
following two phases −
 Transaction acceptance/rejection phase − When a transaction enters the transaction monitor of
a slave site, the slave site sends a request to the master site. The master site checks for conflicts.
If there aren’t any conflicts, the master sends an “ACK+” message to the slave site which then
starts the transaction application phase. Otherwise, the master sends an “ACK-” message to the
slave which then rejects the transaction.
 Transaction application phase − Upon entering this phase, the slave site where transaction has
entered broadcasts a request to all slaves for executing the transaction. On receiving the
requests, the peer slaves execute the transaction and send an “ACK” to the requesting slave on
completion. After the requesting slave has received “ACK” messages from all its peers, it sends a
“DONE” message to the master site. The master understands that the transaction has been
completed and removes it from the pending queue.
Distributed Voting Algorithm
This comprises of ‘N’ peer sites, all of whom must “OK” a transaction before it starts executing.
Following are the two phases of this algorithm −
 Distributed transaction acceptance phase − When a transaction enters the transaction manager
of a site, it sends a transaction request to all other sites. On receiving a request, a peer site
resolves conflicts using priority based voting rules. If all the peer sites are “OK” with the
transaction, the requesting site starts application phase. If any of the peer sites does not “OK” a
transaction, the requesting site rejects the transaction.
 Distributed transaction application phase − Upon entering this phase, the site where the
transaction has entered, broadcasts a request to all slaves for executing the transaction. On
receiving the requests, the peer slaves execute the transaction and send an “ACK” message to
the requesting slave on completion. After the requesting slave has received “ACK” messages
from all its peers, it lets the transaction manager know that the transaction has been completed.
Majority Consensus Algorithm
This is a variation from the distributed voting algorithm, where a transaction is allowed to execute when
a majority of the peers “OK” a transaction. This is divided into three phases −
 Voting phase − When a transaction enters the transaction manager of a site, it sends a
transaction request to all other sites. On receiving a request, a peer site tests for conflicts using
voting rules and keeps the conflicting transactions, if any, in pending queue. Then, it sends either
an “OK” or a “NOT OK” message.
 Transaction acceptance/rejection phase − If the requesting site receives a majority “OK” on the
transaction, it accepts the transaction and broadcasts “ACCEPT” to all the sites. Otherwise, it
broadcasts “REJECT” to all the sites and rejects the transaction.
 Transaction application phase − When a peer site receives a “REJECT” message, it removes this
transaction from its pending list and reconsiders all deferred transactions. When a peer site
receives an “ACCEPT” message, it applies the transaction and rejects all the deferred
transactions in the pending queue which are in conflict with this transaction. It sends an “ACK” to
the requesting slave on completion.
Circulating Token Algorithm
In this approach the transactions in the system are serialized using a circulating token and executed
accordingly against every replica of the database. Thus, all the transactions are accepted, i.e. none is
rejected. This has two phases −
 Transaction serialization phase − In this phase, all transactions are scheduled to run in a
serialization order. Each transaction in each site is assigned a unique ticket from a sequential
series, indicating the order of transaction. Once a transaction has been assigned a ticket, it is
broadcasted to all the sites.
 Transaction application phase − When a site receives a transaction along with its ticket, it places
the transaction for execution according to its ticket. After the transaction has finished execution,
this site broadcasts an appropriate message. A transaction ends when it has completed
execution in all the sites.

TRANSACTIONS
A transaction is a program including a collection of database operations, executed as a logical
unit of data processing. The operations performed in a transaction include one or more of
database operations like insert, delete, update or retrieve data. It is an atomic process that is
either performed into completion entirely or is not performed at all. A transaction involving
only data retrieval without any data update is called read-only transaction.
Each high level operation can be divided into a number of low level tasks or operations. For
example, a data update operation can be divided into three tasks −
 read_item() − reads data item from storage to main memory.
 modify_item() − change value of item in the main memory.
 write_item() − write the modified value from main memory to storage.
Database access is restricted to read_item() and write_item() operations. Likewise, for all
transactions, read and write forms the basic database operations.

TRANSACTION OPERATIONS
The low level operations performed in a transaction are −
 begin_transaction − A marker that specifies start of transaction execution.
 read_item or write_item − Database operations that may be interleaved with main
memory operations as a part of transaction.
 end_transaction − A marker that specifies end of transaction.
 commit − A signal to specify that the transaction has been successfully completed in its
entirety and will not be undone.
 rollback − A signal to specify that the transaction has been unsuccessful and so all
temporary changes in the database are undone. A committed transaction cannot be
rolled back.

TRANSACTION STATES
A transaction may go through a subset of five states, active, partially committed, committed,
failed and aborted.
 Active − The initial state where the transaction enters is the active state. The transaction
remains in this state while it is executing read, write or other operations.
 Partially Committed − The transaction enters this state after the last statement of the
transaction has been executed.
 Committed − The transaction enters this state after successful completion of the
transaction and system checks have issued commit signal.
 Failed − The transaction goes from partially committed state or active state to failed
state when it is discovered that normal execution can no longer proceed or system
checks fail.
 Aborted − This is the state after the transaction has been rolled back after failure and
the database has been restored to its state that was before the transaction began.
The following state transition diagram depicts the states in the transaction and the low level
transaction operations that causes change in states.
DESIRABLE PROPERTIES OF TRANSACTIONS
Any transaction must maintain the ACID properties, viz. Atomicity, Consistency, Isolation, and
Durability.
 Atomicity − This property states that a transaction is an atomic unit of processing, that
is, either it is performed in its entirety or not performed at all. No partial update should
exist.
 Consistency − A transaction should take the database from one consistent state to
another consistent state. It should not adversely affect any data item in the database.
 Isolation − A transaction should be executed as if it is the only one in the system. There
should not be any interference from the other concurrent transactions that are
simultaneously running.
 Durability − If a committed transaction brings about a change, that change should be
durable in the database and not lost in case of any failure.

SCHEDULES AND CONFLICTS


In a system with a number of simultaneous transactions, a schedule is the total order of
execution of operations. Given a schedule S comprising of n transactions, say T1, T2,
T3………..Tn; for any transaction Ti, the operations in Ti must execute as laid down in the
schedule S.
Types of Schedules
There are two types of schedules −
 Serial Schedules − In a serial schedule, at any point of time, only one transaction is
active, i.e. there is no overlapping of transactions. This is depicted in the following graph

 Parallel Schedules − In parallel schedules, more than one transactions are active
simultaneously, i.e. the transactions contain operations that overlap at time. This is
depicted in the following graph −

Conflicts in Schedules
In a schedule comprising of multiple transactions, a conflict occurs when two active
transactions perform non-compatible operations. Two operations are said to be in conflict,
when all of the following three conditions exists simultaneously −
 The two operations are parts of different transactions.
 Both the operations access the same data item.
 At least one of the operations is a write_item() operation, i.e. it tries to modify the data
item.

SERIALIZABILITY
A serializable schedule of ‘n’ transactions is a parallel schedule which is equivalent to a serial
schedule comprising of the same ‘n’ transactions. A serializable schedule contains the
correctness of serial schedule while ascertaining better CPU utilization of parallel schedule.
Equivalence of Schedules
Equivalence of two schedules can be of the following types −
 Result equivalence − Two schedules producing identical results are said to be result
equivalent.
 View equivalence − Two schedules that perform similar action in a similar manner are
said to be view equivalent.
 Conflict equivalence − Two schedules are said to be conflict equivalent if both contain
the same set of transactions and has the same order of conflicting pairs of operations.

WHAT IS TRANSPARENCY?
Transparency in DDBMS refers to the transparent distribution of information to the user from
the system. It helps in hiding the information that is to be implemented by the user. Let’s say,
for example, in a normal DBMS, data independence is a form of transparency that helps in
hiding changes in the definition & organization of the data from the user. But, they all have
the same overall target. That means to make use of the distributed database the same as a
centralized database.
In Distributed Database Management System, there are four types of transparencies, which
are as follows –
 Transaction transparency
 Performance transparency
 DBMS transparency
 Distribution transparency

Transparencies in DDBMS
1. Transaction transparency-

This transparency makes sure that all the transactions that are distributed preserve
distributed database integrity and regularity. Also, it is to understand that distribution
transaction access is the data stored at multiple locations. Another thing to notice is that
the DDBMS is responsible for maintaining the atomicity of every sub-transaction (By this,
we mean that either the whole transaction takes place directly or doesn’t happen in the
least). It is very complex due to the use of fragmentation, allocation, and replication
structure of DBMS.

2. Performance transparency-

This transparency requires a DDBMS to work in a way that if it is a centralized database


management system. Also, the system should not undergo any downs in performance as
its architecture is distributed. Likewise, a DDBMS must have a distributed query processor
which can map a data request into an ordered sequence of operations on the local
database. This has another complexity to take under consideration which is the
fragmentation, replication, and allocation structure of DBMS.

3. DBMS transparency-

This transparency is only applicable to heterogeneous types of DDBMS (Databases that


have different sites and use different operating systems, products, and data models) as it
hides the fact that the local DBMS may be different. This transparency is one of the most
complicated transparencies to make use of as a generalization.

4. Distribution transparency-

Distribution transparency helps the user to recognize the database as a single thing or a
logical entity, and if a DDBMS displays distribution data transparency, then the user does
not need to know that the data is fragmented.

Distribution transparency has its 5 types, which are discussed below –


 Fragmentation transparency-

In this type of transparency, the user doesn’t have to know about fragmented data and,
due to which, it leads to the reason why database accesses are based on the global
schema. This is almost somewhat like users of SQL views, where the user might not know
that they’re employing a view of a table rather than the table itself.
 Location transparency-

If this type of transparency is provided by DDBMS, then it is necessary for the user to
know how the data has been fragmented, but knowing the location of the data is not
necessary.
 Replication transparency-

In replication transparency, the user does not know about the copying of fragments.
Replication transparency is related to concurrency transparency and failure transparency.
Whenever a user modifies a data item, the update is reflected altogether in the copies of
the table. However, this operation shouldn’t be known to the user.
 Local Mapping transparency-

In local mapping transparency, the user needs to define both the fragment names,
location of data items while taking into account any duplications that may exist. This is a
more difficult and time-taking query for the user in DDBMS transparency.
 Naming transparency-

We already know that DBMS and DDBMS are types of centralized database system. It
means that each item in this database must consist of a unique name. It further means
that DDBMS must make sure that no two sites are creating a database object with the
same name. So to solve the problem of naming transparency, there are two ways, either
we can create a central name server to create the unique names of objects in the system,
or, differently, is to add an object starting with the identifier of the creator site.

Various Failures in Distributed System

These are explained as following below.


1. Method failure:
In this type of failure, the distributed system is generally halted and unable to
perform the execution. Sometimes it leads to ending up the execution resulting in an
associate incorrect outcome. Method failure causes the system state to deviate from
specifications, and also method might fail to progress.
 Behavior –
It may be understood as if incorrect computation like Protection violation,
deadlocks, timeout, user input, etc is performed then the method stops its
execution.
 Recovery –
Method failure can be prevented by aborting the method or restarting it from its
prior state.
2. System failure:
In system failure, the processor associated with the distributed system fails to
perform the execution. This is caused by computer code errors and hardware issues.
Hardware issues may involve CPU/memory/bus failure. This is assumed that
whenever the system stops its execution due to some fault then the interior state is
lost.
 Behavior –
It is concerned with physical and logical units of the processor. The system may
freeze, reboot and also it does not perform any functioning leading it to go in an
idle state.
 Recovery –
This can be cured by rebooting the system as soon as possible and configuring the
failure point and wrong state.
3. Secondary storage device failure:
A storage device failure is claimed to have occurred once the keep information can’t
be accessed. This failure is sometimes caused by parity error, head crash, or dirt
particles settled on the medium.
 Behavior –
Stored information can’t be accessed.
 Errors inflicting failure –
Parity error, head crash, etc.
 Recovery/Design strategies –
Reconstruct content from the archive and the log of activities and style reflected
disk system. A system failure will additionally be classified as follows.
 Associate cognitive state failure
 A partial cognitive state failure
 a disruption failure
 A halting failure
4. Communication medium failure:
A communication medium failure happens once a web site cannot communicate
with another operational site within the network. it’s typically caused by the failure
of the shift nodes and/or the links of the human activity system.
 Behavior –
A web site cannot communicate with another operational site.
 Errors/Faults –
Failure of shift nodes or communication links.
 Recovery/Design strategies –
Reroute, error-resistant communication protocols.

Two Phase Locking Protocol


Two Phase Locking Protocol also known as 2PL protocol. It helps to
eliminate the concurrency problem in DBMS.

This locking protocol divides the execution phase of a transaction into three
different parts.

 In the first phase, when the transaction begins to execute, it requires


permission for the locks it needs.
 The second part is where the transaction obtains all the locks. When a
transaction releases its first lock, the third phase starts.
 In this third phase, the transaction cannot demand any new locks.
Instead, it only releases the acquired locks.
The Two-Phase Locking protocol allows each transaction to make a lock or
unlock request in two steps:

 Growing Phase: In this phase transaction may obtain locks but may not
release any locks.
 Shrinking Phase: In this phase, a transaction may release locks but not
obtain any new lock

QUERY OPTIMIZATION
Query optimization is used for accessing the database in an efficient manner. It is an
art of obtaining desired information in a predictable, reliable and timely manner.
Formally defines query optimization as a process of transforming a query into an
equivalent form which can be evaluated more efficiently. The essence of query
optimization is to find an execution plan that minimizes time needed to evaluate a
query. To achieve this optimization goal, we need to accomplish two main tasks. First
one is to find out the best plan and the second one is to reduce the time involved in
executing the query plan.

You might also like