Ddbms Notes
Ddbms Notes
A distributed database is a database that is under the control of central DBMS in which not all
storage devices are attached to a common CPU. It may be stored on multiple computers located
in the same physical location, or may be dispersed over a network of interconnected computers.
Collection of data (e.g in a database) can be distributed across multiple physical locations. In
short, a distributed database is a logically interrelated collection of shared data and a
description of this data is physically distributed over a computer network.
DISTRIBUTED DATABASE
CENTRALIZED DATABASE
A distributed database system allows applications to access various data items from local and
remote databases. Applications are classified into two categories depending on whether the
Local Applications – These applications require access to local data only and do not require data
Global applications – These applications require access to data from other remote sites in the
distributed system.
TYPES OF DDBMS:
1. Homogeneous Database:
In a homogeneous database, all different sites store database identically. The operating
system, database management system, and the data structures used – all are the same
at all sites. Hence, they’re easy to manage.
2. Heterogeneous Database:
In a heterogeneous distributed database, different sites can use different schema and
software that can lead to problems in query processing and transactions. Also, a
particular site might be completely unaware of the other sites. Different computers may
use a different operating system, different database application. They may even use
different data models for the database. Hence, translations are required for different
sites to communicate.
5. Modular growth – Growth is easier. We do not need to interrupt any of the functioning sites to
introduce (add) a new site. Hence, the expansion of the whole system is easier. Removal of site is also
does not cause much problems.
6. Lower communication costs (More Economical) – Data are distributed in such a way that they are
available near to the location where they are needed more. This reduces the communication cost much
more compared to a centralized system.
7. Faster response – Most of the data are local and in close proximity to where they are needed. Hence,
the requests can be answered quickly compared to a centralized system.
9. Robust – The system is continued to work in case of failures. For example, replicated distributed
database performs in spite of failure of other sites.
10. Complied with ACID properties – Distributed transactions demands Atomicity, Consistency,
Isolation, and Reliability.
2. Increased Processing overhead – It costs many messages to be shared between sites to complete a
distributed transaction.
3. Data integrity – Data integrity becomes complex. Too much network resources may be used.
4. Deadlock is difficult to handle compared to a centralized system.
5 Lack of experience - General-purpose distributed DBMSs have not been widely accepted, although
many of the protocols and problems are well understood. Consequently, we do not yet have the same
level of experience in industry as we have with centralized DBMSs.
6. The data shared between sites over networks are vulnerable to attack. Hence, network oriented
security protocols to be used based on the sensitivity of data shared.
7. More complex in terms database design – According to various applications, we may need to fragment
a database, or replicate a database or both.
8. Handling failures is a difficult task. In some cases, we may not distinguish site failure, network
partition, and link failure.
ALTENATIVE STRATEGIES FOR DATA ALLOCATION
Four alternative strategies have been identified for data allocation.
1. CENTRALIZED
In this strategy, the distributed system consists of single database and DBMS is stored at one
site with users distributed across the communication network. In this approach locality of
reference is the lowest at all sites, except the central site where the data is stored. The
communication cost is very high since all users accept the central site have to use the network
for all types of data accesses. Reliability and availability are very low, since the failure of the
central site results in the loss of entire database system.
2. FRAGMENTED (OR PARTITIONED)
This strategy partitions the entire database into disjoint fragments, where each fragment is
assigned to one site. In this strategy, fragments are not replicated. If fragments are stored at the
site where they are used most frequently, locality of reference is high. As there is no replication
of data, storage cost is low. Reliability and availability are also low but still higher than
centralized data allocation strategies, as the failure of a site results in the loss of local data only.
In this case, communication costs are incurred only for global transactions. However, in this
approach, performance should be good, and communication costs are low if the data
distribution is designed properly.
3. COMPLETE REPLICATION
In this strategy, each site of the system maintains a complete copy of the entire database. Since
all the data are available at all sites, locality of reference, availability and reliability, and
performance are maximized in this approach. Storage costs are very high in this case, and hence,
no communication costs are incurred due to global transactions. However, communication costs
for updating data items are the most expensive. To overcome this problem, snapshots are
sometimes used. A snapshot is a copy of the data at a given time. The copies are updated
periodically, so they may not be always up-to-date. Snapshots are also sometimes used to
implement views in a distributed database, to reduce the time taken for performing a database
operation on a view.
4. SELECTIVE REPLICATION
This strategy is a combination of centralized, fragmented, and complete replication strategies. In
this approach, some of the data items are fragmented and allocated to the sites where they are
used frequently, to achieve high localization of reference. Some of the data items or fragments
of the data items that are used by many sites simultaneously but not frequently updated, are
replicated and stored at all these different sites. The data items that are not used frequently are
centralized.
Basis of
S.NO. Comparison Centralized database Distributed database
The data access time in the The data access time in the case of
2. Access time case of multiple users is more multiple users is less in a distributed
in a centralized database. database.
Apache Ignite
Apache Cassandra
A desktop or server CPU Apache HBase
13. Examples
A mainframe computer. Amazon SimpleDB
Clusterpoint
FoundationDB.
Processors communicate with each Computer communicate with each other through
5.
other through bus message passing.
Each site has its own operations and also act as independent autonomous, centralized DBMS. For
security, concurrency control, backup and recovery there is responsibility of each site..
All the sites are equal and no site is dependent on central site to perform any service. We can say
there is no such site which cannot which system cannot operate. There are some services for which
central server is not required they are transaction management, query optimization, deadlock
detection and management of global system catalog.
3. Continuous Operation –
There is no effect of site failure to system. System continue its operation even in case of site failure
or any expansion in network.
4. Local Independence –
To retrieve any data in system, there is need to know about the storage of data i.e where the data is
stored in system.
5. Fragmentation Independence –
The user is able to see only one single logical database. There is transparency to data fragmentation
to user. To retrieve any fragments of database, there is no need to know about name of database
fragments.
6. Replication Independence –
Data can be replicate and stored in different sites. The DDBMS is manages all fragments
transparently the user.
To execute single query at different location, does not able to satisfy transparent request. So, query
optimization is crucial and performed transparently by DDBMS.
Transaction is able to update data at different sites transparently, but control of recovery and
concurrency is achieved by using agents.
9. Hardware Independence –
Horizontal fragmentation –
Horizontal fragmentation refers to the process of dividing a table horizontally by assigning each row or
(a group of rows) of relation to one or more fragments. These fragments are then be assigned to
different sides in the distributed system. Some of the rows or tuples of the table are placed in one
system and the rest are placed in other systems. The rows that belong to the horizontal fragments are
specified by a condition on one or more attributes of the relation. In relational algebra horizontal
fragmentation on table T, can be represented as follows:
σp(T)
where, σ is relational algebra operator for selection
p is the condition satisfied by a horizontal fragment
Note that a union operation can be performed on the fragments to construct table T. Such a fragment
containing all the rows of table T is called a complete horizontal fragment.
For example, consider an EMPLOYEE table (T) :
Enam Salar
Eno Design Dep
e y
Enam Salar
Eno Design Dep
e y
Enam
Eno Design Tuple_id
e
101 A abc 1
102 B abc 2
103 C abc 3
104 D abc 4
105 E abc 5
For the second. sub table of relation after vertical fragmentation is given as follows :
Salar
Dep Tuple_id
y
3000 1 1
4000 2 2
5500 3 3
5000 1 4
2000 4 5
This is T2 and to get back to the original T, we join these two fragments T1 and T2 as πEMPLOYEE (T1 ⋈ T2)
Mixed Fragmentation
The combination of vertical fragmentation of a table followed by further horizontal fragmentation of
some fragments is called mixed or hybrid fragmentation. For defining this type of fragmentation we use
the SELECT and the PROJECT operations of relational algebra. In some situations, the horizontal and the
vertical fragmentation isn’t enough to distribute data for some applications and in that conditions, we
need a fragmentation called a mixed fragmentation.
Mixed fragmentation can be done in two different ways:
1. The first method is to first create a set or group of horizontal fragments and then create vertical
fragments from one or more of the horizontal fragments.
2. The second method is to first create a set or group of vertical fragments and then create horizontal
fragments from one or more of the vertical fragments.
The original relation can be obtained by the combination of JOIN and UNION operations which is
given as follows:
σP(πa1, a2..,an(T))
πa1,a2….,an (σp(T))
For example, for our EMPLOYEE table, below is the implementation of mixed fragmentation is πEname,
Design (σEno< 104(EMPLOYEE))
The result of this fragmentation is:
Ename Design
A abc
B abc
C abc
Replication Schemes –
1. Full Replication
In full replication scheme, the database is available to almost every location or user in
communication network.
2. No Replication
3. Partial replication
Partial replication means only some fragments are replicated from the database.
The number of replicas created for fragments depend upon the importance of data in that
fragment.
1. Master-slave replication: In this type of replication, one database server is designated as the
master, and one or more other servers are designated as slaves. The master server receives all the
write operations, and the slaves receive a copy of the data from the master.
2. Multi-master replication: In this type of replication, all the servers involved in replication can
receive write operations, and all the updates made to any server will be replicated to all the other
servers.
3. Peer-to-peer replication: In this type of replication, each server can act as both a master and a
slave, and the data is replicated between all the servers in a peer-to-peer fashion.
4. Single-source replication: In this type of replication, a single source database is replicated to
multiple target databases.
1. Improved performance, as data can be read from a local copy of the data instead of a remote one.
2. Increased data availability, as copies of the data can be used in case of a failure of the primary
database.
3. Improved scalability, as the load on the primary database can be reduced by reading data from the
replicas.
As discussed earlier, replication is a technique used in distributed databases to store multiple copies of a
data table at different sites. The problem with having multiple copies in multiple sites is the overhead of
maintaining data consistency, particularly during update operations.
In order to maintain mutually consistent data in all sites, replication control techniques need to be
adopted. There are two approaches for replication control, namely −
In synchronous replication approach, the database is synchronized so that all the replications always
have the same value. A transaction requesting a data item will have access to the same value in all the
sites. To ensure this uniformity, a transaction that updates a data item is expanded so that it makes the
update in all the copies of the data item. Generally, two-phase commit protocol is used for the purpose.
In asynchronous replication approach, the replicas do not always maintain the same value. One or more
replicas may store an outdated value, and a transaction can see the different values. The process of
bringing all the replicas to the current value is called synchronization.
A popular method of synchronization is store and forward method. In this method, one site is
designated as the primary site and the other sites are secondary sites. The primary site always contains
updated values. All the transactions first enter the primary site. These transactions are then queued for
application in the secondary sites. The secondary sites are updated using rollout method only when a
transaction is scheduled to execute on it.
TRANSACTIONS
A transaction is a program including a collection of database operations, executed as a logical
unit of data processing. The operations performed in a transaction include one or more of
database operations like insert, delete, update or retrieve data. It is an atomic process that is
either performed into completion entirely or is not performed at all. A transaction involving
only data retrieval without any data update is called read-only transaction.
Each high level operation can be divided into a number of low level tasks or operations. For
example, a data update operation can be divided into three tasks −
read_item() − reads data item from storage to main memory.
modify_item() − change value of item in the main memory.
write_item() − write the modified value from main memory to storage.
Database access is restricted to read_item() and write_item() operations. Likewise, for all
transactions, read and write forms the basic database operations.
TRANSACTION OPERATIONS
The low level operations performed in a transaction are −
begin_transaction − A marker that specifies start of transaction execution.
read_item or write_item − Database operations that may be interleaved with main
memory operations as a part of transaction.
end_transaction − A marker that specifies end of transaction.
commit − A signal to specify that the transaction has been successfully completed in its
entirety and will not be undone.
rollback − A signal to specify that the transaction has been unsuccessful and so all
temporary changes in the database are undone. A committed transaction cannot be
rolled back.
TRANSACTION STATES
A transaction may go through a subset of five states, active, partially committed, committed,
failed and aborted.
Active − The initial state where the transaction enters is the active state. The transaction
remains in this state while it is executing read, write or other operations.
Partially Committed − The transaction enters this state after the last statement of the
transaction has been executed.
Committed − The transaction enters this state after successful completion of the
transaction and system checks have issued commit signal.
Failed − The transaction goes from partially committed state or active state to failed
state when it is discovered that normal execution can no longer proceed or system
checks fail.
Aborted − This is the state after the transaction has been rolled back after failure and
the database has been restored to its state that was before the transaction began.
The following state transition diagram depicts the states in the transaction and the low level
transaction operations that causes change in states.
DESIRABLE PROPERTIES OF TRANSACTIONS
Any transaction must maintain the ACID properties, viz. Atomicity, Consistency, Isolation, and
Durability.
Atomicity − This property states that a transaction is an atomic unit of processing, that
is, either it is performed in its entirety or not performed at all. No partial update should
exist.
Consistency − A transaction should take the database from one consistent state to
another consistent state. It should not adversely affect any data item in the database.
Isolation − A transaction should be executed as if it is the only one in the system. There
should not be any interference from the other concurrent transactions that are
simultaneously running.
Durability − If a committed transaction brings about a change, that change should be
durable in the database and not lost in case of any failure.
Parallel Schedules − In parallel schedules, more than one transactions are active
simultaneously, i.e. the transactions contain operations that overlap at time. This is
depicted in the following graph −
Conflicts in Schedules
In a schedule comprising of multiple transactions, a conflict occurs when two active
transactions perform non-compatible operations. Two operations are said to be in conflict,
when all of the following three conditions exists simultaneously −
The two operations are parts of different transactions.
Both the operations access the same data item.
At least one of the operations is a write_item() operation, i.e. it tries to modify the data
item.
SERIALIZABILITY
A serializable schedule of ‘n’ transactions is a parallel schedule which is equivalent to a serial
schedule comprising of the same ‘n’ transactions. A serializable schedule contains the
correctness of serial schedule while ascertaining better CPU utilization of parallel schedule.
Equivalence of Schedules
Equivalence of two schedules can be of the following types −
Result equivalence − Two schedules producing identical results are said to be result
equivalent.
View equivalence − Two schedules that perform similar action in a similar manner are
said to be view equivalent.
Conflict equivalence − Two schedules are said to be conflict equivalent if both contain
the same set of transactions and has the same order of conflicting pairs of operations.
WHAT IS TRANSPARENCY?
Transparency in DDBMS refers to the transparent distribution of information to the user from
the system. It helps in hiding the information that is to be implemented by the user. Let’s say,
for example, in a normal DBMS, data independence is a form of transparency that helps in
hiding changes in the definition & organization of the data from the user. But, they all have
the same overall target. That means to make use of the distributed database the same as a
centralized database.
In Distributed Database Management System, there are four types of transparencies, which
are as follows –
Transaction transparency
Performance transparency
DBMS transparency
Distribution transparency
Transparencies in DDBMS
1. Transaction transparency-
This transparency makes sure that all the transactions that are distributed preserve
distributed database integrity and regularity. Also, it is to understand that distribution
transaction access is the data stored at multiple locations. Another thing to notice is that
the DDBMS is responsible for maintaining the atomicity of every sub-transaction (By this,
we mean that either the whole transaction takes place directly or doesn’t happen in the
least). It is very complex due to the use of fragmentation, allocation, and replication
structure of DBMS.
2. Performance transparency-
3. DBMS transparency-
4. Distribution transparency-
Distribution transparency helps the user to recognize the database as a single thing or a
logical entity, and if a DDBMS displays distribution data transparency, then the user does
not need to know that the data is fragmented.
In this type of transparency, the user doesn’t have to know about fragmented data and,
due to which, it leads to the reason why database accesses are based on the global
schema. This is almost somewhat like users of SQL views, where the user might not know
that they’re employing a view of a table rather than the table itself.
Location transparency-
If this type of transparency is provided by DDBMS, then it is necessary for the user to
know how the data has been fragmented, but knowing the location of the data is not
necessary.
Replication transparency-
In replication transparency, the user does not know about the copying of fragments.
Replication transparency is related to concurrency transparency and failure transparency.
Whenever a user modifies a data item, the update is reflected altogether in the copies of
the table. However, this operation shouldn’t be known to the user.
Local Mapping transparency-
In local mapping transparency, the user needs to define both the fragment names,
location of data items while taking into account any duplications that may exist. This is a
more difficult and time-taking query for the user in DDBMS transparency.
Naming transparency-
We already know that DBMS and DDBMS are types of centralized database system. It
means that each item in this database must consist of a unique name. It further means
that DDBMS must make sure that no two sites are creating a database object with the
same name. So to solve the problem of naming transparency, there are two ways, either
we can create a central name server to create the unique names of objects in the system,
or, differently, is to add an object starting with the identifier of the creator site.
This locking protocol divides the execution phase of a transaction into three
different parts.
Growing Phase: In this phase transaction may obtain locks but may not
release any locks.
Shrinking Phase: In this phase, a transaction may release locks but not
obtain any new lock
QUERY OPTIMIZATION
Query optimization is used for accessing the database in an efficient manner. It is an
art of obtaining desired information in a predictable, reliable and timely manner.
Formally defines query optimization as a process of transforming a query into an
equivalent form which can be evaluated more efficiently. The essence of query
optimization is to find an execution plan that minimizes time needed to evaluate a
query. To achieve this optimization goal, we need to accomplish two main tasks. First
one is to find out the best plan and the second one is to reduce the time involved in
executing the query plan.