Distributed Databases
Distributed Databases
Ahmed S. Alaidaroos
University Utara Malaysia
S806935@student.uum.edu.my
Yousef A. Fazea
University Utara Malaysia
S809106@student.uum.edu.m
INTRODUCTION
Distributed Database System was first used in mainframe environments in the
1950s and 1960s. But they have flourished best since the development, in the 1980s and
1990s, of minicomputers and powerful desktop and workstation computers, along with
fast, capacious telecommunications, has made it (relatively) easy and cheap to distribute
computing facilities widely.
--1--
Users have access to the portion of the database at their location so that they can
access the data relevant to their tasks without interfering with the work of others. A
centralized distributed database management system (DDBMS) manages the database as if
it were all stored on the same computer. The DDBMS synchronizes all the data
periodically and, in cases where multiple users must access the same data, ensures that
updates and deletes performed on the data at one location will be automatically reflected
in the data stored elsewhere. Distributed database is a database that is under the control of,
a central database management system in which storage devices are not all attached to a
common CPU.
Distributed database technology is one of the most important developments of the
past decades. The maturation of data base management systems - DBMS - technology has
coincided with significant developments in distributed computing and parallel processing
technologies and the result is the emergence of distributed DBMSs and parallel DBMSs.
These systems have started to become the dominant data management tools for highly
intensive applications. The basic motivations for distributing databases are improved
performance, increased availability, share ability, expandability, and access flexibility.
Although, there have been many research studies in these areas, some commercial systems
can provide the whole functionality for distributed transaction processing. Important issues
concerned in studies are database placement in the distributed environment, distributed
query processing, distributed concurrency control algorithms, reliability and availability
protocols and replication strategies.
For general purposes a database is a collection of data that is stored and
maintained at one central location. A database is controlled by a database management
system. The user interacts with the database management system in order to utilize the
database and transform data into information. Furthermore, a database offers many
advantages compared to a simple file system with regard to speed, accuracy, and
accessibility such as: shared access, minimal redundancy, data consistency, data integrity,
and controlled access. All of these aspects are enforced by a database management system.
Among these things let's review some of the many different types of databases.
--2--
DISTRIBUTED DATABASE
"A distributed [1] database is a collection of databases which are distributed and
then stored on multiple computers within a network". A distributed database is also a set of
databases stored on multiple computers that typically appears to applications as a single
database. Distributed systems are a collection of independent cooperating systems, which
enables storage of data at geographically dispersed locations, based on the frequency of
access by users local to a site. The distributed database also enables combining of data
from these dispersed sites by means of queries [3]. "Consequently [4], an application can
simultaneously access and modify the data in several databases in a network ". A database
[2], link connection allows local users to access data on a remote database ". For this
connection to occur, each database in the distributed system must have a unique global
database name in the network domain. The global database name uniquely identifies a
database server in a distributed system. Which mean users have access to the database at
their location so that they can access the data relevant to their tasks without interfering
with the work of others?
A database that consists of two or more data files located at different sites on a
computer network. Because the database is distributed, different users can access it
without interfering with one another. However, the DBMS must periodically synchronize
the scattered databases to make sure that they all have consistent data. A distributed
database is a database that is under the control of a central database management system in
which storage devices are not all attached to a common CPU. It may be stored in multiple
computers located in the same physical location, or may be dispersed over a network of
interconnected computers. Collections of data can be distributed across multiple physical
locations. A distributed database is distributed into separate partitions/fragments.
Besides distributed database replication and fragmentation, there are many other
distributed database design technologies. For example, local autonomy, synchronous and
asynchronous distributed database technologies. These technologies implementation can'
and does definitely depend on the needs of the business and the sensitivity/confidentiality
of the data to be stored in the database. And hence the price the business is willing to
spend on ensuring data security, consistency and integrity.
--3--
A database that consists of two or more data files located at different sites on a
computer network. Because the database is distributed, different users can access it
without interfering with one another. However, the DBMS must periodically synchronize
the scattered databases to make sure that they all have consistent data.
Databases have firmly moved from the realm of research and experimentation into
the commercial world. In this area we will address distributed databases related issues
including transaction management, concurrency, recovery, fault-tolerance, security, and
mobility. Theory and practice of databases will play a prominent role in these pages.
A distributed database system consists of a collection of sites, connected together
via some communication network, in which:
a. Each site is a full database system site in its own right.
b. The sites have agreed to work together so that a user at any site can access data
anywhere in the network in a transparent manner.
A distributed database is a database that is under the control of a central database
management system in which storage devices are not all attached to a common CPU. It
may be stored in multiple computers located in the same physical location, or may be
dispersed over a network of interconnected computers.
Besides distributed database replication and fragmentation, there are many other
distributed database design technologies. For example, local autonomy, synchronous and
asynchronous distributed database technologies. These technologies' implementation can
and does definitely depend on the needs of the business and the sensitivity/confidentiality
of the data to be stored in the database. And hence the price the business is willing to
spend on ensuring data security, consistency and integrity
--4--
--5--
2.
3.
4.
one site, a crash of one of the sites, or the failure of a communication line making some of
these sites inaccessible, does not necessarily make the data impossible to reach.
Furthermore, system crashes or communication failures do not cause total system not
operable and distributed DBMS can still provide limited service.-
--6--
4.
related to these data, it may be much more economical, in terms of communication costs,
to partition the application and do the processing at each site. On the other hand, the cost
of having smaller computing powers at each site is much more less than the cost of having
an equivalent power of a single mainframe.
5.
power to the existing network. It may not be possible to have a linear improvement in
power but significant changes are still possible.
6.
more complex task than management of centralized data. Applications must recognize data
location, and they must be able to stitch together data from different sites. Database
administrators have to coordinate database activities to prevent database degradation due
to data anomalies. Transaction management, concurrency control, security, backup,
recovery, query optimization, access path selection, and so on, must all be addressed and
resolved. In short, keeping the various components of a distributed database synchronized
is a daunting task.
1.
Security- The probability of security lapses increases when data are located at
multiple sites. Different people at several sites will share the responsibility of data
management, and LANs do not yet have the sophisticated security of centralized
mainframe installations.
2.
Lacks of standards- In fact, few official standards exist in any of the distributed
database protocols, whether they deal with communication or data access control.
Consequently, distributed database users must wait for the definitive emergence of
standard protocols before distributed databases can deliver all their potential goods.
--7--
3.
space. This disadvantage is a minor one, because disk storage space is relatively cheap and
it is becoming cheaper. However, disk access and storage management in a widely
dispersed data storage environment are more complex than they would be in a centralized
database.
5.
Lack of Experience: Some special solutions or prototype systems have not been
tested in actual operating environments. More theoretical work is done compared to actual
implementations.
6.
DBMS problems.
5.
Cost: The trade-off between increased profitability due to more efficient and timely
use of information and due to new data processing sites, increased personnel costs has to
be analyzed carefully.
7.
Security: Security can be easily controlled in a central location with the DBMS
enforcing the rules. However, in distributed database system, network is involved which it
has its own security requirements and security control becomes very complicated.
5.
Difficulty of Change: All users have to use their legacy data implemented in
Security
E-Concurrency Control
--8--
Backup
Recovery
Central site independence: Each site in the DDB should act independently with
respect to the central site and all other remote sites.
Note: All sites should have the same capabilities, even though some sites may not
necessarily exercise all these capabilities at a given point in time.
In 1987 one of the founders of relational database theory, C. J. Date, stated 12
goals which, he held, designers should strive to achieve in their DDBs and with the
associated DBMSs:
1. Failure independence: The DDBMS should be unaffected by the failure of a node
or nodes; the rest of the nodes, and the DDBMS as a whole, should continue to work.
Note: In similar fashion, the DDBMS should continue to work if new nodes are
added. Location transparency: Users should not have to know the location of a datum
in order to retrieve it.
2. Fragmentation transparency: The user should be unaffected by, and not even
notice, any fragmentation of the DDB. The user can retrieve data without regard to
the fragmentation of the DDB.
3. Replication transparency: The user should be able to use the DDB without being
concerned in any way with the replication of the data in the DDB.
4. Distributed query processing: A query should be capable of being executed at any
node in the DDBMS that contains data relevant to the query. Many nodes may
participate in the response to the user's query without the user's being aware of such
participation.
5. Distributed transaction processing: A transaction may access and modify data at
several different sites in the DDB without the user's being aware that multiple sites
are participating in the transaction.
6. Hardware independence: The DDB and its associated DDBMS should be capable
of being implemented on any suitable platform, i.e., on any computer with
--9--
Other related problems arise when one process attempts to alter the structure of a
table which another is updating, or when two processes generate duplicate index values for
a pair of records which should have unique keys. The need for "read consistency" must
also be considered - when process A is generating a report based on a series of values in
the database, its results may be falsified if process B changes some of them between the
start and end of the transaction. Oracle actually handles this last problem by way of its
--10--
rollback segments, but it can be approached in the same way as the others, using a system
of locking.
Locking
A lock can be thought of as assigning a user or process temporary ownership of a
database resource. While the lock exists, no other user or process may access the record.
So, to safeguard against the lost update described above:
Process A reads record R into the memory and acquires a lock on it,
Process B tries to read record R into memory but is prevented from doing so,
Process A commits an updated version of record R and releases the lock on it,
Process B tries to read record R into memory again - this time successfully.
There are no commands for locking in standard SQL, and the syntax provided by
any particular RDBMS will vary according to how it handles the locking process. In
discussing this topic, two general pieces of terminology are used: Shared / Exclusive
locks.
Exclusive locks are set when it is intended to write or modify part of the database.
While a resource is exclusively locked, other processes may query but not change it.
Oracle automatically sets an exclusive lock on the relevant records before executing
INSERT, DELETE or UPDATE, but it sometimes proves necessary for programs to lock
explicitly in complex transactions.
Deadlock
As with any other situation where computer processes are in contention for
resources, database locking gives rise to the potential problem of "deadlock". Suppose
process A and B both needs to update records Rl and R2:
Process A reads record Rl into the memory and acquires a lock on it.
Process B reads record R2 into the memory and acquires a lock on it.
--11--
Process B tries to read record Rl into memory but is prevented from doing so,
going into a "wait" state.
Process A tries to read record R2 into memory but is prevented from doing so,
going into a "wait" state.
Transaction Failures: When a transaction fails, it aborts. Thereby, the database must
be restored to the state it was in before the transaction started. Transactions may fail
--12--
for several reasons. Some failures may be due to deadlock situations or concurrency
control algorithms.
Site Failures: Site failures are usually due to software or hardware failures. These
failures result in the loss of the main memory contents. In distributed database, site
failures are of two types:
Total Failure where all the sites of a distributed system fail.
Partial Failure where only some of the sites of a distributed system fail.
Media Failures: Such failures refer to the failure of secondary storage devices. In
these cases, the media failures result in the inaccessibility of part or the entire
database stored on such secondary storage.
RECOVERY FAILURE
As already indicated, a DBMS must provide mechanisms for recovery after
failures of various kinds, which might have corrupted the database or left it in an
inconsistent state. In the Oracle context, several levels of failure are identified:
o Statement failure: simply causes the relevant transaction to be rolled-back and the
database returned to its previous state.
o Process failure: e.g. abnormal disconnection from a SQLPLUS session. Once again
this is handled automatically by rolling back transactions and releasing resources.
o Instance failure: a crash in the DBMS software, operating system or hardware. This
requires action by the database administrator, who must issue a SHUTDOWN
ABORT command to trigger off the automatic instance recovery procedures.
--13--
o Media failure: one or more database files are corrupted, for instance after a disc head
crash. This is potentially most serious as it may have destroyed some of the files, like
the "redo" log, which are needed for recovery. A previous version of the database
must be restored from another storage device.
1.
Instance Recovery
Instance recovery process is keep users connected and able to carry out work in the
application [5]. Oracle's handling of physical database updating with the LRU algorithm
means that the state of the database at the time of failure is quite complex. It may contain:
2.
Media recovery
Recovering a database after a disc failure involves restoring a previously backed-
up version from tape. Oracle provides the DBA with a full database EXPORT / IMPORT
--14--
mechanism for this purpose - it is also necessary to back up log fdes and control files
which do not form part of the database proper.
Any work done on the database since the backup will be lost unless the transaction log is
archived to tape, rather than having its data overwritten when the allocated disc area is
full. The DBA may choose whether or not to keep complete log archives - this is a tradeoff between extra time / space / administrative overheads and the cost of having to re-enter
transactions manually. Log archiving provides the power to do on-line backups without
shutting down the system, and full automatic recovery. It is obviously a necessary option
for operationally critical database applications.
TRANSPARENCY
According to the definition of distributed database, one major objective is to
achieve the transparency into the distributed system. Transparency refers to the separation
of the higher-level semantics of a system from lower-level implementation issues. In a
distributed system, transparency hides the implementation details from users of the
system. In other words, the user believes that he or she is working with a centralized
database system and that all the complexities of a distributed database arc either hidden or
transparent to the user [6]. A distributed DBMS may have various levels of transparency.
In a distributed DBMS [6] the following four main categories of transparency have been
identified:
DBMS transparency; hides the knowledge that the local DBMSs may be different
and is, therefore, only applicable to heterogeneous distributed DBMSs.[6]
The distributed database technology intends to extend the concept of data
independence to environments where data are distributed and replicated over a number of
machines connected by a network. This is provided by several forms of transparency such
--15--
3.
Data Replication
A relation or fragment of a relation is replicated if it is stored redundantly in two or
more sites. Full replication of a relation is the case where the relation is stored at all sites.
Fully redundant databases are those in which every site contains a copy of the entire
database [9].
Advantages of Replication
o Availability: failure of site containing relation r does not result in ' unavailability of
r is replicas exist.
--16--
Disadvantages of Replication
o Increased cost of updates: each replica of relation r must be updated. Increased
complexity of concurrency control: concurrent updates to distinct replicas may
lead to inconsistent data unless special concurrency control mechanisms are
implemented.
o One solution: choose one copy as primary copy and apply concurrency control
operations on primary copy.
4.
Data Fragmentation
Horizontal fragmentation: each tuple of r is assigned to one or more fragments.
Vertical fragmentation: the schema for relation r is split into several smaller schemas. All
schemas must contain a common candidate key (or super key) to ensure lossless join
property.
Advantages of Fragmentation
Vertical allows tuples to be split so that each part of the tuple is stored where it is
most frequently accessed. The tuple-id attribute allows efficient joining of vertical
fragments.
AUTONOMY
All operations at a site are controlled by that site to operate independently when
connections to other nodes have failed (Date, C. J. 1995. An Introduction to Database
--17--
Systems, 6th ed. Reading, MA: Addison-Wesley.).it is a design goal for a distributed
database, which says that a site can independently administer and operate its database
when connections to other nodes have failed. With local autonomy, each site has the
capability to control local data, administer security, and log transactions and recover when
local failures occur and to provide full access to local data to local users when any central
or coordinating site cannot operate. In this case, data are locally owned and managed, even
though they are accessible from remote sites. This implies that there is no reliance on a
central site.
Types of autonomy:
1. Design
Individual DBs can use data models and transaction management techniques that
they prefer.
2. Communication
Individual DBs can decide which information they want to make accessible to
other sites
3. Execution
Individual DBs can decide how to execute transactions submitted to them.
--18--
All sites are aware of each other and agree to cooperate in processing user requests.
Each site in the network surrenders part of its autonomy in terms of right to change
schemas or software.
Sites
may
not
be
aware
of
each
other
and
may
provide
only
--19--
--20--
1) Two-Tier Architecture
To improve performance, the three-tier architecture adds another server layer either
by a middleware server or an application server.
-
--21--
Alternatively, the additional server software can be distributed between the database
server and PC clients.
3) Multiple-Tier Architecture
Client-server architecture with more than three layers: a PC client, a backend
database server, an intervening middleware server, and application servers. It provides
more flexibility on division of processing. The application servers perform business logic
and manage specialized kinds of data such as images.
Courtesy:www.blueportal.org
Author: Michael Mannino.I'ublicationTata Sk Grow Hill,2004
ADVANTAGES OF CLIENT-SERVER ARCHITECTURES
1. More efficient division of labor
2. Horizontal and vertical scaling of resources
3. Better price/performance on client machines
4. Ability to use familiar tools on client machines
5. Client access to remote data (via standards)
6. Full DBMS functionality provided to client workstations
7. Overall better system price/performance
--22--
Properties of Transactions
A Transaction has four properties that lead to the consistency and reliability of a
distributed database. These are Atomicity, Consistency, Isolation, and Durability.
Atomicity: This refers to the fact that a transaction is treated as a unit of operation.
Consequently, it dictates that either all the actions related to a transaction are completed or
none of them is carried out. For example, in the case of a crash, the system should
--23--
complete the remainder of the transaction, or it will undo all the actions pertaining to this
transaction. The recovery of the transaction is split into two types corresponding to the two
types of failures:
Transaction recovery: which is due to the system terminating one of the
transactions because of deadlock handling; and the crash recovery: which is done after a
system crash or a hardware failure.
Consistency: Referring to its correctness, this property deals with maintaining
consistent data in a database system. Consistency falls under the subject of concurrency
control. For example, "dirty data" is data that has been modified by a transaction that has
not yet committed. Thus, the job of concurrency control is to be able to disallow
transactions from reading or updating "dirty data."
Isolation: According to this property, each transaction should see a consistent
database at t all times. Consequently, no other transaction can read or modify data that is
being modified by another transaction.
If this property is not maintained, one of two things could happen to the data base,
as shown in Figure 2:
a.
Lost Updates: this occurs when another transaction (T2) updates the same data being
modified by the first transaction (Tl) in such a manner that T2 reads the value prior
to the writing of Tl thus creating the problem of loosing this update.
b.
Cascading Aborts: this problem occurs when the first transaction (Tl) aborts, then the
transactions that had read or modified data that has been used by Tl will also abort.
--24--
c.
Durability: This property ensures that once a transaction commits, its results are
permanent and cannot be erased from the database. This means that whatever
happens after the COMMIT of a transaction, whether it is a system crash or aborts of
other Transactions, the results already committed are not modified or undone
CONCLUSION
When we look to the history of databases, the technology is developed from graph-based
systems, to relational systems. Because of its simplicity and clean concepts, many studies
are accomplished in relational database design. While using relational model, to provide
declaration of application specific types, a new data model is introduced based, on objectoriented programming principles. More recently, a hybrid model, object-relational model
has emerged which embeds object-oriented features in a relational context. The use of
objects has also been demonstrated as a way to achieve both interoperability of
heterogeneous databases and modularity of the DBMS itself. Sophisticated and reliable
commercial distributed DBMSs are now available in the market, but there is also a number
of issues need to be solved satisfactorily. These deal with skewed data placement in
parallel DBMSs, network-scaling problems, i.e calibrating distributed DBMSs for the
specific characteristics of communication technologies such as broadband networks and
mobile and cellular networks. Advanced transaction models such as workflow models in
distributed environments or models for mobile computing and distributed object
management are among the research issues. Additionally, in a highly distributed
environment, the cost of moving data can be extremely high. So the optimal usage of
communication lines and caches on intermediate nodes becomes an important
performance issue to be considered. By the significant developments in Internet and usage
of WWW, DBMS vendors are making their products web-enabled, to provide better web
servers. By this way, a path to the direction of manipulation of huge volume of
nonstandard data that exists on web is opened.
--25--
REFERENCES
[1]
[2]
[3]
[4]
[5]
J. Dyke, S. Shaw, and M. Bach, Pro Oracle Database 11g RAC on Linux: Apress,
2010.
[6]
[7]
Silbershatz, Korth and Sudarshan, Database Concepts, TATA Mc Graw Hill, 3rd
Edition, 2004
[8]
[9]
[10] P. Verssimo and L. Rodrigues, Distributed systems for system architects: Kluwer
Academic, 2001.
[11] M. T. Ozsu and P. Valduriez, Principles of Distributed Database Systems: Springer,
2011.
--26--