Chapter 6

Chapter 6: Distributed Database System
 A distributed database is a collection of multiple

interconnected databases, which are spread physically across
various locations that communicate via a computer network.
 In general, distributed databases include the following
features:
 Location independent
 Distributed query processing
 Distributed transaction management
 Hardware independent
 Operating system independent
 Network independent
 Transaction transparencyPrepared by Elisaye B. @WSU-DTC 1
Distributed DBMS :
 The Distributed DBMS is defined as, the software
that allows for the management of the distributed
database and make the distributed data available
for the users.
 A distributed DBMS consist of a single logical
database that is divided into a number of pieces
called the fragments.
 In DDBMS, Each site is capable of independently
processing the users request.
Prepared by Elisaye B. @WSU-DTC 2

 Users can access the DDBMS via applications classified:
1. Local Applications –Those applications that doesn’t require
data from the other sites are classified under the category of
Local applications.
2. Global Applications –Those applications that require data from
the other sites are classified under the category of Global
applications.
 Characteristics of Distributed DDBMS :
 A DDBMS has the following characteristics-
1. A collection of logically related shared data.
2. The data is split into a number of fragments.
3. Fragments may be duplicate.
4. Fragments are allocated to sites.
5. The data at each site is under the control of DBMS and managed
by DBMS.
Data Fragmentation, Replication, and Allocation
Techniques for Distributed database Design
Data Replication
 Data replication is the process of storing separate copies
of the database at two or more sites. It is a popular fault
tolerance technique of distributed databases.
 A relation or fragment of a relation is replicated if it is
stored redundantly in two or more sites.
 Full replication of a relation is the case where the
relation is stored at all sites.
 Fully redundant databases are those in which every site
contains a copy of the entire database.
Advantages of Data Replication
 Reliability − In case of failure of any site, the database
system continues to work since a copy is available at
another site(s).
 Reduction in Network Load − Since local copies of data
are available, query processing can be done with reduced
network usage, particularly during prime hours. Data
updating can be done at non-prime hours.
 Quicker Response − Availability of local copies of data
ensures quick query processing and consequently quick
response time.
 Simpler Transactions − Transactions require less number
of joins of tables located at different sites and minimal
coordination across the network. Thus, they become
simpler in nature.
Disadvantages of Data Replication
 Increased Storage Requirements − Maintaining multiple copies
of data is associated with increased storage costs. The storage
space required is in multiples of the storage required for a
centralized system.
 Increased Cost and Complexity of Data Updating − Each time a
data item is updated, the update needs to be reflected in all the
copies of the data at the different sites. This requires complex
synchronization techniques and protocols.
 Undesirable Application – Database coupling − If complex
update mechanisms are not used, removing data inconsistency
requires complex co-ordination at application level. This results in
undesirable application – database coupling.
 Some commonly used replication techniques are −
 Snapshot replication
 Near-real-time replication

Fragmentation
 Fragmentation is the task of dividing a table into a set of
smaller tables.
 The subsets of the table are called fragments.
 Fragmentation can be of three types: horizontal, vertical,
and hybrid (combination of horizontal and vertical).
 Horizontal fragmentation can further be classified into
two techniques: primary horizontal fragmentation and
derived horizontal fragmentation.
 Fragmentation should be done in a way so that the original
table can be reconstructed from the fragments.
 This is needed so that the original table can be
reconstructed from the fragments whenever required.
 This requirement is called “Re-constructiveness.”
 Vertical Fragmentation
 In vertical fragmentation, the fields or columns of a
table are grouped into fragments.
 In order to maintain re-constructiveness, each
fragment should contain the primary key field(s) of
the table. Vertical fragmentation can be used to
enforce privacy of data.
branch_name customer_name
 E.g. Hillside Lowman

Camp
1
2
Hillside
Valleyview Camp 3
Valleyview Kahn 4
Hillside Kahn 5
Valleyview Kahn 6
Valleyview Green 7
deposit1 = branch_name, customer_name,
Prepared by Elisaye B. @WSU-DTC(employee_info )
tuple_id
8
Horizontal Fragmentation
 Horizontal fragmentation groups the tuples of a
table in accordance to values of one or more fields.
 Horizontal fragmentation should also confirm to the
rule of re-constructiveness.
 Each horizontal fragment must have all columns of
the original base table.
 E.g.
branch_name account_number balance
Hillside A-305 500

Hillside A-226 336
Hillside A-155 62
account1 = branch_name=“Hillside” (account )

Hybrid Fragmentation
 In hybrid fragmentation, a combination of horizontal
and vertical fragmentation techniques are used.
 This is the most flexible fragmentation technique since
it generates fragments with minimal extraneous
information.
 Hybrid fragmentation can be done in two alternative
ways:
I. At first, generate a set of horizontal fragments; then
generate vertical fragments from one or more of the
horizontal fragments.
II. At first, generate a set of vertical fragments; then
generate horizontal fragments from one or more of the
vertical fragments. Prepared by Elisaye B. @WSU-DTC 10
Types of Distributed Databases
Distributed databases can be broadly classified
into homogeneous and heterogeneous distributed
database environments, each with further sub-
divisions, as shown in the following illustration.

 Homogeneous Distributed Databases
 In a homogeneous distributed database, all the sites use identical
DBMS and operating systems. Its properties are:-
 The sites use very similar software.
 The sites use identical DBMS or DBMS from the same vendor.
 Each site is aware of all other sites and cooperates with other sites to
process user requests.
 The database is accessed through a single interface as if it is a single
database.
 Types of Homogeneous Distributed Database
 There are two types of homogeneous distributed database:
 Autonomous: Each database is independent that functions on its
own. They are integrated by a controlling application and use
message passing to share data updates.
 Non-autonomous: Data is distributed across the homogeneous
nodes and a central or master DBMS co-ordinates data updates
across the sites
Heterogeneous Distributed Databases
 In a heterogeneous distributed database, different sites have
different operating systems, DBMS products and data models. Its
properties are :-
 Different sites use dissimilar schemas and software.
 The system may be composed of a variety of DBMSs like relational,
network, hierarchical or object oriented.
 Query processing is complex due to dissimilar schemas.
 Transaction processing is complex due to dissimilar software.
 A site may not be aware of other sites and so there is limited co-
operation in processing user requests.
 Types of Heterogeneous Distributed Databases
 Federated − The heterogeneous database systems are
independent in nature and integrated together so that they
function as a single database system.
 Un-federated − The database systems
Prepared by Elisaye B. @WSU-DTCemploy a central 13
Query Processing in Distributed Databases
 A Query processing in a distributed database management
system requires the transmission of data between the
computers in a network.
 A distribution strategy for a query is the ordering of data
transmissions and local data processing in a database system
 A distributed database query is processed in stages as
follows:
1. Query Mapping: The input query on distributed data is
specified formally using a query language.
 It is then translated into an algebraic query on global
relations.
 It is first normalized, analyzed for semantic errors, simplified,
and finally restructured into an algebraic query.
2. Localization: In a distributed database, fragmentation results in
relations being stored in separate sites, with some fragments
possibly being replicated.
 This stage maps the distributed query on the global schema to
separate queries on individual fragments using data distribution
and replication information.
3. Global Query Optimization: Optimization consists of selecting
a strategy from a list of candidates that is closest to optimal.
 A list of candidate queries can be obtained by permuting the
ordering of operations within a fragment query generated by the
previous stage.
4. Local Query Optimization: This stage is common to all sites in
the DDB.
 The techniques are similar to those used in centralized systems.
 The first three stages discussed above are performed at a central
control site, while the last stage
Prepared isB. performed
by Elisaye @WSU-DTC locally. 15
Costs (Transfer of data) of Distributed Query
processing :
 In Distributed Query processing, the data transfer cost of distributed
query processing means the cost of transferring intermediate files to
other sites for processing and therefore the cost of transferring the
ultimate result files to the location where that result’s required.
 Let’s say that a user sends a query to site S1, which requires data
from its own and also from another site S2.
 Now, there are three strategies to process this query which are given
below:
1. We can transfer the data from S2 to S1 and then process the query
2. We can transfer the data from S1 to S2 and then process the query
3. We can transfer the data from S1 and S2 to S3 and then process
the query. So the choice depends on various factors like, the size
of relations and the results, the communication cost between
different sites, and at which
Prepared bythe
Elisayesite result will be utilized.
B. @WSU-DTC 16
 Commonly, the data transfer cost is calculated in terms of the size
of the messages.
 By using the below formula, we can calculate the data transfer
cost:
 Data transfer cost = C * Size
 Where C refers to the cost per byte of data transferring and Size is
the no. of bytes transmitted.
 Example: Consider the following table EMPLOYEE and
DEPARTMENT.
 Site1: EMPLOYEE
 EID NAME SALARY DID
 EID- 10 bytes
 SALARY- 20 bytes
 DID- 10 bytes
 Name- 20 bytes
 Total records- 1000 Prepared by Elisaye B. @WSU-DTC 17
 Site2: DEPARTMENT
 DID DNAME
 DID- 10 bytes
 DName- 20 bytes
 Total records- 50
 Record Size- 30 bytes
 Example: Find the name of employees and their department names. Also, find the
amount of data transfer to execute this query when the query is submitted to Site 3.
 Answer : Considering the query is submitted at site 3 and neither of the two
relations that is an EMPLOYEE and the DEPARTMENT not available at site 3. So,
to execute this query, we have three strategies:
1. Transfer both the tables that is EMPLOYEE and DEPARTMENT at SITE 3 then
join the tables there. The total cost in this is 1000 * 60 + 50 * 30 = 60,000 + 1500
= 61500 bytes.
2. Transfer the table EMPLOYEE to SITE 2, join the table at SITE 2 and then
transfer the result at SITE 3. The total cost in this is 60 * 1000 + 60 * 1000 =
120000 bytes since we have to transfer 1000 tuples having NAME and DNAME
from site 1,
3. Transfer the table DEPARTMENT to SITE 1, join the table at SITE 2 join the
table at site1 and then transfer the result at site3. The total cost is 30 * 50 + 60 *
1000 = 61500 bytes since we have to transfer 1000 tuples having NAME and 18
Distributed Query Processing Using Semijoin
 The semi-join operation is used in distributed query processing to
reduce the number of tuples in a table before transmitting it to
another site.
 This reduction in the number of tuples reduces the number and
the total size of the transmission that ultimately reducing the total
cost of data transfer.
 Let’s say that we have two tables R1, R2 on Site S1, and S2.
 Now, we will forward the joining column of one table say R1 to
the site where the other table say R2 is located.
 This column is joined with R2 at that site. The decision whether
to reduce R1 or R2 can only be made after comparing the
advantages of reducing R1 with that of reducing R2.
 Thus, semi-join is a well-organized solution to reduce the transfer
19
of data in distributed query processing.
Prepared by Elisaye B. @WSU-DTC
Example : Find the amount of data transferred to
execute the same query given in the above example
using semi-join operation.
 Answer : The following strategy can be used to execute
the query.
1. Select all (or Project) the attributes of the EMPLOYEE
table at site 1 and then transfer them to site 3. For this,
we will transfer NAME, DID(EMPLOYEE) and the size
is 25 * 1000 = 25000 bytes.
2. Transfer the table DEPARTMENT to site 3 and join the
projected attributes of EMPLOYEE with this table. The
size of the DEPARTMENT table is 25 * 50 = 1250
 Applying the above scheme, the amount of data
transferred to executePrepared
theby Elisaye
query will be 25000 + 1250 20=
B. @WSU-DTC
Client/Server Architecture of DDBS
• Databases built on the client-server architecture

are quite common, so it is useful to review that
architecture and how it applies to the field of
databases.
• The client/server architecture is based on the
hardware and software components that interact to
form a system.
• The system includes three main components:
Clients, Servers, and Communications
Middleware.

Client:
 The client is any computer process that requests
services from the server.
 The client is also known as the Front-end
Application.
Client Components
The client application or front end, runs on top of
the operating system and connects with the
middleware to access services available in the
network.
Several Third-Generation Language (3GL) and
Fourth-Generation Language (4GL) can be used
to create the front-end applications.
Most front-end applications are GUI-based to hide
the complexity of the Client/Server components
from the end users.
Server
 The server is any computer process providing
services to the clients.
 The server is also known as the Back-end
Application.
Server Components
 The server application, or back end, runs on top of
operating system and interacts with the
middleware to “listen” for client’s requests for
services.
 Unlike front-end client process, the server process
need not be GUI-based.
Communication Middleware
 The communication middleware is any computer
process through which clients and servers
communicate and is also known as
Communications Layer.
Database Middleware Components
 The middleware software is divided into three
main components:
 Applications Programming Interface (API)
 Database Translator
 Network Translator
Application Programming Interface (API)
 API is public to the client application.
 The programmer interacts with middleware
through the API provided by the middleware
software.
 The middleware API allows the client process to
be database-server-independent.
 Means that the server can be changed without
requiring that the client applications be completely
rewritten.
Database Translator
 Translates the SQL requests into the specific
database server syntax.
 The database translator takes the generic SQL
request and maps it to the database
Prepared server’s SQL
by Elisaye B. @WSU-DTC 26
 Network Translator
 Manages the network communications protocols.
 Database servers can use any of the network
protocols, such as TCP/IP, IPX/SPX or Net BIOS.
 The network layer handles all the communications
details of each database transparently to the client
application.
 Middleware Classifications
 Database middleware software can be classified
according to the way clients and servers
communicate across the network.
 The middleware software is usually classified as:
I. Message-Oriented Middleware (MOM)
II. Remote-Procedure-Call-based (RPC-based)
Middleware Prepared by Elisaye B. @WSU-DTC 27
I. Message-Oriented Middleware (MOM)
 Message-oriented middleware is generally more
efficient in local area networks with limited
bandwidth and in applications in which data
integrity is not quite so critical.
II. Remote-Procedure-Call-based (RPC-based)
Middleware
 RPC-based middleware is probably most suited to
highly integrated systems in which data integrity
is critical, as well as high-throughput networks.
III. Object-based Middleware
 Object-based middleware is an emerging
technology based on object-oriented concepts. 28
Prepared by Elisaye B. @WSU-DTC

Chapter 6

Uploaded by

Copyright:

Available Formats

Chapter 6

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 6

Uploaded by

Copyright:

Available Formats

Chapter 6: Distributed Database System

 A distributed database is a collection of multiple

Prepared by Elisaye B. @WSU-DTC 2

 E.g. Hillside Lowman

Hillside A-305 500

account1 = branch_name=“Hillside” (account )

Prepared by Elisaye B. @WSU-DTC 11

• Databases built on the client-server architecture

Prepared by Elisaye B. @WSU-DTC 21

You might also like