Chapter 7 - Distributed Database System
Chapter 7 - Distributed Database System
Distributed Databases
and
Client-Server Architectures
1 1
Outline
2 2
1. Distributed Database Concepts
A transaction can be executed by multiple networked
computers in a unified manner.
A distributed database (DDB) processes Unit of execution (a
transaction) in a distributed manner.
A distributed database (DDB) can be defined as :
– A collection of multiple logically related database
distributed over a computer network, and a distributed
database management system as a software system that
manages a distributed database while making the
distribution transparent to the user.
– The physical placement of data (files, relations, etc.) which
is not known to the user (distribution transparency).
3 3
• The EMPLOYEE, PROJECT, and WORKS_ON tables
may be fragmented horizontally and stored with possible
replication as shown below.
Remark:
• Each site has a DBMS
– Fragments (replicated or unique).
– Linked by network.
– Can handle local users.
– Participates in at least one global
4 4
requests.
Advantages of DDB :
i. Distribution and Network transparency:
Users do not have to worry about operational details of
the network.
– There is Location transparency, which refers to
freedom of issuing command from any location
without affecting its working.
– Then there is Naming transparency, which allows
access to any names object (files, relations, etc.) from
any location.
ii. Replication transparency:
It allows to store copies of a data at multiple sites as
shown in the above diagram.
This is done to minimize access time to the required
data.
iii. Fragmentation transparency:
• Allows to fragment a relation horizontally (create a
subset of rows of a relation) or vertically (create a subset
of columns of a relation). 5 5
iv. Increased reliability and availability:
Reliability refers to system live time, that is, system is
running efficiently most of the time. Availability is the
probability that the system is continuously available
(usable or accessible) during a time interval.
A distributed database system has multiple nodes
(computers) and if one fails then others are available to
do the job.
v. Improved performance:
A distributed DBMS fragments the database to keep
data closer to where it is needed most.
This reduces data management (access and
modification) time significantly.
vi. Easier expansion (scalability):
Allows new nodes (computers) to be added anytime
without changing the entire configuration. 6 6
Disadvantages of Distributed Database
• Homogeneous Window
– All sites of the database Site 5 Unix
Oracle Site 1
system have identical Oracle
setup, i.e., same database Window
system software. Site 4 Communications
– The system may have network
little or no local
autonomy Oracle
– The underlying operating Site 3 Site 2
systems can be a mixture Linux Oracle Linux Oracle
of Linux, Window, Unix,
etc.
11 11
• Heterogeneous
– At least one of the database must be from different vendor : two variants
– Federated: Each site may run different database system but the data
access is managed through a single conceptual schema.
• This implies that the degree of local autonomy is minimum.
Each site must adhere to a centralized access policy. There may
be a global schema.
– Multidatabase: There is no one conceptual global schema. For data
access a schema is constructed dynamically as needed by the
application software.
Network
Object DBMS
Oriented Site 3 Site 2 Relational
12 12
Linux Linux
4. Query Processing in Distributed Databases
Issues
– Cost of transferring data (files and results) over the network.
• This cost is usually high, so some optimization is necessary.
• Example: suppose there are three sites. Where the relation Employee
at site 1, Department at Site 2 and no relation at site 3
– Employee at site 1. 10,000 rows. Row size = 100 bytes. Table
size = 106 bytes.
– Department at Site 2. 100 rows. Row size = 35 bytes. Table size
= 3,500 bytes.
– And a query is initiated from S3 to retrieve employees [First Name (15
byte long), Last name (15 byte long) and Department name (10 byte long)
total of 40 bytes]
• Q: For each employee, retrieve employee Fname, Lname, and
department name
• Q: Fname,Lname,Dname (Employee Dno = Dnumber Department)
Fname Minit Lname SSN Bdate Address Sex Salary Superssn Dno
14 14
• Strategies : Minimizing data transfer.
1. Transfer Employee and Department to site 3.
• Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send
the result to site 3.
• Transferring employees data from site 1 to site 2: 1,000,000 bytes
• Query result size = 40 * 10,000 = 400,000 bytes.
• Total transfer size = 1,000,000 + 400,000 = 1,400,000 bytes.
3. Transfer Department relation to site 1, execute the join at site
1, and send the result to site 3.
• Data Transfer from site 2 to site 1: 3500 bytes
• Query result size = 40 * 10,000 = 400,000 bytes
• Total bytes transferred = 3500+ 400,000 = 403,500 bytes.
– Preferred approach: strategy 3.
15 15
Example 2 : Consider the query
– Q’: For each department, retrieve the department name ,Fname
and LName of the department manager
• Relational Algebra expression:
– Fname,Lname,Dname (Employee Department)
Mgrssn = SSN
• The result of this query will have 100 tuples, assuming that every
department has a manager, the execution strategies are:
1. Transfer Employee and Department to the result site and
perform the join at site 3.
• Total bytes transferred = 1,000,000 + 3500 = 1,003,500
bytes.
2. Transfer Employee to site 2, execute join at site 2 and send the
result to site 3.
• Site 1-- Site 2: 1,000,000
• Site2-- site3: Query result size = 40 * 100 = 4000 bytes.
• Total transfer size = 4000 +1,000,000 = 1,004,000 bytes.
3. Transfer Department relation to site 1, execute join at site 1 and
send the result to site 3.
• Total transfer size = 4000 + 3500 = 7500 bytes.
Preferred strategy: Choose strategy 3. 16 16
Example 3: Now suppose the result is needed at site2. Possible strategies
:
1. Transfer Employee relation to site 2, execute the query and
present the result to the user at site 2.
• Total transfer size = 1,000,000 bytes for both queries
Q and Q’.
2. Transfer Department relation to site 1, execute join at site 1
and send the result back to site 2.
• Total transfer size for
– Q = 400,000 + 3500 = 403,500 bytes
– Q’ = 4000 + 3500 = 7500 bytes.
17 17
5. Concurrency Control and Recovery
Distributed Databases encounter a number of concurrency
control and recovery problems which are not present in
centralized databases. Some of them are listed below.
18 18
– Communication link failure:
• This failure may create network partition which would
affect database availability even though all database
sites may be running.
– Distributed commit:
• A transaction may be fragmented and they may be
executed by a number of sites. This require a two or
three-phase commit approach for transaction commit.
– Distributed deadlock:
• Since transactions are processed at multiple sites, two or
more sites may get involved in deadlock. This must be
resolved in a distributed manner.
19 19
5. 1 Distributed Concurrency control
i. Primary site technique: A single site is designated as a
primary site which serves as a coordinator for transaction
management.
Primary site
Site 5
Site 1
Site 3 Site 2
20 20
• Transaction management:
– Concurrency control and commit are managed by this site.
– In two phase locking, this site manages locking and
releasing data items. If all transactions follow two-phase
policy at all sites, then serializability is guaranteed.
– Advantages:
• An extension to the centralized two phase locking so
implementation and management is simple.
• Data items are locked only at one site but they can be
accessed at any site.
– Disadvantages:
• All transaction management activities go to primary site
which is likely to overload the site.
• If the primary site fails, the entire system is inaccessible.
– To aid recovery a backup site is designated which behaves as
a shadow of primary site. In case of primary site failure,
backup site can act as primary site. 21 21
ii. Primary Copy Technique:
– In this approach, instead of a site, a data item partition is
designated as primary copy. To lock a data item just the
primary copy of the data item is locked.
• Advantages:
– Since primary copies are distributed at various sites, a
single site is not overloaded with locking and unlocking
requests.
• Disadvantages:
– Identification of a primary copy is complex. A distributed
directory must be maintained, possibly at all sites.
22 22
Recovery from a coordinator failure
• In both approaches a coordinator site or copy may become
unavailable. This will require the selection of a new
coordinator.
– Primary site approach with no backup site:
• Aborts and restarts all active transactions at all sites.
Elects a new coordinator and initiates transaction
processing.
– Primary site approach with backup site:
• Suspends all active transactions, designates the backup
site as the primary site and identifies a new back up site.
• Primary site receives all transaction management
information to resume processing.
– Primary and backup sites fail or no backup site:
• Use election process to select a new coordinator site.
23 23
iii. Concurrency control based on voting:
– There is no primary copy of coordinator.
– Send lock request to sites that have data item.
– If majority of sites grant lock then the requesting transaction
gets the data item.
– Locking information (grant or denied) is sent to all these sites.
– To avoid unacceptably long wait, a time-out period is defined.
If the requesting transaction does not get any vote information
then the transaction is aborted.
24 24
Client-Server Database Architecture
Server 1 Client 1
Client 2
Server 2 Client 3
Server n Client n
25 25
three-tier client/server architecture.
26
• Clients reach server for desired service, but server does reach
clients.
• The server software is responsible for local data management
at a site, much like centralized DBMS software.
• The client software is responsible for most of the distribution
function.
• The communication software manages communication among
clients and servers.
• The processing of a SQL queries goes as follows:
– Client parses a user query and decomposes it into a number
of independent sub-queries. Each subquery is sent to
appropriate site for execution.
– Each server processes its query and sends the result to the
client.
– The client combines the results of subqueries and produces
the final result. 27 27