0% found this document useful (0 votes)

39 views

Chapter - 7 Distributed Database System

The document discusses distributed database concepts including distributed database management systems, data fragmentation and replication, types of distributed database systems including homogeneous and heterogeneous, query processing, concurrency control and recovery, and client-server architecture. It covers advantages and disadvantages of distributed databases.

Uploaded by

Eyoab

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

Chapter - 7 Distributed Database System

Uploaded by

Eyoab

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 54

Chapter - 7

Distributed Databases system

1
Outline
1 Distributed Database Concepts

2 Data Fragmentation, Replication and Allocation

3 Types of Distributed Database Systems

4 Query Processing

5 Concurrency Control and Recovery

6 3-Tier Client-Server Architecture

Distributed Database Concepts

– Distributed database - logically interrelated collection of

shared data (and a description of this data) physically

distributed over a computer network.

– DDBMS is a software system that manages a distributed

database while making the distribution transparent to the
user.
• A DDBMS therefore has the following characteristics:

– a collection of logically related shared data;

– the data is split into a number of fragments;

– fragments may be replicated;

– fragments/replicas are allocated to sites;

– the sites are linked by a communications network;

– the data at each site is under the control of a DBMS;

– the DBMS at each site can handle local applications, autonomously;

– each DBMS participates in at least one global application.

Advantages DDS
1. Management of distributed data with different levels of
transparency:
– Distribution transparency:

• This refers to the physical placement of data (files,

relations, etc.) is not known to the user.
− Network transparency: Users do not have to worry about
operational details of the network.
− Location transparency: refers to freedom of issuing command from
any location without affecting its working.
Con…
− Naming transparency: allows access to any named object (files, relations,
etc.) from any location.
− Replication transparency: Allows to store copies of a data at multiple sites.

• This is done to minimize access time to the required data.

− Fragmentation transparency: Allows to segment a relation

horizontally (create a subset of tuples of a relation) or vertically (create a

subset of columns of a relation).

Advantages of DDS
2. Increase reliability and availability:
− Reliability refers to system live time, that is, system is running efficiently most
of the time.
− Availability is the probability that the system is continuously available (usable
or accessible) during a time interval.
− A distributed database system has multiple nodes (computers) and if one fails
then others are available to do the job.
3. Improved performance:
− A DDBMS fragments the database to keep data closer to where it is needed
most.
− This reduces data management (access and modification) time significantly.
4. Easier expansion (scalability):
− Allows new nodes (computers) to be added anytime without chaining the entire
configuration.
Disadvantages DDS
– Complexity

– Cost

– Security

– Integrity control more difficult

– Lack of standards

– Lack of experience

– Database design more complex

Database system architectures
 A Database Architecture is a representation of DBMS design.
It helps to design, develop, implement, and maintain the
database management system.

1. Centralized database architecture

2. Parallel database architectures

3. A truly distributed database architecture

Centralized database
• A centralized database is basically a type of database that is
stored, located and maintained at a single location only.
• This type of database is modified and managed from that
location itself.
Parallel database architectures

 Parallel DBMSs link multiple, smaller machines to achieve

the same throughput as a single, larger machine, often with
greater scalability and reliability
 The three main architectures for parallel DBMSs:

 Shared memory(tightly coupled)

 Shared disk (loosely coupled architecture)

 Shared nothing-(massively parallel processing (MPP))

architecture
The three main architectures for parallel DBMSs:

■ Shared memory - tightly coupled architecture in which multiple processors

share secondary (disk) storage and primary memory.
The three main architectures for parallel DBMSs:

 Shared disk -loosely coupled architecture where multiple processors

share secondary (disk) storage but each has their own primary memory.
The three main architectures for parallel DBMSs:

 Shared nothing-(massively parallel processing (MPP)) architecture.

• Multiple processor architecture in which each processor is part of a
complete system, with its own memory and disk storage.
Distributed database
• A distributed database system allows applications to access data
from local and remote databases.

Site 1
Distributed database
• There are two types of distributed database architecture:
• Homogeneous Distributed Database.
• Heterogeneous Distributed Database.
Types of Distributed Database Systems

Autonomous Database
An autonomous database is a cloud database that uses machine learning to
automate database tuning, security, backups, updates, and other routine
management tasks traditionally performed by DBAs.
Homogeneous
• All sites of the database system have identical setup, i.e., same database
system software.
• The underlying operating systems can be a mixture of Linux, Window,
Unix, etc.
• For example, all sites run Oracle or DB2, or Sybase or some other database
system.
Window
Advantages Site 5 Unix
Oracle Site 1
 Easy to use Oracle
Window
 Easy to mange Site 4 Communications
neteork
 Easy to Design
Oracle
Disadvantages Site 3 Site 2
Linux Oracle Linux Oracle
 Difficult for most organizations to
force a homogeneous environment
Heterogeneous
 Different data center may run different DBMS products, with possibly different underlying data models.

 Translations required to allow for:

o Different hardware.
o Different DBMS products.
o Different hardware and different DBMS products.

Object Unix Relational

Oriented Site 5 Unix
Site 1
Hierarchical
Window
Site 4 Communications
network

Network
Object DBMS
Oriented Site 3 Site 2 Relational
Linux Linux
Heterogeneous
 Advantages
 Huge data can be stored in one Global center from different data center
 Remote access is done using the global schema.
 Different DBMSs may be used at each node

 Disadvantages
 Difficult to mange
 Difficult to design.

.
Federated Database Management Systems

• A federated database system - collection of cooperating database systems

that are autonomous and possibly heterogeneous.
• Differences in data models

• Differences in constraints

• Differences in query language

• Multidatabase system (MDBS): A distributed DBMS in which each site

maintains complete autonomy.
Client/server architecture
• Client/server architecture refers to the way in which computers
interact to form a system.
Distributed Processing and Distributed Database
DDBMS Components
Computer workstations
 To form the network system.
Network hardware and software
 Components that reside in each workstation.
Communications media
 Carry the data from one workstation to another.
Transaction processor (TP)
 Receives and Processes the application’s data requests.
Data processor (DP)
 Stores and Retrieves data located at the site.
 Also Known as data manager (DM).
DDBMS protocol
• determines how the DDBMS will:
– Interface with the network to transport data and commands
between DPs and TPs.
– Synchronize all data received from DPs (TP side) and route
retrieved data to the appropriate TPs (DP side).
– Ensure common database functions in a distributed system --
security, concurrency control, backup, and recovery.
• Single-Site Processing, Single-Site Data (SPSD)
– All processing is done on a single CPU or host computer.

– All DBMS are stored on the host computer’s local disk.

– The DBMS is accessed by dumb terminals.

– Most mainframe and minicomputer DBMSs.

– 1st generation single-user microcomputer database.

Non distributed (Centralized) DBMS
− MPSD requires a network file server on which conventional applications
are accessed through a LAN.

− Client/Server architecture.
– Fully distributed DBMS with support for multiple DPs and
TPs at multiple sites.
– Homogeneous DDMS
 Integrate only one type of centralized DBMS over the network.

– Heterogeneous DDBMS
 Integrate different types of centralized DBMSs over a network.
Distributed DB Transparency
– Distribution transparency
– Transaction transparency
– Failure transparency
– Performance transparency
– Heterogeneity transparency
Distribution Transparency
• Distribution transparency allows us to manage a physically
dispersed database as though it were a centralized database.

• Three Levels of Distribution Transparency

– Fragmentation transparency

– Location transparency

– Local mapping transparency

Distribution Transparency
• Example :
Employee data (EMPLOYEE) are distributed over three locations: New York,
Atlanta, and Miami.
Depending on the level of distribution transparency support, three different cases of
queries are possible:
Distribution Transparency
• Case 1: DB Supports Fragmentation Transparency
SELECT * FROM EMPLOYEE WHERE EMP_DOB < '01-JAN-1940';

• Case 2: DB Supports Location Transparency

SELECT * FROM E1 WHERE EMP_DOB < '01-JAN-1940';
UNION
SELECT * FROM E2 WHERE EMP_DOC < '01-JAN-1940';
UNION
SELECT * FROM E3 WHERE EMP_DOC < '01-JAN-1940';

• Case 3: DB Supports Local Mapping Transparency

SELECT * FROM E1 NODE NY WHERE EMP_DOB < '01-JAN-1940';
UNION
SELECT * FROM E2 NODE ATL WHERE EMP_DOB < '01-JAN-1940';
UNION
SELECT * FROM E3 NODE MIA WHERE EMP_DOB < '01-JAN-1940';
Transaction Transparency
• Transaction transparency - ensures that database transactions

will maintain the database’s integrity and consistency.

• Transaction transparency consists:

– Remote Requests

– Remote Transactions

– Distributed Transactions

– Distributed Requests
A Remote Request
 Allows us to access data to be processed by a single remote database
processor.
A Remote Transaction
 Composed of several requests, may access data at only a single
site.
 Allows a transaction to reference several different (local or remote) DP
sites.
A Distributed Request
 Reference data from several remote DP sites.
 Allows a single request to reference a physically partitioned table.

Example2:
Distributed Request
Transaction Transparency
 Two-Phase Commit Protocol
 DO performs the operation and records the “before” and “after”
values in the transaction log.
 UNDO reverses an operation, using the log entries written by the
DO portion of the sequence.
 REDO redoes an operation, using the log entries written by DO
portion of the sequence.

– The write-ahead protocol forces the log entry to be written to

permanent storage before the actual operation takes place.
Two-Phase Commit Protocol
• Two-phase commit protocol defines the operations between two
nodes;
• Coordinator and

• Subordinates or cohorts - one or more

Two-Phase Commit Protocol
• The protocol is implemented in two phases:
• Phase 1: Preparation

• The coordinator sends a PREPARE TO COMMIT message to all

subordinates.
• The subordinates receive the message, write the transaction log
using the write-ahead protocol, and send an acknowledgement
message to the coordinator.
• The coordinator makes sure that all nodes are ready to commit,
or it aborts the transaction.
Two-Phase Commit Protocol
 Phase 2: The Final Commit

– The coordinator broadcasts a COMMIT message to all

subordinates and waits for the replies.

– Each subordinate receives the COMMIT message then updates

the database, using the DO protocol.
– The subordinates reply with a COMMITTED or NOT COMMITTED
message to the coordinator.
– If one or more subordinates uncommitted, the coordinator sends
an ABORT message, thereby forcing them to UNDO all changes.
Performance Transparency and
Query Optimization

• Query optimization must provide distribution transparency as

well as replica transparency.

• Replica transparency refers to the DDBMSs ability to hide the

existence of multiple copies of data from the user.

• Query optimization algorithms are based on two principles:

• Selection of the optimum execution order

• Selection of sites to be accessed to minimize communication

costs
 Operation Modes of Query Optimization
– Automatic query optimization
DDBMS finds the most cost-effective access path without user intervention.
– Manual query optimization
Optimization is selected and scheduled by the end user or programmer.

• Timing of Query Optimization

– Static query optimization takes place at compilation time.

– Dynamic query optimization takes place at execution time.
• Optimization Techniques Information -
– Statistically based query optimization
uses statistical information about the database.
– Rule-based query optimization algorithm

based on a set of user-defined rules to determine the best query access

strategy.
Distributed Database Design

 The design of a distributed database introduces three new issues:

– How to partition the database into fragments.
– Which fragments to replicate.
– Where to locate those fragments and replicas.
Data Fragmentation
 Data fragmentation allows us to break a single object
into two or more segments or fragments.
 Three Types of Fragmentation Strategies:

 Horizontal fragmentation

 Vertical fragmentation

 Mixed fragmentation
Data Fragmentation
 Horizontal Fragmentation - Consists of a subset of the tuples
of a relation.
 Fragment represents the equivalent of a SELECT statement, with
the WHERE clause on a single attribute.
Data Fragmentation
 Vertical fragment Consists of a subset of the attributes of a
relation.
 Equivalent to the PROJECT statement.
Data Fragmentation

 Mixed fragment - Consists of a horizontal

fragment that is subsequently vertically
fragmented, or a vertical fragment that is
then horizontally fragmented.
 A mixed fragment is defined using the
Selection and Projection operations of the
relational algebra.
Data Replication

 Data replication refers to the storage of data copies at multiple

sites served by a computer network.
– Enhance data availability and response time, reducing
communication and total query costs.
Data Replication
• Mutual Consistency Rule
– All copies of data fragments be identical.
– DDBMS must ensure that a database update is performed at
all sites where replicas exist.
• Replication Conditions
– Fully Replicated database stores multiple copies of all
database fragments at multiple sites.
– Partially Replicated database stores multiple copies of some
database fragments at multiple sites.
• Factors for Data Replication Decision
– Database Size
– Usage Frequency
Data Allocation
 Data allocation describes the processing of deciding where to
locate data.
 Data Allocation Strategies
– Centralized
The entire database is stored at one site.
– Partitioned
The database is divided into several disjoint parts (fragments)
and stored at several sites.
– Replicated
Copies of one or more database fragments are stored at several
sites.
Data allocation algorithms
• Data allocation algorithm take into consideration a variety of
factors:

– Performance and data availability goals

– Size, number of rows, the number of relations that an entity

maintains with other entities.

– Types of transactions to be applied to the database, the

attributes accessed by each of those transactions.
Questions ?
1. Explain what is meant by a DDBMS and discuss the motivation in
providing such a system.