Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lecture3-Distributed Introduction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Principles of Distributed Database

Systems
M. Tamer Özsu
Patrick Valduriez

© 2020, M.T. Özsu & P. Valduriez 1


Outline
◼ Introduction
◼ What is a distributed DBMS
◼ Centralized Database vs Distributed Database
◼ DDB Advantages/Disadvantages
◼ Distributed Computing
◼ Distributed and Parallel Database Design
◼ Parallelism types
◼ DDB types
◼ Classification of DDB
◼ Distributed DBMS promises

© 2020, M.T. Özsu & P. Valduriez 2


Current Distribution – Geographically
Distributed Data Centers

© 2020, M.T. Özsu & P. Valduriez 3


What is a Distributed Database System?
◼ A collection of multiple logically related database distributed over
a computer network.
- Database whose relations reside on different sites

- Database who's some of its relations are replicated at different

sites.

- Database whose relations are split between different sites

❑ A Distributed Database Management System (DDBMS) is


the software system that manages a distributed database
and makes the distribution transparent to the user.

Distributed database system (DDBS) = DDB + D–DBMS

4
Distributed DBMS
• It is the software system that permit the management of the distributed DB and
makes the distribution transparency to users.

• DDBMS consist of single logical DB that is split into a number of fragments.

• Each fragment is stored on one or more computers under the control of a separate
DBMS with the computers connected by a communication network.

• Each site is capable independently processing user request that require access to
local data and is also capable of processing data stored on other computer in the
network.

• User access the distributed database via an application. Applications are classified
as those that don't require data from other sites (local application), and those that do
require data from other sites (global applications)

• We require a DDBMS to have at least one global application

© 2020, M.T. Özsu & P. Valduriez 5


DDBMS Characteristics
A DDBMS therefore has the following characteristics:

◼ A collection of logical related shared data.


◼ The data is split into several fragments.
◼ Fragments may be replicated.
◼ The sites are linked by a communications network.
◼ The data at each site is under the control of a DBMS.
◼ The DBMS at each site handle local applications,
autonomously.
◼ Each DBMS participate in at least one global application.

© 2020, M.T. Özsu & P. Valduriez 6


Centralized Vs. Distributed Databases

In centralized database In Distributed Databases


• Data is stored in multiple
• Data is located in one place places (each is running a
(one server) DBMS)
• All DBMS functionalities are • DBMS functionalities are
distributed over many
done by that server machines

7
Centralized DBMS Environment

© 2020, M.T. Özsu & P. Valduriez 8


Distributed DBMS Environment

© 2020, M.T. Özsu & P. Valduriez 9


Centralized Vs. Distributed Databases

10
Centralized Vs. Distributed Databases

11
Why Might Data be Distributed

• To minimize communication costs or response time.

• Maintain control and security.

• To increase its availability in the event of failure.

• Data is too large.

12
DDBMS- Advantages

• Reflects organizational structure


• Improved shareability and local autonomy
• Improved availability
• Improved reliability
• Improved performance
• Economics
• Modular growth

© 2020, M.T. Özsu & P. Valduriez 13


DDBMS- Disadvantages

• Complexity
• Cost
• Security
• Integrity control more difficult
• Lack of standards
• Lack of experience
• Database design more complex

© 2020, M.T. Özsu & P. Valduriez 14


DDBMS- Example (Banking System)

© 2020, M.T. Özsu & P. Valduriez 15


Distributed Computing

◼ A number of autonomous processing elements (not


necessarily homogeneous) that are interconnected by a
computer network and that cooperate in performing their
assigned tasks.
◼ What is being distributed?
❑ Processing logic
❑ Function
❑ Data
❑ Control

© 2020, M.T. Özsu & P. Valduriez 16


Distributed Computing

◼ Distributed computing system as a number of


interconnected autonomous processing elements (PEs).
◼ Their capabilities may differ, they may be heterogeneous.
◼ PEs do not have access to each other’s state, which they
can only learn by exchanging messages that incur a
communication cost.
◼ Therefore, when data is distributed, its management and
access in a logically integrated manner requires special
care from the distributed DBMS software

© 2020, M.T. Özsu & P. Valduriez 17


Parallel VS. Distributed Databases
In Parallel Database System (To improve performance through
parallelization)

◼ DBMS running across multiple processors and disks that is designed to


execute operations in parallel, whenever possible, in order to improve
performance.
◼ Distributed processing usually imply parallel processing (not distribution of
data)
◼ Can have parallel processing on a single machine

In Distributed Database System (To increased availability )


◼ Data is physically stored across several sites, and each site is managed by a
DBMS capable of running independent of the other sites.
◼ In contrast to parallel databases, sharing data is the key of a DDBs

18
Parallel VS. Distributed Databases

19
Different Architectures

Three possible architectures for passing and processing


data:
a) Shared memory -- processors share a common
memory

b) Shared disk -- processors share a common disk

c) Shared nothing -- processors share neither a


common memory nor common disk

20
Different Architectures

21
Parallel VS. Distributed Databases
In Parallel Databases
• Machines are physically close to each other, e.g., same server room
• Machines connects with dedicated high-speed LANs and switches
• Communication cost is assumed to be small
• Can shared-memory, shared-disk, or shared-nothing architecture

In Distributed Databases
• Machines can be far from each other, e.g., in different continent
• Can be connected using public-purpose network, e.g., Internet
• Communication cost and problems cannot be ignored
• Usually shared-nothing architecture.

22
Type of parallelism
1. Inter-query Parallelism
Queries/transactions execute in parallel with one another.
2. Intra-query parallelism
A single query that is executed in parallel using multiple processors or
disks using shared nothing architecture. To improve the query’s
response time.
3. Intra-operation parallelism
▪ Execution of single complex or large operations in parallel in multiple
processors.
◼ Executing concurrently multiple instances of an operator, with each
instance working on a subset of the data.
◼ Intra-operator parallelism is based primarily on partitioning the input
relation into non-overlapping data segments. Followed by a final merge of
the results
For example, ORDER BY clause of a query that tries to execute on millions of
records can be parallelized on multiple processors.
23
Type of parallelism

4 Inter-operation Parallelism
4.1 Pipe-lined parallelism
Execution of different operations in pipe-lined fashion. For example, if
we need to join three tables, one processor may join two tables and
send the result set records as and when they are produced to the other
processor. In the other processor the third table can be joined with the
incoming records and the result can be produced.

4.2 Independent parallelism


Execution of each operation individually in different processors
only if they can be executed independent of each other.
For example, if we need to join four tables, then two can be joined
at one processor and the other two can be joined at another
processor. Final join can be done later.

24
Types of Distributed Database Systems

◼ Three main factors are used to differentiate between


different types of DDBMSs.

25
Homogeneous Distributed Databases

All sites of the database system


have identical DBMSs. Window
Site 5 Unix
Oracle Site 1
◼ For example, all sites run Oracle or Oracle
Window
DB2. Site 4 Communications
network
• Each site is aware of all other sites
and cooperates with other sites to Oracle
process user requests. Site 3 Site 2
Linux Oracle Linux Oracle
• The database is accessed through
a single interface as if it is a single
database.

26
Homogeneous Distributed Databases

There are two types of homogeneous distributed database :

• Autonomous − Each database is independent that functions on its


own (consist of nodes that operate independently and exchange
information with each other using message passing). They are
integrated by a controlling application and use message passing to
share data updates.

• Non-autonomous − Data is distributed across the homogeneous


nodes and a central or master DBMS co-ordinates data updates
across the sites.

27
Homogeneous Distributed Databases:
Autonomy
◼ Design autonomy:

❑ Individual DBMSs are free to use the data models and


transaction management technique that they prefer.
◼ Communication autonomy:

❑ Each individual DBMSs is free to make its own decision on


providing other DBMSs with information.
◼ Execution autonomy:

❑ Each DBMS can execute the transactions that are


submitted to it in anyway that it wants to.

28
Heterogeneous Distributed Databases
In a heterogeneous distributed database, different sites have different
operating systems, DBMS products and data models. Its properties
are:
• Different sites use dissimilar schemas and software.

• The system may be composed of a variety of DBMSs like relational,


network, hierarchical or object oriented.

• Query processing is complex due to dissimilar schemas.

• Transaction processing is complex due to dissimilar software.

• A site may not be aware of other sites and so there is limited co-
operation in processing user requests

29
Heterogeneous Distributed Databases

◼ Many database applications require data from a


variety of preexisting databases located in a
heterogeneous collection of hardware and software
platforms
Object Unix Relational
Oriented Site 5 Unix
Site 1
Hierarchical
Window
Site 4 Communications
network

Network
Object DBMS
Oriented Site 3 Site 2 Relational
Linux Linux

30
Heterogeneous Distributed Databases

◼ Federated: Each site may run different database system, but


the data access is managed through a single conceptual
schema.
❑ This implies that the degree of local autonomy is minimum. Each site must
adhere to a centralized access policy. There may be a global schema.

◼ Multi-database: There is no one conceptual global schema.


For data access a schema is constructed dynamically as
needed by the application software.

31
Distributed DBMS Architectures

DDBMS architectures are generally developed depending on three


parameters :

• Distribution − It states the physical distribution of data across the


different sites.

• Autonomy − It indicates the distribution of control of the database


system and the degree to which each constituent DBMS can
operate independently.

• Heterogeneity − It refers to the uniformity or dissimilarity of the


data models, system components and databases.

32
Classification of DDBMS

Distribution Peer-to-Peer
Distributed DBS

Distributed Multi-
DBS

Client\server

Autonomy

Multi-DBS

Heterogeneity
Federated DBS

33
History – Early Distribution
Peer-to-Peer (P2P)

© 2020, M.T. Özsu & P. Valduriez 34


History – Client/Server

© 2020, M.T. Özsu & P. Valduriez 35


Distributed DBMS Promises

 Transparent management of distributed, fragmented,


and replicated data

 Improved reliability/availability through distributed


transactions

 Improved performance

 Easier and more economical system expansion

© 2020, M.T. Özsu & P. Valduriez


Scalability

◼ Issue is database scaling and workload scaling

◼ Adding processing and storage power

◼ Scale-out: add more servers

❑ Scale-up: increase the capacity of one server → has limits

© 2020, M.T. Özsu & P. Valduriez 51


Outline
◼ Introduction


❑ Distributed DBMS architecture

© 2020, M.T. Özsu & P. Valduriez 57

You might also like