Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

DDBS Lec1

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 20

CSE 607

Distributed Database

Source:
1. Principles of Distributed Database Systems
By Tanner Ozsu, Patric Valdureitz
2. Slides available
Survey of advanced topics in Database Systems

• By 1998, centralized database managers (DBMSs) would be an “antique


curiosity” and most organizations would move towards distributed database
managers. Distribution was slowly starting and “client/server” had just started.

• These systems were generally multiple client/single server systems in which the
distribution was mostly in terms of functionality, not data. If multiple servers
were used, clients were responsible for managing the connections to these
servers.

• Transparency of access was not widely supported, and each client had to “know”
the location of the required data. The distribution of data among multiple servers
was very primitive, systems did not support fragmentation or replication of data.

• Systems of the time were “homogeneous” in that each system could manage
only data that were stored in its own database, with no linkage to other
repositories.
• Today’s client/server systems provide significant transparency in accessing data
from multiple servers, support distributed transactions to facilitate transparency,
and execute queries over (horizontally) fragmented data.

• Further, new systems implement both synchronous and asynchronous


replication protocols, and many vendors have introduced gateways to access
other databases.

• Significant achievements have taken place in the development and deployment


of parallel database servers.

• Object database managers have entered the marketplace and have found a
niche market in some classes of applications which are inherently distributed.
Distributed Database System
Distributed database system (DDBS) technology:
It is one of the major recent developments in the database systems area.
It is the union of what appear to be two diametrically opposed approaches to data
processing:
1. Database system and
2. Computer network technologies.
One of the major motivations behind the use of database systems is the desire to
integrate the operational data of an enterprise and to provide centralized, thus
controlled access to that data.

The technology of computer networks promotes a mode of work that goes against all
centralization efforts.
The most important objective of the database technology is integration, not centralization.
It is possible to achieve integration without centralization, and that is exactly what the DDB
technology attempts to achieve.
Distributed Data Processing
Distributed computing system: It is a number of autonomous processing elements
(not necessarily homogeneous) that are interconnected by a computer network and
that cooperate in performing their assigned tasks. The “processing element” is a
computing device that can execute a program on its own.

What is being distributed?


1. Processing logic: Processing logic or processing elements are distributed.

2. Function: Various functions of a computer system could be delegated to various


pieces of hardware or software.

3. Data: Data used by a number of applications may be distributed to a number of


processing sites.

4. Control: The control of the execution of various tasks might be distributed


instead of being performed by one computer system.
Why do we distribute at all?

Distributed processing better corresponds to the organizational structure of today’s


widely distributed enterprises, and that such a system is more reliable and more
responsive.

Many of the current applications of computer technology are inherently distributed.


Electronic commerce over the internet, multimedia applications such as
news-on-demand, manufacturing control systems are all examples of such applications.

The fundamental reason behind distributed processing is to be better able to solve


the big and complicated problems simply by dividing them into smaller pieces
(using a variation of the divide-and-conquer rule) and assigning them to
different software groups.
Advantages
1. Distributed computing provides an economical method of harnessing more
computing power by employing multiple processing elements optimally.
2. By attacking these problems in smaller groups working more or less
autonomously, it might be possible to discipline the cost of software
development.

Distributed Database System

Distributed database: It is a collection of multiple, logically interrelated


databases distributed over a computer network.

Distributed database management system (Distributed DBMS): A distributed


DBMS is defined as the software system that permits the management of the
DDBS and makes the distribution transparent to the users.

Fig. 1.6: Central database on a network (Not a DDBS)

Fig. 1.7: DDBS environment


What is a Distributed Database System?

• A distributed database System is a collection of databases which are distributed


over different computers of a computer network.

• Each site has autonomous processing capability and can perform local applications.

• Each site also participates in the execution of at least one global application which
requires accessing data at several sites.

Database 1
Database 3
Communication Network
Server 1
Server 3

Database 2

Server 2
Promises of DDBSs

1) Transparent management of distributed and replicated data


Transparency refers to separation of the higher level semantics of a system from
lower level implementation issues. A transparent system hides the
implementation details from users.
• Consider an engineering firm that has offices in Boston, Edmonton, Paris, and
San Francisco. They run projects at each of these sites and would like to maintain
a database of their employees, the projects and other related data.
- Assuming that the database is relational, we need the following relations:
EMP(ENO, ENAME, TITLE)
PROJ(PNO, PNAME, BUDGET)
PAY(TITLE, SAL)
ASG(ENO, PNO, DUR, RESP)

- For a DDBS, to localize each data such that data about the employees in
Edmonton office are stored in Edmonton, those in the Boston office are stored in
Boston, and so forth. The same applies to the project and salary information.
 We partition each of the relations and store each partition at a different
site. This is known as fragmentation. It may be preferable to duplicate
some of this data at other sites for performance and reliability reasons. The
result is a distributed database which is fragmented and replicated.

Fig. 1.5: A distributed application

2) Reliability through distributed transactions


 Distributed DBMS are intended to improve reliability since they have
replicated components and thereby, eliminate single points of failure.
 The failure of a single site or the failure of a communication link which
makes one or more sites unreachable, is not sufficient to bring down the
entire system.
Distributed DBMS Environment
Example
Transparent Access
3) Improved Performance
1. A DDBMS fragments the conceptual database, enabling data to be stored
in close proximity to its points of use (data localization).
• Since each site handles only a portion of the database, contention
for CPU and I/O services is not as severe as for centralized databases
• Localization reduces remote access
2. The inherent parallelism of distributed systems may be exploited for inter-
query and intra-query parallelism.
• Inter-query parallelism results from the ability to execute multiple
queries at the same time.
• On the other hand, intra-query parallelism is achieved by breaking
up a single query into a number of subqueries each of which is
executed at a different site, accessing a different part of the
distributed database.
Potential Problems
• Complexity: More complex than centralized database management ones.
• Cost: Distributed systems require additional hardware (communication
mechanisms etc.), thus have increased hardware costs.
• Distribution of control: Distribution creates problems of synchronization and
coordination.
• Security: In a distributed database system, a network is involved which is a
medium that has its own security requirements. Thus, the security problems in
distributed database systems are by nature more complicated than in
centralized ones.
Problem Areas

1. Distributed database design


2. Distributed query processing
3. Distributed concurrency control
4. Distributed directory management
5. Distributed deadlock management
6. Reliability of distributed DBMS
7. Operating system support
8. Heterogeneous databases
9. Relationship among problems
Distributed Database Design

To place the database and applications across different sites, there are two alternatives:
i) Partitioned (or non-replicated) and
ii) Replicated.

Partitioned scheme: Database is divided into a number of disjoint partitions each of which is
placed at a different site.

Replicated scheme: It can be fully replicated where the entire database is stored at each site, or
partially replicated where each partition of the database is stored at more than one site but not at
all the sites.

Two fundamental design issues are


fragmentation- the separation of the database into partitions called fragments, and distribution-
the optimum distribution of the database.
Distributed Query Processing

 Query processing deals with designing algorithms that analyze queries and
convert them into a series of data manipulation operations. The problem is
how to decide on a strategy for executing each query over the network in the
most cost-effective way.
 The objective is to optimize where the inherent parallelism is used to improve
the performance of executing the transaction.

Distributed Directory Management


• A directory contains information (such as descriptions and locations) about data
items in the database.
• A directory may be global to the entire DDBS or local to each site; it can be
centralized at one site or distributed over several sites.
Distributed Concurrency Control
• Concurrency control involves the synchronization of accesses to the distributed
database, such that the integrity of the database is maintained.
• We have to worry about both the integrity of a single database, and about the
consistency of multiple copies of the database (mutual consistency).
• Two solution classes are pessimistic and optimistic. In pessimistic, synchronizing
the execution of user requests before the execution starts. In optimistic,
executing the requests and then checking if the execution compromised the
consistency of the database.
• Locking can be used in both cases. It is based on the mutual exclusion of accesses
to data items. Timestamping ensures the execution of the transactions in some
order.

You might also like