Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
170 views

Distributed Database Concepts

Distributed databases allow data to be shared across a computer network while being logically unified. A distributed database management system (DDBMS) manages the distributed database and makes the distribution transparent to users. Data is fragmented and fragments may be replicated across multiple database sites connected by a communications network. Each site has a local DBMS and participates in global applications while handling local applications autonomously.

Uploaded by

Joel wakhungu
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
170 views

Distributed Database Concepts

Distributed databases allow data to be shared across a computer network while being logically unified. A distributed database management system (DDBMS) manages the distributed database and makes the distribution transparent to users. Data is fragmented and fragments may be replicated across multiple database sites connected by a communications network. Each site has a local DBMS and participates in global applications while handling local applications autonomously.

Uploaded by

Joel wakhungu
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 52

Distributed Databases

Concepts
Distributed Database.
A logically interrelated collection of shared data
(and a description of this data), physically
distributed over a computer network.

Distributed DBMS.
Software system that permits the management of
the distributed database and makes the
distribution transparent to users.
Concepts
• Collection of logically-related shared data.
• Data split into fragments.
• Fragments may be replicated.
• Fragments/replicas allocated to sites.
• Sites linked by a communications network.
• Data at each site is under control of a DBMS.
• DBMSs handle local applications autonomously.
• Each DBMS participates in at least one global
application.
Reason for Data Distribution
• Centralized DBMS vs. Distributed Database System
• A distributed database is a collection of data that
belongs logically to the same system but is physically
spread over the sites of a computer network
• Several factors have led to the development of DDBS:
– Distributed nature of some database applications
– Increased reliability and availability
– Allowing data sharing while maintaining some measure of local
control
– Improved performance
Component Architecture for a DDBMS
site 1
GDD
DDBMS

DC LDBMS

GDD

Computer Network
DDBMS

DC

site 2 DB

LDBMS : Local DBMS component


DC : Data communication component
GDD : Global Data Dictionary
The Ideal Situation
• A single application should be able to operate
transparently on data that is:
– spread across a variety of different DBMS's
– running on a variety of different machines
– supported by a variety of different operating systems
– connected together by a variety of different
communication networks

• The distribution can be geographical or local


Workable definition
A distributed database system consists of a collection of sites
connected together via some kind of communications network,
in which :
– each site is a database system site in its own right;
– the sites agree to work together, so that a user at
any site can access data anywhere in the network
exactly as if the data were all stored at the user's
own site
It is a logical union of real databases
• It can be seen as a kind of partnership among individual local
DBMS's
• Difference with remote access or distributed processing systems
• Temporary assumption: strict homogeneity
Additional Functionality of DDBS
• Distribution leads to increased complexity in the system design
and implementation
• DDBMS must be able to provide additional functions to those of
a centralized DBMS. Some of these are:
– To access remote sites and transmit queries and data among the various
sites via a communication network.
– To keep track of the data distribution and replication in the DDBMS catalog.
– To devise execution strategies for queries and transactions that access
data from more than one site.
– To decide on which copy of a replicated data item to access.
– To maintain the consistency of copies of a replicated data item.
– To maintain the global conceptual schema of the distributed database
– To recover from individual site crashes and from new types of failures such
as failure of a communication link.
Types of DDBMS
• In a homogeneous distributed database
– All sites have identical software
– Are aware of each other and agree to cooperate in processing
user requests.
– Each site surrenders part of its autonomy in terms of right to
change schemas or software
– Appears to user as a single system
• In a heterogeneous distributed database
– Different sites may use different schemas and software
• Difference in schema is a major problem for query processing
• Difference in software is a major problem for transaction processing
– Sites may not be aware of each other and may provide only
limited facilities for cooperation in transaction processing
Physical Architecture of DDBS
Distributed Database Design
• Fragmentation
– Relation may be divided into a number of sub-
relations, which are then distributed.
• Allocation
– Each fragment is stored at site with "optimal"
distribution.
• Replication
– Copy of fragment may be maintained at several
sites.
Fragmentation
• Definition and allocation of fragments carried out
strategically to achieve:
– Locality of Reference
– Improved Reliability and Availability
– Improved Performance
– Balanced Storage Capacities and Costs
– Minimal Communication Costs.
• Involves analyzing most important applications,
based on quantitative/qualitative information.
Fragmentation
• Quantitative information may include:
– frequency with which an application is run;
– site from which an application is run;
– performance criteria for transactions and
applications.
• Qualitative information may include
transactions that are executed by
application, type of access (read or write),
and predicates of read operations.
Data Allocation
• Four alternative strategies regarding
placement of data:
– Centralized
– Partitioned (or Fragmented)
– Complete Replication
– Selective Replication
Data Allocation
• Centralized
– Consists of single database and DBMS stored
at one site with users distributed across the
network.

• Partitioned
– Database partitioned into disjoint fragments,
each fragment assigned to one site.
Data Allocation
• Complete Replication
– Consists of maintaining complete copy of
database at each site.

• Selective Replication
– Combination of partitioning, replication, and
centralization.
Why Fragment?
• Usage
– Applications work with views rather than
entire relations.
• Efficiency
– Data is stored close to where it is most
frequently used.
– Data that is not needed by local applications
is not stored.
Why Fragment?
• Parallelism
– With fragments as unit of distribution, transaction
can be divided into several subqueries that operate
on fragments.
• Security
– Data not required by local applications is not stored
and so not available to unauthorized users.

• Disadvantages
– Performance
– Integrity.
Correctness of Fragmentation
• Three correctness rules:

– Completeness
– Reconstruction
– Disjointness.
Correctness of Fragmentation
• Completeness
– If relation R is decomposed into fragments
R1, R2, ... Rn, each data item that can be found in
R must appear in at least one fragment.
• Reconstruction
– Must be possible to define a relational operation
that will reconstruct R from the fragments.
– Reconstruction for horizontal fragmentation is Union
operation and Join for vertical .
Correctness of Fragmentation
• Disjointness
– If data item di appears in fragment Ri, then it
should not appear in any other fragment.
– Exception: vertical fragmentation, where
primary key attributes must be repeated to allow
reconstruction.
– For horizontal fragmentation, data item is a tuple
– For vertical fragmentation, data item is an
attribute.
Types of Fragmentation
• Four types of fragmentation:
– Horizontal
– Vertical
– Mixed
– Derived.

• Other possibility is no fragmentation:


– If relation is small and not updated frequently, may
be better not to fragment relation.
Horizontal and Vertical Fragmentation

41
Horizontal Fragmentation
• This strategy is determined by looking at
predicates used by transactions.
• Involves finding set of minimal (complete and
relevant) predicates.
• Set of predicates is complete, if and only if, any
two tuples in same fragment are referenced with
same probability by any application.
• Predicate is relevant if there is at least one
application that accesses fragments differently.
Horizontal Fragmentation of
account Relation
Vertical Fragmentation
• Vertical fragmentation: the schema for relation r is split
into several smaller schemas
– All schemas must contain a common candidate key (or
superkey) to ensure lossless join property.
– A special attribute, the tuple-id attribute may be added to each
schema to serve as a candidate key.
• Example : relation account with following schema
• Account-schema = (branch-name, account-number,
balance)
Vertical Fragmentation of
employee info Relation
Mixed Fragmentation
Advantages of Fragmentation
• Horizontal:
– allows parallel processing on fragments of a relation
– allows a relation to be split so that tuples are located where they
are most frequently accessed
• Vertical:
– allows tuples to be split so that each part of the tuple is stored
where it is most frequently accessed
– tuple-id attribute allows efficient joining of vertical fragments
– allows parallel processing on a relation
• Vertical and horizontal fragmentation can be mixed.
– Fragments may be successively fragmented to an arbitrary
depth.
Data Replication (1)
• A relation or fragment of a relation is
replicated if it is stored redundantly in two
or more sites.
• Full replication of a relation is the case
where the relation is stored at all sites.
• Fully redundant databases are those in
which every site contains a copy of the
entire database.
Data Replication (2)
• Advantages of Replication
– Availability: failure of site containing relation r does not result in
unavailability of r is replicas exist.
– Parallelism: queries on r may be processed by several nodes in
parallel.
– Reduced data transfer: relation r is available locally at each site
containing a replica of r.
• Disadvantages of Replication
– Increased cost of updates: each replica of relation r must be
updated.
– Increased complexity of concurrency control: concurrent updates
to distinct replicas may lead to inconsistent data unless special
concurrency control mechanisms are implemented.
• One solution: choose one copy as primary copy and apply
concurrency control operations on primary copy
Possible Network Topologies
Date’s 12 Rules for Distributed
Systems
Rule 0. Fundamental Principle:
TO THE USER, A DISTRIBUTED SYSTEM SHOULD LOOK EXACTLY LIKE
A NONDISTRIBUTED SYSTEM

1. Local autonomy
2. No reliance on a central site
3. Continuous operation
4. Location independence
5. Fragmentation independence
6. Replication independence
7. Distributed query processing
8. Distributed transaction management
9. Hardware independence
10. Operating system independence
11. Network independence
12. DBMS independence
Rule 1 : Local Autonomy
Autonomy objective: Sites should be autonomous to the maximum extent possible

• Local data is locally owned and managed, with local accountability


– security considerations
– integrity considerations
• Local operations remain purely local
• All operations at a given site are controlled by that site; no site X
should depend on some other site Y for its successful functioning
• In some situations some slight loss of autonomy is inevitable
– fragmentation problem - Rule 5
– replication problem - Rule 6
– update of replicated relation - Rule 6
– multiple-site integrity constraint problem - Rule 7
– a problem of participation in a two-phase commit process - Rule 8
Rule 2 : No Reliance on a Central Site
There must not be any reliance on a central “master” site for some central service,
such as centralized query processing or centralized transaction management, such
that the entire system is dependent on that central site

• Reliance on a central site would be undesirable for at


least the following two reasons:
– that central site might be a bottleneck
– the system would be vulnerable
• In a distributed system, therefore, the following functions
(among others) must all be distributed:
– Dictionary management
– Query processing
– Concurrency control
– Recovery control
Rule 3 : Continuous Operation
There should ideally never be any need for a planned entire system
shutdown
• Incorporating a new site X into an existing distributed system
not bring the entire system to a halt
• Incorporating a new site X into an existing distributed system
not require any changes to existing user programs or terminal
• Removing an existing site X from the distributed system
should cause any unnecessary interruptions in service
• Within the distributed system, it should be possible to create
destroy fragments and replicas of fragments dynamically
• It should be possible to upgrade the DBMS at any given
component to a newer release without taking the entire
system down
Rule 4 : Location Independence
(Transparency)
Users should not have to know where data is physically stored, but
rather should be able to behave - at least from a logical standpoint - as if
the data was all stored at their own local site

• Simplifies user programs and terminal activities


• Allows data to migrate from site to site
• It is easier to provide location independence for simple retrieval
operations than it is for update operations
• Distributed data naming scheme and corresponding support from
the dictionary subsystem
• User naming scheme
– User U has to have a valid logon ID at each of multiple sites to operate
– User profile for each valid logon ID in the dictionary
– Granting of access privileges at each component site
Rule 5 : Fragmentation Independence
• A distributed system supports data fragmentation if a given relation can
be divided up into pieces or “fragments” for physical storage purposes
• A system that supports data fragmentation should also support
fragmentation independence (also known as fragmentation
transparency)
• Users should be able to behave (at least from a logical standpoint) as if
the data were in fact not fragmented at all
• Fragmentation is desirable for performance reasons
• Horizontal fragmentation  SELECT
• Vertical fragmentation  PROJECT
• Fragmentation must be defined within the context of a distributed
database
o Fragmentation independence (like location independence) is desirable
because it simplifies user programs and terminal activities
• Fragmentation independence implies that users should normally be
presented with a view of the data in which the fragments are logically
combined together by means of suitable joins and unions
Rule 5 : An Example of
Fragmentation
Rule 6 : Replication Independence
(Transparency)
User should be able to behave as if the data were in fact not replicated at all

• A distributed system supports data replication if a given relation


(more generally, a given fragment of a relation) can be represented
at the physical level by many distinct stored copies or replicas, at
many distinct sites.
• Replication, like fragmentation, should be “transparent to the user”
• Replication is desirable for at least two reasons:
– Performance
– Availability
• Update propagation problem
• Replication independence (like location and fragmentation
independence) is desirable because it simplifies user programs and
terminal activities
• Snapshots
Rule 6 : Examples of Data Replication
Rule 7 : Distributed Query Processing
It is crucially important for distributed database systems to choose a
good strategy for distributed query processing

• Query processing in a distributed system involve


– local CPU and I/O activity at several distinct sites
– some amount of data communication among those sites
• Amount of data communication is a major performance factor

• Query compilation ahead of time


• Views that span multiple sites
• Integrity constraints within a DDBS that span multiple sites
Rule 8 : Distributed Transaction
Management
Two major aspects of transaction management, recovery control and concurrency
control, require extended treatment in the distributed environment

• In a distributed system, a single transaction can involve the


execution of code at multiple sites and can thus involve updates at
multiple sites
• Each transaction is therefore
said to consist of multiple
“agents,” where an agent
is the process performed
on behalf of a given transaction
at a given site
• Global deadlock: neither site
can detect it using only
information that is internal
to that site
Rule 9 : Hardware Independence
User should be presented with the “single-system image” regardless
any particular hardware platform

• It is desirable to be able to run the same DBMS


on different hardware systems
• It is desirable to have those different hardware
systems all participate as equal partners (where
appropriate) in a distributed system
• The strict homogeneity assumption is not
relaxed; it is still assumed that the same DBMS
is running on all those different hardware
systems
Rule 10 : Operating System
Independence
It is obviously desirable, not only to be able to run the same DBMS on
different hardware systems, but also to be able to run it on different
operating systems - even different operating systems on the same
hardware

• From a commercial point of view, the most


important operating system environments, and
hence the ones that (at a minimum) the DBMS
should support, are probably MVS/)(A,
MVS/ESA, VM/CMS, VAX/VMS, UNIX (various
flavors), OS/2, MS/DOS, Windows
Rule 11 : Network Independence
It is obviously desirable to be able to support a variety of disparate
communication networks

• From the point of view of the distributed DBMS, the network is


merely the provider of a reliable message transmission service
• By “reliable” here is meant that, if the network accepts a message
from site X for delivery to site Y, then it will eventually deliver that
message to site Y;
• Messages will not be garbled, will not be delivered more than once,
and will be delivered in the order sent.
• The network should also be responsible for site authentication
• Ideally the system should support both local area networks and
wide-area networks
• Distributed system should support a variety of different network
architectures
Rule 12 : DBMS Independence
Data Transparency
• Data transparency: Degree to which system user may
remain unaware of the details of how and where the data
items are stored in a distributed system
• Consider transparency issues in relation to:
– Fragmentation transparency
– Replication transparency
– Location transparency
• Naming of Data Items - Criteria
– Every data item must have a system-wide unique name.
– It should be possible to find the location of data items efficiently.
– It should be possible to change the location of data items
transparently.
– Each site should be able to create new data items
autonomously.
Solution to Transparency Problem
• Centralized Scheme – Name Server
– Structure:
• Name server assigns all names
• Each site maintains a record of local data items
• Site asks name server to locate non-local data items
• Use of aliases
– Alternative to centralized scheme: each site prefixes its own site
identifier to any name that it generates i.e., site 17.account.
• Fulfills having a unique identifier, and avoids problems associated
with central control.
• However, fails to achieve network transparency.
– Solution: create a set of aliases for data items: Store the
mapping of aliases to the real name at each sites
– The user can be unaware of the physical location of a data item,
and is unaffected if the data item is moved from site to another
Distributed Transaction
• Transaction may access data at several sites.
• Each site has a local transaction manager responsible
for:
– Maintaining a log for recovery purposes
– Participating in coordinating the concurrent execution of the
transactions executing at that site.
• Each site has a transaction coordinator, which is
responsible for:
– Starting the execution of transactions that originate at the site.
– Distributing subtransactions at appropriate sites for execution.
– Coordinating the termination of each transaction that originates
at the site, which may result in the transaction being committed
at all sites or aborted at all sites.
System Failure Modes
• Failures unique to distributed systems:
– Failure of a site.
– Loss of massages
• Handled by network transmission control protocols such as TCP-IP
– Failure of a communication link
• Handled by network protocols, by routing messages via alternative
links
– Network partition
• A network is said to be partitioned when it has been split into two
or more subsystems that lack any connection between them
– Note: a subsystem may consist of a single node
• Network partitioning and site failures are generally
indistinguishable.

You might also like