Distributed Dbms
Distributed Dbms
I. Introduction to DDBMS
II. Architecture of DDBs
III. Storing data in DDBs
IV. Distributed catalog management
V. Distributed query processing
VI. Transaction Processing
VII. Distributed concurrency control and recovery
2 Distributed DBMS
I.Introduction to DDBMS
Data in a distributed database system is stored across several sites.
Transparent implies that each user within the system may access all of
the data within all of the databases as if they were a single database
3 Distributed DBMS
DDBMS properties
Distributed data independence: Users should be able to ask queries
without specifying where the referenced relations, or copies or
fragments of the relations, are located.
4 Distributed DBMS
DISTRIBUTED PROCESSING ARCHITECTURE
Distributed DBMS
CLIENT CLIENT
CLIENT CLIENT
LAN
LAN
CLIENT CLIENT
CLIENT CLIENT
Delhi Mumbai
DBMS
LAN LAN
CLIENT CLIENT
CLIENT CLIENT
5 Hyderabad Pune
DISTRIBUTED DATABASE ARCHITECTURE
Distributed DBMS
CLIENT CLIENT CLIENT CLIENT
DBMS
DBMS
LAN
Delhi Mumbai
CLIENT
CLIENT CLIENT CLIENT CLIENT
DBMS
DBMS
LAN
Hyderabad
6 Pune
Distributed database
Communication Network- DBMS and Data at each node
7 Distributed DBMS
Types of Distributed Databases
Homogeneous distributed database system :
If data is distributed but all servers run the same DBMS software.
8 Distributed DBMS
I. Introduction to DDBMS
DDBMS II. Architecture of DDBs
III. Storing data in DDBs
IV. Distributed catalog management
V. Distributed query processing
VI. Transaction Processing
VII. Distributed concurrency control and recovery
9 Distributed DBMS
2.DISTRIBUTED DBMS
ARCHITECTURES
1. Client-Server Systems
2. Collaborating Server Systems
3. Middleware Systems
10 Distributed DBMS
1.Client-Server Systems:
1. A Client-Server system has one or more client processes
and one or more server processes,
2. A client process can send a query to any one server
process.
3. Clients are responsible for user-interface issues,
4. Servers manage data and execute transactions.
5. A client process could run on a personal computer and
send queries to a server running on a mainframe.
6. The Client-Server architecture does not allow a single
query to span multiple servers
11 Distributed DBMS
SPECIALISED NETWORK CONNECTION
TERMINALS
MAINFRAME COMPUTER
DUMB
DUMB
13 Distributed DBMS
M:N CLIENT/SERVER DBMS ARCHITECTURE
SERVER #1
CLIENT
#1
D/BASE
CLIENT
#2
SERVER #2
D/BASE
CLIENT
#3
NOT TRANSPARENT!
14
Distributed DBMS
3.Middleware Systems:
The Middleware architecture is designed to allow a single
query to span multiple servers, without requiring all
database servers to be capable of managing such multisite
execution strategies.
15 Distributed DBMS
DDBMS
I. Introduction to DDBMS
II. Architecture of DDBs
III. Storing data in DDBs
IV. Distributed catalog management
V. Distributed query processing
VI. Transaction Processing
VII. Distributed concurrency control and recovery
16 Distributed DBMS
3.Storing Data in DDBs
In a distributed DBMS, relations are stored across several
sites.
Accessing a relation that is stored at a remote site includes
message-passing costs.
A single relation may be partitioned or fragmented across
several sites.
17 Distributed DBMS
Types of Fragmentation:
Horizontal fragmentation: The union of the
horizontal fragments must be equal to the original
relation. Fragments are usually also required to be
disjoint.
18 Distributed DBMS
19 Distributed DBMS
Replication
Replication means that we store several copies of a relation
or relation fragment.
20 Distributed DBMS
DDBMS
I. Introduction to DDBMS
II. Architecture of DDBs
III. Storing data in DDBs
IV. Distributed Catalog Management
V. Distributed Query Processing
VI. Transaction Processing
VII. Distributed concurrency control and recovery
21 Distributed DBMS
4.Distributed Catalog Management
1. Naming Objects
• If a relation is fragmented and replicated, we must be able to
uniquely identify each replica of each fragment.
1. A local name field
2. A birth site field
2. Catalog Structure
A centralized system catalog can be used. It is vulnerable to failure of
the site containing the catalog.
An alternative is to maintain a copy of a global system catalog.
compromises site autonomy.
22 Distributed DBMS
4.Distributed Catalog Management
A better approach:
Each site maintains a local catalog that describes all copies
of data stored at that site.
23 Distributed DBMS
DDBMS
I. Introduction to DDBMS
II. Architecture of DDBs
III. Storing data in DDBs
IV. Distributed Catalog management
V. Distributed Query Processing
VI. Transaction Processing
VII. Distributed concurrency control and recovery
24 Distributed DBMS
5.Distributed Query Processing
Distributed query processing: Transform a high-level
query (of relational calculus/SQL) on a distributed
database (i.e., a set of global relations) into an equivalent
and efficient lower-level query (of relational algebra) on
relation fragments.
25 Distributed DBMS
Distributed Query Processing Steps
26 Distributed DBMS
5.Distributed Query Processing
Sailors(sid: integer, sname: string, rating: integer, age: real)
Reserves(sid: integer, bid: integer, day: date, rname: string)
27 Distributed DBMS
5.Distributed Query Processing
Criteria for measuring the cost of a query evaluation
strategy
For centralized DBMSs number of disk accesses (# blocks read /
written)
For distributed databases, additionally
The cost of data transmission over the network
Potential gain in performance from having several sites processing parts
of the query in parallel
28 Distributed DBMS
5.Distributed query processing
To estimate the cost of an evaluation strategy, in addition to counting
the number of page I/Os.
we must count the number of pages that are shipped is a
communication costs.
Communication costs is a significant component of overall cost in a
distributed database.
29 Distributed DBMS
5.Distributed Query Processing
1.Nonjoin Queries in a Distributed DBMS
Even simple operations such as scanning a relation, selection, and
projection are affected by fragmentation and replication.
SELECT S.age
FROM Sailors S
WHERE S.rating > 3 AND S.rating < 7
Suppose that the Sailors relation is horizontally fragmented, with all
tuples having a rating less than 5 at Mumbai and all tuples having a
rating greater than 5 at Delhi.
The DBMS must answer this query by evaluating it at both sites and
taking the union of the answers.
30 Distributed DBMS
5.Distributed Query Processing
Eg 1: SELECT avg(age)
FROM Sailors S
WHERE S.rating > 3 AND S.rating < 7
taking the union of the answers is not enough
Eg 2: SELECT S.age
FROM Sailors S
WHERE S.rating > 6
31 Distributed DBMS
5.Distributed Query Processing
Eg 4:the entire Sailors relation is stored at both MUMBAI and DELHI
sites.
Where should the query be executed?
32 Distributed DBMS
Which strategy is better for me?
5.Distributed Query Processing
1.Fetch As Needed
Page-oriented Nested Loops join: For each page of R, get each page
of S, and write out matching pairs of tuples <r, s>, where r is in R-page
and s is in S-page.
33 Distributed DBMS
Fetch As Needed: Transferring the relation
piecewise
34 Distributed DBMS
5.Distributed Query Processing
Assume Reserves and Sailors relations
each tuple of Reserves is 40 bytes long
a page can hold 100 Reserves tuples
1,000 pages of such tuples.
each tuple of Sailors is 50 bytes long
a page can hold 80 Sailors Tuples
500 pages of such tuples
td is cost to read/write page; ts is cost to ship page.
The cost is = 500td to scan Sailors
for each Sailors page, the cost of scanning shipping all of Reserves, which is
=1000(td + ts).
The total cost is = 500td + 500, 000(td + ts).
35 Distributed DBMS
5.Distributed Query Processing
Assume Reserves (Paris )and Sailors (London) relations
each tuple of Reserves is 40 bytes long
a page can hold 100 Reserves tuples
1,000 pages of such tuples.
each tuple of Sailors is 50 bytes long
a page can hold 80 Sailors Tuples
500 pages of such tuples
If the query site is different. Consider join at London and result is shipped
to the query site.
Number of tuples in result = 100,000, each tuple =90 bytes long (40+50)
Page size is 4000 bytes (80 sailors tuple fit in a page and 50 bytes long, so
80*50 = 4000 bytes)
4000/90 = 44 result tuples fit in a page.
Result size is 100,000 /44 = 2273 pages. Distributed DBMS
36 Cost of shipping the result to another site is 2273 ts
5.Distributed Query Processing
This cost depends on the size of the result.
The cost of shipping the result is greater than the cost of shipping both
Sailors and Reserves to the query site.
38 Distributed DBMS
5.Distributed Query Processing
Ship Whole vs Fetch as needed:
Fetch as needed results in a high number of messages
Ship whole results in high amounts of transferred data
Note: Some tuples in Reserves do not join with any tuple in Sailors, we
could avoid shipping them.
3.Semijoins and Bloomjoins
Semijoins: 3 steps:
1. At London, compute the projection of Sailors onto the join columns i.e sid
and ship this projection to Paris.
2. 2. At Paris, compute the natural join of the projection received from the
first site with the Reserves relation.
3. The result of this join is called the reduction of Reserves with respect to
Sailors. ship the reduction of Reserves to London.
4. 3. At London, compute the join of the reduction of Reserves with Sailors.
39
Distributed DBMS
Semijoin:
Semijoin: Requesting all join partners in just one step.
40 Distributed DBMS
5.Distributed query processing
Bloomjoins: 3 steps:
1. At London, A bit-vector of (some chosen) size k is computed by hashing
each tuple of Sailors into the range 0 to k −1 and setting bit i to 1. if some
tuple hashes to i, and 0 otherwise then ship this to DELHI
41 Distributed DBMS
Bloom join:
Bloom join:
Also known as bit-vector join
Avoiding to transfer all join attribute values to the other
node
Instead transferring a bitvector B[1 : : : n]
Transformation
Choose an appropriate hash function h
Apply h to transform attribute values to the range [1 : : : n]
Set the corresponding bits in the bitvector B[1 : : : n] to
42 Distributed DBMS
Bloom join:
43 Distributed DBMS
Cost-Based Query Optimization
optimizing queries in a distributed database poses the following additional
challenges:
If individual sites are run under the control of different DBMSs, the
autonomy of each site must be respected while doing global query
planning.
44 Distributed DBMS
Cost-Based Query Optimization
Cost-based approach; consider all plans, pick cheapest; similar to
centralized optimization.
Query site constructs global plan, with suggested local plans describing
processing at each site.
If a site can improve suggested local plan, free to do so.
45 Distributed DBMS
6.DISTRIBUTED TRANSACTIONS PROCESSING
A given transaction is submitted at some one site, but it can
access data at other sites.
46 Distributed DBMS
7.Distributed Concurrency Control
Lock management can be distributed across sites in many
ways:
1. Centralized: A single site is incharge of handling lock and
unlock requests for all objects.
2. Primary copy: One copy of each object is designated as the
primary copy.
All requests to lock or unlock a copy of this object are handled
by the lock manager at the site where the primary copy is stored.
48 Distributed DBMS
Phantom Deadlocks: delays in propagating local information might
cause the deadlock detection algorithm to identify `deadlocks' that do not
really exist. Such situations are called phantom deadlocks
49 Distributed DBMS
Distributed deadlock detection
algorithms
Centralised
Heirarchical
Waiting longer!! abort
50 Distributed DBMS
7.Distributed Recovery
Recovery in a distributed DBMS is more complicated than in a
centralized DBMS for the following reasons:
51 Distributed DBMS
7.Distributed Recovery
Two-Phase Commit (2PC):
52 Distributed DBMS
7.Distributed Recovery(2pc)
1. Coordinator sends prepare msg to each subordinate.
3. If coordinator gets all yes votes, force-writes a commit log record and
sends commit msg to all subs. Else, force-writes abort log rec, and
sends abort msg.
5. Coordinator writes end log rec after getting ack msg from all subs
53 Distributed DBMS
TWO-PHASE COMMIT (2PC) - commit
54 Distributed DBMS
TWO-PHASE COMMIT (2PC) - ABORT
55
Distributed DBMS
7.Distributed Recovery(3pc)
Three-Phase Commit
1. A commit protocol called Three-Phase Commit (3PC) can avoid
blocking even if the coordinator site fails during recovery.
2. The basic idea is that when the coordinator sends out prepare
messages and receives yes votes from all subordinates.
56 Distributed DBMS
Distributed Database
Advantages: Disadvantages:
Reliability Complexity of Query opt.
Performance Concurrency control
Growth (incremental) Recovery
Local control Catalog management
Transparency
57 Distributed DBMS