Dist DB

Design Problem Distributed Database Systems Fall 2012
Distributed Database Design

SL02
Design problem of distributed systems: Making decisions about the placement of data and programs across the sites of a computer network as well as possibly designing the network itself. In DDBMS, the distribution of applications involves
Distribution of the DDBMS software Distribution of applications that run on the database
Design problem Design strategies (top-down, bottom-up) Fragmentation (horizontal, vertical) Allocation and replication of fragments, optimality, heuristics
Distribution of applications will not be considered in the following; instead the distribution of data is studied.
DDBS12, SL02
1/60
M. Bohlen
DDBS12, SL02
2/60
M. Bohlen
Framework of Distribution
Dimension for the analysis of distributed systems
Level of sharing: no sharing, data sharing, data + program sharing Behavior of access patterns: static, dynamic Level of knowledge on access pattern behavior: no information, partial information, complete information
Design Strategies/1
Top-down approach
Designing systems from scratch Homogeneous systems
Bottom-up approach
The databases already exist at a number of sites The databases should be connected to solve common tasks
Distributed database design should be considered within this general framework.

DDBS12, SL02 3/60 M. Bohlen DDBS12, SL02 4/60 M. Bohlen
Design Strategies/2
Top-down design strategy
Design Strategies/3
Distribution design is the central part of the design in DDBMSs (the other tasks are similar to traditional databases). Objective: Design the LCSs by distributing the data over the sites. Two main aspects have to be designed carefully:
Fragmentation: Relation may be divided into a number of sub-relations, which are distributed Allocation and replication: Each fragment is stored at site(s) with optimal distribution
Distribution design issues

Why fragment at all? How to fragment? How much to fragment? How to test correctness? How to allocate?
DDBS12, SL02
5/60
M. Bohlen
DDBS12, SL02
6/60
M. Bohlen
Design Strategies/4
Bottom-up design strategy
Fragmentation/1
What is a reasonable unit of distribution? Relation or fragment of relation? Relations as unit of distribution:
If the relation is not replicated, we get a high volume of remote data accesses. If the relation is replicated, we get unnecessary replications, which cause problems in executing updates and waste disk space Might be an OK solution, if queries need all the data in the relation and data stays only at the sites that use the data Application views are usually subsets of relations Thus, locality of accesses of applications is dened on subsets of relations Permits a number of transactions to execute concurrently, since they will access different portions of a relation Parallel execution of a single query (intra-query concurrency) However, semantic data control (especially integrity enforcement) is more difcult
Fragments of relation as unit of distribution:
DDBS12, SL02
7/60
M. Bohlen
DDBS12, SL02
Fragments of relations are (usually) appropriate unit of distribution.

8/60
M. Bohlen
Fragmentation/2
Fragmentation/3
Types of Fragmentation
Horizontal: partitions a relation along its tuples Vertical: partitions a relation along its attributes Mixed/hybrid: a combination of horizontal and vertical fragmentation
Fragmentation aims to improve:

Reliability Performance Balanced storage capacity and costs Communication costs Security
The following information is used to decide fragmentation:

Quantitative information: cardinality of relations, frequency of queries, site where query is run, selectivity of the queries, etc. Qualitative information: predicates in queries, types of access of data, read/write, etc.
(a) Horizontal Fragmentation
(b) Vertical Fragmentation
(c) Mixed Fragmentation
DDBS12, SL02
9/60
M. Bohlen
DDBS12, SL02
10/60
M. Bohlen
Fragmentation/4
Example of database instance
Fragmentation/5
Example (contd.): Horizontal fragmentation of PROJ relation
PROJ1: projects with budgets less than 200 000 PROJ2: projects with budgets greater than or equal to 200 000
Data
Fragmentation/6
Example (contd.): Vertical fragmentation of PROJ relation
PROJ1: information about project budgets PROJ2: information about project names and locations
Correctness Rules of Fragmentation

Completeness
Decomposition of relation R into fragments R1 , R2 , . . . , Rn is complete iff each data item in R can also be found in some Ri .
Reconstruction
If relation R is decomposed into fragments R1 , R2 , . . . , Rn , then there should exist some relational operator that reconstructs R from its fragments, i.e., R = R1 . . . Rn
Union to combine horizontal fragments Join to combine vertical fragments
Disjointness
If relation R is decomposed into fragments R1 , R2 , . . . , Rn and data item di appears in fragment Rj , then di should not appear in any other fragment Rk , k j (exception: primary key attribute for vertical fragmentation)
For horizontal fragmentation, data item is a tuple For vertical fragmentation, data item is an attribute
DDBS12, SL02
13/60
M. Bohlen
DDBS12, SL02
14/60
M. Bohlen
Idea of Horizontal Fragmentation
Information Requirements/1
Database information:
Intuition behind horizontal fragmentation

Every site should hold all information that is used to query the site The information at the site should be fragmented so the queries of the site run faster
Links between relations (a link models a 1:N relationship between relations that are related to each other by an equality join)
Horizontal fragmentation is dened as selection operation, p (R ) Example:
BUDGET <200K (PROJ ) BUDGET 200K (PROJ )

Links (one to many) Cardinality of relations: card(R)
DDBS12, SL02
15/60
M. Bohlen
DDBS12, SL02
16/60
M. Bohlen
Application information: Simple predicates used in queries
02.1
02.2
Relation R [A1 , A2 , ..., An ] Ai vi a simple predicate ( {=, <, , >, , }, vi Di , Di is the domain of Ai ) For relation R we dene Pr = {p1 , p2 , ..., pm } Example: PNAME = Maintenance, BUDGET < 200000
Application information (cntd): Minterm selectivities

The number of tuples of the relation that would be accessed by a user query, which is specied according to a given minterm predicate mi .
Minterm predicates
minterm predicates are conjunctions of simple predicates Relation R , Pr = {p1 , p2 , ..., pm } M = {m1 , m2 , ..., mr } such that M = {mi | mi = pj Pr pj }, 1 j m, 1 i z where pj = pj or pj = pj
Access frequencies
The frequency with which a user application qi accesses data.
DDBS12, SL02
17/60
M. Bohlen
DDBS12, SL02
18/60
M. Bohlen
Horizontal Fragmentation/1
A horizontal fragment Ri of relation R consists of all the tuples of R that satisfy a predicate Fj : Rj = Fj (R ), 1 j w where Fj is a selection formula, which is (preferably) a minterm predicate. Computing the horizontal fragmentation: 1. Determine Pr (80/20 rule) 2. Compute Pr 3. Determine M 4. Minimize M (eliminating impossible minterms)
We write Pr to denote the complete and minimal set of simple predicates generated from Pr : The set of predicates is complete if and only if any two tuples in the same fragment are referenced with the same probability by any application The set of predicates is minimal if and only if each simple predicate is relevant for determining the fragmentation and for each pair of fragments there is at least one query that accesses the fragments differently
DDBS12, SL02
19/60
M. Bohlen
DDBS12, SL02
20/60
M. Bohlen
Example: Fragmentation of the PROJ relation
Consider the following query: Find the name and budget of projects given their location. The query is issued at all three locations Fragmentation based on LOC, using the set of predicates {LOC = Montreal , LOC = NewYork , LOC = Paris }
PROJ1 = LOC =Montreal (PROJ ) PNO PNAME BUDGET LOC P1 Instrument. 150000 Montreal PROJ3 = LOC =Paris (PROJ ) PNO PNAME BUDGET LOC P4 Maintenance 310000 Paris PROJ2 = LOC =NewYork (PROJ ) PNO PNAME BUDGET LOC P2 DB Develop. 135000 New York P3 CAD/CAM 250000 New York
If access is only according to the location, the above set of predicates is complete
i.e., in each fragment PROJi each tuple has the same probability of being accessed
If there is a second query/application that accesses only those project tuples where the budget is less than $200K, the set of predicates is not complete.
P 2 in PROJ2 has higher probability to be accessed
DDBS12, SL02
21/60
M. Bohlen
DDBS12, SL02
22/60
M. Bohlen
Example (contd.): Add BUDGET < 200K and BUDGET 200K to the set of predicates
to make it complete. {LOC = Montreal , LOC = NewYork , LOC = Paris , BUDGET 200K , BUDGET < 200K } is a complete set Minterms to fragment the relation are given as follows:
02.3
Example (contd.): Now, PROJi , i = 1, 2, 3 will be split in two fragments

PROJ1 = LOC =Montr BDGT <200K (PROJ ) PNO PNAME BUDGET LOC P1 Instrumnt 150000 Montreal PROJ3 = LOC =Paris BDGT 200K (PROJ ) PNO PNAME BUDGET LOC P4 Maintenance 310000 Paris PROJ2 = LOC =NY BDGT <200K (PROJ ) PNO PNAME BUDGET LOC P2 DB Devel 135000 New York PROJ2 = LOC =NY BDGT 200K (PROJ ) PNO PNAME BUDGET LOC P3 CAD/CAM 250000 New York
(LOC = Paris ) (BUDGET 200K ) (LOC = Paris ) (BUDGET > 200K )
(LOC = NewYork ) (BUDGET > 200K )
(LOC = NewYork ) (BUDGET 200K )
(LOC = Montreal ) (BUDGET > 200K )
(LOC = Montreal ) (BUDGET 200K )
Note that the following fragments are empty:

LOC =Paris BDGT <200K (PROJ ) LOC =Montreal BDGT 200K (PROJ )
DDBS12, SL02
23/60
M. Bohlen
DDBS12, SL02
24/60
M. Bohlen
Derived Horizontal Fragmentation/1
02.5
Dened on a member (target) relation of a link according to a selection operation specied on its owner (source).
Each link is an equijoin. Fragments of member relation can be dened with a semijoin on fragments of owner relation.
Given a link L where owner (L ) = S and member (L ) = R , the derived horizontal fragments of R are dened as Ri = R
|><
Si , 1 i w
where w is the maximum number of fragments that will be dened on R and Si = Fi (S ) where Fi is the formula according to which the primary horizontal fragment Si is dened.
DDBS12, SL02
25/60
M. Bohlen
DDBS12, SL02
26/60
M. Bohlen
02.7
Vertical Fragmentation/1
Objective of vertical fragmentation is to partition a relation into a set of smaller relations so that many of the applications will run on only one fragment. Vertical fragmentation of a relation R produces fragments R1 , R2 , . . . , each of which contains a subset of R s attributes. Vertical fragmentation is dened using the projection operation of the relational algebra: A1 ,A2 ,...,An (R ) Example: PROJ1 = PNO ,BUDGET (PROJ ) PROJ2 = PNO ,PNAME ,LOC (PROJ ) Vertical fragmentation has also been studied for (centralized) DBMS
Smaller relations, and hence less page accesses e.g., MONET system
Given link L 1 where owner (L 1) = PAY and member (L 1) = EMP and PAY 1 = SAL 30000 (PAY ) PAY 2 = SAL >30000 (PAY ) we get EMP 1 = EMP |>< PAY 1 EMP 2 = EMP |>< PAY 2
DDBS12, SL02
27/60
M. Bohlen
DDBS12, SL02
28/60
M. Bohlen
Vertical fragmentation is more complicated than horizontal fragmentation

In horizontal partitioning: for n simple predicates, the number of possible minterms is 2n ; some of them can be ruled out by existing implications/constraints. In vertical partitioning: for m non-primary key attributes, the number of possible fragments is equal to B (m) (= the mth Bell number), i.e., the number of partitions of a set with m members.
For large numbers, B (m) mm (e.g., B (15) = 109 )
Two types of heuristics for vertical fragmentation exist:

Grouping: assign each attribute to one fragment, and at each step, join some of the fragments until some criteria is satised.
Bottom-up approach
Splitting: starts with a relation and decides on benecial partitionings based on the access behaviour of applications to the attributes.
Top-down approach Results in non-overlapping fragments
Optimal solution is probably closer to the full relation than to a set of small relations with only one attribute
DDBS12, SL02
29/60
M. Bohlen
DDBS12, SL02
30/60
M. Bohlen
Application information: The major information required as input for vertical fragmentation is related to applications (queries)
Since vertical fragmentation places in one fragment those attributes usually accessed together, there is a need for some measure that would dene more precisely the notion of togetherness, i.e., how closely related the attributes are. This information is obtained from queries and collected in the Attribute Usage Matrix and Attribute Afnity Matrix.
Given are the user queries/applications Q = (q1 , . . . , qq ) that will run on relation R (A1 , . . . , An ) Attribute Usage Matrix: Denotes which query uses which attribute: use (qi , Aj ) = 1 0 iff qi uses Aj otherwise
The use (qi , ) vectors for each application are easy to dene if the designer knows the applications that willl run on the DB (consider also the 80-20 rule)
DDBS12, SL02
31/60
M. Bohlen
DDBS12, SL02
32/60
M. Bohlen
02.8
Example: Consider relation PROJ (PNO , PNAME , BUDGET , LOC ) and queries: q1 = SELECT BUDGET FROM PROJ WHERE PNO=Value q2 = SELECT PNAME,BUDGET FROM PROJ q3 = SELECT PNAME FROM PROJ WHERE LOC=Value q4 = SELECT SUM(BUDGET) FROM PROJ WHERE LOC =Value Abbreviations: A1 = PNO , A2 = PNAME , A3 = BUDGET , A4 = LOC Attribute Usage Matrix where
ref l (qk ) is the cost (= number of accesses to (Ai , Aj )) of query qK at site l acc l (qk ) is the frequency of query qk at site l
Attribute Afnity Matrix: Denotes the frequency of two attributes Ai and Aj with respect to a set of queries Q = (q1 , . . . , qn ): aff (Ai , Aj ) =
use (q ,A )=1, k : use (qk ,Ai )=1 k j
(
sites l
ref l (qk ) acc l (qk ))
DDBS12, SL02
33/60
M. Bohlen
DDBS12, SL02
34/60
M. Bohlen
Example (contd.): Let the cost of each query be ref l (qk ) = 1, and the frequency acc l (qk ) of the queries be as follows:
Site 1 acc1 (q1 ) = 15 acc1 (q2 ) = 5 acc1 (q3 ) = 25 acc1 (q4 ) = 3 Site 2 acc2 (q1 ) = 20 acc2 (q2 ) = 0 acc2 (q3 ) = 25 acc2 (q4 ) = 0 Site 3 acc3 (q1 ) = 10 acc3 (q2 ) = 0 acc3 (q3 ) = 25 acc3 (q4 ) = 0
Attribute afnity matrix aff (Ai , Aj ) =
Idea: Take the attribute afnity matrix (AA) and reorganize the attribute orders to form clusters where the attributes in each cluster demonstrate high afnity to one another. Bond energy algorithm (BEA) has been suggested for that purpose for several reasons:
It is designed specically to determine groups of similar items as opposed to a linear ordering of the items. The nal groupings are insensitive to the order in which items are presented. The computation time is reasonable (O (n2 ), where n is the number of attributes)
3 e.g., aff (A1 , A3 ) = 1 k =1 l =1 acc l (qk ) = acc 1 (q1 ) + acc 2 (q1 ) + acc 3 (q1 ) = 45 (q1 is the only query to access both A1 and A3 )
Example (contd.): Cluster Afnity Matrix CA after running BEA
BEA:
Input: AA matrix Output: Clustered AA matrix (CA) Permutation is done in such a way to maximize the following global afnity mesaure (afnity of Ai and Aj with their neighbors):
n n
AM =
i =1 j =1
aff(Ai , Aj )[aff(Ai , Aj 1 ) + aff(Ai , Aj +1 ) + aff(Ai 1 , Aj ) + aff(Ai +1 , Aj )]
Elements with similar values are grouped together, and two clusters can be identied An additional partitioning algorithm is needed to identify the clusters in CA
Usually more clusters and more than one candidate partitioning, thus additional steps are needed to select the best clustering.
The resulting fragmentation after partitioning (PNO is added in PROJ2 explicilty as key): PROJ1 = {PNO , BUDGET }
DDBS12, SL02 37/60 M. Bohlen DDBS12, SL02 38/60
PROJ2 = {PNO , PNAME , LOC }

M. Bohlen
BEA Algorithm/1
BEA Algorithm/2
In order to determine the best placement we dene the contribution of a placement. Contribution of a placement: cont (Ai , Ak , Aj ) = 2 bond (Ai , Ak )+ 2 bond (Ak , Aj ) 2 bond (Ai , Aj )
Input: The AA matrix Output: Clustered afnity matrix CA which is a permutation of AA Initialization: Place and x one of the columns of AA in CA (we choose column 1 in our example) Iteration: Place the remaining n-i columns in the remaining i+1 positions in the CA matrix. For each column choose the placement that makes the largest contribution to the global afnity measure. Row order: Order the rows according to the column ordering.
Bond between a pair of attributes is dened as: bond (Ax , Ay ) =

n z =1
aff (Az , Ax ) aff (Az , Ay )

40/60 M. Bohlen
DDBS12, SL02
39/60
M. Bohlen
DDBS12, SL02
BEA Algorithm/3
Consider the following AA matrix and the corresponding CA matrix where A1 and A2 have been placed. A1 AA = A2 A3 A4 A1 A2 A3 A4 45 0 45 0 0 80 5 75 45 5 53 3 0 75 3 78 A1 CA = A2 A3 A4 A1 A2 45 0 0 80 45 5 0 75
BEA Algorithm/4
Ordering (0-3-1) : cont (A 0, A 3, A 1)
= 2 0 + 2 4410 2 0 = 8820
= 2 bond (A 0, A 3) + 2 bond (A 3, A 1)2 bond (A 0, A 1)
bond (A3 , A1 ) = 45 45 + 5 0 + 53 45 + 3 0 = 4410 bond (A3 , A2 ) = 45 0 + 5 80 + 53 5 + 3 75 = 890 The next step is to place A3 .
= 2 4410 + 2 890 2 225 = 10150
= 1780
41/60 M. Bohlen DDBS12, SL02
DDBS12, SL02
42/60
M. Bohlen
BEA Algorithm/5
After adding A3 the CA matrix has the form A1 A3 A2 45 45 0 0 5 80 45 53 5 0 3 75
02.9
BEA Algorithm/6
The nal step is to divide a set of clustered attributes {A1 , ..., An } into two sets {A1 , ..., Ai } and {Ai +1 , ..., An } such that the costs of queries that access both sets are minimized. The appropriate split point on the diagonal must be determined: A1 A3 A2 A4 A1 45 45 0 0 A3 45 53 5 3 A2 0 5 80 75 A4 0 3 75 78
CA =
After adding A4 and ordering rows the CA matrix has the form A1 CA = A3 A2 A4 A1 A3 A2 A4 45 45 0 0 45 53 5 3 0 5 80 75 0 3 75 78
This is an optimization problem: the optimal point on the diagonal must be determined. Note: Generalizations are needed to deal with
partitions located in the middle of the matrix multiple split points
43/60 M. Bohlen DDBS12, SL02 44/60 M. Bohlen
DDBS12, SL02
BEA Algorithm/7
AQ (qi ) = {Aj | use (qi , Aj ) = 1} TQ = {qi | AQ (qi ) TA } OQ = Q {TQ BQ } CQ = CTQ = CBQ = COQ =
q i Q
Correctness of Vertical Fragmentation

attributes accessed by qi queries that access TA only queries that access BA only queries that access TA and BA cost of all queries cost of TQ queries cost of BQ queries cost of other queries maximize Relation R is decomposed into fragments R1 , R2 , . . . , Rn e.g., PROJ = {PNO , BUDGET , PNAME , LOC } into PROJ1 = {PNO , BUDGET } and PROJ2 = {PNO , PNAME , LOC } Completeness
Guaranteed by the partitioning algortihm, which assigns each attribute in A to one partition
BQ = {qi | AQ (qi ) BA }
Sj
refj (qi ) accj (qi )

Sj Sj Sj
Reconstruction
Join to reconstruct vertical fragments R = R1 Rn = PROJ1 PROJ2
If tuple IDs are used, the fragments are really disjoint Otherwise, key attributes are replicated automatically by the system e.g., PNO in the above example
qi TQ qi BQ qi OQ
Disjointness
Attributes have to be disjoint in VF. Two cases are distinguished:
CTQ CBQ COQ 2
DDBS12, SL02
45/60
M. Bohlen
DDBS12, SL02
46/60
M. Bohlen
Mixed Fragmentation
In most cases simple horizontal or vertical fragmentation of a DB schema will not be sufcient to satisfy the requirements of the applications. Mixed fragmentation (hybrid fragmentation): Consists of a horizontal fragment followed by a vertical fragmentation, or a vertical fragmentation followed by a horizontal fragmentation Fragmentation is dened using the selection and projection operations of relational algebra:
Replication and Allocation
Replication: Which fragements shall be stored as multiple copies?

Complete Replication
Complete copy of the database is maintained in each site
Selective Replication
Selected fragments are replicated in some sites
Allocation: On which sites to store the various fragments?

Centralized
Consists of a single DB and DBMS stored at one site with users distributed across the network
p (A1 ,...,An (R )) A1 ,...,An (p (R ))
Partitioned
Database is partitioned into disjoint fragments, each fragment assigned to one site
DDBS12, SL02
47/60
M. Bohlen
DDBS12, SL02
48/60
M. Bohlen
Replication/1
Replication/2
Comparison of replication alternatives
Replicated DB
fully replicated: each fragment at each site partially replicated: each fragment at some of the sites
Non-replicated DB (= partitioned DB)

partitioned: each fragment resides at only one site
Rule of thumb:
If
read only queries 1, then replication is advantageous, otherwise update queries replication may cause problems
DDBS12, SL02
49/60
M. Bohlen
DDBS12, SL02
50/60
M. Bohlen
Fragment Allocation/1
Fragment allocation problem
Given are: fragments F = {F1 , F2 , ..., Fn } network sites S = {S1 , S2 , ..., Sm } and applications Q = {q1 , q2 , ..., ql } Find: the optimal distribution of F to S
Required information
Database Information
selectivity of fragments size of a fragment
Application Information
RRij : number of read accesses of a query qi to a fragment Fj URij : number of update accesses of query qi to a fragment Fj uij : a matrix indicating which queries updates which fragments, rij : a similar matrix for retrievals originating site of each query
Optimality
Minimal cost
Communication + storage + processing (read and update) Cost in terms of time (usually)
Site Information
USCk : unit cost of storing data at a site Sk LPCk : cost of processing one unit of data at a site Sk
Performance
Response time and/or throughput
Constraints
Per site constraints (storage and processing)
Network Information
communication cost/frame between two sites frame size
DDBS12, SL02
51/60
M. Bohlen
DDBS12, SL02
52/60
M. Bohlen
We discuss an allocation model which attempts to
minimize the total cost of processing and storage meet certain response time restrictions
The total cost function has two components: storage and query processing. TOC =
Sk S Fj F
STCjk +
q i Q
QPCi
General Form: min(Total Cost)

subject to
response time constraint storage constraint processing constraint
Storage cost of fragment Fj at site Sk : STCjk = USCk size (Fj ) xij where USCk is the unit storage cost at site k Query processing cost for a query qi is composed of two components:
composed of processing cost (PC) and transmission cost (TC)
Decision variable xij xij = 1 0 if fragment Fi is stored at site Sj otherwise
QPCi = PCi + TCi

Processing cost is a sum of three components:
access cost (AC), integrity contraint cost (IE), concurency control cost (CC) PCi = ACi + IEi + CCi Access cost: ACi =
sk S Fj F
The transmission cost is composed of two components:
Cost of processing updates (TCU) and cost of processing retrievals (TCR) TCi = TCUi + TCRi Cost of updates:
Inform all the sites that have replicas + a short conrmation message back
(URij + RRij ) xij LPCk
where LPCk is the unit process cost at site k Integrity and concurrency costs:
Can be similarly computed, though depends on the specic constraints
TCUi =
Sk S Fj F
uij (update message cost + acknowledgment cost)
Retrieval cost:
Send retrieval request to all sites that have a copy of fragments that are needed + sending back the results from these sites to the originating site.
Note: ACi assumes that processing a query involves decomposing it into a set of subqueries, each of which works on a fragment, ...,
This is a very simplistic model Does not take into consideration different query costs depending on the operator or different algorithms that are applied
DDBS12, SL02 55/60 M. Bohlen DDBS12, SL02
TCRi =
F j F
S k S
min (xjk (cost retrieval request + cost sending back result))

56/60 M. Bohlen
Modeling the constraints
Response time constraint for a query qi execution time of qi max. allowable response time for qi Storage constraints for a site Sk storage requirement of Fj at Sk storage capacity of Sk
Solution Methods
The complexity of this allocation model/problem is NP-complete Correspondence between the allocation problem and similar problems in other areas
Plant location problem in operations research Knapsack problem Network ow problem
Hence, solutions from these areas can be re-used Use different heuristics to reduce the search space
Assume that all candidate partitionings have been determined together with their associated costs and benets in terms of query processing. The problem is then reduced to nd the optimal partitioning and placement for each relation Ignore replication at the rst step and nd an optimal non-replicated solution Replication is then handeled in a second step on top of the previous non-replicated solution.
F j F
Processing constraints for a site Sk processing load of qi at site Sk processing capacity ofSk
qi Q
DDBS12, SL02
57/60
M. Bohlen
DDBS12, SL02
58/60
M. Bohlen
Conclusion/1
Distributed design decides on the placement of (parts of the) data and programs across the sites of a computer network There are two key technical questions: fragmentation and allocation/replication of data Horizontal fragmentation is dened via the selection operation p (R )
Rewrites the queries of each site in the conjunctive normal form and nds a minimal and complete set of conjunctions to determine fragmentation
Conclusion/2
Allocation/Replication of data
Type of replication: no replication, partial replication, full replication Optimal allocation/replication modelled as a cost function under a set of constraints The complexity of the problem is NP-complete Use of different heuristics to reduce the complexity
Vertical fragmentation via the projection operation A (R )

Computes the attribute afnity matrix and groups similar attributes together
Mixed fragmentation is a combination of both approaches
DDBS12, SL02
59/60
M. Bohlen
DDBS12, SL02
60/60
M. Bohlen

Dist DB

Uploaded by

Copyright:

Available Formats

Dist DB

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dist DB

Uploaded by

Copyright:

Available Formats

Design Problem Distributed Database Systems Fall 2012

Distributed Database Design

Distributed database design should be considered within this general framework.

Distribution design issues

Fragments of relation as unit of distribution:

Fragments of relations are (usually) appropriate unit of distribution.

Fragmentation aims to improve:

The following information is used to decide fragmentation:

(b) Vertical Fragmentation

(c) Mixed Fragmentation

Correctness Rules of Fragmentation

Idea of Horizontal Fragmentation

Intuition behind horizontal fragmentation

Horizontal fragmentation is dened as selection operation, p (R ) Example:

BUDGET <200K (PROJ ) BUDGET 200K (PROJ )

Application information (cntd): Minterm selectivities

Example (contd.): Now, PROJi , i = 1, 2, 3 will be split in two fragments

(LOC = Paris ) (BUDGET 200K ) (LOC = Paris ) (BUDGET > 200K )

(LOC = NewYork ) (BUDGET > 200K )

(LOC = NewYork ) (BUDGET 200K )

(LOC = Montreal ) (BUDGET > 200K )

(LOC = Montreal ) (BUDGET 200K )

Note that the following fragments are empty:

Derived Horizontal Fragmentation/1

Derived Horizontal Fragmentation/2

Derived Horizontal Fragmentation/3

Vertical fragmentation is more complicated than horizontal fragmentation

Two types of heuristics for vertical fragmentation exist:

ref l (qk ) acc l (qk ))

Attribute afnity matrix aff (Ai , Aj ) =

aff(Ai , Aj )[aff(Ai , Aj 1 ) + aff(Ai , Aj +1 ) + aff(Ai 1 , Aj ) + aff(Ai +1 , Aj )]

PROJ2 = {PNO , PNAME , LOC }

Bond between a pair of attributes is dened as: bond (Ax , Ay ) =

aff (Az , Ax ) aff (Az , Ay )

Ordering (1-3-2) : cont (A 1, A 3, A 2)

= 2 bond (A 0, A 3) + 2 bond (A 3, A 1)2 bond (A 0, A 1)

Ordering (2-3-4) : cont (A 2, A 3, A 4)

= 2 4410 + 2 890 2 225 = 10150

= 2 bond (A 1, A 3) + 2 bond (A 3, A 2)2 bond (A 1, A 2)

= 2 bond (A 2, A 3) + 2 bond (A 3, A 4)2 bond (A 2, A 4)

Correctness of Vertical Fragmentation

refj (qi ) accj (qi )

refj (qi ) accj (qi )

refj (qi ) accj (qi )

refj (qi ) accj (qi )

Attributes have to be disjoint in VF. Two cases are distinguished:

CTQ CBQ COQ 2

Replication and Allocation

Replication: Which fragements shall be stored as multiple copies?

Allocation: On which sites to store the various fragments?

p (A1 ,...,An (R )) A1 ,...,An (p (R ))

Non-replicated DB (= partitioned DB)

General Form: min(Total Cost)

Decision variable xij xij = 1 0 if fragment Fi is stored at site Sj otherwise

QPCi = PCi + TCi

(URij + RRij ) xij LPCk

uij (update message cost + acknowledgment cost)

min (xjk (cost retrieval request + cost sending back result))

Vertical fragmentation via the projection operation A (R )