2 Distribution Design
2 Distribution Design
2 Distribution Design
TS. Phan Thị Hà_PTIT
◼ Completeness
❑ Decomposition of relation R into fragments R1, R2, ..., Rn is
complete if and only if each data item in R can also be found in
some Ri
◼ Reconstruction
❑ If relation R is decomposed into fragments R1, R2, ..., Rn, then
there should exist some relational operator ∇ such that
R = ∇1≤i≤nRi
◼ Disjointness
❑ If relation R is decomposed into fragments R1, R2, ..., Rn, and
data item di is in Rj, then di should not be in any other fragment
Rk (k ≠ j ).
◼ Non-replicated
❑ partitioned : each fragment resides at only one site
◼ Replicated
❑ fully replicated : each fragment at each site
❑ partially replicated : each fragment at some of the sites
◼ Rule of thumb:
◼ Database Information
❑ relationship
m1: PNAME="Maintenance" BUDGET≤200000
◼ Application Information
❑ minterm selectivities: sel(mi)
◼ The number of tuples of the relation that would be accessed by a
user query which is specified according to a given minterm
predicate mi.
❑ access frequencies: acc(qi)
◼ The frequency with which a user application qi accesses data.
◼ Access frequency for a minterm predicate can also be defined.
Definition :
Rj = Fj(R), 1 ≤ j ≤ w
where Fj is a selection formula, which is (preferably) a minterm
A horizontal fragment Ri of relation R consists of all the tuples of R
which satisfy a minterm predicate mi.
Given a set of minterm predicates M, there are as many horizontal
fragments of relation R as there are minterm predicates.
Set of horizontal fragments also referred to as minterm fragments.
Preliminaries :
❑ Pr should be complete
❑ Pr should be minimal
◼ Example :
applications defined on it.
❑ Find the budgets of projects at each location. (1)
❑ Find projects with budgets less than $200000. (2)
According to (1),
Pr={LOC=“Montreal”,LOC=“New York”,LOC=“Paris”}
which is complete.
Example :
Pr ={LOC=“Montreal”,LOC=“New York”, LOC=“Paris”,
Initialization :
❑ find a pi Pr such that pi partitions R according to Rule 1
❑ set Pr' = pi ; Pr Pr – {pi} ; F {fi}
Iteratively add predicates to Pr' until it is complete
❑ find a pj Pr such that pj partitions some fk defined according to
minterm predicate over Pr' according to Rule 1
❑ set Pr' = Pr' {pi}; Pr Pr – {pi}; F F {fi}
❑ if pk Pr' which is nonrelevant then
Pr' Pr – {pi}
F F – {fi}
❑ Simple predicates
❑ For application (1)
p1 : LOC = “Montreal”
p2 : LOC = “New York”
p3 : LOC = “Paris”
❑ For application (2)
p4 : BUDGET ≤ 200000
p5 : BUDGET > 200000
❑ Pr = Pr' = {p1,p2,p3,p4,p5}
© 2020, M.T. Özsu & P. Valduriez: TS.
Phan Thị HÀ_PTIT
PHF – Example
◼ Completeness
❑ Since Pr' is complete and minimal, the selection predicates are
◼ Reconstruction
❑ If relation R is fragmented into FR = {R1,R2,…,Rr}
R = Ri FR Ri
◼ Disjointness
❑ Minterm predicates that form the basis of fragmentation should
be mutually exclusive.
Ri = R ⋉F Si, 1≤i≤w
where w is the maximum number of fragments that will be
defined on R and
Si = Fi (S)
◼ Completeness
❑ Referential integrity
❑ Let R be the member relation of a link whose owner is relation S
which is fragmented as FS = {S1, S2, ..., Sn}. Furthermore, let A
be the join attribute between R and S. Then, for each tuple t of
R, there should be a tuple t' of S such that
t[A] = t' [A]
◼ Reconstruction
❑ Same as primary horizontal fragmentation.
◼ Disjointness
❑ Simple join graphs between the owner and the member
◼ Overlapping fragments
❑ grouping
◼ Non-overlapping fragments
❑ splitting
◼ Application Information
❑ Attribute affinities
◼ a measure that indicates how closely related the attributes are
◼ This is obtained from more primitive usage data
❑ Attribute usage values
◼ Given a set of queries Q = {q1, q2,…, qq} that will run on the relation
R[A1, A2,…, An],
query access = access frequency of a query
all sites
Assume each query in the previous example accesses the attributes once
during each execution.
S1 S2 S3
Also assume the access frequencies q1 15 20 10
q2 5 0 0
q3 25 25 25
q4 3 0 0
aff(A1, A3) = 15*1 + 20*1+10*1
= 45
and the attribute affinity matrix AA is
bond(Ax,Ay) =
z =1
aff(A ,A )aff(A ,A )
z x z y
Ordering (0-3-1) :
cont(A0,BUDGET,PNO) = 2bond(A0, BUDGET)+2bond(BUDGET, PNO)
–2bond(A0 , PNO)
= 8820
Ordering (1-3-2) :
cont(PNO,BUDGET,PNAME) = 10150
Ordering (2-3-4) :
cont (PNAME,BUDGET,LOC) = 1780
TQ = set of applications that access only TA
BQ = set of applications that access only BA
OQ = set of applications that access both TA and BA
CTQ = total number of accesses to attributes by applications
that access only TA
CBQ = total number of accesses to attributes by applications
that access only BA
COQ = total number of accesses to attributes by applications
that access both TA and BA
Then find the point along the diagonal that maximizes
Two problems :
Cluster forming in the middle of the CA matrix
❑ Shift a row up and a column left and apply the algorithm to find
the “best” partitioning point
❑ Do this for all possible shifts
❑ Cost O(m2)
More than two clusters
❑ m-way partitioning
❑ try 1, 2, …, m–1 split points along diagonal and try to find the
best point for each of these
❑ Cost O(2m)
◼ Reconstruction
❑ Reconstruction can be achieved by
R = ⋈K Ri, Ri FR
◼ Disjointness
❑ TID's are not considered to be overlapping since they are
maintained by the system
❑ Duplicated keys are not considered to be overlapping
© 2020, M.T. Özsu & P. Valduriez: TS.
Phan Thị HÀ_PTIT
Hybrid Fragmentation
❑ Data distribution
◼ Database information
❑ selectivity of fragments
❑ size of a fragment
◼ Application information
❑ access types and numbers
❑ access localities
◼ Communication network information
❑ unit cost of storing data at a site
❑ unit cost of processing at a site
◼ Computer system information
❑ bandwidth
❑ latency
❑ communication overhead
© 2020, M.T. Özsu & P. Valduriez: TS.
Phan Thị HÀ_PTIT
General Form
min(Total Cost)
subject to
response time constraint
storage constraint
processing constraint
Decision Variable
1 if fragment Fi is stored at site Sj
xij =
0 otherwise
◼ Total Cost
Processing component
access cost + integrity enforcement cost + concurrency control cost
❑ Access cost
◼ Constraints
❑ Response Time
execution time of query ≤ max. allowable response time for that query
◼ Solution Methods
❑ FAP is NP-complete
❑ DAP also NP-complete
◼ Heuristics based on
❑ single commodity warehouse location (for FAP)
❑ knapsack problem
❑ branch and bound techniques
❑ network flow
❑ Combined approaches
◼ Examplar: Schism
❑ Graph G=(V,E) where
◼ vertex vi ∈ V represents a tuple in database,
◼ edge e=(vi,vj) ∈ E represents a query that accesses both tuples vi
and vj;
◼ each edge has weight counting the no. of queries that access both
❑ Perform vertex disjoint graph partitioning
◼ Each vertex is assigned to a separate partition
◼ If monitoring tuple-level access (E-Store), this will tell