2 Distribution Design
2 Distribution Design
2 Distribution Design
Systems
TS. Phan Thị Hà_PTIT
◼ Completeness
❑ Decomposition of relation R into fragments R1, R2, ..., Rn is
complete if and only if each data item in R can also be found in
some Ri
◼ Reconstruction
❑ If relation R is decomposed into fragments R1, R2, ..., Rn, then
there should exist some relational operator ∇ such that
R = ∇1≤i≤nRi
◼ Disjointness
❑ If relation R is decomposed into fragments R1, R2, ..., Rn, and
data item di is in Rj, then di should not be in any other fragment
Rk (k ≠ j ).
◼ Non-replicated
❑ partitioned : each fragment resides at only one site
◼ Replicated
❑ fully replicated : each fragment at each site
❑ partially replicated : each fragment at some of the sites
◼ Rule of thumb:
◼ Database Information
❑ relationship
Example
m1: PNAME="Maintenance" BUDGET≤200000
◼ Application Information
❑ minterm selectivities: sel(mi)
◼ The number of tuples of the relation that would be accessed by a
user query which is specified according to a given minterm
predicate mi.
❑ access frequencies: acc(qi)
◼ The frequency with which a user application qi accesses data.
◼ Access frequency for a minterm predicate can also be defined.
Definition :
Rj = Fj(R), 1 ≤ j ≤ w
where Fj is a selection formula, which is (preferably) a minterm
predicate.
Therefore,
A horizontal fragment Ri of relation R consists of all the tuples of R
which satisfy a minterm predicate mi.
Given a set of minterm predicates M, there are as many horizontal
fragments of relation R as there are minterm predicates.
Set of horizontal fragments also referred to as minterm fragments.
Preliminaries :
❑ Pr should be complete
❑ Pr should be minimal
◼ Example :
❑ Assume PROJ[PNO,PNAME,BUDGET,LOC] has two
applications defined on it.
❑ Find the budgets of projects at each location. (1)
❑ Find projects with budgets less than $200000. (2)
According to (1),
Pr={LOC=“Montreal”,LOC=“New York”,LOC=“Paris”}
which is complete.
Example :
Pr ={LOC=“Montreal”,LOC=“New York”, LOC=“Paris”,
BUDGET≤200000,BUDGET>200000}
Initialization :
❑ find a pi Pr such that pi partitions R according to Rule 1
❑ set Pr' = pi ; Pr Pr – {pi} ; F {fi}
Iteratively add predicates to Pr' until it is complete
❑ find a pj Pr such that pj partitions some fk defined according to
minterm predicate over Pr' according to Rule 1
❑ set Pr' = Pr' {pi}; Pr Pr – {pi}; F F {fi}
❑ if pk Pr' which is nonrelevant then
Pr' Pr – {pi}
F F – {fi}
❑ Simple predicates
❑ For application (1)
p1 : LOC = “Montreal”
p2 : LOC = “New York”
p3 : LOC = “Paris”
❑ For application (2)
p4 : BUDGET ≤ 200000
p5 : BUDGET > 200000
❑ Pr = Pr' = {p1,p2,p3,p4,p5}
© 2020, M.T. Özsu & P. Valduriez: TS.
29
Phan Thị HÀ_PTIT
PHF – Example
◼ Completeness
❑ Since Pr' is complete and minimal, the selection predicates are
complete
◼ Reconstruction
❑ If relation R is fragmented into FR = {R1,R2,…,Rr}
R = Ri FR Ri
◼ Disjointness
❑ Minterm predicates that form the basis of fragmentation should
be mutually exclusive.
Ri = R ⋉F Si, 1≤i≤w
where w is the maximum number of fragments that will be
defined on R and
Si = Fi (S)
◼ Completeness
❑ Referential integrity
❑ Let R be the member relation of a link whose owner is relation S
which is fragmented as FS = {S1, S2, ..., Sn}. Furthermore, let A
be the join attribute between R and S. Then, for each tuple t of
R, there should be a tuple t' of S such that
t[A] = t' [A]
◼ Reconstruction
❑ Same as primary horizontal fragmentation.
◼ Disjointness
❑ Simple join graphs between the owner and the member
fragments.
◼ Overlapping fragments
❑ grouping
◼ Non-overlapping fragments
❑ splitting
◼ Application Information
❑ Attribute affinities
◼ a measure that indicates how closely related the attributes are
◼ This is obtained from more primitive usage data
❑ Attribute usage values
◼ Given a set of queries Q = {q1, q2,…, qq} that will run on the relation
R[A1, A2,…, An],
access
query access = access frequency of a query
execution
all sites
Assume each query in the previous example accesses the attributes once
during each execution.
S1 S2 S3
Also assume the access frequencies q1 15 20 10
q2 5 0 0
q3 25 25 25
q4 3 0 0
Then
aff(A1, A3) = 15*1 + 20*1+10*1
= 45
and the attribute affinity matrix AA is
(Let A1=PNO, A2=PNAME, A3=BUDGET,
A4=LOC)
where
n
bond(Ax,Ay) =
z =1
aff(A ,A )aff(A ,A )
z x z y
Ordering (0-3-1) :
cont(A0,BUDGET,PNO) = 2bond(A0, BUDGET)+2bond(BUDGET, PNO)
–2bond(A0 , PNO)
= 8820
Ordering (1-3-2) :
cont(PNO,BUDGET,PNAME) = 10150
Ordering (2-3-4) :
cont (PNAME,BUDGET,LOC) = 1780
Define
TQ = set of applications that access only TA
BQ = set of applications that access only BA
OQ = set of applications that access both TA and BA
and
CTQ = total number of accesses to attributes by applications
that access only TA
CBQ = total number of accesses to attributes by applications
that access only BA
COQ = total number of accesses to attributes by applications
that access both TA and BA
Then find the point along the diagonal that maximizes
CTQCBQ−COQ2
Two problems :
Cluster forming in the middle of the CA matrix
❑ Shift a row up and a column left and apply the algorithm to find
the “best” partitioning point
❑ Do this for all possible shifts
❑ Cost O(m2)
More than two clusters
❑ m-way partitioning
❑ try 1, 2, …, m–1 split points along diagonal and try to find the
best point for each of these
❑ Cost O(2m)
◼ Reconstruction
❑ Reconstruction can be achieved by
R = ⋈K Ri, Ri FR
◼ Disjointness
❑ TID's are not considered to be overlapping since they are
maintained by the system
❑ Duplicated keys are not considered to be overlapping
© 2020, M.T. Özsu & P. Valduriez: TS.
51
Phan Thị HÀ_PTIT
Hybrid Fragmentation
❑ Data distribution
❑
◼ Database information
❑ selectivity of fragments
❑ size of a fragment
◼ Application information
❑ access types and numbers
❑ access localities
◼ Communication network information
❑ unit cost of storing data at a site
❑ unit cost of processing at a site
◼ Computer system information
❑ bandwidth
❑ latency
❑ communication overhead
© 2020, M.T. Özsu & P. Valduriez: TS.
56
Phan Thị HÀ_PTIT
Allocation
General Form
min(Total Cost)
subject to
response time constraint
storage constraint
processing constraint
Decision Variable
1 if fragment Fi is stored at site Sj
xij =
0 otherwise
◼ Total Cost
Processing component
access cost + integrity enforcement cost + concurrency control cost
❑ Access cost
◼ Constraints
❑ Response Time
execution time of query ≤ max. allowable response time for that query
◼ Solution Methods
❑ FAP is NP-complete
❑ DAP also NP-complete
◼ Heuristics based on
❑ single commodity warehouse location (for FAP)
❑ knapsack problem
❑ branch and bound techniques
❑ network flow
❑ Combined approaches
◼ Examplar: Schism
❑ Graph G=(V,E) where
◼ vertex vi ∈ V represents a tuple in database,
◼ edge e=(vi,vj) ∈ E represents a query that accesses both tuples vi
and vj;
◼ each edge has weight counting the no. of queries that access both
tuples
❑ Perform vertex disjoint graph partitioning
◼ Each vertex is assigned to a separate partition
SELECT PNAME FROM PROJ WHERE BUDGET>? AND LOC=‘?’
◼ If monitoring tuple-level access (E-Store), this will tell
you