Distributed Database Design
Distributed Database Design
Table of Contents
1.-Design Problem 1.1 What's design? 1.2 Organization of distributed systems
4.- Allocation 4.1 Allocation Problem 4.2 Optimality Measures 4.3 Alternatives 4.4 Information Requirements 4.4.1 Database
4.4.2 Application
4.4.3 Site 4.4.4 Network 5.- Bottom-Up Strategy 5.1 Data Warehouse Approach 5.2 Design Process 5.3 Approaches 5.4 Relationship Schemes
5.5.1 Matching/Integration/Mapping
What is design?
What is design?
Design is the creation of a plan or convention for the construction of an object or a system (as in architectural blueprints, engineering drawings, business processes, circuit diagrams and sewing patterns).
Design Problem
Design problem of distributed systems: Making decisions about the placement of data and programs across the sites of a computer network.
Level of sharing
Behavior of access patterns Level of knowledge (on access pattern behavior)
Level of sharing
There are three possibilities in terms of this level:
No Sharing: Each application and its data execute at one site, and there is no communication with any other program or access to any data file at other sites. Level data sharing: All the programs are replicated at all the sites, but data files are not. Data-plus-program sharing: Both data and programs may be shared, meaning that a program at 7 a given site can request a service from another
Level of knowledge
The possibilities within this level are:
Without Information: It is not known by the designer how the users will access the database.
Framework of Distribution
10
Design Strategies
Top-down approach - Designing systems from scratch - Homogeneous systems - Better for tightly integrated homogeneous DDBMS
Bottom-up approach
- The databases already exist at a number of sites
11
Requirements Analysis: Defines the environment of the system. System Requirements (Objectives): Performance, reliability and availability, economics, and expandability (flexibility). Requirements document is input to two parallel activities: - Conceptual Design: Determine entity types and relationships among these entities. Entity Analysis/Functional Analysis
12
Distribution Design: Takes as inputs the results collected from View Design (Global Conceptual Schema -GCS- and Access Patterns).
13
Design Schema
14
Distribution of Relations
We first divide them into sub-relations called fragments. Therefore, the distribution design activity splits in two steps: - Fragmentation - Allocation Physical Design: Maps the LCS to the physical storage devices available at the corresponding sites.
15
Why fragment at all? How should we fragment? How much should we fragment? Is there any way to test the correctness of decomposition?
From a distribution viewpoint, there is really no need to fragment data. The important issue is to have an appropriate unit of distribution. A relation is not a suitable unit, because: - Application views are usually subsets of relations. - If the applications that have views defined on a given relation reside at different sites, two 17 alternatives can be followed (one site or some/all of
Fragmentation Difficulties
If the applications have conflicting requirements that prevent decomposition of the relation into mutually exclusive fragments, those applications whose views are defined on more than one fragment may suffer performance degradation. As a result of fragmentation, attributes participating in a dependency may be decomposed into different fragments that might be allocated to different sites. In this case, even the simpler task of checking for dependencies would result in chasing after data in 18 a number of sites.
Fragmentation Alternatives
Relation instances are essentially tables, so the issue is one of finding alternative ways of dividing a table into smaller ones. There are two ways to do so:
Horizontally
Vertically Hybrid/Mixed (Nested)
19
Fragmentation Alternatives
20
Fragmentation Alternatives
- Horizontal
- Vertical
21
Degree of Fragmentation
The degree of fragmentation goes from one of two extremes:
Degree of Fragmentation
How to do so?
23
24
Completeness
For a relation instance R decomposed into fragments FR = {R1, R2, . . . ,Rn}, each data item that can be found in R can also be found in one or more of Ris
Ensures that the data in a global relation are mapped into fragments without any loss.
Horizontal Tuple Vertical - Attribute
25
Reconstruction
For a relation R decomposed into fragments FR = {R1, R2, . . . ,Rn}, it should be possible to define a relational operator such that:
R = Ri, Ri FR
Ensures that constraints defined on the data in the form of dependencies are preserved. ( will be different for different forms of fragmentation)
26
Disjointness
For a relation R horizontally decomposed into fragments FR = {R1, R2, . . . ,Rn} and data item di is in Rj, it is not in any other fragment Rk (k j).
Allocation Problem
Assume that there are a set of fragments:
F = {F1, F2, . . . , Fn } And a distributed system consisting of sites: S = {S1, S2 , . . . , Sm } On which a set of applications:
Optimality Measures
The optimality can be defined with respect of two measures:
Minimal Cost: The cost function consists of the cost of storing each Fi at a site Sj, the cost of querying Fi at site Sj, the cost of updating Fi at all sites where it is stored, and the cost of data communication. (Scheme that minimizes a combined cost function) Performance: The allocation strategy is designed to maintain a performance metric. Two well-known 29
Allocation
Assuming that the database is fragmented properly, one has to decide on the allocation of the fragments to various sites on the network.
30
Allocation Alternatives
The reasons for Replication are:
Reliability: With multiple copies, comes good chances for some copies to be accessible even with system failures. Efficiency of read-only queries: Queries that access the same data item can be executed in parallel (since there are copies in multiple sites). However the execution of an updated query, brings troubles, since the system has to check all the copies 31 were updated correctly.
Allocation Alternatives
Partitioned DB: Non-Replicated database contains fragments that are allocated to sites, and there is only one copy of any fragment on the network.
In case of replication:
The database exists in its entirety at each site (Fully Replicated DB). Fragments are distributed to the sites in such a way that copies of a fragment may reside in multiple sites (Partially Replicated DB).
32
Allocation Alternatives
33
Information Requirements
Too many factors contribute to an optimal design.
The logical organization of the database. The location of the applications. The access characteristics of the applications to the database.
Information Requirements
The information needed for distribution design can be divided into four categories:
Database information.
Application information.
Communication network information. (Quantitative nature) Computer system information. (Quantitative nature)
35
Database Information
We now need to define the selectivity of a fragment Fj with respect to query qi. This is the number of tuples of Fj that need to be accessed in order to process qi.
SELi (Fj). Another necessary information on the database fragments is their size. The size of a fragment Fj is given by:
36
Application Information
The two important measures are the number of read accesses that a query qi makes to a fragment Fj during its execution (RRij), and its counterpart for the update accesses (URij).
We also need to define two matrices UM and RM, with elements uij and rij, respectively. A vector O to specifies the originating site of query qi.
37
Site Information
For each computer site, we need to know its storage and processing capacity. (Elaborate functions or by estimates)
Network Information
In our model we assume the existence of a simple network where the cost of communication is defined in terms of one frame of data (cost per frame between sites Si and Sj). To enable the calculation of the number of messages, we use the size (in bytes) of one frame.
39
The process consists of integrating local databases with their (local) schemes into a global database with its global conceptual schema (GCS). Database integration can be either physical or logical.
The integration is aided by extract-transform-load (ETL) tools that enable extraction of data from
40
41
Approaches
GCS Defined first: In which case the bottom-up design involves mapping LCSs to this schema. (This is the case in data warehouses)
GCS defined as an integration of parts of LCSs: In this case, the bottom-up design involves both the generation of the GCS and the mapping of individual LCSs to this GCS.
42
Relationship of Schemes
If the GCS is defined up-front, the relationship between the GCS and the LCS can be of two fundamental types:
Local-as-View (LAV): The GCS definition exists, and each LCS is treated as a view definition over it. (The results are constrained by the objects in the local DBMSs).
Global-as-View (GAV): The GCS is defined as a set of views over the LCSs. (The query results are constrained to the set of 43 objects that are defined in the GCS).
44
Schema Matching: to determine the syntactic and semantic correspondences among the translated LCS elements. Schema Integration: of the common schema elements into a global conceptual (mediated) schema if one has not yet been defined. Schema Mapping: determines how to map the elements of each LCS to the other elements of the
45
46
Scheme Heterogeneity
Structural conflicts occur in four possible ways:
Type conflicts: when the same object is represented by an attribute in one schema and by an entity (relation) in another Dependency conflicts: when different relationship modes are used to represent the same thing in different schemes. Key conflicts: when different candidate keys are available and different primary keys are selected in different schemes.
47
Integration Methodologies
This step is only necessary if a GCS has not already been defined and matching was performed on individual LCSs.
Binary: involve the manipulation of two schemes at a time. These can occur in a stepwise (ladder) fashion or in a purely binary fashion, where each schema is integrated with one other.
Nary: offers more flexibility (more information is available) and is more general (the number of schemes can be varied depending on the integrators48 preferences).
Integration Methodologies
Binary
Nary
49
Mapping Methodologies
Once a GCS is defined, it is necessary to identify how the data from each of the local databases (source) can be mapped to GCS (target) while preserving semantic consistency (as defined by both the source and the target). Although schema matching has identified the correspondences between the LCSs and the GCS, it may not have identified explicitly how to obtain the global database from the local ones.
50
Mapping Methodologies
Mapping Creation: the process of creating explicit queries that map data from a local database to the global data.
Mapping Maintenance: the detection and correction of mapping inconsistencies resulting from schema evolution.
51
Data Cleaning
Errors in source databases can always occur, requiring cleaning in order to correctly answer user queries. Cleaning is performed as the global database is created. (Warehouse) The process needs to be performed during query processing when data are returned from the source databases. (Integration Systems)
52