CS 4221: Database Design
CS 4221: Database Design
1
Physical Database Design
It is the process of transforming a logical data model into
a physical model of a database.
Unlike a logical design, a physical database design is
optimized for data-access paths, performance
requirements and other constraints of the target
environment, i.e. hardware and software.
Before you can begin the physical design, you must have:
(1) logical database design
minimally third normal form
(2) Transaction characterization, such as
most frequent transactions
most complex or resource-insensitive transactions
distributions of transactions, over time
mix of insert, update, delete and select statements
most critical transactions to the applications
(3) Performance requirements
Ref: R Gillette, D Muench, and J Tabaka. Physical Database Design for SYSBASE SQL Server.
2
Prentice Hall, 1995.
Physical database design activities
Note: Table and column mean
Defining Tables and Columns
relation and attribute. Also
Supertype and Subtype mean
Defining Keys Superclass and Subclass, resp.
Controlling Access
Managing Objects:
sizes 3
placement
1. Defining Tables and Columns – The initial transformation of the logical
model into a physical model, including naming objects, choosing data
types and lengths, and handling null values.
2. Defining Keys – Choosing primary and foreign keys, including the use of
surrogate keys.
3. Identifying Critical Transactions – Identifying business transactions
that are high-value, mission-critical, frequently performed, or costly in
terms of computing resources.
4. Adding Redundant Columns – The first of a series of denormalization
techniques: adding columns to tables that exist in other tables.
5. Adding Derived Columns – Adding a column to a table based on the
values or existence of values in other columns in any table.
6. Collapsing Tables – Combining two or more tables into one table.
7. Splitting Tables – Partitioning a table into two or more disjoint tables.
Partitioning may be horizontal (row-wise) or vertical (column-wise).
4
8. Handling Supertypes and Subtypes – Deciding how to implement tables
that are involved in a supertype-subtype relationship in the logical model.
9. Duplicating Parts of Tables – Duplicating data vertically and / or
horizontally into new tables.
10. Adding Tables for Derived Data – Creating new tables that hold data
derived in columns from other tables.
11. Handling Vector Data – Deciding how to implement tables that contain
plural attributes or vector data. Row-wise and column-wise
implementations are discussed.
12. Generating Sequence Numbers -– Choosing a strategy to generate
sequence numbers, and the appropriate tables and columns to support the
strategy.
13. Specifying Indexes – Specifying indexes to improve data access
performance or to enforce uniqueness.
14. Maintaining Row Uniqueness – Maintaining the uniqueness or primary-
key values.
5
15. Handling Domain Restriction – Defining SQL Server rules and defaults
on the columns of a table to maintain valid data values in columns.
16. Handling Referential Integrity – Deciding how to handle primary-key
updates and deletes, and foreign-key inserts and updates. Using triggers to
ensure referential integrity.
17. Maintaining Derived and Redundant Data – Specifying how data
integrity will be maintained if the data model contains derived or
redundant data.
18. Handling Complex Integrity Constraints – Deciding how to handle
complex business rules such as sequence rules, cross-domain business
rules, and complex data domain rules. Using triggers to implement
complex business rules.
19. Controlling Access to Data – Restricting access to commands and data.
20. Managing Object Sizes - Calculating the estimated size of a database and
its objects.
21. Recommending Object Placement – Allocating databases and their
objects on available hardware to achieve optimal performance. 6
Physical database design goals
improve system performance
reduce disk I/O
reduce joins
7
1. Defining Keys
If there are more than one candidate key in a table, select
the primary key as below:
– select the key which transactions will know about most
often. This will avoid additional lookups.
– select the shortest length key when used in indexes
– consider what other keys are available in other tables on
which to join.
– criteria for primary selection as mentioned in our tutorial.
11
3. Adding Redundant Columns
required when an unaccepted number of joins is needed to
perform a critical transaction.
add redundant columns in order to reduce the no. of joins.
– It is a de-normalization process. Tables will not be
in 3NF.
Example
publisher (pub-id, pubname, city, state)
Titles (title-id, title, type, pub-id, price, pubname, ...)
pubname is duplicated in Titles table
13
4. Adding Derived Columns
When you expect that the performance requirements for a
critical transactions will not be met because of a costly,
recurring calculation based on relative static data, then add
derived column will help.
Derived data may include:
column data aggregated with SQL aggregate function such
as sum(), avg(), over N detail rows
column data which is calculated using formulas over N
rows.
counts of details rows matching specific criteria
de-normalization
similar to adding redundant columns
in order to get better performance
a research area:
materialized database
15
6. Splitting Tables
Required when it is more advantageous to access a subset
of data, and no important transactions rely on a
consolidated view of the data.
Vertical table splits:
e.g. Emp (Eno, name, salary, tax, mgr#, dept#)
can be split to 2 tables:
Emp_bio (Eno, name, mgr#, dept#)
Emp_comp (Eno, salary, tax)
16
Horizontal table splits
e.g. You can form horizontal fragments of the Supplier table
based on values of the city column
Supplier (sno, sname, city, status)
Supplier_boston (sno, sname, status)
Benefits:
- A table is large and reducing its size reduces the no. of
index pages read in a query
- The table split corresponds to an actual physical separation
of the data rows, as in different geographical sites.
- Table splitting achieves specific distribution of data on the
available physical media
- To achieve domain key normal form.
17
7. Handling Supertypes and Subtypes
Decide how to involved in a supertype/subtype relationship
in the logical data model. (i.e. superclass subclass isa relationship)
19
There are 3 common physical design scenarios for
subtype-supertype relationships.
20
(1) Single supertype and multiple subtype tables
employee
employee_num
employee_num
name
salary
tax
manager_num
department_num
employee_type
22
(2) Single Supertype table only
employee
employee_num
employee_num
name
salary
tax
manager_num
department_num
consulting_title
contracting_title
billing_rate
mentor
prof_soc_num
employee_type
23
This technique is appropriate if the subtypes
have similar columns
are involved in similar relationships
are frequently accessed together
are infrequently accessed separately
24
(3) Multiple Subtype Tables Only
contractor consultant regular_staff
employee_num employee_num employee_num
employee_num employee_num employee_num
* billing_rate * billing_rate * prof_soc_num
* contracting_title * consulting_title name
name * mentor salary
salary name tax
tax tax manager_num
manager_num manager_num department_num
department_num department_num
The total-sales attribute stores the total sales for the same
type of books
Triggers are required to update the summary-table.
26
9. Specifying Indexes
Indexes can be used to improve data access performance, to
enforce uniqueness, or to control data distribution.
Indexes may be clustered or non clustered,
unique or non unique, or concatenated.
Testing and trial-and-error during production may indicate
other index choices.
A table’s indexes must be maintained with every insert, update,
and delete operation performed on the table.
Be careful not to over index.
Incorrect index selection can adversely affect the performance.
The greatest problem will be deriving the best set of indexes for
the database when conflicting applications exist (i.e.
applications whose access needs and priorities are in conflict).
27
You may need to split up or duplicate a database into
another database in order to support equally critical but
opposing indexing strategies, particularly with request to
the clustered index, where only one is allowed per table.
Index density = 1/ total no. of unique values
e.g. If there are 20 colors for cars then the index density for colors is
1/20 = 0.05.
e.g. The index density of the primary key of a table is
1/no_of_row
Selectivity = Index density * total no. of rows
The more selective (lower selectivity value) this number is, the more
likely the SQL query optimizers will choose to use the index since it
can assume fewer rows will be required to answer the query.
E.g. If three are 200 values and 400 rows, then the selectivity value is 1/200*400=2,
indicating that on average only 2 rows should be returned for each index value.
29
Clustered indexes
an index in which the physical order of rows and the logical
(indexed) order are the same. The leaf level of a clustered
index represents the data pages themselves.
Only one clustered index is allowed for each table.
Usually, the primary key is the clustered index on the tables,
but not always.
Instead you may want to choose the attribute which is used to
specify a range in a where clause.
Clustered indexes are implemented as B-trees in SQL servers.
Insertions may cause the splitting of the leaf nodes of a B-tree.
30
Non clustered indexes
32
Identifying Columns for Indexes
columns used to specify range in the where clause
(clustered index)
columns used to join one or more tables, usually
primary and foreign keys
columns likely to be used as search arguments
columns used to match an equi-join query
columns used in aggregate functions
columns used in a group by clause
columns used in an order by clause
33
In environment where deletes and inserts are frequent, such
as many real-time transaction processing applications, you
may avoid the clustered index.
34