Unit 6 - Normalization
Unit 6 - Normalization
Unit 6 - Normalization
NORMALIZATION
At the end of this unit, the student should be able to:
Basically, to create a relational database schema, the database designer groups attributes into relations
by using his common sense or by following the guidelines of mapping a conceptual data model (such
as the ER or Enhanced-ER (EER) data model) to a relational data model. However, there is still a
need to formally analyze why one grouping of attributes into a relation schema may be better (in
terms of goodness and appropriateness) than another. In this unit, we discuss some of the theories that
have been developed with the goal of evaluating relational schemas for design quality. We can assess
formally why one set of groupings of attributes into relation schemas is better than another.
We recall that a relational database model consists of elements called relations. Each relation is a set
of attributes with values for each attribute such that:
Most practical design projects acquire existing designs of databases from previous designs, designs in
legacy models, or from existing files. Normalization is carried out in practice so that the resulting
designs are of high quality and meet the desirable properties.
The normalization process (introduced by Codd, 1972) basically takes a relation schema through a
series of tests to certify whether it satisfies a certain normal form. The normal form of a relation
refers to the highest normal form condition that it meets, and hence indicates the degree to which it
has been normalized.
Initially, Codd proposed three normal forms, which he called first (1NF), second (2NF), and third
(3NF) normal form. A stronger definition of 3NF—called Boyce-Codd normal form (BCNF)—was
proposed later by Boyce and Codd. All these normal forms are based on a single analytical tool: the
functional dependencies among the attributes of a relation. Later, a fourth normal form (4NF) and a
fifth normal form (5NF) were introduced, based on the concepts of multivalued dependencies and
join dependencies, respectively.
However, database designers need not normalize to the highest possible normal form. Relations may
be left in a lower normalization status, such as 2NF, for performance reasons. Thus there is a process
called denormalization which stores the join of higher normal form relations as a base relation,
which is in a lower normal form.
These are some of the informal guidelines that may be used as measures to determine the quality of relation
schema design.
A. Make sure that the semantics of the attributes is clear in the schema.
The semantics of a relation refers to its meaning resulting from the interpretation of attribute values
in a tuple. In grouping attributes to form a relation schema, assume that attributes belonging to one
relation have certain real-world meaning and a proper interpretation associated with them.
Guideline 1: Design a relation schema so that it is easy to explain its meaning. Do not combine
attributes from multiple entity types and relationship types into a single relation. Intuitively, if a
relation schema corresponds to one entity type or one relationship type, it is straightforward to
interpret and to explain its meaning. Otherwise, if the relation corresponds to a mixture of multiple
entities and relationships, semantic ambiguities will result and the relation cannot be easily explained.
Figure 6-1: A simplified company relational database
schema.
One goal of schema design is to minimize the storage space used by the base relations. Grouping
attributes into relation schemas has a significant effect on storage space. Storing natural joins of base
relations leads to an additional problem referred to as update anomalies. These anomalies happen
during insertion, deletion, and modification of relations.
Examples of anomalies that may happen using the schema in Figure 6-2:
• To insert a new tuple for an employee who works in department number 5, we must enter all
the attribute values of department 5 correctly so that they are consistent with the
corresponding values for department 5 in other tuples in EMP_DEPT. In the design in
Figure 6-1, we do not have to worry about this consistency problem because we enter only
the department number in the employee tuple; all other attribute values of department 5 are
recorded only once in the database, as a single tuple in the DEPARTMENT relation.
• If we delete from EMP_DEPT an employee tuple that happens to represent the last
employee working for a particular department, the information concerning that department
is lost from the database. This problem does not occur in the database in Figure 6-1 because
DEPARTMENT tuples are stored separately.
• In EMP_DEPT, if we change the value of one of the attributes of a particular department—
say, the manager of department 5—we must update the tuples of all employees who work in
that department; otherwise, the database will become inconsistent. If we fail to update some
tuples, the same department will be shown to have two different values for manager in
different employee tuples, which would be wrong.
Guideline 2: Design the base
relation schemas so that no
insertion, deletion, or
modification anomalies are
present in the relations. If any
anomalies are present, note
them clearly and make sure
that the programs that update
the database will operate
correctly.
Figure 6-2: Two relation schemas suffering from update anomalies. (a) EMP_DEPT and
(b) EMP_PROJ.
If many of the attributes do not apply to all tuples in the relation, we end up with many NULLs in
those tuples. This can waste space at the storage level and may also lead to problems with
understanding the meaning of the attributes and with specifying JOIN operations at the logical level.
Another problem with NULLs is how to account for them when aggregate operations such as
COUNT or SUM are applied. SELECT and JOIN operations involve comparisons; if NULL values
are present, the results may become unpredictable.
Take note that NULL values have multiple interpretations: a) the attribute does not apply to the tuple,
b) the attribute value for the tuple is unknown, c) the value is known but absent- that is, it has not been
recorded yet.
Guideline 3: Avoid placing attributes in a base relation whose values may frequently be NULL. If
NULLs are unavoidable, make sure that they apply in exceptional cases only and do not apply to a
majority of tuples in the relation.
For example: If only 15 percent of employees have individual offices, there is little justification for
including an attribute Office_number in the EMPLOYEE relation; rather, a relation
EMP_OFFICES(Essn, Office_number) can be created to include tuples for only the employees with
individual offices.
Guideline 4: Design relation schemas so that they can be joined with equality conditions on attributes
that are appropriately related (primary key, foreign key) pairs in a way that guarantees that no
spurious tuples are generated. Avoid relations that contain matching attributes that are not (foreign
key, primary key) combinations because joining on such attributes may produce spurious tuples.
6.3 FORMAL DESIGN G UID ELINE S
This is the most important concept in relational schema design theory. A functional dependency is a
constraint between two sets of attributes from the database. It describes a relationship
between ATTRIBUTES within a single relation. An attribute is FUNCTIONALLY
DEPENDENT on another if we can use the value of one attribute to determine the value of
another.
Suppose that our relational database schema has n attributes A1, A2, ..., An; let us think of the whole
database as being described by a single universal relation schema R = {A1, A2, ... , An}. Functional
dependency is denoted by X Y, between two sets of attributes X and Y that are subsets of R
specifies a constraint on the possible tuples that can form a relation state r of R. The constraint is that,
for any two tuples t1 and t2 in r that have t1[X] = t2[X], they must also have t1[Y] = t2[Y].
Consider the relation schema EMP_PROJ in Figure 6-2(b); from the semantics of the attributes and
the relation, we know that the following functional dependencies should hold:
a. Ssn Ename
b. Pnumber {Pname, Plocation}
c. {Ssn, Pnumber} Hours
These functional dependencies specify that (a) the value of an employee’s Social Security number
(Ssn) uniquely determines the employee name (Ename), (b) the value of a project’s number
(Pnumber) uniquely determines the project name (Pname) and location (Plocation), and (c) a
combination of Ssn and Pnumber values uniquely determines the number of hours the employee
currently works on the project per week (Hours). Alternatively,we say that Ename is functionally
determined by (or functionally dependent on) Ssn, or given a value of Ssn, we know the value of Ename,
and so on.
NORMAL FORMS
Normalization of data can be considered a process of analyzing the given relation schemas based on
their FDs and primary keys to achieve the desirable properties of (1) minimizing redundancy and
(2) minimizing the insertion, deletion, and update anomalies. The normalization procedure provides
database designers with the following:
• A formal framework for analyzing relation schemas based on their keys and on the
functional dependencies among their attributes.
• A series of normal form tests that can be carried out on individual relation schemas so that
the relational database can be normalized to any desired degree.
Although several higher normal forms have been defined, such as the 4NF and 5NF, the practical
utility of these normal forms becomes questionable when the constraints on which they are based are
rare, and hard to understand or to detect by the database designers and users who must discover
these constraints. Thus, we give particular attention to normalization only up to 3NF.
Example 1:
a.
b.
Figure 6-3: (a) An un-normalized TABLE_PRODUCT and (b) its 1st normal form - result of splitting
it into two tables.
Example 2:
Figure 6-4: An un-normalized table Student and its 1st Normal Form.
Example 3:
Transitive functional dependency exists when we have the following relationships in the table: A is
functionally dependent on B, and B is functionally dependent on C. In this case, C is transitively
dependent on A via B.
Example 1: