What Is Normalization
What Is Normalization
What Is Normalization
A relation is a set of attributes with values for each attribute such that:
1. Each attribute (column) value must be a single value only.
2. All values for a given attribute (column ) must be of the same data type.
3. Each attribute (column) name must be unique.
4. The order of attributes (columns) is insignificant
5. No two tuples (rows) in a relation can be identical.
6. The order of the tuples (rows) is insignificant.
From our discussion of E-R Modeling, we know that an Entity typically corresponds to a relation
and that the Entity’s attributes become attributes of the relation.
We also discussed how, depending on the relationships between entities, copies of attributes
(the identifiers) were placed in related relations as foreign keys.
The next step is to identify functional dependencies within each relation. Click on the __Next
Page link below to learn more about the normalization process.
Functional Dependencies
Key: One or more attributes that uniquely identify a tuple (row) in a relation.
The selection of keys will depend on the particular application being considered.
In most cases the key for a relation will already be specified during the conversion from the E-R
model to a set of relations.
Users can also offer some guidance as to what would make an appropriate key.
Recall that no two relations should have exactly the same values, thus a candidate key would
consist of all of the attributes in a relation.
A key functionally determines a tuple (row). So one functional dependency that can always be
written is:
The Key → All other attributes
Not all determinants are keys.
Modification Anomalies
Once our E-R model has been converted into relations, we may find that some relations are not
properly specified. There can be a number of problems:
o Deletion Anomaly: Deleting one fact or data point from a relation results in other information
being lost.
o Insertion Anomaly: Inserting a new fact or tuple into a relation requires we have information from
two or more entities – this situation might not be feasible.
o Update Anomaly: Updating one fact in a relation requires us to update multiple tuples.
Anomaly Example 1:
Here is an example to illustrate these anomalies: Consider a very common CUSTOMER relation:
CUSTOMER(CustomerID, CustomerName, Street, City, State, PostalCode)
In the United States, the PostalCode (or ZipCode) references a specific City and State so one
might have data such as:
Here is another example to illustrate anomalies: A company has a Purchase Order form:
Our dutiful consultant creates the E-R Model directly matching the purchase order:
When we follow the steps to convert to a set of relations this results in two relations (keys are
underlined):
PO_HEADER (PO_Number, PODate, Vendor, Ship_To, ...)
1. What happens if we want to add the fact that Order O103 has quantity 5 of part P99 ?
2. What happens when we delete item I02 from Order O101 ?
3. What happens if we want to change the price of the Plate (P99)?
These problems occur because the relation in question contains data about 2 or more themes.
Typical way to solve these anomalies is to split the relation in to two or more relations – This is
part of theProcess called Normalization discussed next. On the next page we will formally define
the Normalization Process. Normalization Process:
Relations can fall into one or more categories (or classes) called Normal Forms
Normal Form: A class of relations free from a certain set of modification anomalies.
Normal forms are given names such as:
o First normal form (1NF)
o Second normal form (2NF)
o Third normal form (3NF)
o Boyce-Codd normal form (BCNF)
o Fourth normal form (4NF)
o Fifth normal form (5NF)
o Domain-Key normal form (DK/NF)
These forms are cumulative. A relation in Third normal form is also in 2NF and 1NF.
The Normalization Process for a given relation consists of:
1. Specify the Key of the relation
2. Specify the functional dependencies of the relation.
Sample data (tuples) for the relation can assist with this step.
3. Apply the definition of each normal form (starting with 1NF).
4. If a relation fails to meet the definition of a normal form, change the relation
(most often by splitting the relation into two new relations) until it meets the
definition.
5. Re-test the modified/new relations to ensure they meet the definitions of each
normal form.
In the next set of notes, each of the normal forms will be defined along with an example of the
normalization steps.
A relation is in second normal form (2NF) if all of its non-key attributes are dependent on all of
the key.
Another way to say this: A relation is in second normal form if it is free from partial-key
dependencies
Relations that have a single attribute for a key are automatically in 2NF.
This is one reason why we often use artificial identifiers (non-composite keys) as keys.
In the example below, Close Price is dependent on Company, Date
The following example relation is not in 2NF:
STOCKS (Company, Symbol, Headquarters, Date, Close_Price)
At this point we have two new relations in our relational model. The original “STOCKS” relation
we started with is removed form the model.
Sample data and functional dependencies for the two new relations:
COMPANY Relation:
STOCK_PRICES relation:
In checking these new relations we can confirm that they meet the definition of 1NF (each one
has well defined unique keys) and 2NF (no partial key dependencies).
Third Normal Form (3NF)
Consider one of the new relations we created in the STOCKS example for 2nd normal form:
This gives us the following sample data and FD for the new relations
Company Symbol
Microsoft MSFT
Oracle ORCL
Company Headquarters
Microsoft Redmond, WA
Again, each of these new relations should be checked to ensure they meet the definition of 1NF,
2NF and now 3NF.
Boyce-Codd Normal Form (BCNF)
In this case, the combination FundID and InvestmentType form a candidate key because we can
use FundID,InvestmentType to uniquely identify a tuple in the relation.
Similarly, the combination FundID and Manager also form a candidate key because we can use
FundID, Manager to uniquely identify a tuple.
Manager by itself is not a candidate key because we cannot use Manager alone to uniquely
identify a tuple in the relation.
Is this relation FUNDS(FundID, InvestmentType, Manager) in 1NF, 2NF or 3NF ?
Given we pick FundID, InvestmentType as the Primary Key: 1NF for sure.
2NF because all of the non-key attributes (Manager) is dependant on all of the key.
3NF because there are no transitive dependencies.
However consider what happens if we delete the tuple with FundID 22. We loose the fact that
Brown manages the InvestmentType “Growth Stocks.”
Therefore, while FUNDS relation is in 1NF, 2NF and 3NF, it is in BCNF because not all
determinants (Manager in FD3) are candidate keys.
The following are steps to normalize a relation into BCNF:
1. List all of the determinants.
2. See if each determinant can act as a key (candidate keys).
3. For any determinant that is not a candidate key, create a new relation from the
functional dependency. Retain the determinant in the original relation.
For our example: FUNDS (FundID, InvestmentType, Manager)
Each of the new relations sould be checked to ensure they meet the definitions of 1NF, 2NF,
3NF and BCNF
Fourth Normal Form (4NF)
S
Resolution: Split into two tables with the common key:
Consider the following relation: CUSTOMER (CustomerID, Name, Address, City, State, Zip)
This relation is not in DK/NF because it contains a functional dependency not implied by the
key.
Zip → City, State
We can normalize this into DK/NF by splitting the CUSTOMER relation into two:
CUSTOMER (CustomerID, Name, Address, Zip)
We may pay a performance penalty – each customer address lookup requires we look in two
relations (tables).
More technically, obtaining a complete customer and address record requires us
to join CUSTOMER and ZIPCODES together.
In such cases, we may de-normalize the relations to achieve a performance improvement.
In other words, we re-assemble the original CUSTOMER relation we started with that will
contain all of the attributes.
De-normalization presents a trade-off between performance and modification anomalies / data
redundancy.
All-in-One Database Normalization Example
Many of you asked for a “complete” example that would run through all of the normal forms
from beginning to end using the same tables. This is tough to do, but here is an attempt:
Example relation:
Example Data:
Assume the key is Name, Project, Task.
Is EMPLOYEE in 1NF ?
Second Normal Form
List all of the functional dependencies for EMPLOYEE.
Are all of the non-key attributes dependant on all of the key ?
It seems if we know the employee’s name, we can figure out their office, floor and phone.
Split into two relations EMPLOYEE_PROJECT_TASK and EMPLOYEE_OFFICE_PHONE.
EMPLOYEE_PROJECT_TASK (Name, Project, Task)
Office Phone
400 1400
442 1442
588 1588
Boyce-Codd Normal Form
Name Project
Bill 100X
Bill 200Y
Sue 100X
Sue 200Y
Sue 300Z
Ed 100X
Name Task
Bill T1
Bill T2
Sue T33
Ed T2
EMPLOYEE_OFFICE (Name, Office, Floor)
Office Phone
400 1400
442 1442
588 1588
Relation Name
CUSTOMER (CustomerID, Name, Street, City, State, Zip, Phone)
Example Data
Check both CUSTOMER and ZIPCODE to ensure they are both in 1NF up to BCNF.
Here are a bunch of fun normalization exercises you can try! Answers are on the last page.
1. Choose a key and write the dependencies for the following GRADES relation:
GRADES (Student_ID, CourseNumber, SemesterNumber, Grade)
2. Choose a key and write the dependencies for the PO_LINE_ITEMS relation:
Keep in mind ItemNum is an integer that counts up from 1 for each purchase order.
PO_LINE_ITEMS (PO_Number, ItemNum, PartNum, Description, Price, Qty)
3. What normal form is the above LINE_ITEMS relation in (before doing any normalizing) ?
4. What normal form is the following relation in (key is underlined):
STORE_ITEM (SKU, PromotionID, Vendor, Style, Price)
5. Normalize the above STORE_ITEM relation into the next higher normal form:
6. Choose a key and write the dependencies for the following SOFTWARE relation (assume all
of the vendor’s products have the same warranty):
SOFTWARE (SoftwareVendor, Product, Release, SystemReq, Price, Warranty)
FD1: oftwareVendor, Product, Release → SystemReq, Price, Warranty
FD1: H, I → J, K, L
FD2: J→M
FD3: K→N
FD4: L→O
D, O → N, T, C, R, Y
C, R → D
D→N
List the functional dependencies and Normalize this relation into BCNF.
Normalization Exercises – Answers
(d) SHIPPORTS is not in BCNF since it has VoyageID as a determinant but VoyageID is not a
candidate key.