Database Normalization
Database Normalization
A Tutorial
by Fred Coulson
Table of Contents
INTRODUCTION ..................................................................................................2
THE PROBLEM:
KEEPING TRACK OF A STACK OF INVOICES .................................................3
1
Introduction
Introduction
This is meant to be a very brief tutorial aimed at beginners who want to get a
conceptual grasp on the database normalization process. I find it very difficult to
visualize these concepts using words alone, so I shall rely as much as possible
upon pictures and diagrams.
To demonstrate the main principles involved, we will take the classic example of
an Invoice and level it to the Third Normal Form. We will also construct an Entity
Relationship Diagram (ERD) of the database as we go.
Important Note: This is not a description of how you would actually design and
implement a database. The sample database screenshots are not meant to be
taken literally, but merely as visual aids to show how the raw data gets shuffled
about as the table structure becomes increasingly normalized.
Purists and academics may not be interested in this treatment. I will not cover
issues such as the benefits and drawbacks of normalization. For those who wish
to pursue the matter in greater depth, a list of references for further reading is
provided at the end.
For the most part, the first three normal forms are common sense. When people
sit down to design a database, they often already have a partially-normalized
structure in mind—normalization is a natural way of perceiving relationships
between data and no special skill in mathematics or set theory is required.
In fact, it usually takes quite a bit of work to de-normalize a database (that is,
remove the natural efficient relationships that a normalized data structure
provides). Denormalization is a fairly common task, but it is beyond the scope of
this presentation.
To begin: First, memorize the 3 normal forms so that you can recite them in your
sleep. The meaning will become clear as we go. Just memorize them for now:
2
The Problem:
Keeping Track of a Stack of Invoices
The Problem:
Keeping Track of a Stack of Invoices
Consider a typical invoice (Figure A).
Figure A: Invoice
Those of us who have an ordered mind but aren't quite aware of relational
databases might try to capture the Invoice data in a spreadsheet, such as
Microsoft Excel.
3
The Problem:
Keeping Track of a Stack of Invoices
Figure A-1: orders spreadsheet
This isn't a bad approach, since it records every purchase made by every
customer. But what if you started to ask complicated questions, such as:
4
First Normal Form:
No Repeating Elements or Groups of Elements
So, First Normal Form (NF1) wants us to get rid of repeating elements. What
are those?
Again we turn our attention to the first invoice (#125) in Figure A-1. Cells H2, H3,
and H4 contain a list of Item ID numbers. This is a column within our first
database row. Similarly, I2-I4 constitute a single column; same with J2-J4, K2-
K4, L2-L4, and M2-M4. Database columns are sometimes referred to as
attributes (rows/columns are the same as tuples/attributes).
You will notice that each of these columns contains a list of values. It is precisely
these lists that NF1 objects to: NF1 abhors lists or arrays within a single
database column. NF1 craves atomicity: the indivisibility of an attribute into
similar parts.
• H2 through M2
• H3 through M3
• H4 through M4
Similar (though not necessarily identical) data repeats within Invoice #125's row.
We can satisfy NF1's need for atomicity quite simply: by separating each item in
these lists into its own row.
5
First Normal Form:
No Repeating Elements or Groups of Elements
Figure A-2: flattened orders spreadsheet
Don't worry. The kind of duplication that we introduce at this stage will be
addressed when we get to the Third Normal Form. Please be patient; this is a
necessary step in the process.
We have actually only told half the story of NF1. Strictly speaking, NF1
addresses two issues:
We have already dealt with atomicity. But to make the point about Primary Keys,
we shall bid farewell to the spreadsheet and move our data into a relational
database management system (RDBMS). Here we shall use Microsoft Access to
create the orders table, as in Figure B:
This looks pretty much the same as the spreadsheet, but the difference is that
within an RDBMS we can identify a primary key. A primary key is
a column (or group of columns) that uniquely identifies each row. A value that uniquely
identifies a row is called a
primary key.
As you can see from Figure B, there is no single column that
uniquely identifies each row. However, if we put a number of
When this value is made
columns together, we can satisfy this requirement. up of two or more
columns, it is referred to
The two columns that together uniquely identify each row are as a concatenated
order_id and item_id: no two rows have the same combination primary key.
of order_id and item_id. Therefore, together they qualify to be
used as the table's primary key. Even though they are in two different table
columns, they are treated as a single entity. We call them concatenated.
6
First Normal Form:
No Repeating Elements or Groups of Elements
The underlying structure of the orders table can Figure C: orders table structure
be represented as Figure C:
What's next?
7
Second Normal Form:
No Partial Dependencies on a Concatenated Key
Still not clear? To try and understand this, let's take apart the orders table
column by column. For each column we will ask the question,
Can this column exist without one or the other part of the
concatenated primary key?
If the answer is "yes" — even once — then the table fails Second Normal Form.
Refer to Figure C again to remind us of the orders Figure C: orders table structure
table structure.
The short answer is yes: order_date relies on order_id, not item_id. Some of you
might object, thinking that this means you could have a dated order with no items
(an empty invoice, in effect). But this is not what we are saying at all: All we are
trying to establish here is whether a particular order on a particular date relies on
a particular item. Clearly, it does not. The problem of how to prevent empty
8
Second Normal Form:
No Partial Dependencies on a Concatenated Key
orders falls under a discussion of "business rules" and could be resolved using
check constraints or application logic; it is not an issue for Normalization to solve.
So voilá, our table has already failed Second Normal Form. But let's continue
with testing the other columns. We have to find all the columns that fail the test,
and then we do something special with them.
customer_id is the ID number of the customer who placed the order. Does it rely
on order_id? No: a customer can exist without placing any orders. Does it rely
on item_id? No: for the same reason. This is interesting: customer_id (along with
the rest of the customer_* columns) does not rely on either member of the
primary key. What do we do with these columns?
We don't have to worry about them until we get to Third Normal Form. We mark
them as "unknown" for now.
item_description is the next column that is not itself part of the primary key. This
is the plain-language description of the inventory item. Obviously it relies on
item_id. But can it exist without an order_id?
Yes! An inventory item (together with its "description") could sit on a warehouse
shelf forever, and never be purchased... It can exist independent of an order.
item_description fails the test.
In fact, this field does not belong in our database at all. It can easily be
reconstructed outside of the database proper; to include it would be redundant
(and could quite possibly introduce corruption). Therefore we will discard it and
speak of it no more.
9
Second Normal Form:
No Partial Dependencies on a Concatenated Key
order_total_price, the sum of all the item_total_price fields for a particular order,
is another derived value. We discard this field too.
Here is the markup from our NF2 analysis of the orders table:
Figure C (revised):
10
Second Normal Form:
No Partial Dependencies on a Concatenated Key
There are several things to notice:
1. We have brought a copy of the order_id column over into the order_items
table. This allows each order_item to "remember" which order it is a part of.
2. The orders table has fewer rows than it did before.
3. The orders table no longer has a concatenated primary key. The primary
key now consists of a single column, order_id.
4. The order_items table does have a concatenated primary key.
If you are new to Entity Relationship Diagrams, pay close attention to the line that
connects these two tables. This line means, in English,
11
Second Normal Form: Phase II
Remember, NF2 only applies to tables with a concatenated primary key. Now
that orders has a single-column primary key, it has passed Second Normal
Form. Congratulations!
Can this column exist without one or the other part of the
concatenated primary key?
item_price relies on the item_id but not on the order_id, so it does violate
Second Normal Form.
Figure F (revised):
We should be getting good at this now. Here is
the marked up table diagram:
12
Second Normal Form: Phase II
So, we take the fields that fail NF2 and create a new table. We call this new table
items:
But wait, something's wrong. When we did our first pass through the NF2 test, we
took out all the fields that relied on item_id and put them into the new table. This
time, we are only taking the fields that failed the test: in other words, item_qty
stays where it is. Why? What's different this time?
The difference is that in the first pass, we removed the item_id key from the
orders table altogether, because of the one-to-many relationship between orders
and order-items. Therefore the item_qty field had to follow item_id into the new
table.
In the second pass, item_id was not removed from the order_items table
because of the many-to-one relationship between order-items and items.
Therefore, since item_qty does not violate NF2 this time, it is permitted to stay in
the table with the two primary key parts that it relies on.
13
Second Normal Form: Phase II
This should be clearer with a new ERD. Here is how the items table fits into the
overall database schema:
Figure H:
The line that connects the items and order_items tables means the following:
• Each item can be associated with any number of lines on any number of
invoices, including zero;
• each order-item is associated with one item, and only one.
Each order can have many items; each item can belong to
many orders.
Notice that this time, we did not bring a copy of the order_id column into the new
table. This is because individual items do not need to have knowledge of the
orders they are part of. The order_items table takes care of remembering this
relationship via the order_id and item_id columns. Taken together these columns
comprise the primary key of order_items, but taken separately they are foreign
keys or pointers to rows in other tables. More about foreign keys when we get to
Third Normal Form.
Notice, too, that our new table does not have a concatenated primary key, so it
passes NF2. At this point, we have succeeded in attaining Second Normal Form!
14
Third Normal Form:
No Dependencies on Non-Key Attributes
To better understand this concept, consider the order_date column. Can it exist
independent of the order_id column? No: an "order date" is meaningless without
an order. order_date is said to depend on a key attribute (order_id is the "key
attribute" because it is the primary key of the table).
What about customer_name — can it exist on its own, outside of the orders
table?
These fields belong in their own table, with customer_id as the primary key (see
Figure I).
Figure I:
However, you will notice in Figure I that we have severed the relationship
between the orders table and the Customer data that used to inhabit it.
15
Third Normal Form:
No Dependencies on Non-Key Attributes
We have to restore the relationship by creating an entity called a foreign key
(indicated in our diagram by (FK)) in the orders table. A foreign key is essentially
a column that points to the primary key in another table. Figure J describes this
relationship, and shows our completed ERD:
The relationship that has been established between the orders and customers
table may be expressed in this way:
You will notice that the order_id and item_id columns in order_items perform a
dual purpose: not only do they function as the (concatenated) primary key for
order_items, they also individually serve as foreign keys to the order table and
items table respectively.
Figure J.1 documents this fact, and shows our completed ERD:
16
Third Normal Form:
No Dependencies on Non-Key Attributes
Figure J: Final ERD
And finally, here is what the data in each of the four tables looks like. Notice that
NF3 removed columns from a table, rather than rows.
Figure K:
17
References for Further Reading
18