DATABASE NOTES Database Normalization
DATABASE NOTES Database Normalization
The ER model is used for transactional systems primarily because it minimizes data
redundancy and ensures data integrity. The approach to reduce redundancy is called
normalization, which is a formal data modeling approach to examining and validating a
model and putting it into a particular normalized form. The ER model is often called a third
normal form (3NF) or a normalized model. The advantage of this process is that each
attribute that belongs to an entity is going to be assigned a unique position within the data
model. The goal is to have minimal or no redundancy throughout the data model, and
normalization helps achieve this goal.
The downside of normalization is that it can adversely affect performance and deadlines if
strictly enforced. ERP systems can have tens of thousands of tables. This, of course, has an
impact on performance anyway based on the number of joins and tables that have to be
updated. Adding the effort of normalizing a large model can take an inordinate amount of
time to design and to code.
Normalization Levels
There are precise definitions and approaches toward normalizing a database. Edgar Frank
Codd, who invented the relational model for database management, identified the normal
forms as different states of the normalized relational data model. Those levels are defined as:
•First normal form (1NF), which has no repeating groups within it.
•Second normal form (2NF), which has no partial-key dependencies.
•Third normal form (3NF), which has no non-key interdependencies.
•Fourth normal form (4NF), which has no independent multiple relationships.
•Fifth normal form (5NF), which has no semantically-related multiple relationships.
Like cardinality, normalization is an obscure concept that requires close examination to really
understand the differences. Fortunately, this classic, yet challenging modeling technique
leads to whatever level of normalization is required. The application of the normalization
rules are cumulative from 1NF to 5NF.
Most operational systems or transactional systems are implemented in the third normal form.
3NF is considered the best practice in that it enables transactional integrity while balancing
complexity and performance. Although there are higher levels of normalization, they are
rarely used in transactional systems and never used in BI applications.
Normalizing an Entity
There are three steps to develop a third normal form database. They are shown at a high level
below, and then in further detail in the subsequent sections.
1.
1NF—Eliminate repeating groups. Make a separate table for each set of attributes (in
essence, this is creating an entity). Identify a primary key for each table. If you cannot
define a primary key, then you have not split up the tables into the sets of related
attributes creating an entity, and you need to repeat this step.
2.
3.
3NF—Eliminate non-key interdependencies. If you have defined the primary key and
the keys within that, then all the attributes in that entity need to be related to that key.
For example, if you have customer or product, you can only have attributes that are
related to the customer or the product within the entity. Otherwise, remove them and
put them into a separate table, as they are most likely separate entities. With these
steps completed, you have defined a 3NF schema.
The first step to ultimately normalize a database to third normal form is to eliminate repeating
groups, creating a first normal form.
Refer to Figure 8.14 and ask the following question for each attribute: does this attribute
occur more than once for any instance?
If it does not, then you have no repeating groups, and you can move on.
The second step is to eliminate redundant data and get into second normal form (see Figure
8.16). If attributes are not dependent on the primary key, place them into a separate entity.
For only those entities that have a primary key that is a composite key, ask the following of
each non-key attribute: is this attribute dependent on part of the primary key?
As Figure 8.16 shows, a stepwise refinement further decomposes the entities, so they can
stand on their own.
The final step is to eliminate the columns not dependent on the key, which creates the third
normal form as shown in Figure 8.17. For each non-key attribute, ask the following: does this
attribute depend on some other non-key attribute?
After the three normalization steps, you reach the third normal form data model, which is
generally the area of normalization that is needed to support enterprise applications.
Database Normalization - 1NF, 2NF, 3NF, BCNF, 4NF and 5NF with
examples
Normalization is a process of identifying the optimal grouping (relations at the end) for
attributes that satisfies data requirements in an organization. It is a database design technique
we use after completing ER modeling. This process identifies relationships between attributes
(called functional dependencies) and applies series of tests described as normal forms.
There are many articles written on this and you can find examples for almost all normal
forms. However many articles explains the theory with the same scenario, hence thought to
make a post with different set of examples that I use for my lectures.
As explained above, the purpose of normalization is to identify the best grouping for
attributes that ultimately forms relations. Some of the characteristics of relations formed are;
Functional Dependency
Let's try to understand functional dependency first. This speaks about the relationship
between attributes in a relation. For an example, if EmployeeCode and FirstName are
attributes of Employee relation, we can say that FirstName is functionally dependent on
EmployeeCode. This means, each EmployeeCode is associated with exactly one value of
FirstName. We denote this like;
Why need to identify functional dependencies? One of the reasons for that is, identifying the
best candidate for the primary key. Once functional dependencies are identified, we can
analyze all and select the most suitable determinant as the primary key.
The definition of this goes as A relation in which the intersection of each row and column
contains one and only one value. Let's try to understand this with an example. The following
table shows an identified relation for Student Registration for courses. As you see, a tuple
represents a registration that is done for a date.
In order to make sure that the relation is normalized for 1NF, we need to make sure that;
You can see that Course attribute has multiple value that violates the 1NF. There are multiple
ways for addressing this but if I need to handle it without decomposing the relation, I can
organize my tuples as below.
Since the relation has no multiple values in intersections and no repeatable groups, it is now a
1NF relation.
The definition of second normal form is A relation that is in First Normal form and every
non-primary-key attribute is fully dependent on the primary key. What is says is, there should
not be partial dependency between primary key and non-primary key.
Let's try to set the primary key for above table. For that, let's list out some functional
dependencies;
Considering above identified functional dependencies, I can easily pick the first one, that is
StudentCode, Course as my primary key because the combination of them can be used for
identifying the tuple easily.
Okay, now the primary key is StudentCode+Course. However, we know that StudentCode -
> Name relationship is still exist. This means that Name can be determined by part of the
primary key, that is partial dependency. This is the violation of second normal form.
We can decompose the relation now into two for making sure that relations do not violating
the 2NF.
Note that you will not see violation of 2NF if the primary key is based on just one attribute.
Third Normal Form (3NF)
This normal form speaks about transitive dependency. The definition goes as A relation that
is in First and Second Normal form and in which no non-primary-key attribute is transitively
dependent on the primary key.
This says that we should remove transitive dependency if they are exist. What is transitive
dependency? It is a condition such as in Student relation, StudentCode determines the Town
(StudentCode -> Town - There is only one two associated with a given StudentCode) and
Town determines the Province (Town -> Province), therefore StudentCode determines
Province (Note that, as per this relation StudentCode detemines Province but the issue is it
can be determined by Town too). This is transitive dependency. In other words, if you see that
Attribute A determines B (A -> B) and B determines C (B -> C), then A determines C (A -
> C).
This is an extension of 3NF and it is sometime treated as 3.5NF. This makes the 3NF more
stronger by making sure that every non-primary-key determinant is a candidate key with
identified functional dependencies. The definition goes as A relation is in BCNF, if and only
if, every determinant is a candidate key.
What does it exactly means? You have already seen that we can identify many functional
dependencies in a relation and we pick one for defining the primary key. The determinants of
other identified functional dependencies can be candidate keys for the primary key or they
might not be qualified for the primary key. If you see that all determinants are qualified,
means you can mark them as the primary key if need, then your relation (table) is in BCNF.
If you consider the primary key of this table is Course + Subject, then no violation of 1NF,
2NF and 3NF. Let's list out all possible functional dependencies.
Now, based on the identified functional dependencies, see whether you can make
determinants as candidate keys. If you take the first one, we can clearly say that Course +
Subject is a candidate key. Second one that is Course + Lecturer is also a candidate key as
we can identify tuples uniquely using it. However the determinant of the third one cannot be
used as a candidate key because it has duplicates. You cannot make Lecturer as a primary
key. Now you have a determinant that cannot be set as a primary key, hence it violates
BCNF.
This normal form handles multi-valued dependencies caused by 1NF. When we see repeated
groups or multiple values in an intersection, we add additional tuples removing multiple
values. That is what we do with 1NF. When there are two multi-value attributes in a relation,
then each value in one of the attributes has to be repeated with every value of the other
attribute. This situation is referred as a multi-valued dependency. See below relation;
The forth normal form is describes as A relation that is in Boyce-Codd normal form and does
not contain nontrivial multi-valued dependencies. This talks about one type of multi-valued
dependency that is nontrivial. Trivial relationship means; if B is subset of A or A U B = R.
Else it is Nontrivial. As you see, CustomerContact contains nontrivial dependencies, hence
need to decompose the table as below.
Fifth Normal Form (5NF)
In order to normalize relations, we decompose the relations into multiple relations. Although
multiple divided relations optimize transactions and avoid anomalies, it adds a cost for data
retrieval as relations have to be rejoined. The biggest risk with rejoining is, producing
inaccurate outputs in certain conditions.
When we decompose a relation into two relations, the resulting relations have the property
called lossless-join that makes sure rejoining two relations produce the original relation. The
definition of lossless-join is, A property of decomposition, which ensures that no spurious
tuples are generated when relations are reunited through a natural join operation.
Now let's try to understand Fifth Normal Form. When decomposing a relation into multiple
relations for minimizing redundency, it might introduce join dependency, that might create
spurious tuples when they are reunited. The definition of Join Dependency goes as for a
relation R with subsets of the attributes of R denoted as A, B, ..., Z a relation R satisfies a join
dependency if and only if every legal value of R is equal to the join its projections on A, B, ...,
Z. Considering this, definition of Fifth normal form goes as A relation that has no join
dependency.
Since this is very rare to see in database design, let's try to understand with an example. See
the following table that contains how Lecturers teaches Subjects related to Courses.
Assume that the tuples are formed based on the following scenario;
Whenever Lecturer L1 teaches Subject S1,
and Course C1 has Subject S1,
and Lecturer L1 teaches at least one subject in Course C1,
then Lecturer L1 will be teaching Subject S1 in Course C1.
Note that we have this scenario for explaining the 5NF, otherwise you will not see it properly.
Now if I try to decompose this relation into two relations for minimizing redundant data, I
will be having these two tables (Sequences are added for understanding joins only);
Now, if I need to rejoin these with Natural Join (Read about Join at: ), this will be the result.
See the highlighted one. It is the result of Join Dependency. It is a spurious tuple which is not
valid. In order to avoid it, in order to make the relations as 5NF relation, let's introduce
another relation like below;