Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
19 views

DATABASE NOTES Database Normalization

Normalization

Uploaded by

oniskatsande
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

DATABASE NOTES Database Normalization

Normalization

Uploaded by

oniskatsande
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Normalization

The ER model is used for transactional systems primarily because it minimizes data
redundancy and ensures data integrity. The approach to reduce redundancy is called
normalization, which is a formal data modeling approach to examining and validating a
model and putting it into a particular normalized form. The ER model is often called a third
normal form (3NF) or a normalized model. The advantage of this process is that each
attribute that belongs to an entity is going to be assigned a unique position within the data
model. The goal is to have minimal or no redundancy throughout the data model, and
normalization helps achieve this goal.

The downside of normalization is that it can adversely affect performance and deadlines if
strictly enforced. ERP systems can have tens of thousands of tables. This, of course, has an
impact on performance anyway based on the number of joins and tables that have to be
updated. Adding the effort of normalizing a large model can take an inordinate amount of
time to design and to code.

Normalization Levels

There are precise definitions and approaches toward normalizing a database. Edgar Frank
Codd, who invented the relational model for database management, identified the normal
forms as different states of the normalized relational data model. Those levels are defined as:

•First normal form (1NF), which has no repeating groups within it.
•Second normal form (2NF), which has no partial-key dependencies.
•Third normal form (3NF), which has no non-key interdependencies.
•Fourth normal form (4NF), which has no independent multiple relationships.
•Fifth normal form (5NF), which has no semantically-related multiple relationships.

Like cardinality, normalization is an obscure concept that requires close examination to really
understand the differences. Fortunately, this classic, yet challenging modeling technique
leads to whatever level of normalization is required. The application of the normalization
rules are cumulative from 1NF to 5NF.

Most operational systems or transactional systems are implemented in the third normal form.
3NF is considered the best practice in that it enables transactional integrity while balancing
complexity and performance. Although there are higher levels of normalization, they are
rarely used in transactional systems and never used in BI applications.

Normalizing an Entity

There are three steps to develop a third normal form database. They are shown at a high level
below, and then in further detail in the subsequent sections.

1.

1NF—Eliminate repeating groups. Make a separate table for each set of attributes (in
essence, this is creating an entity). Identify a primary key for each table. If you cannot
define a primary key, then you have not split up the tables into the sets of related
attributes creating an entity, and you need to repeat this step.
2.

2NF—Eliminate redundant data stored in different entities. If an attribute depends on


anything other than the primary key (could be a compound key), then remove it as a
separate table.

3.

3NF—Eliminate non-key interdependencies. If you have defined the primary key and
the keys within that, then all the attributes in that entity need to be related to that key.
For example, if you have customer or product, you can only have attributes that are
related to the customer or the product within the entity. Otherwise, remove them and
put them into a separate table, as they are most likely separate entities. With these
steps completed, you have defined a 3NF schema.

First Normal Form—1NF

The first step to ultimately normalize a database to third normal form is to eliminate repeating
groups, creating a first normal form.

Refer to Figure 8.14 and ask the following question for each attribute: does this attribute
occur more than once for any instance?

If it does not, then you have no repeating groups, and you can move on.

•If it does, you need to:


•Build a new entity.
•Move all the repeating attributes to the new entity.
•Set the attributes for the new entity’s primary key.
•Build an identifying relationship from the original entity (the one that had the
repeating groups to begin with) to the new entity, where you moved those repeating
attributes.
As shown in Figure 8.15, you slowly decompose the entities until you have no repeating
attributes, so each entity stands alone by itself.

Second Normal Form—2NF

The second step is to eliminate redundant data and get into second normal form (see Figure
8.16). If attributes are not dependent on the primary key, place them into a separate entity.
For only those entities that have a primary key that is a composite key, ask the following of
each non-key attribute: is this attribute dependent on part of the primary key?

FIGURE 8.16. 2NF example.


•If it is not, then there are no partial-key dependencies.
•If it is, then you need to:
•Build a new entity.
•Move out all the attributes that have the following same partial-key dependency to
the new entity.
•Use the dependent attribute as a key or determine a better primary key.
•Build, name, and identify a relationship from the new entity pointing back to the
original entity that had the partial-key dependencies.

As Figure 8.16 shows, a stepwise refinement further decomposes the entities, so they can
stand on their own.

Third Normal Form—3NF

The final step is to eliminate the columns not dependent on the key, which creates the third
normal form as shown in Figure 8.17. For each non-key attribute, ask the following: does this
attribute depend on some other non-key attribute?

FIGURE 8.17. 3NF example.

•If it does not, then there are no non-key interdependencies.


•But if it does, once again, you need to:
•Build a new entity to contain all attributes with this same non-key dependency.
•Use the dependent attribute as the primary key.
•Build and name a nonidentifying, nonmandatory (meaning optional) relationship
from the new entity back to the original entity.
So the third step, to recap, is to decompose an attribute and spin it off when there are
attributes that depend on the non-key attributes. Attributes that are related to one another get
split out into their own entity.

After the three normalization steps, you reach the third normal form data model, which is
generally the area of normalization that is needed to support enterprise applications.

Database Normalization - 1NF, 2NF, 3NF, BCNF, 4NF and 5NF with
examples

Normalization is a process of identifying the optimal grouping (relations at the end) for
attributes that satisfies data requirements in an organization. It is a database design technique
we use after completing ER modeling. This process identifies relationships between attributes
(called functional dependencies) and applies series of tests described as normal forms.

There are many articles written on this and you can find examples for almost all normal
forms. However many articles explains the theory with the same scenario, hence thought to
make a post with different set of examples that I use for my lectures.

As explained above, the purpose of normalization is to identify the best grouping for
attributes that ultimately forms relations. Some of the characteristics of relations formed are;

 Support of data requirements with minimal number of attributes


 Relation holds attributes with a close logical relationship (functional dependency)
 Relation holds minimal redundancy with each attribute (except foreign keys)
o increasing the performance of updates
o reducing the storage consumption
o avoiding update anomalies (insertion, modification and deletion)

Functional Dependency

Let's try to understand functional dependency first. This speaks about the relationship
between attributes in a relation. For an example, if EmployeeCode and FirstName are
attributes of Employee relation, we can say that FirstName is functionally dependent on
EmployeeCode. This means, each EmployeeCode is associated with exactly one value of
FirstName. We denote this like;

EmployeeCode -> FirstName

Remember, in above relationship, we call the EmployeeCode as determinant. Basically the


left-hand side of the arrow is considered as the determinant. The relationship between left to
right is always one to one (1:1).
If the right-hand attribute is fully dependent on left-hand side, we call this dependency as full
functional dependency. If the left-hand side is a composite one (two or more attributes) and
right-hand side can be determined by part of left-hand side, then the dependency is
considered as partial dependency (you will see an example of it later).

Why need to identify functional dependencies? One of the reasons for that is, identifying the
best candidate for the primary key. Once functional dependencies are identified, we can
analyze all and select the most suitable determinant as the primary key.

First Normal Form (1NF)

The definition of this goes as A relation in which the intersection of each row and column
contains one and only one value. Let's try to understand this with an example. The following
table shows an identified relation for Student Registration for courses. As you see, a tuple
represents a registration that is done for a date.

In order to make sure that the relation is normalized for 1NF, we need to make sure that;

 No multiple values in intersection of each row and column


 No repeatable groups in attributes (like Course1, Course2, Course3... columns)
 Order of attributes and tuples are insignificant
 No duplicate tuples in the relation.

You can see that Course attribute has multiple value that violates the 1NF. There are multiple
ways for addressing this but if I need to handle it without decomposing the relation, I can
organize my tuples as below.
Since the relation has no multiple values in intersections and no repeatable groups, it is now a
1NF relation.

Second Normal Form (2NF)

The definition of second normal form is A relation that is in First Normal form and every
non-primary-key attribute is fully dependent on the primary key. What is says is, there should
not be partial dependency between primary key and non-primary key.

Let's try to set the primary key for above table. For that, let's list out some functional
dependencies;

StudentCode, Course -> Name, Town, Province, Course, DateRegistered


StudentCode -> Name
Town -> Province

Considering above identified functional dependencies, I can easily pick the first one, that is
StudentCode, Course as my primary key because the combination of them can be used for
identifying the tuple easily.

Okay, now the primary key is StudentCode+Course. However, we know that StudentCode -
> Name relationship is still exist. This means that Name can be determined by part of the
primary key, that is partial dependency. This is the violation of second normal form.

We can decompose the relation now into two for making sure that relations do not violating
the 2NF.

Note that you will not see violation of 2NF if the primary key is based on just one attribute.
Third Normal Form (3NF)

This normal form speaks about transitive dependency. The definition goes as A relation that
is in First and Second Normal form and in which no non-primary-key attribute is transitively
dependent on the primary key.

This says that we should remove transitive dependency if they are exist. What is transitive
dependency? It is a condition such as in Student relation, StudentCode determines the Town
(StudentCode -> Town - There is only one two associated with a given StudentCode) and
Town determines the Province (Town -> Province), therefore StudentCode determines
Province (Note that, as per this relation StudentCode detemines Province but the issue is it
can be determined by Town too). This is transitive dependency. In other words, if you see that
Attribute A determines B (A -> B) and B determines C (B -> C), then A determines C (A -
> C).

For removing transitive dependency, we need to decompose the relation.

Boyce-Codd Normal Form (BCNF / 3.5NF)

This is an extension of 3NF and it is sometime treated as 3.5NF. This makes the 3NF more
stronger by making sure that every non-primary-key determinant is a candidate key with
identified functional dependencies. The definition goes as A relation is in BCNF, if and only
if, every determinant is a candidate key.

What does it exactly means? You have already seen that we can identify many functional
dependencies in a relation and we pick one for defining the primary key. The determinants of
other identified functional dependencies can be candidate keys for the primary key or they
might not be qualified for the primary key. If you see that all determinants are qualified,
means you can mark them as the primary key if need, then your relation (table) is in BCNF.

Take this example.


Assume that business rules related to this relation are as follows;

1. Course has one or more subjects.


2. Course is managed by one or more lecturers.
3. Subject is taught by one or more lecturers.
4. Lecturer teaches only one subject.

If you consider the primary key of this table is Course + Subject, then no violation of 1NF,
2NF and 3NF. Let's list out all possible functional dependencies.

1. Course, Subject -> Lecturer


2. Course, Lecturer -> Subject
3. Lecturer -> Subject

Now, based on the identified functional dependencies, see whether you can make
determinants as candidate keys. If you take the first one, we can clearly say that Course +
Subject is a candidate key. Second one that is Course + Lecturer is also a candidate key as
we can identify tuples uniquely using it. However the determinant of the third one cannot be
used as a candidate key because it has duplicates. You cannot make Lecturer as a primary
key. Now you have a determinant that cannot be set as a primary key, hence it violates
BCNF.

In order to make the table BCNF table, need to decompose as below.

Forth Normal Form (4NF)

This normal form handles multi-valued dependencies caused by 1NF. When we see repeated
groups or multiple values in an intersection, we add additional tuples removing multiple
values. That is what we do with 1NF. When there are two multi-value attributes in a relation,
then each value in one of the attributes has to be repeated with every value of the other
attribute. This situation is referred as a multi-valued dependency. See below relation;

If we apply 1NF to this relation;

The definition of the multi-valued dependency goes as Represent a dependency between


attributes in a relation, such that for each value of A there is a set of values for B and set of
values for C. However the set of values for B and C are independent of each other. This
dependency denotes as A ->> B.

See the CustomerContacts table. CustomerCode determines multiple


Telephone (CustomerCode ->> Telephone) and CustomerCode determines multiple Address
(CustomerCode ->> Telephone).

The forth normal form is describes as A relation that is in Boyce-Codd normal form and does
not contain nontrivial multi-valued dependencies. This talks about one type of multi-valued
dependency that is nontrivial. Trivial relationship means; if B is subset of A or A U B = R.
Else it is Nontrivial. As you see, CustomerContact contains nontrivial dependencies, hence
need to decompose the table as below.
Fifth Normal Form (5NF)

In order to normalize relations, we decompose the relations into multiple relations. Although
multiple divided relations optimize transactions and avoid anomalies, it adds a cost for data
retrieval as relations have to be rejoined. The biggest risk with rejoining is, producing
inaccurate outputs in certain conditions.

When we decompose a relation into two relations, the resulting relations have the property
called lossless-join that makes sure rejoining two relations produce the original relation. The
definition of lossless-join is, A property of decomposition, which ensures that no spurious
tuples are generated when relations are reunited through a natural join operation.

Now let's try to understand Fifth Normal Form. When decomposing a relation into multiple
relations for minimizing redundency, it might introduce join dependency, that might create
spurious tuples when they are reunited. The definition of Join Dependency goes as for a
relation R with subsets of the attributes of R denoted as A, B, ..., Z a relation R satisfies a join
dependency if and only if every legal value of R is equal to the join its projections on A, B, ...,
Z. Considering this, definition of Fifth normal form goes as A relation that has no join
dependency.

Since this is very rare to see in database design, let's try to understand with an example. See
the following table that contains how Lecturers teaches Subjects related to Courses.

Assume that the tuples are formed based on the following scenario;
 Whenever Lecturer L1 teaches Subject S1,
 and Course C1 has Subject S1,
 and Lecturer L1 teaches at least one subject in Course C1,
 then Lecturer L1 will be teaching Subject S1 in Course C1.

Note that we have this scenario for explaining the 5NF, otherwise you will not see it properly.

Now if I try to decompose this relation into two relations for minimizing redundant data, I
will be having these two tables (Sequences are added for understanding joins only);

Now, if I need to rejoin these with Natural Join (Read about Join at: ), this will be the result.

See the highlighted one. It is the result of Join Dependency. It is a spurious tuple which is not
valid. In order to avoid it, in order to make the relations as 5NF relation, let's introduce
another relation like below;

Now, if we rejoin all three, we will not see any issue.


Therefore, in order to minimize redundant data and make sure no join dependency, make sure
relations are formed as 5NF relations.

You might also like