Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
23 views

Data Normalization

Data Normalization DBMS

Uploaded by

Nitika Kumari
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Data Normalization

Data Normalization DBMS

Uploaded by

Nitika Kumari
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Data Normalization

We will step by step normalize the data.

The data basically stores the course code, course venue, instructor name, and
instructor’s phone number. At first, this design seems to be good. However, issues
start to develop once we need to modify information. For instance, suppose, if Prof.
George changed his mobile number. In such a situation, we will have to make edits in
2 places.

What if someone just edited the mobile number against CS101, but forgot to edit it
for CS154? This will lead to stale/wrong information in the database. This problem
can be easily tackled by dividing our table into 2 simpler tables:

Table 1 (Instructor):

 Instructor ID
 Instructor Name
 Instructor mobile number

Table 2 (Course):

 Course code
 Course venue
 Instructor ID
we store the instructors separately and in the course table, we do not store the entire
data of the instructor. Rather, we store the ID of the instructor. Now, if someone
wants to know the mobile number of the instructor, they can simply look up the
instructor table. Also, if we were to change the mobile number of Prof. George, it can
be done in exactly one place. This avoids the stale/wrong data problem.

Further, if you observe, the mobile number now need not be stored 2 times. We have
stored it in just 1 place. This also saves storage. This may not be obvious in the above
simple example. However, think about the case when there are hundreds of courses
and instructors and for each instructor, we have to store not just the mobile number,
but also other details like office address, email address, specialization, availability, etc.
In such a situation, replicating so much data will increase the storage requirement
unnecessarily.

First Normal Form 1NF


The first normal form simply says that each cell of a table should contain exactly one
value. Assume we are storing the courses that a particular instructor takes, we can
store it like this:
Second Normal Form (2NF)
For a table to be in second normal form, the following 2 conditions must be met:

 The table should be in the first normal form.


 The primary key of the table should have exactly 1 column.
the first column is the student name and the second column is the course taken by
the student.

Clearly, the student name column isn’t unique as we can see that there are 2 entries
corresponding to the name ‘Rahul’ in row 1 and row 3. Similarly, the course code
column is not unique as we can see that there are 2 entries corresponding to course
code CS101 in row 2 and row 4.

However, the tuple (student name, course code) is unique since a student cannot
enroll in the same course more than once. So, these 2 columns when combined form
the primary key for the database.

As per the second normal form definition, our enrollment table above isn’t in the
second normal form. To achieve the same (1NF to 2NF), we can rather break it into 2
tables:
Third Normal Form (3NF)
Before we delve into the details of third normal form, let us understand the concept
of a functional dependency on a table.

Column A is said to be functionally dependent on column B if changing the value of


A may require a change in the value of B. As an example, consider the following
table:

Here, the department column is dependent on the professor name column. This is
because if in a particular row, we change the name of the professor, we will also have
to change the department value. As an example, suppose MA214 is now taken by
Prof. Ronald who happens to be from the mathematics department, the table will
look like this:

Here, when we changed the name of the professor, we also had to change the
department column. This is not desirable since someone who is updating the
database may remember to change the name of the professor, but may forget
updating the department value. This can cause inconsistency in the database.

Third normal form avoids this by breaking this into separate tables:
we store the details of the professor against his/her ID. This way, whenever we want
to reference the professor somewhere, we don’t have to put the other details of the
professor in that table again. We can simply use the ID.

Therefore, in the third normal form, the following conditions are required:

 The table should be in the second normal form.


 There should not be any functional dependency.

Boyce-Codd Normal Form (BCNF)


The Boyce-Codd Normal form is a stronger generalization of the third normal form.
A table is in Boyce-Codd Normal form if and only if at least one of the following
conditions are met for each functional dependency A → B:

 A is a superkey
 It is a trivial functional dependency.

Let us first understand what a superkey means. To understand BCNF in DBMS,


consider the following BCNF example table:
Here, the first column (course code) is unique across various rows. So, it is a
superkey. Consider the combination of columns (course code, professor name). It is
also unique across various rows. So, it is also a superkey. A superkey is basically a set
of columns such that the value of that set of columns is unique across various rows.
That is, no 2 rows have the same set of values for those columns. Some of the
superkeys for the table above are:

 Course code
 Course code, professor name
 Course code, professor mobile number

A superkey whose size (number of columns) is the smallest is called a candidate key.
For instance, the first superkey above has just 1 column. The second one and the last
one have 2 columns. So, the first superkey (Course code) is a candidate key.

Boyce-Codd Normal Form says that if there is a functional dependency A → B, then


either A is a superkey or it is a trivial functional dependency. A trivial functional
dependency means that all columns of B are contained in the columns of A. For
instance, (course code, professor name) → (course code) is a trivial functional
dependency because when we know the value of course code and professor name,
we do know the value of course code and so, the dependency becomes trivial.

A is a superkey: this means that only and only on a superkey column should it be
the case that there is a dependency of other columns. Basically, if a set of columns (B)
can be determined knowing some other set of columns (A), then A should be a
superkey. Superkey basically determines each row uniquely.

It is a trivial functional dependency: this means that there should be no non-trivial


dependency. For instance, we saw how the professor’s department was dependent
on the professor’s name. This may create integrity issues since someone may edit the
professor’s name without changing the department. This may lead to an inconsistent
database.

Another example would be if a company had employees who work in more than one
department. The corresponding database can be decomposed into where the
functional dependencies could be such keys as employee ID and employee
department.

Fourth Normal Form (4NF)


Definition: 4NF is a level of database normalization that builds upon the first three
normal forms (1NF, 2NF, and 3NF) and the Boyce-Codd Normal Form (BCNF).

o Objective: It ensures that there are no non-trivial multivalued


dependencies other than a candidate key.
o Properties:
 The relation must already satisfy the requirements of BCNF.
 It should not contain more than one multivalued dependency.
o Example: Consider the schema, with a case as “if a company makes a
product and an agent is an agent for that company, then he always
sells that product for the company”. Under these circumstances, the
ACP table is shown as:

Table ACP
Agent Company Product

A1 PQR/XYZ Nut/Bolt

A2 PQR Nut

A3 XYZ Bolt

Here, the multivalued dependencies are:


 A1 ->-> Company
 A1 ->-> Product

This is read as “Agent A1 multi determines Company” and “Agent A1 multi determines
Product.”
Note that a functional dependency is a special case of multivalued dependency. In a
functional dependency X -> Y, every x determines exactly one y, never more than one.

Agent Company Product

A1 PQR Nut

A1 PQR Bolt

A1 XYZ Nut

A1 XYZ Bolt

A2 PQR Nut

A3 XYZ Bolt

To eliminate the redundancy, we can break this information into two tables: one for
agent_company and another for agent_product.

Table R1

Agent Company

PQR
A1

A1 XYZ
Agent Company

A2 PQR

A3 XYZ

Table R2

Agent Product

A1 Nut

A1 Bolt

A2 Nut

A3 Bolt

Fifth Normal Form (5NF)


Also known as Project-Join Normal Form (PJNF), 5NF is the highest level of
normalization.

o Objective: It involves decomposing a table into smaller tables to


remove data redundancy and improve data integrity.
o Condition:
 A relation is in 5NF if every join dependency in that relation is
implied by the candidate keys.
o Join Dependency (JD):
 A join decomposition is a generalization of multivalued
dependencies.
 If the join of two decomposed relations over a common attribute
is equal to the original relation, a join dependency exists.

If the join of R1 and R2 over C is equal to relation R then we can


say that a join dependency (JD) exists, where R1 and R2 are the
decomposition R1(A, B, C) and R2(C, D) of a given relations R (A,
B, C, D). Alternatively, R1 and R2 are a lossless decomposition of
R. A JD ⋈ {R1, R2, …, Rn} is said to hold over a relation R if R1,
R2, ….., Rn is a lossless-join decomposition. The *(A, B, C, D), (C,
D) will be a JD of R if the join of joins attribute is equal to the
relation R. Here, *(R1, R2, R3) is used to indicate that relation R1,
R2, R3 and so on are a JD of R. Let R is a relation schema R1, R2,
R3……..Rn be the decomposition of R. r( R ) is said to satisfy join
dependency if and only if

Example – Consider the above schema, with a case as “if a company makes a
product and an agent is an agent for that company, then he always sells that
product for the company”. Under these circumstances, the ACP table is shown as:

Table ACP
Agent Company Product

A1 PQR Nut

A1 PQR Bolt

A1 XYZ Nut

A1 XYZ Bolt
Agent Company Product

A2 PQR Nut

The relation ACP is again decomposed into 3 relations. Now, the natural Join of all
three relations will be shown as:

Table R1
Agent Company

A1 PQR

A1 XYZ

A2 PQR

Table R2
Agent Product

A1 Nut

A1 Bolt

A2 Nut

Table R3
Company Product

PQR Nut

PQR Bolt
Company Product

XYZ Nut

XYZ Bolt

The result of the Natural Join of R1 and R3 over ‘Company’ and then the Natural
Join of ACP is a lossless join decomposition. Therefore, the relation is in 5NF as it
does not violate the property of lossless join.

Conclusion
 Multivalued dependencies are removed by 4NF, and join dependencies
are removed by 5NF.
 The greatest degrees of database normalization, 4NF and 5NF, might not
be required for every application.
 Normalizing to 4NF and 5NF might result in more complicated database
structures and slower query speed, but it can also increase data accuracy,
dependability, and simplicity.

You might also like