Data Normalization
Data Normalization
The data basically stores the course code, course venue, instructor name, and
instructor’s phone number. At first, this design seems to be good. However, issues
start to develop once we need to modify information. For instance, suppose, if Prof.
George changed his mobile number. In such a situation, we will have to make edits in
2 places.
What if someone just edited the mobile number against CS101, but forgot to edit it
for CS154? This will lead to stale/wrong information in the database. This problem
can be easily tackled by dividing our table into 2 simpler tables:
Table 1 (Instructor):
Instructor ID
Instructor Name
Instructor mobile number
Table 2 (Course):
Course code
Course venue
Instructor ID
we store the instructors separately and in the course table, we do not store the entire
data of the instructor. Rather, we store the ID of the instructor. Now, if someone
wants to know the mobile number of the instructor, they can simply look up the
instructor table. Also, if we were to change the mobile number of Prof. George, it can
be done in exactly one place. This avoids the stale/wrong data problem.
Further, if you observe, the mobile number now need not be stored 2 times. We have
stored it in just 1 place. This also saves storage. This may not be obvious in the above
simple example. However, think about the case when there are hundreds of courses
and instructors and for each instructor, we have to store not just the mobile number,
but also other details like office address, email address, specialization, availability, etc.
In such a situation, replicating so much data will increase the storage requirement
unnecessarily.
Clearly, the student name column isn’t unique as we can see that there are 2 entries
corresponding to the name ‘Rahul’ in row 1 and row 3. Similarly, the course code
column is not unique as we can see that there are 2 entries corresponding to course
code CS101 in row 2 and row 4.
However, the tuple (student name, course code) is unique since a student cannot
enroll in the same course more than once. So, these 2 columns when combined form
the primary key for the database.
As per the second normal form definition, our enrollment table above isn’t in the
second normal form. To achieve the same (1NF to 2NF), we can rather break it into 2
tables:
Third Normal Form (3NF)
Before we delve into the details of third normal form, let us understand the concept
of a functional dependency on a table.
Here, the department column is dependent on the professor name column. This is
because if in a particular row, we change the name of the professor, we will also have
to change the department value. As an example, suppose MA214 is now taken by
Prof. Ronald who happens to be from the mathematics department, the table will
look like this:
Here, when we changed the name of the professor, we also had to change the
department column. This is not desirable since someone who is updating the
database may remember to change the name of the professor, but may forget
updating the department value. This can cause inconsistency in the database.
Third normal form avoids this by breaking this into separate tables:
we store the details of the professor against his/her ID. This way, whenever we want
to reference the professor somewhere, we don’t have to put the other details of the
professor in that table again. We can simply use the ID.
Therefore, in the third normal form, the following conditions are required:
A is a superkey
It is a trivial functional dependency.
Course code
Course code, professor name
Course code, professor mobile number
A superkey whose size (number of columns) is the smallest is called a candidate key.
For instance, the first superkey above has just 1 column. The second one and the last
one have 2 columns. So, the first superkey (Course code) is a candidate key.
A is a superkey: this means that only and only on a superkey column should it be
the case that there is a dependency of other columns. Basically, if a set of columns (B)
can be determined knowing some other set of columns (A), then A should be a
superkey. Superkey basically determines each row uniquely.
Another example would be if a company had employees who work in more than one
department. The corresponding database can be decomposed into where the
functional dependencies could be such keys as employee ID and employee
department.
Table ACP
Agent Company Product
A1 PQR/XYZ Nut/Bolt
A2 PQR Nut
A3 XYZ Bolt
This is read as “Agent A1 multi determines Company” and “Agent A1 multi determines
Product.”
Note that a functional dependency is a special case of multivalued dependency. In a
functional dependency X -> Y, every x determines exactly one y, never more than one.
A1 PQR Nut
A1 PQR Bolt
A1 XYZ Nut
A1 XYZ Bolt
A2 PQR Nut
A3 XYZ Bolt
To eliminate the redundancy, we can break this information into two tables: one for
agent_company and another for agent_product.
Table R1
Agent Company
PQR
A1
A1 XYZ
Agent Company
A2 PQR
A3 XYZ
Table R2
Agent Product
A1 Nut
A1 Bolt
A2 Nut
A3 Bolt
Example – Consider the above schema, with a case as “if a company makes a
product and an agent is an agent for that company, then he always sells that
product for the company”. Under these circumstances, the ACP table is shown as:
Table ACP
Agent Company Product
A1 PQR Nut
A1 PQR Bolt
A1 XYZ Nut
A1 XYZ Bolt
Agent Company Product
A2 PQR Nut
The relation ACP is again decomposed into 3 relations. Now, the natural Join of all
three relations will be shown as:
Table R1
Agent Company
A1 PQR
A1 XYZ
A2 PQR
Table R2
Agent Product
A1 Nut
A1 Bolt
A2 Nut
Table R3
Company Product
PQR Nut
PQR Bolt
Company Product
XYZ Nut
XYZ Bolt
The result of the Natural Join of R1 and R3 over ‘Company’ and then the Natural
Join of ACP is a lossless join decomposition. Therefore, the relation is in 5NF as it
does not violate the property of lossless join.
Conclusion
Multivalued dependencies are removed by 4NF, and join dependencies
are removed by 5NF.
The greatest degrees of database normalization, 4NF and 5NF, might not
be required for every application.
Normalizing to 4NF and 5NF might result in more complicated database
structures and slower query speed, but it can also increase data accuracy,
dependability, and simplicity.