Normalization
Normalization
Normalization
Normalization is performed to reduce or eliminate Insertion, Deletion or Update anomalies. However, a completely normalized database may not be the most efficient or effective implementation. Denormalization is sometimes used to improve efficiency. Normalization splits database information across multiple tables. To retrieve complete information from a normalized database, the JOIN operation must be used. JOIN tends to be expensive in terms of processing time, and very large joins are very expensive.
Normalization -Contd..
Introduction In this exercise we are looking at the optimisation of data
structure. The example system we are going to use as a model is a database to keep track of employees of an organisation working on different projects.
Objectives By the end of the exercise you should be able to: Show understanding of why we normalize data Give formal definitions of 1NF, 2NF & 3NF Apply the process of normalization to your own project
The Scenario
The data we would want to store could be expressed as:
Project No Project Name Employee No Employee Name Jessica Brookes
Andy Evans Max Fat Jessica Brookes Alex Branton
Rate category A
B C A B
Rate
1203
11
12 16
90
80 70 90 80
1506
11 17
Why Normalization?
Three problems become apparent with our current model: Tables in a RDBMS use a simple grid structure Each project has a set of employees so we cant even use this format to enter data into a table. How would you construct a query to find the employees working on each project? All tables in an RDBMS need a key Each record in a RDBMS must have a unique identity. Which field should be the primary key? Data entry should be kept to a minimum Our main problem is that each project contains repeating groups, which lead to redundancy and inconsistency.
11 12
90 80
1203
1506 1506
16
11 17
Max Fat
Jessica Brookes Alex Branton
C
A B
70
90 70
Normalization Process
The solution is simply to take out the duplication. We do this by: Identifying a key In this case we can use the project no and employee no to uniquely identify each row
Project No 1203 1203 1203 1506 1506 Employee No 11 12 16 11 17
Unique Identifier
120311
120312
120316 150611
150617
Employee No
We remove partial dependencies The fields listed are only dependent on part of the key so we remove them from the table.
Partially Dependent
Employee Name Rate category Rate
Project No
tblProjects
Project No
1203
Project Name
Madagascar travel site
1203 1203
1203 1506 1506
11 1506 12
16 11 17
tblEmployees
Employee No 11 12 16 Employee Name Jessica Brookes Andy Evans Max Fat Rate Category A B C
17
Alex Branton
80
11
12 16 17
Jessica Brookes
Andy Evans Max Fat Alex Branton
A
B C A
90
80 70 80
Again, we have stored redundant data: the hourly rate- rate category relationship is being stored in its entirety i.e. We have to key in both the rate category AND the hourly rate.
tblProjects
Project No Project Name
Employee No
11
1023
1056
12
tblEmployees
Employee No 11
12 16
1023
1056 1056
16
11 17
Rate Category A
B
tblRates
Rate Category A B Rate 90 80 70
17
Alex Branton
(1n indicates there are many occurrences of the field it is a repeating group).
To begin the normalization process we start by moving from zero normal form to 1st normal form.
st 1
Normal Form
The definition of 1st normal form; There are no repeating groups All the key attributes are defined All attributes are dependent on the primary key So far, we have no keys, and there are repeating groups. So we remove the repeating groups and define the keys and are left with: Employee Project table Project number part of key Project name Employee number part of key Employee name Rate category Hourly rate This table is in first normal form (1NF)
The tables are now in 2nd normal form (2NF). Are they in 3rd normal form?
rd 3
Normal Form
A table is in 3rd normal form if: Its already in second normal form It includes no transitive dependencies (where a non-key attribute is dependent on another non-key attribute) We can narrow our search down to the Employee table, which is the only one with more than one non-key attribute. Employee name is not dependent on either Rate category or Hourly rate, the same applies to Rate category, but Hourly rate is dependent on Rate category.
So, as before, we remove it, placing it in it's own table, with the attribute it was dependent on as key.
Project Table Project number - primary key Project name These tables are all now in 3rd normal form, and ready to be implemented.
Summary
What normal form is this table in? Giving it a quick glance, we see:
no repeating groups, and a primary key defined, so it's at least in 1st normal form. There's only one key, so we needn't even look for partial dependencies, so it's at least in 2nd normal form. How about transitive dependencies? Well, it looks like Town might be determined by Postcode. And in most parts of the world that's usually the case.
So we should remove Town, and place it in a separate table, with Postcode as the key?
Summary Contd..
No! Although this table is not technically in 3rd normal form, removing this information is not worth it. Creating more tables increases the load slightly, slowing processing down. This is often counteracted by the reduction in table sizes, and redundant data. But in this case, where the town would almost always be referenced as part of the address, it isn't worth it. Perhaps a company that uses the data to produce regular mailing lists of thousands of customers should normalize fully. It always comes down to how the data is going to be used. Normalization is just a helpful process that usually results in the most efficient table structure, and not a rule for database design.