DBMS
DBMS
DBMS
Data is nothing but facts and statistics stored or free flowing over a network, generally it's raw
and unprocessed. For example: When you visit any website, they might store you IP address, that
is data, in return they might add a cookie in your browser, marking you that you visited the
website, that is data, your name, it's data, your age, it's data.
Data becomes information when it is processed, turning it into something meaningful. Like,
based on the cookie data saved on user's browser, if a website can analyse that generally men of
age 20-25 visit us more, that is information, derived from the data collected.
What is a Database?
A Database is a collection of related data organised in a way that data can be easily accessed,
managed and updated. Database can be software based or hardware based, with one sole
purpose, storing data.
During early computer days, data was collected and stored on tapes, which were mostly write-
only, which means once data is stored on it, it can never be read again. They were slow and
bulky, and soon computer scientists realised that they needed a better solution to this problem.
Larry Ellison, the co-founder of Oracle was amongst the first few, who realised the need for a
software based Database Management System.
What is DBMS?
A DBMS is a software that allows creation, definition and manipulation of database, allowing
users to store, process and analyse data easily. DBMS provides us with an interface or a tool, to
perform various operations like creating database, storing data in it, updating data, creating tables
in the database and a lot more.
DBMS also provides protection and security to the databases. It also maintains data consistency
in case of multiple users.
• MySql
• Oracle
• SQL Server
• IBM DB2
• PostgreSQL
• Amazon SimpleDB (cloud based) etc.
1. Data stored into Tables: Data is never directly stored into the database. Data is stored
into tables, created inside the database. DBMS also allows to have relationships between
tables which makes the data more meaningful and connected. You can easily understand
what type of data is stored where by looking at all the tables created in a database.
2. Reduced Redundancy: In the modern world hard drives are very cheap, but earlier when
hard drives were too expensive, unnecessary repetition of data in database was a big
problem. But DBMS follows Normalisation which divides the data in such a way that
repetition is minimum.
3. Data Consistency: On Live data, i.e. data that is being continuosly updated and added,
maintaining the consistency of data can become a challenge. But DBMS handles it all by
itself.
4. Support Multiple user and Concurrent Access: DBMS allows multiple users to work
on it(update, insert, delete data) at the same time and still manages to maintain the data
consistency.
5. Query Language: DBMS provides users with a simple Query language, using which
data can be easily fetched, inserted, deleted and updated in a database.
6. Security: The DBMS also takes care of the security of data, protecting the data from un-
authorised access. In a typical DBMS, we can create user accounts with different access
permissions, using which we can easily secure our data by restricting user access.
7. DBMS supports transactions, which allows us to better handle and manage data integrity
in real world applications where multi-threading is extensively used.
Advantages of DBMS
• Segregation of applicaion program.
• Minimal data duplicacy or data redundancy.
• Easy retrieval of data using the Query Language.
• Reduced development time and maintainance need.
• With Cloud Datacenters, we now have Database Management Systems capable of storing
almost infinite data.
• Seamless integration into the application programming languages which makes it very
easier to add a database to almost any application or website.
Disadvantages of DBMS
• It's Complexity
• Except MySQL, which is open source, licensed DBMSs are generally costly.
• They are large in size.
Components of DBMS
The database management system can be divided into five major components, they are:
1. Hardware
2. Software
3. Data
4. Procedures
5. Database Access Language
Let's have a simple diagram to see how they all fit together to form a database management
system.
DBMS Components: Hardware
When we say Hardware, we mean computer, hard disks, I/O channels for data, and any other
physical component involved before any data is successfully stored into the memory.
When we run Oracle or MySQL on our personal computer, then our computer's Hard Disk, our
Keyboard using which we type in all the commands, our computer's RAM, ROM all become a
part of the DBMS hardware.
This is the main component, as this is the program which controls everything. The DBMS
software is more like a wrapper around the physical database, which provides us with an easy-to-
use interface to store, access and update data.
The DBMS software is capable of understanding the Database Access Language and intrepret it
into actual database commands to execute them on the DB.
Data is that resource, for which DBMS was designed. The motive behind the creation of DBMS
was to store and utilise data.
In a typical Database, the user saved Data is present and meta data is stored.
Metadata is data about the data. This is information stored by the DBMS to better understand
the data stored in it.
For example: When I store my Name in a database, the DBMS will store when the name was
stored in the database, what is the size of the name, is it stored as related data to some other data,
or is it independent, all this information is metadata.
Procedures refer to general instructions to use a database management system. This includes
procedures to setup and install a DBMS, To login and logout of DBMS software, to manage
databases, to take backups, generating reports etc.
DBMS Components: Database Access Language
Database Access Language is a simple language designed to write commands to access, insert,
update and delete data stored in any database.
A user can write commands in the Database Access Language and submit it to the DBMS for
execution, which is then translated and executed by the DBMS.
User can create new databases, tables, insert data, fetch stored data, update data and delete the
data using the access language.
Users
• Database Administrators: Database Administrator or DBA is the one who manages the
complete database management system. DBA takes care of the security of the DBMS, it's
availability, managing the license keys, managing user accounts and access etc.
• Application Programmer or Software Developer: This user group is involved in developing and
desiging the parts of DBMS.
• End User: These days all the modern applications, web or mobile, store user data. How do you
think they do it? Yes, applications are programmed in such a way that they collect user data and
store the data on DBMS systems running on their server. End users are the one who store,
retrieve, update and delete data.
1-tier DBMS architecture also exist, this is when the database is directly available to the user for
using it to store data. Generally such a setup is used for local application development, where
programmers communicate directly with the database for quick response.
Such an architecture provides the DBMS extra security as it is not exposed to the End User
directly. Also, security can be improved by adding security and authentication checks in the
Application layer too.
For the end user, the GUI layer is the Database System, and the end user has no idea about the
application layer and the DBMS system.
If you have used MySQL, then you must have seen PHPMyAdmin, it is the best example of a
3-tier DBMS architecture.
• Hierarchical Model
• Network Model
• Entity-relationship Model
• Relational Model
Hierarchical Model
This database model organises data into a tree-like-structure, with a single root, to which all the
other data is linked. The heirarchy starts from the Root data, and expands like a tree, adding
child nodes to the parent nodes.
In this model, a child node will only have a single parent node.
This model efficiently describes many real-world relationships like index of a book, recipes etc.
In hierarchical model, data is organised into tree-like structure with one one-to-many relationship
between two different types of data, for example, one department can have many courses, many
professors and of-course many students.
Network Model
This is an extension of the Hierarchical model. In this model data is organised more like a graph,
and are allowed to have more than one parent node.
In this database model data is more related as more relationships are established in this database
model. Also, as the data is more related, hence accessing the data is also easier and fast. This
database model was used to map many-to-many data relationships.
This was the most widely used database model, before Relational Model was introduced.
Entity-relationship Model
In this database model, relationships are created by dividing object of interest into entity and its
characteristics into attributes.
E-R Models are defined to represent the relationships into pictorial form to make it easier for
different stakeholders to understand.
This model is good to design a database, which can then be turned into tables in relational
model(explained below).
Let's take an example, If we have to design a School Database, then Student will be an entity
with attributes name, age, address etc. As Address is generally complex, it can be another
entity with attributes street name, pincode, city etc, and there will be a relationship between
them.
Relationships can also be of different types. To learn about E-R Diagrams in details, click on the
link.
Relational Model
In this model, data is organised in two-dimensional tables and the relationship is maintained by
storing a common field.
This model was introduced by E.F Codd in 1970, and since then it has been the most widely used
database model, infact, we can say the only database model used around the world.
The basic structure of data in the relational model is tables. All the information related to a
particular type is stored in rows of that table.
In the coming tutorials we will learn how to design tables, normalize them to reduce data
redundancy and how to use Structured Query language to access data from tables.
Basic Concepts of ER Model in DBMS
As we described in the tutorial Database models, Entity-relationship model is a model used for
design and representation of relationships between data.
The main data objects are termed as Entities, with their details defined as attributes, some of
these attributes are important and are used to identity the entity, and different entities are related
using relationships.
Let's take an example to explain everything. For a School Management Software, we will have
to store Student information, Teacher information, Classes, Subjects taught in each class etc.
An Entity is generally a real-world object which has characteristics and holds relationships in a
DBMS.
If a Student is an Entity, then the complete dataset of all the students will be the Entity Set
ER Model: Attributes
If a Student is an Entity, then student's roll no., student's name, student's age, student's gender
etc will be its attributes.
An attribute can be of many types, here are different types of attributes defined in ER database
model:
1. Simple attribute: The attributes with values that are atomic and cannot be broken down
further are simple attributes. For example, student's age.
2. Composite attribute: A composite attribute is made up of more than one simple
attribute. For example, student's address will contain, house no., street name, pincode
etc.
3. Derived attribute: These are the attributes which are not present in the whole database
management system, but are derived using other attributes. For example, average age of
students in a class.
4. Single-valued attribute: As the name suggests, they have a single value.
5. Multi-valued attribute: And, they can have multiple values.
ER Model: Keys
If the attribute roll no. can uniquely identify a student entity, amongst all the students, then the
attribute roll no. will be said to be a key.
1. Super Key
2. Candidate Key
3. Primary Key
For example, if 2 entities are involved, it is said to be Binary relationship, if 3 entities are
involved, it is said to be Ternary relationship, and so on.
In the next tutorial, we will learn how to create ER diagrams and design databases using ER
diagrams.
For example, in the below diagram, anyone can see and understand what the diagram wants to
convey: Developer develops a website, whereas a Visitor visits a website.
Components of ER Diagram
Entitiy, Attributes, Relationships etc form the components of ER Diagram and there are defined
symbols and shapes to represent each one of them.
Entity
Weak Entity
A weak Entity is represented using double rectangular boxes. It is generally connected to another
entity.
To represent a Key attribute, the attribute name inside the Ellipse is underlined.
Derived attributes are those which are derived based on other attributes, for example, age can be
derived from date of birth.
To represent a derived attribute, another dotted ellipse is created inside the main ellipse.
Double Ellipse, one inside another, represents the attribute which can have multiple values.
ER Diagram: Entity
An Entity can be any object, place, person or class. In ER Diagram, an entity is represented
using rectangles. Consider an example of an Organisation- Employee, Manager, Department,
Product and many more can be taken as entities in an Organisation.
Weak entity is an entity that depends on another entity. Weak entity doesn't have anay key
attribute of its own. Double rectangle is used to represent a weak entity.
ER Diagram: Attribute
Key attribute represents the main characterstic of an Entity. It is used to represent a Primary key.
Ellipse with the text underlined, represents Key Attribute.
An attribute can also have their own attributes. These attributes are known as Composite
attributes.
ER Diagram: Relationship
1. Binary Relationship
2. Recursive Relationship
3. Ternary Relationship
Binary Relationship means relation between two Entities. This is further divided into three types.
The below example showcases this relationship, which means that 1 student can opt for many
courses, but a course can only have 1 student. Sounds weird! This is how it is.
Many to One Relationship
It reflects business rule that many entities can be associated with just one entity. For example,
Student enrolls for only one Course but a Course can have many Students.
The above diagram represents that one student can enroll for more than one courses. And a
course can have more than 1 student enrolled in it.
ER Diagram: Recursive Relationship
A Ternary relationship involves three entities. In such relationships we always consider two
entites together and then look upon the third.
For example, in the diagram above, we have three related entities, Company, Product and
Sector. To understand the relationship better or to define rules around the model, we should
relate two entities and then derive the third one.
A Company produces many Products/ each product is produced by exactly one company.
A Company operates in only one Sector / each sector has many companies operating in it.
Considering the above two rules or relationships, we see that although the complete relationship
involves three entities, but we are looking at two entities at a time
Hence, as part of the Enhanced ER Model, along with other improvements, three new concepts
were added to the existing ER Model, they were:
1. Generalization
2. Specialization
3. Aggregration
Let's understand what they are, and why were they added to the existing ER Model.
Generalization
Generalization is a bottom-up approach in which two lower level entities combine to form a
higher level entity. In generalization, the higher level entity can also combine with other lower
level entities to make further higher level entity.
It's more like Superclass and Subclass system, but the only difference is the approach, which is
bottom-up. Hence, entities are combined to form a more generalised entity, in other words, sub-
classes are combined to form a super-class.
For example, Saving and Current account types entities can be generalised and an entity with
name Account can be created, which covers both.
Specialization
Specialization is opposite to Generalization. It is a top-down approach in which one higher level
entity can be broken down into two lower level entity. In specialization, a higher level entity may
not have any lower-level entity sets, it's possible.
Aggregration
Aggregration is a process when relation between two entities is treated as a single entity.
In the diagram above, the relationship between Center and Course together, is acting as an
Entity, which is in relationship with another entity Visitor. Now in real world, if a Visitor or a
Student visits a Coaching Center, he/she will never enquire about the center only or just about
the course, rather he/she will ask enquire about both.
Rule zero
This rule states that for a system to qualify as an RDBMS, it must be able to manage database
entirely through the relational capabilities.
Rule 1: Information rule
Each unique piece of data(atomic value) should be accesible by : Table Name + Primary
Key(Row) + Attribute(column).
Null has several meanings, it can mean missing data, not applicable or no value. It should be
handled consistently. Also, Primary key must not be null, ever. Expression on NULL must give
null.
Database dictionary(catalog) is the structure description of the complete Database and it must be
stored online. The Catalog must be governed by same rules as rest of the database. The same
query language should be used on catalog as used to query database.
One well structured language must be there to provide all manners of access to the data stored in
the database. Example: SQL, etc. If the database allows access to the data without the use of this
language, then that is a violation.
All the view that are theoretically updatable should be updatable by the system as well.
Rule 7: Relational Level Operation
There must be Insert, Delete, Update operations at each level of relations. Set operation like
Union, Intersection and minus should also be supported.
The physical storage of data should not matter to the system. If say, some file supporting table is
renamed or moved from one disk to another, it should not effect the application.
If there is change in the logical structure(table structures) of the database the user view of data
should not change. Say, if a table is split into two tables, a new view should give result as the
join of the two tables. This rule is most difficult to satisfy.
The database should be able to enforce its own integrity rather than using other programs. Key
and Check constraints, trigger etc, should be stored in Data Dictionary. This also make RDBMS
independent of front-end.
A database should work properly regardless of its distribution across a network. Even if a
database is geographically distributed, with data stored in pieces, the end user should get an
impression that it is stored at the same place. This lays the foundation of distributed database.
If low level access is allowed to a system it should not be able to subvert or bypass integrity rules
to change the data. This can be achieved by some sort of looking or encryption.
In Relational database model, a table is a collection of data elements organised in terms of rows
and columns. A table is also considered as a convenient representation of relations. But a table
can have duplicate row of data while a true relation cannot have duplicate data. Table is the
most simplest form of data storage. Below is an example of an Employee table.
1 Adam 34 13000
2 Alex 28 15000
3 Stuart 20 18000
4 Ross 42 19020
A single entry in a table is called a Tuple or Record or Row. A tuple in a table represents a set
of related data. For example, the above Employee table has 4 tuples/records/rows.
1 Adam 34 13000
A table consists of several records(row), each record can be broken down into several smaller
parts of data known as Attributes. The above Employee table consist of four attributes, ID,
Name, Age and Salary.
Attribute Domain
Hence, the attribute Name will hold the name of employee for every tuple. If we save employee's
address there, it will be violation of the Relational database model.
Name
Adam
Alex
Ross
A relation schema describes the structure of the relation, with the name of the relation(name of
table), its attributes and their names and type.
1. Key Constraints
2. Domain Constraints
3. Referential integrity Constraints
Key Constraints
We store data in tables, to later access it whenever required. In every table one or more than one
attributes together are used to fetch data from tables. The Key Constraint specifies that there
should be such an attribute(column) in a relation(table), which can be used to fetch data for any
tuple(row).
The Key attribute should never be NULL or same for two different row of data.
For example, in the Employee table we can use the attribute ID to fetch data for each of the
employee. No value of ID is null and it is unique for every row, hence it can be our Key
attribute.
Domain Constraint
Domain constraints refers to the rules defined for the values that can be stored for a certain
attribute.
Like we explained above, we cannot store Address of employee in the column for Name.
We will study about this in detail later. For now remember this example, if I say Supriya is my
girlfriend, then a girl with name Supriya should also exist for that relationship to be present.
If a table reference to some data from another table, then that table and that data should be
present for referential integrity constraint to hold true.
In relational algebra, input is a relation(table from which data has to be accessed) and output is
also a relation(a temporary table holding the data asked for by the user).
Relational Algebra works on the whole table at once, so we do not have to use loops etc to iterate
over all the rows(tuples) of data one by one. All we have to do is specify the table name from
which we need the data, and in a single line of command, relational algebra will traverse the
entire given table to fetch data for you.
The primary operations that we can perform using relational algebra are:
1. Select
2. Project
3. Union
4. Set Different
5. Cartesian product
6. Rename
Syntax: σp(r)
Where, σ represents the Select Predicate, r is the name of relation(table name in which you want
to look for data), and p is the prepositional logic, where we specify the conditions that must be
satisfied by the data. In prepositional logic, one can use unary and binary operators like =, <, >
etc, to specify the conditions.
Let's take an example of the Student table we specified above in the Introduction of relational
algebra, and fetch data for students with age more than 17.
This will fetch the tuples(rows) from table Student, for which age will be greater than 17.
You can also use, and, or etc operators, to specify two conditions, for example,
This will return tuples(rows) from table Student with information of male students, of age more
than 17.(Consider the Student table has an attribute Gender too.)
It will only project or show the columns or attributes asked for, and will also remove duplicate
data from the columns.
For example,
∏Name, Age(Student)
Above statement will show us only the Name and Age columns for all the rows of data in
Student table.
For this operation to work, the relations(tables) specified should have same number of
attributes(columns) and same attribute domain. Also the duplicate tuples are autamatically
eliminated from the result.
Syntax: A ∪ B
For example, if we have two tables RegularClass and ExtraClass, both have a column student
to save name of student, then,
∏Student(RegularClass) ∪ ∏Student(ExtraClass)
Above operation will give us name of Students who are attending both regular classes and extra
classes, eliminating repetition.
Syntax: A - B
For example, if we want to find name of students who attend the regular class but not the extra
class, then, we can use the below operation:
∏Student(RegularClass) - ∏Student(ExtraClass)
Syntax: A X B
For example, if we want to find the information for Regular Class and Extra Class which are
conducted during morning, then, we can use the following operation:
For the above query to work, both RegularClass and ExtraClass should have the attribute time.
Rename Operation (ρ)
This operation is used to rename the output relation for any query operation which returns result
like Select, Project etc. Or to simply rename a relation(table)
Apart from these common operations Relational Algebra is also used for Join operations like,
• Natural Join
• Outer Join
• Theta join etc.
Syntax: { T | Condition }
In this form of relational calculus, we define a tuple variable, specify the table(relation) name in
which the tuple is to be searched for, along with a condition.
We can also specify column name using a . dot operator, with the tuple variable to only get a
certain attribute(column) in result.
To specify the name of the relation(table) in which we want to look for data, we do the
following:
Then comes the condition part, to specify a condition applicable for a particluar
attribute(column), we can use the . dot variable with the tuple variable to specify it, like in table
Student, if we want to get data for students with age greater than 17, then, we can write it as,
Putting it all together, if we want to use Tuple Relational Calculus to fetch names of students,
from table Student, with age greater than 17, then, for T being our tuple variable,
Syntax: { c1, c2, c3, ..., cn | F(c1, c2, c3, ... ,cn)}
where, c1, c2... etc represents domain of attributes(columns) and F defines the formula including
the condition for fetching the data.
For example,
Again, the above query will return the names and ages of the students in the table Student who
are older than 17.
Not all the ER Model constraints and components can be directly transformed into relational
model, but an approximate schema can be derived.
So let's take a few examples of ER diagrams and convert it into relational model schema, hence
creating tables in RDBMS.
And the attributes of the Entity gets converted to columns of the table.
And the primary key specified for the entity in the ER model, will become the primary key for
the table in relational model.
A table with name Student will be created in relational model, which will have 4 columns, id,
name, age, address and id will be the primary key for this table.
Relationship becomes a Relationship Table
In ER diagram, we use diamond/rhombus to reprsent a relationship between two entities. In
Relational model we create a relationship table for ER Model relationships too.
In the ER diagram below, we have two entities Teacher and Student with a relationship
between them.
As discussd above, entity gets mapped to table, hence we will create table for Teacher and a
table for Student with all the attributes converted into columns.
Now, an additional table will be created for the relationship, for example StudentTeacher or
give it any name you like. This table will hold the primary key for both Student and Teacher, in a
tuple to describe the relationship, which teacher teaches which student.
If there are additional attributes related to this relationship, then they become the columns for
this table, like subject name.
Also proper foriegn key constraints must be set for all the tables.
Points to Remember
Similarly we can generate relational database schema using the ER diagram. Following are some
key points to keep in mind while doing so:
1. Entity gets converted into Table, with all the attributes becoming fields(columns) in the table.
2. Relationship between entities is also converted into table with primary keys of the related
entities also stored in it as foreign keys.
3. Primary Keys should be properly set.
4. For any relationship of Weak Entity, if primary key of any other entity is included in a table,
foriegn key constraint must be defined.
A Key can be a single attribute or a group of attributes, where the combination may act as a key.
Also, tables store a lot of data in them. Tables generally extends to thousands of records stored in
them, unsorted and unorganised.
Now to fetch any particular record from such dataset, you will have to apply some conditions,
but what if there is duplicate data present and every time you try to fetch some data by applying
certain condition, you get the wrong data. How many trials before you get the right data?
To avoid all this, Keys are defined to easily identify any row of data in a table.
Let's try to understand about all the keys using a simple example.
1 Akon 9876723452 17
2 Akon 9991165674 19
3 Bkon 7898756543 18
4 Ckon 8987867898 19
5 Dkon 9990080080 17
Let's take a simple Student table, with fields student_id, name, phone and age.
Super Key
Super Key is defined as a set of attributes within a table that can uniquely identify each record
within a table. Super Key is a superset of Candidate key.
In the table defined above super key would include student_id, (student_id, name), phone etc.
Confused? The first one is pretty simple as student_id is unique for every row of data, hence it
can be used to identity each row uniquely.
Next comes, (student_id, name), now name of two students can be same, but their
student_id can't be same hence this combination can also be a key.
Similarly, phone number for every student will be unique, hence again, phone can also be a key.
Candidate Key
Candidate keys are defined as the minimal set of fields which can uniquely identify each record
in a table. It is an attribute or a set of attributes that can act as a Primary Key for a table to
uniquely identify each record in that table. There can be more than one candidate key.
In our example, student_id and phone both are candidate keys for table Student.
• A candiate key can never be NULL or empty. And its value should be unique.
• There can be more than one candidate keys for a table.
• A candidate key can be a combination of more than one columns(attributes).
Primary Key
Primary key is a candidate key that is most appropriate to become the main key for any table. It
is a key that can uniquely identify each record in a table.
For the table Student we can make the student_id column as the primary key.
Composite Key
Key that consists of two or more attributes that uniquely identify any record in a table is called
Composite key. But the attributes which together form the Composite key are not a key
independentely or individually.
In the above picture we have a Score table which stores the marks scored by a student in a
particular subject.
In this table student_id and subject_id together will form the primary key, hence it is a
composite key.
The candidate key which are not selected as primary key are known as secondary keys or
alternative keys.
Non-key Attributes
Non-key attributes are the attributes or fields of a table, other than candidate key
attributes/fields in a table.
Non-prime Attributes
Normalization of Database
Database Normalization is a technique of organizing the data in the database. Normalization is a
systematic approach of decomposing tables to eliminate data redundancy(repetition) and
undesirable characteristics like Insertion, Update and Deletion Anomalies. It is a multi-step
process that puts data into tabular form, removing duplicated data from the relation tables.
The video below will give you a good overview of Database Normalization. If you want you can
skip the video, as the concept is covered in detail, below the video.
In the table above, we have data of 4 Computer Sci. students. As we can see, data for the fields
branch, hod(Head of Department) and office_tel is repeated for the students who are in the
same branch in the college, this is Data Redundancy.
Insertion Anomaly
Suppose for a new admission, until and unless a student opts for a branch, data of the student
cannot be inserted, or else we will have to set the branch information as NULL.
Also, if we have to insert data of 100 students of same branch, then the branch information will
be repeated for all those 100 students.
These scenarios are nothing but Insertion anomalies.
Updation Anomaly
What if Mr. X leaves the college? or is no longer the HOD of computer science department? In
that case all the student records will have to be updated, and if by mistake we miss any record, it
will lead to data inconsistency. This is Updation anomaly.
Deletion Anomaly
In our Student table, two different informations are kept together, Student information and
Branch information. Hence, at the end of the academic year, if student records are deleted, we
will also lose the branch information. This is Deletion anomaly.
Normalization Rule
Normalization rules are divided into the following normal forms:
For a table to be in the First Normal Form, it should follow the following 4 rules:
In the next tutorial, we will discuss about the First Normal Form in details.
Second Normal Form (2NF)
To understand what is Partial Dependency and how to normalize a table to 2nd normal for, jump
to the Second Normal Form tutorial.
Here is the Third Normal Form tutorial. But we suggest you to first study about the second
normal form and then head over to the third normal form.
Boyce and Codd Normal Form is a higher version of the Third Normal form. This form deals
with certain type of anomaly that is not handled by 3NF. A 3NF table which does not have
multiple overlapping candidate keys is said to be in BCNF. For a table to be in BCNF, following
conditions must be satisfied:
To learn about BCNF in detail with a very easy to understand example, head to Boye-Codd
Normal Form tutorial.
In our last tutorial we learned and understood how data redundancy or repetition can lead to
several issues like Insertion, Deletion and Updation anomalies and how Normalization can
reduce data redundancy and make the data more meaningful.
In this tutorial we will learn about the 1st Normal Form which is more like the Step 1 of the
Normalization process. The 1st Normal form expects you to design your table in such a way that
it can easily be extended and it is easier for you to retrieve data from it whenever required.
If tables in a database are not even in the 1st Normal Form, it is considered as bad database design.
Each column of your table should be single valued which means they should not contain multiple
values. We will explain this with help of an example later, let's see the other rules for now.
This is more of a "Common Sense" rule. In each column the values stored must be of the same
kind or type.
For example: If you have a column dob to save date of births of a set of people, then you cannot
or you must not save 'names' of some of them in that column along with 'date of birth' of others
in that column. It should hold only 'date of birth' for all the records/rows.
If one or more columns have same name, then the DBMS system will be left confused.
This rule says that the order in which you store the data in your table doesn't matter.
Our table already satisfies 3 rules out of the 4 rules, as all our column names are unique, we have
stored data in the order we wanted to and we have not inter-mixed different type of data in
columns.
But out of the 3 different students in our table, 2 have opted for more than 1 subject. And we
have stored the subject names in a single column. But as per the 1st Normal form each column
must contain atomic value.
It's very simple, because all we have to do is break the values into atomic values.
Here is our updated table and it now satisfies the First Normal Form.
roll_no name subject
101 Akon OS
101 Akon CN
102 Bkon C
By doing so, although a few values are getting repeated but values for the subject column are
now atomic for each record/row.
Using the First Normal Form, data redundancy increases, as there will be many columns with
same data in multiple rows but each row as a whole will be unique.
For a table to be in the Second Normal Form, it must satisfy two conditions:
What is Partial Dependency? Do not worry about it. First let's understand what is Dependency
in a table?
What is Dependency?
Let's take an example of a Student table with columns student_id, name, reg_no(registration
number), branch and address(student's home address).
In this table, student_id is the primary key and will be unique for every row, hence we can use
student_id to fetch any row of data from this table
Even for a case, where student names are same, if we know the student_id we can easily fetch
the correct record.
Hence we can say a Primary Key for a table is the column or a group of columns(composite
key) which can uniquely identify each record in the table.
I can ask from branch name of student with student_id 10, and I can get it. Similarly, if I ask
for name of student with student_id 10 or 11, I will get it. So all I need is student_id and
every other column depends on it, or can be fetched using it.
For a simple table like Student, a single column like student_id can uniquely identfy all the
records in a table.
But this is not true all the time. So now let's extend our example to see if more than 1 column
together can act as a primary key.
Let's create another table for Subject, which will have subject_id and subject_name fields
and subject_id will be the primary key.
subject_id subject_name
1 Java
2 C++
3 Php
Now we have a Student table with student information and another table Subject for storing
subject information.
Let's create another table Score, to store the marks obtained by students in the respective
subjects. We will also be saving name of the teacher who teaches that subject along with marks.
1 10 1 70 Java Teacher
2 10 2 75 C++ Teacher
3 11 1 80 Java Teacher
In the score table we are saving the student_id to know which student's marks are these and
subject_id to know for which subject the marks are for.
Together, student_id + subject_id forms a Candidate Key(learn about Database Keys) for
this table, which can be the Primary key.
See, if I ask you to get me marks of student with student_id 10, can you get it from this table?
No, because you don't know for which subject. And if I give you subject_id, you would not
know for which student. Hence we need student_id + subject_id to uniquely identify any
row.
Now if you look at the Score table, we have a column names teacher which is only dependent
on the subject, for Java it's Java Teacher and for C++ it's C++ Teacher & so on.
Now as we just discussed that the primary key for this table is a composition of two columns
which is student_id & subject_id but the teacher's name only depends on subject, hence the
subject_id, and has nothing to do with student_id.
This is Partial Dependency, where an attribute in a table depends on only a part of the primary
key and not on the whole key.
The simplest solution is to remove columns teacher from Score table and add it to the Subject
table. Hence, the Subject table will become:
subject_id subject_name teacher
And our Score table is now in the second normal form, with no partial dependency.
1 10 1 70
2 10 2 75
3 11 1 80
Quick Recap
1. For a table to be in the Second Normal form, it should be in the First Normal form and it should
not have Partial Dependency.
2. Partial Dependency exists, when for a composite primary key, any attribute in the table depends
only on a part of the primary key and not on the complete primary key.
3. To remove Partial dependency, we can divide the table, remove the attribute which is causing
partial dependency, and move it to some other table where it fits in well.
In our last tutorial, we learned about the second normal form and even normalized our Score
table into the 2nd Normal Form.
So let's use the same example, where we have 3 tables, Student, Subject and Score.
Student Table
Subject Table
Score Table
1 10 1 70
2 10 2 75
3 11 1 80
In the Score table, we need to store some more information, which is the exam name and total
marks, so let's add 2 more columns to the Score table.
With exam_name and total_marks added to our Score table, it saves more data now. Primary
key for our Score table is a composite key, which means it's made up of two attributes or
columns → student_id + subject_id.
Our new column exam_name depends on both student and subject. For example, a mechanical
engineering student will have Workshop exam but a computer science student won't. And for
some subjects you have Prctical exams and for some you don't. So we can say that exam_name is
dependent on both student_id and subject_id.
And what about our second new column total_marks? Does it depend on our Score table's
primary key?
Well, the column total_marks depends on exam_name as with exam type the total score
changes. For example, practicals are of less marks while theory exams are of more marks.
But, exam_name is just another column in the score table. It is not a primary key or even a part of
the primary key, and total_marks depends on it.
Again the solution is very simple. Take out the columns exam_name and total_marks from
Score table and put them in an Exam table and use the exam_id wherever required.
1 Workshop 200
2 Mains 70
3 Practicals 30
Advantage of removing Transitive Dependency
Follow the video above for complete explanation of BCNF. Or, if you want, you can even skip
the video and jump to the section below for the complete tutorial.
In our last tutorial, we learned about the third normal form and we also learned how to remove
transitive dependency from a table, we suggest you to follow the last tutorial before this one.
The second point sounds a bit tricky, right? In simple words, it means, that for a dependency A
→ B, A cannot be a non-prime attribute, if B is a prime attribute.
As you can see, we have also added some sample data to the table.
• One student can enrol for multiple subjects. For example, student with student_id 101, has
opted for subjects - Java & C++
• For each subject, a professor is assigned to the student.
• And, there can be multiple professors teaching one subject like we have for Java.
Well, in the table above student_id, subject together form the primary key, because using
student_id and subject, we can find all the columns of the table.
One more important point to note here is, one professor teaches only one subject, but one subject
may have two different professors.
Hence, there is a dependency between subject and professor here, where subject depends on
the professor name.
This table satisfies the 1st Normal form because all the values are atomic, column names are
unique and all the values stored in a particular column are of same domain.
This table also satisfies the 2nd Normal Form as their is no Partial Dependency.
And, there is no Transitive Dependency, hence the table also satisfies the 3rd Normal Form.
In the table above, student_id, subject form primary key, which means subject column is a
prime attribute.
And while subject is a prime attribute, professor is a non-prime attribute, which is not
allowed by BCNF.
How to satisfy BCNF?
To make this relation(table) satisfy BCNF, we will decompose this table into two tables, student
table and professor table.
Student Table
student_id p_id
101 1
101 2
and so on...
1 P.Java Java
2 P.Cpp C++
and so on...
And now, this relation satisfy Boyce-Codd Normal Form. In the next tutorial we will learn about
the Fourth Normal Form.
Follow the video above for complete explanation of 4th Normal Form. Or, if you want, you can
even skip the video and jump to the section below for the complete tutorial.
In our last tutorial, we learned about the boyce-codd normal form, we suggest you to follow the
last tutorial before this one.
1. For a dependency A → B, if for a single value of A, multiple value of B exists, then the table may
have multi-valued dependency.
2. Also, a table should have at-least 3 columns for it to have a multi-valued dependency.
3. And, for a relation R(A,B,C), if there is a multi-valued dependency between, A and B, then B
and C should be independent of each other.
If all these conditions are true for any relation(table), it is said to have multi-valued dependency.
1 Science Cricket
1 Maths Hockey
2 C# Cricket
2 Php Hockey
As you can see in the table above, student with s_id 1 has opted for two courses, Science and
Maths, and has two hobbies, Cricket and Hockey.
You must be thinking what problem this can lead to, right?
Well the two records for student with s_id 1, will give rise to two more records, as shown
below, because for one student, two hobbies exists, hence along with both the courses, these
hobbies should be specified.
1 Science Cricket
1 Maths Hockey
1 Science Hockey
1 Maths Cricket
And, in the table above, there is no relationship between the columns course and hobby. They
are independent of each other.
So there is multi-value dependency, which leads to un-necessary repetition of data and other
anomalies as well.
To make the above relation satify the 4th normal form, we can decompose the table into 2 tables.
CourseOpted Table
s_id course
1 Science
1 Maths
2 C#
2 Php
s_id hobby
1 Cricket
1 Hockey
2 Cricket
2 Hockey
If you design your database carefully, you can easily avoid these issues.