Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

DBMS - Data Models and Relational Database Design Notes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 48

Data Models and Relational Database Design

Evolution of Data Models


A Database model defines the logical design and structure of a database and defines how
data will be stored, accessed and updated in a database management system. While
the Relational Model is the most widely used database model, there are other models too:

 Hierarchical Model
 Network Model
 Entity-relationship Model
 Relational Model

Hierarchical Model
 This database model organises data into a tree-like-structure, with a single root, to
which all the other data is linked. The heirarchy starts from the Root data, and
expands like a tree, adding child nodes to the parent nodes.
 In this model, a child node will only have a single parent node.
 This model efficiently describes many real-world relationships like index of a book,
recipes etc.
 In hierarchical model, data is organised into tree-like structure with one one-to-
many relationship between two different types of data, for example, one department
can have many courses, many professors and of-course many students.

Network Model
 This is an extension of the Hierarchical model. In this model data is organised more
like a graph, and are allowed to have more than one parent node.
 In this database model data is more related as more relationships are established in
this database model. Also, as the data is more related, hence accessing the data is
also easier and fast. This database model was used to map many-to-many data
relationships.
 This was the most widely used database model, before Relational Model was
introduced.

Entity-relationship Model
 In this database model, relationships are created by dividing object of interest into
entity and its characteristics into attributes.
 Different entities are related using relationships.
 E-R Models are defined to represent the relationships into pictorial form to make it
easier for different stakeholders to understand.
 This model is good to design a database, which can then be turned into tables in
relational model(explained below).
 Let's take an example, If we have to design a School Database, then Student will be
an entity with attributes name, age, address etc. As Address is generally complex,
it can be another entity with attributes street name, pincode, city etc, and there
will be a relationship between them.
 Relationships can also be of different types.
Relational Model
 In this model, data is organised in two-dimensional tables and the relationship is
maintained by storing a common field.
 This model was introduced by E.F Codd in 1970, and since then it has been the most
widely used database model, infact, we can say the only database model used
around the world.
 The basic structure of data in the relational model is tables. All the information
related to a particular type is stored in rows of that table.
 Hence, tables are also known as relations in relational model.
 In the coming tutorials we will learn how to design tables, normalize them to reduce
data redundancy and how to use Structured Query language to access data from
tables.
Entity Relationship Model
ER Model is used to model the logical view of the system from data perspective which
consists of these components:

Entity, Entity Type, Entity Set –


An Entity may be an object with a physical existence – a particular person, car, house, or
employee – or it may be an object with a conceptual existence – a company, a job, or a
university course.
An Entity is an object of Entity Type and set of all entities is called as entity set. e.g.; E1 is an
entity having Entity Type Student and set of all students is called Entity Set. In ER diagram,
Entity Type is represented as:

Attribute(s):
 Attributes are the properties which define the entity type. For example, Roll_No,
Name, DOB, Age, Address, Mobile_No are the attributes which defines entity type
Student. In ER diagram, attribute is represented by an oval.

1. Key Attribute –
 The attribute which uniquely identifies each entity in the entity set is called
key attribute.For example, Roll_No will be unique for each student. In ER
diagram, key attribute is represented by an oval with underlying lines.
2. Composite Attribute –
 An attribute composed of many other attribute is called as composite
attribute. For example, Address attribute of student Entity type consists of
Street, City, State, and Country. In ER diagram, composite attribute is
represented by an oval comprising of ovals.

3. Multivalued Attribute –
 An attribute consisting more than one value for a given entity. For example,
Phone_No (can be more than one for a given student). In ER diagram,
multivalued attribute is represented by double oval.

4. Derived Attribute –
 An attribute which can be derived from other attributes of the entity type is
known as derived attribute. e.g.; Age (can be derived from DOB). In ER
diagram, derived attribute is represented by dashed oval.

 The complete entity type Student with its attributes can be represented as:


Relationship Type and Relationship Set:
 A relationship type represents the association between entity types. For
example,‘Enrolled in’ is a relationship type that exists between entity type Student
and Course. In ER diagram, relationship type is represented by a diamond and
connecting the entities with lines.

 A set of relationships of same type is known as relationship set. The following


relationship set depicts S1 is enrolled in C2, S2 is enrolled in C1 and S3 is enrolled in
C3.

Degree of a relationship set:


The number of different entity sets participating in a relationship set is called as degree
of a relationship set.
1. Unary Relationship –
 When there is only ONE entity set participating in a relation, the
relationship is called as unary relationship. For example, one person is married
to only one person.

2. Binary Relationship –
 When there are TWO entities set participating in a relation, the relationship
is called as binary relationship.For example, Student is enrolled in Course.

3. n-ary Relationship –
 When there are n entities set participating in a relation, the relationship is
called as n-ary relationship.

Cardinality:
The number of times an entity of an entity set participates in a relationship set is
known as cardinality. Cardinality can be of different types:
1. One to one – When each entity in each entity set can take part only once in the
relationship, the cardinality is one to one. Let us assume that a male can marry to one
female and a female can marry to one male. So the relationship will be one to one.

Using Sets, it can be represented as:

2. Many to one – When entities in one entity set can take part only once in the
relationship set and entities in other entity set can take part more than once in
the relationship set, cardinality is many to one. Let us assume that a student can take
only one course but one course can be taken by many students. So the cardinality will
be n to 1. It means that for one course there can be n students but for one student,
there will be only one course.

Using
Sets, it can be represented as:
In this case, each student is taking only 1 course but 1 course has been taken by many
students.
3. Many to many – When entities in all entity sets can take part more than once in
the relationshipcardinality is many to many. Let us assume that a student can take
more than one course and one course can be taken by many students. So the
relationship will be many to many.

Using sets, it can be represented as:

In this example, student S1 is enrolled in C1 and C3 and Course C3 is enrolled by S1,


S3 and S4. So it is many to many relationships.
Participation Constraint:
Participation Constraint is applied on the entity participating in the relationship set.
1. Total Participation – Each entity in the entity set must participate in the
relationship. If each student must enroll in a course, the participation of student will
be total. Total participation is shown by double line in ER diagram.
2. Partial Participation – The entity in the entity set may or may NOT participate in
the relationship. If some courses are not enrolled by any of the student, the
participation of course will be partial.
The diagram depicts the ‘Enrolled in’ relationship set with Student Entity set having
total participation and Course Entity set having partial participation.
Using set, it can be represented as,

Every student in Student Entity set is participating in relationship but there exists a
course C4 which is not taking part in the relationship.
Weak Entity Type and Identifying Relationship:
An entity type has a key attribute which uniquely identifies each entity in the entity set. But
there exists some entity type for which key attribute can’t be defined. These are called
Weak Entity type.
For example, A company may store the information of dependants (Parents, Children,
Spouse) of an Employee. But the dependents don’t have existence without the employee. So
Dependent will be weak entity type and Employee will be Identifying Entity type for
Dependant.
A weak entity type is represented by a double rectangle. The participation of weak entity
type is always total. The relationship between weak entity type and its identifying strong
entity type is called identifying relationship and it is represented by double diamond.
Extended Entity Relationship Model

As the complexity of data increased in the late 1980s, it became more and more difficult to
use the traditional ER Model for database modelling. Hence some improvements or
enhancements were made to the existing ER Model to make it able to handle the complex
applications better.
Hence, as part of the Enhanced ER Model, along with other improvements, three new
concepts were added to the existing ER Model, they were:

1. Generalization
2. Specialization
3. Aggregration

Let's understand what they are, and why were they added to the existing ER Model.

Generalization
Generalization is a bottom-up approach in which two lower level entities combine to form
a higher level entity. In generalization, the higher level entity can also combine with other
lower level entities to make further higher level entity.
It's more like Superclass and Subclass system, but the only difference is the approach,
which is bottom-up. Hence, entities are combined to form a more generalised entity, in
other words, sub-classes are combined to form a super-class.

For example, Saving and Current account types entities can be generalised and an entity


with name Account can be created, which covers both.

Specialization
Specialization is opposite to Generalization. It is a top-down approach in which one higher
level entity can be broken down into two lower level entity. In specialization, a higher level
entity may not have any lower-level entity sets, it's possible.
Aggregration
Aggregration is a process when relation between two entities is treated as a single entity.

In the diagram above, the relationship between Center and Course together, is acting as an


Entity, which is in relationship with another entity Visitor. Now in real world, if a Visitor or
a Student visits a Coaching Center, he/she will never enquire about the center only or just
about the course, rather he/she will ask enquire about both.
Relational model:

 The relational model is the theoretical basis of relational databases which is a


technique or way of structuring data using relations, which are grid-like mathematical
structures consisting of columns and rows. Codd proposed the relational model for
IBM, but the idea became extremely vital and prominent that his work would become
the basis of relational databases. You might be very familiar with the physical
demonstration of a relation in a database - which is known as a table.

 In the relational model, all data is logically structured within relations, i.e., tables, as
mentioned above. Each relation has a name and is formed from named attributes or
columns of data. Each tuple or row holds one value per attribute. The greatest
strength of the relational model is this simple logical structure that it forms. Behind
this simple structure is a sophisticated theoretical foundation that is lacking in the
first generation of DBMSs.

Logical View of Data

A logical data model describes the data in as much detail as possible, without regard to how
they will be physical implemented in the database. Features of a logical data model include:

 Includes all entities and relationships among them.


 All attributes for each entity are specified.
 The primary key for each entity is specified.
 Foreign keys (keys identifying the relationship between different entities) are
specified.
 Normalization occurs at this level.

The steps for designing the logical data model are as follows:

1. Specify primary keys for all entities.


2. Find the relationships between different entities.
3. Find all attributes for each entity.
4. Resolve many-to-many relationships.
5. Normalization.

The figure below is an example of a logical data model.


Logical Data Model

Comparing the logical data model shown above with the conceptual data model diagram,
we see the main differences between the two:

 In a logical data model, primary keys are present, whereas in a conceptual data
model, no primary key is present.
 In a logical data model, all attributes are specified within an entity. No attributes are
specified in a conceptual data model.
 Relationships between entities are specified using primary keys and foreign keys in
a logical data model. In a conceptual data model, the relationships are simply stated,
not specified, so we simply know that two entities are related, but we do not specify
what attributes are used for this relationship.

Keys
 Keys are very important part of Relational database model. They are used to
establish and identify relationships between tables and also to uniquely identify any
record or row of data inside a table.
 A Key can be a single attribute or a group of attributes, where the combination may
act as a key.

Why we need a Key?


 In real world applications, number of tables required for storing the data is huge,
and the different tables are related to each other as well.
 Also, tables store a lot of data in them. Tables generally extends to thousands of
records stored in them, unsorted and unorganised.
 Now to fetch any particular record from such dataset, you will have to apply some
conditions, but what if there is duplicate data present and every time you try to
fetch some data by applying certain condition, you get the wrong data. How many
trials before you get the right data?
 To avoid all this, Keys are defined to easily identify any row of data in a table.
 Let's try to understand about all the keys using a simple example.

student_id name phone age

1 Akon 9876723452 17

2 Akon 9991165674 19

3 Bkon 7898756543 18

4 Ckon 8987867898 19

5 Dkon 9990080080 17

Let's take a simple Student table, with fields student_id, name, phone and age.

Super Key
 Super Key is defined as a set of attributes within a table that can uniquely identify
each record within a table. Super Key is a superset of Candidate key.
 In the table defined above super key would include student_id, (student_id,
name), phoneetc.

 Confused? The first one is pretty simple as student_id is unique for every row of
data, hence it can be used to identity each row uniquely.
 Next comes, (student_id, name), now name of two students can be same, but
their student_idcan't be same hence this combination can also be a key.
 Similarly, phone number for every student will be unique, hence again, phone can
also be a key.
 So they all are super keys.
Candidate Key
 Candidate keys are defined as the minimal set of fields which can uniquely identify
each record in a table. It is an attribute or a set of attributes that can act as a Primary
Key for a table to uniquely identify each record in that table. There can be more than
one candidate key.
In our example, student_id and phone both are candidate keys for table Student.

 A candiate key can never be NULL or empty. And its value should be unique.
 There can be more than one candidate keys for a table.
 A candidate key can be a combination of more than one columns(attributes).

Primary Key
 Primary key is a candidate key that is most appropriate to become the main key for
any table. It is a key that can uniquely identify each record in a table.

For the table Student we can make the student_id column as the primary key.

Composite Key
 Key that consists of two or more attributes that uniquely identify any record in a
table is called Composite key. But the attributes which together form
the Composite key are not a key independentely or individually.
 In the above picture we have a Score table which stores the marks scored by a
student in a particular subject.
 In this table student_id and subject_id together will form the primary key, hence it is
a composite key.

Secondary or Alternative key


 The candidate key which are not selected as primary key are known as secondary
keys or alternative keys.

Non-key Attributes
 Non-key attributes are the attributes or fields of a table, other than candidate
key attributes/fields in a table.

Non-prime Attributes
 Non-prime Attributes are attributes other than Primary Key attribute(s)..

Integrity Rules
 Integrity Constraints
o Integrity constraints are a set of rules. It is used to maintain the quality of
information.
o Integrity constraints ensure that the data insertion, updating, and other processes
have to be performed in such a way that data integrity is not affected.
o Thus, integrity constraint is used to guard against accidental damage to the
database.
Types of Integrity Constraint

1. Domain constraints
o Domain constraints can be defined as the definition of a valid set of values for an
attribute.
o The data type of domain includes string, character, integer, time, date, currency, etc.
The value of the attribute must be available in the corresponding domain.

Example:

 
2. Entity integrity constraints
o The entity integrity constraint states that primary key value can't be null.
o This is because the primary key value is used to identify individual rows in relation
and if the primary key has a null value, then we can't identify those rows.
o A table can contain a null value other than the primary key field.

Example:

3. Referential Integrity Constraints


o A referential integrity constraint is specified between two tables.
o In the Referential integrity constraints, if a foreign key in Table 1 refers to the
Primary Key of Table 2, then every value of the Foreign Key in Table 1 must be null
or be available in Table 2.

Example:

 
4. Key constraints
o Keys are the entity set that is used to identify an entity within its entity set uniquely.
o An entity set can have multiple keys, but out of which one key will be the primary
key. A primary key can contain a unique and null value in the relational table.

Example:

Relational set operators

 Relational algebra will have operators to indicate the operations. This algebra can
be applied on single relation – called unary or can be applied on two tables –
called binary. While applying the operations on the relation, the resulting subset of
relation is also known as new relation. There can be multiple steps involved in some
of the operations. The subsets of relations at the intermediary level are also known
as relation. We will understand it better when we see different operations below.

Relational Algebra in DBMS has 6 fundamental operations. There are several other
operations defined upon these fundamental operations.

Select (σ)

Select (σ) - This is a unary relational operation. This operation pulls the horizontal subset
(subset of rows) of the relation that satisfies the conditions. This can use operators like <, >,
<=, >=, = and != to filter the data from the relation. It can also use logical AND, OR and NOT
operators to combine the various filtering conditions. This operation can be represented as
below:

σ p (r)

Where σ is the symbol for select operation, r represents the relation/table, and p is the
logical formula or the filtering conditions to get the subset. Let us see an example as below:
σSTD_NAME = “James” (STUDENT) 

What does above relation algebra do? It selects the record/tuple from the STUDENT table
with Student name as ‘James’

σdept_id = 20 AND salary>=10000 (EMPLOYEE) - Selects the records from EMPLOYEE table with
department ID = 20 and employees whose salary is more than 10000.

Project (∏)

Project (∏) - This is a unary operator and is similar to select operation above. It creates
the subset of relation based on the conditions specified. Here, it selects only selected
columns/attributes from the relation- vertical subset of relation. The select operation
above creates subset of relation but for all the attributes in the relation. It is denoted as
below:

∏a1, a2, a3 (r)

Where ∏ is the operator for projection, r is the relation and a1, a2, a3 are the attributes of
the relations which will be shown in the resultant subset.

∏std_name, address, course (STUDENT) - This will select all the records from STUDENT table but
only selected columns – std_name, address and course. Suppose we have to select only
these 3 columns for particular student then we have to combine both project and select
operations.

∏STD_ID, address, course (σ STD_NAME = “James”(STUDENT)) - this selects the record for ‘James’ and
displays only std_ID, address and his course columns. Here we can see two unary operators
are combined, and it has two operations performing. First it selects the tuple from
STUDENT table for ‘James’. The resultant subset of STUDENT is also considered as
intermediary relation. But it is temporary and exists till the end of this operation. It then
filters the 3 columns from this temporary relation.

Rename (ρ)

Rename (ρ) - This is a unary operator used to rename the tables and columns of a relation.
When we perform self join operation, we have to differentiate two same tables. In such case
rename operator on tables comes into picture. When we join two or more tables and if
those tables have same column names, then it is always better to rename the columns to
differentiate them. This occurs when we perform Cartesian product operation.

ρ 
(E)
R

 Where ρ is the rename operator, E is the existing relation name, and R is the new relation
name.
ρ STUDENT (STD_TABLE) – Renames STD_TABLE table to STUDENT

Let us see another example to rename the columns of the table. If the STUDENT table has
ID, NAME and ADDRESS columns and if they have to be renamed to STD_ID, STD_NAME,
STD_ADDRESS, then we have to write as follows.

ρ 
STD_ID, STD_NAME, STD_ADDRESS (STUDENT) – It will rename the columns in the order the names
appear in the table

Cartesian product (X)

Cartesian product (X): - This is a binary operator. It combines the tuples of two relations
into one relation.

 RXS

Where R and S are two relations and X is the operator. If relation R has m tuples and
relation S has n tuples, then the resultant relation will have mn tuples. For example, if we
perform cartesian product on EMPLOYEE (5 tuples) and DEPT relations (3 tuples), then we
will have new tuple with 15 tuples.

EMPLOYEE X DEPT

This operator will simply create a pair between the tuples of each table. i.e.; each employee
in the EMPLOYEE table will be mapped with each department in DEPT table. Below
diagram depicts the result of cartesian product.

Union (U)

Union (U) - It is a binary operator, which combines the tuples of two relations. It is
denoted by
R U S

Where R and S are the relations and U is the operator.

                                DESIGN_EMPLOYEE U TESTING_EMPLOYEE

Where DESIGN_EMPLOYEE and TESTING_EMPLOYEE are two relations.

It is different from cartesian product in:

Cartesian product combines the attributes of two relations into one relation
whereas Union combines the tuples of two relations into one relation.
In Union, both relations should have same number of columns.  Suppose we have to
list the employees who are working for design and testing department. Then we
will do the union on employee table. Since it is union on same table it has same
number of attributes. Cartesian product does not concentrate on number of
attribute or rows. It blindly combines the attributes.
In Union, both relations should have same types of attributes in same order.  In the
above example, since union is on employee relation, it has same type of attribute
in the same order.

It need not have same number of tuples in both the relation. If there is a duplicate tuples as
a result of union, then it keeps only one tuple. If a tuple is present in any one relation, then
it keeps that tuple in the new relation. In the above example, number of employees in
design department need not be same as employees in testing department. Below diagram
shows the same. We can observe that it combines the table data in the order they appear in
the table.
We would not able to join both these tables if the order of columns or the number of
columns were different.

Set-difference (-)

Set-difference (-) - This is a binary operator. This operator creates a new relation with
tuples that are in one relation but not in other relation. It is denoted by ‘-‘symbol.

                R – S

Where R and S are the relations.

Suppose we want to retrieve the employees who are working in Design department but not
in testing.

                DESIGN_EMPLOYEE −TESTING_EMPLOYEE

There are additional relational operations based on the above fundamental


operations. Some of them are:

Set Intersection

Set Intersection - This operation is a binary operation. It results in a relation with tuples
that are in both the relations. It is denoted by ‘∩ ‘.

                R∩S

Where R and S are the relations. It picks all the tuples that are present in both R and S, and
results it in a new relation.
Suppose we have to find the employees who are working in both design and testing
department. If we have tuples as in above example, the new result relation will not have
any tuples. Suppose we have tuples like below and see the new relation after set difference.

This set intersection can also be written as a combination of set difference operations.

R ∩ S    R-(R-S)

i.e.; it evaluates R-S to get the tuples which are present only in R and then it gets the record
which are present only in R but not in new resultant relation of R-S.

In above example of employees,

DESIGN_EMPLOYEE – (DESIGN_EMPLOYEE – TESTING_EMPLOYEE)

It first filters only those employees who are only design employees – (104, Kathy). This
result is then used to find the difference with design employee. This will find those
employees who are design employees but not in new result – (100, James). Thus it gives the
result tuple which is both designer and tester. We can see here fundamental relational
operator is used twice to get set intersection. Hence this operation is not fundamental
operation.

Assignment

Assignment - As the name indicates, the assignment operator ‘ ’ is used to assign the
result of a relational operation to temporary relational variable. This is useful when there is
multiple steps in relational operation and handling everything in one single expression is
difficult. Assigning the results into temporary relation and using this temporary relation in
next operation makes task simple and easy.

                T S – denotes relation S is assigned to temporary relation T

A relational operation ∏a1, a2 (σ p (E)) with selection and projection can be divided as below.

                T   σ p (E)

                S  ∏a1, a2 (T)

Our example above in projection for getting STD_ID, ADDRESS and COURSE for the Student
‘James’ can be re-written as below.

∏STD_ID, address, course (σ STD_NAME = “James”(STUDENT))

T  σ STD_NAME = “James”(STUDENT)

S  ∏STD_ID, address, course (T)

Natural Join

Natural join - As we have seen above, cartesian product simply combines the attributes of
two relations into one. But the new relation will not have correct tuples. It has only
combinations of tuples. In order to get the correct tuples, we have to use selection
operation on the cartesian product result. This set of operations – cartesian product
followed by selection – is combined into one relation called natural join. It is denoted by ∞

R∞S

Suppose we want to select the employees who are working for department 10.  Then we
will perform the cartesian product on the EMPLOYEES and DEPT and find the DEPT_ID in
both relations matching to 10. The same is done with natural join as

σ EMPLOYEE.DEPT_ID = DEPT>DEPT_ID AND EMPLOYEE.DEPT_ID = 10(EMPLOYEE X DEPT)           

Same can be written using natural join as      EMPLOYEE ∞ DEPT


From the above example, we see that only the matching data from both the relations are
retained in the final relation. Suppose we want to retain all the information from first
relation and the corresponding information from the second relation irrespective of if it
exists or not. In such case we use outer join. This join makes sure all the combinations of
tuples are shown in correct way. Unlike cartesian product, this join make sure that to
create a tuple from both the table if there exists right match for them, and if there is no
match null is added to those attribute. Let see them in below types of outer join.

There are three types of outer joins

Left Outer Join

Left outer join - In this operation, all the tuples in the left hand side relation is retained. All
matching attribute in the right hand relation is displayed with values and the ones which
do not have value are shown as NULL.

Below example of left outer join on DEPT and EMPLOYEE table combines the matching
combination of DEPT_ID = 10 with values. But DEPT_ID = 30 does not have any employees
yet. Hence it displays NULL for those employees. Thus this outer join makes more
meaningful to combining two relations than a cartesian product.
Right outer join

Right outer join - This is opposite of left outer join. Here all the attributes of right hand
side is retained and it matching attribute in left hand relation is found and displayed. If no
matching is found then null is displayed. Same above example is re-written to understand
this as below:

Notice the order and column difference in both the cases.

Full Outer Join

Full outer join - This is the combination of both left and right outer join. It displays all the
attributes from both the relation. If the matching attribute exists in other relation, then that
will be displayed, else those attributes are shown as null.
Hope above diagram is self explanatory.

Division

Division - This operation is used to find the tuples with phrase ‘for all’. It is denoted by ‘÷’.
Suppose we want to see all the employees who work in all of departments. What are the
steps involved to find this?

First we find all the department ID -  T1  ∏DEPT_ID (DEPARTMENT)


Next step is list all the employees and their departments – T2  ∏ EMP_ID,
DEPT_ID(EMPLOYEE)

In third step we will find the employees in T2 with the entire department ID in T1. This is
obtained by using division operation – T2 ÷ T1
Data Dictionary and System Catalog

 Data Dictionary consists of database metadata. It has records about objects in the
database.

 Data Dictionary also consists of the following information:

1. Name of the tables in the database


2. Constraints of a table i.e. keys, relationships, etc.
3. Columns of the tables that related to each other
4. Owner of the table
5. Last accessed information of the object
6. Last updated information of the object

An example of Data Dictionary can be personal details of a student:

Example

<StudentPersonalDetails>

Student_ID Student_Name Student_Address Student_City


The following is the data dictionary for the above fields:

Types of Data Dictionary

Here are the two types of data dictionary:

 Active Data Dictionary

The DBMS software manages the active data dictionary automatically. The modification is
an automatic task and most RDBMS has active data dictionary. It is also known as
integrated data dictionary.

 Passive Data Dictionary

Managed by the users and is modified manually when the database structure change. Also
known as non-integrated data dictionary.
Codd’s Relational Database Rules
 Dr Edgar F. Codd, after his extensive research on the Relational Model of database
systems, came up with twelve rules of his own, which according to him, a database
must obey in order to be regarded as a true relational database.
 These rules can be applied on any database system that manages stored data using
only its relational capabilities. This is a foundation rule, which acts as a base for all
the other rules.

Rule 1: Information Rule


 The data stored in a database, may it be user data or metadata, must be a value of
some table cell. Everything in a database must be stored in a table format.

Rule 2: Guaranteed Access Rule


 Every single data element (value) is guaranteed to be accessible logically with a
combination of table-name, primary-key (row value), and attribute-name (column
value). No other means, such as pointers, can be used to access data.

Rule 3: Systematic Treatment of NULL Values


 The NULL values in a database must be given a systematic and uniform treatment.
This is a very important rule because a NULL can be interpreted as one the
following − data is missing, data is not known, or data is not applicable.

Rule 4: Active Online Catalog


 The structure description of the entire database must be stored in an online catalog,
known as data dictionary, which can be accessed by authorized users. Users can
use the same query language to access the catalog which they use to access the
database itself.

Rule 5: Comprehensive Data Sub-Language Rule


 A database can only be accessed using a language having linear syntax that supports
data definition, data manipulation, and transaction management operations. This
language can be used directly or by means of some application. If the database
allows access to data without any help of this language, then it is considered as a
violation.

Rule 6: View Updating Rule


 All the views of a database, which can theoretically be updated, must also be
updatable by the system.

Rule 7: High-Level Insert, Update, and Delete Rule


 A database must support high-level insertion, updation, and deletion. This must not
be limited to a single row, that is, it must also support union, intersection and minus
operations to yield sets of data records.
Rule 8: Physical Data Independence
 The data stored in a database must be independent of the applications that access
the database. Any change in the physical structure of a database must not have any
impact on how the data is being accessed by external applications.

Rule 9: Logical Data Independence


 The logical data in a database must be independent of its user’s view (application).
Any change in logical data must not affect the applications using it. For example, if
two tables are merged or one is split into two different tables, there should be no
impact or change on the user application. This is one of the most difficult rule to
apply.

Rule 10: Integrity Independence


 A database must be independent of the application that uses it. All its integrity
constraints can be independently modified without the need of any change in the
application. This rule makes a database independent of the front-end application
and its interface.

Rule 11: Distribution Independence


 The end-user must not be able to see that the data is distributed over various
locations. Users should always get the impression that the data is located at one site
only. This rule has been regarded as the foundation of distributed database
systems.

Rule 12: Non-Subversion Rule


 If a system has an interface that provides access to low-level records, then the
interface must not be able to subvert the system and bypass security and integrity
constraints.

Normalization of Database Tables

 Normalization is a process to eliminate the flaws of a database with bad design. A


poorly designed database is inconsistent and create issues while adding, deleting or
updating information.

 The following makes Database Normalization a crucial step in database design


process:
Resolving the database anomalies

 The forms of Normalization i.e. 1NF, 2NF, 3NF, BCF, 4NF and 5NF remove all the
Insert, Update and Delete anomalies.

 Insertion Anomaly occurs when you try to insert data in a record that does not
exist.

 Deletion Anomaly is when a data is to be deleted and due to the poor deign of
database, other record also deletes.

 Eliminate Redundancy of Data

Storing same data item multiple times is known as Data Redundancy. A normalized table do
not have the issue of redundancy of data.

 Data Dependency

The data gets stored in the correct table and ensures normalization.

 Isolation of Data

A good designed database states that the changes in one table or field do not affect other.
This is achieved through Normalization.

 Data Consistency

While updating if a record is left, it can led to inconsistent data, Normalization resolves it
and ensures Data Consistency.

ADVANTAGES OF NORMALIZATION
The following are the advantages of the normalization.
• More efficient data structure.
• Avoid redundant fields or columns.
• More flexible data structure i.e. we should be able to add new rows and data values easily
• Better understanding of data.
• Ensures that distinct tables exist when necessary.
• Easier to maintain data structure i.e. it is easy to perform operations and complex queries
can be easily handled.
• Minimizes data duplication.
• Close modeling of real world entities, processes and their relationships.
DISADVANTAGES OF NORMALIZATION
The following are disadvantages of normalization. 
• You cannot start building the database before you know what the user needs.
• On Normalizing the relations to higher normal forms i.e. 4NF, 5NF the performance
degrades.
• It is very time consuming and difficult process in normalizing relations of higher degree.
• Careless decomposition may leads to bad design of database which may leads to serious
problems.

Functional Dependency
A functional dependency A->B in a relation holds if two tuples having same value of
attribute A also have same value for attribute B. For Example, in relation STUDENT shown
in table 1, Functional Dependencies
STUD_NO->STUD_NAME, STUD_NO->STUD_ADDR hold
but
STUD_NAME->STUD_ADDR do not hold

How to find functional dependencies for a relation?


Functional Dependencies in a relation are dependent on the domain of the relation.
Consider the STUDENT relation given in Table 1.
 We know that STUD_NO is unique for each student. So STUD_NO->STUD_NAME,
STUD_NO->STUD_PHONE, STUD_NO->STUD_STATE, STUD_NO->STUD_COUNTRY and
STUD_NO -> STUD_AGE all will be true.
 Similarly, STUD_STATE->STUD_COUNTRY will be true as if two records have same
STUD_STATE, they will have same STUD_COUNTRY as well.
 For relation STUDENT_COURSE, COURSE_NO->COURSE_NAME will be true as two
records with same COURSE_NO will have same COURSE_NAME.

Functional Dependency Set:  Functional Dependency set or FD set of a relation is the set
of all FDs present in the relation. For Example, FD set for relation STUDENT shown in table
1 is:
{ STUD_NO->STUD_NAME, STUD_NO->STUD_PHONE, STUD_NO->STUD_STATE, STUD_NO-
>STUD_COUNTRY,
STUD_NO -> STUD_AGE, STUD_STATE->STUD_COUNTRY }
Attribute Closure:

 Attribute closure of an attribute set can be defined as set of attributes which can be
functionally determined from it.
 How to find attribute closure of an attribute set?
To find attribute closure of an attribute set:
 Add elements of attribute set to the result set.
 Recursively add elements to the result set which can be functionally determined
from the elements of the result set.
Using FD set of table 1, attribute closure can be determined as:
(STUD_NO)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE, STUD_COUNTRY,
STUD_AGE}
(STUD_STATE)+ = {STUD_STATE, STUD_COUNTRY}

 How to find Candidate Keys and Super Keys using Attribute Closure?
 If attribute closure of an attribute set contains all attributes of relation, the attribute
set will be super key of the relation.
 If no subset of this attribute set can functionally determine all attributes of the
relation, the set will be candidate key as well. For Example, using FD set of table 1,
(STUD_NO, STUD_NAME)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,
STUD_COUNTRY, STUD_AGE}
(STUD_NO)+ = {STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE, STUD_COUNTRY,
STUD_AGE}
(STUD_NO, STUD_NAME) will be super key but not candidate key because its subset
(STUD_NO)+ is equal to all attributes of the relation. So, STUD_NO will be a candidate key.

GATE Question: Consider the relation scheme R = {E, F, G, H, I, J, K, L, M, N} and the set
of functional dependencies {{E, F} -> {G}, {F} -> {I, J}, {E, H} -> {K, L}, K -> {M}, L -> {N}
on R. What is the key for R? (GATE-CS-2014)
A. {E, F}
B. {E, F, H}
C. {E, F, H, K, L}
D. {E}
Answer: Finding attribute closure of all given options, we get:
{E,F}+ = {EFGIJ}
{E,F,H}+ = {EFHGIJKLMN}
{E,F,H,K,L}+ = {{EFHGIJKLMN}
{E}+ = {E}
{EFH}+ and {EFHKL}+ results in set of all attributes, but EFH is minimal. So it will be
candidate key. So correct option is (B).

 How to check whether an FD can be derived from a given FD set?


To check whether an FD A->B can be derived from an FD set F,
1. Find (A)+ using FD set F.
2. If B is subset of (A)+, then A->B is true else not true.
GATE Question: In a schema with attributes A, B, C, D and E following set of functional
dependencies are given
{A -> B, A -> C, CD -> E, B -> D, E -> A}
Which of the following functional dependencies is NOT implied by the above set?
A. CD -> AC
B. BD -> CD
C. BC -> CD
D. AC -> BC
Answer: Using FD set given in question,
(CD)+ = {CDEAB} which means CD -> AC also holds true.
(BD)+ = {BD} which means BD -> CD can’t hold true. So this FD is no implied in FD set. So
(B) is the required option.Others can be checked in the same way.

 Prime and non-prime attributes


Attributes which are parts of any candidate key of relation are called as prime attribute,
others are non-prime attributes. For Example, STUD_NO in STUDENT relation is prime
attribute, others are non-prime attribute.
GATE Question:  Consider a relation scheme R = (A, B, C, D, E, H) on which the
following functional dependencies hold: {A–>B, BC–> D, E–>C, D–>A}. What are the
candidate keys of R? [GATE 2005]
(a) AE, BE
(b) AE, BE, DE
(c) AEH, BEH, BCH
(d) AEH, BEH, DEH
Answer: (AE)+ = {ABECD} which is not set of all attributes. So AE is not a candidate key.
Hence option A and B are wrong.
(AEH)+ = {ABCDEH}
(BEH)+ = {BEHCDA}
(BCH)+ = {BCHDA} which is not set of all attributes. So BCH is not a candidate key. Hence
option C is wrong.
So correct answer is D.

1. First Normal Form –

If a relation contain composite or multi-valued attribute, it violates first normal form or a


relation is in first normal form if it does not contain any composite or multi-valued
attribute. A relation is in first normal form if every attribute in that relation is singled
valued attribute.
 Example 1 – Relation STUDENT in table 1 is not in 1NF because of multi-valued
attribute STUD_PHONE. Its decomposition into 1NF has been shown in table 2.

 Example 2 –

 ID Name Courses
 ------------------
 1 A c1, c2
 2 E c3
 3 M C2, c3
In the above table Course is a multi valued attribute so it is not in 1NF.
Below Table is in 1NF as there is no multi valued attribute
ID Name Course
------------------
1 A c1
1 A c2
2 E c3
3 M c1
3 M c2
 

2. Second Normal Form –

To be in second normal form, a relation must be in first normal form and relation must not
contain any partial dependency. A relation is in 2NF iff it has No Partial
Dependency, i.e., no non-prime attribute (attributes which are not part of any candidate
key) is dependent on any proper subset of any candidate key of the table.
Partial Dependency – If proper subset of candidate key determines non-prime attribute, it
is called partial dependency.
 Example 1 – In relation STUDENT_COURSE given in Table 3,
 FD set: {COURSE_NO->COURSE_NAME}
 Candidate Key: {STUD_NO, COURSE_NO}
In FD COURSE_NO->COURSE_NAME, COURSE_NO (proper subset of candidate key) is
determining COURSE_NAME (non-prime attribute). Hence, it is partial dependency
and relation is not in second normal form.
To convert it to second normal form, we will decompose the relation
STUDENT_COURSE (STUD_NO, COURSE_NO, COURSE_NAME) as :
STUDENT_COURSE (STUD_NO, COURSE_NO)
COURSE (COURSE_NO, COURSE_NAME)
Note – This decomposition will be lossless join decomposition as well as dependency
preserving.
 Example 2 – Consider following functional dependencies in relation  R (A,  B , C,  D )
 AB -> C [A and B together determine C]
BC -> D [B and C together determine D]
In the above relation, AB is the only candidate key and there is no partial dependency,
i.e., any proper subset of AB doesn’t determine any non-prime attribute.

3. Third Normal Form –

A relation is in third normal form, if there is no transitive dependency for non-prime


attributes is it is in second normal form.
A relation is in 3NF iff at least one of the following condition holds in every non-trivial
function dependency X –> Y
1. X is a super key.
2. Y is a prime attribute (each element of Y is part of some candidate key).
Transitive dependency – If A->B and B->C are two FDs then A->C is called transitive
dependency.
 Example 1 – In relation STUDENT given in Table 4,
FD set: {STUD_NO -> STUD_NAME, STUD_NO -> STUD_STATE, STUD_STATE ->
STUD_COUNTRY, STUD_NO -> STUD_AGE, STUD_STATE -> STUD_COUNTRY}
Candidate Key: {STUD_NO}
For this relation in table 4, STUD_NO -> STUD_STATE and STUD_STATE ->
STUD_COUNTRY are true. So STUD_COUNTRY is transitively dependent on STUD_NO.
It violates third normal form. To convert it in third normal form, we will decompose
the relation STUDENT (STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE,
STUD_COUNTRY_STUD_AGE) as:
STUDENT (STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE, STUD_AGE)
STATE_COUNTRY (STATE, COUNTRY)
 Example 2 – Consider relation R(A, B, C, D, E)
A -> BC,
CD -> E,
B -> D,
E -> A
All possible candidate keys in above relation are {A, E, CD, BC} All attribute are on
right sides of all functional dependencies are prime.

4. Boyce-Codd Normal Form (BCNF) –

A relation R is in BCNF if R is in Third Normal Form and for every FD, LHS is super key. A
relation is in BCNF iff in every non-trivial functional dependency X –> Y, X is a super key.
Example 1 – Find the highest normal form of a relation R(A,B,C,D,E) with
FD set as {BC->D, AC->BE, B->E}
 Step 1. As we can see, (AC)+ ={A,C,B,E,D} but none of its subset can
determine all attribute of relation, So AC will be candidate key. A
or C can’t be derived from any other attribute of the relation, so
there will be only 1 candidate key {AC}.
 Step 2. Prime attribute are those attribute which are part of
candidate key {A,C} in this example and others will be non-prime
{B,D,E} in this example.
 Step 3. The relation R is in 1st normal form as a relational DBMS
does not allow multi-valued or composite attribute.
The relation is in 2nd normal form because BC->D is in 2nd normal
form (BC is not proper subset of candidate key AC) and AC->BE is
in 2nd normal form (AC is candidate key) and B->E is in 2nd
normal form (B is not a proper subset of candidate key AC).
The relation is not in 3rd normal form because in BC->D (neither
BC is a super key nor D is a prime attribute) and in B->E (neither B
is a super key nor E is a prime attribute) but to satisfy 3rd normal
for, either LHS of an FD should be super key or RHS should be
prime attribute.
So the highest normal form of relation will be 2nd Normal form.
 Example 2 –For example consider relation R(A, B, C)
A -> BC,
B ->A and B both are super keys so above relation is in BCNF.

 BCNF is free from redundancy.


 If a relation is in BCNF, then 3NF is also also satisfied.
 If all attributes of relation are prime attribute, then the relation is always in 3NF.
 A relation in a Relational Database is always and at least in 1NF form.
 Every Binary Relation ( a Relation with only 2 attributes ) is always in BCNF.
 If a Relation has only singleton candidate keys( i.e. every candidate key consists of
only 1 attribute), then the Relation is always in 2NF( because no Partial functional
dependency possible).
 Sometimes going for BCNF form may not preserve functional dependency. In that
case go for BCNF only if the lost FD(s) is not required, else normalize till 3NF only.
 There are many more Normal forms that exist after BCNF, like 4NF and more. But in
real world database systems it’s generally not required to go beyond BCNF.

Exercise 1: Find the highest normal form in R (A, B, C, D, E) under following functional
dependencies.
ABC --> D
CD --> AE
Important Points for solving above type of question.
1) It is always a good idea to start checking from BCNF, then 3 NF and so on.
2) If any functional dependency satisfied a normal form then there is no need to check for
lower normal form. For example, ABC –> D is in BCNF (Note that ABC is a super key), so no
need to check this dependency for lower normal forms.
Candidate keys in given relation are {ABC, BCD}
BCNF: ABC -> D is in BCNF. Let us check CD -> AE, CD is not a super key so this dependency
is not in BCNF. So, R is not in BCNF.
3NF: ABC -> D we don’t need to check for this dependency as it already satisfied BCNF. Let
us consider CD -> AE. Since E is not a prime attribute, so relation is not in 3NF.
2NF: In 2NF, we need to check for partial dependency. CD which is a proper subset of a
candidate key and it determine E, which is non prime attribute. So, given relation is also not
in 2 NF. So, the highest normal form is 1 NF.

If two or more independent relation are kept in a single relation or we can say multivalue
dependencyoccurs when the presence of one or more rows in a table implies the presence
of one or more other rows in that same table. Put another way, two attributes (or columns)
in a table are independent of one another, but both depend on a third attribute.
A multivalued dependency always requires at least three attributes because it consists of
at least two attributes that are dependent on a third.
For a dependency A -> B, if for a single value of A, multiple value of B exists, then the table
may have multi-valued dependency. The table should have at least 3 attributes and B and C
should be independent for A ->> B multivalued dependency. For example,

PERSON MOBILE FOOD_LIKES

Mahesh 9893/9424 Burger / pizza

Ramesh 9191 Pizza

Person->-> mobile,
Person ->-> food_likes
This is read as “person multidetermines mobile” and “person multidetermines food_likes.”

Note that a functional dependency is a special case of multivalued dependency. In a


functional dependency X -> Y, every x determines exactly one y, never more than one.

Fourth normal form (4NF):

Fourth normal form (4NF) is a level of database normalization where there are no non-
trivial multivalued dependencies other than a candidate key. It builds on the first three
normal forms (1NF, 2NF and 3NF) and the Boyce-Codd Normal Form (BCNF). It states that,
in addition to a database meeting the requirements of BCNF, it must not contain more than
one multivalued dependency.
Properties – A relation R is in 4NF if and only if the following conditions are satisfied:
1. It should be in the Boyce-Codd Normal Form (BCNF).
2. the table should not have any Multi-valued Dependency.
A table with a multivalued dependency violates the normalization standard of Fourth
Normal Form (4NK) because it creates unnecessary redundancies and can contribute to
inconsistent data. To bring this up to 4NF, it is necessary to break this information into two
tables.
Example – Consider the database table of a class whaich has two relations R1 contains
student ID(SID) and student name (SNAME) and R2 contains course id(CID) and course
name (CNAME).

Table – R1(SID, SNAME)


SID SNAME

S1 A

S2 B

Table – R2(CID, CNAME)

CID CNAME

C1 C

C2 D

When there cross product is done it resulted in multivalued dependencies:

Table – R1 X R2

SID SNAME CID CNAME

S1 A C1 C

S1 A C2 D

S2 B C1 C

S2 B C2 D

Multivalued dependencies (MVD) are:


SID->->CID; SID->->CNAME; SNAME->->CNAME
Joint dependency – Join decomposition is a further generalization of Multivalued
dependencies. If the join of R1 and R2 over C is equal to relation R then we can say that a
join
dependency (JD) exists, where R1 and R2 are the decomposition R1(A, B, C) and R2(C, D) of
a given relations R (A, B, C, D). Alternatively, R1 and R2 are a lossless decomposition of R. A
JD ⋈ {R1, R2, …, Rn} is said to hold over a relation R if R1, R2, ….., Rn is a lossless-join
decomposition. The *(A, B, C, D), (C, D) will be a JD of R if the join of join’s attribute is equal
to
the relation R. Here, *(R1, R2, R3) is used to indicate that relation R1, R2, R3 and so on are a
JD of R.
Let R is a relation schema R1, R2, R3……..Rn be the decomposition of R. r( R ) is said to
satisfy join dependency if and only if

Example –

Table – R1

COMPANY PRODUCT

C1 pendrive

C1 mic

C2 speaker

C2 speaker

Company->->Product

Table – R2

AGENT COMPANY

Aman C1

Aman C2

Mohan C1

Agent->->Company

Table – R3

AGENT PRODUCT

Aman pendrive

Aman mic
AGENT PRODUCT

Aman speaker

Mohan speaker

Agent->->Product

Table – R1⋈R2⋈R3

COMPANY PRODUCT AGENT

C1 pendrive Aman

C1 mic Aman

C2 speaker speaker

C1 speaker Aman

Agent->->Product

Fifth Normal Form / Projected Normal Form (5NF):

A relation R is in 5NF if and only if every join dependency in R is implied by the candidate
keys of R. A relation decomposed into two relations must have loss-less join Property,
which ensures that no spurious or extra tuples are generated, when relations are reunited
through a natural join.
Properties – A relation R is in 5NF if and only if it satisfies following conditions:
1. R should be already in 4NF.
2. It cannot be further non loss decomposed (join dependency)
Example – Consider the above schema, with a case as “if a company makes a product and
an agent is an agent for that company, then he always sells that product for the company”.
Under these circumstances, the ACP table is shown as:

Table – ACP

AGENT COMPANY PRODUCT

A1 PQR Nut

A1 PQR Bolt
AGENT COMPANY PRODUCT

A1 XYZ Nut

A1 XYZ Bolt

A2 PQR Nut

The relation ACP is again decomposes into 3 relations. Now, the natural Join of all the three
relations will be shown as:

Table – R1

AGENT COMPANY

A1 PQR

A1 XYZ

A2 PQR

Table – R2

AGENT PRODUCT

A1 Nut

A1 Bolt

A2 Nut

Table – R3

COMPANY PRODUCT

PQR Nut

PQR Bolt
COMPANY PRODUCT

XYZ Nut

XYZ Bolt

Result of Natural Join of R1 and R3 over ‘Company’ and then Natural Join of R13 and R2
over ‘Agent’and ‘Product’ will be table ACP.
Hence, in this example, all the redundancies are eliminated, and the decomposition of ACP
is a lossless join decomposition. Therefore, the relation is in 5NF as it does not violate the
property of lossless join.

You might also like