Lecture 16 - Relational Database Design PDF
Lecture 16 - Relational Database Design PDF
Lecture - 16
Relational Database Design
Welcome to Module 16 of Database Management Systems till the last module which
closed with the third week.
Specifically in the third week, we talked about certain advanced features of SQL and the
formal query language in terms of relational and algebra and calculi and then, we talked
in a depth in terms of the entity relationship model, the first basic conceptual level
representation of the real world that we can do in terms of designing a system.
Now, our next task would be to take it to more proper complete relational database
design and this will have a lot of theory at different levels that we need to understand.
We will slowly develop that and this discussion will span 5 modules that is we will take
the whole week to complete.
So, to start with the features of good relational design, let us take an example.
Suppose we have seen the instructor, relation instructor entity set as a relation. You have
seen the department relation. Now, let us consider that if these two were not two separate
relations, if they were all kept in a common relation that is all the attributes are kept in
the common relation, so earlier if you recall that your instructor relation was this and
your department relation was this much. So, if we keep everything together, of course we
are calling it inst dept, but please keep in mind this is not the same inst dept that we
discussed in terms of the ER model. This is just putting these two together.
Now, the question is if you look into this data carefully, for example if you look into this
particular row, if you look into this particular row and if you look into this particular row,
these are rows of instructors who all belong to computer science. Now, earlier we were
representing the information of instructor only in this part. So, we just knew that it is
computer science and we represent the information of department in this part. So, given a
department name say computer science, we knew, where is it located, the building and
what budget it has. Now, when we are combined, we will see that naturally since
computer science is located in the tailor building, we know that it has a budget of say
100,000. So, all of these records will have this information repeated.
So, this is not a very good situation. This is not a good situation because this kind of
situation is typically in database is known as redundancy, that is you have the same data
in multiple places. So, what is the consequence of redundancy? For example, there could
be different kinds of anomaly when you have redundancy. What is an anomaly? An
anomaly is the possibility of certain data getting inconsistent. For example, let us say
computer science department moves from tailor building to painter building. Now, what
will have to happen if it moves to painter building? Then, I will need to remove this,
make it a painter, make this value painter. I have to also do this, make this painter. I have
to also do this, make this painter. So, if I have a change, then I will have to make the
change at multiple entries. Think about the earlier situation where I just had these three
in my department relation, then naturally computer centered only one row and therefore,
this change, this update could be done at only one place.
So, it is not only that if while doing this in case of this redundancy, I have to do this
multiple times. It also has the difficulty that if I forget to update any one of them or more
of them, then I have inconsistent data. Similarly, if I want to insert a new value, I will
have to do that for all this redundant information. If I have to delete say for some reason
let say the university decides to wind up the Physics department, then I have to delete all
these rows which have physics as an entry and the consequence of that is the department
is deleted, but as a consequence of that I will delete the whole row and therefore, I will
not only remove the department, but I will also remove the corresponding instructor who
was enrolled for that department.
So, this kind of redundancy can lead to different kinds of anomalies in a database design.
On the other hand, if you look at, well why I am complicating the whole situation? We
have already had a good design in terms of where these anomalies were, not their
department, were separate instructor was separate. In that case, the situation is that to
answer some of the queries, I may have to do a very expensive joint operation. For
example, if I want to know if Einstein wants to know what is the budget of his
department that cannot be found out from the earlier instructor database, instructor
relation which had only these fields.
So, I have to pick up Einstein from here, do a join based on the department name, depth
name with the department table department relation and then only, I will be able to find
out that an Einstein belongs to Physics. Physics has a budget of 70000. So, Einstein's
department has a budget 70000. So, there is a tradeoff between how much if data
information you make redundant and lead to different anomalous situations or how much
data you optimize in the representation, but get into the possible situation of having a
higher cost in terms of answering your queries.
So, this is one of the core design issues that we will start with. So, let us look into some
more.
So, it is note that combining schemas is necessarily always bad in terms of repetition or
in terms of redundancy. So, different situations will have to be assessed.
If two records are there which have the same department name, they must be identical.
So, they are distinguishable completely by that. So, let us see what is the consequence of
this. So, we are saying that we write it as a rule that if there is a schema department,
name, building, budget, then department name would be a candidate key and we write
this observation that if two records match on the department name, they must match on
the building and budget and very loosely, we will come to the formal definition. Very
loosely we call this the functional dependency. We say that the building and budget is
functionally dependent on the department name and that is a situation where we can split
this inst dept and create a smaller relationship because department name is not a
candidate key in the inst dept. It does not decide the records of inst dept uniquely.
So, since it does not, so when the values of this key, this attribute department name is
duplicated or triplicated, the values of the building and budget are repeated and we have
the redundancy. So, this is a situation, very common situation which is indicative of the
fact that we need a decomposition into smaller , but at the same time we can also
observe, I mean let us take a different example. If we are thinking that decomposition is
the panacea of solving these kind of redundancy and related problems, then let us try to
see a different relationship employee which has id, name, street, city, salary and we want
to make it smaller and want to make two relations id and name and name, city street,
salary.
So, if we do that, then how do we get the salary for a particular id? We will naturally
have to join these two relations in terms of the common attribute name. We have seen
that in the query and the question is when I do this joint, do I get back the original
information or I lose some information.
(Refer Slide Time: 12:42)
Look at an example. So, here is an example of the combined instance and I have two
different ids, but incidentally the names are same. The names of these two distinct
employees are same. So, when I decompose, I get this relation which shows id and name.
I get this relation which is against the name shows this, but when I try to join them by
national joint, I not only get the combination of this with this which is what I need, but I
also get this combination. So, if I say this is what I get as well in terms of natural joint,
this is what I get as well in terms of the natural join which are really not there in the
original relation.
So, you can see that in the natural join, I get four records, I get four rows whereas, in the
original one I had only two rows. So, I get some entries which are actually erroneous.
These are not there in the database. So, this is when this happens. We say that we have
loss of information and such joints are said to be lossy joins. So, when we decompose,
we need to make sure that our joins are lossless in nature; otherwise that is not a good
design.
When I take the join, the original information is completely retrieved. I get back the
same table and when that happens, I say that the join is lossless. So, what we need to
understand is on one side there is a need to decompose relations into smaller relations to
reduce redundancy and while we do that, we will also have to keep this in mind that the
smaller relations must be composable through certain natural join procedure to the
original relation, and I must get back that original relation, otherwise I have a lossy joint
which is not acceptable. Also, the decomposition will have the costs of doing natural join
every time I want to answer those queries.
We consider that the domains of attributes are atomic if they are indivisible. So, anything
that is a number string and so on is considered to be atomic and we say a relational
schema is in its first normal form if the domains of all attributes are atomic and all
attributes single value, there is no multi value attribute. If these conditions are satisfied,
then we will say that every relate that relational schema is in its first normal form. So, we
will slowly understand the purpose of defining such normal forms, but let us initially
understand the definition. So, if we have attributes which are composite in nature,
naturally my relationship, my relational schema is not in first normal form if we have
attributes which are multiple valued, it is not so.
So, if we say that we have possible values are like this, then if we just treat them as
strings, then the corresponding relational schema is in first normal form, but if we say
that from this string we can extract the first two characters which is CS which tells me
what is a department. The next four characters gives me a number, the serial number of
the particular student in the role. Then I am not actually using an atomic domain because
my domain needs to be interpreted separately than just being a value. So, these are not
parts of what can be a first normal form.
What you can do? You can separate out these phone numbers into two different
attributes; Telephone number 1 and 2. Even then it is not exactly in first normal form
because you do not know in which order they should be handled. If you have to search
for a telephone number, then you will have to search multiple attributes which are
conceptually same and then, the question is why only two attributes. Cannot anybody
have 3 phone numbers, 7 phone numbers and so on. So, this is really not a good option.
So, the other way could be that for every telephone number, you introduce a separate
row. Once you do that you already know you have redundancy and you have possibilities
of varied kinds of anomalies that could happen.
So, one way it could be achieved is we follow the principle that we had seen in ER
modelling that this multivalued dependency can be represented in terms of a separate
relation where against the customer id we just keep the telephone number. So, we can
keep multiple of them and we take that out from the customer name. So, one to many
relationship between the parent and the child, between the customer name and telephone
number, every customer may have more than one telephone number is possible and that
makes it 2 NF relation, first normal form relation and we will later on see that it also is 2
NF and 3 NF, but that is a future story.
Now, finally we come to the core of what the mathematical formulation which dictates
much of the data base, relational database design is known as functional dependencies.
So, naturally R will be the union of all of these, R i the total set of attributes. So, instead
of keeping all the information into one relation in one table, we are basically
decomposing it into n different schemas.
A functional dependency is a constraint on the set of legal relation. So, mind you it is a
constraint on the schema and once that constraint is defined, it must hold for all relations
that the schema satisfied. So, here we need that the value of certain set of attributes
uniquely determined the value of another set of attributes. So, I know the value of three
attributes, I should be able to say that the values of the other four attributes would be
fixed. So, you have already seen this notion in terms of key or super key. You have seen
that similar type of concept exists where we said a key is a set of attributes, so that if the
values of two rows are identical over these set of attributes, then the two peoples, the two
rows must be totally identical.
So, key is something which does a similar thing as a functional dependency, but is more
specific. Functional dependencies are generalization.
So, let us formally define that let R be a relational schema which means that it is a set of
attributes and let us say alpha and beta are two subsets of R, then we write this and note
this notation. Alpha is a set of attributes; beta is another set of attributes. Both are subset
of the same R and we say alpha functionally determines beta that is if I know the value
of a tuple over the attributes of alpha, then the values of that tuple over the attributes of
beta would be fixed or in other words, they say that if I have two tuples t 1 and t 2 and
their values over the set of alpha attributes are same, then necessarily their values over
the set of beta attributes must be same and mind you this is something which is a design
constraint. It is not just an incidental property. It is not just the fact that a particular
instance of a schema satisfies this, but when you say this is a functional dependency, we
need all possible past, present and future instances of the schema must satisfy this.
So, these are functional dependencies that must hold, but certainly we would not expect
department name to functionally determine salary. That would be too much, right. So,
functional dependencies are facts about the real world that we try to understand from the
real world and then, represent in terms of the functional dependency formulation in the
database.
These are another example. So, these are just go through them, try to convince yourself
that these functional dependencies are very genuinely real world situations that can be
modeled in this way.
So, F plus necessarily is a superset of F. So, here in that above example, this is F and this
is F plus.