Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Distributed Database

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

think and plan the later issues of the

Introduction: implementation.
This is an advanced course of the previous that you
must have previously studied and that is the Architecture and Design:
“Database Management Systems”. This course
enhances the concepts learnt earlier, moreover, the There are different architectures available for
applications where you will be applying the concepts designing distributed systems and we have to
and the techniques learnt in this course are also more identify the right one as different architectures are
advanced and complex by nature. The Distributed suitable for different environments. Moreover, there
Database Management Systems (DDBMS) uses the are different approaches to implement the DDBS
concepts of:
designs, we will study those approaches since each
1) Database Management Systems
2) Networking suits different environment.

Note:
The key is to identify the environments in which we
have to use the distributed databases. You may realize Selection of wrong architecture in an environment
that using distributed databases in some situations
results in an in-efficient system
may not prove to be fruitful. The implementation
may develop drawbacks or may become in-efficient. Technological Treatment:
As a computer or database expert, you will always be
having the basic assignment that you are given a Different design approaches of DDBS will be
system for which you have to design/develop a implemented using the prevailing DDBMSs, like SQL
solution or a database system. You will have several Server and Oracle. That will give you the idea of how
options available and interesting thing is that every a real DDBS will look like.
one of them would work. Now the challenge lies in
Theoretical Aspects:
selecting the most feasible solution. For this, the
merits and demerits of every approach have to be We will discuss the theoretical aspects related to the
analyzed. If this evaluation is not properly done in DDBS. The study of these issues will help you
the beginning then it is very difficult to apply administering a DDBS on one side and on the other
adjustments in the later stages. To be more precise, side it will help you in the further studies/research in
Distributed Database System (DDBS) will be one of the DDBS. The database management systems
many options available to you for any system that available today do most of the administration
you will be asked to develop. It is your job to analyze automatically but it is important for the database
whether the environment needs a DDBS solution or designer to know the background procedures so that
any other one. A wrong decision in this regard may the overall efficiency of the distributed database
introduce inefficiency rather than any advantage. management systems may be enhanced.
There are going to be four major component in this Recommended Books:
course:
1- Distributed Database Systems (2nd Edition) by
Introductory Stuff T.M., Ozsu, P. Valdusiez
Architectures and Design Issues Technological 2- Distributed Database Systems. By D. Bell, J.
Treatment Grimson, Addison-Wesley, 1992 3- Distributed
Systems: Concepts and Design, 4th Edition, by G.
Related Topics
Coulouris, J.
Introduction:
Dollimore, T. Kindberg, Addison-Wesley
This part will cover the introduction to the basic
concepts of databases and networks. Then we will
also realize the nature of application that need a
DDBS, so that we are convinced that this is the
environment and it requires a DDBS and then we
The book mentioned at No. 1 is the main book for information which is common to all the three
this course. It is a famous and one of the rare books systems but still being stored separately. This results
written on the topic. Course is mainly based on this in:
book. So you will holding this piece very frequent in
1) Data Redundancy
coming days, make it a bedside item.
2) Expensive Changes/Modifications due to
You are already familiar with the marks split, that
redundancy of the data
consists of mid-term exam, assignments and a final
exam. Good luck with your course and now lets start Database Approach:
reading.
To remove the defects from the file processing
History: systems the database was approach was used which
eliminated the interdependency of the program and
This part is from the first course of Databases.
the data. The changes/modifications can be brought
Computer applications are divided into two types:
about easily whether they were related to the
1) Data Processing Applications programs or the data itself.

2) Scientific / Engineering Applications Database is a shared collection of logically related


data

Distributed Computing Systems:


Data Processing Applications in computer
terminology are referred to as “File Processing Distributed Computing System can be defined as “A
Systems”. In those applications the data was system consisting of a number of autonomous
processed with the help of different programming processing elements that are connected through a
languages. The program, which was used to process computer network and that cooperate in performing
the data, had the data defined in it, therefore they their assigned task”.
were dependent on each other. If we had to make
Three things are important here:
any change in the program it was difficult, as we had
to take care of the dependent data as well and vice 1) Multiple systems are involved.
versa. You are already familiar with the following
diagram 2) These multiple systems are linked together
through some network

3) These multiple systems perform common tasks in


which they cooperate with each other.

Distributed Computing:

We can elaborate the concept of the Distributed


Computing with the following example:

A Computer has different components for example


RAM, Hard disk, Processor etc working together to
perform a single task. So is this the example of
Distributed Computing? The answer is No. This is
because according to the definition there are
Typical File Processing Environment
“different computing systems” involved therefore we
For example in the above example we can see three cannot say that the distributed activities involved in a
systems i.e. Examination, Library and the Registration single computer is an example of distributed
system. Each of them is having its own data for computing.
processing, however there might be some
The second thing is that what is being distributed? A 1) Hardware has become cheap.
few examples are given below:
2) Internet i.e. communication connectivity is easily
1) Processing Logic can be distributed available and cheap.

2) We can divide our goal/task into different


functions and get them distributed among various
Distributed Computing Alerts:
systems
1) Poor management leads to in-efficiency.
3) Data
2) We create information islands and due to lack of
4) Control
standards the system gets in-efficient
All these things can be divided to make our system
3) Improper Designing e.g. we have to travel by air to
run efficiently.
some destination. We go to the an airline’s booking
Classification of Distributed Computing Systems: office and get our booking done to some destination.
The overall process takes a considerable delay. Our
Following factors are to be addressed:
booking is done but with a delay this might occur as
1) Degree of Coupling: a result of improper design.

Here we have to see that how closely the systems Note:


are connected. We can measure this closeness in
Multiple options are available for designing
terms of messages exchanged between two systems
distributed systems.
and the duration for which they work independently.

Note:
- Definition of a Distributed Database System
If two systems are connected through a network
(DDBS)
the coupling may be weak however if they are
sharing some hardware the coupling is strong. - The candidate applications for a DDBS

2) Interconnection Structure: - The definition of a Distributed Database


Management System (DDBMS)
We have to see how the two systems are connected
to each other. The connection can be point-to-point
or sharing a common channel etc.
Distributed Database System:
3) Interdependence:
A collection of logically inter related databases that
The interdependency doesn’t base totally on the are spread physically across multiple locations
architecture, it is also based on the task and how it is connected by a data communication link.
distributed.
Main characteristics:
4) Not Totally Independent:
The DDBS, in its general meaning, is a solution that
Why Distributed Computing Systems: has been adopted for a single organization. That is,
an organization wants to develop a database
Some organization structures are suitable for
system for its requirements, now the analyst(s)
Distributed Computing
analyze the requirements and out of many possible
1) Organization is expanded on a large geographic solutions, like, client-server, desktop based, or
area. distributed database, this particular architecture is
proposed. Following are the major characteristics
2) Organization has different functioning units of a DDBS highlighted in the definition above:
located in different areas Technological Push:
Data management at multiple sites: Although it requires a DDBS or not. The candidate applications
belongs to the same organization but data in a for a DDBS have following two main characteristics:
DDBS is stored at geographically multiple sites.
1- Large number of users
Data is not only stored at multiple sites but it is
managed there, that is, tables are created, data
entered by different users and multiple applications
running. A complete DBMS environment at each 2- Users are physically spread across large
local site. geographical area

Local requirements: Each of the sites storing data in Following are some of the Database applications
a DDBS is called a local site. Every site is mainly that are strong candidates for a DDBS.
concerned with the requirements and data
Banking Applications: Take the example of any
management for the users directly associated with
Pakistani Bank. A bank has large number of
that site, called local users.
customers and its branches are spread across all
Global perspective: Apart from catering the local Pakistan (obviously, many of them have branches
requirements, the DDBS also fulfils the global around the world, their candidature is even
requirements. These are the requirements that are stronger). Now, in the modern banking, the
generated by the central management who want to customers not only access/use their accounts from
make overall decisions about the organization and within the branch rather they access data outside
want to have the overall picture of the organization the branch. Like, from ATMs/branches spread
data. The DDBS fulfils the global requirements in a across the city or country. Every time, when a user
transparent way, that is, the data for these operates his account from anywhere in the
requirements is fetched from all the local sites, country/world, his account/data is being accessed.
merged together and is presented to the global
Air ticketing: We now have the facility to book a
users. All this activity of fetching and merging is
seat in any airline from any location to any
hidden from the global user who gets the feeling as
destination. e.g. we can book return ticket from
if the data is being fetched from a single place.
Lahore to Karachi and from Karachi to Lahore from
In a DDBS environment, three types of accesses are the airline’s Lahore office. This system too, has a
involved: large number of users spread across a large area.
Whenever a booking is made, the data of the flights
Local access: the access by the users connected to a is accessed.
site and accessing the data from the same site.
Business at multiple locations: A company having
Remote access: a user connected to a site, lets say offices at multiple locations, or different units at
site 1, and accessing the data from site 2. different locations, like production, warehouses,
sales operating from different locations, each site
Global access: no matter from where ever the
storing data locally however, these units need to
access is made, data will be displayed after being
access each other’s data and data from all the sites
collected from all locations.
is required for the global access.
A user does not know from where he is getting the
Distributed Database Management System:
data. To the user it appears that the data is present
on the machine on which he is working. A software system that permits the management of
distributed database and makes the distribution
Distributed databases; where to apply:
transparent to the users.
As mentioned before, the DDBS is one of the possible
Like we need a DBMS for a centralized or client-
solutions for a database application. We need to
server database, we do need a DDBMS for a DDBS.
analyze the environment to decide whether it
A DDBMS will behave like a normal DBMS on the
local site, however, the additional facility that it
provides is the creation and maintenance of the Distributed files:
global access where data across multiple sites is
A collection of files stored on differed computers of a
accessed against a single query. The approach that
network, not a DDBS; Why?
most of the current commercial DBMS vendors (like
Oracle, SQL Server, DB2, Sybase) have adopted is This is not enough for DDBS, as the data should be
that they provide different versions for different logically related.
situations. If the user needs a desktop database for
the single computer usage, then a smaller version is Note:
available that does not support the remote access
DDBS is logically related, has common structure
or data distribution. For client-server database
among files, and accessed via the same interface.
there is another version, and for the DDBS
environment the Enterprise Edition of the DBMS is Multiprocessor system:
provided that of course supports data distribution
among multiple sites, the establishing of link Multiple processors that share some common
between these sites and finally joining/combining memory.
data from multiple sites against a single query. RAM Sharing Tight coupling.
Decentralized database: HDISK Sharing Loose coupling.
A collection of Independent databases on non- Systems simply connected Share Nothing.
networked computers. In this environment the data
at multiple computers is related but these computers Following diagrams explain these architectures:
are not linked, so whenever data has to be accessed
from multiple computers, we need to apply some
means of transferring data across these computers.

Summary: In today’s lecture we have discussed the


definition of the DDBS, the common applications
where the DDBS can be applied and the reasons
why the DDBS is recommended for these sites. This
is extremely important to have a clear idea about
what precisely is a DDBS if we want to implement a
DDBS properly.

Shared Everything Tight Coupling


In previous lecture:

- Definition of a Distributed Database System (DDBS)

- The candidate applications for a DDBS

- The definition of a Distributed Database


Management System (DDBMS)

In this lecture:

- Resembling Setups

- Why to have a DDBS

- Advantages/Promises of DDBS
Shared Everything Loose Coupling
Resembling setups:
Shared Nothing

Centralized C/S System

Data management is carried on a single centralized


system. However, this data is accessed from different
Fig. 3: General Architecture of DDBS
machine (clients). All machines are connected with
each other through a communication link (network). As is clear from the diagram, there are a number of
This is a very common architecture. The major local DBMSs called local nodes. Each local node
characteristic of this architecture is that data storage works independently serving multiple users that are
and management is mainly done on the server. As connected to it, these users are called local nodes. At
the diagram at the next page shows, the data is the same time, the local nodes are connected with
associated with a single site, this site is basically the each other. A layer is superimposed on top of all
Server, rest of the machines are accessing data from these connected local DBMSs and that layer is the
the Server. DDBMS. The DDBMS contains the global schema,
that is basically the merger of all local schemas.
There is no data underlying the global schema rather
the data is contained with the local DBMSs. The user
connected to the DDBMS layer are called global
users, as their queries are replied by collecting data
from all the local nodes, and this activity of
distribution is transparent from the global users.

Reasons for DDBS:

Local units want control over data.

Note:

A person who maintains the data on a local site is


The Distributed Database System called local DBA.

As has been discussed in the previous lecture, the A user on a local site is called the local user. Data is
data is managed/manipulated at multiple sites in a generated locally and accessed locally but there are
DDBS. There are many different architectures of a situations where you require certain reports for
DDBS; a very general one is given below; this general which the data must be collected from all sites e.g. A
architecture also establishes a picture of the DDBS in bank wants to know how many customers it has
the mind that further helps to understand the having a balance of more than one core. Local
working of a DDBS. control may be desirable in many situations, as it
improves the efficiency in handling/managing the
database. That is, local control is helpful in Schema contains:
maintenance, backup activities, managing local users
What has to be shown to the global user.
accounts etc.
How we are going to set data for a thing on each site.
Note:
The type of the data stored on each site.
We may require global access to the data.
How we are going to merge the data present on
There are two basic reasons for the DDBS different sites.
Environment. To better understand these reasons,
Note:
we need to see the other (than DDBS) alternative,
and that is the centralized database or a client-server Global users are attached to the Distributed DBMS
environment. Taking the example of our Bank layer.
database, if it is a centralized one, it means that the
database is stored at a single place, lets suppose, in Promises of DDBS:
Pakistan they select a geographically central place,
If we adopt a DDBS as a solution for a particular
let it be Multan, then the database is stored in
application, what features we are going to get:
Multan, now users from all over Pakistan, whenever
they want to use their account, the data will be Transparency:
accessed from the central database (in Multan). If it
is a distributed environment, then the Bank can pick A transparent system hides the implementation
two, three, four or more places and each database details from the user. There are different types of
will be storing the data of its surrounding areas. So transparencies, but at this point the focus is on the
the load now is distributed over multiple places. distribution transparence, that means that the global
With this scenario in mind, lets discuss the reasons user will not have any idea that the data that is being
for DDBS: provided to him is actually coming from multiple
sites, rather he will get the feeling, as if the data is
Reduce telecom cost: coming just from the machine that he is using. It is a
very useful feature, as it saves the global user from
With a centralized database, all the users from all
the complexity and details of the distributed access.
over the country will access it remotely and
therefore there would be an increased Data Independence:
telecommunication cost. However, if it is distributed
then for all users the data is closer and it is local Major advantage of the database approach is the
access most of the time. So, it reduces the overall data independence as the program and data are not
telecommunication cost. dependent on each other i.e. we can change the
program with very little or no changes made to the
Reduce the risk of telecom failures: data and vice versa

With a centralized database, all the working In a 3-layer architecture the changes on lower level
throughout the country depends on the link with the has little or no affect on higher level.
central site. If, due to any reason, link fails then
working at all the places will be discontinued. So  Logical data independence:
failure at a single site caused damage at all the If we change the conceptual schema there is little or
places. On the other side, if it is distributed, then no effect on the External level.
failure at a single site will disturb only the users of
that site, remaining sites will be working just normal.  Physical data independence:
Moreover, one form of data distribution is replication
If we change the physical or lower level then there is
where the data is duplicated at multiple places. This
little or no effect on the conceptual level.
particular form of data distribution, further reduces
the cost of telecommunication failure.
Network transparency: Note :

This is another form of transparency. The user is Full replication is when all the data is stored on
unaware of even the existence of the network, that every site and therefore every access will be local.
frees him from the problems and complexities of
Fragmentation transparency:
network.
A file or a table is broken down into smaller
Replication transparency:
parts/sections called fragments and those fragments
Replication and fragmentation are the two ways to are stored at different locations. The fragmentation
implement a DDBS. In replication same data is stored will be discussed in detail in the later lectures.
on multiple sites example e.g. In case of a bank every However, briefly, a table can be fragmented
branch is holding the data of every other branch. The horizontally (row-wise) or vertically (column-wise).
replication increases the availability of data and Hence, we have two major types of fragmentations,
reduces the risk of telecom failure. In case of horizontal and vertical. Different fragmentations of a
replication, the DDBS hides the replication from the table are placed at different locations. The basic
end user, advantage is that user simply gets the objective of fragmentation and placement at
benefits of the system and does not need to know different places is to maximize the local access and to
the details or to understand the technical details. reduce the remote access since the later causes cost
and delay.
Summary
Fragmentation transparency is that a user should not
In today’s lecture we continued the discussion on
know that the database is fragmented. The concept
distributed systems. We discussed the setups that
of fragmentation should be kept hidden from the
resemble a DDBS and there we studied distributed
user.
file system and multiprocessor systems. In the later
type, we have share everything and share nothing Note:
systems. We then discussed a centralized C/S system
DBA designs the architecture of fragments where as
that is also a very popular architecture for the
once implemented it is managed by DDBMS.
databases. Then we saw different reasons to have a
DDBS, the situations where it suits, we compared it Responsibility of transparency: -
with its alternative and studied why a DDBS is useful
for certain type of applications. Finally, we saw what Transparency is very much desirable since it hides all
advantages we are going to have if we adopt a DDBS the technical details from the users, and makes the
solution. use/access very easy. However, providing
transparency involves cost, the cost that has to be
In previous lecture: bear by someone. More transparency the DDBS
environment provides most cost has to be paid. In
- Resembling Setups
this section, we are discussing that who is going to
- Why to have a DDBS pay the cost of transparency, that is, who is
responsible of providing transparency. There are
- Advantages/Promises of DDBS
three major players in this regard: the
In this lecture: Language/Compiler, Operating System and the

- Promises of DDBS DDBMS.

- Reliability through DDBS Language/compiler: The language or compiler used


to develop the front end application programs in a
- Complicating factors database system is one of the agents that can
provide transparency, like VB, VB.Net, JSP, Perl etc.
- Problem areas
This front end tool is used to perform three major
tasks, one, it is used to create the user interface
(now a days it is generally a graphical user interface transparency of the linking and manipulation of data.
(GUI)), secondly, it is used to perform calculations or All these features combined, make the DDBMS a
any other form of processing, and thirdly, it is used viable and user friendly option to use.
to interact with the DBMS, that is to store and
Note: The more the higher will be the cost.
retrieve the data. These three functions are
performed by the application programmer(s) by
using the tool. Now, from the end-user point of view
it does not matter at all that which particular tool
has been used to establish this front end GUI. It
means the language/compiler component is
providing certain form of transparency. It is making
end-user free of the mechanism to create the GUI,
the mechanism to establish the link with the DBMS
for data manipulation, and accessing the data from
Different layers of transparencies
the database. Rather, it can be said that since the
users’ interaction with the DBMS or DDBMS is Reliability of DDBS:
through the application program that has been
developed a particular tool (language/compiler), so Reliability through distributed transaction: The
in a sense we can say that all types of transparencies distributed nature in a DDBS environment reduces
are provided to the end-user from this tool. Although the chances of single point of failure and that
practically it is not the case, but still apparently it can increases the reliability of the entire system. It
be said. means that the entire system does not go down with
the failure of a single system as is the case with
Operating system: This layer is not so visible to the centralized database systems. It definitely means,
end-users most of the time, since it resides below however, that in case of DDBS, the site that goes
the DDBMS or the language used to develop the down, the users of that site will definitely suffer but
application. However, both of them heavily depend not the entire system.
on the features supported by the operating system
(OS). For example, some of the OSes provide Concurrency issues: the concurrent access means
network transparence, so the DBMS or the language the access of data by multiple users at the same
uses the features of the OS to provide this facility to time. Concurrency issues rise even in simple (client-
the end-users. server) databases, however these issues become
more critical in case of a DDBS. Specially, in case of
DDBMS: This is the third component that provides replication, when same data is duplicated at multiple
certain forms of transparencies, for example, the site, then the consistency of data across multiple
distribution transparency is supported by the sites is a serious issue that needs extra care.
DDBMS. Although DDBMS uses the feature of the
OS, still the facility is finally presented by the Performance Improvement:
DDBMS. It also provides certain level of
The DDBS provides improved performance; the
transparency.
major factors causing this improved performance are
Practical Situation: Although we have studied three data localization and query parallelism.
different components that may be responsible for
Data Localization: One of the basic principles of data
providing transparencies, however, practically the
distribution in a DDBS is that the data should reside
features of all three components are used together
at the closest site where it is most frequently
to provide a very cohesive and easy to use
accessed. This reduces the communication cost and
environment for the users. For example, distributed
the delay. However, the DDBS also involves the
OS provides the network transparency, the DDBMS
remote accesses as well and in that case delay is
provides the fragmentation or replication
unavoidable, but through maximized data
transparency, the front-end tool provides the
localization we get overall improved performance.
Query Parallelism: This is the second major factor The following areas in DDBS still need more work
that is the basis of improved performance in a DDBS. and are considered problem areas
Since the DDBS involves multiple systems, a query in
Database design: All the issues of a centralized
certain situations can be executed in parallel, that
database system are applicable in a DDBS but it
improves performance. There are two types of query
introduces additional aspect related to data
parallelism, that is, the inter-query and intra-query
placement, that is, where our sites should be located
parallelism. The former means that multiple queries
can be executed at the same time, whereas the later Query processing: problem arises in queries
means that the same query is split across multiple executed at multiple sites.e g. what should be done
sites and this split components are executed in when data from one site is not collected.
parallel that increases the throughput. These topics
will be discussed in detail in the later lectures. Other critical issues include Concurrency Control,
Operating System and Heterogeneity. These issues
Complicating factors will be discussed in the later lectures.
There are certain aspects that complicate a DDBS The diagram shows the interlink between these
environment. Following are some of those factors. problem areas.
Selection of the Copy: In case of replication, the
selection of the right copy is a complicating factor.
That is, as the same data resides at multiple places,
which particular site should be accessed against a
particular query is an important factor to resolve.
One simple solution is to decide on the basis of
distance or load. However, the same question arises
in a different situation when a particular site goes
down. In this case the queries that were originally
routed to these particular sites now have to be re-
routed. Thus selection of the appropriate copy is an
issue that needs extra attention in a DDBS
environment.
The diagram shows that the DDBS design lies at the
Failure recovery: Likewise, in case of replication the heart of all issues. It is linked with most of the issues
synchronization of copies after failure has to be dealt like Directory Management, Reliability etc. It means
with carefully. that overall performance of a DDBS mainly depends
Complexity: Since the data is stored at multiple sites on the database design. If we could do it efficiently
and has to be managed the overall system is more then most of the issues will be working efficiently.
complex as compared to a centralized database Summary: This lecture continues the discussion on
system. different forms of transparencies including
Cost: A DDBS involves more cost, as the hardware fragmentation transparency. Then the issue of the
and the trained manpower has to be deployed at responsibility for providing the transparency is
multiple sites. discussed. Three different components may be
considered as the transparency providers, however,
Distribution of Control: The access to the data should practically all three components are used to provide
be allowed carefully. Rights to access data should be different forms of transparencies and to provide the
well defined for local sites. end-user a user-friendly environment to work with.
After this, different issues that complicate the DDBS
The Problem Areas:
environment are discussed and finally some problem
areas are discussed.
for the commercial puposes. There are different
reasons behind this. Like, the network and
In previous lecture:
hierarchical data models based DBMSs were used
- Promises of DDBS because these data models are the initial ones. They
were the ones who replaced the traditional file
- Reliability through DDBS processing systems. So success of the initial DBMSs
means the acceptance of database approach against
- Complicating factors
then existing file system approach. This is the era
- Problem areas from late 1960s to early 1980s. The relational data
model was proposed in 1970 and by late 1980s it
In this lecture: started gaining popularity and dominance. Today it is
- Background topics the unchallenged most popular and most widely
(perhaps only) used data model. We have so many
- Data Models different DBMSs based on the relational data model
(RDM), like, Oracle, Sybase, DB2, SQL Server and
- Relational Data Model
many more. The two major strengths of RDM that
- Normalization gave it so success are :

So far, we have discussed the introduction to It is very simple as it has only on structure i.e a
distributed systems and distributed databases in relation or a table. It has a strong mathematical
particular. We should have some idea in mind about foundation.
what a DDBS is and the environment where it suits
The semantic data model, like Object-Oriented data
and pros and cons of the DDBSs. Before moving to
models could not get popularity as commercial
topics related to DDBS in details, let us step a little
choice as they lack the same two features. So on one
back and discuss some background topics. These
side OO data model was a bit difficult to understand
topics include the Relational Data Model and the
due to its complexity and secondly it is not that well
Networks. These are two important underlying
defined due to lack of mathematical support.
topics that will help to have a better understanding
However, semantic data models are heavily used as
of the DDBSs. We start with the Relational Data
the design tool, that is, for the database design,
Model, we will discuss the Networking concepts
specially for conceptual and external schemas, the
later.
semantic data models are used. The reason for this
Data Model: a set of tools or constructs that are choice is that they provide more constructs and a
used to design a database. There are two major database design in semantic data models is more
categories of data models. Record based data models expressive and hence is easy to understand.
that have relatively less constructs and the
Since the RDM is the dominant in the market, so our
identification of records (defining key) is the
background discussion focuses only on this data
responsibility of the user. This type of data models
model.
are Network, Hierarchical and Relational Data
Models. The record based data models are also Relational Data Model
called the Legacy data models. Whereas the
Semantic data models are the ones that have more A data model that is based on the concept of relation
constructs, so they are semantically rich, moreover or table. The concept of relation has been taken from
the identification of records is managed by the the mathematical relation. In the databases, the
system. Examples of such data models are Entity- same relation represented in a two dimensional way
Relationship and Object-Oriented data models. is called table, so relation or table represent the
same thing. The RDM has got three components:
The legacy data models have been and are
commercially successful. That is, the DBMSs that are
based on these data models have been mostly used
Structure support for storage of data and RDM
supports only a single structure and that is a relation
or a table

2. Manipulation language The language to access


and manipulate data from the database. The SQL
Keys: Key is a status that is assigned to a single or
(Structured Query Language) has been accepted as
collection of attributes in a relation. There are
the standard language for RDM. The SQL is based on
different types of keys for different purposes,
relational algebra and relational calculus
however most important role performed by keys is
3. Support for integrity constraints: The RDM support the unique identification. Different types of keys are
two major integrity constraints such that they have mentioned below:
become a part of the data model itself. It means that
Super Key: An attribute or set of attributes whose
any DBMS that is based on the RDM must support
value can be used to uniquely identify each
these two constraints and they are
record/tuple in a relation is super key. For example,
> Entity integrity constraint in the EMP table ENo is a super key since its value
identifies each tuple uniquely. However, ENo, EName
> Referential integrity constraint both jointly are also the super key, likewise other
These two constraints help enforce database combinations of attributes with ENo are all super
consistency. The relational DBMSs provide support keys.
for all these three components. Candidate Key: is the minimal super key, that is a
Note: super key whose no proper subset itself is a super
key. Like in the above example ENo is a super key and
Business rules are enforced through integrity a candidate as well. However, the ENo, EName jointly
constraints. is super key not the candidate key as it has a proper
subset (ENo) that is a super key. Primary key: The
A relation R defined over domain D1, D2………Dn is a
successful/selected candidate key is called the
set of n- tuples < d1,d2……….dn> such that < d1 D1,
primary key, remaining candidate keys are called
d2 D2, ………….dn Dn>. The structure of a relation is
alternate keys. If a table has got a single candidate
described through a relation scheme which is of the
key then the same will be used as the primary key,
form
however, if there are multiple candidate keys then
R(A1:D1, A2:D2, ……, An:Dn) or simply R(A1, A2, ….., we have to choose one of them as the primary key
An) where A1, A2, …., An are the names of the and then the others will be called the alternate keys.
attributes and D1, D2,….., Dn are their respective In our EMP table, we have got only one candidate
domains. Two example relation schemes are given key, so the same is the primary key and there is no
below alternate key. Secondary key is the attribute(s) that
are used to access data from the relation but whose
 EMP (ENo:Integer, EName:Text, EAdd:Text, value is not necessarily unique, like EName. Foreign
ESal:Number) or it can be simply written as key is the attribute(s) in one table that is/are primary
EMP (ENo, EName, EAdd, ESal) key in the other table. The purpose of defining
 Project (PNo: Char(8), PName:Text, foreign key is the enforcement of referential integrity
bud:Number, stDate:Date) or simply Project constraint that has been mentioned above.
(PNo, PName, bud, stDate)
Table:
Each of the attribute in these two relations has a
domain and the domains need not to be distinct. A  Relation represented in a two dimensional
relation is collection of tuples based on the relation form.
scheme. An example relation for the above EMP  After defining schema it is populated and
relation scheme is given below we get the table.
 Tuples are rows and attributes are columns. The two concepts regarding normalization are:
 Attributes have domain
 Attributes can contain NULL values. (NULL Lossless decomposition: When a relation is
does not mean zero) decomposed into two relations there should be no
 If a primary key is based on more than one loss of information. i.e when we combine the
attributes then none of the attributes can decomposed relations together we get the original
have null value. relation. This loss of information is concerned both
with the relational schema as well as data.
Normalization: is a step by step process to produce
efficient and smart database design that makes is Dependency Preservation: The same dependency
easier to maintain the consistency of the data base. should be maintained after the decomposition of a
Following are some main points about normalization: relation.

 Strongly recommended not a must. Dependency structure: Normalization is based on


Performed after the logical database design. dependencies. Up to BCNF on FDs, further normal
That is, once we have finalized our design forms are based on multi-valued dependency and
then we fine tune it through normalization project-join dependency.
process Note:
 Deals with anomalies, that is, helps to
convert the database design into an Dependencies are identified not designed.
anomaly free design
 Anomalies are conditions which may make
database incorrect or inconsistent. These
anomalies are:
1. Duplication
2. Insertion anomaly o Update
anomaly
3. oDeletion anomaly

Normalization removes the anomalies and helps the


maintenance of the database a lot easier. The
normalization is mainly based on the decomposition Dependencies:
of relations, that is, a relation is analyzed for the
Functional dependency: A functional dependency
existence of anomalies if some are found then it is
exists when the value of one of more attribute(s) can
decomposed into smaller relations that do not have
be determined from the value(s) of one or more
those anomalies. The concept of universal relation is
other attribute(s). So, if we can get a unique value of
also used in this regard, that is, the design is
an attribute B against a value of attribute(s) A, then
considered to be consisting of a single large table
we say that B is functionally dependent of A or A
called the universal table, then this table is
functionally determines B. Unique value means that
decomposed step by step following the
if we have a value of A then there is definitely a
normalization process. Finally, we get our database
particular/single/unique value of B. There is no
design which is fully normalized.
ambiguity or no question of getting two or different
Normal Forms: There are different normal forms values of B against one value of A.

 First Normal Form For example, given the relation EMP(empNo,


 Second Normal Form empName, sal), attribute empName is functionally
 Third Normal Form dependant on attribute empNo. If we take any value
 BCNF ( Boyce Code Normal Form) Fourth of empNo, then we are going to have a unique or
Normal Form exactly one value of empName if it exisits. Lets say
 Fifth Normal Form we have an empNo ‘E00142’, if this empNo exists
then definitely we are going to get exactly one name StaffID alone as well, which is subset of StaffID,
against this empNo, lets say it is ‘Murad’. According Name. So we will say that BranchID is partially
to this FD, it is not possible that against ‘E00142’, we dependent on StaffID, Name, that is, the first FD is a
get ‘Murad’ and also ‘Waheed’. To make it more partial FD.
clear, lets describe the same thing inversely, if we
Transitive Dependency: If for a relation R, we have
have an empName ‘Murad’ and want to know the
FDs a b and b c then it means that a c,
empNo, will we get a single empNo. No. Why?
where b and c both are non-key attributes.
Because there could be a ‘Murad’ with empNo as
‘E00142’ and another with ‘E00065’ and yet some Normal forms
others. So this is what is meant by an FD, like empNo
empName. Same is the case with other attribute Let us now discuss the normal forms. Brief
‘sal’, if it is included in FD, that is, empNo definitions of the Normal forms:
empName, sal.
First Normal Form
Note:
Table must contain atomic/single values in the cells
Determinant is defined as the attribute on left hand
Second Normal Form
side of the arrow in a functional dependency (FD)
e.g. A is determinant in the following dependency o Relation must be in 1NF
A B
o Each non-key attribute should be fully functionally
Full Functional dependency: An FD in which the dependent on key, that is, there exists no partial
dependent attributes are determined by all of the dependency
determinant not by any subset of it (determinant).
Obviously, if the determinant in an FD (A B) Third Normal Form
consists of a single attribute, then it is definitely a full o Relation must be in 2NF
functional dependency, or we can say that B is fully
functionally dependent on A. If, however, A is a set of o There should be no transitive dependency
attributes, then we have to see whether it is full
BCNF
functional dependency or not. How are going to
know whether an FD is FFD or not? By seeing the For every FD X Y in R, X is a super key or a
other FDs of the same relation. If, for example, we relation in which every determinant is a candidate
have a relation R(a, b, c, d, e, f) and the FDs key.

a, b c, d, e and e f, in this case first FD has


two attributes in the determinant, and there is no
other FD in which either a or b alone determines any
of the attributes c, d, or e, so it is an FFD. As we can
see that there is another FD on this relation but this
is a separate independent FD, it does not make the
first one as a non-FFD.
Till Page 28
Partial Functional dependency: An FD in which one
or more non dependent attributes are also
functionally dependent on part of the determinant.
For example, consider the table Staff (StaddID,
Name, BranchID) and the FDs:

You might also like