Data Management 1 - Introduction To DBMS and Relational Theory
Data Management 1 - Introduction To DBMS and Relational Theory
Data Management 1 - Introduction To DBMS and Relational Theory
COURSE DESCRIPTION
CSC 214 is a comprehensive introduction to theories and practices in Database
Design and Management. Database Management I concentrates on principles,
design, implementation and application of Database Management Systems.
Students are introduced to the fundamental theories, concepts and techniques
needed to properly understand and implement the relational database model
which is the bedrock of today’s mainstream database products.
COURSE JUSTIFICATION
An in-depth understanding of the principles and application of database systems
is a critical success factor for information professionals taking leadership
roles in future information systems initiatives. This course offers students the
opportunity of rigorous study of the traditional principles of database design,
implementation and usage.
COURSE OBJECTIVES
Upon successful completion, students should be able to:
§ Demonstrate good knowledge of basic database concepts, including the
structure and operation of the relational data model.
§ Understand and successfully apply logical database design principles,
including E-R diagrams and database normalization.
§ Assess the quality and ease of use of data modelling and diagramming tools.
§ Design and implement a small database project.
§ Describe and discuss selected advanced database topics, such as distributed
database systems and the data warehouse.
COURSE CONTENTS
Information storage & retrieval, information management applications. Information
capture and representation, analysis & indexing, search, retrieval, information
privacy; integrity, security; scalability, efficiency and effectiveness.
Introduction to database system:
Components of database systems DBMS functions, Database architecture and data
independence. Data modelling, Entity-relationship model, Database, design using
entity-relationship and semantic object models, Relational data model, process of
database design.
COURSE REQUIREMENT
There are no formal prerequisites for this course.
METHOD OF GRADING
Method of grading
S/N GRADING SCORE(%)
Continuous Assessments
1.
• C.AI 7%
• C.AII (Mid-Semester Test) 15%
• C.AIII 8%
2. Assignment
Practical (Laboratory work)/ Case
3.
Studies
4. Final Examination 70%
5. Total 100
Books
§ Database Management System (2Ed) by Raghu Ramakrishtan and Johannes
Gehrke
§ Database Systems: Design, Implementation, and Management (10Ed) by Carlos
Coronel, Steven Morris, and Peter Rob (2012). Cengage Learning, Boston. ISBN-
13: 978-1-111-96960-8
§ Database principles and design (3Ed) by Colin Ritchie (2008). Cengage Learning,
London. ISBN-13: 978-1-84480-540-2.
§ Database System, the complete book (2Ed) by Hector G. M., Jeffrey D. U.,
Jennifer W. (2009). Pearson Education Inc. New Jersey. ISBN 0-13-606701-8
§ Relational Theory for Computer Professionals by C. J. Date (2013). O’Reilly
Media, Inc. Sebastopol. ISBN: 978-1-449-36943-9
Online resources
§ Database Management Systems Relational, Object-Relational and Object-
Oriented Data Models. Center for Objekt Teknology. Available online:
http://www.cit.dk/COT/reports/reports/Case4/05-v1.1/cot-4-05-1.1.pdf
§ http://www.help2engg.com/dbms/dbms-languages
§ Database Management System by tutorialpoint. Available online
https://www.tutorialspoint.com/dbms/dbms_tutorial.pdf
Information: information is data that has been given meaning by way of relational
connection. This "meaning" can be useful, but does not have to be. In computer
parlance, a relational database makes information from the data stored within it.
Information embodies the understanding of some sort. E.g. the temperature dropped
to 15 degrees and then it started raining.
DATABASE
A Database is a shared, integrated computer structure that is repository to:
-End-user data, that is, raw facts of interest to the end user
-Metadata, or data about data describes of the data characteristics and the set
of relationships that link the data found within the db.
Database System
Refers to an organization of components that define and regulate the collection,
storage, management from general management point of view, the DB system is
composed of
Ø Hardware
Ø Software
Ø People –system administrators: database systems operations
§ DB administrators: manage the DBMS and ensure the DB is
§ functioning properly
§ DB designers
§ System analysts and programmers design and implement the
application programs
§ end user
Ø Procedures
Ø Data
ADVANTAGES OF A DBMS
DBMS Architecture
The DBMS provides users with an abstract view of the data in it i.e. the system hides
certain details of how the data is stored and maintained from users. A DBMS can be
viewed as divided into levels of abstraction. A common architecture generally used
is the ANSI/SPARC (American National Standards Institute - Standards Planning and
Requirements Committee) model.
The ANSI/SPARC model abstracts the DBMS into a 3-tier architecture as follows:
External level
Conceptual level
Internal level
i. External level: The external level is the user’s view of the database and closest
to the users. It presents only the relevant part of the DBMS to the user. E.g. A
bank database stores a lot more information but an account holder is only
interested in his/her account details such as the current account balance,
transaction history etc. An external schema describes each external view. The
external schema consists of the definition of the logical records and the
relationships in the external view. In the external level, the different views may
have different representations of the same data.
ii. Conceptual level: At this level of database abstraction, all the database entities
and relationships among them are included. Conceptual level provides the
community view of the database and describes what data is stored in the database
and the relationships among the data. In other words, the conceptual view
represents the entire database of an organization. It is a complete view of the data
requirements of the organization that is independent of any storage
consideration. The conceptual schema defines conceptual view. It is also called
the logical schema. There is only one conceptual schema per database. The figure
shows the conceptual view record of a data base.
iii. Internal level or physical level: The lowest level of abstraction is the
internal level. It is the one closest to physical storage device. This level is also
termed as physical level, because it describes how data are actually stored on the
storage medium such as hard disk, magnetic tape etc. This level indicates how the
data will be stored in the database and describe the data structures, file
structures and access methods to be used by the database. The internal schema
defines the internal level. The internal schema contains the definition of the
stored record, the methods of representing the data fields and accessed methods
used. The figure shows the internal view record of a database.
DBMS LANGUAGES
The workings of a DBMS is controlled by four different languages, they are
Ø Data Definition Language (DDL): Used by the DBA and database designers to
specify the conceptual schema of a database. In many DBMSs, the DDL is also
used to define internal and external schemas (views). In some DBMSs, separate
storage definition language (SDL) and view definition language (VDL) are used
to define internal and external schemas. SDL is typically realized via DBMS
commands provided to the DBA and database designers. Some examples include:
Ø CREATE - to create objects in the database
Ø ALTER - alters the structure of the database
Ø DROP - delete objects from the database
Ø TRUNCATE - remove all records from a table, including all spaces
allocated for the records are removed
Ø COMMENT - add comments to the data dictionary
Ø RENAME - rename an object
Ø Data Control Language (DCL): used for granting and revoking user access on
a database
Ø To grant access to user – GRANT
Ø To revoke access from user – REVOKE
Ø Transaction Control (TCL): Statements are used to manage the changes made
by DML statements. It allows statements to be grouped together into logical
transactions.
Example
Write the SQL code that will create the table structure for a table named EMP_1.
This table is a subset of the EMPLOYEE table. The basic EMP_1 table structure is
summarized in the following table. EMP_NUM is the primary key and JOB_CODE is the
FK to JOB.
Having created the table structure in (a), write the SQL code to enter the first two
rows for the table EMP_1 below:
INSERT INTO EMP_1
(EMP_NUM, EMP_LNAME, EMP_FNAME, EMP_INITIAL, EMP_HIREDATE, JOB_CODE)
VALUES
("101", "News", "John", "G", "08-Nov-00", "502"),
("102", "Senior", "David", "H", "12-Jul-89", "500");
Assuming the data shown in the EMP_1 table have been entered, write the SQL code
that will list all attributes for a job code of 502.
Write the SQL code that will save the changes made to the EMP_1 table.
COMMIT WORK;
NB: WORK is optional.
Traditionally, there are four DBMS. These four data models also represent the
historical developments of the DBMS:
Example of Hierarchical data model
Network model:
It represents complex data relationships more effectively than the hierarchical
model. The major improvement is that the one-to-many limitation was removed; the
models still views data in a hierarchical one-to-many structure but now record may
have more than one parent. Network data models represent data in a symmetric
manner, unlike the hierarchical data model (distinction between a parent and a
child). Information is organized as a collection of graphs of record that are related
with pointers. More flexible than a hierarchical data model and still permits
efficient navigation.
Example of network data model
The records consist of lists of fields (fixed or variable length with maximum
length), where each field contains a simple value (fixed or variable size). The
network data model also introduces the notion of indexes of fields and records, sets
of pointers, and physical placement of records. A DDL for network data models must
allow the definition of record types, fields types, pointers and indexes. And the DML
must allow navigation through the graphs through the pointers and indexes.
Programs also navigates closely to the physical
storage structures, implying that the network data model only supports limited data
independence, and are therefore difficult to maintain as the data models evolve over
time.
Developed by E.F. Codd (IBM) in 1970, the relational data model has a mathematical
foundation in relational algebra. The model is based on first-order predicate logic
and defines a table as an n-ary relation. Data is organized in relations (two-
dimensional tables). Each relation contains a set of tuples (records). Each tuple
contains a number of fields. A field may contain a simple value (fixed or variable
size) from some domain (e.g. integer, real, text, etc.).
§ Guaranteed data consistency and accuracy: Data is consistent and accurate due
to the various levels of integrity you can impose within the database. (This will
become quite clear as you work through the design process.)
§ Easy data retrieval: At the user’s command, data can be retrieved either from a
particular table or from any number of related tables within the database.
This enables a user to view information in an almost unlimited number of ways.
One commonly perceived disadvantage of the relational database was that software
programs based on it ran very slowly.
Some definitions
RELATION: a relation, as defined by E. F. Codd, is a set of tuples (d1, d2, ..., dn), where
each element dj is a member of Dj, a data domain, for each j=1, 2, ..., n. A data domain
is simply a data type. It specifies a data abstraction: the possible values for the data
and the operations available on the data. For example, a String can have zero or more
characters in it, and has operations for comparing strings, concatenating string, and
creating strings. A relation is a truth predicate. It defines what attributes are
involved in the predicate and what the meaning of the predicate is. In relational data
model, relations are represented in the table format. This format stores the
relation among entities. A table has rows and columns, where rows represent
records and columns represent the attributes. E.g.
TUPLE: A single row of a table, which contains a single record for that relation is
called a tuple. A tuple has attribute values which match the required attributes in
the relation. The ordering of attribute values is immaterial. Every tuple in the body
of a given relation is required to conform to the heading (attribute) of that relation,
i.e. it contains exactly one value, of the applicable type, for each attribute, and
nothing else besides
ATTRIBUTE DOMAIN: Every attribute has some predefined value scope, known as
attribute domain
SCHEMAS: The name of a relation and the set of attributes for a relation is called
the schema for that relation. The schema is depicted by the relation name followed
by a parenthesized list of its attributes. Thus, the schema for relation Movies above
is
Movies (title , year, length, genre)
In the relational model, a database consists of one or more relations. The set of
schemas for the relations of a database is called a relational database schema, or
just a database schema.
Data Types: All attributes must have a data type. The following are the primitive
data types that are supported by SQL (Structured Query Language) systems.
i. Character strings of fixed or varying length. The type CHAR(n) denotes a fixed-
length string of up to n characters. VARCHAR(n) also denotes a string of up
to n characters. The difference is implementation-dependent; typically CHAR
implies that short strings are padded to make n characters, while VARCHAR
implies that an endmarker or string-length is used. Normally, a string is
padded by trailing blanks if it becomes the value of a component that is a fixed-
length string of greater length. For example, the string ’foo’ if it became the
value of a component for an attribute of type CHAR(5), would assume the value
’foo ’ (with two blanks following the second o).
ii. Bit strings of fixed or varying length. These strings are analogous to fixed
and varying-length character strings, but their values are strings of bits
rather than characters. The type BIT (n) denotes bit strings of length n, while
BIT VARYING (n) denotes bit strings of length up to n.
iii. The type BOOLEAN denotes an attribute whose value is logical. The possible
values of such an attribute are TRUE, FALSE.
iv. The type INT or INTEGER (these names are synonyms) denotes typical integer
values. The type SHORTINT also denotes integers, but the number of bits
permitted may be less, depending on the implementation (as with the types int
and short int in C).
v. Floating-point numbers can be represented in a variety of ways. We may use the
type FLOAT or REAL (these are synonyms) for typical floating point numbers.
A higher precision can be obtained with the type DOUBLE PRECISION. We can
also specify real numbers with a fixed decimal point. For example, DECIMAL(n,d)
allows values that consist of n decimal digits, with the decimal point assumed
to be d positions from the right. Thus, 0123.45 is a possible value of type
DECIMAL(6,2). NUMERIC is almost a synonym for DECIMAL, although there are
possible implementation-dependent differences.
vi. Dates and times can be represented by the data types DATE and TIME,
respectively. These values are essentially character strings of a special form.
We may, in fact, coerce dates and times to string types, and we may do the
reverse if the string “makes sense” as a date or time.
A relational database Schema is depicted by stating both the attributes and their
datatype:
Movies (
title CHAR(IOO),
year INT,
length INT,
genre CHAR(10),
studioName CHAR(30),
producer INT
)
{
<Person SSN# = "123-45-6789" Name = "Art Larsson" City = "San Francisco">,
<Person SSN# = "231-45-6789" Name = "Lino Buchanan" City = "Philadelphia">,
<Person SSN# = "321-45-6789" Name = "Diego Jablonski" City = "Chicago">
}
It is more common and concise to show a relation value as a table. All ordering within
the table is artificial and meaningless.
Design theory for Relational Database
A common problem with schema design involve trying to combine too much into one
relation thus leading to redundancy. Thus, improvements to relational schemas pay
close attention to eliminating redundancy. The theory of “dependences” is a well-
developed theory for relational databases providing guidelines on how to develop
good schema and eliminate flaws if any. The first concept we need to consider is
Functional Dependency (FD).
The functional dependence definition can be generalized to cover the case in which
the determining attribute values occur more than once in a table.
RELATION KEYS: The key’s role is based on a concept known as determination. I.e.
the statement “A determines B” indicates that if you know the value of attribute A,
you can look up (determine) the value of attribute B. E.g.:
an invoice number identifies all of the invoice attributes such as invoice date and
the customer name.
if we know STU_NUM in a STUDENT table we can look up (determine) student’s last
name, grade point average, phone number, etc.
Definitions
Key Attribute(s): We say a set of one or more attributes {A1, A2, ..., An} is a key for
a relation R if:
i. Those attributes functionally determine all other attributes of the relation.
That is, it is impossible for two distinct tuples of R to agree on all of A1, A2,
..., An (uniqueness).
ii. No proper subset of {A1, A2, ..., An} functionally determines all other
attributes of R; i.e., a key must be minimal.
When a key consists of a single attribute A, we often say that A (rather than {A}) is
a key. An attribute that is part of a key is called key attribute.
Consider the Relation Movies below:
Attributes {title, year, starName} form a key for the relation Movies because it meets
the two conditions:
Condition 1:
Do they functionally determine all the other attributes? Yes
Condition 2:
Do any proper subset of {title, year, starName} functionally determines all
other attributes?
{title, year} do not determine starName thus {title, year} is not a key.
{year, starName} is not a key because we could have a star in two movies
in the same year; therefore
{Year, starName} → title is not an FD.
{title, starName} is not a key, because two movies with the same title, made
in different years, can have a star in common.
Therefore, no proper subset of {title, year, starName} functionally
determines all other attributes
Super Key (shortened: super set of keys): An attribute or a combination of attributes
that is used to identify the records uniquely is known as Super Key. It is to be noted
that some superkeys are not (minimal) keys. Note that every superkey satisfies the
first condition of a key: it functionally determines all other attributes of the
relation. However, a superkey need not satisfy the second condition: minimality. A
table can have many Super Keys. E.g. of Super Key
§ ID
§ ID, Name
§ ID, Address
§ ID, Department_ID
§ ID, Salary
§ Name, Address
Candidate Key: It can be defined as minimal Super Key or irreducible Super Key. In
other words an attribute or a combination of attribute that identifies the record
uniquely but none of its proper subsets can identify the records uniquely. E.g. of
Candidate Key
Code
Name, Address
Primary Key: A Candidate Key that is used by the database designer for unique
identification of each row in a table is known as Primary Key. A Primary Key can
consist of one or more attributes of a table. E.g. of Primary Key - Database designer
can use one of the Candidate Key as a Primary Key.
In this case we have “Code” and “Name, Address” as Candidate Key,
The designer may prefer “Code” as the Primary Key as the other key is the
combination of more than one attribute.
Null values should never be part of a primary key, they should also be avoided to
the greatest extent possible in other attributes too. A null is no value at all. It does
not mean a zero or a space. There are rare cases in which nulls cannot be reasonably
avoided when you are working with non-key attributes. For example, one of an
EMPLOYEE table’s attributes is likely to be the EMP_INITIAL. However, some
employees do not have a middle initial. Therefore, some of the EMP_INITIAL values
may be null. Null can also exist because of the nature of the relationship between
two entities. Conventionally, the existence of nulls in a table is often an indication
of poor database design. Nulls, if used improperly, can create problems because they
have many different meanings. For example, a null can
represent:
An unknown attribute value.
A known, but missing, attribute value.
A “not applicable” condition.
Dept_ID is the primary key in Table DeptTbl, the DeptID attribute of table
Employee (dependent or child table) can be defined as the Foreign Key as it can
reference to the Dept_ID attribute of the table DeptTbl (the referenced or
parent table), a Foreign Key value must match an existing value in the parent
table or be NULL.
Composite Key: If we use multiple attributes to create a Primary Key then that
Primary Key is called Composite Key (also called a Compound Key or Concatenated
Key).
Alternate Key: Alternate Key can be any of the Candidate Keys except for the
Primary Key.
Secondary Key: The attributes that are not even the Super Key but can be still used
for identification of records (not unique) are known as Secondary Key.
E.g. of Secondary Key can be Name, Address, Salary, Department_ID etc. as they can
identify the records but they might not be unique.
Exercise
Suppose R is a relation with attributes A1, A2, ..., An. As a function of n, tell how
many superkeys R has, if:
a) The only key is A1.
b) The only keys are A1 and A2
c) The only keys are {A1, A2} and {A3, A4}
d) The only keys are {A1, A2} and {A1, A3}
These rules guide us on how we can infer a functional dependency from other given
FD’s.
E.g., given that a relation R (A, B, C) satisfies the FD’s
A —> B and B —> C,
then we can deduce that R also satisfies the FD
A —> C.
Proof:
Consider two tuples of R that agree on A
Let the tuples agreeing on attribute A be (a, b1, c1) and (a, b2, c2)
Since R satisfies A → B, and these tuples agree on A, they must also agree on B. That
is, b1 = b2
The tuples are now (a, b, c1) and (a, b, c2), where b is both b1 and b2.
Similarly, since R satisfies B → C, and the tuples agree on B, they agree also on C.
Thus, c1= c2; i.e., the tuples do agree on C.
We have proved that any two tuples of R that agree on A also agree on C, and that is
the FD
A → C.
This rule is called the transitive rule
In other words, we may split attributes on the right side so that only one attribute
appears on the right of each FD. Likewise, we can replace a collection of FD’s having
a common left side by a single FD with the same left side and all the right sides
combined into one set of attributes. In either event, the new set of FD’s is equivalent
to the old. The equivalence noted above can be used in two ways.
§ We can replace an FD
A1, A2, …, An → B1, B2, …, Bm by a set of FD’s
A1, A2, …, An → Bi for i = 1, 2, ..., m
We call this transformation the splitting rule.
§ We can replace a set of FD’s
A1, A2, …, An → Bi for i = 1, 2, ..., m by the single FD
A1, A2, …, An → B1, B2, …, Bm.
We call this transformation the combining rule.
Trivial-dependency rule.
There is an intermediate situation in which some, but not all, of the attributes on
the right side of an FD are also on the left. This FD is not trivial.
Given a set a = {A1, A2, ..., A n} of attributes of R and a set of functional dependencies
FD, we need a way to find all of the attributes of R that are functionally determined
by a. This set of attributes is called the closure of a under F and is denoted a+.
Finding a+ is useful because:
result := a
repeat
temp := result
for each functional dependency b ® g in F do
if b Í result then
result := result È g
until temp = result
Example:
Consider a relation with attributes A, B, C, D, E, and F. Suppose that this relation
has the FD’s
AB → C, BC → AD, D → E, and CF → B.
What is the closure of {A, B}?
Solution
First, split BC → AD into BC → A and BC → D.
For BC → C and BC → D
BC Í Result, so we have
Result = Result È A and D i.e., Result = {A, B, C, D}
For D → E
D Í Result, so we have
Result = Result È E i.e. Result = {A, B, C, D, E}
No more changes to Result are possible, thus, {A, B}+ = {A, B, C, D, E}.
By computing the closure of any set of attributes, we can test whether any given FD
A1, A2, …, An → B follows from a set of FD’s S.
First compute {A1, A2, …, An}+ using the set of FD’s S. If B is in {A1, A2, …, An}+, then
A1, A2, …, An → B does follow from S, and if B is not in {A1, A2, …, An}+, then this FD
does not follow from S.
More generally, A1, A2, …, An → B1, B2, …, Bm follows from set of FD’s S if and only
if all of B1, B2, ..., Bm are in {A1, A2, …, An}+
Example:
Consider the relation and FD’s in the example above, Suppose we wish to test whether
AB → D follows from these FD’s. We compute {A, B}+, which is {A, B, C, D, E}. Since D
is a member of the closure, we conclude that AB → D does follow.
On the other hand, consider the FD
D → A. To test whether this FD follows from the given FD’s, first compute {D}+.
{D}+ = {D, E}. Since A is not a member of {D, E}, we conclude that D → A does not follow.
Armstrong's Axioms
§ Reflexivity / reflexive rule: If {B1, B2, ..., Bm} Í {A1, A2, ..., An}, then
A1, A2, …, An → B1, B2, …, Bm. These are what we have called trivial FD’s.
§ Augmentation rule: If A1A2 … An → B1B2 … Bm, then
A1A2 … AnC1C2 … Ck → B1B2, … BmC1C2 … Ck for any set of attributes C1,
C2, ..., Ck
Since some of the C ’s may also be A’s or B’s or both, we should eliminate from
the left side duplicate attributes and do the same for the right side.
§ Transitivity rule: If A1, A2, …, An → B1, B2, …, Bm and B1, B2, …, Bm → C1, C2,
…, Ck hold in relation R, then A1, A2, …, An → C1, C2, …, Ck also holds in R.
If some of the C ’s are among the A’s, we may eliminate them from the right side
by the trivial-dependencies rule
Additional rules:
Example:
Assume there are 4 attributes A, B, C, D and that F = {A → B, B → C}. to compute F+ we
first get:
A+ = AB+ = AC+ = ABC+ = {A, B, C}
B+ = BC+ = {B, C}
C+ = {C}
D+ = {D}
AD+ = {A, D}
BC+ = {B, C}
BD+ = BCD+ = {B, C, D}
ABD+ = ABCD+ = {A, B, C, D}
ACD+ = {A, C, D}
Exercise
Consider a relation with schema R (A, B, C, D) and FD’s AB → C, D → D and D → A.
i. What are all the nontrivial FD’s that follow from the given FD’s? You should
restrict yourself to FD’s with single attributes on the right side.
ii. What are all the keys of R?
iii. What are all the superkeys for R that are not keys?
Very few DBMSs are capable of supporting all eight relational operators. To be
considered minimally relational, the DBMS must support the key relational
operators SELECT, PROJECT, and JOIN.
1. SELECT, also known as RESTRICT, yields values for all the rows found in a
table that satisfy a given condition. SELECT yields a horizontal subset of a
table.
2. PROJECT yields all values for selected attributes. PROJECT yields a vertical
subset of a table
3. UNION: combines all rows from two or more tables, excluding duplicate rows.
In order to be used in a UNION, the tables must be UNION compatible, that is:
Ø The relations must all have the same number of attributes.
Ø Corresponding columns must all have identical data types and lengths.
When these criteria are met, the tables are said to be union compatible.
4. INTERSECT: yields only the rows that appear in both tables. As with UNION,
the tables must be union-compatible to yield valid results.
5. DIFFERENCE: yields all rows in one table that are not found in the other table.
As with the UNION, the tables must be UNION-compatible to yield valid results.
6. PRODUCT: yields all possible pairs of rows from two tables- also known as
Cartesian product. Therefore, if one table has six rows and the other table
has three, the PRODUCT yields a list composed of 6 x 3= 18 rows.
7. JOIN: Joins two tables together using a shared key usually either the primary
key or foreign key. JOIN allows the use of independent tables linked by common
attributes. Join is a fundamental concept in Relational database. A join can
either be inner join or outer join. An inner join is a join that only returns
matched records from the tables that are being joined e.g. natural Join,
equijoin, theta join. In an outer join, the matched pairs would be retained, and
any unmatched values in the other table would be left null. We look at types
of join below:
§ Natural join (Inner Join): A natural join links tables by selecting only
the rows with common values in their common attribute(s). A natural join
is the result of a three-stage process:
a. PRODUCT of the tables is created
b. SELECT is performed on the output of Step a) to yield only the rows
whose values are equal.
c. A PROJECT is performed on the results of Step b to yield a single copy
of each attribute, thereby eliminating duplicate columns. The final
outcome of a natural join yields a table that does not include
unmatched pairs and provides only copies of the matches.
Step 2: SELECT yield only the rows for which the AGENT_CODE values are
equal. The common columns are referred to as the join columns
§ Theta join: If any other comparison operator such as (<, >, …) is used, the
join is called a theta join.
SELECT*
FROM Customer
JOIN Agent on (AGENT_CODE > CUSTOMER.AGENT_CODE)
§ Outer Join: In an outer join, the matched pairs would be retained, and any
unmatched values in the other table would be left null. It is an easy
mistake to think that an outer join is the opposite of an inner join.
However, it is more accurate to think of an outer join as an “inner join
plus.” The outer join still returns all of the matched records that the
inner join returns, plus it returns the unmatched records from one of
the tables. The SQL OUTER JOIN operator (+) is used only on one side of
the join condition only. The subtypes of OUTER JOIN are:
Ø Left outer join or left join
Ø Right outer join or right join
Ø Full outer join
Syntax
Select *
FROM table1, table2
WHERE conditions [+];
§ The LEFT JOIN (specified with the keywords LEFT JOIN and ON) joins two
tables and fetches all matching rows of two tables for which the sql-
expression is true, plus rows from the first table that do not match any
row in the second table.
E.g.
SELECT *
FROM CUSTOMER
LEFT OUTER JOIN AGENT
ON CUSTOMER.AGENT_CODE = AGENT_CODE
§ The RIGHT JOIN, joins two tables and fetches rows based on a condition,
which are matching in both the tables ( before and after the JOIN clause
mentioned in the syntax below), and the unmatched rows will also be
available from the table written after the JOIN clause ( mentioned in the
syntax below).
Syntax
SELECT *
FROM table1
RIGHT [OUTER] JOIN table2
ON table1.column_name=table2.column_name;
E.g.
SELECT *
FROM CUSTOMER
RIGHT OUTER JOIN AGENT
ON CUSTOMER.AGENT_CODE = AGENT_CODE
Right Join
§ Full outer join: the FULL OUTER JOIN combines the results of both left
and right outer joins and returns all (matched or unmatched) rows from
the tables on both sides of the join clause.
Syntax
SELECT *
FROM table1
FULL OUTER JOIN table2
ON table1.column_name=table2.column_name;
§ ON table1.column_name=table2.column_name;More specifically, if an
outer join is produced for tables CUSTOMER and AGENT, two scenarios
are possible
8. The DIVIDE operation uses one single-column table (e.g., column “a”) as the
divisor and one 2-column table (i.e., columns “a” and “b”) as the dividend. The
tables must have a common column (e.g., column “a”). The output of the DIVIDE
operation is a single column with the values of column “a” from the dividend
table rows where the value of the common column (i.e., column “a”) in both
tables matches.
Divide operation
The 1:1 relationship: As the 1:1 label implies, in this relationship, one entity can be
related to only one other entity, and vice versa. For example, one department chair—
a professor—can chair only one department, and one department can have only one
department chair.
If we the examine the PROFESSOR and DEPARTMENT tables, we note some important
features:
§ Each professor is a College employee; thus, the professor identification is
through the EMP_NUM. (However, note that not all employees are professors—
there’s another optional relationship.)
§ The 1:1 PROFESSOR chairs DEPARTMENT relationship is implemented by having
the EMP_NUM as foreign key in the DEPARTMENT table. Note that the 1:1
relationship is treated as a special case of the 1:M relationship in which the
“many” side is restricted to a single occurrence. In this case, DEPARTMENT
contains the EMP_NUM as a foreign key to indicate that it is the department that
has a chair.
§ Also, note that the PROFESSOR table contains the DEPT_CODE foreign key to
implement the 1:M DEPARTMENT employs PROFESSOR relationship. This is a good
example of how two entities can participate in two (or even more) relationships
simultaneously. The preceding “PROFESSOR chairs DEPARTMENT” example
illustrates a proper 1:1 relationship. In fact, the use of a 1:1 relationship
ensures that two entity sets are not placed in the same table when they should
not be. However, the existence of a 1:1 relationship sometimes means that the
entity components were not defined properly. It could indicate that the two
entities actually belong in the same table! As rare as 1:1 relationships should
be, certain conditions absolutely require their use. One such condition is the
concept called generalization hierarchy, which is a powerful tool for improving
database designs under specific conditions to avoid a proliferation of nulls. One
of the characteristics of generalization hierarchies is that they are implemented
as 1:1 relationships.
To examine the M:N relationship more closely, imagine a small college with two
students, each of whom takes three classes. The table below shows the enrollment
data for the two students.
Sample Student Enrollment Data
Given the data relationship and the sample data in the table above, it can be wrongly
assumed that M:N relationship can be implemented by simply adding a foreign key in
the many side of the relationship that points to the primary key of the related table.
This not correct
§ The tables will create many redundancies. For example, note that the STU_NUM
values occur many times in the STUDENT table. In a real-world situation,
additional student attributes such as address, classification, major, and home
phone would also be contained in the STUDENT table, and each of those
attribute values would be repeated in each of the records shown here.
Similarly, the CLASS table contains many duplications: each student taking the
class generates a CLASS record. The problem would be even worse if the CLASS
table included such attributes as credit hours and course description.
§ Given the structure and contents of the two tables, the relational operations
become very complex and are likely to lead to system efficiency errors and
output errors.
The problems inherent in the many-to-many (M:N) relationship can easily be avoided
by creating a
composite entity (also referred to as a bridge entity or an associative entity).
Because such a table is used to link the tables that were originally related in an
M:N relationship, the composite entity structure includes—as foreign keys—at least
the primary keys of the tables that are to be linked. The database designer can then
define the composite table’s primary key either by: using the combination of those
foreign keys or create a new primary key. In the example above, we can create the
composite ENROLL table CLASS and STUDENT. In this example, the ENROLL table’s
primary key is the combination of its foreign keys CLASS_CODE and STU_NUM. But
the designer could have decided to create a single-attribute new primary key such as
ENROLL_LINE, using a different line value to identify each ENROLL table row
uniquely. (Microsoft Access users might use the Autonumber data type to generate
such line values automatically).
Table name: STUDENT
Primary key: STU_NUM
Foreign key: none
Because the ENROLL table links two tables, STUDENT and CLASS, it is also called a
linking table.
In other words, a linking table is the implementation of a composite entity.
The ENROLL table yields the required M:N to 1:M conversion. Observe that the
composite entity represented by the ENROLL table must contain at least the primary
keys of the CLASS and STUDENT tables (CLASS_CODE and STU_NUM, respectively)
for which it serves as a connector. Also note that the STUDENT and CLASS tables
now contain only one row per entity. The ENROLL table contains multiple
occurrences of the foreign key values, but those controlled redundancies are
incapable of producing anomalies as long as referential integrity is enforced.
Additional attributes may be assigned as needed. In this case, ENROLL_GRADE is
selected to satisfy a reporting requirement. Also note that the ENROLL table’s
primary key consists of the two attributes CLASS_CODE and STU_NUM because both
the class code and the student number are needed to define a particular student’s
grade. Naturally, the conversion is reflected in the ERM, too. The revised
relationship is shown below:
note that the composite entity named ENROLL represents the linking table between
STUDENT and CLASS. We can increase the amount of available information even as
we control the database’s redundancies. Below is the expanded ERM, including the
1:M relationship between COURSE and CLASS. Note that the model is able to handle
multiple sections of a CLASS while controlling redundancies by making sure that
all of the COURSE data common to each CLASS are kept in the COURSE table.
The relationship diagram that corresponds to the ERM shown above is as below:
The ERD above contains all of the components introduced thus far. Note that
CAR_VIN is the primary key, and CAR_COLOR is a multivalued attribute of the
CAR entity.
Splitting the multivalued attribute into new attributes
Another benefit we can derive from this approach is that we are now able
to assign as many colors as necessary without having to change the table
structure.
Note that the ERM shown in Figure above reflects the components listed
in previous table. This is the preferred way to deal with multivalued
attributes. Creating a new entity in a 1:M relationship with the original
entity yields several benefits: it’s a more flexible, expandable solution,
and it is compatible with the relational model!
Derived Attributes: A derived attribute is an attribute whose value is
calculated (derived) from other attributes. The derived attribute need
not be physically stored within the database; instead, it can be derived
by using an algorithm. For example, an employee’s age, EMP_AGE, may be
found by computing the integer value of the difference between the
current date and the EMP_DOB. In Microsoft Access, we use:
INT((DATE() – EMP_DOB)/365)
In Microsoft SQL Server, we use
SELECT DATEDIFF(“YEAR”, EMP_DOB, GETDATE()),
where DATEDIFF is a function that computes the difference between
dates. The first parameter indicates the measurement, in this case, years.
In Oracle, we use SYSDATE instead of DATE().
A derived attribute is indicated in the Chen notation by a dashed line
connecting the attribute and the entity. The Crow’s Foot notation does
not have a method for distinguishing the derived attribute from other
attributes.
The left side of the ER diagram shows the Chen notation, based on Peter Chen’s
landmark paper. In this notation, the connectivities are written next to each
entity box. Relationships are represented by a diamond connected to the
related entities through a relationship line. The relationship name is written
inside the diamond. The right side illustrates the Crow’s Foot notation. The
name “Crow’s Foot” is derived from the three-pronged symbol used to represent
the “many” side of the relationship. In the basic Crow’s Foot ERD represented
above, the connectivities are represented by symbols. For example, the “1” is
represented by a short line segment, and the “M” is represented by the three-
pronged “crow’s foot.” The relationship name is written above the relationship
line. In Figure above, entities and relationships are shown in a horizontal
format, but they may also be oriented vertically. The entity location and the
order in which the entities are presented are immaterial; just remember to read
a 1:M relationship from the “1” side to the “M” side.
Knowing the minimum and maximum number of entity occurrences is very useful
at the application software level. A college might want to ensure that a class
is not taught unless it has at least 10 students enrolled. Similarly, if the
classroom can hold only 30 students, the application software should use that
cardinality to limit enrollment in the class. However, keep in mind that the
DBMS cannot handle the implementation of the cardinalities at the table
level—that capability is provided by the application software or by triggers.
Note that the Chen notation above identifies the weak entity by using a
double-walled entity rectangle. The Crow’s Foot notation generated by
Visio Professional uses the relationship line and the PK/FK designation
to indicate whether the related entity is weak.
A strong (identifying) relationship indicates that the related entity is
weak. Such a relationship means that both conditions for the weak entity
definition have been met—the related entity is existence-dependent, and
the PK of the related entity contains a PK component of the parent entity.
Remember that the weak entity inherits part of its primary key from its
strong counterpart. For example, at least part of the DEPENDENT
entity’s key shown in Figure above was inherited from the EMPLOYEE
entity:
EMPLOYEE (EMP_NUM, EMP_LNAME, EMP_FNAME, EMP_INITIAL, EMP_DOB,
EMP_HIREDATE)
DEPENDENT (EMP_NUM, DEP_NUM, DEP_FNAME, DEP_DOB)
Crowfoot symbols
Example
Mr Brandon’s the owner of SPEED CAFÉ has been having problems with the
management of his Café. Having learnt that you are a DB designer, he believes he has
finally found a solution. He has asked you to automate the management of his Café.
Since this will involve a database backend, you are saddled with the task of showing
him a good database model based on the following business rules:
• The café has several employees each having a unique identification number,
names and dates of birth.
• An employee is either a “Technical Officer” or “Casual Employee”, but not
both. A Technical officer has access to one or more computing facilities in
the Café and therefore has login usernames and passwords. Technical
officers have varying salary rates based on their ranks. Casual Employee
however, do not have access to computing facilities and their salaries are
wages (i.e. based on the number of hours worked).
• All Computing facilities in the Café have names (e.g. computer, cable, printer
etc.) and date of purchase (remember names are not unique, so you will have
to choose a surrogate key).
• Access to Internet facilities in the Café (either by a staff or customer) is
through a ticket. Each ticket has a unique ticket number, duration, time of
production, period (number of days) of validity and price in Naira.
Accessed_through
has
d
CASUAL WORKER
TECHNICIAN PK,FK emp_id
PK,FK emp_id Hours_worked TICKET
Login_username Wage_rate
PK ticket_number
Login_password Date_of_production
Date_of_expiry
Price
Duration