Fundamental and Advanced Database Tutorial
Fundamental and Advanced Database Tutorial
ØData models define how data is connected to each other and how they are processed
• They provide abstraction:- hiding of details of how the data are stored
and maintained
Categories of Data Models
1. Conceptual (high-level, semantic) data models
üProvide concepts that are
ü Also called entity-based or object-based data models.
üUse concepts such as:
• Entities- represents real-world objects
• Attributes- represents properties of entity
• Relationships- represents association among entities
2. Physical (low-level, internal) data models:
ü Provide concepts that is stored in the computer.
ü By representing information like
• Record formats
• Record ordering
• Access paths
3.Representational or Implementation data models:
üRepresent data using record structure
• Each external schema describes the part of the database that a particular user group is interested in
and hides the rest.
• Describes what data are stored in the and relationships among the data
databases
• DCL is the simplest of the SQL subsets, as it consists of only three
commands: GRANT, REVOKE, and DENY
Classification of Database Management Systems
• Based on the data models
ü Relational, Hierarchical, Network, Object- Oriented
• Based on number of users
ü Single-user
ü Multi-user
• Based on number of Sites
üCentralized
üDistributed
• Based on Cost
üLow cost
üMedium cost
üHigh cost
Entity Relationship (ER) Model
• E-R modeling is mainly used to create the conceptual schema for the database from the
collected system specifications
• Entities
• Attributes of entities
E/R Diagram Representation
Strong Entity
Weak Entity
Single Valued Attribute
Multivalued Attribute
Derived Attribute
Composite attribute
Key
Relationship
12
Introduction to the E-R Model
Lname CName
FName ID No. CCode Credit
13
Types of Attribute
• It can be
simple
Stored
attribute
composite
Derived
Single value
Multi value
14
Simple VS Composite Attributes
1. Simple or an atomic attribute: cannot be further divided into smaller
• Components composed of single attribute with an independent Existence
• Attributes that are not divisible
• Examples:
• Gender, and SSN, FName
2. Composite attribute: it can be divided into smaller subparts where each
subpart is either atomic or composite
• Composite attributes can form a hierarchy
• Can be divided into further parts
• Examples:
• Name:( First Name, Last Name)
• Address: (Street, City, State, Zip Code)
15
Single-Valued VS Multi-Valued Attributes
3. Single-valued : attributes have a single value for an entity instance
üThe majority of attributes are single valued attribute for a particular entity.
üExamples: Name, Date Of Birth, Reg. No
4. Multi-valued attributes, may have more than one value for an entity instance
üMay have lower and upper bounds on the number of values allowed for each individual
entity.
üDenoted with a double-lined ellipse
• For example
• college degree: Bachelor, Master and PhD
• Languages: Stores the names of the languages that a student speaks
• Phone number: Mobile phone , Office phone , Home phone
• Hoppy: {Reading book, Listening music, Watching Tv, Playing Football}
16
Stored VS Derived Attributes
• The value of a derived attribute can be determined by analyzing other attributes i.e can be
derived from other attributes
• Denoted with a
• Example:
• Age: Can derived from the current date and the attribute DateOfBirth
• An attribute whose value cannot be derived from the values of other attributes is called a
stored attribute
• stored attribute: From which the value of other attributes are drived.
• E.g Birthdate of a person
17
Database Design
Functional Dependency & Normalization
19
Functional Dependency (FD)
• Two data items A and B are said to be in a determinant or dependent relationship if
certain values of data item B always appears with certain values of data item A.
• If the data item A is the determinant data item and B the dependent data item then the
direction of the association is from A to B and not vice versa.
• " A determines B," or that "B is a function of A," or that "A functionally governs B.“
• "If A, then B” It is important to note that the value B must be unique for a given value
of A
• i.e., any given value of A must imply just one and only one value of B, in order for
the relationship to qualify for the name function.
• X Y holds if whenever two tuples have the same value for X, they must have the
same value for Y.
• The notation is: A B which is read as; B is functionally dependent on A
• X Y X functionally determines Y
F = { studID Name,
courseno course_Name }
21
Functional dependency
• Example check FD or not
• Rno name FD
• nameRno not FD
• {Rno,name}marks FD
22
Partial Functional Dependency
• A functional dependency X Y is a partial dependency if there is
some attribute that can be removed from X and yet the dependency
still holds.
• If an attribute which is not a member of the primary key is dependent on some part
of the primary key (if we have composite primary key) then that attribute is
partially functionally dependent on the primary key.
Let {A,B} is the Primary Key and C is no key attribute.
Then if {A,B}C and BC
23
Example
24
Full functional Dependency(FFD)
• If X and Y are attribute set of a relation, Y is fully functionally dependent on
X if Y is functionally dependent on X, but not on any proper subset of X.
• If an attribute which is not a member of the primary key is not dependent on
some part of the primary key but the whole key (if we have composite
primary key) then that attribute is fully functionally dependent on the primary
key.
• Let {A,B} is the Primary Key and C is no key attribute
• Then if {A,B}C and BC and AC does not hold
• Then C Fully functionally dependent on {A,B}
25
Example
• b/c price is not functional dependent on any of the subset of the determinate
supplier and itemid
26
Cont…’
Transitive Dependency
• In mathematics and logic, a transitive relationship is a relationship of the
following form: "If A implies B, and if also B implies C, then A implies C."
• Example: If Mr X is a Human, and if every Human is an Animal, then
Mr X must be an Animal.
• Generalized way of describing transitive dependency is that:
• If A functionally governs B, AND
• If B functionally governs C
• THEN A functionally governs C
• Provided that neither C nor B determines A i.e. (B / A and C / A)
In the normal notation:
{(AB) AND (BC)} ==> AC provided that B / A and C / A
27
Normal Forms
• First Normal Form (1NF)
• Second Normal Form (2NF)
• Third Normal Form (3NF)
• Boyce-Codd Normal Form (BCNF)
Normalization towards a logical design consists of the following steps:
• Unnormalized Form: Identify all data elements
• First Normal Form: Find the key with which you can find all data
• Second Normal Form: Remove part-key dependencies. Make all data dependent on the
whole key.
• Third Normal Form:
• Remove non-key dependencies. Make all data dependent on nothing but the key.
• For most practical purposes, databases are considered normalized if they adhere to third
normal form.
28
First Normal Form(1NF)
• Definition: a table (relation) is in 1NF:
Ø There are no duplicated rows in the table. Unique identifier
Ø Each cell is single-valued (i.e., there are no repeating groups, no
composite attributes).
Ø Entries in a column (attribute, field) are of the same kind.
Ø Determine the PK of the new entity
Ø Repeat steps until no more repeating groups.
Example: for First Normal form (1NF )
• Unnormalized Form
Business rule: Whenever an employee participates in a project, he/she will be entitled for
an incentive.
• This schema is in its 1NF since we don’t have any repeating groups or attributes with
multi-valued property.
• To convert it to a 2NF we need to remove all partial dependencies of non key attributes on
part of the primary key.
• {EmpID, ProjNo} EmpName, ProjName, ProjLoc, ProjFund, ProjMangID, Incentive
• But in addition to this we have the following dependencies
FD1: {EmpID}EmpName
FD2: {ProjNo}ProjName, ProjLoc, ProjFund, ProjMangID
FD3: {EmpID, ProjNo} Incentive
• Some non key attributes are partially dependent on some part of the primary key. This can
be witnessed by analyzing the first two functional dependencies (FD1 and FD2).
• Thus, each Functional Dependencies, with their dependent attributes should be moved to a
new relation where the Determinant will be the Primary Key for each.
33
Cont…’
• EMPLOYEE EmpID EmpName
• EMP_PROJ
EmpID ProjNo Incentive
34
Third Normal Form(3rd)
• Eliminate Columns that are dependent on another non-Primary Key.
• If attributes do not contribute to a description of the key
• i.e. if it is not directly dependent remove them to a separate table.
Definition: a Table (Relation) is in 3NF:
• It is in 2NF and
• All attributes depend on nothing but the key.
• There are no transitive dependencies between a primary key and non-primary key
attributes.
• Generally, a table is said to be normalized if it reaches 3NF.
• A database with all tables in the 3NF is said to be Normalized Database.
35
Cont…’
Example: for (3NF) Assumption: Students of the same batch (same
year) live in one building or dormitory
STUDENT
StudID Stud_FName Stud_LName Dept Year Dormitary
36
Cont…’
• This schema is in its 2NF since the primary key is a single attribute.
• Let’s take StudID,Year and Dormitary and see the dependencies.
• StudID Year AND YearDormitary
• And Year can not determine StudID and Dormitary can not determine StudID
Then transitively StudIDDormitary
• To convert it to a 3NF we need to remove all transitive dependencies of non
key attributes on another non-key attribute.
• The non-primary key attributes, dependent on each other will be moved to
another table and linked with the main table using Candidate Key- Foreign Key
relationship.
37
Cont….’
StudID StudF_Na StudL_Na Dept Year
me me Year Dormitary
STUDENT 125/97 Abebe
654/95 Lemma
Mekuria
Alemu
Info Sc
Geog
1
3
DORM 1 401
3 403
842/95 Chane Kebede CompS 3
c
165/97 Alem Kebede InfoSc 1
985/95 Almaz Belay Geog 3
38
SQL LANGUAGE
• Relational Model
• The data and the relations between them are organized in tables.
• A table is a collection of records and each record in a table contains the same
fields organized in columns.
45
Motivation of ODBMSs
46
What is Object Oriented Database? (OODB)
47
Object Oriented Database Management
qObject Oriented databases have evolved along two different paths:
qPersistent Object Oriented Programming Languages: (pure ODBMSs)
üStart with an OO language (e.g., C++, Java, SMALLTALK) which has a rich type
system
üAdd persistence to the objects in programming language where persistent objects
stored in databases
qObject Relational Database Management Systems (SQL Systems)
üExtend relational DBMSs with the rich type system and user-defined functions.
üProvide a convenient path for users of relational DBMSs to migrate to OO
technology
üAll major vendors (e.g., Informix, Oracle) will/are supporting features of SQL.
48
Object Oriented Concepts
q Object:
• observable entity in the world being modeled
• similar to concept to entity in the E/R model
• An object consists of:
Øattributes: properties built in from primitive types
Ørelationships: properties whose type is a reference to some other object or a
collection of references
Ømethods: functions that may be applied to the object.
qClass
• Similar objects with the same set of properties and describing similar real-world
concepts are collected into a class. 49
Class Extents
50
Multiple Inheritance
• A class may have more than one superclass.
• choose one
51
Object Identity
• Object identity is a property of data that is created in the context of an object data model,
where an object is assigned a unique internal object identifier, or OID.
• Object identifier (OID) can be stored as attribute in object to refer to another object.
52
Persistence
q signature or interface of the operation, specifies the operation name and arguments
(or parameters).
54
Polymorphism
55
Complex Objects
• Unstructured complex object:
q These is provided by a DBMS and permits the storage and retrieval of large objects
that are needed by the database application.
q Typical examples of such objects are bitmap images and long text strings (such as
documents); they are also known as binary large objects, or BLOBs for short.
• Structured complex object:
§ This differs from an unstructured complex object in that the object’s structure is
defined by repeated application of the type constructors provided by the OODBMS.
§ Hence, the object structure is defined and known to the OODBMS.
§ The OODBMS also defines methods or operations on it.
56
Introduction to Query Processing
qQuery optimization
üThe query optimization techniques are used to chose an efficient execution plan that
will minimize the runtime as well as many other types of resources such as number of
disk I/O, CPU time and so on.
57
Steps of query processing
58
Translating SQL Queries into Relational Algebra
qQuery block:
üThe basic unit that can be translated into the algebraic operators and optimized.
üA query block contains a single SELECT-FROM-WHERE expression, as well
as GROUP BY and HAVING clause if these are part of the block.
q Nested queries
ü within a query are identified as separate query blocks.
üAggregate operators in SQL must be included in the extended algebra.
59
Using Heuristics in Query Optimization
• The main heuristic is to apply first the operations that reduce the size of intermediate
results.
• E.g., Apply SELECT and PROJECT operations before applying other binary
operations.
60
Internal representation of Query Optimization
Query tree:
ØA tree data structure that corresponds to a relational algebra expression. It represents the
input relations of the query as leaf nodes of the tree, and represents the relational algebra
operations as internal nodes.
ØAn execution of the query tree consists of executing an internal node operation whenever
its operands are available and then replacing that internal node by the relation that results
from executing the operation.
Query graph:
• A graph data structure that corresponds to a relational calculus expression. It does not
indicate an order on which operations to perform first. There is only a single graph
corresponding to each query. 61
Using Selectivity and Cost Estimates in Query Optimization
• Cost-based query optimization:
• Estimate and compare the costs of executing a query using different execution strategies and
choose the strategy with the lowest cost estimate.
• (Compare to heuristic query optimization)
• Issues
• Cost function
• Number of execution strategies to be considered
• Cost Components for Query Execution
1. Access cost to secondary storage
2. Storage cost
3. Computation cost
4. Memory usage cost
5. Communication cost
• Note: Different database systems may focus on different cost components
62
Cont’s
• Catalog Information Used in Cost Functions
• Information about the size of a file
• number of records (tuples) (r),
• record size (R),
• number of blocks (b)
• blocking factor (bfr)
• Information about indexes and indexing attributes of a file
• Number of levels (x) of each multilevel index
• Number of first-level index blocks (bI1)
• Number of distinct values (d) of an attribute
• Selectivity (sl) of an attribute
• Selection cardinality (s) of an attribute. (s = sl * r)
63
Overview of Query Optimization in Oracle
• Oracle DBMS V8
• Rule-based query optimization: the optimizer chooses execution plans based
on heuristically ranked operations.
• (Currently it is being phased out)
• Cost-based query optimization: the optimizer examines alternative access
paths and operator algorithms and chooses the execution plan with lowest
estimate cost.
• The query cost is calculated based on the estimated usage of resources
such as I/O, CPU and memory needed.
• Application developers could specify hints to the ORACLE query optimizer.
• The idea is that an application developer might know more information about
the data.
64
INTRODUCTION TO TRANSACTION PROCESSING
• A Transaction:
• Logical unit of database processing that includes one or more access operations (read -
retrieval, write - insert or update, delete).
• A transaction (set of operations) may be stand-alone specified in a high level language like
SQL submitted interactively, or may be embedded within application program.
• Example : Transfer of money that amounts 100 from checking account to savings account
• Transaction boundaries:
• Begin and End transaction.
• An application program may contain several transactions separated by the Begin and
End transaction boundaries
• Basic operations are read and write
• read_item(X): Reads a database item named X into a program variable. To simplify our
notation, we assume that the program variable is also named X.
• write_item(X): Writes the value of program variable X into the database item named X
65
Why Concurrency Control is needed:
• The Lost Update Problem
• This occurs when two transactions that access the same database items have their
operations interleaved in a way that makes the value of some database item
incorrect. The update made by the fist transaction is lost (overwritten) by the
second transaction.
• The Temporary Update (or Dirty Read) Problem
• This occurs when one transaction updates a database item and then the
transaction fails for some reason.
• The updated item is accessed by another transaction before it is changed back to
its original value.
• The Incorrect Summary Problem
• If one transaction is calculating an aggregate summary function on a number of
records while other transactions are updating some of these records, the aggregate
function may calculate some values before they are updated and others after they
are updated.
66
What causes a Transaction to fail
1. A computer failure (system crash):
A hardware or software error occurs in the computer system during transaction execution. If the hardware crashes, the contents of the
computer’s internal memory may be lost.
Some operation in the transaction may cause it to fail, such as integer overflow or division by zero.
ü The concurrency control method may decide to abort the transaction, to be restarted later, because it violates serializability or
because several transactions are in a state of deadlock
5. Disk failure:
ü Some disk blocks may lose their data because of a read or write malfunction or because of a disk read/write head crash.
ü This refers to an endless list of problems that includes power or air-conditioning failure, fire, theft, sabotage, overwriting disks
or tapes by mistake, and mounting of a wrong tape by the operator.
67
Transaction states
• Active state
• Partially committed state
• Committed state
• Failed state
• Terminated State
68
Desirable Properties of Transactions
ACID properties
• Atomicity: A transaction is an atomic unit of processing; it is either
performed in its entirety or not performed at all.
• Consistency preservation: A correct execution of the transaction must
take the database from one consistent state to another.
• Isolation: A transaction should not make its updates visible to other
transactions until it is committed; this property, when enforced strictly,
solves the temporary update problem and makes cascading rollbacks of
transactions unnecessary.
• Durability or permanency: Once a transaction changes the database and
the changes are committed, these changes must never be lost because of
subsequent failure.
69
Database Concurrency Control
• Concurrency Control: the process of managing simultaneous operations on the database without having
them interfere with one another.
• Purpose of Concurrency Control
• To enforce Isolation (through mutual exclusion) among conflicting transactions.
• To preserve database consistency through consistency preserving execution of transactions.
• To resolve read-write and write-write conflicts
Two-Phase Locking Techniques
A lock is a mechanism to control concurrent access to a data item
• Locking is an operation which secures
• (a) permission to Read
• (b) permission to Write a data item for a transaction.
• Example:
• Lock (Li(X)). Data item X is locked in behalf of the requesting transaction.
• Unlocking is an operation which removes these permissions from the data item.
• Example:
• Unlock (Ui(X)): Data item X is made available to all other transactions.
• Lock and Unlock are Atomic operations. 70
Two-Phase Locking Techniques: Essential components
• Two locks modes:
• (a) shared (read) (b) exclusive (write).
• Shared mode: shared lock (X)
• More than one transaction can apply share lock on X for reading its value but no write
lock can be applied on X by any other transaction.
• Exclusive mode: Write lock (X)
• Only one write lock on X can exist at any time and no shared lock can be applied by any
other transaction on X.
• Conflict matrix
Read Write
Read Write
Y N
N N
71
Two-Phase Locking Techniques: The algorithm
• Two Phases:
(a) Locking (Growing)
(b) Unlocking (Shrinking).
• Locking (Growing) Phase:
üA transaction applies locks (read or write) on desired data items one at a time.
üTransaction may obtain locks
üTransaction may not release locks
• Unlocking (Shrinking) Phase:
üA transaction unlocks its locked data items one at a time.
üTransaction may release locks
ü Transaction may not obtain locks
• Requirement:
üFor a transaction these two phases must be mutually exclusively, that is,
during locking phase unlocking phase must not start and during unlocking
phase locking phase must not begin.
72
DATABASE RECOVERY TECHNIQUES
73
Transaction Log
• For recovery from any type of failure data values prior to modification
(BFIM - Before Image) and the new value after modification (AFIM
– After Image) are required.
• These values and other information is stored in a sequential file called
Transaction log. A sample log is given below
74
Data Update
• Immediate Update: As soon as a data item is modified in cache, the disk copy is
updated.
• Deferred Update: All modified data items in the cache is written either after a
transaction ends its execution or after a fixed number of transactions have
completed their execution.
• Shadow update: The modified version of a data item does not overwrite its disk
copy but is written at a separate disk location.
• In-place update: The disk version of the data item is overwritten by the cache
version.
75
Checkpointing
• Time to time (randomly or under some criteria) the database flushes its buffer to
database disk to minimize the task of recovery
• Possible ways for flushing database cache to database disk:
1. Steal: Cache can be flushed before transaction commits.
q It avoids the need for a very large buffer space to store updated pages in
memory.
2. No-Steal: Cache cannot be flushed before transaction commit.
3. Force: Cache is immediately flushed (forced) to disk
q All pages updated by a transaction are immediately written to disk when the
transaction commits).
4. No-Force: Cache is deferred until transaction commits
qThis eliminate the I/O cost to read that page again from disk
76
Different ways for handling recovery:
Database
78
Shadow Paging
79
Recovery in multidatabase system
• A multidatabase system is a special distributed database system where one node may be
running relational database system under UNIX, another may be running object-
oriented system under Windows and so on.
• Phase1: when all participating databases signal the coordinator that the part of the
MDT involving each has concluded, the coordinator sends a “prepare for commit”
message to each participant to get ready for committing the transaction.
• Phase 2: If all participating databases reply ok, the transaction is successful and the
coordinator sends a commit signal to the participating DBs
80
Distributed Databases and Client-Server Architectures
• Advantages of DDB
• This refers to the physical placement of data (files, relations, etc.) which is
not known to the user (distribution transparency).
81
CONT’S
• Distribution and Network transparency:
• Users do not have to worry about operational details of the network.
• There is Location transparency, which refers to freedom of issuing command
from any location without affecting its working.
• Then there is Naming transparency, which allows access to any names object
(files, relations, etc.) from any location.
• Replication transparency:
• It allows to store copies of a data at multiple sites.
• This is done to minimize access time to the required data.
• Fragmentation transparency:
• Allows to fragment a relation horizontally (create a subset of tuples of a relation) or
vertically (create a subset of columns of a relation).
82
CONT’S
83
Data Fragmentation, Replication and Allocation
Ø Data Fragmentation
• Split a relation into logically related and correct parts. A relation can be fragmented in two ways:
• Horizontal Fragmentation
• Vertical Fragmentation
• Horizontal fragmentation
• It is a horizontal subset of a relation which contain those of tuples which satisfy selection conditions.
• Consider the Employee relation with selection condition (DNO = 5). All tuples satisfy this condition will
create a subset which will be a horizontal fragment of Employee relation.
• A selection condition may be composed of several conditions connected by AND or OR.
• Derived horizontal fragmentation: It is the partitioning of a primary relation to other secondary relations
which are related with Foreign keys.
84
Vertical fragmentation
• Consider the Employee relation. A vertical fragment of can be created by keeping the
• Because there is no condition for creating a vertical fragment, each fragment must
include the primary key attribute of the parent relation Employee. In this way all
85
• Data Replication
• In full replication the entire database is replicated and in partial replication some
• For example, all sites run Oracle or DB2, or Sybase or some other database
system.
87
• Heterogeneous
• Federated: Each site may run different database system but the data access is
managed through a single conceptual schema.
• This implies that the degree of local autonomy is minimum. Each site must
adhere to a centralized access policy. There may be a global schema.
88
Query Processing in Distributed Databases
• Issues
• Cost of transferring data (files and results) over the network.
• This cost is usually high so some optimization is necessary.
• Example relations: Employee at site 1 and Department at Site 2
• Employee at site 1. 10,000 rows. Row size = 100 bytes. Table size = 106
bytes.
• Department at Site 2. 100 rows. Row size = 35 bytes. Table size = 3,500
bytes.
• Q: For each employee, retrieve employee name and department name Where
the employee works.
• Q: Fname,Lname,Dname (Employee Dno = Dnumber Department)
89
• Result
• The result of this query will have 10,000 tuples, assuming that every employee is
related to a department.
• Suppose each result tuple is 40 bytes long. The query is submitted at site 3 and
the result is sent to this site.
• Problem: Employee and Department relations are not present at site 3.
• Strategies:
1. Transfer Employee and Department to site 3.
• Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send the result to site 3.
• Query result size = 40 * 10,000 = 400,000 bytes. Total transfer size =
400,000 + 1,000,000 = 1,400,000 bytes.
3. Transfer Department relation to site 1, execute the join at site 1, and send the
result to site 3.
• Total bytes transferred = 400,000 + 3500 = 403,500 bytes.
• Optimization criteria: minimizing data transfer.
• Preferred approach: strategy 3.
90
Concurrency Control and Recovery
• Distributed Databases encounter a number of concurrency control and recovery problems which are not
present in centralized databases. Some of them are listed below.
• Dealing with multiple copies of data items: The concurrency control must maintain global
consistency. Likewise the recovery mechanism must recover all copies and maintain consistency after
recovery
• Failure of individual sites: Database availability must not be affected due to the failure of one or two
sites and the recovery scheme must recover them before they are available for use.
• Communication link failure :This failure may create network partition which would affect database
availability even though all database sites may be running.
• Distributed commit: A transaction may be fragmented and they may be executed by a number of sites.
This require a two or three-phase commit approach for transaction commit.
• Distributed deadlock: Since transactions are processed at multiple sites, two or more sites may get
involved in deadlock. This must be resolved in a distributed manner.
91
Client-Server Database Architecture
• It consists of clients running client software, a set of servers which provide all
database functionalities and a reliable communication infrastructure.
Server 1 Client 1
Client 2
Server 2 Client 3
Server n Client n
92
THE END!
Q&A
93