Dbms Complete Notes With Addons
1: Introduction
Purpose of Database Systems
View of Data
Database Languages
Relational Databases
Database Design
Object-based and semistructured databases
Data Storage and Querying
Transaction Management
Database Architecture
Database Users and Administrators
Overall Structure
Database Management System (DBMS)
DBMS contains information about a particular enterprise
Database Applications:
Banking: all transactions
Airlines: reservations, schedules
Universities: registration, grades
Sales: customers, products, purchases
Online retailers: order tracking, customized recommendations
Manufacturing: production, inventory, orders, supply chain
Human resources: employee records, salaries, tax deductions
Purpose of Database Systems
In the early days, database applications were built directly on top of
file systems
Drawbacks of using file systems to store data:
Atomicity of updates
Failures may leave database in an inconsistent state with partial
updates carried out
Example: Transfer of funds from one account to another should
either complete or not happen at all
Concurrent access by multiple users
Concurrent accessed needed for performance
Uncontrolled concurrent accesses can lead to inconsistencies
Example: Two people reading a balance and updating it at the same time Security problems
Collection of interrelated data
Set of programs to access the data
An environment that is both convenient and efficient to use
Databases touch all aspects of our lives
What Is a DBMS?
A very large, integrated collection of data.
Models real-world enterprise.
Entities (e.g., students, courses)
Relationships (e.g., Madonna is taking CS564)
A Database Management System (DBMS) is a software package designed to store and
manage databases.
Why Use a DBMS?
Data independence and efficient access.
Reduced application development time.
Data integrity and security.
Uniform data administration.
Concurrent access, recovery from crashes.
Why Study Databases??
Shift from computation to information
at the low end: scramble to webspace (a mess!)
at the high end: scientific applications
Datasets increasing in diversity and volume.
Digital libraries, interactive video, Human Genome project, EOS project
... need for DBMS exploding
DBMS encompasses most of CS
OS, languages, theory, AI, multimedia, logic
Files vs. DBMS
Application must stage large datasets between main memory and secondary storage (e.g.,
buffering, page-oriented access, 32-bit addressing, etc.)
Special code for different queries
Must protect data from inconsistency due to multiple concurrent users
Crash recovery
Security and access control
Drawbacks of using file systems to store data:
Data redundancy and inconsistency
Multiple file formats, duplication of information in different files
Difficulty in accessing data
Need to write a new program to carry out each new task
Data isolation multiple files and formats
Integrity problems
Integrity constraints (e.g. account balance > 0) become buried in program
code rather than being stated explicitly
Hard to add new constraints or change existing ones
Drawbacks of using file systems (cont.)
Atomicity of updates
Failures may leave database in an inconsistent state with partial updates
carried out
Example: Transfer of funds from one account to another should either
complete or not happen at all
Concurrent access by multiple users
Concurrent accessed needed for performance
Uncontrolled concurrent accesses can lead to inconsistencies
Example: Two people reading a balance and updating it at the same
Security problems
Hard to provide user access to some, but not all, data
Database systems offer solutions to all the above problems
Levels of Abstraction
Physical level: describes how a record (e.g., customer) is stored.
Logical level: describes data stored in database, and the relationships among the data.
type customer = record
customer_id : string;
customer_name : string;
customer_street : string;
customer_city : string;
View level: application programs hide details of data types. Views can also hide
information (such as an employees salary) for security purposes.
DBMS used to maintain, query large datasets.
Benefits include recovery from system crashes, concurrent access, quick application
development, data integrity and security.
Levels of abstraction give data independence.
A DBMS typically has a layered architecture.
DBAs hold responsible jobs and are well-paid! J
DBMS R&D is one of the broadest, most exciting areas in CS.
View of Data
An architecture for a database system
Instances and Schemas
Similar to types and variables in programming languages
Schema the logical structure of the database
Example: The database consists of information about a set of customers and
accounts and the relationship between them)
Analogous to type information of a variable in a program
Physical schema: database design at the physical level
Logical schema: database design at the logical level
Instances and Schemas
Instance the actual content of the database at a particular point in time
Analogous to the value of a variable
Physical Data Independence the ability to modify the physical schema without
changing the logical schema
Applications depend on the logical schema
In general, the interfaces between the various levels and components should be
well defined so that changes in some parts do not seriously influence others.
Data Models
A collection of tools for describing
Data relationships
Data semantics
Data constraints
Relational model
Entity-Relationship data model (mainly for database design)
Object-based data models (Object-oriented and Object-relational)
Semi structured data model (XML)
Other older models:
Network model
- Hierarchical model
Data Models
A data model is a collection of concepts for describing data.
A schema is a description of a particular collection of data, using the a given data
The relational model of data is the most widely used model today.
Main concept: relation, basically a table with rows and columns.
Every relation has a schema, which describes the columns, or fields.
Example: University Database
Conceptual schema:
Students(sid: string, name: string, login: string,
age: integer, gpa:real)
Courses(cid: string, cname:string, credits:integer)
Enrolled(sid:string, cid:string, grade:string)
Physical schema:
Relations stored as unordered files.
Index on first column of Students.
External Schema (View):
Data Independence
Applications insulated from how data is structured and stored.
Logical data independence : Protection from changes in logical structure of data.
Physical data independence: Protection from changes in physical structure of data.
Data Manipulation Language (DML)
Language for accessing and manipulating the data organized by the appropriate data model
DML also known as query language
Two classes of languages
Procedural user specifies what data is required and how to get those data
Declarative (nonprocedural) user specifies what data is required without
specifying how to get those data
SQL is the most widely used query language
Data Definition Language (DDL)
Specification notation for defining the database schema
Example: create table account (
account_number char(10),
branch_name char(10),
balance integer)
DDL compiler generates a set of tables stored in a data dictionary
Data dictionary contains metadata (i.e., data about data)
Database schema
Data storage and definition language
Specifies the storage structure and access methods used
Integrity constraints
Domain constraints
Referential integrity (e.g. branch_name must correspond to a valid branch in
the branch table)
Relational Model
Example of tabular data in the relational model
A Sample Relational Database
SQL: widely used non-procedural language
Example: Find the name of the customer with customer-id 192-83-7465
select customer.customer_name
where customer.customer_id = 192-83-7465
Example: Find the balances of all accounts held by the customer with
customer-id 192-83-7465
select account.balance
from depositor, account
where depositor.customer_id = 192-83-7465 and
depositor.account_number = account.account_number
Application programs generally access databases through one of
Language extensions to allow embedded SQL
Application program interface (e.g., ODBC/JDBC) which allow SQL queries to
be sent to a database
Database Users
Users are differentiated by the way they expect to interact with
the system
Application programmers interact with system through DML calls
Sophisticated users form requests in a database query language
Specialized users write specialized database applications that do not fit into the
traditional data processing framework
Nave users invoke one of the permanent application programs that have been
written previously
Examples, people accessing database over the web, bank tellers, clerical staff
Database Administrator
Coordinates all the activities of the database system
has a good understanding of the enterprises information resources and needs.
Database administrator's duties include:
Storage structure and access method definition
Schema and physical organization modification
Granting users authority to access the database
Backing up data
Monitoring performance and responding to changes
Database tuning
Data storage and Querying
Storage management
Query processing
Transaction processing
Storage Management
Storage manager is a program module that provides the interface between the low-
level data stored in the database and the application programs and queries submitted
to the system.
The storage manager is responsible to the following tasks:
Interaction with the file manager
Efficient storing, retrieving and updating of data
Storage access
File organization
Indexing and hashing
Query Processing
1.Parsing and translation
2. Optimization
3. Evaluation
Alternative ways of evaluating a given query
Equivalent expressions
Different algorithms for each operation
Cost difference between a good and a bad way of evaluating a query can be enormous
Need to estimate the cost of operations
Depends critically on statistical information about relations which the database
must maintain
Need to estimate statistics for intermediate results to compute cost of complex expressions
Transaction Management
A transaction is a collection of operations that performs a single logical function in a
database application
Transaction-management component ensures that the database remains in a consistent
(correct) state despite system failures (e.g., power failures and operating system
crashes) and transaction failures.
Concurrency-control manager controls the interaction among the concurrent
transactions, to ensure the consistency of the database.
Database Architecture
The architecture of a database systems is greatly influenced by
the underlying computer system on which the database is running:
Parallel (multiple processors and disks)
Overall System Structure
Database Application Architectures
History of Database Systems
1950s and early 1960s:
Data processing using magnetic tapes for storage
Tapes provide only sequential access
Punched cards for input
Late 1960s and 1970s:
Hard disks allow direct access to data
Network and hierarchical data models in widespread use
Ted Codd defines the relational data model
Would win the ACM Turing Award for this work
IBM Research begins System R prototype
UC Berkeley begins Ingres prototype
High-performance (for the era) transaction processing
History (cont.)
Research relational prototypes evolve into commercial systems
SQL becomes industry standard
Parallel and distributed database systems
Object-oriented database systems
Large decision support and data-mining applications
Large multi-terabyte data warehouses
Emergence of Web commerce
XML and XQuery standards
Automated database administration
Increasing use of highly parallel database systems
Web-scale distributed data storage systems
Database design:
Conceptual design: (ER Model is used at this stage.)
What are the entities and relationships in the enterprise?
What information about these entities and relationships should we store in the database?
What are the integrity constraints or business rules that hold?
A database `schema in the ER Model can be represented pictorially (ER diagrams).
Can map an ER diagram into a relational schema
A database can be modeled as:
a collection of entities,
relationship among entities.
An entity is an object that exists and is distinguishable from other objects.
Example: specific person, company, event, plant
Entities have attributes
Example: people have names and addresses
An entity set is a set of entities of the same type that share the same properties.
Example: set of all persons, companies, trees, holidays
Entity Sets customer and loan:
An entity is represented by a set of attributes, that is descriptive properties possessed by all
members of an entity set.
Domain the set of permitted values for each attribute
Attribute types:
Simple and composite attributes.
Single-valued and multi-valued attributes
Example: multivalued attribute: phone_numbers
Derived attributes
Can be computed from other attributes
Example: age, given date_of_birth
Mapping Cardinality Constraints:
Express the number of entities to which another entity can be associated via a relationship set.
Most useful in describing binary relationship sets.
For a binary relationship set the mapping cardinality must be one of the following types:
One to one
One to many
Many to one
Many to many
Mapping Cardinalities:
ER Model Basics:
Entity: Real-world object distinguishable from other objects. An entity is described (in DB) using a
set of attributes.
Entity Set: A collection of similar entities. E.g., all employees.
All entities in an entity set have the same set of attributes. (Until we consider ISA hierarchies,
Each entity set has a key.
Each attribute has a domain.
ER Model Basics (Contd.):
Relationship: Association among two or more entities. E.g., Attishoo works in Pharmacy
Relationship Set: Collection of similar relationships.
An n-ary relationship set R relates n entity sets E1 ... En; each relationship in R involves entities e1
E1, ..., en En
Same entity set could participate in different relationship sets, or in different roles in same set.
A relationship is an association among several entities
Hayes depositor A-102
customer entity relationship setaccount entity
A relationship set is a mathematical relation among n 2 entities, each taken from entity sets
, e
, e
) | e
, e
, , e
where (e
, e
, , e
) is a relationship
(Hayes, A-102) depositor
Relationship Set borrower:
Relationship Sets (Cont.):
An attribute can also be property of a relationship set.
For instance, the depositor relationship set between entity sets customer and account may have the
attribute access-date
Degree of a Relationship Set:
Refers to number of entity sets that participate in a relationship set.
Relationship sets that involve two entity sets are binary (or degree two). Generally, most
relationship sets in a database system are binary.
Relationship sets may involve more than two entity sets.
Additional features of the ER model
Participation Constraints
Does every department have a manager?
If so, this is a participation constraint: the participation of Departments in Manages is said to be
total (vs. partial).
Every Departments entity must appear in an instance of the Manages relationship.
Weak Entities:
A weak entity can be identified uniquely only by considering the primary key of another (owner)
Owner entity set and weak entity set must participate in a one-to-many relationship set (one owner,
many weak entities).
Weak entity set must have total participation in this identifying relationship set.
Weak Entity Sets (Cont.):
We depict a weak entity set by double rectangles.
We underline the discriminator of a weak entity set with a dashed line.
payment_number discriminator of the payment entity set
Primary key for payment (loan_number, payment_number)
Note: the primary key of the strong entity set is not explicitly stored with the weak entity set, since
it is implicit in the identifying relationship.
If loan_number were explicitly stored, payment could be made a strong entity, but then the
relationship between payment and loan would be duplicated by an implicit relationship defined by
the attribute loan_number common to payment and loan
More Weak Entity Set Examples:
In a university, a course is a strong entity and a course_offering can be modeled as a weak entity
The discriminator of course_offering would be semester (including year) and section_number (if
there is more than one section)
If we model course_offering as a strong entity we would model course_number as an attribute.
Then the relationship with course would be implicit in the course_number attribute
1.Introduction to relational model
2.Enforcing integrity constraints
3.Logical Database Design
4.Logical Database Design
5. Introduction to Views
6.Relational Algebra
7.Tuple Relational Calculus
8. Domain Relational Calculus
Relational Database: Definitions
Relational database: a set of relations
Relation: made up of 2 parts:
Instance : a table, with rows and columns.
#Rows = cardinality, #fields = degree / arity.
Schema : specifies name of relation, plus name and type of each column.
E.G. Students (sid: string, name: string, login: string, age: integer, gpa: real).
Can think of a relation as a set of rows or tuples (i.e., all rows are distinct).
Example Instance of Students Relation
Cardinality = 3, degree = 5, all rows distinct
Do all columns in a relation instance have to
be distinct?
Relational Query Languages
A major strength of the relational model: supports simple, powerful querying of data.
Queries can be written intuitively, and the DBMS is responsible for efficient evaluation.
The key: precise semantics for relational queries.
Allows the optimizer to extensively re-order operations, and still ensure that the
answer does not change.
The SQL Query Language
sid name login age gpa
53666 Jones jones@cs 18 3.4
53688 Smith smith@eecs 18 3.2
53650 Smith smith@math 19 3.8
sid name login age gpa
53666 Jones jones@cs 18 3.4
53688 Smith smith@ee 18 3.2
FROM Students S
WHERE S.age=18
To find just names and logins, replace the first line:
SELECT S.name, S.login
Querying Multiple Relations
What does the following query compute?
SELECT S.name, E.cid
FROM Students S, Enrolled E
WHERE S.sid=E.sid AND E.grade=A
we get:
Creating Relations in SQL
Creates the Students relation. Observe that the type of each field is specified, and
enforced by the DBMS whenever tuples are added or modified.
As another example, the Enrolled table holds information about courses that students
(sid: CHAR(20), name: CHAR(20), login: CHAR(10), age: INTEGER,
gpa: REAL)
CREATE TABLE Enrolled (sid: CHAR(20), cid: CHAR(20), grade: CHAR(2))
Destroying and Altering Relations
Destroys the relation Students. The schema information and the tuples are deleted.
ALTER TABLE Students ADD COLUMN firstYear: integer
sid cid grade
53831 Carnatic101 C
53831 Reggae203 B
53650 Topology112 A
53666 History105 B
sid name login age gpa
53666 Jones jones@cs 18 3.4
53688 Smith smith@eecs 18 3.2
53650 Smith smith@math 19 3.8
S.name E.cid
Smith Topology112
The schema of Students is altered by adding a new field; every tuple in the current instance is
extended with a null value in the new field.
Adding and Deleting Tuples
Can insert a single tuple using:
INSERT INTO Students (sid, name, login, age, gpa)
VALUES (53688, Smith, smith@ee, 18, 3.2)
Can delete all tuples satisfying some condition (e.g., name = Smith):
FROM Students S
WHERE S.name = Smith
Integrity Constraints (ICs)
IC: condition that must be true for any instance of the database; e.g., domain constraints.
ICs are specified when schema is defined.
ICs are checked when relations are modified.
A legal instance of a relation is one that satisfies all specified ICs.
DBMS should not allow illegal instances.
If the DBMS checks ICs, stored data is more faithful to real-world meaning.
Avoids data entry errors, too!
Primary Key Constraints
A set of fields is a key for a relation if :
1. No two distinct tuples can have same values in all key fields, and
2. This is not true for any subset of the key.
Part 2 false? A superkey.
If theres >1 key for a relation, one of the keys is chosen (by DBA) to be the
primary key.
E.g., sid is a key for Students. (What about name?) The set {sid, gpa} is a superkey.
Primary and Candidate Keys in SQL
Possibly many candidate keys (specified using UNIQUE), one of which is chosen as the
primary key.
For a given student and course, there is a single grade. vs. Students can take only one
course, and receive a single grade for that course; further, no two students in a course
receive the same grade.
Used carelessly, an IC can prevent the storage of database instances that arise in practice!
(sid CHAR(20)
cid CHAR(20),
grade CHAR(2),
PRIMARY KEY (sid,cid) )
(sid CHAR(20)
cid CHAR(20),
grade CHAR(2),
UNIQUE (cid, grade) )
Foreign Keys, Referential Integrity
Foreign key : Set of fields in one relation that is used to `refer to a tuple in another relation.
(Must correspond to primary key of the second relation.) Like a `logical pointer.
E.g. sid is a foreign key referring to Students:
Enrolled(sid: string, cid: string, grade: string)
If all foreign key constraints are enforced, referential integrity is achieved, i.e., no
dangling references.
Can you name a data model w/o referential integrity?
Links in HTML!
Foreign Keys in SQL
Only students listed in the Students relation should be allowed to enroll for courses.
(sid CHAR(20), cid CHAR(20), grade CHAR(2),
PRIMARY KEY (sid,cid),
Enforcing Referential Integrity
Consider Students and Enrolled; sid in Enrolled is a foreign key that references Students.
What should be done if an Enrolled tuple with a non-existent student id is inserted? (Reject
What should be done if a Students tuple is deleted?
Also delete all Enrolled tuples that refer to it.
Disallow deletion of a Students tuple that is referred to.
Set sid in Enrolled tuples that refer to it to a default sid.
(In SQL, also: Set sid in Enrolled tuples that refer to it to a special value null,
denoting `unknown or `inapplicable.)
Similar if primary key of Students tuple is updated.
Referential Integrity in SQL
sid cid grade
53666 Carnatic101 C
53666 Reggae203 B
53650 Topology112 A
53666 History105 B
sid name login age gpa
53666J ones jones@cs 18 3.4
53688Smi th smi th@eecs 18 3.2
53650Smi th smi th@math19 3.8
SQL/92 and SQL:1999 support all 4 options on deletes and updates.
Default is NO ACTION (delete/update is rejected)
CASCADE (also delete all tuples that refer to deleted tuple)
SET NULL / SET DEFAULT (sets foreign key value of referencing tuple)
CREATE TABLE Enrolled (sid CHAR(20), cid CHAR(20), grade CHAR(2),
Where do ICs Come From?
ICs are based upon the semantics of the real-world enterprise that is being described in the
database relations.
We can check a database instance to see if an IC is violated, but we can NEVER infer that
an IC is true by looking at an instance.
An IC is a statement about all possible instances!
From example, we know name is not a key, but the assertion that sid is a key is
given to us.
Key and foreign key ICs are the most common; more general ICs supported too.
Logical DB Design: ER to Relational
CREATE TABLE Employees (ssn
CHAR(11), name CHAR(20), lot INTEGER,
Relationship Sets to Tables
In translating a relationship set to a relation, attributes of the relation must include:
Keys for each participating entity set (as foreign keys).
This set of attributes forms a superkey for the relation.
All descriptive attributes.
KEY (ssn, did), FOREIGN KEY (ssn) REFERENCES Employees, FOREIGN KEY (did)
REFERENCES Departments)
Review: Key Constraints
Each dept has at most one manager, according to the key constraint on Manages.
Translating ER Diagrams with Key Constraints
Map relationship to a table:
Note that did is the key now!
Separate tables for Employees and Departments.
Since each department has a unique manager, we could instead combine Manages and
ssn CHAR(11),
since DATE,
CREATE TABLE Dept_Mgr( did INTEGER, dname CHAR(20), budget REAL,
Review: Participation Constraints
Does every department have a manager?
If so, this is a participation constraint: the participation of Departments in Manages
is said to be total (vs. partial).
Every did value in Departments table must appear in a row of the Manages
table (with a non-null ssn value!)
Participation Constraints in SQL
We can capture participation constraints involving one entity set in a binary relationship, but
little else (without resorting to CHECK constraints).
CREATE TABLE Dept_Mgr( did INTEGER, dname CHAR(20), budget REAL, ssn
Review: Weak Entities
A weak entity can be identified uniquely only by considering the primary key of another
(owner) entity.
Owner entity set and weak entity set must participate in a one-to-many relationship
set (1 owner, many weak entities).
Weak entity set must have total participation in this identifying relationship set.
Translating Weak Entity Sets
Weak entity set and identifying relationship set are translated into a single table.
When the owner entity is deleted, all owned weak entities must also be deleted.
CREATE TABLE Dep_Policy ( pname CHAR(20), age INTEGER, cost REAL,
ssn CHAR(11) NOT NULL, PRIMARY KEY (pname, ssn), FOREIGN KEY (ssn)
Review: ISA Hierarchies
As in C++, or other PLs, attributes are inherited.
If we declare A ISA B, every A entity is also considered to be a B entity.
Overlap constraints: Can Joe be an Hourly_Emps as well as a Contract_Emps entity?
Covering constraints: Does every Employees entity also have to be an Hourly_Emps or a
Contract_Emps entity? (Yes/no)
Translating ISA Hierarchies to Relations
General approach:
3 relations: Employees, Hourly_Emps and Contract_Emps.
Hourly_Emps: Every employee is recorded in Employees. For hourly emps,
extra info recorded in Hourly_Emps (hourly_wages, hours_worked, ssn);
must delete Hourly_Emps tuple if referenced Employees tuple is deleted).
Queries involving all employees easy, those involving just Hourly_Emps
require a join to get some attributes.
Alternative: Just Hourly_Emps and Contract_Emps.
Hourly_Emps: ssn, name, lot, hourly_wages, hours_worked.
Each employee must be in one of these two subclasses.
Review: Binary vs. Ternary Relationships
What are the additional constraints in the 2nd diagram?
The key constraints allow us to combine Purchaser with Policies and Beneficiary with
Participation constraints lead to NOT NULL constraints.
What if Policies is a weak entity set?
CREATE TABLE Policies ( policyid INTEGER, cost REAL, ssn CHAR(11) NOT NULL,
PRIMARY KEY (policyid). FOREIGN KEY (ssn) REFERENCES Employees,
CREATE TABLE Dependents ( pname CHAR(20), age INTEGER, policyid INTEGER,
PRIMARY KEY (pname, policyid) FOREIGN KEY (policyid) REFERENCES Policies,
A view is just a relation, but we store a definition, rather than a set of tuples.
CREATE VIEW YoungActiveStudents (name, grade)
AS SELECT S.name, E.grade
FROM Students S, Enrolled E
WHERE S.sid = E.sid and S.age<21
Views can be dropped using the DROP VIEW command.
How to handle DROP TABLE if theres a view on the table?
DROP TABLE command has options to let the user specify this.
Views and Security
Views can be used to present necessary information (or a summary), while hiding details in
underlying relation(s).
Given YoungStudents, but not Students or Enrolled, we can find students s who
have are enrolled, but not the cids of the courses they are enrolled in.
View Definition
A relation that is not of the conceptual model but is made visible to a user as a virtual
relation is called a view.
A view is defined using the create view statement which has the form
create view v as < query expression >
where <query expression> is any legal SQL expression. The view name is represented by v.
Once a view is defined, the view name can be used to refer to the virtual relation that the
view generates.
Example Queries
A view consisting of branches and their customers
create view all_customer as
(select branch_name, customer_name
from depositor, account
where depositor.account_number =
account.account_number )
union (select branch_name, customer_name
from borrower, loan
where borrower.loan_number = loan.loan_number )
n Find all customers of the Perryridge branch
select customer_name
from all_customer
where branch_name = 'Perryridge'
Uses of Views
Hiding some information from some users
Consider a user who needs to know a customers name, loan number and branch
name, but has no need to see the loan amount.
Define a view
(create view cust_loan_data as
select customer_name, borrower.loan_number, branch_name
from borrower, loan
where borrower.loan_number = loan.loan_number )
Grant the user permission to read cust_loan_data, but not borrower or loan
Predefined queries to make writing of other queries easier
Common example: Aggregate queries used for statistical analysis of data
Processing of Views
When a view is created
the query expression is stored in the database along with the view name
the expression is substituted into any query using the view
Views definitions containing views
One view may be used in the expression defining another view
A view relation v
is said to depend directly on a view relation v
if v
is used in the
expression defining v
A view relation v
is said to depend on view relation v
if either v
depends directly to
or there is a path of dependencies from v
to v
A view relation v is said to be recursive if it depends on itself.
View Expansion
A way to define the meaning of views defined in terms of other views.
Let view v
be defined by an expression e
that may itself contain uses of view relations.
View expansion of an expression repeats the following replacement step:
Find any view relation v
in e
Replace the view relation v
by the expression defining v
until no more view relations are present in e
As long as the view definitions are not recursive, this loop will terminate
With Clause
The with clause provides a way of defining a temporary view whose definition is available
only to the query in which the with clause occurs.
Find all accounts with the maximum balance
with max_balance (value) as
select max (balance)
from account
select account_number
from account, max_balance
where account.balance = max_balance.value
Complex Queries using With Clause
Find all branches where the total account deposit is greater than the average of the total account
deposits at all branches.
with branch_total (branch_name, value) as
select branch_name, sum (balance)
from account
group by branch_name
with branch_total_avg (value) as
select avg (value)
from branch_total
select branch_name
from branch_total, branch_total_avg
where branch_total.value >= branch_total_avg.value
Note: the exact syntax supported by your database may vary slightly.
E.g. Oracle syntax is of the form
with branch_total as ( select .. ),
branch_total_avg as ( select .. )
Update of a View
Create a view of all loan data in the loan relation, hiding the amount attribute
create view loan_branch as
select loan_number, branch_name
from loan
Add a new tuple to loan_branch
insert into loan_branch
values ('L-37, 'Perryridge)
This insertion must be represented by the insertion of the tuple
('L-37', 'Perryridge', null )
into the loan relation
Formal Relational Query Languages
Two mathematical Query Languages form the basis for real languages (e.g. SQL), and for
Relational Algebra : More operational, very useful for representing execution plans.
Relational Calculus : Lets users describe what they want, rather than how to
compute it. (Non-operational, declarative.)
A query is applied to relation instances, and the result of a query is also a relation instance.
Schemas of input relations for a query are fixed (but query will run regardless of
The schema for the result of a given query is also fixed! Determined by definition of
query language constructs.
Positional vs. named-field notation:
Positional notation easier for formal definitions, named-field notation more readable.
Both used in SQL
Example Instances
Sailors and Reserves relations for our examples.
Well use positional or named field notation, assume that names of fields in query results
are `inherited from names of fields in query input relations.
Relational Algebra
Basic operations:
Selection ( ) Selects a subset of rows from relation.
Projection ( ) Deletes unwanted columns from relation.
Cross-product ( ) Allows us to combine two relations.
Set-difference ( ) Tuples in reln. 1, but not in reln. 2.
Union ( ) Tuples in reln. 1 and in reln. 2.
Additional operations:
sid sname rating age
22 dustin 7 45.0
31 lubber 8 55.5
58 rusty 10 35.0
sid sname rating age
28 yuppy 9 35.0
31 lubber 8 55.5
44 guppy 5 35.0
58 rusty 10 35.0
sid bid day
22 101 10/10/96
58 103 11/12/96
Intersection, join, division, renaming: Not essential, but (very!) useful.
Since each operation returns a relation, operations can be composed! (Algebra is closed.)
Deletes attributes that are not in projection list.
Schema of result contains exactly the fields in the projection list, with the same names that
they had in the (only) input relation.
Projection operator has to eliminate duplicates! (Why??)
Note: real systems typically dont do duplicate elimination unless the user explicitly
asks for it. (Why not?)
Selects rows that satisfy selection condition.
No duplicates in result! (Why?)
Schema of result identical to schema of (only) input relation.
Result relation can be the input for another relational algebra operation! (Operator
sname rati ng
yuppy 9
l ubber 8
guppy 5
rusty 10
) 2 (
rating sname
) 2 (S
) 2 (
A ( )
Find names of sailors whove reserved boat #103
Solution 1:
Solution 2:
( , Re ) Temp serves
( , ) Temp Temp Sailors 2 1
Temp ( ) 2
Solution 3:
serves Sailors ( (Re ))
Find names of sailors whove reserved a red boat
Information about boat color only available in Boats; so need an extra join:
A more efficient solution:
) ) ) ( (( A B A
x x
) ) Re ( (
Sailors serves
) Re )
' '
(( Sailors serves Boats
red color
sid bid color red
Boats s Sailors ( ((
' '
) Re ) )
A query optimizer can find this, given the first solution!
Find sailors whove reserved a red or a green boat
Can identify all red or green boats, then find sailors whove reserved one of these boats:
Can also define Tempboats using union! (How?)
What happens if is replaced by in this query?
Find sailors whove reserved a red and a green boat
Previous approach wont work! Must identify sailors whove reserved red boats, sailors
whove reserved green boats, then find the intersection (note that sid is a key for Sailors):
Relational Calculus
Comes in two flavors: Tuple relational calculus (TRC) and Domain relational calculus
Calculus has variables, constants, comparison ops, logical connectives and quantifiers.
TRC : Variables range over (i.e., get bound to) tuples.
DRC : Variables range over domain elements (= field values).
Both TRC and DRC are simple subsets of first-order logic.
' ' ' '
( , ( Boats
green color red color
) Re ( Sailors serves Tempboats
)) Re )
' '
( ( , ( serves Boats
red color sid
)) Re )
' '
( ( , ( serves Boats
green color sid
) ) (( Sailors Tempgreen Tempred
Expressions in the calculus are called formulas. An answer tuple is essentially an
assignment of constants to variables that make the formula evaluate to true.
Domain Relational Calculus
Query has the form:
Answer includes all tuples that make the formula be true.
Formula is recursively defined, starting with simple atomic formulas (getting tuples from
relations or making comparisons of values), and building bigger and better formulas using
the logical connectives.
DRC Formulas
Atomic formula:
, or X op Y, or X op constant
op is one of
Atomic formula:
, or X op Y, or X op constant
op is one of
an atomic formula, or
, where p and q are formulas, or
, where variable X is free in p(X), or
, where variable X is free in p(X)
The use of quantifiers and is said to bind X.
A variable that is not bound is free.
Free and Bound Variables
The use of quantifiers and in a formula is said to bind X.
xn x x p xn x x ,..., 2 , 1 | ,..., 2 , 1
Rname xn x x ,..., 2 , 1
> < , , , , ,
A variable that is not bound is free.
Let us revisit the definition of a query:
There is an important restriction: the variables x1, ..., xn that appear to the left of `| must be
the only free variables in the formula p(...).
Find all sailors with a rating above 7
The condition ensures that the domain variables I, N, T and A are bound to fields of the
same Sailors tuple.
The term to the left of `| (which should be read as such that) says that every tuple
that satisfies T>7 is in the answer.
Modify this query to answer:
Find sailors who are older than 18 or have a rating under 9, and are called Joe.
Find sailors rated > 7 who have reserved boat #103
We have used as a shorthand for
Note the use of to find a tuple in Reserves that `joins with the Sailors tuple under
Find sailors rated > 7 whove reserved a red boat
xn x x p xn x x ,..., 2 , 1 | ,..., 2 , 1
> 7 , , , | , , , T Sailors A T N I A T N I
Sailors A T N I , , ,
A T N I , , ,
A T N I , , ,
> 7 , , , | , , , T Sailors A T N I A T N I
103 Re , , , , Br I Ir serves D Br Ir D Br Ir
( ) ... , , D Br Ir ( ) ( ) ( ) ... D Br Ir
> 7 , , , | , , , T Sailors A T N I A T N I
I Ir serves D Br Ir D Br Ir Re , , , ,
Observe how the parentheses control the scope of each quantifiers binding.
This may look cumbersome, but with a good user interface, it is very intuitive. (MS Access,
Find sailors whove reserved all boats
Find all sailors I such that for each 3-tuple either it is not a tuple in Boats or
there is a tuple in Reserves showing that sailor I has reserved it.
Find sailors whove reserved all boats (again!)
Simpler notation, same query. (Much clearer!)
To find sailors whove reserved all red boats:
Unsafe Queries, Expressive Power
It is possible to write syntactically correct calculus queries that have an infinite number of
answers! Such queries are called unsafe.
It is known that every query that can be expressed in relational algebra can be expressed as a
safe query in DRC / TRC; the converse is also true.
Sailors A T N I A T N I , , , | , , ,
Boats C BN B C BN B , , , ,
B Br Ir I serves D Br Ir D Br Ir Re , , , ,
Sailors A T N I A T N I , , , | , , ,
Boats C BN B , ,
B Br Ir I serves D Br Ir Re , ,
Sailors S S|
Relational Completeness : Query language (e.g., SQL) can express every query that is
expressible in relational algebra/calculus.
1.The Form of a Basic SQL Queries
2. Query operations & NESTED Queries
3. NESTED Queries
4. Aggregate Operators
5. Null Values
6. Complex I.C in SQL-92
7. Triggers and Active Databases
8. Designing Active Databases
IBM Sequel language developed as part of System R project at the IBM San Jose Research
Renamed Structured Query Language (SQL)
ANSI and ISO standard SQL:
SQL:1999 (language name became Y2K compliant!)
Commercial systems offer most, if not all, SQL-92 features, plus varying feature sets from
later standards and special proprietary features.
Not all examples here may work on your particular system.
Data Definition Language
Allows the specification of:
The schema for each relation, including attribute types.
Integrity constraints
Authorization information for each relation.
Non-standard SQL extensions also allow specification of
The set of indices to be maintained for each relations.
The physical storage structure of each relation on disk.
Create Table Construct
An SQL relation is defined using the create table command:
create table r (A
, A
, ..., A
r is the name of the relation
each A
is an attribute name in the schema of relation r
is the data type of attribute A
create table branch
(branch_name char(15),
branch_city char(30),
assets integer)
Domain Types in SQL
char(n). Fixed length character string, with user-specified length n.
varchar(n). Variable length character strings, with user-specified maximum length n.
int. Integer (a finite subset of the integers that is machine-dependent).
smallint. Small integer (a machine-dependent subset of the integer domain type).
numeric(p,d). Fixed point number, with user-specified precision of p digits, with n digits
to the right of decimal point.
real, double precision. Floating point and double-precision floating point numbers, with
machine-dependent precision.
float(n). Floating point number, with user-specified precision of at least n digits.
More are covered in Chapter 4.
Integrity Constraints on Tables
not null
primary key (A
, ..., A
Example: Declare branch_name as the primary key for branch
create table branch
(branch_name char(15),
branch_city char(30) not null,
assets integer,
primary key (branch_name))
primary key declaration on an attribute automatically ensures not null in SQL-92 onwards,
needs to be explicitly stated in SQL-89
Basic Insertion and Deletion of Tuples
Newly created table is empty
Add a new tuple to account
insert into account
values ('A-9732', 'Perryridge', 1200)
Insertion fails if any integrity constraint is violated
Delete all tuples from account
delete from account
Note: Will see later how to delete selected tuples
Drop and Alter Table Constructs
The drop table command deletes all information about the dropped relation from the
The alter table command is used to add attributes to an existing relation:
alter table r add A D
where A is the name of the attribute to be added to relation r and D is the domain of A.
All tuples in the relation are assigned null as the value for the new attribute.
The alter table command can also be used to drop attributes of a relation:
alter table r drop A
where A is the name of an attribute of relation r
Dropping of attributes not supported by many databases
Basic Query Structure
A typical SQL query has the form:
select A
, A
, ..., A
from r
, r
, ..., r
where P
represents an attribute
represents a relation
P is a predicate.
This query is equivalent to the relational algebra expression.
The result of an SQL query is a relation.
The select Clause
The select clause list the attributes desired in the result of a query
corresponds to the projection operation of the relational algebra
Example: find the names of all branches in the loan relation:
select branch_name
from loan
In the relational algebra, the query would be:
NOTE: SQL names are case insensitive (i.e., you may use upper- or lower-case letters.)
E.g. Branch_Name BRANCH_NAME branch_name
Some people use upper case wherever we use bold font.
)) ( (
2 1 , , ,
2 1
m P A A A
r r r
SQL allows duplicates in relations as well as in query results.
To force the elimination of duplicates, insert the keyword distinct after select.
Find the names of all branches in the loan relations, and remove duplicates
select distinct branch_name
from loan
The keyword all specifies that duplicates not be removed.
select all branch_name
from loan
An asterisk in the select clause denotes all attributes
select *
from loan
The select clause can contain arithmetic expressions involving the operation, +, , *, and /,
and operating on constants or attributes of tuples.
select loan_number, branch_name, amount * 100
from loan
The where Clause
The where clause specifies conditions that the result must satisfy
Corresponds to the selection predicate of the relational algebra.
To find all loan number for loans made at the Perryridge branch with loan amounts greater
than $1200.
select loan_number
from loan
where branch_name = 'Perryridge' and amount > 1200
Comparison results can be combined using the logical connectives and, or, and not.
The from Clause
The from clause lists the relations involved in the query
Corresponds to the Cartesian product operation of the relational algebra.
Find the Cartesian product borrower X loan
select *
from borrower, loan
n Find the name, loan number and loan amount of all customers
having a loan at the Perryridge branch.
select customer_name, borrower.loan_number, amount
from borrower, loan
where borrower.loan_number = loan.loan_number and
branch_name = 'Perryridge'
The Rename Operation
SQL allows renaming relations and attributes using the as clause:
old-name as new-name
E.g. Find the name, loan number and loan amount of all customers; rename the column name
loan_number as loan_id.
select customer_name, borrower.loan_number as loan_id, amount
from borrower, loan
where borrower.loan_number = loan.loan_number
Tuple Variables
Tuple variables are defined in the from clause via the use of the as clause.
Find the customer names and their loan numbers and amount for all customers having a loan at
some branch.
select customer_name, T.loan_number, S.amount
from borrower as T, loan as S
where T.loan_number = S.loan_number
n Find the names of all branches that have greater assets than
some branch located in Brooklyn.
select distinct T.branch_name
from branch as T, branch as S
where T.assets > S.assets and S.branch_city = 'Brooklyn'
n Keyword as is optional and may be omitted
borrower as T borrower T
n Some database such as Oracle require as to be omitted
n Example Instances
n We will use these instances of the Sailors and Reserves relations in our examples.
n If the key for the Reserves relation contained only the attributes sid and bid, how would the
semantics differ?
Basic SQL Query
SELECT [DISTINCT] target-list
FROM relation-list
WHERE qualification
relation-list A list of relation names (possibly with a range-variable after each name).
target-list A list of attributes of relations in relation-list
qualification Comparisons (Attr op const or Attr1 op Attr2, where op is one of
) combined using AND, OR and NOT.
DISTINCT is an optional keyword indicating that the answer should not contain duplicates.
Default is that duplicates are not eliminated!
Conceptual Evaluation Strategy
Semantics of an SQL query defined in terms of the following conceptual evaluation
Compute the cross-product of relation-list.
Discard resulting tuples if they fail qualifications.
Delete attributes that are not in target-list.
If DISTINCT is specified, eliminate duplicate rows.
This strategy is probably the least efficient way to compute a query! An optimizer will find
more efficient strategies to compute the same answers.
Example of Conceptual Evaluation
SELECT S.sname
FROM Sailors S, Reserves R
WHERE S.sid=R.sid AND R.bid=103
A Note on Range Variables
Really needed only if the same relation appears twice in the FROM clause. The previous query can
also be written as:
SELECT S.sname
FROM Sailors S, Reserves R
WHERE S.sid=R.sid AND bid=103 OR
SELECT sname
FROM Sailors, Reserves
WHERE Sailors.sid=Reserves.sid
AND bid=103
It is good style,
however, to use
(sid) sname rating age (sid) bid day
22 dustin 7 45.0 22 101 10/10/96
22 dustin 7 45.0 58 103 11/12/96
31 lubber 8 55.5 22 101 10/10/96
31 lubber 8 55.5 58 103 11/12/96
58 rusty 10 35.0 22 101 10/10/96
58 rusty 10 35.0 58 103 11/12/96
range variables
Find sailors whove reserved at least one boat
FROM Sailors S, Reserves R
WHERE S.sid=R.sid
Would adding DISTINCT to this query make a difference?
What is the effect of replacing S.sid by S.sname in the SELECT clause? Would adding
DISTINCT to this variant of the query make a difference?
Expressions and Strings
SELECT S.age, age1=S.age-5, 2*S.age AS age2
FROM Sailors S
Illustrates use of arithmetic expressions and string pattern matching: Find triples (of ages of
sailors and two fields defined by expressions) for sailors whose names begin and end with B
and contain at least three characters.
AS and = are two ways to name fields in result.
LIKE is used for string matching. `_ stands for any one character and `% stands for 0 or
more arbitrary characters.
String Operations
SQL includes a string-matching operator for comparisons on character strings. The operator
like uses patterns that are described using two special characters:
percent (%). The % character matches any substring.
underscore (_). The _ character matches any character.
Find the names of all customers whose street includes the substring Main.
select customer_name
from customer
where customer_street like '% Main%'
Match the name Main%
like 'Main\%' escape '\'
SQL supports a variety of string operations such as
concatenation (using ||)
converting from upper to lower case (and vice versa)
finding string length, extracting substrings, etc.
Ordering the Display of Tuples
List in alphabetic order the names of all customers having a loan in Perryridge branch
select distinct customer_name
from borrower, loan
where borrower loan_number = loan.loan_number and
branch_name = 'Perryridge'
order by customer_name
We may specify desc for descending order or asc for ascending order, for each attribute;
ascending order is the default.
Example: order by customer_name desc
In relations with duplicates, SQL can define how many copies of tuples appear in the result.
Multiset versions of some of the relational algebra operators given multiset relations r
and r
): If there are c
copies of tuple t
in r
, and t
satisfies selections
, then there are c
copies of t
(r ): For each copy of tuple t
in r
, there is a copy of tuple
) in
) where
) denotes the projection of the single tuple t
3. r
x r
: If there are c
copies of tuple t
in r
and c
copies of tuple t
in r
, there are c
x c
copies of the tuple t
. t
in r
x r
Example: Suppose multiset relations r
(A, B) and r
(C) are as follows:
= {(1, a) (2,a)} r
= {(2), (3), (3)}
) would be {(a), (a)}, while
) x r
would be
{(a,2), (a,2), (a,3), (a,3), (a,3), (a,3)}
SQL duplicate semantics:
select A
, ..., A
from r
, r
, ..., r
where P
is equivalent to the multiset version of the expression:
Set Operations
The set operations union, intersect, and except operate on relations and correspond to the
relational algebra operations , , .
Each of the above operations automatically eliminates duplicates; to retain all duplicates use
the corresponding multiset versions union all, intersect all and except all.
Suppose a tuple occurs m times in r and n times in s, then, it occurs:
+ n times in r union all s
min(m,n) times in r intersect all s
max(0, m n) times in r except all s
Set Operations
Find all customers who have a loan, an account, or both:
(select customer_name from depositor)
(select customer_name from borrower)
Find all customers who have both a loan and an account
(select customer_name from depositor)
(select customer_name from borrower
Find all customers who have an account but no loan
)) ( (
2 1 , , ,
2 1
m P A A A
r r r
(select customer_name from depositor)
(select customer_name from borrower)
1. Transaction concept & State
2. Implementation of atomicity and durability
3. Serializability
4. Recoverability
5. Implementation of isolation
6. Lock based protocols
7. Lock based protocols
8. Timestamp based protocols
9. Validation based protocol
Transaction Concept
A transaction is a unit of program execution that accesses and possibly updates various
data items.
E.g. transaction to transfer $50 from account A to account B:
1. read(A)
2. A := A 50
3. write(A)
4. read(B)
5. B := B + 50
6. write(B)
Two main issues to deal with:
Failures of various kinds, such as hardware failures and system crashes
Concurrent execution of multiple transactions
Example of Fund Transfer
Transaction to transfer $50 from account A to account B:
1. read(A)
2. A := A 50
3. write(A)
4. read(B)
5. B := B + 50
6. write(B)
Atomicity requirement
if the transaction fails after step 3 and before step 6, money will be lost leading to
an inconsistent database state
Failure could be due to software or hardware
the system should ensure that updates of a partially executed transaction are not
reflected in the database
Durability requirement once the user has been notified that the transaction has
completed (i.e., the transfer of the $50 has taken place), the updates to the database by the
transaction must persist even if there are software or hardware failures.
Transaction to transfer $50 from account A to account B:
1. read(A)
2. A := A 50
3. write(A)
4. read(B)
5. B := B + 50
6. write(B)
Consistency requirement in above example:
the sum of A and B is unchanged by the execution of the transaction
In general, consistency requirements include
Explicitly specified integrity constraints such as primary keys and foreign
Implicit integrity constraints
e.g. sum of balances of all accounts, minus sum of loan amounts
must equal value of cash-in-hand
A transaction must see a consistent database.
During transaction execution the database may be temporarily inconsistent.
When the transaction completes successfully the database must be consistent
Erroneous transaction logic can lead to inconsistency
Isolation requirement if between steps 3 and 6, another transaction T2 is allowed to
access the partially updated database, it will see an inconsistent database (the sum A + B
will be less than it should be).
T1 T2
1. read(A)
2. A := A 50
3. write(A)
read(A), read(B), print(A+B)
4. read(B)
5. B := B + 50
6. write(B
Isolation can be ensured trivially by running transactions serially
that is, one after the other.
However, executing multiple transactions concurrently has significant benefits, as we will
see later.
ACID Properties
A transaction is a unit of program execution that accesses and possibly updates various data
items.To preserve the integrity of data the database system must ensure:
Atomicity. Either all operations of the transaction are properly reflected in the database or
none are.
Consistency. Execution of a transaction in isolation preserves the consistency of the
Isolation. Although multiple transactions may execute concurrently, each transaction must
be unaware of other concurrently executing transactions. Intermediate transaction results
must be hidden from other concurrently executed transactions.
That is, for every pair of transactions T
and T
, it appears to T
that either T
, finished
execution before T
started, or T
started execution after T
Durability. After a transaction completes successfully, the changes it has made to the
database persist, even if there are system failures.
Transaction State
Active the initial state; the transaction stays in this state while it is executing
Partially committed after the final statement has been executed.
Failed -- after the discovery that normal execution can no longer proceed.
Aborted after the transaction has been rolled back and the database restored to its state
prior to the start of the transaction. Two options after it has been aborted:
restart the transaction
can be done only if no internal logical error
kill the transaction
Committed after successful completion.
Implementation of Atomicity and Durability
The recovery-management component of a database system implements the support for
atomicity and durability.
E.g. the shadow-database scheme:
all updates are made on a shadow copy of the database
db_pointer is made to point to the updated shadow copy after
the transaction reaches partial commit and
all updated pages have been flushed to disk.
db_pointer always points to the current consistent copy of the database.
In case transaction fails, old consistent copy pointed to by db_pointer can be used,
and the shadow copy can be deleted.
The shadow-database scheme:
Assumes that only one transaction is active at a time.
Assumes disks do not fail
Useful for text editors, but
extremely inefficient for large databases (why?)
Variant called shadow paging reduces copying of data, but is still not
practical for large databases
Does not handle concurrent transactions
Will study better schemes in Chapter 17.
Concurrent Executions
Multiple transactions are allowed to run concurrently in the system. Advantages are:
increased processor and disk utilization, leading to better transaction throughput
E.g. one transaction can be using the CPU while another is reading from or
writing to the disk
reduced average response time for transactions: short transactions need not wait
behind long ones.
Concurrency control schemes mechanisms to achieve isolation
that is, to control the interaction among the concurrent transactions in order to
prevent them from destroying the consistency of the database
Will study in Chapter 16, after studying notion of correctness of concurrent executions
Schedule a sequences of instructions that specify the chronological order in which
instructions of concurrent transactions are executed
a schedule for a set of transactions must consist of all instructions of those
must preserve the order in which the instructions appear in each individual
A transaction that successfully completes its execution will have a commit instructions as
the last statement
by default transaction assumed to execute commit instruction as its last step
A transaction that fails to successfully complete its execution will have an abort instruction
as the last statement
Schedule 1
Let T
transfer $50 from A to B, and T
transfer 10% of the balance from A to B.
A serial schedule in which T
is followed by T
Schedule 2
A serial schedule where T
is followed by T
Schedule 3
Let T
and T
be the transactions defined previously. The following schedule is not a serial
schedule, but it is equivalent to Schedule 1.
In Schedules 1, 2 and 3, the sum A + B is preserved.
Schedule 4
The following concurrent schedule does not preserve the value of (A + B ).
Basic Assumption Each transaction preserves database consistency.
Thus serial execution of a set of transactions preserves database consistency.
A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule.
Different forms of schedule equivalence give rise to the notions of:
1. conflict serializability
2. view serializability
Simplified view of transactions
We ignore operations other than read and write instructions
We assume that transactions may perform arbitrary computations on data in local
buffers in between reads and writes.
Our simplified schedules consist of only read and write instructions.
Conflicting Instructions
Instructions l
and l
of transactions T
and T
respectively, conflict if and only if there exists
some item Q accessed by both l
and l
, and at least one of these instructions wrote Q.
1. l
= read(Q), l
= read(Q). l
and l
dont conflict.
2. l
= read(Q), l
= write(Q). They conflict.
3. l
= write(Q), l
= read(Q). They conflict
4. l
= write(Q), l
= write(Q). They conflict
Intuitively, a conflict between l
and l
forces a (logical) temporal order between them.
If l
and l
are consecutive in a schedule and they do not conflict, their results would
remain the same even if they had been interchanged in the schedule.
Conflict Serializability
If a schedule S can be transformed into a schedule S by a series of swaps of non-conflicting
instructions, we say that S and S are conflict equivalent.
We say that a schedule S is conflict serializable if it is conflict equivalent to a serial
View Serializability
Let S and S be two schedules with the same set of transactions. S and S are view
equivalent if the following three conditions are met, for each data item Q,
1. If in schedule S, transaction T
reads the initial value of Q, then in schedule S also
transaction T
must read the initial value of Q.
2. If in schedule S transaction T
executes read(Q), and that value was produced by
transaction T
(if any), then in schedule S also transaction T
must read the value of
Q that was produced by the same write(Q) operation of transaction T
3. The transaction (if any) that performs the final write(Q) operation in schedule S
must also perform the final write(Q) operation in schedule S.
As can be seen, view equivalence is also based purely on reads and writes alone.
A schedule S is view serializable if it is view equivalent to a serial schedule.
Every conflict serializable schedule is also view serializable.
Below is a schedule which is view-serializable but not conflict serializable.
What serial schedule is above equivalent to?
Every view serializable schedule that is not conflict serializable has blind writes.
Other Notions of Serializability
The schedule below produces same outcome as the serial schedule < T
>, yet is not
conflict equivalent or view equivalent to it.
Determining such equivalence requires analysis of operations other than read and write.
Recoverable Schedules
Need to address the effect of transaction failures on concurrently
running transactions.
Recoverable schedule if a transaction T
reads a data item previously written by a
transaction T
, then the commit operation of T
appears before the commit operation of T
The following schedule (Schedule 11) is not recoverable if T
commits immediately after
the read
If T
should abort, T
would have read (and possibly shown to the user) an inconsistent
database state. Hence, database must ensure that schedules are recoverable.
Cascading Rollbacks
Cascading rollback a single transaction failure leads to a series of transaction
rollbacks. Consider the following schedule where none of the transactions has yet
committed (so the schedule is recoverable)
If T
fails, T
and T
must also be rolled back.
Can lead to the undoing of a significant amount of work
Cascadeless Schedules
Cascadeless schedules cascading rollbacks cannot occur; for each pair of transactions T
and T
such that T
reads a data item previously written by T
, the commit operation of T
appears before the read operation of T
Every cascadeless schedule is also recoverable
It is desirable to restrict the schedules to those that are cascadeless
Concurrency Control
A database must provide a mechanism that will ensure that all possible schedules are
either conflict or view serializable, and
are recoverable and preferably cascadeless
A policy in which only one transaction can execute at a time generates serial schedules, but
provides a poor degree of concurrency
Are serial schedules recoverable/cascadeless?
Testing a schedule for serializability after it has executed is a little too late!
Goal to develop concurrency control protocols that will assure serializability.
Concurrency Control vs. Serializability Tests
Concurrency-control protocols allow concurrent schedules, but ensure that the schedules are
conflict/view serializable, and are recoverable and cascadeless .
Concurrency control protocols generally do not examine the precedence graph as it is being
Instead a protocol imposes a discipline that avoids nonseralizable schedules.
We study such protocols in Chapter 16.
Different concurrency control protocols provide different tradeoffs between the amount of
concurrency they allow and the amount of overhead that they incur.
Tests for serializability help us understand why a concurrency control protocol is correct.
Weak Levels of Consistency
Some applications are willing to live with weak levels of consistency, allowing schedules
that are not serializable
E.g. a read-only transaction that wants to get an approximate total balance of all
E.g. database statistics computed for query optimization can be approximate (why?)
Such transactions need not be serializable with respect to other transactions
Tradeoff accuracy for performance
Levels of Consistency in SQL-92
Serializable default
Repeatable read only committed records to be read, repeated reads of same record must
return same value. However, a transaction may not be serializable it may find some
records inserted by a transaction but not find others.
Read committed only committed records can be read, but successive reads of record
may return different (but committed) values.
Read uncommitted even uncommitted records may be read.
Lower degrees of consistency useful for gathering approximate
information about the database
Warning: some database systems do not ensure serializable schedules by default
E.g. Oracle and PostgreSQL by default support a level of consistency called
snapshot isolation (not part of the SQL standard)
Transaction Definition in SQL
Data manipulation language must include a construct for specifying the set of actions
that comprise a transaction.
In SQL, a transaction begins implicitly.
A transaction in SQL ends by:
Commit work commits current transaction and begins a new one.
Rollback work causes current transaction to abort.
In almost all database systems, by default, every SQL statement also commits
implicitly if it executes successfully
Implicit commit can be turned off by a database directive
E.g. in JDBC, connection.setAutoCommit(false);
Implementation of Isolation
Schedules must be conflict or view serializable, and recoverable, for the sake of
database consistency, and preferably cascadeless.
A policy in which only one transaction can execute at a time generates serial schedules,
but provides a poor degree of concurrency.
Concurrency-control schemes tradeoff between the amount of concurrency they allow
and the amount of overhead that they incur.
Some schemes allow only conflict-serializable schedules to be generated, while others
allow view-serializable schedules that are not conflict-serializable.
Figure 15.6
Testing for Serializability
Consider some schedule of a set of transactions T
, T
, ..., T
Precedence graph a direct graph where the vertices are the transactions (names).
We draw an arc from T
to T
if the two transaction conflict, and T
accessed the data
item on which the conflict arose earlier.
We may label the arc by the item that was accessed.
Example 1
Example Schedule (Schedule A) + Precedence Graph
Test for Conflict Serializability
A schedule is conflict serializable if and only if its precedence graph is acyclic.
Cycle-detection algorithms exist which take order n
time, where n is the number of vertices
in the graph.
(Better algorithms take order n + e where e is the number of edges.)
If precedence graph is acyclic, the serializability order can be obtained by a topological
sorting of the graph.
This is a linear order consistent with the partial order of the graph.
For example, a serializability order for Schedule A would be
Are there others?
Test for View Serializability
The precedence graph test for conflict serializability cannot be used directly to test for view
Extension to test for view serializability has cost exponential in the size of the
precedence graph.
The problem of checking if a schedule is view serializable falls in the class of NP-complete
Thus existence of an efficient algorithm is extremely unlikely.
However practical algorithms that just check some sufficient conditions for view
serializability can still be used.
Lock-Based Protocols
A lock is a mechanism to control concurrent access to a data item
Data items can be locked in two modes :
1. exclusive (X) mode. Data item can be both read as well as
written. X-lock is requested using lock-X instruction.
2. shared (S) mode. Data item can only be read. S-lock is
requested using lock-S instruction.
Lock requests are made to concurrency-control manager. Transaction can proceed only after
request is granted.
Lock-compatibility matrix
A transaction may be granted a lock on an item if the requested lock is compatible with
locks already held on the item by other transactions
Any number of transactions can hold shared locks on an item,
but if any transaction holds an exclusive on the item no other transaction may hold
any lock on the item.
If a lock cannot be granted, the requesting transaction is made to wait till all incompatible
locks held by other transactions have been released. The lock is then granted.
Example of a transaction performing locking:
: lock-S(A);
read (A);
read (B);
Locking as above is not sufficient to guarantee serializability if A and B get updated in-
between the read of A and B, the displayed sum would be wrong.
A locking protocol is a set of rules followed by all transactions while requesting and
releasing locks. Locking protocols restrict the set of possible schedules.
Pitfalls of Lock-Based Protocols
Consider the partial schedule
Neither T
nor T
can make progress executing lock-S(B) causes T
to wait for T
release its lock on B, while executing lock-X(A) causes T
to wait for T
to release its lock
on A.
Such a situation is called a deadlock.
To handle a deadlock one of T
or T
must be rolled back
and its locks released.
The potential for deadlock exists in most locking protocols. Deadlocks are a necessary evil.
Starvation is also possible if concurrency control manager is badly designed. For example:
A transaction may be waiting for an X-lock on an item, while a sequence of other
transactions request and are granted an S-lock on the same item.
The same transaction is repeatedly rolled back due to deadlocks.
Concurrency control manager can be designed to prevent starvation.
The Two-Phase Locking Protocol
This is a protocol which ensures conflict-serializable schedules.
Phase 1: Growing Phase
transaction may obtain locks
transaction may not release locks
Phase 2: Shrinking Phase
transaction may release locks
transaction may not obtain locks
The protocol assures serializability. It can be proved that the transactions can be
serialized in the order of their lock points (i.e. the point where a transaction acquired
its final lock).
Two-phase locking does not ensure freedom from deadlocks
Cascading roll-back is possible under two-phase locking. To avoid this, follow a
modified protocol called strict two-phase locking. Here a transaction must hold all its
exclusive locks till it commits/aborts.
Rigorous two-phase locking is even stricter: here all locks are held till commit/abort.
In this protocol transactions can be serialized in the order in which they commit.
There can be conflict serializable schedules that cannot be obtained if two-phase
locking is used.
However, in the absence of extra information (e.g., ordering of access to data), two-
phase locking is needed for conflict serializability in the following sense:
Given a transaction T
that does not follow two-phase locking, we can find a
transaction T
that uses two-phase locking, and a schedule for T
and T
that is not
conflict serializable
Lock Conversions
Two-phase locking with lock conversions:
First Phase:
can acquire a lock-S on item
can acquire a lock-X on item
can convert a lock-S to a lock-X (upgrade)
Second Phase:
can release a lock-S
can release a lock-X
can convert a lock-X to a lock-S (downgrade)
This protocol assures serializability. But still relies on the programmer to insert the
various locking instructions.
Automatic Acquisition of Locks
A transaction T
issues the standard read/write instruction, without explicit locking
The operation read(D) is processed as:
if T
has a lock on D
else begin
if necessary wait until no other
transaction has a lock-X on D
grant T
a lock-S on D;
write(D) is processed as:
if T
has a lock-X on D
else begin
if necessary wait until no other trans. has any lock on D,
if T
has a lock-S on D
upgrade lock on D to lock-X
grant T
a lock-X on D
All locks are released after commit or abort
Implementation of Locking
A lock manager can be implemented as a separate process to which transactions send
lock and unlock requests
The lock manager replies to a lock request by sending a lock grant messages (or a
message asking the transaction to roll back, in case of a deadlock)
The requesting transaction waits until its request is answered
The lock manager maintains a data-structure called a lock table to record granted
locks and pending requests
The lock table is usually implemented as an in-memory hash table indexed on the
name of the data item being locked
Lock Table
Black rectangles indicate granted locks, white ones indicate waiting requests
Lock table also records the type of lock granted or requested
New request is added to the end of the queue of requests for the data item, and granted
if it is compatible with all earlier locks
Unlock requests result in the request being deleted, and later requests are checked to
see if they can now be granted
If transaction aborts, all waiting or granted requests of the transaction are deleted
lock manager may keep a list of locks held by each transaction, to implement
this efficiently
Graph-Based Protocols
Graph-based protocols are an alternative to two-phase locking
Impose a partial ordering on the set D = {d
, d
,..., d
} of all data items.
If d
then any transaction accessing both d
and d
must access d
accessing d
Implies that the set D may now be viewed as a directed acyclic graph, called a
database graph.
The tree-protocol is a simple kind of graph protocol.
Tree Protocol
1. Only exclusive locks are allowed.
2. The first lock by T
may be on any data item. Subsequently, a data Q can be locked by
only if the parent of Q is currently locked by T
3. Data items may be unlocked at any time.
4. A data item that has been locked and unlocked by T
cannot subsequently be relocked
by T
Timestamp-Based Protocols
Each transaction is issued a timestamp when it enters the system. If an old transaction
has time-stamp TS(T
), a new transaction T
is assigned time-stamp TS(T
) such that
) <TS(T
The protocol manages concurrent execution such that the time-stamps determine the
serializability order.
In order to assure such behavior, the protocol maintains for each data Q two
timestamp values:
W-timestamp(Q) is the largest time-stamp of any transaction that executed
write(Q) successfully.
R-timestamp(Q) is the largest time-stamp of any transaction that executed
read(Q) successfully.
The timestamp ordering protocol ensures that any conflicting read and write
operations are executed in timestamp order.
Suppose a transaction T
issues a read(Q)
1. If TS(T
) W-timestamp(Q), then T
needs to read a value of Q that was
already overwritten.
n Hence, the read operation is rejected, and T
is rolled back.
2. If TS(T
) W-timestamp(Q), then the read operation is executed, and R-
timestamp(Q) is set to max(R-timestamp(Q), TS(T
Suppose that transaction T
issues write(Q).
1. If TS(T
) < R-timestamp(Q), then the value of Q that T
is producing was needed
previously, and the system assumed that that value would never be produced.
n Hence, the write operation is rejected, and T
is rolled back.
2. If TS(T
) < W-timestamp(Q), then T
is attempting to write an obsolete value of
n Hence, this write operation is rejected, and T
is rolled back.
3. Otherwise, the write operation is executed, and W-timestamp(Q) is set to
Example Use of the Protocol
A partial schedule for several data items for transactions with
timestamps 1, 2, 3, 4, 5
read(Y) read(X)
Correctness of Timestamp-Ordering Protocol
The timestamp-ordering protocol guarantees serializability since all
the arcs in the precedence graph are of the form:
Thus, there will be no cycles in the precedence graph
Timestamp protocol ensures freedom from deadlock as no transaction
ever waits.
But the schedule may not be cascade-free, and may not even be
Thomas Write Rule
Modified version of the timestamp-ordering protocol in which obsolete
write operations may be ignored under certain circumstances.
When T
attempts to write data item Q, if TS(T
) < W-timestamp(Q),
then T
is attempting to write an obsolete value of {Q}.
Rather than rolling back T
as the timestamp ordering protocol
would have done, this {write} operation can be ignored.
Otherwise this protocol is the same as the timestamp ordering
Thomas' Write Rule allows greater potential concurrency.
Allows some view-serializable schedules that are not conflict-
Validation-Based Protocol
Execution of transaction T
is done in three phases.
1. Read and execution phase: Transaction T
writes only to
temporary local variables
2. Validation phase: Transaction T
performs a ``validation test''
to determine if local variables can be written without violating
3. Write phase: If T
is validated, the updates are applied to the
database; otherwise, T
is rolled back.
The three phases of concurrently executing transactions can be
interleaved, but each transaction must go through the three phases in
that order.
Assume for simplicity that the validation and write phase occur
together, atomically and serially
I.e., only one transaction executes validation/write at a
Also called as optimistic concurrency control since transaction
executes fully in the hope that all will go well during validation
Each transaction T
has 3 timestamps
) : the time when T
started its execution
): the time when T
entered its validation phase
) : the time when T
finished its write phase
Serializability order is determined by timestamp given at validation
time, to increase concurrency.
Thus TS(T
) is given the value of Validation(T
This protocol is useful and gives greater degree of concurrency if
probability of conflicts is low.
because the serializability order is not pre-decided, and
relatively few transactions will have to be rolled back.
Validation Test for Transaction T
If for all T
with TS (T
) < TS (T
) either one of the following condition
) < start(T
) < finish(T
) < validation(T
) and the set of data items
written by T
does not intersect with the set of data items read
by T
then validation succeeds and T
can be committed. Otherwise,
validation fails and T
is aborted.
Justification: Either the first condition is satisfied, and there is no
overlapped execution, or the second condition is satisfied and
n the writes of T
do not affect reads of T
since they occur after T
has finished its reads.
n the writes of T
do not affect reads of T
since T
does not read
any item written by T
Schedule Produced by Validation
Example of schedule produced using validation
read(B) read(B)
B:= B-50
A:= A+50
display (A+B)
write (B)
write (A)
Multiple Granularity
Allow data items to be of various sizes and define a hierarchy of data
granularities, where the small granularities are nested within larger
Can be represented graphically as a tree (but don't confuse with tree-
locking protocol)
When a transaction locks a node in the tree explicitly, it implicitly
locks all the node's descendents in the same mode.
Granularity of locking (level in tree where locking is done):
fine granularity (lower in tree): high concurrency, high locking
coarse granularity (higher in tree): low locking overhead, low
Example of Granularity Hierarchy
The levels, starting from the coarsest (top) level are
Intention Lock Modes
In addition to S and X lock modes, there are three additional lock
modes with multiple granularity:
intention-shared (IS): indicates explicit locking at a lower level of
the tree but only with shared locks.
intention-exclusive (IX): indicates explicit locking at a lower level
with exclusive or shared locks
shared and intention-exclusive (SIX): the subtree rooted by that
node is locked explicitly in shared mode and explicit locking is
being done at a lower level with exclusive-mode locks.
intention locks allow a higher level node to be locked in S or X mode
without having to check all descendent nodes.
Compatibility Matrix with
Intention Lock Modes
The compatibility matrix for all lock modes is:
Multiple Granularity Locking Scheme
Transaction T
can lock a node Q, using the following rules:
1. The lock compatibility matrix must be observed.
2. The root of the tree must be locked first, and may be locked in
any mode.
3. A node Q can be locked by T
in S or IS mode only if the parent of
Q is currently locked by T
in either IX or IS mode.
4. A node Q can be locked by T
in X, SIX, or IX mode only if the
parent of Q is currently locked by T
in either IX or SIX mode.
5. T
can lock a node only if it has not previously unlocked any node
(that is, T
is two-phase).
6. T
can unlock a node Q only if none of the children of Q are
currently locked by T
Observe that locks are acquired in root-to-leaf order, whereas they are
released in leaf-to-root order.
1. Data on external storage &
File organization and indexing
2. Index data structures
3. Comparison of file organizations
4. Comparison of file organizations
5. Indexes and performance tuning
6. Indexes and performance tuning
7. Intuition for tree indexes & ISAM
8. B+ tree
Data on External Storage
Disks: Can retrieve random page at fixed cost
But reading several consecutive pages is much cheaper than reading them in random
Tapes: Can only read pages in sequence
Cheaper than disks; used for archival storage
File organization: Method of arranging a file of records on external storage.
Record id (rid) is sufficient to physically locate record
Indexes are data structures that allow us to find the record ids of records with given
values in index search key fields
Architecture: Buffer manager stages pages from external storage to main memory buffer pool.
File and index layers make calls to the buffer manager.
Alternative File Organizations
Many alternatives exist, each ideal for some situations, and not so good in others:
Heap (random order) files: Suitable when typical access is a file scan retrieving all
Sorted Files: Best if records must be retrieved in some order, or only a `range of
records is needed.
Indexes: Data structures to organize records via trees or hashing.
Like sorted files, they speed up searches for a subset of records, based on values
in certain (search key) fields
Updates are much faster than in sorted files.
Index Classification
Primary vs. secondary: If search key contains primary key, then called primary index.
Unique index: Search key contains a candidate key.
Clustered vs. unclustered: If order of data records is the same as, or `close to, order of data
entries, then called clustered index.
Alternative 1 implies clustered; in practice, clustered also implies Alternative 1 (since
sorted files are rare).
A file can be clustered on at most one search key.
Cost of retrieving data records through index varies greatly based on whether index is
clustered or not!
Clustered vs. Unclustered Index
Suppose that Alternative (2) is used for data entries, and that the data records are
stored in a Heap file.
To build clustered index, first sort the Heap file (with some free space on each
page for future inserts).
Overflow pages may be needed for inserts. (Thus, order of data recs is `close
to, but not identical to, the sort order.)
Index entries
direct search for
CLUSTERED data entries
Data entries
Data Records
(Index File)
(Data file)
Data entries
Data Records
An index on a file speeds up selections on the search key fields for the index.
Any subset of the fields of a relation can be the search key for an index on the relation.
Search key is not the same as key (minimal set of fields that uniquely identify a record
in a relation).
An index contains a collection of data entries, and supports efficient retrieval of all data
entries k* with a given key value k.
Given data entry k*, we can find record with key k in at most one disk I/O. (Details
soon )
B+ Tree Indexes
(Sorted by search key)
Leaf pages contain data entries, and are chained (prev & next)
Non-leaf pages have index entries; only used to direct searches:
index entry
Example B+ Tree
Hash-Based Indexes
Alternatives for Data Entry k* in Index
In a data entry k* we can store:
Data record with key value k, or
<k, rid of data record with search key value k>, or
<k, list of rids of data records with search key k>
Choice of alternative for data entries is orthogonal to the indexing
technique used to locate data entries with a given key value k.
Examples of indexing techniques: B+ trees, hash-based
Typically, index contains auxiliary information that directs
searches to the desired data entries
Alternatives 2 and 3:
Data entries typically much smaller than data records. So,
better than Alternative 1 with large data records, especially if
search keys are small. (Portion of index structure used to direct
search, which depends on size of data entries, is much smaller
than with Alternative 1.)
Alternative 3 more compact than Alternative 2, but leads to
variable sized data entries even if search keys are of fixed
Cost Model for Our Analysis
We ignore CPU costs, for simplicity:
B: The number of data pages
R: Number of records per page
D: (Average) time to read or write disk page
Measuring number of page I/Os ignores gains of pre-fetching a
sequence of pages; thus, even I/O cost is only approximated.
Average-case analysis; based on several simplistic assumptions.
Comparing File Organizations
Heap files (random order; insert at eof)
Sorted files, sorted on <age, sal>
Clustered B+ tree file, Alternative (1), search key <age, sal>
Heap file with unclustered B + tree index on search key <age, sal>
Heap file with unclustered hash index on search key <age, sal>
Operations to Compare
Scan: Fetch all records from disk
Equality search
Range selection
Insert a record
Delete a record
Assumptions in Our Analysis
Heap Files:
Equality selection on key; exactly one match.
Sorted Files:
Files compacted after deletions.
Alt (2), (3): data entry size = 10% size of record
Hash: No overflow buckets.
80% page occupancy => File size = 1.25 data size
Tree: 67% occupancy (this is typical).
Implies file size = 1.5 data size
Leaf levels of a tree-index are chained.
Index data-entries plus actual file scanned for unclustered
Range searches:
We use tree indexes to restrict the set of data records fetched,
but ignore hash indexes.
Cost of Operations
Understanding the Workload
For each query in the workload:
Which relations does it access?
Which attributes are retrieved?
Which attributes are involved in selection/join conditions? How
selective are these conditions likely to be?
For each update in the workload:
Which attributes are involved in selection/join conditions? How
selective are these conditions likely to be?
The type of update (INSERT/DELETE/UPDATE), and the attributes
that are affected.
Choice of Indexes
What indexes should we create?
(a) Scan (b) Equality (c ) Range (d) Insert (e) Delete
(1) Heap BD
0.5BD BD 2D Search
(2) Sorted BD
Dlog 2B D(log 2 B +
#pgs with
match recs)
Dlog F 1.5B D(log F 1.5B
+#pgs w.
match recs)
(4) Unclust.
Tree index
D(1 +
log F 0.15B)
D(log F 0.15B
+#pgs w.
match recs)
(5) Unclust.
Hash index
BD(R+0.125) 2D BD Search
Which relations should have indexes? What field(s) should be
the search key? Should we build several indexes?
For each index, what kind of an index should it be?
Clustered? Hash/tree?
One approach: Consider the most important queries in turn. Consider
the best plan using the current indexes, and see if a better plan is
possible with an additional index. If so, create it.
Obviously, this implies that we must understand how a DBMS
evaluates queries and creates query evaluation plans!
For now, we discuss simple 1-table queries.
Before creating an index, must also consider the impact on updates in
the workload!
Trade-off: Indexes can make queries go faster, updates slower.
Require disk space, too.
Index Selection Guidelines
Attributes in WHERE clause are candidates for index keys.
Exact match condition suggests hash index.
Range query suggests tree index.
Clustering is especially useful for range queries; can also
help on equality queries if there are many duplicates.
Multi-attribute search keys should be considered when a WHERE
clause contains several conditions.
Order of attributes is important for range queries.
Such indexes can sometimes enable index-only strategies for
important queries.
For index-only strategies, clustering is not important!
Examples of Clustered Indexes
B+ tree index on E.age can be used to get qualifying tuples.
How selective is the condition?
Is the index clustered?
Consider the GROUP BY query.
If many tuples have E.age > 10, using E.age index and sorting
the retrieved tuples may be costly.
Clustered E.dno index may be better!
Equality queries and duplicates:
Clustering on E.hobby helps!
WHERE E.age>40
WHERE E.age>10
WHERE E.hobby=Stamps
Indexes with Composite Search Keys
Composite Search Keys: Search on a combination of fields.
Equality query: Every field value is equal to a constant value.
E.g. wrt <sal,age> index:
age=20 and sal =75
Range query: Some field value is not a constant. E.g.:
age =20; or age=20 and sal > 10
Data entries in index sorted by search key to support range queries.
Lexicographic order, or
Spatial order.
Data entries in index Data entries
sorted by <sal>
sorted by <sal,age>
Composite Search Keys
To retrieve Emp records with age=30 AND sal=4000, an index on
<age,sal> would be better than an index on age or an index on sal.
Choice of index key orthogonal to clustering etc.
If condition is: 20<age<30 AND 3000<sal<5000:
Clustered tree index on <age,sal> or <sal,age> is best.
If condition is: age=30 AND 3000<sal<5000:
Clustered <age,sal> index much better than <sal,age> index!
Composite indexes are larger, updated more often.
Index-Only Plans
A number of queries can be answered without retrieving any tuples
from one or more of the relations involved if a suitable index is
<E. age,E.sal>
<E.sal, E.age>
WHERE E.age=25 AND
E.sal BETWEEN 3000 AND 5000
Many alternative file organizations exist, each appropriate in some
If selection queries are frequent, sorting the file or building an index is
Hash-based indexes only good for equality search.
Sorted files and tree-based indexes best for range search; also
good for equality search. (Files rarely kept sorted in practice;
B+ tree index is better.)
Index is a collection of data entries plus a way to quickly find entries with
given key values.
Data entries can be actual data records, <key, rid> pairs, or <key, rid-
list> pairs.
Choice orthogonal to indexing technique used to locate data
entries with a given key value.
Can have several indexes on a given file of data records, each with a
different search key.
Indexes can be classified as clustered vs. unclustered, primary vs.
secondary, and dense vs. sparse. Differences have important
consequences for utility/performance.
As for any index, 3 alternatives for data entries k*:
Data record with key value k
<k, rid of data record with search key value k>
<k, list of rids of data records with search key k>
Choice is orthogonal to the indexing technique used to locate data
entries k*.
Tree-structured indexing techniques support both range searches and
equality searches.
ISAM : static structure; B+ tree: dynamic, adjusts gracefully under
inserts and deletes.
Range Searches
``Find all students with gpa > 3.0
If data is in sorted file, do binary search to find first such
student, then scan to find others.
Cost of binary search can be quite high.
Simple idea: Create an `index file.
Index entry
Comments on ISAM
File creation: Leaf (data) pages allocated sequentially,
sorted by search key; then index pages allocated, then space for
overflow pages.
Index entries: <search key value, page id>; they `direct search for
data entries, which are in leaf pages.
Search : Start at root; use key comparisons to go to leaf. Cost log
N ; F = # entries/index pg, N = # leaf pgs
Insert : Find leaf data entry belongs to, and put it there.
Delete : Find and remove from leaf; if empty overflow page, de-
Example ISAM Tree
Each node can hold 2 entries; no need for `next-leaf-page pointers.
B+ Tree: Most Widely Used Index
Insert/delete at log
N cost; keep tree height-balanced. (F = fanout,
N = # leaf pages)
Minimum 50% occupancy (except for root). Each node contains d <=
m <= 2d entries. The parameter d is called the order of the tree.
Supports equality and range-searches efficiently.
Example B+ Tree
Search begins at root, and key comparisons direct it to a leaf (as in
Search for 5*, 15*, all data entries >= 24* ...
B+ Trees in Practice
Typical order: 100. Typical fill-factor: 67%.
average fanout = 133
Typical capacities:
Height 4: 133
= 312,900,700 records
Height 3: 133
= 2,352,637 records
Can often hold top levels in buffer pool:
Level 1 = 1 page = 8 Kbytes
Level 2 = 133 pages = 1 Mbyte
Level 3 = 17,689 pages = 133 MBytes
Inserting a Data Entry into a B+ Tree
Find correct leaf L.
Put data entry onto L.
If L has enough space, done!
Else, must split L (into L and a new node L2)
Redistribute entries evenly, copy up middle key.
Insert index entry pointing to L2 into parent of L.
This can happen recursively
To split index node, redistribute entries evenly, but push up
middle key. (Contrast with leaf splits.)
Splits grow tree; root split increases height.
Tree growth: gets wider or one level taller at top.
Inserting 8* into Example B+ Tree
Observe how minimum occupancy is guaranteed in both leaf and index
pg splits.
Note difference between copy-up and push-up; be sure you understand
the reasons for this.
Example B+ Tree After Inserting 8*
Notice that root was split, leading to increase in height
In this example, we can avoid split by re-distributing entries;
however, this is usually not done in practice.
Deleting a Data Entry from a B+ Tree
Start at root, find leaf L where entry belongs.
Remove the entry.
If L is at least half-full, done!
If L has only d-1 entries,
Try to re-distribute, borrowing from sibling (adjacent node
with same parent as L).
If re-distribution fails, merge L and sibling.
If merge occurred, must delete entry (pointing to L or sibling) from
parent of L.
Merge could propagate to root, decreasing height.
Example Tree After (Inserting 8*, Then) Deleting 19* and 20* ...
Deleting 19* is easy.
Deleting 20* is done with re-distribution. Notice how middle key is copied
... And Then Deleting 24*
Must merge.
Observe `toss of index entry (on right), and `pull down of index entry
What's a database ?
A database is a collection of data organized in a particular way.
Databases can be of many types such as Flat File Databases, Relational Databases, Distributed
Databases etc.
What's SQL ?
In 1971, IBM researchers created a simple non-procedural language called Structured English
Query Language. or SEQUEL. This was based on Dr. Edgar F. (Ted) Codd's design of a relational
model for data storage where he described a universal programming language for accessing
In the late 80's ANSI and ISO (these are two organizations dealing with standards for a wide variety
of things) came out with a standardized version called Structured Query Language or SQL. SQL is
prounced as 'Sequel'. There have been several versions of SQL and the latest one is SQL-99.
Though SQL-92 is the current universally adopted standard.
SQL is the language used to query all databases. It's simple to learn and appears to do very little but
is the heart of a successful database application. Understanding SQL and using it efficiently is
highly imperative in designing an efficient database application. The better your understanding of
SQL the more versatile you'll be in getting information out of databases.
What's an RDBMS ?
This concept was first described around 1970 by Dr. Edgar F. Codd in an IBM research publication
called "System R4 Relational".
A relational database uses the concept of linked two-dimensional tables which comprise of rows
and columns. A user can draw relationships between multiple tables and present the output as a
table again. A user of a relational database need not understand the representation of data in order to
retrieve it. Relational programming is non-procedural.
[What's procedural and non-procedural ?
Programming languages are procedural if they use programming elements such as conditional
statements (if-then-else, do-while etc.). SQL has none of these types of statements.]
In 1979, Relational Software released the world's first relational database called Oracle V.2
What a DBMS ?
MySQL and mSQL are database management systems or DBMS. These software packages are used
to manipulate a database. All DBMSs use their own implementation of SQL. It may be a subset or a
superset of the instructions provided by SQL 92.
MySQL, due to it's simplicity uses a subset of SQL 92 (also known as SQL2).
What's Database Normalization ?
Normalization is the process where a database is designed in a way that removes redundancies, and
increases the clarity in organizing data in a database.
In easy English, it means take similar stuff out of a collection of data and place them into tables.
Keep doing this for each new table recursively and you'll have a Normalized database. From this
resultant database you should be able to recreate the data into it's original state if there is a need to
do so.
The important thing here is to know when to Normalize and when to be practical. That will come
with experience. For now, read on...
Normalization of a database helps in modifying the design at later times and helps in being prepared
if a change is required in the database design. Normalization raises the efficiency of the datatabase
in terms of management, data storage and scalability.
Now Normalization of a Database is achieved by following a set of rules called 'forms' in creating
the database.
These rules are 5 in number (with one extra one stuck in-between 3&4) and they are:
1st Normal Form or 1NF:
Each Column Type is Unique.
2nd Normal Form or 2NF:
The entity under consideration should already be in the 1NF and all attributes within the entity
should depend solely on the entity's unique identifier.
3rd Normal Form or 3NF:
The entity should already be in the 2NF and no column entry should be dependent on any other
entry (value) other than the key for the table.
If such an entity exists, move it outside into a new table.
Now if these 3NF are achieved, the database is considered normalized. But there are three more
'extended' NF for the elitist.
These are:
BCNF (Boyce & Codd):
The database should be in 3NF and all tables can have only one primary key.
Tables cannot have multi-valued dependencies on a Primary Key.
There should be no cyclic dependencies in a composite key.
Well this is a highly simplified explanation for Database Normalization. One can study this process
extensively though. After working with databases for some time you'll automatically create
Normalized databases. As, it's logical and practical.
For now, don't worry too much about Normalization. The quickest way to grasp SQL and Databases
is to plunge headlong into creating tables and start messing around with SQL statements. After you
go through the tutorial examples and also the example contacts database, look at the example
provided in creating a normalized database near the very end of this tutorial. And then try to think
how you would like to create your own database.
Much of database design depends on how YOU want to keep the data. In real life situations often
you may find it more convenient to store data in tables designed in a way that does fall a bit short of
keeping all the NFs happy. But that's what databases are all about. Making your life simpler.
Onto SQL
There are four basic commands which are the workhorses for SQL and figure in almost all queries
to a database.
INSERT - Insert Data
DELETE - Delete Data
SELECT - Pull Data
UPDATE - Change existing Data
As you can see SQL is like English.
Let's build a real world example database using MySQL and perform some SQL operations on it.
A database that practically anyone could use would be a Contacts database.
In our example we are going to create create a database with the following fields:
First, lets decide how we are going to store this data in the database. For illustration purposes,
we are going to keep this data in multiple tables.
This will let us exercise all of the SQL commands pertaining to retrieving data from multiple tables.
Also we can separate different kinds of entities into different tables. So let's say you have thousands
of friends and need to send a mass email to all of them, a SELECT statement (covered later) will
look at only one table.
Well, we can keep the FirstName, LastName and BirthDate in one table.
Address related data in another.
Company Details in another.
Emails can be separated into another.
Telephones can be separated into another.
Let's build the database in MySQL.
While building a database - you need to understand the concept of data types. Data types allow the
user to define how data is stored in fields or cells within a database. It's a way to define how your
data will actually exist. Whether it's a Date or a string consisting of 20 characters, an integer etc.
When we build tables within a database we also define the contents of each field in each row in the
table using a data type. It's imperative that you use only the data type that fits your needs and don't
use a data type that reserves more memory than the data in the field actually requires.
Let's look at various Data Types under MySQL.
Size in
TINYINT (length) 1
Integer with unsigned range of 0-255 and a signed range
from -128-127
SMALLINT (length) 2
Integer with unsigned range of 0-65535 and a signed
range from -32768-32767
MEDIUMINT(length) 3
Integer with unsigned range of 0-16777215 and a signed
range from -8388608-8388607
INT(length) 4
Integer with unsigned range of 0-429467295 and a signed
range from -2147483648-2147483647
BIGINT(length) 8
Integer with unsigned range of 0-18446744 and a signed
range from
FLOAT(length, decimal) 4
Floating point number with max. value +/-
3.402823466E38 and min.(non-zero) value
Floating point number with max. value +/-
-1.7976931348623157E308 and min. (non-zero) value
DECIMAL(length, decimal) length
Floating point number with the range of the DOUBLE
type that is stored as a CHAR field type.
YYYYMMDD, YYMMDD. A Timestamp value is
updated each time the row changes value. A NULL value
sets the field to the current time.
CHAR(length) length
A fixed length text string where fields shorter than the
assigned length are filled with trailing spaces.
VARCHAR(length) length
A fixed length text string (255 Character Max) where
unused trailing spaces are removed before storing.
TINYTEXT length+1A text field with max. length of 255 characters.
TINYBLOB length+1A binary field with max. length of 255 characters.
TEXT length+164Kb of text
BLOB length+164Kb of data
MEDIUMTEXT length+316Mb of text
MEDIUMBLOB length+316 Mb of data
LONGTEXT length+44GB of text
LONGBLOB length+44GB of data
ENUM 1,2
This field can contain one of a possible 65535 number of
options. Ex: ENUM('abc','def','ghi')
SET 1-8
This type of field can contain any number of a set of
predefined possible values.
The following examples will make things quite clear on declaring Data Types within SQL
Steps in Creating the Database using MySQL
From the shell prompt (either in DOS or UNIX):
mysqladmin create contacts;
This will create an empty database called "contacts".
Now run the command line tool "mysql" and from the mysql prompt do the following:
mysql> use contacts;
(You'll get the response "Database changed")
The following commands entered into the MySQL prompt will create the tables in the database.
PRIMARY KEY, FirstName CHAR(20), LastName CHAR(20), BirthDate DATE);
StreetAddress CHAR(50), City CHAR(20), State CHAR(20), Zip CHAR(15), Country
mysql> CREATE TABLE telephones (contact_id SMALLINT NOT NULL PRIMARY KEY,
TelephoneHome CHAR(20), TelephoneWork(20));
Email CHAR(20));
mysql> CREATE TABLE company_details (contact_id SMALLINT NOT NULL PRIMARY
KEY, CompanyName CHAR(25), Designation CHAR(15));
Note: Here we assume that one person will have only one email address. Now if there were a
situation where one person has multiple addresses, this design would be a problem. We'd need
another field which would keep values that indicated to whom the email address belonged to. In
this particular case email data ownership is indicated by the primary key. The same is true for
telephones. We are assuming that one person has only one home telephone and one work telephone
number. This need not be true. Similarly one person could work for multiple companies at the same
time holding two different designation. In all these cases an extra field will solve the issue. For now
however let's work with this small design.
The relationships between columns located in different tables are usually described through the use
of keys.
As you can see we have a PRIMARY KEY in each table. The Primary key serves as a mechanism
to refer to other fields within the same row. In this case, the Primary key is used to identify a
relationship between a row under consideration and the person whose name is located inside the
'names' table. We use the AUTO_INCREMENT statement only for the 'names' table as we need to
use the generated contact_id number in all the other tables for identification of the rows.
This type of table design where one table establishes a relationship with several other tables is
known as a 'one to many' relationship.
In a 'many to many' relationship we could have several Auto Incremented Primary Keys in various
tables with several inter-relationships.
Foreign Key:
A foreign key is a field in a table which is also the Primary Key in another table. This is known
commonly as 'referential integrity'.
Execute the following commands to see the newly created tables and their contents.
To see the tables inside the database:
| Tables in contacts |
| address |
| company_details |
| email |
| names |
| telephones |
5 rows in set (0.00 sec)
To see the columns within a particular table:
mysql>SHOW COLUMNS FROM address;
| Field | Type | Null | Key | Default | Extra | Privileges
| contact_id | smallint(6) | | PRI | 0 | | select,insert,update,references |
| StreetAddress | char(50) | YES | | NULL | | select,insert,update,references |
| City | char(20) | YES | | NULL | | select,insert,update,references |
| State | char(20) | YES | | NULL | | select,insert,update,references |
| Zip | char(10) | YES | | NULL | | select,insert,update,references |
| Country | char(20) | YES | | NULL | | select,insert,update,references |
+---------------+-------------+------+-----+---------+-------+------------------ ---------------+
6 rows in set (0.00 sec)
So we have the tables created and ready. Now we put in some data.
Let's start with the 'names' table as it uses a unique AUTO_INCREMENT field which in turn is
used in the other tables.
Inserting data, one row at a time:
mysql> INSERT INTO names (FirstName, LastName, BirthDate) VALUES ('Yamila','Diaz
Query OK, 1 row affected (0.00 sec)
Inserting multiple rows at a time:
mysql> INSERT INTO names (FirstName, LastName, BirthDate) VALUES
Query OK, 2 rows affected (0.00 sec)
Records: 2 Duplicates: 0 Warnings: 0
Let's see what the data looks like inside the table. We use the SELECT command for this.
mysql> SELECT * from NAMES;
| contact_id | FirstName | LastName | BirthDate |
| 3 | Tia | Carrera | 1975-09-18 |
| 2 | Nikki | Taylor | 1972-03-04 |
| 1 | Yamila | Diaz | 1974-10-13 |
3 rows in set (0.06 sec)
Try another handy command called 'DESCRIBE'.
mysql> DESCRIBE names;
| Field | Type | Null | Key | Default | Extra | Privileges
| contact_id | smallint(6) | | PRI | NULL | auto_increment | select,insert,update,references |
| FirstName | char(20) | YES | | NULL | | select,insert,update,references |
| LastName | char(20) | YES | | NULL | | select,insert,update,references |
| BirthDate | date | YES | | NULL | | select,insert,update,references |
4 rows in set (0.00 sec)
Now lets populate the other tables. Observer the syntax used.
mysql> INSERT INTO address(contact_id, StreetAddress, City, State, Zip, Country)
VALUES ('1', '300 Yamila Ave.', 'Los Angeles', 'CA', '300012', 'USA'),('2','4000 Nikki
St.','Boca Raton','FL','500034','USA'),('3','404 Tia Blvd.','New York','NY','10011','USA');
Query OK, 3 rows affected (0.05 sec)
Records: 3 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM address;
| contact_id | StreetAddress | City | State | Zip | Country |
| 1 | 300 Yamila Ave. | Los Angeles | CA | 300012 | USA |
| 2 | 4000 Nikki St. | Boca Raton | FL | 500034 | USA |
| 3 | 404 Tia Blvd. | New York | NY | 10011 | USA |
3 rows in set (0.00 sec)
mysql> INSERT INTO company_details (contact_id, CompanyName, Designation) VALUES
('1','Xerox','New Business Manager'), ('2','Cabletron','Customer Support Eng'),
('3','Apple','Sales Manager');
Query OK, 3 rows affected (0.05 sec)
Records: 3 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM company_details;
| contact_id | CompanyName | Designation |
| 1 | Xerox | New Business Manager |
| 2 | Cabletron | Customer Support Eng |
| 3 | Apple | Sales Manager |
3 rows in set (0.06 sec)
mysql> INSERT INTO email (contact_id, Email) VALUES ('1', 'yamila@yamila.com'),( '2',
Query OK, 3 rows affected (0.00 sec)
Records: 3 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM email;
| contact_id | Email |
| 1 | yamila@yamila.com |
| 2 | nikki@nikki.com |
| 3 | tia@tia.com |
3 rows in set (0.06 sec)
mysql> INSERT INTO telephones (contact_id, TelephoneHome, TelephoneWork) VALUES
('1','333-50000','333-60000'),('2','444-70000','444-80000'),('3','555-30000','55 5-40000');
Query OK, 3 rows affected (0.00 sec)
Records: 3 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM telephones;
| contact_id | TelephoneHome | TelephoneWork |
| 1 | 333-50000 | 333-60000 |
| 2 | 444-70000 | 444-80000 |
| 3 | 555-30000 | 555-40000 |
3 rows in set (0.00 sec)
Okay, so we now have all our data ready for experimentation.
Before we start experimenting with manipulating the data let's look at how MySQL stores the
To do this execute the following command from the shell prompt.
mysqldump contacts > contacts.sql
Note: The reverse operation for this command is:
mysql contacts < contacts.sql
The file generated is a text file that contains all the data and SQL instruction needed to recreate the
same database. As you can see, the SQL here is slightly different than what was typed in. Don't
worry about this. It's all good ! It would also be obvious that this is a good way to backup your
# MySQL dump 8.2
# Host: localhost Database: contacts
# Server version 3.22.34-shareware-debug
# Table structure for table 'address'
CREATE TABLE address (
contact_id smallint(6) DEFAULT '0' NOT NULL,
StreetAddress char(50),
City char(20),
State char(20),
Zip char(10),
Country char(20),
PRIMARY KEY (contact_id)
# Dumping data for table 'address'
INSERT INTO address VALUES (1,'300 Yamila Ave.','Los Angeles','CA','300012','USA');
INSERT INTO address VALUES (2,'4000 Nikki St.','Boca Raton','FL','500034','USA');
INSERT INTO address VALUES (3,'404 Tia Blvd.','New York','NY','10011','USA');
# Table structure for table 'company_details'
CREATE TABLE company_details (
contact_id smallint(6) DEFAULT '0' NOT NULL,
CompanyName char(25),
Designation char(20),
PRIMARY KEY (contact_id)
# Dumping data for table 'company_details'
INSERT INTO company_details VALUES (1,'Xerox','New Business Manager');
INSERT INTO company_details VALUES (2,'Cabletron','Customer Support Eng');
INSERT INTO company_details VALUES (3,'Apple','Sales Manager');
# Table structure for table 'email'
contact_id smallint(6) DEFAULT '0' NOT NULL,
Email char(20),
PRIMARY KEY (contact_id)
# Dumping data for table 'email'
INSERT INTO email VALUES (1,'yamila@yamila.com');
INSERT INTO email VALUES (2,'nikki@nikki.com');
INSERT INTO email VALUES (3,'tia@tia.com');
# Table structure for table 'names'
contact_id smallint(6) DEFAULT '0' NOT NULL auto_increment,
FirstName char(20),
LastName char(20),
BirthDate date,
PRIMARY KEY (contact_id)
# Dumping data for table 'names'
INSERT INTO names VALUES (3,'Tia','Carrera','1975-09-18');
INSERT INTO names VALUES (2,'Nikki','Taylor','1972-03-04');
INSERT INTO names VALUES (1,'Yamila','Diaz','1974-10-13');
# Table structure for table 'telephones'
CREATE TABLE telephones (
contact_id smallint(6) DEFAULT '0' NOT NULL,
TelephoneHome char(20),
TelephoneWork char(20),
PRIMARY KEY (contact_id)
# Dumping data for table 'telephones'
INSERT INTO telephones VALUES (1,'333-50000','333-60000');
INSERT INTO telephones VALUES (2,'444-70000','444-80000');
INSERT INTO telephones VALUES (3,'555-30000','555-40000');
Let's try some SELECT statement variations:
To select all names whose corresponding contact_id is greater than 1.
mysql> SELECT * FROM names WHERE contact_id > 1;
| contact_id | FirstName | LastName | BirthDate |
| 3 | Tia | Carrera | 1975-09-18 |
| 2 | Nikki | Taylor | 1972-03-04 |
2 rows in set (0.00 sec)
As a condition we can also use NOT NULL. This statement will return all names where there exists
a contact_id.
mysql> SELECT * FROM names WHERE contact_id IS NOT NULL;
| contact_id | FirstName | LastName | BirthDate |
| 3 | Tia | Carrera | 1975-09-18 |
| 2 | Nikki | Taylor | 1972-03-04 |
| 1 | Yamila | Diaz | 1974-10-13 |
3 rows in set (0.06 sec)
Result's can be arranged in a particular way using the statement ORDER BY.
mysql> SELECT * FROM names WHERE contact_id IS NOT NULL ORDER BY
| contact_id | FirstName | LastName | BirthDate |
| 3 | Tia | Carrera | 1975-09-18 |
| 1 | Yamila | Diaz | 1974-10-13 |
| 2 | Nikki | Taylor | 1972-03-04 |
3 rows in set (0.06 sec)
'asc' and 'desc' stand for ascending and descending respectively and can be used to arrange the
mysql> SELECT * FROM names WHERE contact_id IS NOT NULL ORDER BY LastName
| contact_id | FirstName | LastName | BirthDate |
| 2 | Nikki | Taylor | 1972-03-04 |
| 1 | Yamila | Diaz | 1974-10-13 |
| 3 | Tia | Carrera | 1975-09-18 |
3 rows in set (0.04 sec)
You can also place date types into conditional statements.
mysql> SELECT * FROM names WHERE BirthDate > '1973-03-06';
| contact_id | FirstName | LastName | BirthDate |
| 3 | Tia | Carrera | 1975-09-18 |
| 1 | Yamila | Diaz | 1974-10-13 |
2 rows in set (0.00 sec)
LIKE is a statement to match field values using wildcards. The % sign is used for denoting
wildcards and can represent multiple characters.
mysql> SELECT FirstName, LastName FROM names WHERE LastName LIKE 'C%';
| FirstName | LastName |
| Tia | Carrera |
1 row in set (0.06 sec)
'_' is used to represent a single wildcard.
mysql> SELECT FirstName, LastName FROM names WHERE LastName LIKE '_iaz';
| FirstName | LastName |
| Yamila | Diaz |
1 row in set (0.00 sec)
SQL Logical Operations (operates from Left to Right)
1.NOT or !
2. AND or &&
3. OR or ||
4. = : Equal
5. <> or != : Not Equal
6. <=
7. >=
8 <,>
Here are some more variations with Logical Operators and using the 'IN' statement.
mysql> SELECT FirstName FROM names WHERE contact_id < 3 AND LastName LIKE 'D
| FirstName |
| Yamila |
1 row in set (0.00 sec)
mysql> SELECT contact_id FROM names WHERE LastName IN ('Diaz','Carrera');
| contact_id |
| 3 |
| 1 |
2 rows in set (0.02 sec)
To return the number of rows in a table
mysql> SELECT count(*) FROM names;
| count(*) |
| 3 |
1 row in set (0.02 sec)
mysql> SELECT count(FirstName) FROM names;
| count(FirstName) |
| 3 |
1 row in set (0.00 sec)
To do some basic arithmetic aggregate functions.
mysql> SELECT SUM(contact_id) FROM names;
| SUM(contact_id) |
| 6 |
1 row in set (0.00 sec)
To select a largest value from a row. Substitute 'MIN' and see what happens next.
mysql> SELECT MAX(contact_id) FROM names;
| MAX(contact_id) |
| 3 |
1 row in set (0.00 sec)
Take a look at the first query using the statement WHERE and the second statement using the
statement HAVING.
mysql> SELECT * FROM names WHERE contact_id >=1;
| contact_id | FirstName | LastName | BirthDate |
| 1 | Yamila | Diaz | 1974-10-13 |
| 2 | Nikki | Taylor | 1972-03-04 |
| 3 | Tia | Carrera | 1975-09-18 |
3 rows in set (0.03 sec)
mysql> SELECT * FROM names HAVING contact_id >=1;
| contact_id | FirstName | LastName | BirthDate |
| 3 | Tia | Carrera | 1975-09-18 |
| 2 | Nikki | Taylor | 1972-03-04 |
| 1 | Yamila | Diaz | 1974-10-13 |
3 rows in set (0.00 sec)
Now lets work with multiple tables and see how information can be pulled out of the data.
mysql> SELECT names.contact_id, FirstName, LastName, Email FROM names, email
WHERE names.contact_id = email.contact_id;
| contact_id | FirstName | LastName | Email |
| 1 | Yamila | Diaz | yamila@yamila.com |
| 2 | Nikki | Taylor | nikki@nikki.com |
| 3 | Tia | Carrera | tia@tia.com |
3 rows in set (0.11 sec)
mysql> SELECT DISTINCT names.contact_id, FirstName, Email, TelephoneWork FROM
names, email, telephones WHERE names.contact_id=email.contact_id=telephones.contact_id;
| contact_id | FirstName | Email | TelephoneWork |
| 1 | Yamila | yamila@yamila.com | 333-60000 |
| 2 | Nikki | nikki@nikki.com | 333-60000 |
| 3 | Tia | tia@tia.com | 333-60000 |
3 rows in set (0.05 sec)
So what's a JOIN ?
JOIN is the action performed on multiple tables that returns a result as a table. It's what
makes a database 'relational'.
There are several types of joins. Let's look at LEFT JOIN (OUTER JOIN) and RIGHT JOIN
Let's first check out the contents of the tables we're going to use
mysql> SELECT * FROM names;
| contact_id | FirstName | LastName | BirthDate |
| 3 | Tia | Carrera | 1975-09-18 |
| 2 | Nikki | Taylor | 1972-03-04 |
| 1 | Yamila | Diaz | 1974-10-13 |
3 rows in set (0.00 sec)
mysql> SELECT * FROM email;
| contact_id | Email |
| 1 | yamila@yamila.com |
| 2 | nikki@nikki.com |
| 3 | tia@tia.com |
3 rows in set (0.00 sec)
mysql> SELECT * FROM names LEFT JOIN email USING (contact_id);
| contact_id | FirstName | LastName | BirthDate | contact_id | Email|
| 3 | Tia | Carrera | 1975-09-18 | 3 | tia@tia.com |
| 2 | Nikki | Taylor | 1972-03-04 | 2 | nikki@nikki.com |
| 1 | Yamila | Diaz | 1974-10-13 | 1 | yamila@yamila.com |
3 rows in set (0.16 sec)
To find the people who have a home phone number.
mysql> SELECT names.FirstName FROM names LEFT JOIN telephones ON
names.contact_id = telephones.contact_id WHERE TelephoneHome IS NOT NULL;
| FirstName |
| Tia |
| Nikki |
| Yamila |
3 rows in set (0.02 sec)
These same query leaving out 'names' (from names.FirstName) is still the same and will
generate the same result.
mysql> SELECT FirstName FROM names LEFT JOIN telephones ON names.contact_id =
telephones.contact_id WHERE TelephoneHome IS NOT NULL;
| FirstName |
| Tia |
| Nikki |
| Yamila |
3 rows in set (0.00 sec)
And now a RIGHT JOIN:
mysql> SELECT * FROM names RIGHT JOIN email USING(contact_id);
+------------+-----------+----------+------------+------------+----------------- --+
| contact_id | FirstName | LastName | BirthDate | contact_id | Email |
| 1 | Yamila | Diaz | 1974-10-13 | 1 | yamila@yamila.com |
| 2 | Nikki | Taylor | 1972-03-04 | 2 | nikki@nikki.com
| 3 | Tia | Carrera | 1975-09-18 | 3 | tia@tia.com
3 rows in set (0.03 sec)
This conditional statement is used to select data where a certain related contraint falls between a
certain range of values. The following example illustrates it's use.
mysql> SELECT * FROM names;
| contact_id | FirstName | LastName | BirthDate |
| 3 | Tia | Carrera | 1975-09-18 |
| 2 | Nikki | Taylor | 1972-03-04 |
| 1 | Yamila | Diaz | 1974-10-13 |
3 rows in set (0.06 sec)
mysql> SELECT FirstName, LastName FROM names WHERE contact_id BETWEEN 2
AND 3;
| FirstName | LastName |
| Tia | Carrera |
| Nikki | Taylor |
2 rows in set (0.00 sec)
The ALTER statement is used to add a new column to an existing table or to make changes to it.
Query OK, 3 rows affected (0.11 sec)
Records: 3 Duplicates: 0 Warnings: 0
Now let's take a look at the 'ALTER'ed Table.
mysql> SHOW COLUMNS FROM names;
| Field | Type | Null | Key | Default | Extra |
| contact_id | smallint(6) | | PRI | 0 | auto_increment |
| FirstName | char(20) | YES | | NULL | |
| LastName | char(20) | YES | | NULL | |
| BirthDate | date | YES | | NULL | |
| Age | smallint(6) | YES | | NULL | |
5 rows in set (0.06 sec)
But we don't require Age to be a SMALLINT type when a TINYINT would suffice. So we use
another ALTER statement.
Query OK, 3 rows affected (0.02 sec)
Records: 3 Duplicates: 0 Warnings: 0
mysql> SHOW COLUMNS FROM names;
| Field | Type | Null | Key | Default | Extra |
| contact_id | smallint(6) | | PRI | NULL |
| FirstName | char(20) | YES | | NULL | |
| LastName | char(20) | YES | | NULL | |
| BirthDate | date | YES | | NULL | |
| Age | tinyint(4) | YES | | NULL | |
5 rows in set (0.00 sec)
You can also use the MODIFY statement to change column data types.
Query OK, 3 rows affected (0.03 sec)
Records: 3 Duplicates: 0 Warnings: 0
mysql> SHOW COLUMNS FROM names;
| Field | Type | Null | Key | Default | Extra |
| contact_id | smallint(6) | | PRI | NULL | auto_increment |
| FirstName | char(20) | YES | | NULL | |
| LastName | char(20) | YES | | NULL | |
| BirthDate | date | YES | | NULL | |
| Age | smallint(6) | YES | | NULL | |
5 rows in set (0.00 sec)
To Rename a Table:
mysql> ALTER TABLE names RENAME AS mynames;
Query OK, 0 rows affected (0.00 sec)
| Tables_in_contacts |
| address |
| company_details |
| email |
| mynames |
| telephones |
5 rows in set (0.00 sec)
We rename it back to the original name.
mysql> ALTER TABLE mynames RENAME AS names;
Query OK, 0 rows affected (0.01 sec)
The UPDATE command is used to add a value to a field in a table.
mysql> UPDATE names SET Age ='23' WHERE FirstName='Tia';
Query OK, 1 row affected (0.06 sec)
Rows matched: 1 Changed: 1 Warnings: 0
The Original Table:
mysql> SELECT * FROM names;
| contact_id | FirstName | LastName | BirthDate | Age |
| 3 | Tia | Carrera | 1975-09-18 | 23 |
| 2 | Nikki | Taylor | 1972-03-04 | NULL |
| 1 | Yamila | Diaz | 1974-10-13 | NULL |
3 rows in set (0.05 sec)
The Modified Table:
mysql> SELECT * FROM names;
| contact_id | FirstName | LastName | BirthDate | Age |
| 3 | Tia | Carrera | 1975-09-18 | 24 |
| 2 | Nikki | Taylor | 1972-03-04 | NULL |
| 1 | Yamila | Diaz | 1974-10-13 | NULL |
3 rows in set (0.00 sec)
mysql> DELETE FROM names WHERE Age=23;
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM names;
| contact_id | FirstName | LastName | BirthDate | Age |
| 2 | Nikki | Taylor | 1972-03-04 | NULL |
| 1 | Yamila | Diaz | 1974-10-13 | NULL |
2 rows in set (0.00 sec)
mysql> DELETE FROM names;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT * FROM names;
Empty set (0.00 sec)
One more destructive tool...
mysql> DROP TABLE names;
Query OK, 0 rows affected (0.00 sec)
| Tables in contacts |
| address |
| company_details |
| email |
| telephones |
4 rows in set (0.05 sec)
mysql> DROP TABLE address ,company_details, telephones;
Query OK, 0 rows affected (0.06 sec)
Empty set (0.00 sec)
As you can see, the table 'names' no longer exists. MySQL does not give a warning so be careful.
Since version 3.23.23, Full Text Indexing and Searching has been introduced into MySQL.
FULLTEXT indexes can be created from VARCHAR and TEXT columns. FULLTEXT searches
are performed with the MATCH function. The MATCH function matches a natural language query
on a text collection and from each row in a table it returns relevance.The resultant rows are
organized in order of relevance.
Full Text searches are a very powerful way to search through text. But is not ideal for small tables
of text and may produce inconsistent results. Ideally it works with large collections of textual data.
Optimizing your Database
Well, databases do tend to get large at some or the other. And here arises the issue of database
optimization. Queries are going to take longer and longer as the database grows and certain things
can be done to speed things up.
The easiest method is that of 'clustering'. Suppose you do a certain kind of query often, it would be
faster if the database contents were arranged in a in the same way data was requested. To keep the
tables in a sorted order you need a clustering index. Some databases keep stuff sorted automatically.
Ordered Indices
These are a kind of 'lookup' tables of sorts. For each column that may be of interest to you, you can
create an ordered index.
It needs to be noted that again these kinds of optimization techniques produce a system load in
terms of creating a new index each time the data is re-arranged.
There are additional method such as B-Trees, Hashing which you may like to read up about but will
not be discussed here.
Replication is the term given to the process where databases synchronize with each other. In this
process one database updates it's own data with respect to another or with reference to certain
criteria for updates specified by the programmer. Replication can be used under various
circumstances. Examples may be : safety and backup, to provide a closer location to the database
for certain users.
What are Transactions ?
In an RDBMS, when several people access the same data or if a server dies in the middle of an
update, there has to be a mechanism to protect the integrity of the data. Such a mechanism is called
a Transaction. A transaction groups a set of database actions into a single instantaneous event. This
event can either succeed or fail. i.e .either get the job done or fail.
The definition of a transaction can be provided by an Acronym called 'ACID'.
(A)tomicity: If an action consists of multiple steps - it's still considered as one operation.
(C) Consistency: The database exists in a valid and accurate operating state before and after a
(I) Isolation: Processes within one transaction are independent and cannot interfere with that in
(D) Durability: Changes affected by a transaction are permanent.
To enable transactions a mechanism called 'Logging' needs to be introduced. Logging involves a
DBMS writing details on the tables, columns and results of a particular transaction, both before and
after, onto a log file. This log file is used in the process of recovery. Now to protect a certain
database resource (ex. a table) from being used and written onto simulatneously several techniques
are used. One of them is 'Locking' another is to put a 'time stamp' onto an action. In the case of
Locking, to complete an action, the DBMS would need to acquire locks on all resources needed to
complete the action. The locks are released only when the transaction is completed.
Now if there were say a large numbers of tables involved in a particular action, say 50, all 50 tables
would be locked till a transaction is completed.
To improve things a bit, there is another technique used called 2 Phase Locking or 2PL. In this
method of locking, locks are acquired only when needed but are released only when the transaction
is completed.
This is done to make sure that that altered data can be safely restored if the transaction fails for any
This technique can also result in problems such as "deadlocks".
In this case - 2 processes requiring the same resources lock each other up by preventing the other to
complete an action. Options here are to abort one, or let the programmer handle it.
MySQL implements transactions by implementing the Berkeley DB libraries into its own code. So
it's the source version you'd want here for MySQL installation. Read the MySQL manual on
implementing this.
Beyond MySQL
What are Views ?
A view allows you to assign the result of a query to a new private table. This table is given the name
used in your VIEW query.
Although MySQL does not support views yet a sample SQL VIEW construct statement would look
What are Triggers ?
A trigger is a pre-programmed notification that performs a set of actions that may be commonly
required. Triggers can be programmed to execute certain actions before or after an event occurs.
Triggers are very useful as they they increase efficiency and accuracy in performing operations on
databases and also are increase productivity by reducing the time for application development.
Triggers however do carry a price in terms of processing overhead.
What are Procedures ?
Like triggers, Procedures or 'Stored' Procedures are productivity enhancers. Suppose you needed to
perform an action using a programming interface to the database in say PERL and ASP. If a
programmed action could be stored at the database level, it's obvious that it has to be written only
once and cam be called by any programming language interacting with the database.
Procedures are executed using triggers.
Beyond RDBMS
Distributed Databases (DDB)
A distributed database is a collection of several, logically interrelated database located at multiple
locations of a computer network. A distributed database management system permits the
management of such a database and makes the operation transparent to the user. Good examples of
distributed databases would be those utilized by banks, multinational firms with several office
locations where each distributed data system works only with the data that is relevant to it's
operations. DDBs have have full functionality of any DBMS. It's also important to know that the
distributed databases are considered to be actually one database rather than discrete files and data
within distributed databases are logically interrelated.
Object Database Management Systems or ODBMS
When the capabilities of a database are integrated with object programming language
capababilities, the resulting product is an ODBMS. Database objects appear as programming
objects in an ODBMS. Using an ODBMS offers several advantages. The ones that can be most
readily appreciated are:
1. Efficiency
When you use an ODBMS, you're using data the way you store it. You will use less code as you're
not dependent on an intermediary like SQL or ODBC. When this happens you can create highly
complex data structures through your programming language.
2. Speed
When data is stored the way you'd like it to be stored (i.e. natively) there is a massive performance
increase as no to-and-fro translation is required.
A Quick Tutorial on Database Normalization
Let's start off by taking some data represented in a Table.
Table Name: College Table
Don CorleoneCS003
ming 1
Daffy Duck
DJ Tiesto CS004
Lara Croft CS789 OpenGL Bill Clinton CS001
Papa Smurf
Seven of
(text size has been shrunk to aid printability on one page)
The First Normal Form: (Each Column Type is Unique and there are no repeating groups
[types] of data)
This essentially means that you indentify data that can exist as a separate table and therefore reduce
repetition and will reduce the width of the original table.
We can see that for every student, Course Information is repeated for each course. So if a student
has three course, you'll need to add another set of columns for Course Title, Course Professor and
CourseID. So Student information and Course Information can be considered to be two broad
Table Name: Student
StudentID (Primary Key)
Table Name: Course
CourseID (Primary Key)
It's obvious that we have here a Many to Many relationship between Students and Courses.
Note: In a Many to Many relationship we need something called a relating table which basically
contains information exclusively on which relatioships exist between two tables. In a One to Many
relationship we use a foreign key.
So in this case we need another little table called: Students and Courses
Table Name: Students and
The Second Normal Form: (All attributes within the entity should depend solely on the
entity's unique identifier)
The AdvisorName under Student Information does not depend on the StudentID. Therefore it can
be moved to it's own table.
Table Name: Student
StudentID (Primary Key)
Table Name: Advisor
Table Name: Course
CourseID (Primary Key)
Table Name: Students and
Note: Relating Tables can be created as required.
The Third Normal Form:(no column entry should be dependent on any other entry (value)
other than the key for the table)
In simple terms - a table should contain information about only one thing.
In Course Information, we can pull CourseProfessor information out and store it in another table.
Table Name: Student
StudentID (Primary Key)
Table Name: Advisor
Table Name: Course
CourseID (Primary Key)
Table Name: Professor
Table Name: Students and
Note: Relating Tables can be created as required.