Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Adbms Unit 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 129

Established as per the Section 2(f) of the UGC Act, 1956

Approved by AICTE, COA and BCI, New Delhi

Advanced DBMS
Course code: M21DES212

School of Computer Science and


Applications
Abhay Kumar Srivastav
LECTURE -1

Agenda
 Importance of the subject  Data

 Prerequisites  Database
 Objectives  DBMS
 Course Content  RDBMS
 Course Outcome  DBMS - Storage System
IMPORTANCE OF THE COURSE

1. The DBMS implies integrity constraints to get a high level of protection


against prohibited access to data

2. DBMS offers a variety of techniques to store & retrieve data


3. DBMS serves as an efficient handler to balance the needs of multiple
applications using the same data

4. A DBMS uses various powerful functions to store and retrieve data


efficiently.
PREREQUISITES

1. Basic Knowledge of Database

2. database relationships

3. memory management in OS
OBJECTIVES

1. Introduce various methodologies used in storage of data and Indexing


2. Give them an in depth knowledge about Tree-based and Hash Based
Indexing
3. To familiarize them with query evaluation and optimization
4. Provide in depth knowledge in parallel and distributed databases
5. Introduce the new advancement in Databases
COURSE CONTENT

UNIT 1:
Overview of Storage and Indexing
Memory hierarchy: RAID,Disk space management, Buffer manager: Files of records; Page
formats and record format, Structured Indexing, Data on external storage; File organizations
and Indexing, Index data structures; Comparison of file organizations; Indexes and
performance tuning. Intuition for tree indexes; Indexed sequential access method; B+trees ,
Hash-Based Indexing.
UNIT 2:
Overview of Query Evaluation, External Sorting and Relational Query Optimizer
The system catalog, Introduction to operator evaluation; Algorithm for relational operations;
Introduction to query optimization; When does a DBMS sort data? A simple two-way merge
sort; External merge sort, Evaluating Relational Operators The Selection operation; General
selection conditions; The Projection operation; The Join operation; The Set operations;
Aggregate operations; The impact of buffering..
COURSE CONTENT CONTD..
UNIT 3: Concurrency Control: Serializability and Transaction processing:
Enforcing, Serializability by Locks, Locking Systems With Several, Lock Modes,
Architecture for a Locking Scheduler Managing . Transaction processing: Introduction
of transaction processing, Advantagess and Disadvantagess of transaction processing
system, online transaction processing system, resolving deadlock, Transaction
management in multi-database system, long duration transaction, high-performance
transaction system.
UNIT 4: Parallel and Distributed Databases and XML data
Architectures for parallel databases; Parallel query evaluation; Parallelizing individual
operations; Parallel query optimizations; Introduction to distributed databases;
Distributed DBMS architectures; Storing data in a Distributed DBMS; Information
retrival and XML data: Colliding Worlds: Databases, IR, and XML, Introduction to
Information Retrieval, Indexing for Text Search, Web Search Engines, Managing Text in
a DBMS, A Data Model for XML, XQuery: Querying XML Data. Mobile databases,
Multimedia databases, geographic databases, temporal databases, biological
databases
COURSE OUTCOME

1. Understand and appreciate the Tree-based and Hash-based Indexing.


2. Evaluate and optimize the relational query
3. Compare and contrast parallel ,distributed databases and XML databases
4. Model and represent the real world data using object oriented database
5. Identify various emerging technologies in the database field
TEXT BOOKS AND REFERENCE
BOOKS

Text book/s:
1. Raghu Ramakrishnan and Johannes Gehrke: Database Management
Systems, 3rd Edition, McGraw-Hill,
2003[Chapters:8,9,10,11,12,13,14,22,23,27,29]
References:
1. Michael Rosenblum and Dr. Paul Dorsey,” PL/SQL FOR
DUMMIES”,WILLEY Publications 2006
2. Elmasri and Navathe: Fundamentals of Database Systems,5th Edition,
Pearson Education, 2007.
3. Conolly and Begg: Database Systems, 4th Edition, Pearson Education,
2002.
DEFINITION

 Data
 Database
 A DATABASE MANAGEMENT SYSTEM
 Drawback of File System
DBMS - STORAGE SYSTEM

Databases are stored in file formats, which contain records. At physical level,
the actual data is stored in electromagnetic format on some device. These
storage devices can be broadly categorized into three types −
Primary Storage - The memory storage that is directly accessible to the CPU
comes under this category.
Secondary Storage − Secondary storage devices are used to store data for
future use or as backup.
DBMS - STORAGE SYSTEM

Tertiary Storage − Tertiary storage is used to store huge volumes of


data.
MAGNETIC DISK STRUCTURE

Most of the secondary storage is in the form of magnetic disks.


 A magnetic disk contains several platters.
 Each platter is divided into circular shaped tracks.
 The length of the tracks near the centre is less than the length of the tracks
farther from the center. Each track is further divided into sectors
MAGNETIC DISK STRUCTURE

The speed of the disk is measured as two parts:


 Transfer rate: This is the rate at which the data
moves from disk to the computer.
 Random access time: It is the sum of the seek
time and rotational latency.
 Seek time is the time taken by the arm to move
to the required track
SUMMARY

 Data
 Database
 DBMS
 RDBMS
 DBMS - Storage System
QUIZ

1-The storage device that uses rigid, 2-Which of the following is the
permanently installed magnetic disks secondary storage device that uses a
to store data is long plastic strip coated with a
magnetic material as recording
1. Floppy medium?
2. Permanent disk
3. Optical disk 1. Compact disk
4. Hard disk 2. Hard disk
3. Magnetic tape
ANSWER: Hard disk
4. None of the above
ANSWER: Magnetic tape
LECTURE -2

OBJECTIVE

 DATABASE  Disk Array


 DATABASE MANAGEMENT  Data striping
SYSTEM
 Reliability
 Storage System  RAID
 Redundancy
DISK ARRAY

 Is an arrangement of several
disks, organized to increase
performance and improve
reliability of the resulting storage
system.

 Performance is increased through


data.
DATA STRIPING

 distributes data over several


disks to give the impression of
having a single large, very fast
disk.
MEMORY HIERARCHY: RAID
(REDUNDANT ARRAYS OF
INDEPENDENT DISKS)

 Reliability is improved through redundancy. Instead of


having a single copy of the data, redundant information is
maintained.

 Disk arrays that implement a combination of data striping and


redundancy are called redundant arrays of independent disks,
or in short, RAID.
REDUNDANCY

 Reliability of a disk array can be increased by storing redundant information. If


a disk fails, the redundant information is used to reconstruct the data on the
failed disk. Redundancy can immensely increase the MTTF(Mean-time-to-
failure) of a disk array.

 When incorporating redundancy into a disk array design, we have to make two
choices.

 We can either store the redundant information on a small number of check


disks or distribute the redundant information uniformly over all disks.
LEVELS OF REDUNDANCY
Level 0: Nonredundant
 RAID level 0 provides data stripping, i.e., a data can place across
multiple disks. It is based on stripping that means if one disk fails then
all data in the array is lost.
 This level doesn't provide fault tolerance but increases the system
performance.
LEVELS OF REDUNDANCY
Advantages of RAID 0:
• Throughput is increased because multiple data requests probably not on
the same disk.
• Full utilizes the disk space and provides high performance.
• It requires minimum 2 drives.
Disadvantages of RAID 0:
• It doesn't contain any error detection mechanism.
• The RAID 0 is not a true RAID because it is not fault-tolerance.
• Failure of either disk results in complete data loss in respective array.
LEVELS OF REDUNDANCY:
Level 1: Mirrored
 This level is called mirroring of data as it copies the data from drive 1 to drive
2. It provides 100% redundancy in case of a failure.
 Only half space of the drive is used to store the data. The other half of drive is
just a mirror to the already stored data.
LEVELS OF REDUNDANCY:
Level 1: Mirrored
Advantages of RAID 1:
• Fault tolerance
o if one disk fails, then the other automatically takes over.
o the array will function even if any one of the drives fails.
Disadvantages of RAID 1:
• One extra drive is required per drive for mirroring, so the expense is higher.
SUMMARY

 Disk Array
 Data striping
 Reliability
 RAID
 Redundancy
 RAID 0
 RAID 1
QUIZ

1-What is the minimum number of 2-Optical disk


disks required for RAID1? technology uses
a) 1 a) Helical scanning
b) 2 b) DAT
c) 4 c) A laser beam
d) 5 d) RAID
ANSWER: b ANSWER: d
LECTURE -3

OBJECTIVE

 Disk Array  RAID-2


 Data striping  Advantages and Disadvantages
of RAID 2
 Reliability
 RAID-3
 RAID
 Advantages and Disadvantages
 Redundancy
of RAID 3
 RAID 0 and 1
 RAID-4
LEVELS OF REDUNDANCY:
Level 2: Error-Correcting Codes
RAID 2 consists of bit-level striping using hamming code parity. each
data bit in a word is recorded on a separate disk and ECC code of data
words is stored on different set disks.
Due to its high cost and complex structure, this level is not
commercially used. This same performance can be achieved by RAID 3
at a lower cost.
MEMORY HIERARCHY: RAID

Level 2: Error-Correcting Codes

Advantages of RAID 2:
• Uses one designated drive to store parity.
• It uses the hamming code for error detection.
Disadvantages of RAID 2:
• It requires an additional drive for error detection.
LEVELS OF REDUNDANCY:
Level 3: Bit~Interleaved Parity
 RAID 3 consists of byte-level striping with dedicated parity. the parity
information is stored for each disk section and written to a dedicated parity
drive.
 In case of drive failure, the parity drive is accessed, and data is reconstructed
from the remaining devices. Once the failed drive is replaced, the missing
data can be restored on the new drive.
 data can be transferred in bulk. Thus high-speed data transmission is
possible.
LEVELS OF REDUNDANCY:
Level 3: Bit~Interleaved Parity

Advantages of RAID 3:
• Data is regenerated using parity drive.
• It contains high data transfer rates.
• Data is accessed in parallel.
Disadvantages of RAID 3:
• It required an additional drive for parity.
• It gives a slow performance for operating on small sized files.
LEVELS OF REDUNDANCY:
Level 4: Block Interleaved Parity
RAID 4 consists of block-level stripping with a parity disk. Instead of
duplicating data, the RAID 4 adopts a parity-based approach.
This level allows recovery of at most 1 disk failure due to the way parity
works. if more than one disk fails, then there is no way to recover the
data.
Level 3 and level 4 both are required at least three disks to implement
RAID.
SUMMARY

 RAID-2
 Advantages and Disadvantages of RAID 2
 RAID-3
 Advantages and Disadvantages of RAID 3
 RAID-4
QUIZ

1-Which level of RAID refers to 2-Which one of the following is a


disk mirroring with block striping? Stripping technique?
a) RAID level 1 a) Byte level stripping
b) RAID level 2 b) Raid level stripping
c) RAID level 0 c) Disk level stripping
d) RAID level 3 d) Block level stripping
ANSWER: a ANSWER: d
LECTURE -4

OBJECTIVE

 RAID-2  RAID-5
 Advantages and  Advantages and
Disadvantages of RAID 2 Disadvantages of RAID 5
 RAID-3  RAID-6
 Advantages and  Advantages and
Disadvantages of RAID 3 Disadvantages of RAID 6
 RAID-4  RAID-1+0
LEVELS OF REDUNDANCY:
Level 5: Block-Interleaved Distributed Parity
 RAID 5 is a slight modification of the RAID 4 system. The only difference is
that in RAID 5, the parity rotates among the drives.
 It consists of block-level striping with DISTRIBUTED parity.
 Same as RAID 4, this level allows recovery of at most 1 disk failure. If more
than one disk fails, then there is no way for data recovery.
LEVELS OF REDUNDANCY:
Level 5: Block-Interleaved
Distributed Parity
Advantages of RAID 5:
• Cost effective and provides high performance.
• parity is distributed across the disks in an array.
• It is used to make the random write performance better.
Disadvantages of RAID 5:
• Disk failure recovery takes longer time as parity has to be calculated from all
available drives.
• This level cannot survive in concurrent drive failure.
LEVELS OF REDUNDANCY:
Level 6: P+Q Redundancy
 This level is an extension of RAID 5. It contains block-level stripping with 2
parity bits.
 In RAID 6, you can survive 2 concurrent disk failures. Suppose you are using
RAID 5, and RAID 1. When your disks fail, you need to replace the failed disk
because if simultaneously another disk fails then you won't be able to recover
any of the data, so in this case RAID 6 plays its part where you can survive two
concurrent disk failures before you run out of options.
LEVELS OF REDUNDANCY:

Level 6: P+Q Redundancy


Advantages of RAID 6:
• Performs RAID 0 to strip data and RAID 1 to mirror. stripping is performed
before mirroring.
• Drives required should be multiple of 2.
Disadvantages of RAID 6:
• It is not utilized 100% disk capability as half is used for mirroring.
• It contains very limited scalability.
LEVELS OF REDUNDANCY:
RAID 10 (RAID 1+0)
 RAID 10, also known as RAID 1+0, is a RAID configuration that combines disk
mirroring and disk striping to protect data.
 It requires a minimum of four disks and stripes data across mirrored pairs.
 As long as one disk in each mirrored pair is functional, data can be retrieved. If
two disks in the same mirrored pair fail, all data will be lost because there is
no parity in the striped sets.
SUMMARY

 RAID-5
 Advantages and Disadvantages of RAID 5
 RAID-6
 Advantages and Disadvantages of RAID 6
 RAID-1+0
QUIZ

The RAID level which mirroring is done Which one of the following is not a
along with stripping is secondary storage?
a) RAID 1+0 a) magnetic disks
b) RAID 0 b) magnetic tapes
c) RAID 2 c) ram
d) Both RAID 1+0 and RAID 0 d) none of the mentioned
Answer:d Answer:c
LECTURE -5

OBJECTI
VE
 RAID-5  DISK SPACE MANAGEMENT
 Advantages and  Keeping Track of Free
Disadvantages of RAID 5 Blocks
 RAID-6  OS File Systems to Manage
Disk Space
 Advantages and
Disadvantages of RAID 6
 RAID-1+0
DISK SPACE MANAGEMENT

 The lowest level of software in the DBMS architecture called the disk
space manager, manages space on disk.
 Abstractly, the disk space manager supports the concept of a page
as a unit of data and provides commands to allocate or deallocate a
page and read or write a page.
DISK SPACE MANAGEMENT
KEEPING TRACK OF FREE BLOCKS:
 The disk space manager keeps track of which disk blocks are in use.
 blocks are initially allocated sequentially on disk, subsequent allocations and
deallocations could in general create 'holes.‘
 One way to keep track of block usage is to maintain a. list of free blocks.
 second way is to maintain a bitmap with one bit for each disk block,
which indicates whether a block is in use or not.
DISK SPACE MANAGEMENT

Using OS File Systems to Manage Disk Space


 Operating systems also manage space on disk. Typically, an
operating system supports the abstraction of a file as a sequence
of bytes.

 The OS manages space on the disk and translates requests, such as


"Read byte i of file f," into corresponding low-level instructions:
"Read block m of track t of cylinder c of disk d.“

 A database disk space manager could he built using OS files.


SUMMARY

 DISK SPACE MANAGEMENT


 Keeping Track of Free Blocks
OS File Systems to Manage Disk Space
QUIZ

1-Operating system is 2- A magneto-optic disk is :


responsible for
a) primary storage
a) disk initialization
b) secondary storage
b) booting from disk
c) tertiary storage
c) bad-bock recovery
d) none of the mentioned
d) all of the mentioned
Answer: c
Answer: d
LECTURE -6

OBJECTIVE

 DISK SPACE MANAGEMENT  Buffer manager


 Keeping Track of Free Blocks  Replacement policy
 OS File Systems to Manage  pin_count and dirty.
Disk Space
 Pinning
 Unpinning
BUFFER MANAGER

 Suppose that the database contains 1 million pages, but only 1000 pages of
main memory are available for holding.

 Because all the data cannot be brought into main memory at one time, the
DBMS must bring pages into main memory as they are needed and, in the
process, decide what existing page in main memory to replace to make space
for the new page.

 The policy used to decide which page to replace is called the replacement
policy.
BUFFER MANAGER

 The buffer manager is the software layer responsible for bringing pages from
disk to main memory as needed.

 The buffer manager manages the available main memory by partitioning it


into a collection of pages, which we collectively refer to as the buffer pool.

 The main memory pages in the buffer pool are called frames
BUFFER MANAGER

 Higher levels of the DBMS code can be written without worrying about
whether data pages are in memory or not; they ask the buffer manager for the
page, and it is brought into a frame in the buffer pool if it is not already there.

 the higher-level code that requests a page must also release the page when it
is no longer needed, by informing the buffer manager, so that the frame
containing the page can be reused.

 The higher-level code must also inform the buffer manager if it modifies the
requested page; the buffer manager then makes sure that the change is
propagated to the copy of the page on disk.
BUFFER MANAGER

 The buffer manager maintains some book keeping information and two
variables for each frame in the pool: pin_count and dirty.

 The number of times that the page currently in a given frame has been
requested but not released-the number of current users of the page is
recorded in the pin_count variable for that frame.

 The Boolean variable dirty indicates whether the page has been modified
since it was brought into the buffer pool from disk.
BUFFER MANAGER

Initially, the pin_count for every frame is set to 0, and the dirty bits are turned
off. When a page is requested the buffer manager does the following:
1. Checks the buffer pool to see if some frame contains the requested page and,
if so, increments the pin_count of that frame. If the page is not in the pool, the
buffer manager brings it in as follows:
(a) Chooses a frame for replacement, using the replacement policy, and
increments its pin_count.
(b) If the dirty bit for the replacement frame is on, writes the page it contains to
disk (that is, the disk copy of the page is overwritten with the contents of the
frame).
(c) Reads the requested page into the replacement frame.
BUFFER MANAGER

2.Returns the (main memory) address of the frame containing the requested
page to the requestor.
 Incrementing pin_count is often called pinning.
 the pin_count of the frame containing the requested page is decremented.
This is called unpinning the page.
SUMMARY

 Buffer manager
 Replacement policy
 pin_count and dirty.
 Pinning
 Unpinning
QUIZ

1-Which of the following is the 2-Which one of the following is not a


oldest database model? secondary storage?
a) Relational a) magnetic disks
b) Hierarchical b) magnetic tapes
c) Physical c) ram
d) Network d) none of the mentioned
Answer: d Answer:c
LECTURE -7

OBJECTIVE

 Buffer manager  Buffer Replacement Policies


 Replacement policy  sequential flooding
 pin_count and dirty.  FILES OF RECORDS
 Pinning  Implementing Heap Files
 Unpinning  Linked List of Pages
 Directory of Pages
BUFFER REPLACEMENT POLICIES
 The best-known replacement policy is least recently used (LRU). This can be
implemented in the buffer manager using a queue of pointers to frames with
pin_count 0.
 The page chosen for replacement is the one in the frame at the head of the
queue.
 A variant of LRU, called clock replacement, has similar behavior but less
overhead.
 Using LRU, every scan of the file will result in reading every page of the file!
In this situation, called sequential flooding, LRU is the worst possible
replacement strategy.
 Other replacement policies include first in first out (FIFO) and most recently
used (MRU), which also entail overhead similar to LRU, and random, among
others.
FILES OF RECORDS

 the way pages are stored on disk and brought into main memory, to the
way pages are used to store records and organized into logical collections
or files.

 Higher levels of the DBMS code treat a page as effectively being a


collection of records, ignoring the representation and storage details.
FILES OF RECORDS

Implementing Heap Files


 The data in the pages of a heap file is not ordered in any way, and the only
guarantee is that one can retrieve all records in the file by repeated requests for
the next record.

 Supported operations on a heap file include Create and destroy files, insert a
record, delete a record with a given rid, get a record with a given rid, and scan
all records in the file.

 We must keep track of the pages in each heap file to support scans, and we
must keep track of pages that contain free space to implement insertion
efficiently
FILES OF RECORDS
Implementing Heap Files
 There are two alternative ways to
maintain this information.
1. Linked List of Pages:
 One possibility is to maintain a heap
file as a doubly linked list of pages.
The DBMS can remember where the
first page is located by maintaining a
table containing pairs of
(heap_file_name, page_Laddr) in a
known location on disk.We call the
first page of the file the header page.
FILES OF RECORDS

Implementing Heap Files


2. Directory of Pages
An alternative to a linked list of
pages is to maintain a directory
of pages. Free space is
Each directory entry identifies a managed by
Bit per entry
page (or a sequence of pages) in count.
the heap file.
SUMMARY

 Buffer Replacement Policies


 sequential flooding
 Files of records
 Implementing Heap Files
 Linked List of Pages
 Directory of Pages
QUIZ

1-The file organization 2-Large collection of files are called


which allows us to read ____________
records that would satisfy
a) Fields
the join condition by using
one block read is b) Records
a) Heap file organization c) Database
b) Sequential file d) Sectors
organization
Answer:c
c) Clustering file
organization
d) Hash file organization
Answer:c
LECTURE -8

OBJECTIVE

 Buffer Replacement Policies  Record Arrangement


 sequential flooding  fixed length records
 Files of records  fixed length records
Problem and solution
 Implementing Heap Files
 Variable length Records
 Linked List of Pages
 Variable length Records
 Directory of Pages
Problem and solution
PAGE FORMATS

How collection of records are


arranges in a page?
1) fixed length records
Think a page as collection of records
 if all records on the page are
guaranteed to be of same
length. Then records can be
 A record in a page is identified by
uniformly arranged in a page.
using the pair (pageID, slotNo) as
RID. R1
R2
 Types of Records? R1
R2 R3
1) fixed length records R3 … R4
2) variable length records
PAGE FORMATS

fixed length records


ISSUES: how to keep track of empty Soln: Alternative-1 store records in
slots if a record is deleted as shown sequence of first to N. if a record is
below. deleted replace the empty space with
Nth record. As shown in Fig.
Delete
R1 R1
R1 R1
R2
R_N
R3
After R3 R3
R4 R3
Delete R4 R_N
PAGE FORMATS

fixed length records


ISSUES: how to keep track of empty Soln: Alternative-2 handle deletion
slots if a record is deleted as shown using array of bits. One per slot to
below. keep track of free slots.
Delete
R1
R2 R1
R3
After
R4 R3
Delete
R4
PAGE FORMATS

Variable length Records


 If records are variable length then
we cannot divide them in to
equal slots.
 To insert new record  find
empty space of exact size. Other
wise space is wasted.
what is the solution
Maintain a directory of slots for
each page with a
SUMMARY

 Record Arrangement
 fixed length records
 fixed length records Problem and
solution
 Variable length Records
 Variable length Records Problem and
solution
QUIZ

1-Storing a separate copy of the 2-A unit of storage that can store
database at multiple locations is ? one or more records in a hash file
A) Data Replication organization is denoted as

B) Horizontal Partitioning a) Buckets

C) Vertical Partitioning b) Disk pages

D) Horizontal and Vertical c) Blocks


Partitioning d) Nodes
Answer: A Answer:a
LECTURE -9

OBJECTI
VE
 Record Arrangement  Index
 fixed length records  Primary Index
 fixed length records Problem  Secondary Index
and solution
 Clustering Index
 Variable length Records
 Variable length Records
Problem and solution
STRUCTURED INDEXING

 We know that data is stored in the form of records. Every record has a
key field, which helps it to be recognized uniquely.

 Indexing is a data structure technique to efficiently retrieve records


from the database files based on some attributes on which the
indexing has been done.

 Indexing is defined based on its indexing attributes. Indexing can be


of the following types
STRUCTURED INDEXING

1. Primary Index
 Primary index is defined on an ordered data file. The data file is
ordered on a key field.

 The key field is generally the primary key of the relation.


STRUCTURED INDEXING

2- Secondary Index − Secondary index may be generated from a field which is a


candidate key and has a unique value in every record, or a non-key with duplicate
values.
STRUCTURED INDEXING

3. Clustering Index − Clustering index is defined on an ordered data file. The


data file is ordered on a non-key field.
SUMMARY

 Index
 Primary Index
 Secondary Index
 Clustering Index
QUIZ

1-Does index take space in 2-Which string function


the disk? returns the index of the first
occurrence of substring?
a) It stores memory as and
when required a) INSERT()
b) Yes, Indexes are stored on b) INSTR()
disk
c) INSTRING()
c) Indexes are never stored on
d) INFSTR()
disk
Answer:b
d) Indexes take no space
Answer:b
LECTURE -10

OBJECTI
VE
 Index  Data on External Storage
 Primary Index  Magnetic tapes
 Secondary Index  Page
 Clustering Index cost of page I/O
DATA ON EXTERNAL STORAGE

Prg1 DBMS stores vast


quantities of data, and
the data must persist
Prg3 across program
Prg2
executions.

Therefore, data is stored on


external storage devices such
as disks and tapes. And
fetched into main memory as
needed for processing.
DATA ON EXTERNAL STORAGE

POINTS TO REMEMBER
 Disks:
• Can retrieve random page at fixed cost
• But reading several consecutive pages is
much cheaper than reading them in random
order.
 Tapes (magnetic tapes):
• Can only read pages in sequence
• Cheaper than disks, used for archival storage.
(Archival storage: data that may not be actively
needed)
DATA ON EXTERNAL STORAGE

POINTS TO REMEMBER

 Page:
 The unit of information read from or written to disk is a page.

 The size of a page is a DBMS parameter, and typical values are


4KB or 8KB.

 Cost of page I/O


The cost of page I/O (input from disk to main memory and output
from memory
to disk) dominates the cost of typical database operations, and
database systems are carefully optimized to minimize this cost.
SUMMARY

 Data on External Storage


 Magnetic tapes
 Page
cost of page I/O
QUIZ
1-Which of the following is a 2-A distributed database can use
Disadvantages of replication? which of the following strategies?

A) Reduced network traffic A) Totally centralized at one


location and accessed by many sites
B) If the database fails at one site, a copy
can be located at another site. B) Partially or totally replicated
across sites
C) Each site must have the same storage
capacity. C) Partitioned into segments at
different sites
D) Each transaction may proceed without
coordination across the network. D) All of the above

Answer: C Answer: D
LECTURE -11

OBJECTI
VE
 Data on External Storage  File
 Magnetic tapes  File organization
 Page  Index data structures
 cost of page I/O  Search key
 Data Reference
 Type of Index data
structures
FILE ORGANIZATION

 The File is a collection of records. Using the primary key, we can access
the records. The type and frequency of access can be determined by the
type of file organization which was used for a given set of records.
FILE ORGANIZATION

 File organization is a logical relationship among various records. This


method defines how file records are mapped onto disk blocks.

 File organization is used to describe the way in which the records are stored
in terms of blocks, and the blocks are placed on the storage medium.
FILE ORGANIZATION

 The first approach to map the database to the file is to use the
several files and store only one fixed length record in any given file.

 An alternative approach is to structure our files so that we can


contain multiple lengths for records.

 Files of fixed length records are easier to implement than the files of
variable length records.
INDEX DATA STRUCTURES

 Indexing is a way to optimize the performance of a database by


minimizing the number of disk accesses required when a query is
processed.

 It is a data structure technique which is used to quickly locate and


access the data in a database.
INDEX DATA STRUCTURES

1- The first column is the Search key that contains a copy of the
primary key or candidate key of the table. These values are stored in
sorted order so that the corresponding data can be accessed quickly.
Note: The data may or may not be stored in sorted order.

2- The second column is the Data Reference or Pointer which contains


a set of pointers holding the address of the disk block where that
particular key value can be found.
INDEX DATA STRUCTURES

Two types of Index Data Structures:

1) Hash based Indexing

2) Tree Based Indexing


SUMMARY

 File
 File organization
 Index data structures
 Search key
 Data Reference
 Type of Index data
structures
QUIZ

The file organization which Key value pairs is usually seen in


allows us to read records that
a) Hash tables
would satisfy the join condition
by using one block read is b) Heaps
a) Heap file organization c) Both Hash tables and Heaps
b) Sequential file organization d) Skip list
c) Clustering file organization Answer: a
d) Hash file organization
Answer: c
LECTURE -12

OBJECTI
VE
 File  Comparison of file
organizations
 File organization
 heap file
 Index data structures
 Clustered B+ tree
 Search key
 Heap file with an
 Data Reference
unclustered B+ tree
 Type of Index data structures
 Heap file with an
unclustered hash
COMPARISON OF FILE
ORGANIZATIONS

• The files and indexes are


organized according to
the composite search key
(age, sal)
and Compare
• all selection operations
are specified on these
fields.
COMPARISON OF FILE
ORGANIZATIONS

• Different File organizations


are:
1. File of randomly ordered
employee records, or heap
file.
COMPARISON OF FILE
ORGANIZATIONS

• Different File organizations


are:

2. File of employee records


sorted on (age, sal).

3. Clustered B+ tree file with


search key (age, sal).
COMPARISON OF FILE
ORGANIZATIONS

• Different File organizations


are:

4. Heap file with an


unclustered B+ tree index on
(age, sal).

5. Heap file with an


unclustered hash index on (age,
sal).
SUMMARY

 Comparison of file
organizations
 heap file
 Clustered B+ tree
 Heap file with an
unclustered B+ tree
 Heap file with an
unclustered hash
QUIZ

1-A transaction manager is ? 2-A distributed database has


A) Maintains a log of transactions which of the following
Advantagess over a
B) Maintains before and after centralized database?
database images
A) Software cost
C) Maintains appropriate
concurrency control B) Software complexity
D) All of the above. C) Slow Response
Answer: D D) Modular growth
Answer: D
LECTURE -13

OBJECTI
VE
 Comparison of file  Tree Structured index
organizations  Tree Structured index
 heap file  ISAM (indexed sequential
 Clustered B+ tree access method)

 Heap file with an  B+ Trees


unclustered B+ tree
 Heap file with an
unclustered hash
TREE-BASED INDEXING:

• The data entries are arranged in sorted order by search key value
and a hierarchical search data structure is maintained.
TREE-BASED INDEXING:

• Tree Structured index


• index storage techniques uses 3
alternatives for data entries k*:
– Data record with key value k
– <k, rid of data record with search key
value k>
– <k, list of rids of data records with
search key k>
• Tree-structured indexing techniques support both range searches and
equality searches.
TREE-BASED INDEXING:

Tree Structured index


Two techniques available in tree
structured indexing:

1. ISAM (indexed sequential access


method)

2. B+ Trees
Both supports effective range
searches
TREE-BASED INDEXING:

ISAM 
– it is static index structure that is effective when the file is not
frequently updated.
– This method is not suitable for a file that grows and shrinks a
lot.
TREE-BASED INDEXING:
B + Trees  A dynamic structure that adjusts to changes in the file
gracefully.

Most widely used index structure.


 because it adjusts well to changes
 Supports equality search and range search
SUMMARY

 Tree Structured index


 Tree Structured index
 ISAM (indexed sequential
access method)
 B+ Trees
QUIZ

1-Which of the reasons will force 2-Which of the following is not a XML
you to use XML data model in storage option ?
SQL Server ?
A) Native storage as XML data type
A) Your data is sparse or you do
B) Mapping between XML and
not know the structure of the
relational storage
data
C) Small object storage
B) Your data represents
containment hierarchy D) None of the Mentioned
C) Order is inherent in your data Answer: C
D) All of the Mentioned
Answer: D
LECTURE -14

OBJECTI
VE
 Tree Structured index  ISAM
 Tree Structured index  ISAM structure
 ISAM (indexed sequential  leaf pages
access method)
 Non leaf pages
 B+ Trees
ISAM (INDEX SEQUENTIAL
ACCESS METHOD)

Data entries of the ISAM index are in the leaf of the tree and
additional overflow pages chained to some leaf pages.

ISAM structure is static (except for overflow pages as they will be


very few)
ISAM (INDEX SEQUENTIAL
ACCESS METHOD)

ISAM
Non-leaf
Pages

Leaf
Pages
Overflow
page
Primary pages
ISAM (INDEX SEQUENTIAL
ACCESS METHOD)

 Each tree node is disk page.

 When a file is created all leaf pages are allocated sequentially and sorted on
the search key value.

 The non leaf level pages are then allocated.

 If there are several inserts to the file (but is there is no space ) then additional
pages are needed because the index is static (these pages are called Overflow
pages).
SUMMARY

 ISAM
 ISAM structure
 leaf pages
 Non leaf pages
QUIZ

1-The Oracle RDBMS uses 2- ____ means that the data


the ____ statement to used during the execution
declare a new transaction of a transaction cannot be
start and its properties. used by a second
transaction until the first
A) BEGIN
one is completed.
B) SET TRANSACTION
A) Consistency
C) BEGIN TRANSACTION
B) Atomicity
D) COMMIT
C) Durability
Answer: B
D) Isolation
Answer: D
LECTURE -15

OBJECTIVE

 ISAM  B+ Tree
 ISAM structure  Problem with ISAM
 leaf pages Hash-Based Indexing
 Non leaf pages
B+ TREES: DYNAMIC INDEX
STRUCTURE

• A static structure such as the ISAM index suffers from the problem
that long overflow chains can develop as the file grows, leading to
poor performance.
• This problem motivated the development of more flexible, dynamic
structures that adjust gracefully to inserts and deletes.
B+ TREES: DYNAMIC INDEX
STRUCTURE

• Problems with ISAM:


Long overflow pages leads to poor performance
Characteristics: index entry

 Deletion:
}
 insertion: In both operations
tree is balanced P
0
K
1 P1 K 2 P
2
K m Pm

Searching: just needs traversal from root to node


Height of the tree: rarely more than 3 to 4
• Format of the Tree node: M index entries contains m+1 pointers
HASH-BASED INDEXING

used to quickly find records that have a given search key value.

Example: if employee records is hashed on the name field, we can


retrieve all records about Johny.

Here records in a file are grouped in buckets

Records from a bucket can be found by using Hash function.


HASH-BASED INDEXING
SUMMARY

 B+ Tree
 Problem with ISAM
Hash-Based Indexing
QUIZ

1-What are the leaf nodes in a B+ tree? 2-Dynamic hashing is also called
as _________
a) The topmost nodes
a) Extended hashing
b) The bottommost nodes
b) Extendable hashing
c) The nodes in between the top and bottom nodes
d) None of the mentioned c) Static hashing
d) Movable hashing
Answer:b
Answer:b
EXERCISE

A distributed database has which of the following Advantagess over a


centralized database?

A) Software cost

B) Software complexity

C) Slow Response

D) Modular growth

Answer: D
EXERCISE

An autonomous homogenous environment is?

A) The same DBMS is at each node and each DBMS works independently.

B) The same DBMS is at each node and a central DBMS coordinates database
access.

C) A different DBMS is at each node and each DBMS works independently.

D) A different DBMS is at each node and a central DBMS coordinates database


access.
EXERCISE

A transaction manager is ?

A) Maintains a log of transactions

B) Maintains before and after database images

C) Maintains appropriate concurrency control

D) All of the above.

Answer: D
EXERCISE

Location transparency allows ?

A) Users to treat the data as if it is at one location

B) Programmers to treat the data as if it is at one location

C) Managers to treat the data as if it is at one location

D) All of the above.

Answer: D
EXERCISE

A heterogeneous distributed database is ?

A) The same DBMS is used at each location and data are not distributed
across all nodes.

B) The same DBMS is used at each location and data are distributed across all
nodes.

C) A different DBMS is used at each location and data are not distributed
across all nodes.

D) A different DBMS is used at each location and data are distributed across
THANK YOU

You might also like