Chapter 12: Physical Storage Systems: Database System Concepts, 7 Ed
Chapter 12: Physical Storage Systems: Database System Concepts, 7 Ed
Chapter 12: Physical Storage Systems: Database System Concepts, 7 Ed
Database System Concepts - 7th Edition 12.2 ©Silberschatz, Korth and Sudarshan
Storage Hierarchy
Database System Concepts - 7th Edition 12.3 ©Silberschatz, Korth and Sudarshan
Storage Hierarchy (Cont.)
primary storage: Fastest media but volatile (cache, main
memory).
secondary storage: next level in hierarchy, non-volatile,
moderately fast access time
• also called on-line storage
• E.g. flash memory, magnetic disks
tertiary storage: lowest level in hierarchy, non-volatile, slow
access time
• also called off-line storage and used for archival storage
• e.g. magnetic tape, optical storage
• Magnetic tape
Sequential access, 1 to 12 TB capacity
A few drives with many tapes
Juke boxes with petabytes (1000’s of TB) of storage
Database System Concepts - 7th Edition 12.4 ©Silberschatz, Korth and Sudarshan
Storage Interfaces
Disk interface standards families
• SATA (Serial ATA)
SATA 3 supports data transfer speeds of up to 6 gigabits/sec
• SAS (Serial Attached SCSI)
SAS Version 3 supports 12 gigabits/sec
• NVMe (Non-Volatile Memory Express) interface
Works with PCIe connectors to support lower latency and
higher transfer rates
Supports data transfer rates of up to 24 gigabits/sec
Disks usually connected directly to computer system
In Storage Area Networks (SAN), a large number of disks are
connected by a high-speed network to a number of servers
In Network Attached Storage (NAS) networked storage provides a
file system interface using networked file system protocol, instead of
providing a disk system interface
Database System Concepts - 7th Edition 12.5 ©Silberschatz, Korth and Sudarshan
Magnetic Hard Disk Mechanism
Database System Concepts - 7th Edition 12.6 ©Silberschatz, Korth and Sudarshan
Magnetic Disks
Read-write head
Surface of platter divided into circular tracks
• Over 50K-100K tracks per platter on typical hard disks
Each track is divided into sectors.
• A sector is the smallest unit of data that can be read or written.
• Sector size typically 512 bytes
• Typical sectors per track: 500 to 1000 (on inner tracks) to 1000 to
2000 (on outer tracks)
To read/write a sector
• disk arm swings to position head on right track
• platter spins continually; data is read/written as sector passes under
head
Head-disk assemblies
• multiple disk platters on a single spindle (1 to 5 usually)
• one head per platter, mounted on a common arm.
Cylinder i consists of ith track of all the platters
Database System Concepts - 7th Edition 12.7 ©Silberschatz, Korth and Sudarshan
Magnetic Disks (Cont.)
Disk controller – interfaces between the computer system and
the disk drive hardware.
• accepts high-level commands to read or write a sector
• initiates actions such as moving the disk arm to the right track and
actually reading or writing the data
• Computes and attaches checksums to each sector to verify that
data is read back correctly
If data is corrupted, with very high probability stored checksum
won’t match recomputed checksum
• Ensures successful writing by reading back sector after writing it
• Performs remapping of bad sectors
Database System Concepts - 7th Edition 12.8 ©Silberschatz, Korth and Sudarshan
Performance Measures of Disks
Access time – the time it takes from when a read or write request
is issued to when data transfer begins. Consists of:
• Seek time – time it takes to reposition the arm over the correct track.
Average seek time is 1/2 the worst case seek time.
• Would be 1/3 if all tracks had the same number of sectors, and we
ignore the time to start and stop arm movement
4 to 10 milliseconds on typical disks
• Rotational latency – time it takes for the sector to be accessed to appear
under the head.
4 to 11 milliseconds on typical disks (5400 to 15000 r.p.m.)
Average latency is 1/2 of the above latency.
• Overall latency is 5 to 20 msec depending on disk model
Data-transfer rate – the rate at which data can be retrieved from
or stored to the disk.
• 25 to 200 MB per second max rate, lower for inner tracks
Database System Concepts - 7th Edition 12.9 ©Silberschatz, Korth and Sudarshan
Performance Measures (Cont.)
Disk block is a logical unit for storage allocation and retrieval
• 4 to 16 kilobytes typically
Smaller blocks: more transfers from disk
Larger blocks: more space wasted due to partially filled blocks
Sequential access pattern
• Successive requests are for successive disk blocks
• Disk seek required only for first block
Random access pattern
• Successive requests are for blocks that can be anywhere on disk
• Each access requires a seek
• Transfer rates are low since a lot of time is wasted in seeks
I/O operations per second (IOPS)
• Number of random block reads that a disk can support per second
• 50 to 200 IOPS on current generation magnetic disks
Database System Concepts - 7th Edition 12.10 ©Silberschatz, Korth and Sudarshan
Performance Measures (Cont.)
Mean time to failure (MTTF) – the average time the disk is expected
to run continuously without any failure.
• Typically 3 to 5 years
• Probability of failure of new disks is quite low, corresponding to a
“theoretical MTTF” of 500,000 to 1,200,000 hours for a new disk
E.g., an MTTF of 1,200,000 hours for a new disk means that
given 1000 relatively new disks, on an average one will fail
every 1200 hours
• MTTF decreases as disk ages
Database System Concepts - 7th Edition 12.11 ©Silberschatz, Korth and Sudarshan
Flash Storage
NOR flash vs NAND flash
NAND flash
• used widely for storage, cheaper than NOR flash
• requires page-at-a-time read (page: 512 bytes to 4 KB)
20 to 100 microseconds for a page read
Not much difference between sequential and random
read
• Page can only be written once
Must be erased to allow rewrite
Solid state disks
• Use standard block-oriented disk interfaces, but store data on
multiple flash storage devices internally
• Transfer rate of up to 500 MB/sec using SATA, and
up to 3 GB/sec using NVMe PCIe
Database System Concepts - 7th Edition 12.12 ©Silberschatz, Korth and Sudarshan
Flash Storage (Cont.)
Erase happens in units of erase block
• Takes 2 to 5millisecs
• Erase block typically 256 KB to 1 MB (128 to 256 pages)
Remapping of logical page addresses to physical page addresses
avoids waiting for erase
Flash translation table tracks mapping
• also stored in a label field of flash page
• remapping carried out by flash translation layer
After 100,000 to 1,000,000 erases, erase block becomes unreliable
and cannot be used
Page write
• wear leveling
Physical Page Address
Database System Concepts - 7th Edition 12.14 ©Silberschatz, Korth and Sudarshan
Storage Class Memory
3D-XPoint memory technology pioneered by Intel
Available as Intel Optane
• SSD interface shipped from 2017
Allows lower latency than flash SSDs
• Non-volatile memory interface announced in 2018
Supports direct access to words, at speeds comparable to
main-memory speeds
Database System Concepts - 7th Edition 12.15 ©Silberschatz, Korth and Sudarshan
RAID
Database System Concepts - 7th Edition 12.16 ©Silberschatz, Korth and Sudarshan
Improvement of Reliability via Redundancy
Database System Concepts - 7th Edition 12.18 ©Silberschatz, Korth and Sudarshan
RAID Levels
Schemes to provide redundancy at lower cost by using disk striping
combined with parity bits
• Different RAID organizations, or RAID levels, have differing cost,
performance and reliability characteristics
RAID Level 0: Block striping; non-redundant.
• Used in high-performance applications where data loss is not critical.
RAID Level 1: Mirrored disks with block striping
• Offers best write performance.
• Popular for applications such as storing log files in a database system.
Database System Concepts - 7th Edition 12.19 ©Silberschatz, Korth and Sudarshan
RAID Levels (Cont.)
Parity blocks: Parity block j stores XOR of bits from block j of each
disk
• When writing data to a block j, parity block j must also be computed
and written to disk
Can be done by using old parity block, old value of current block
and new value of current block (2 block reads + 2 block writes)
Or by recomputing the parity value using the new values of blocks
corresponding to the parity block
• More efficient for writing large amounts of data sequentially
• To recover data for a block, compute XOR of bits from all other blocks
in the set including the parity block
Database System Concepts - 7th Edition 12.20 ©Silberschatz, Korth and Sudarshan
RAID Levels (Cont.)
RAID Level 5: Block-Interleaved Distributed Parity; partitions data
and parity among all N + 1 disks, rather than storing data in N disks
and parity in 1 disk.
• E.g., with 5 disks, parity block for nth set of blocks is stored on disk
(n mod 5) + 1, with the data blocks stored on the other 4 disks.
Database System Concepts - 7th Edition 12.21 ©Silberschatz, Korth and Sudarshan
RAID Levels (Cont.)
RAID Level 5 (Cont.)
• Block writes occur in parallel if the blocks and their parity blocks
are on different disks.
RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but
stores two error correction blocks (P, Q) instead of single parity
block to guard against multiple disk failures.
• Better reliability than Level 5 at a higher cost
Becoming more important as storage sizes increase
Database System Concepts - 7th Edition 12.22 ©Silberschatz, Korth and Sudarshan
RAID Levels (Cont.)
Other levels (not used in practice):
• RAID Level 2: Memory-Style Error-Correcting-Codes (ECC)
with bit striping.
• RAID Level 3: Bit-Interleaved Parity
• RAID Level 4: Block-Interleaved Parity; uses block-level
striping, and keeps a parity block on a separate parity disk for
corresponding blocks from N other disks.
RAID 5 is better than RAID 4, since with RAID 4 with random
writes, parity disk gets much higher write load than other
disks and becomes a bottleneck
Database System Concepts - 7th Edition 12.23 ©Silberschatz, Korth and Sudarshan
Choice of RAID Level
Factors in choosing RAID level
• Monetary cost
• Performance: Number of I/O operations per second, and
bandwidth during normal operation
• Performance during failure
• Performance during rebuild of failed disk
Including time taken to rebuild failed disk
RAID 0 is used only when data safety is not important
• E.g. data can be recovered quickly from other sources
Database System Concepts - 7th Edition 12.24 ©Silberschatz, Korth and Sudarshan
Choice of RAID Level (Cont.)
Database System Concepts - 7th Edition 12.25 ©Silberschatz, Korth and Sudarshan
Hardware Issues
Database System Concepts - 7th Edition 12.26 ©Silberschatz, Korth and Sudarshan
Hardware Issues (Cont.)
Latent failures: data successfully written earlier gets damaged
• can result in data loss even if only one disk fails
Data scrubbing:
• continually scan for latent failures, and recover from copy/parity
Hot swapping: replacement of disk while system is running, without
power down
• Supported by some hardware RAID systems,
• reduces time to recovery, and improves availability greatly
Many systems maintain spare disks which are kept online, and used
as replacements for failed disks immediately on detection of failure
• Reduces time to recovery greatly
Many hardware RAID systems ensure that a single point of failure will
not stop the functioning of the system by using
• Redundant power supplies with battery backup
• Multiple controllers and multiple interconnections to guard against
controller/interconnection failures
Database System Concepts - 7th Edition 12.27 ©Silberschatz, Korth and Sudarshan
Optimization of Disk-Block Access
Database System Concepts - 7th Edition 12.28 ©Silberschatz, Korth and Sudarshan
End of Chapter 12
Database System Concepts - 7th Edition 12.30 ©Silberschatz, Korth and Sudarshan
Magnetic Tapes