Lecture 01 - File Storage - Part 1
Lecture 01 - File Storage - Part 1
Systems
Lecture 01
Other (%)(specify) - 0%
References/Reading Materials
-Describe and use Extendible hashing and Linear Hashing (Part 1 and Part 2)
• The DBMS software can then retrieve, update, and process this data as needed.
• Storage hierarchy
Primary storage
Storage
Secondary storage
hierarchy
Tertiary storage
Memory Hierarchies and Storage Devices
● Primary storage:
- Can be operated directly by the CPU.
- Provides fast access to data but is of limited storage capacity
○ If the speed of disk rotation is p revolutions per minute (rpm) then the average
rotational delay rd
○ rd = (1/2)*(1/p) min = (60 *1000)/(2*p) msec
Parameters of Disks
● When the block size is larger than the record size, each block will contain numerous
records.
● Some files may have unusually large records that cannot fit in one block.
Record Blocking and Spanned vs. Unspanned
Records (cont’d.)
● Records of a file must be allocated to disk blocks.
● Suppose that block size is B bytes. For a file of fixed-length records of size R bytes, with
B ≥ R,
○ we can fit Blocking factor bfr = ⎣B/R⎦ records per block,
○ In general, R may not divide B exactly, so unused space in each block equal to
B − (bfr * R) bytes
Record Blocking and Spanned vs. Unspanned
Records (cont’d.)
● To utilize the unused space, store part of the record in one block and the rest in
another.
● If consecutive blocks are not used, a pointer at the end of the first block points to
the block which has the rest of the records.
● This organization is called spanned.
● If records are not allowed to cross block boundaries, the organization is called
unspanned.
○ used with fixed-length records having B > R because it makes each record start
at a known location in the block, simplifying record processing.
the differencre between spanned and unspanned is , in spannd store part of records in blocks
Spanned vs. Unspanned Records (cont’d.)
● Unspanned Organization
Spanned vs. Unspanned Records (cont’d.)
● Spanned organization
Spanned vs. Unspanned Records (cont’d.)
● For variable-length records using spanned organization, each block may store a
different number of records.
● In this case, the blocking factor bfr represents the average number of records per
block for the file.
● bfr is used to calculate the number of blocks b needed for a file of r records:
○ b = ⎡(r/bfr)⎤ blocks
○ ⎡(x)⎤ (ceiling function) rounds the value x up to the next integer.
File Headers (file descriptor)
● Contains information about a file that is needed by the system programs that access
the file records
● Includes information
○ To determine the disk address of the file blocks
○ Record format descriptions
Files of Unordered Records
(Heap Files/Pile Files)
● Simplest and the most basic type of organization
● Records are unordered
● New records are inserted at the end of the file
● Can use spanned or unspanned organization
● Inserting a record
○ Very efficient, the last block of the file is copied into the buffer, the new record
is added, and the block is rewritten back to disk.
○ The address of the last block is kept in the file header.
Files of Unordered Records
(Heap Files/Pile Files)
● Searching a record
○ Involves a linear search, expensive procedure
○ Requires searching (b/2) blocks on average.
○ If no records or several records satisfy the search then all b blocks must be
searched.
● Deleting a record
○ Find the block, copy the block into the buffer, delete the record, rewrite the
block to the disk.
○ Leaves wasted storage space
○ Expensive operation
Files of Unordered Records
(Heap Files/Pile Files)
Techniques used in deleting
● Use the space when inserting new records
○ This need to keep track of empty locations.
● Have an extra bit/byte, called a deletion marker stored with each record
○ When a record is deleted set a deletion marker to a certain value.
○ When searching consider only valid records
○ Require periodic reorganization to reclaim unused space.
Files of Ordered Records (Sorted Files)
● Records are ordered based on the values of one of the fields –ordering field
● Ordering field may be the key field
● Reading the records in the order of the ordering key value becomes efficient.
● Finding the next record from the current one is efficient. The next record is in the
same block unless it is the last record
Files of Ordered Records (Sorted Files)
● Searching records in the ordered field is very efficient
○ Can use binary search
● Usually access log2(b) blocks
● Does not provide any advantage for random or ordered access based on non
ordering field.
● Inserting records is expensive.
○ Find the correct position
○ Must make space in the file to insert the record.
Files of Ordered Records (Sorted Files)
Techniques
● Keep unused space in each block
○ Once the space is used, original problem resurfaces.
Static hashing
● Address space is made of buckets, holds multiple records
● A bucket is either one disk block or a cluster of contiguous block.
● The hashing function maps a key into a relative bucket number.
● Using the file header the bucket number is converted into the corresponding disk
block address.
Hashing Techniques
Hashing Techniques
Dynamic hashing
● Extendible hashing
○ Uses a type of directory -an array of 2d bucket address .
○ d is known as the global depth.
● The directory consists of an array size 4 (22)
● Each element in the array is a pointer to the bucket.
To locate an entry (record) apply the hash function to the search field, and take the last
two bits ( because d is 2) of its binary representation.
Eg:
● Locate the entry with hash value 5.
○ Binary code 101
○ Look at directory entry 01 and follow the pointer.
● Insert a data entry with hash value 13.
○ Take the binary code 1101
○ Consider the last two bits 01
○ Go to bucket B
○ Page has space insert it.
● Insert an entry with hash value 20
○ Binary code 10100
○ Last two bits 00
○ Led to bucket A
○ Bucket A is full.
○ Spilt the bucket by allocating a new bucket. and redistribute the contents
across the old bucket and its split image.
•We need three bits to discriminate between A and A2.
•The directory has only enough slots to store all two bit patterns.
•Double the directory
Binary codes
20- 10100
4- 100
12- 1100
32- 100000
16-10000
● Whether splitting a bucket necessitates a directory doubling.
○ Not always
○ Eg:
○ Insert entry of hash value 9 (001)
○ Belongs to bucket B
○ Bucket is full split the bucket and use directory elements 001 and 101.
○ But if A and A2 are full and an insertion forces a bucket split, the directory is
doubled.
Binary codes
1- 001
9- 1001
5- 101
21- 10101
13-1101
● To determine whether a directory doubling is needed, we maintain a local depth for
each bucket.
● If a bucket whose local depth is equal to the global depth is split, the directory must
be doubled.
● Initially all local depth are equal to the global depth
● Increase the global depth by 1, each time the directory is doubled.
● Increate by 1 the local depth of the split bucket and assign this same local depth to
its split image.
Exercise 01
Consider the following extendible hashing index
diagram. You may assume that the entries in the
index are hash values.