Basic File Structure
Basic File Structure
Databases are stored physically as files of records, which are typically stored on magnetic
disks.
physical database file structures: auxiliary data structures called INDEXES.
The collection of data that makes up a computerized database must be stored physically on some
computer storage medium. The DBMS software can then retrieve, update, and process this data as
needed. Computer storage media form a storage hierarchy that includes two main categories:
(DATABASE jo ki collection of records hota hai vo khi physical medium pe store hota haii, jese
hard disk me tb hi hm retrive kr payenge data koo)
Primary storage. This category includes storage media that can be operated
on directly by the computer’s central processing unit (CPU), such as the com
puter’s main memory and smaller but faster cache memories.
Provides: fast access to data, but limited storage capacity.
(b) A record with two variable-length fields and three fixed-length fields.
(c) A variable-field record with three types of separator characters.
For variable-length fields, each record has a value for each field, but we do not know
the exact length of some field values. To determine the bytes within a particular
record that represent each field, we can use special separator characters (such as ? or
% or $)
A file of records with optional fields can be formatted in different ways. If the total
number of fields for the record type is large, but the number of fields that actually
appear in a typical record is small, we can include in each record a sequence of
<field-name, field-value> pairs rather than just the field values. T
separating the field name from the field value and separating one field from the next field. A more
practical option is to assign a short field type code—say, an integer number—to each field and
include in each record a sequence of <field-type, field-value> pairs rather than <field-name,
field-value> pairs.
A repeating field needs one separator character to separate the repeating values of
the field and another separator character to indicate termination of the field.
Finally, for a file that includes records of different types, each record is preceded by a
record type indicator.
The records of a file must be allocated to disk blocks because a block is the unit of
data transfer between disk and memory. When the block size is larger than the
record size, each block will contain numerous records, although some files may have
The records of a file must be allocated to disk blocks because a block is the unit of
data transfer between disk and memory. When the block size is larger than the
record size, each block will contain numerous records, although some files may have
unusually large records that cannot fit in one block.
Suppose a bytes = B
for fixed size length records size of = R,
with B ≥ R
we can fit bfr = ⎣ B/R⎦ records per block
where the ⎣ (x)⎦ (floor function) rounds down the number x
to an integer
B exactly, so we have some unused space in each block equal to
B − (bfr * R) bytes
To utilize this unused space, we can store part of a record on one block and the rest
on another. A pointer at the end of the first block points to the block containing the
remainder of the record in case it is not the next consecutive block on disk. This
organization is called spanned because records can span more than one block.
Whenever a record is larger than a block, we must use a spanned organization. If
records are not allowed to cross block boundaries, the organization is called
unspanned.
File Headers
A file header or file descriptor contains information about a file that is needed by
header includes
record format descriptions = field lengths and the order of fields within a record for
fixed-length unspanned records and field typE codes, separator characters, and
To search for a record on disk, one or more blocks are copied into main memory
buffers. Programs then search for the desired record or records within the buffers,
using the information in the file header. If the address of the block that contains the
desired record is not known, the search programs must do a linear search through the
file blocks. Each file block is copied into a buffer and searched until the record is
located or all the file blocks have been searched unsuccessfully. This can be very
time-consuming for a large file. The goal of a good file organization is to locate the
block that contains a desired record with a minimal number of block transfers.
Operations on Files
■ Reset. Sets the file pointer of an open file to the beginning of the file.
■ Find (or Locate). Searches for the first record that satisfies a search condition.
Read (or Get). Copies the current record from the buffer to a program vari
■ FindNext. Searches for the next record in the file that satisfies the search condition.
Delete. Deletes the current record and (eventually) updates the file on disk to reflect the
deletion.
■ Modify. Modifies some field values for the current record and (eventually) updates the file
Insert. Inserts a new record in the file by locating the block where the record is to be
inserted,
Close. Completes the file access by releasing the buffers and performing any
■ Scan.
■ FindAll
■ FindOrdered
■ Reorganize.
Files of Unordered Records (Heap Files)
In this type of organization, records are placed in the file in the order in which they are
inserted, so new records are inserted at the end of the file. Such an organization is called a
It is also used to collect and store data records for future use.
the last disk block of the file is copied into a buffer, the new record is added, and the block
is then rewritten back to disk. The address of the last file block is kept in the file header
, searching for a record using any search condition involves a linear search through the file
If only one record satisfies the search condition, then, on the average, a program will read
To read all records in order of the values of some field, we create a sorted copy of the
file. Sorting is an expensive operation for a large disk file, and special techniques for
For a file of unordered fixed-length records using unspanned blocks and contiguous
their fields—called the ordering field. This leads to an ordered or sequential file
the file
is required
3. using a search condition based on the value of an ordering key field results in faster
access when the binary search technique is used, which constitutes an improvement over
linear searches.
A binary search for disk files can be done on the blocks rather than on the records
DISADVANTAGE
Inserting and deleting records are expensive operations for an ordered file because
the records must remain physically ordered. To insert a record, we must find its cor
rect position in the file, based on its ordering field value, and then make space in the
consuming because, on the average, half the records of the file must be moved to
make space for the new record. This means that half the file blocks must be read and
The search condition must be an equality condition on a single Field, called the hash field.
In most cases, the hash field is also a key field of the file, in which case it is called the hash
key.
The idea behind hashing is to provide a funcTion h, called a hash function or randomizing
function.
, which is applied to the Hash field value of a record and yields the address of the disk block
Internal Hashing
For internal files, hashing is typically implemented as a hash table through the use
shown in Figure 17.8(a); then we have M slots whose addresses correspond to the
array indexes. We choose a hash function that transforms the hash field value into
an integer between 0 and M − 1. One common hash function is the h(K) = K mod
M function, which returns the remainder of an integer hash field value K after division by M;
Hashing for disk files is called external hashing. To suit the characteristics of disk
storage, the target address space is made of buckets, each of which holds multiple
records. A bucket is either one disk block or a cluster of contiguous disk blocks. The
hashing function maps a key into a relative bucket number, rather than assigning an
absolute block address to the bucket. A table maintained in the file header converts
insert the new record in some other position, since its hash address is occupied. The
There are numerous methods for collision resolution, including the following:
hash address, the program checks the subsequent positions in order until an
unused (empty) position is found. Algorithm 17.2(b) may be used for this
purpose.
■ Chaining. For this method, various overflow locations are kept, usually by
ing the new record in an unused overflow location and setting the pointer of
the occupied hash address location to the address of that overflow location.
A linked list of overflow records for each hash address is thus maintained, as
■ Multiple hashing. The program applies a second hash function if the first
addressing or applies a third hash function and then uses open addressing if
necessary.
Each collision resolution method requires its own algorithms for insertion,
in a bucket can hash to the same bucket without causing problems. However, we
must make provisions for the case where a bucket is filled to capacity and a new
record being inserted hashes to that bucket. We can use a variation of chaining in
which a pointer is maintained in each bucket to a linked list of overflow records for
the bucket.The pointers in the linked list should be record pointers, which include both a
Hashing provides the fastest possible access for retrieving an arbitrary record given
the value of its hash field. Although most good hash functions do not maintain
The hashing scheme described so far is called static hashing because a fixed number
of buckets M is allocated. This can be a serious drawback for dynamic files. Suppose
that we allocate M buckets for the address space and let m be the maximum number
of records that can fit in one bucket; then at most (m * M) records will fit in the allo
cated space. If the number of records turns out to be substantially fewer than
(m * M), we are left with a lot of unused space. On the other hand, if the number of
result and retrieval will be slowed down because of the long lists of overflow
records.
Dynamic Hashing. A precursor to extendible hashing was dynamic hashing, in
which the addresses of the buckets were either the n high-order bits or n − 1 high
order bits, depending on the total number of keys belonging to the respective
bucket. The eventual storage of records in buckets for dynamic hashing is somewhat
directory. Whereas extendible hashing uses the notion of global depth (high-order d
bits) for the flat directory and then combines adjacent collapsible buckets into a
■ Internal nodes that have two pointers—the left pointer corresponding to the
0 bit (in the hashed address) and a right pointer corresponding to the 1 bit.
bucket addresses—is maintained, where d is called the global depth of the directory. The
is used as an index to the array to determine a directory entry, and the address in
that entry determines the bucket in which the corresponding records are stored.
However, there does not have to be a distinct bucket for each of the 2d directory
locations. Several directory locations with the same first d bits for their hash values
may contain the same bucket address if all the records that hash to these locations fit
in a single bucket. A local depth d —stored with each bucket—specifies the number
of bits on which the bucket contents are based. Figure 17.11 shows a directory with
global depth d = 3.
The value of d can be increased or decreased by one at a time, thus doubling or halv
ing the number of entries in the directory array. Doubling is needed if a bucket,
whose local depth d is equal to the global depth d, overflows. Halving occurs if d >
d for all the buckets after some deletions occur. Most record retrievals require two
To illustrate bucket splitting, suppose that a new inserted record causes overflow in
the bucket whose hash values start with 01—the third bucket in Figure 17.11. The
records will be distributed between two buckets: the first contains all records whose
hash values start with 010, and the second all those whose hash values start with
011. Now the two directory locations for 010 and 011 point to the two new distinct
buckets. Before the split, they pointed to the same bucket. The local depth d of the
two new buckets is 3, which is one more than the local depth of the old bucket.
If a bucket that overflows and is split used to have a local depth d equal to the global
depth d of the directory, then the size of the directory must now be doubled so that
we can use an extra bit to distinguish the two new buckets. For example, if the
bucket for records whose hash values start with 111 in Figure 17.11 overflows, the
two new buckets need a directory with global depth d = 4, because the two buckets
are now labeled 1110 and 1111, and hence their local depths are both 4. The direc
tory size is hence doubled, and each of the other original locations in the directory
is also split into two locations, both of which have the same pointer value as did the
original location.
The main advantage of extendible hashing that makes it attractive is that the per
formance of the file does not degrade as the file grows, as opposed to static external
hashing where collisions increase and the corresponding chaining effectively614
increases the average number of accesses per key. Additionally, no space is allocated
in extendible hashing for future growth, but additional buckets can be allocated
dynamically as needed. The space overhead for the directory table is negligible. The
maximum directory size is 2k, where k is the number of bits in the hash value.
Another advantage is that splitting causes minor reorganization in most cases, since
only the records in one bucket are redistributed to the two new buckets. The only
time reorganization is more expensive is when the directory has to be doubled (or
halved). A disadvantage is that the directory must be searched before accessing the
buckets themselves, resulting in two block accesses instead of one in static hashing.
This performance penalty is considered minor and thus the scheme is considered