Accessing The Data.: Dept. of Ise, Gsssietw
Accessing The Data.: Dept. of Ise, Gsssietw
MODULE - 1
The Heart of File Structure Design:
Disks (Magnetic disks/Optical disks) are slow. The time it takes to get information from
Random Access Memory (RAM) is about 120th billionths of second. Getting the same
information from a typical disk might take 30 milliseconds, or 30 thousandths of a second.
The disk access is quarter of a million times longer than the memory access. On the other
hand disk provides enormous capacity at much less cost than memory and they are non
volatile. The tension between a disk’s relatively slow access time and its enormous,
nonvolatile capacity is the driving force behind file structure design. Good file structure
design will give us access to all the data without making our application spend a lot of time
waiting for the disk.
File structure is a combination of representations for data in files and of operations for
accessing the data.
Short History of file structure design
1. Early work: Early work assumed that files were on tapes. Access was sequential and the
cost of access grew in direct proportion to the size of file.
2. Emergence of Disks and indexes: Sequential access was not a good solution for large
files. Disks allowed for direct access. Indexes made it possible to keep a list of keys and
address of records in a small file that could be searched very quickly. With the key and
address the user had direct access to the large primary file.
3. Emergence of Trees: As indexes also grew too large to they became difficult to handle.
Sorting the indexes took too much time and reduced the performances. Idea of using tree
structures to manage the indexes emerged in early 1960’s. Initially BST were used for storing
the records in the file. This resulted in uneven growth of trees in-turn resulted in long searches
require many disk access to find a record. Than AVL trees were used. AVL trees are balanced
trees. The problem of uneven growth was resolved. The AVL trees are suitable for data in
memory but not for data in file. In 1970s came the idea of B trees and B+ trees which require
an O (logk N) access time. Still the efficiency was dependent on the size of the file. i.e ‘N’. As
the ‘N’ (number of records increased) the efficiency decreases.
4. Hashing: Retrieving records in a single access to a file. Ideally Hashing has an efficiency
of O(1). Hashing is a good for files that do not change size greatly over time, but do not work
well with dynamic files. Extendible hashing helps in overcoming the limitation of Hashing.
Logical file
The file as seen by a program. The use of logical file allows a program to describe operations
to be performed on a file without knowing what physical file will be used. A “channel”
(Telephone line) that hides the details of the file location and physical format to the program.
This logical file will have logical name which is what is used inside the program.
The bottleneck of a disk access is moving the read/write arm. So it makes sense to store a file
in tracks that are below/above each other on different surfaces, rather than in several tracks
on the same surface. Disk controllers: typically embedded in the disk drive, which acts as an
interface between the CPU and the disk hardware. The controller has an internal cache
(typically a number of MBs) that it uses to buffer data for read/write requests.
Estimating Capacities
Track capacity = number of sectors/track * bytes/sector
Cylinder capacity = number of tracks/cylinder * track capacity
Drive capacity = number of cylinders * cylinder capacity
Number of cylinders = number of tracks in a surface
Organizing Tracks by sector
Two ways to organize data on disk: by sector and by block. The physical placement of
sectors- Different views of sectors on a track:
Sectors that are adjacent, fixed size segments of a track that happen to hold a file. When you
want to read a series of sectors that are all in the same track, one right after the other, you
often cannot read adjacent sectors. After reading the data, it takes the disk controller a certain
amount of time to process the received information before it is ready to accept more. if
logically adjacent sectors are placed physically adjacent, we would miss start of the next
sector while we were processing the sector that we had just read. w.r.t the given figure it
takes thirty-two revolutions to read the entire 32 sectors of a track.
VI Sem -File Structures [17IS62 ]
O/O system designers have solved this problem by interleaving the sectors: leaving an
interval of several physical sectors between logically adjacent sectors. Figure below
illustrates the assignment of logical sector content to the thirty-two physical sectors in a track
with interleaving factor of 5. It takes five revolutions to read the entire 32 sectors of a track.
In the early 1990s, controller speeds improved so that disks can now offer 1:1 interleaving.
This means that successive sectors are physically adjacent, making it possible to read entire
track in a single rotation of the disk.
Clusters
Another view of sector organization, designed to improve performance, is clusters. A cluster
is a fixed number of contagious sectors. Once a given cluster has been found on a disk, all
sectors in that cluster can be accessed without requiring additional seek.
To view a file as a series of clusters and still maintain the sectored view the file manager ties
logical sectors to physical clusters they belong to by using a file allocation table (FAT). The
FAT contains a list of all the clusters in a file. ordered accorind to the logical order of the
sectors they contain. With each cluster entry in the FAT is an entry giving the physical
location of the cluster.
VI Sem -File Structures [17IS62 ]
Extents
In case of availability of sufficient free space on a disk, it may be possible to store a file
entirely in contagious clusters. Then we say that the file consists of one extent. All of its
sectors, tracks and (if it is large) cylinders form one contagious whole. Then whole file can be
accessed with minimum amount of seek.
Each block is usually accompanied by sub blocks containing extra information about the data
block such as:
1. Count sub-block: Contains number of bytes in accompanying data block.
2. Key sub-block: contains the keys of all the records that are stored in the
following data block.
Disks as Bottleneck
Processes are often disk bound. i.e the cpu often has to wait long period of time for the disk
to transmit data. Then the cpu processes the data. Solution to handle disk bottleneck are:
Solution 1: Multi-programming (CPU works on other jobs while waiting for the disk)
Solution 2: Stripping: Disk stripping involves splitting the parts of a file and storing on
several different drives, then letting the separate drives deliver parts of the file to CPU
simultaneously(It achieves parallelism)
Solution 3: RAID: Redundant array of independent disks
Solution 4: RAM disks: Simulate the behaviour of mechanical disk in main memory
(provides faster access)
Solution 5: Disk cache: Large block of main memory configured to contain pages of data
from a disk. First check cache for required data if not available, then go to disk and replace
some page in cache with the page from the disk containing the required data.
The cost of disk Access
Seek Time: is the time required to move the access arm to the correct cylinder. if we are
alternately accessing sectors from two files that are stored at the opposite extremes on a disk
(one on the inner most cylinder, one on the outer most cylinder), seeking is very expensive.
Most hard disks available today have average seek time of less than 10 milli seconds and
high performance hard disks have average seek time as low as 7.5 msecs
Rotational Delay: refers to the time it takes for the disk to rotate so the sector we want is
under the read/write head. Hard disk with rotation speed of 5000 rpm takes 12 msecs for one
rotation. on average, the rotation delay is half a revolution, or 6 msec.
Transfer time: Once the data we want is under the read/write head, it can be transferred. The
transfer time is given by the formula
AGNETIC TAPES
Magnetic tape units belong to a class of devices that provide no direct accessing facility but
can provide very rapid sequential access to data. Tapes are compact stand up well under
different environment conditions, easy to store and transport. Tapes are widely used to store
application data. Currently tapes are used as archival storage.
The surface of the typical tape can be seen as a set of parallel tracks, each of which is a
sequence of bits. In a nine track tape the nine bits that are at the corresponding position in the
nine respective tracks are taken to constitute one byte, plus a parity bit. So a byte can be
thought of as a one bit wide slice of tape. Such a slice is called frame. Parity bit is not part of
the data. It is used to check the validity of the data.
Frame
Track 1
Track 2
Track 9
Frames are organised into data blocks of variable size separated by interblock gaps (Long
enough to start/accelerate and stop/decelerate). Tapes cannot start and stop instantaneously.
Length of the magnetic tape is given by - s
s = n x (b+g) n=number of data blocks, g=inter block gap, b=physical length of data block
Effective transmission rate = effective recording density (bpi) x tape speed (ips) For problem
related to Magnetic Tapes refer to Class notes.
Disk versus Tapes
In past: Both disk and tapes were used foe secondary storage. Disks were preferred for
random access and tape for better sequential access.
Now: Disks have taken over much of secondary storage because of decreased cost of disk.
Tapes are used as tertiary storage.
INTRODUCTION TO CD-ROM
CD-ROM: Compact disk read only memory. A single disk can hold approximately 700MB of
data. CD-ROM is read only. It is publishing medium rather than a data storage and retrieval
like magnetic disks.
Physical Organization of CD-ROM
CD-ROM is the child of CD audio. Audio discs are designed to play music, not to provide
fast, random access to data. This biases CD toward having storage capacity and moderate
data transfer rates and against decent seek performance.
Reading Pits and Lands:
CD ROMs are stamped from a glass master disk which has a coating that is changed by the
laser beam. When the coating is developed, the areas hit by the laser beam turn into pits along
the track followed by the beam. The smooth unchanged areas between the pits are called
lands.
When we read the CD we focus a beam of laser light on the track as it makes under the
optical pickup (read/write head). The pits scatter the light but the lands reflect the light back
to optical pickup. High and low intensity reflected light is the signal used to reconstruct the
original digital information.
1’s are represented by transition from pit to land and back again. Every time the light
intensity changes we get 1. The 0s (Zeros) are represented by the amount of time between the
transitions. The longer between transitions, the more 0s (Zeros) we have.
Given this scheme it is not possible to have two adjacent 1’s – 1’s are always separated by
0s.In face due to the limits of the resolution of the optical pickup, there must be at least two
0s between any pair of 1s. This means that the raw patterns of 8 bits 1s and 0s has to be
translated so that at least 2 0s (zeros) separate consecutive 1s. This translation is done using
EFM (Eight to Fourteen Modulation) encoding lookup table. The EFM transition scheme
refers lookup table, turns the original 8 bits of data into 14 expanded bits that can be
represented in the pits and lands on the disc.
VI Sem -File Structures [17IS62 ]
CLV CAV
The data is stored on a single spiral track that The data is stored on ‘n’ number of
winds for almost 3 miles from the centre to concentric tracks and pie shaped sectors.
the outer edge of the disc.
All the sectors take same amount of space. Inner sectors take less space compared to
outer sectors.
Storage capacity of all the sectors is same. Storage capacity of all the sectors is same.
All the sectors are written at maximum Writes data less densely in outer sectors and
density (Constant data density) more densely at inner sectors. (Variable data
density)
Due to Constant data density space is not Due to Variable data density space is wasted
wasted in either inner or outer sectors. in outer sectors.
Constant data density implies that disc has to Variable data density implies that disc rotates
spin more slowly when reading data at outer at constant speed irrespective of reading from
sectors compared to reading at the inner inner sectors or outer sectors.
(center) sectors. ( Variable speed of disc
rotation)
Poor seeking performance Seeking is fast compared to CLV
Addressing
In CD audio each sector is of 2 Kilo Bytes. 75 sectors create 1 second of audio playback.
According to original Philips/Song standards, a CD whether used for audio or CD-ROM,
contains at least one hour of playing time. That means the disc is capable of holding at least
540000 kilo bytes of data.
1 second = 75 sectors
VI Sem -File Structures [17IS62 ]
12 bytes 4 bytes 2048 bytes 4 bytes error 8 bytes 276 bytes error
Synch sector ID user data detection null correction
A JOURNEY OF A BYTE
What happens when the following statement in the application program is executed?
write(fd,ch,1)
VI Sem -File Structures [17IS62 ]
1. The program asks the operating system to write the contents of the variable c to the next
available position in the file.
2. The operating system passes the job on to the file manager.
3. The file manager looks up for the given file in a table containing information about it, such
as whether the file is open and available for use, what types of access are allowed, if any, and
what physical file the logical name fd corresponds to.
4. The file manager searches a file allocation table for the physical location of the sector that
is to contain the byte.
5. The file manager makes sure that the last sector in the file has been stored in a system I/O
buffer in RAM, than deposits the ‘P’ into its proper position in the buffer.
6. The file manager gives instructions to the I/O processor abou where the byte is stored in
RAM and where it needs to be sent on the disk.
7. The I/O processor finds a time when the drive is available to receive the data and puts the
data in proper format for the disk. It may also buffer the data to send it out in chunks of the
proper size for the disk.
8. The I/O processor sends the data to the disk controller.
9. The controller instructs the drive to move the read/write head to the proper track, waits for
the desired sector to come under the read/write head, than sends the byte to the drive to be
deposited bit by bit on the surface of the disk.
VI Sem -File Structures [17IS62 ]
BUFFER MANAGEMENT
Buffering involves working with a large chunk of data in memory so the number of accesses
to secondary storage can be reduced. Assume that the system has a single buffer and is
performing both input and output on one character at a time, alternatively. In this case, the
sector containing the character to be read is constantly over-written by the sector containing
the spot where the character will be written, and vice-versa. In such a case, the system needs
more than 1 buffer: at least, one for input and the other one for output. Strategies to avoid this
problem:
Multiple buffering:
Suppose that a program is only writing to a disk and that it is I/O bound. The CPU wants to
be filling a buffer at the same time that I/O is being performed. If two buffers are used and
I/O-CPU overlapping is permitted, the CPU can be filling one buffer while the contents of the
other are being transmitted to disk. When both tasks are finished, the roles of the buffers can
be exchanged. This method is called double buffering. This technique need not be restricted
to two buffers.
Some file system use a buffering scheme called buffer pooling. Buffer pooling:
There is a pool of buffers. When a request for a sector is received, O.S. first looks to see that
sector is in some buffer. If not there, it brings the sector to some free buffer. If no free buffer
exists, it must choose an occupied buffer. (Usually LRU strategy is used).
VI Sem -File Structures [17IS62 ]
Here we have fixed the field 1 size as 10 bytes, field 2 as 8 bytes, field 3 as 5 bytes and so on
which results in total length of record to be 70 bytes. While reading the record the first 10
bytes that is read is treated as field 1 the next 8 bytes are treated as field 2 and so on.
VI Sem -File Structures [17IS62 ]
Disadvantage of this method is padding of each and every field to bring it to pre-defined
length which makes the file much larger. Rather than using 4 bytes to store “Ames” we are
using 10 bytes. We can also encounter problems with data that is too long to fit into the
allocated amount of space.
Method 2: Begin each field with a length indicator.
The fields within a record are prefixed by a length byte or bytes.
Fields within a record can have different sizes.
Different records can have different length fields.
Programs which access the record must know the size and format of the length prefix.
There is external overhead for field separation equal to the size of the length prefix
per field.
04Ames04 Mary0312305Maple10Stillwater07OK74075
05Mason04Alan02 9008Eastgate03Ada07OK74820
Record Structures:
A record can be defined as a set of fields that belong together when the file is viewed in terms
of a higher level of organization.
Following are some of the most often used methods for organizing the records of a file.
Make records a predictable number of bytes.
Make records a predictable number of fields.
Begin each record with a length indicator.
Use an index to keep track of the address.
Place a delimiter at the end of each record.
Method 1: Make records a predictable number of bytes.
All records within a file have the same size.
Programs which access the file must know the record length.
Offset, or position, of the nth record of a file can be calculated.
There is no external overhead for record separation.
There may be internal fragmentation (unused space within records.)
There will be no external fragmentation (unused space outside of records) except
for deleted records.
Ames|Mary|23|Maple|Stillwater|OK74075| Mason|Alan|90|Eastgate|Ada|OK74820| …
6 Fields 6 Fields
Method 3: Begin each record with a length indicator
The records within a file are prefixed by a length byte or bytes.
Records within a file can have different sizes.
Different files can have different length records.
Programs which access the file must know the size and format of the length prefix.
Offset, or position, of the nth record of a file cannot be calculated.
There is external overhead for record separation equal to the size of the length
prefix per record.
40Ames|Mary|23|Maple|Stillwater|OK74075|36 Mason|Alan|90|Eastgate|Ada|OK74820| …
delimiter.
VI Sem -File Structures [17IS62 ]
Ames|Mary|23|Maple|Stillwater|OK74075| #Mason|Alan|90|Eastgate|Ada|OK74820|# …
IO Buffer
char array for buffer value
VariableLengthBuffer FixedLengthBuffer
read & write operations read & write operations
Here the member functions read(), Write(), Pack(), Unpak() of class IOBuffer are virtual
functions so that subclasses VariableLengthBuffer and FixedLengthBuffer define its own
implementation. This means that the class IOBuffer does not include an implementation of
the method.
Record Access
When looking for an individual record, it is convenient to identify the record with a key
based on record contents. The key should be unique so that duplicate entries can be avoided.
For example, in the previous section example we might want to access the “Ames record” or
the “Mason record” rather than thinking in terms of the “first record” or “second record”.
When we are looking for a record containing the last name Ames we want to recognize it
even if the user enters the key in the form “AMES”, “ames” or “Ames”. To do this we must
define a standard form of keys along with associated rules and procedures for converting
keys into this standard form. This is called as canonical form of the key.
Sequential Search
Reading through the file, record by record, looking for a record with a particular key is called
sequential searching. In general the work required to search sequentially for a record in a file
with ‘n’ records in proportional to n: i.e the sequential search is said to be of the order O(n).
This efficiency is tolerable if the searching is done on the date present in the main memory,
but not for, which has to be extracted from secondary storage device, due to high delay
involved in accessing the data. Instead of extracting the records from the secondary storage
device one at a time sequential we can access some set of records at once from the hard disk,
store it in main memory and do the comparisons. This is called as record blocking.
In some cases sequential search is superior like:
Repetitive hits: Searching for patterns in ASCII files.
Searching records with a certain secondary key value.
Small Search Set: Processing files with few records.
Devices/media most hospitable to sequential access: tape, binary file on disk.
Unix Tools For Sequential Processing
Some of the UNIX commands which perform sequential access are:
Cat: displays the content of the file sequentially on the console.
%cat filename
Example: %cat myfile
Ames Mary 123 Maple Stillwater OK74075
Mason Alan 90 Eastgate Ada OK74820
VI Sem -File Structures [17IS62 ]
wc: counts the number of words lines and characters in the file
%wc filename
Example: %wc
myfile 2 14
76
grep: (generalized regular expression) Used for pattern matching
%grep string filename
Example: % grep Ada myfile
Mason Alan 90 Eastgate Ada OK74820
Direct Access:
The most radical alternative to searching sequentially through a file for a record is a retrieval
mechanism known as direct access. The major problem with direct access is knowing where
the beginning of the required record is. One way to know the beginning of the required record
or byte offset of the required record is maintaining a separate index file. The other way is by
relative record number RRN. If a file is a sequence of records, the RRN of a record gives its
position relative to the beginning of the file. The first record in a file has RRN 0, the next has
RRN 1, and so forth.
For example if we are interested in the record with an RRN of 546 and our file has fixed
length record size of 128 bytes per record, we can calculate the byte offset of a record with
an RRN of n is
Byte offset = 546 * 128 = 69888
In general Byte offset = n * r where r is length of record.
Header Records
It is often necessary or useful to keep track of some general information about a file to assist
in future use of the file. A header record is often placed at the beginning of the file to hold
information such as number of records, type of records the file contains, size of the file, date
and time of file creation, modification etc. Header records make a file self describing object,
freeing the software that accesses the file from having to know a priori everything about its
structure. The header record usually has a different structure than the data record.
File Access and File Organization
File organization is static.
Fixed Length Records.
Variable Length Records.
File access is dynamic.
Sequential Access
Direct Access.
VI Sem -File Structures [17IS62 ]