CSC 202 File Processing
CSC 202 File Processing
Data Processing
Data processing is the act of handling or manipulating data in some specified ways so as to give
meaning to data or to transform the data into information. It is the process through which facts
and figures are collected, assigned meaning, communicated to others and retained for future
use.
Data: The word "data" is the plural of datum, which means fact, observation, assumption or
occurrence. More precisely, data are representations of facts pertaining to people, things, ideas
and events. Data are represented by symbols such as letters of the alphabets, numerals or
other special symbols.
Information: can be defined as “data that has been transformed into a meaningful and useful
form for specific purposes”. There is no hard and fast rule for determining when data becomes
information.
(a) Collection
Data originates naturally in the form of events transaction, observations, measurement,
interview etc. This data is then recorded in some usable form. Observable data may be reported
as a narration or in table form. Data may be initially recorded on paper source and then
converted into a machine usable form for processing in what is term Data Capturing.
Alternatively, they may be recorded by a direct input device in a paperless, machine-readable
form using on-line medium or Direct Data Capturing machine.
Activity:
Study and record the cost of up keep for twenty students during Harmattan semester 2014/2015
academic session. Make use of two (a male & a female) students at every level (100 & 200) of
any five departments in KWASU, in all you are sampling twenty (20) students.
The fields to be used are: Matriculation Number, Surname, Other names, Sex, State,
Department, Level, Session, Semester, Feeding, Transport, Housing, Clothing, Recharge card,
and Books.
1
(b) Conversion
Once the data is collected, it is converted from its source documents to a form that is more
suitable for processing. The data is first codified by assigning identification codes. A code
comprises of numbers, letters, special characters, or a combination of these. For example, an
employee may be allotted a code as his category as A class, etc. It is useful to codify data,
when data requires classification. To classify means to categorize, i.e., data with similar
characteristics is placed in similar categories or groups. For example, one may like to arrange
accounts data according to account number or date. Hence a balance sheet can easily be
prepared. After classification of data, it is verified or checked to ensure the accuracy before
processing starts. After verification, the data is transcribed from one data medium to another.
For example, in case data processing is done using a computer, the data may be transformed
from source documents to machine sensible form using magnetic tape or a disk.
(c) Manipulation
Once data is collected and converted, it is ready for the manipulation function which converts
data into information. Manipulation consists of the following activities:
Sorting
It involves the arrangement of data items in a desired sequence. Usually, it is easier to work with
data if it is arranged in a logical sequence. Most often, the data are arranged in alphabetical
sequence. Sometimes sorting itself will transform data into information. For example, a simple
act of sorting the names in alphabetical order gives meaning to a telephone directory. The
directory will be practically worthless without sorting. Business data processing extensively
utilizes sorting technique. Virtually all the records in business files are maintained in some
logical sequence. Numeric sorting is common in computer-based processing systems because it
is usually faster than alphabetical sorting.
Calculating
Arithmetic manipulation of data is called calculating. Items of recorded data can be added to one
another, subtracted, divided or multiplied to create new data. Calculation is an integral part of
data processing. For example, in calculating an employee's pay, the hours worked multiplied by
the hourly wage rate gives the gross pay. Based on total earning, income-tax deductions are
computed and subtracted from gross-pay to arrive at net pay.
Summarizing
To summarize is to condense or reduce masses of data to a more usable and concise form. For
example, you may summarize a lecture attended in a class by writing small notes in one or two
pages. When the data involved is numbers, you summarize by counting or accumulating the
totals of the data in a classification or by selecting strategic data from the mass of data being
processed. For example, the summarizing activity may provide a general manager with sales-
totals by major product line, the sales manager with sales totals by individual salesman as well
as by the product line and a salesman with sales data by customer as well as by product line.
2
Comparing
To compare data is to perform an evaluation in relation to some known measure. For example,
business managers compare data to discover how well their companies are doing. They may
compare current sales figures with those for last year to analyze the performance of the
company in the current month.
Storing
To store is to hold data for continued or later use. Storage is essential for any organized method
of processing and re-using data. The storage mechanisms for data processing systems are file
cabinets in a manual system, and electronic devices such as magnetic disks/magnetic tapes in
case of computer based system. The storing activity involves storing data and information in
organized manner in order to facilitate the retrieval activity.
Retrieving
To retrieve means to recover or find again the stored data or information. Thus data whether in
file cabinets or in computers can be recalled for further processing. Retrieval and comparison of
old data gives meaning to current information.
(e) Communication
Communication is the process of sharing information. Unless the information is made available
to the users who need it, it is worthless. Thus, communication involves the transfer of data and
information produced by the data processing system to the prospective users of such
information or to another data processing system. As a result, reports and documents are
prepared and delivered to the users. In electronic data processing, results are communicated
through display units or terminals.
(f) Reproduction
To reproduce is to copy or duplicate data or information. This reproduction activity may be done
by hand or by machine.
(i) Input
The term input refers to the activities required to record data and to make it available for
processing. The input can also include the steps necessary to check, verify and validate data
contents.
3
(ii) Processing
The term processing denotes the actual data manipulation techniques such as classifying,
sorting, calculating, summarizing, comparing, etc. that convert data into information.
(iii) Output
It is a communication function which transmits the information, generated after processing of
data, to persons who need the information. Sometimes output also includes decoding activity
which converts the electronically generated information into human-readable form.
(iv) Storage
It involves the filing/keeping of data and information for future use.
The activity of data processing can be viewed as a "system". A system can be defined as "a
group of interrelated components that seeks the attainment of a common goal by accepting
inputs and producing outputs in an organized process". For example, a production system
accepts raw material as input and produces finished goods as output. Similarly, a data
processing system can be viewed as a system that uses data as input and processes this data
to produce information as output. There are many kinds of data processing systems. A manual
4
data processing system is one that utilizes tools like pens, and filing cabinets. A mechanical
data processing system uses devices such as typewriters, calculating machines and book-
keeping machines. Finally, electronic data processing uses computers to automatically
process data.
Data Organization/Hierarchy
*Field
A field is a data item in a computer file. Its length may be fixed or variable. If all individuals have
3 digit employee numbers, a 3-digit field is required to store the particular data. Hence, it is a
fixed field. In contrast, since customer's name varies considerably from one customer to
another, a variable amount of space must be available to store this element. This can be called
as variable field.
*Record
A record is a collection of related data items or fields. Each record normally corresponds to a
specific unit of information. For example, various fields in a student record may include student
number, student's name, level and department. This is the data used to produce the students
report. Each record contains all the data for a given student. Each related item is grouped
together to form a record.
*File
The collection of records is called a file. A file contains all the related records for an application.
Files are stored on some medium, such as floppy disk, magnetic tape or magnetic disk, flash
drives, memory card etc.
*Database
The collection of related files is called a database. A database contains all the related files for a
particular application.
A logical record contains all the data related to a single entity. It may be a payroll record for an
employee or a record of marks secured by a student in a particular examination. E.g. The record
of student (1345cs022).
A physical record refers to a record whose data fields are stored physically next to one
another. It is also the amount of data that is treated as a single unit by the input-output device.
i.e. the unit of transfer between disk and primary storage. Portions of the same logical record
may be located in different physical records or several logical records may be located in one
physical record. Generally, a physical record consists of more than one logical record.
*Master Files- Master files are permanent files kept up-to-date by applying the transactions that
occur during a particular operation. They contain generally two basic types of data:
Data of a more or less permanent nature as well as data which will change every time
transactions are applied to the file.
*Transaction Files- accumulates records at arbitrary points in time mainly in the course of
updating master files- it is usually emptied after use and re-accumulated again when required.
Transaction files contain details of all transactions that have occurred in the last period. A period
may be the time that has elapsed a day, a week, a month or more. For example a sales
transaction file may contain details of all sales made that day. Once the data has been
processed it can be discarded (although backup copies may be kept for a while).
*Security Files- Backup files for master or transaction files- they are not used in the ordinary
course of processing- they are used for replacement or reconciliations
Searching: A query about a particular item of data may require looking through a master file to
find the appropriate record or records- SQL statement is SELECT * From table where criteria
6
BASIC FILE CONCEPT.
File is the basic unit of storage that enables a computer to distinguish one set of
information from another. The file is the central element in most applications. Before data
can be processed by a Computer-Based Information System (CBIS), it must be
systematically organized. The most common method is to arrange data into fields,
records, files and databases. Files can be considered to be the framework around which
data processing revolves. File processing is the process of creating, storing and accessing the
content of files.
a. Field
A field is the basic element of data. An individual field contains a single value, such
as an employee’s last name, a date, or the value of a sensor reading. It is characterized
by its length and data type (e.g., ASCII, string, decimal). Depending on the file design,
fields may be fixed length or variable length. In the latter case, the field often consists of
two or three subfields: the actual value to be stored, the name of the field, and, in some
cases, the length of the field. In other cases of variable-length fields, the length of the
field is indicated by the use of special demarcation symbols between fields.
b. Record
A record is a collection of related fields that can be treated as a unit by some application
program. For example, an employee record would contain such fields as name,
identification number, job designation, date of employment, and so on. Again, depending
on design, records may be of fixed length or variable length. A record will be of variable
length if some of its fields are of variable length or if the number of fields may vary. In the
latter case, each field is usually accompanied by a fieldname. In either case, the entire
record usually includes a length field.
File
A file is a collection of related records. The file is treated as a single entity by
users and applications and may be referenced by name. Files have names and may be
created and deleted. Access control restrictions usually apply at the file level. That is, in
a shared system, users and programs are granted or denied access to entire files. In
some more sophisticated systems, such controls are enforced at the record or eventhe
field level.
7
Naming Files
Files provide a way to store information and read it back later. This must be done
in a way as to shield the user from the details of how and where the information is
stored, and how the disks actually work. When a process creates a file, it gives the file a
name. When the process terminates, the file continue to exist, and can be accessed by
other processes using its name.
The exact rules for file naming vary somewhat from system to system, but all
operating systems allow strings of one to eight letters as legal filenames. The file name
is chosen by the person creating it, usually to reflect its contents. There are few
constraints on the format of the filename: It can comprise the letters A-Z, numbers 0-9
and special characters $ # & + @! ( ) - { } ' ` _ ~ as well as space. The only symbols that
cannot be used to identify a file are * | <> \ ^ =? / [ ] ';, plus control characters. The main
reason for choosing a file name is that there are different rules for different operating
systems that can present problems when files are moved from computer to another. For
example, Microsoft Windows is case insensitive, so files like MYEBOOKS, myebooks,
MyEbooks are all the same to Microsoft Windows.
However, under the UNIX operating system, all three would be different files as, in this
instance, file names are case sensitive.
Naming Convention
Usually a file would have two parts with “.” separating them. The part on the left
side of the period character is called the main name while the part on the right side is
called the extension. A good example of a filename is “course.doc.” The main name is
course while the extension is doc. File extension differentiates between different types
of files. We can have files with same names but different extensions and therefore we
generally refer to a file with its name along with its extension and that forms a complete
file name.
9
Table 3: Filename Extension of Sound Files
File Attributes
The particular information kept for each file varies from operating system to operating
system. No matter what operating system one might be using, files always have certain
attributes or characteristics. Different file attributes are discussed as follow.
a. File Name
The symbolic file name is the only information kept in human-read form. As it is obvious, a file
name helps users to differentiate between various files.
10
b. File Type
A file type is required for the systems that support different types of files. As discussed earlier,
file type is a part of the complete file name. We might have two different files; say “csc202.doc”
and “csc202.txt”.Therefore the file type is an important attribute which helps in differentiating
between files based on their types. File types indicate which application should be used to open
a particular file.
c. Location
This is a pointer to the device and location on that device of the file. As it is clear from the
attribute name, it specifies where the file is stored.
d. Size
Size attribute keeps track of the current size of a file in bytes, words or blocks. The size of a file
is measured in bytes. A floppy disk holds about1.44 Mb; a Zip disk holds 100 Mb or 250 Mb; a
CD holds about 800Mb; a DVD holds about 4.7Gb.
e. Protection
Protection attribute of a file keeps track of the access-control information that controls who can
do reading, writing, executing, and soon.
f. Usage Count
This value indicates the number of processes that are currently using (have opened) a particular
file.
FIELD MEANING
Protection Who can access the file and in what way?
Password Password needed to access the file
Creator Identity of the person who created the file
Owner Current owner
Read-only flag 0 for read/write, 1 for read only
Hidden flag 0 for normal, 1 for do not display in listing
System flag 0 for normal file, 1 for system file
Archive flag 0 has been backed up, 1 for needs to be backed up
11
ASCII/binary file 0 for ASCII file, 1 for binary file
Random Access file 0 for sequential access only, 1 for random access.
Temporary flag 0 for normal, 1 for delete on process exit
Lock flags 0 for unlocked, nonzero for locked
Record length Number of bytes in a record
Key position Offset of the key within each record
Key length Number of bytes in the key field
Creation time Date and time file was created
Time of last access Date and time file was last accessed
Time of last change Date and Time file was last changed
Current size Number of bytes in the file
Maximum size Maximum size file may grow
- The first four attributes relate to the file’s protection and tell who may access it and who
may not. All kinds of scheme are possible; in some systems the user must present a
password to access a file, in which case the password must be one of the attributes.
- The flags are bits or short fields that control or enable some specific properties. Hidden
files, for example, do not appear in listing of the files. The archive flag is a bit that keeps
track of whether the file has been backed up. The backup program clears it, and the
operating system sets it whenever a file is changed. In this way, the backup program can
tell which files need backing up. The temporary flag allows a file to be marked for
automatic deletion when the process that created it terminates.
- The record length, key position, and key length fields are only present in files whose
records can be looked up using a key. They provide the information required to find the
keys. The various times keep track of when the file was created, most recently accessed
and most recently modified. These are useful for a variety of purposes. For example, a
source file that has been modified after the creation of the corresponding object file
needs to be recompiled. These fields provide the necessary information.
- The current size tells how big the file is at present. Some mainframe operating systems
require the maximum size to be specified when the file is created, to let the operating
system reserve the maximum amount of storage in advance. Minicomputers and
personal computer systems are clever enough to do without this item.
The major advantage of file processing is that it helps to avoid duplication of Data. Applications
are developed independently in file processing system leading to unplanned duplicate files.
Duplication is wasteful as it requires additional storage space and changes in one file must be
made manually in all files. This also results in loss of data integrity. It is also possible that the
same data item may have different names in different files, or the same name may be used for
different data items in different files.
12
FILE ORGANISATION AND ACCESS METHOD
File organisation refers to the logical structuring of the records as determined by the way
in which they are accessed. File organisation refers to the structure of a file in terms of its
components and how they are mapped onto the backing store. Data files are organized so as to
facilitate access to records and to ensure their efficient storage. A trade-off between these two
requirements generally exists: if rapid access is required, more storage must be expended to
make it possible (for example, by providing indexes to the data records). Access to a record for
reading it (and sometimes updating it) is the essential operation on data. Any given file
organization supports one or more file access methods. Organisation is thus closely related to
but conceptually distinct from access methods. Access method is any algorithm used for the
storage and retrieval of records from a data file by determining the structural characteristics of
the file on which it is used.
13
The direct or hashed file
I. The Serial File
These are files of unordered record. Data are collected in the order in which they arrive. When
records are received they are stored in the next available storage position. The purpose of the
serial file is simply to accumulate the mass of data and save it. Records may have different
fields, or similar fields in different orders. Because there is no structure to the serial file, record
access is by exhaustive search. That is, if we wish to find a record that contains a particular field
with a particular value, it is necessary to examine each record in the pile until the desired record
is found or the entire file has been searched. If we wish to find all records that contain a
particular field or contain that field with a particular value, then the entire file must be searched.
It allows quick insertion since there is no particular ordering. Records can easily be appended
to the end of the file and is easy to update. It can be used as temporary files to store transaction
data. However, beyond these limited uses, this type of file is unsuitable for most applications.
E.g. Records on Tape
14
Periodically, a batch update is performed that merges the log file with the master file to
produce a new file in correct key sequence.
15
individual record retrieval through the index. The disadvantage here is that creation of index
table causes addition overhead.
17
Functions of File Management
With respect to meeting user requirements, the extent of such requirements
depends on the variety of applications and the environment in which the computer
system will be used. Therefore, File Management system should ensure that;
Each user should be able to create, delete, read, write, and modify files.
Each user may have controlled access to other users’ files.
Each user may control what types of accesses are allowed to the user’s files.
Each user should be able to restructure the user’s files in a form appropriate to the
problem.
Each user should be able to move data between files.
Each user should be able to back up and recover the user’s files in case of
damage.
Each user should be able to access his or her files by name rather than by
numeric identifier.
User Program
Device Drivers
At the lowest level, device drivers communicate directly with peripheral devices.
Drivers are special software programs that operate specific devices that can be either crucial or
optional to the functioning of the computer. Drivers help operate keyboards, printers, DVD
drives, etc.
18
The device driver is responsible for starting I/O operations on a device and processing the
completion of an I/O request. The typical devices controlled are disk and tape drives. Device
drivers are usually considered to be part of the operating system.
Logical I/O
Logical I/O enables users and applications to access records. Logical I/O provides a
general-purpose record I/O capability and maintains basic data about files. The level of
the file system closest to the user is often termed the access method. It provides a
standard interface between applications and the file systems and devices that hold the
data. Different access methods reflect different file structures and different ways of
accessing and processing the data.
Retrieve _One
This requires the retrieval of just a single record. Interactive, transaction-oriented
applications need this operation.
Retrieve _Next
This requires the retrieval of the record that is “next” in some logical sequence to the
most recently retrieved record. Some interactive applications, such as filling in forms,
may require such an operation. A program that is performing a search may also use this
operation.
Retrieve _Previous
This is similar to Retrieve_Next, but in this case the record that is “previous” to the
currently accessed record is retrieved.
Insert _One
Insert a new record into the file. It may be necessary that the new record fit into a
particular position to preserve a sequencing of the file.
Delete_One
Delete an existing record. Certain linkages or other data structures may need to be
updated to preserve the sequencing of the file.
Update_One
This operation retrieves a record, update one or more of its fields, and rewrite the
updated record back into the file. If the length of the record has changed, the update
operation is generally more difficult than if the length is preserved.
Retrieve_Few
This retrieves a number of records. For example, an application or user may wish to
retrieve all records that satisfy a certain set of criteria. The nature of the operations that
are most commonly performed on a file will influence the way the file is organized as
previously discussed.
20
III. FILE DIRECTORIES
Concept of File Directory
To keep track of files, the file system normally provides directories, which, in many
systems are themselves files. The structure of the directories and the relationship among
them are the main areas where file systems tend to differ.
Associated with any file management system and collection of files is a file directory.
The directory contains information about the files, including attributes, location,
and ownership. Much of this information, especially those that concern storage, is
managed by the operating system. The directory is itself a file, accessible by various file
management routines.
1. Single-Level Directory
In a single-level directory system, all the files are placed in one directory. This is very
common on single-user operating systems. A single-level directory has significant
21
limitations when the number of files increases or when there is more than one user.
Since all files are in the same directory, they must have unique names. If there are two
users who call their data file “CSC202note.doc”, then the unique-name rule is violated.
Files
2. Two-Level Directory
In the two-level directory system, the system maintains a master block that has one
entry for each user. This master block contains the addresses of the directory of the
users. This structure effectively isolates one user from another. This design eliminates
name conflicts among users and this is an advantage because users are completely
independent, but a disadvantage when the users want to cooperate on some task and
access files of other users. Some systems simply do not allow local files to be accessed
by other users. It is also unsatisfactory for users with many files because it is quite
common for users to want to group their files together in a logical way. Below is a
double-level directory.
22
3. Tree-Level Structural Directories
In the tree-structured directory, the directory themselves are considered as files. This
leads to the possibility of having sub-directories that can contain files and sub-
subdirectories. An important issue in a tree-structured directory structure is how to
handle the deletion of a directory. If a directory is empty, its entry in its containing
directory can simply be deleted. However, suppose the directory to be deleted is not
empty, but contains several files or sub-directories then it becomes a bit problematic.
Some systems will not delete a directory unless it is empty. Thus, to delete a directory,
someone must first delete all the files in that directory. If there are any subdirectories,
this procedure must be applied recursively to them so that they can be deleted too.
4. Acyclic-Graph Directories
The acyclic directory structure is an extension of the tree-structured directory structure.
Unlike in the tree-structured directory where files and directories are owned by one
particular user, the acyclic structure takes away this prohibition and thus a directory or
file under directory can be owned by several users.
23
PATH NAMES
When a file system is organized as a directory tree, some way is needed for specifying
the filenames. Any file in the system can be located by following a path from the root or
master directory down various branches until the file is reached. The series of directory
names, culminating (ending) in the file name itself, constitutes a pathname for the file.
Two different methods commonly used are:
- Absolute Path name
- Relative Path name
Absolute Path Name
With this path name, each file is given a path consisting of the path from the root
directory to the file. As an example, the file in the lower left hand corner of the Figure
below has the pathname User_B/Word/Unit_A/ABC. The slash is used to delimit names
in the sequence. The name of the master directory is implicit, because all paths start at
that directory. Note that it is perfectly acceptable to have several files with the same file
name, as long as they have unique pathnames, which is equivalent to saying that the
24
same file name may be used in different directories. In this example, there is another file
in the system with the file name ABC, but that has the pathname User_B/Draw/ABC.
Note that absolute file names always start at the root directory and are unique. In UNIX
the file components of the path are separated by /. In MS-DOS the separator is \. In
MULTICS it is >.
25
relative to the working directory. For example, if the working directory for user B is
“Word,” then the pathname Unit_A/ABC is sufficient to identify the file in the lower left-
hand corner of the above figure.
e. Reading a File
To read a file, a system call is made that specifies the name of the file and where (in
memory) the next block of the file should be put. Again, the directory is searched for the
26
associated directory entry, and the directory will need a pointer to the next block to be
read. Once the block is read, the pointer is updated.
f. Deleting a File
To delete a file, the directory is searched for the named file. Having found the associated
directory entry, the space allocated to the file is released (so it can be reused by other
files) and invalidates the directory entry.
g. Renaming a File
It frequently happens that user needs to change the name of an existing file. This system
call makes that possible. It is not always strictly necessary, because the file can always
be copied to a new file with the new name, and the old file then deleted.
h. Appending a File
This call is a restricted form of WRITE call. It can only add data to the end of the file.
System that provide a minimal set of system calls do not generally have APPEND.
i. List a Directory
We need to list the files in a directory and the contents of the directory entry for each file
in the list.
FILE ALLOCATION.
The main purpose of a computer system is to execute programs. Those programs
together with the data they access must be in main memory during execution. Since main
memory is usually too small to accommodate all the data and programs permanently, the
computer system must provide secondary storage to back up main memory. Most modern
computer systems use disk as the primary storage medium for information (both programs and
data). We want to discuss how files are being allocated to the disk storage. In allocating disk
space, the following issues are involved:
- When a new file is created, is the maximum space required for that file allocated all at
once?
- Should the space allocated to a file be one or more contiguous units/portions? A portion
is a contiguous set of allocated blocks. The size of a portion can range from a single
block to the entire file. What should the size of portion allocated for a file be?
27
- What sort of data structure or table is used to keep track of the portions assigned to a
file? An example of such a structure is a file allocation table (FAT)
Portion Size.
The second issue as listed above is that of the size of the portion allocated to a file. At one
extreme, a portion large enough to hold the entire file is allocated. At the other extreme, space
on the disk is allocated one block at a time as the need arises. In choosing a portion size, there
is a tradeoff between efficiency from the point of view of a single file versus overall system
efficiency. A list of some items to be considered in the tradeoff is:
- Contiguity of space increases performance, especially for Retrieve_Next operations, and
greatly for transactions running in a transaction-oriented operating system
- Having a large number of small portions increases the size of tables needed to manage
the allocation information.
- Having fixed-size portions (for example, blocks) simplifies the reallocation of space.
- Having small fixed-size portions minimizes waste of unused storage due to over-
allocation.
28
using variable-size portions. The file allocation table needs just a single entry for each
file, showing the starting block and the length of the file. Contiguous allocation is the best
from the point of view of the individual sequential file. Multiple blocks can be read in at a
time to improve I/O performance for sequential processing. It is also easy to retrieve a
single block. Note that, with pre-allocation, it is necessary to declare the size of the file at
the time of creation. Fragmentation is a problem in this situation. Compaction of the free
space is needful from time to time.
29
2. Linked/Chained Allocation: In Chained allocation, allocation is on an individual
block basis. Each block contains a pointer to the next block in the chain. Again, the file
allocation table needs just a single entry for each file, showing the starting block and the
length of the file. Although pre-allocation is possible, it is more common simply to allocate
blocks as needed. Any free block can be added to a chain. There is no external
fragmentation in this case because only one block at a time is needed. To select an
individual block of a file requires tracing through the chain to the desired block. One
consequence of linking/chaining, is that there is no accommodation of the principle of
locality.
BLOCKING OF RECORD.
File is a body of stored data or information in an electronic format. Almost all information
stored on computers is in the form of files. Files reside on mass storage devices such as hard
disks, optical disks, magnetic tapes, and floppy disks. When the Central Processing Unit (CPU)
of a computer needs data from a file, or needs to write data to a file, it temporarily stores the file
in its main memory, or Random Access Memory (RAM), while it works on the data.
A file consists of a collection of blocks and the operating system is responsible for
allocating blocks to files.
Records are the logical unit of access of a structured file, whereas blocks are the unit of I/O with
secondary storage. For I/O to be performed, records must be organized as blocks. On most
systems, blocks are of fixed length. This simplifies I/O, buffer allocation in main memory, and
the organisation of blocks on secondary storage. The larger the block, the more records that
31
are passed in one I/O operation. If a file is being processed or searched sequentially, this is an
advantage, because the number of I/O operations is reduced by using larger blocks, thus
speeding up processing. On the other hand, if records are being accessed randomly and no
particular locality of reference is observed, then larger blocks result in the unnecessary transfer
of unused records.
However, we can say that the I/O transfer time is reduced by using larger blocks, but a
competing concern is that larger blocks require larger I/O buffers, making buffer management
more difficult. (buffer is a temporary storage for data being manipulated or processed).
Methods of Blocking
There are three methods of blocking namely; fixed, variable-length spanned and variable-length
unspanned.
a. Fixed Blocking
It uses fixed-length records, and an integral number of records are stored in a block. There may
be unused space at the end of each block. This is referred to as internal fragmentation. File
fragmentation is defined as a condition in which files are broken apart on disk into small,
physically separated segments.
32
File Space Management
Files are normally stored on disk, so management of disk space is a major concern to the
file system designers. To keep track of free disk space, the system maintains a free space list.
The free space list records all disk blocks that are free – those that are not allocated to some file
or directory.
To create a file, the system searches the free space and allocates that space to the new file.
This space is then removed from the free-space list. When a file is deleted, its disk space is
added to the free-space list.
33
Techniques Used in Space Management
There are four different techniques of free space. They are;
Bit tables
Chained free portion
Indexing
Free block list
i. Bit Tables
This method uses a vector containing one bit for each block on the disk. Each entry of a 0
corresponds to a free block, and each 1 corresponds to a block in use. For example, for the disk
layout shown below, a vector of length 35 is needed and would have the following value:
00111000011111000011111111111011000
A bit table has the advantage that it is relatively easy to find one or a contiguous group of free
blocks. Thus, a bit table works well with any of the file allocation methods outlined. Another
advantage is that it is as small as possible. However, it can still be sizeable. The amount of
memory (in bytes) required for a block bitmap is
disk size in bytes
8 x file system block size
Thus, for a 16-Gigbyte disk with 512-byte blocks, the bit table occupies about 4 Mbytes.
Accordingly, most file systems that use bit tables maintain auxiliary data structures that
summarise the contents of subranges of the bit table. For example, the table could be divided
34
logically into a number of equal-size sub-ranges. A summary table could include, for each sub-
range, the number of free blocks and the maximum-sized contiguous number of free blocks.
When the file system needs a number of contiguous blocks, it can scan the summary table to
find an appropriate sub-range and then search that sub-range.
iii. Indexing
The indexing approach treats free space as a file and uses an index table as described under
file allocation. There is one entry in the table for every free portion on the disk. This approach
provides efficient support for all of the file allocation methods.
a. The list can be treated as a push-down stack (LIFO) with the first few thousand elements of
the stack kept in main memory. When a new block is allocated, it is popped from the top of the
stack, which is in main memory. Similarly, when a block is de-allocated, it is pushed onto the
stack. There has to be a transfer between disk and main memory when the in-memory portion of
35
the stack becomes either full or empty. Thus, this technique gives almost zero-time access most
of the time.
b. The list can be treated as a FIFO queue, with a few thousand entries from both the head and
the tail of the queue in main memory. A block is allocated by taking the first entry from the head
of the queue and de-allocated by adding it to the end of the tail of the queue. There only has to
be a transfer between disk and main memory when either the in-memory portion of the head of
the queue becomes empty or the in-memory portion of the tail of the queue becomes full.
a. Block Caching
The most common technique used to reduce disk accesses is the block cache. (Cache means
to hide.) In this context, a cache is a collection of blocks that logically belong to the disk, but are
being kept in memory for performance reasons. Various algorithms can be used to manage the
cache, but a common one is to check all read requests to see if the needed block is in the
cache. If it is, the read request can be satisfied without a disk access. If the disk is not in the
cache, it is first read into the cache, and then copied to wherever it is needed. Subsequent
requests for the same block can be satisfied from the cache.
36
File System Reliability
Destruction of a file system is often a far greater disaster than destruction of a computer.
If computer is destroyed by fire, lightning surges, or a cup of coffee poured onto the keyboard, it
is annoying and will cost money, but generally a replacement can be purchased with a minimum
of fuss. Inexpensive personal computers can even be replaced within a few hours. If a
computer file system is irrevocably lost, whether due to hardware, software or any other means,
restoring all the information will be difficult, time consuming, and in many cases, impossible. For
the people whose programs, documents, customer files, tax records, databases, marketing
plans, or other data are gone forever, the consequences can be catastrophic. While the file
system cannot offer any protection against any physical destruction of the equipment and
media, it can help protect the information.
Software solution requires the user or file system to carefully construct a file containing
all the bad blocks. This technique removes them from the free list, so they will never
occur in data files. As long as the bad block file is never read or written, no problem will
arise. Care has to be taken during disk backups to avoid reading this file.
b. Backups
Even with a clever strategy for dealing with bad blocks, it is important to back up files frequently.
After all, automatically switching to a spare track after a crucial data block has been ruined is not
easy. Backup technique is as simple as it sounds. It involves keeping another copy of the data
37
on some other machine or device so that the copy could be used in case of a system failure.
There are two types of backup techniques, namely full dump and incremental dump.
- Full dump simply refers to making a backup copy of the whole disk on another disk or
machine. It is pretty obvious that the process of full dumping is time consuming as well as
memory consuming.
- Incremental dump has some advantages over full dump. The simplest form of
incremental dumping is to make a full dump periodically (say monthly or weekly) and to
make a daily dump of only those files that have been modified since the last full dump. A
better scheme could be to change only those files that have been changed since the last
full dump. Such a scheme of data backup is time efficient as well as memory efficient.
To implement this method, a list of dump times for each file must be kept on disk.
A. Intrusion/Categories of Intruders
Intrusion is a set of actions that attempt to compromise the integrity, confidentiality, or
availability of any resource on a computing platform.
Categories of Intruders
- Casual prying by non technical users. Many people have terminals to timesharing
systems on their desks, and human nature being what it is, some of them will read other
people’s electronic mails and other files if no barriers are placed in the way.
- Snooping by insiders. Students, system programmers, operators, and other technical
personnel often consider it to be a personal challenge to break the security of a local
computer system. They are often highly skilled and are willing to devote a substantial
amount of time to the effort.
- Determined attempt to make money. Some bank programmers have attempted
banking system to steal from the bank. Schemes vary from changing software to
truncating rather than rounding off interest, keeping the fraction of money for themselves,
siphoning off accounts not used for years, to blackmail (“pay me or I will destroy all the
bank’s records”)
- Commercial or military espionage. Espionage refers to a serious and well funded by a
competitor or a foreign country to steal programs, trade secrets, patents, technology,
circuit designs, marketing plans, and so forth. Often this attempt will involve wiretapping
or even erecting antennas at the computer to pick up its electromagnetic radiation.
The amount of effort that one puts into security and protection clearly depends on who the
enemy is thought to be. Absolute protection of the system from malicious abuse is not possible,
39
but the cost to the perpetrator can be made sufficiently high to deter most, if not all,
unauthorised attempts to access the information residing in the system.
Intrusion Detection
Intrusion detection strives to detect attempted or successful intrusions into computer systems
and to initiate appropriate responses to the intrusions. Intrusion can be detected through:
User Authentication
A major security problem for operating systems is authentication. The protection system
depends on the ability to identify the programs and processes currently executing, which in turn
depends on the ability to identify each user of the system. The process of identifying users when
they log on is called user authentication. How do we determine whether a user's identity is
authentic? Generally, authentication is based on one or more of three items:
- User possession (a key or card)
- User knowledge (a user identifier and password)
- User attributes (fingerprint, retina pattern, or signature).
1. Passwords
The most common approach to authenticating a user identity is the use of passwords. When
a user identifies herself by user ID or account name, she is asked for a password. If the user-
supplied password matches the password stored in the system, the system assumes that the
user is legitimate. Passwords are often used to protect objects in the computer system, in the
absence of more complete protection schemes.Different passwords may be associated with
different access rights. For example, different passwords may be used for reading files,
appending files, and updating files.
40
Password Vulnerabilities
Passwords are extremely common because they are easy to understand and use.
Unfortunately, passwords can often be guessed, accidentally exposed, sniffed, or illegally
transferred from an authorized user to an unauthorised one. There are two common ways to
guess a password.
- One way is for the intruder (either human or program) to know the user or to have
information about the user.
- The use of brute force, trying enumeration, or all possible combinations of letters,
numbers, and punctuation, until the password is found.
In addition to being guessed, passwords can be exposed as a result of visual or electronic
monitoring. An intruder can look over the shoulder of a user (shoulder surfing) when the user
is logging in and can learn the password easily by watching the keystrokes. Alternatively,
anyone with access to the network on which a computer resides could seamlessly add a
network monitor, allowing her to watch all data being transferred on the network (sniffing),
including user IDs and passwords. Encrypting the data stream containing the password solves
this problem. Exposure is a particularly severe problem if the password is written down where it
can be read or lost.
2. Biometrics
There are many other variations to the use of passwords for authentication. Palm or hand-
readers are commonly used to secure physical access—for example, access to a data center.
These readers match stored parameters against what is being read from hand-reader pads. The
parameters can include a temperature map, as well as finger length, finger width, and line
patterns. These devices are currently too large and expensive to be used for normal computer
authentication. Fingerprint readers have become accurate and cost-effective and should
become more common in the future. These devices read your finger's ridge patterns and
convert them into a sequence of numbers. Over time, they can store a set of sequences to
adjust for the location of the finger on the reading pad and other factors. Software can then scan
a finger on the pad and compare its features with these stored sequences to determine if the
finger on the pad is the same as the stored one.
41
B. Program Threats
When a program written by one user may be used by another, misuse and unexpected behavior
may result. Some common methods by which users gain access to the programs of others are:
Trojan horses, Trap doors, Stack and buffer overflow.
i. Trojan Horse
Many systems have mechanisms for allowing programs written by users to be executed by other
users. If these programs are executed in a domain that provides the access rights of the
executing user, the other users may misuse these rights. A text-editor program, for example,
may include code to search the file to be edited for certain keywords. If any are found, the entire
file may be copied to a special area accessible to the creator of the text editor. A code segment
that misuses its environment is called a Trojan horse.
42
the security added by firewalls. One solution to this problem is for the CPU to have a feature
that disallows execution of code in a stack section of memory.
C. System Threats
Most operating systems provide a means by which processes can give birth to other processes.
In such an environment, it is possible to create a situation where operating system resources
and user files are misused. The two most common methods for achieving this misuse are
worms and viruses.
i. Worms
A wormis a process that uses the spawn(giving birth/replicating)mechanism to ravage system
performance. The worm spawns copies of itself, using up system resources and perhaps locking
out all other processes. On computer networks, worms are particularly potent, since they may
reproduce themselves among systems and thus shut down the entire network.
ii. Viruses
Like worms, viruses are designed to spread into other programs and can wreck havoc in a
system by modifying or destroying files and causing system crashes and program malfunctions.
Whereas a worm is structured as a complete, standalone program, a virus is a fragment of code
embedded in a legitimate program. Viruses are a major problem for computer users, especially
users of microcomputer systems. Viruses are usually spread when users download viral
programs from public bulletin boards or exchange disks containing an infection. In recent years,
a common form of virus transmission has been via the exchange of Microsoft Office files, such
as Microsoft Word documents. Most commercial antivirus packages are effective against only
particular known viruses. They work by searching all the programs on a system for the specific
pattern of instructions known to make up the virus. When they find a known pattern, they
remove the instructions, disinfectingthe program. These commercial packages have catalogs of
thousands of viruses for which they search. The best protection against computer viruses is
prevention, or the practice of safe computing. Purchasing unopened software from vendors
and avoiding free or pirated copies from public sources or disk exchange is the safest route to
preventing infection. Another defense is to avoid opening any e-mail attachments from
unknown users.
43
iii. Denial of Service
The last attack category, denial of service, is aimed not at gaining information or stealing
resources but rather at disrupting legitimate use of a system or facility. An intruder could delete
all the files on a system, for example. It involves launching an attack that prevents legitimate
use of system resources
File Protection
There are three most popular implementations of file protection:
- File Naming
It depends upon the inability of a user to access a file he cannot name. This can be
implemented by allowing only users to see the files they have created. But since most file
systems allow only a limited number of characters for filenames, there is no guarantee that two
users will not use the same filenames.
- Password Protection
This scheme associates a password to each file. If a user does not know the password
associated to a file then he cannot access it. This is a very effective way of protecting files but
for a user who owns many files, and constantly changes the password to make sure that nobody
accesses these files will require that users have some systematic way of keeping track of their
passwords.
- Access Control
An access list is associated with each file or directory. The access list contains information on
the type of users and accesses that they can have on a directory or file. An example is the
following access list associated to a UNIX file or directory:
drwxrwxrwx
The d indicates that this is an access list for a directory, the first rwxindicates that it can be
read, written, and executed by the owner of the file, the second rwxis an access information for
users belonging to the same group as the owner (somewhere on the system is a list of users
belonging to same group as the owner), and the last rwxfor all other users. The rwxcan be
changed to just r - - indicating that it can only be read, or – w - for write-only, - - x for execute
only.
44
File Characteristics
File Hit Rate: is the term used to describe the rate of processing Master files in terms of active
records. It is defined as the proportion of records updated or referenced on each updating in
relation to the total number of records in the master file for instance if 1000 out of 10000 is
processed then the hit rate is 10%
Volatility: This is the frequency with which records are added to or deleted from the file. We can
have volatile file or static files
Size: This is the amount of data stored in the file. It may be expressed in number of bytes, kilo
bytes or mega bytes
Growth: file often grows steadily in size as new records are added
Batch processing
Online processing
Interactive processing
-Real-time processing
-Multi-users processing
-Multi- tasking processing
-Batch processing: Transactions are accumulated into batches of suitable sizes, and then each
batch is sorted and processed through sequence of stages known as a run. Each batch is
identified by a batch number which is to be recorded on a batch control slip. This slip also
contains control information (e.g number of items in each batch) and individual hardware
requirement. The identification and specifications are done using a special language called Job
control language. The JCL made it possible to specify names for the jobs, files to be used by
each, peripherals required and job priority etc. The weakness is that it delays output, require
physical transportation of data or manual intervention in the course of processing, it has minimal
application in the modern computing
-Online processing: involves direct connection of the data source to the computer either by
using wired or wireless connection. The computer, the data source (another computer, consoles
or any other input devices) and the output devices are said to be on-line when they can interact
automatically.
45
-Interactive processing: This involves hands-on transactions which is referred to as
“conversational mode processing”. The software prompt the user for information, the user is
expected to respond promptly, the response will be processed immediately. The resultant
outcome is determined by the sequence of request by the computer and answers supplied by the
users. This type of techniques receives and processes data at random intervals, hence time lag
is not tragic. Examples of interactive processing are:
i. Real- time processing: Transactions are said to be real-time when processing is done
as event occurs and the master files are update immediately. Examples of Real-time processing
are; Airline seat reservation, online banking and recharging GSM account etc
ii. Multi-users processing. Has provision for a number of users to use the same computer
at a time. A special multi-users operating system (Window, Linux etc) will be required to control
the resources such that the delay in response to user requests are not noticeable
iii. Multi-tasking processing. This technique facilitates the running of two or more tasks
(programs) concurrently on the computers. The technique require a Multi-tasking Operating
system that will allow high speed switching between different tasks while affording access to
multiple sources of information.
46
- “Save as type” (at the bottom of the box). A dropdown box that allows you to choose a
format (type) for your file. The default file format will appear with the default file extension.
These three options also appear when you choose “File” ->“Open” from within Microsoft Word,
but they have slightly different names. Other programs will have the same three options, which
again might have slightly different names.
b. Using My Computer
Double-clicking on the My Computer icon, which is located in the upper left-hand corner of your
desktop, will open a window labeled “My Computer”. From within this window, you can open,
move, copy, and delete files; you can also create, move, copy, and delete folders. Double-
clicking on any folder icon also opens the contents of that.
At the “top” level of the directory structure are the drives, differentiated by letters:
- A:\ is your floppy disk drive
- C:\ is your hard disk
- D:\ is your Zip, CD, or DVD drive
- F:\ is probably your flash
Go to “View” at the top of the window to change the way files and folders are displayed within
the window. There are four ways to view files and folders:
- Large icons
- Small icons
47
- List – Choose this when you want to work with several files or folders at a time.
- Details – This is a good mode to work in when you want to see when the file was created,
its size, and other important information.
Up – Choosing “Up” enables you to navigate through the computer’s directory structure quickly.
Clicking on this button will change the contents of the current window, taking you “up” in the
directory structure until you get to the highest level, at which only the drives are shown.
Cut – When you single-click on a file or folder to select it, it will be highlighted in the window.
Choosing “Cut” will delete the file or folder from its current location and copy it to the clipboard
so that it can be pasted elsewhere.
Copy – Choosing “Copy” will copy a selected file or folder into the clipboard so that it can be
pasted elsewhere, but will not remove the file or folder from its current location.
Paste – Choosing “Paste” will paste a file or folder that is stored in the clipboard into the current
location.
Undo – Choosing “Undo” allows you to undo an action that you have just performed. This is
particularly useful when you have just deleted something you didn’t mean to delete.
Delete – Choosing “Delete” will delete a selected file or folder without copying it to the clipboard.
48
Properties – Choosing “Properties” will bring up a box that gives you information about a
particular file or folder.
To create a new folder in the current window, you can do one of two things:
Go to “File”-> “New”-> “Folder.”
A new folder appears in the current window, and the folder name is highlighted that will allow
you to name it.
Right-click anywhere in the current window (not on an icon or filename) and choose
“New”-> “Folder.”
Right-clicking on a selected file or folder will allow you to do several useful things, among which
are the following:
- Rename a file or folder by choosing “Rename.” A blinking cursor will appear in the file or
folder name.
- Create a desktop shortcut by choosing “Send To” “Desktop as Shortcut.”
- Copy the file or folder to a floppy disk by choosing “Send To”-> “3 ½ Floppy (A:).”
- Cut, copy, paste, or print a file.
49
My Documents Dialog Box
51
is overdue, call it something like “overdue081206” rather than something like “letter”. How
will you know who the letter is to without opening it.
File as you go. The best time to file a document is when you first create it. So get in the
habit of using the “Save As” dialogue box to file your document as well as name it, putting
it in the right place in the first place.
Order your files for your convenience. If there are folders or files that you use a lot,
force them to the top of the file list by renaming them with AA at the beginning of the
filename.
Cull your files regularly. Sometimes what’s old is obvious as in the example of the
folder named “Invoices” above. If it’s not, keep your folders uncluttered by clearing out
the old files. Do NOT delete business related files unless you are absolutely certain that
you will never need the file again. Instead, in your main collection of folders in My
Documents, create a folder called “Old” or “Inactive” and move old files into it when you
come across them.
Back up your files regularly. Whether you’re copying your files onto another drive or
onto tape, it’s important to set up and follow a regular back up regime. If you follow these
file management tips consistently, even if you don’t know where something is, you know
where it should.
Different types of algorithms exists with regards to file sorting and searching,
I. Sorting Algorithm
In computer science and mathematics, a sorting algorithm is a prescribed set of well-defined
rules or instructions that puts elements of a list in a certain order. The most-used orders are
numerical order and alphabetical order. Efficient sorting is important to optimizing the use of
other algorithms (such as search and merge algorithms) that require sorted lists to work
correctly. More formally, the output must satisfy two conditions:
1. The output is in non-decreasing order (each element is no smaller than the previous element
according to the desired total order);
2. The output is a permutation, or reordering, of the input.
Some popular Sorting Algorithms are as follows:
52
Bubble Sort
This is a sorting algorithm that continuously steps through a list, swapping items until they
appear in the correct order. Bubble sortis a straightforward and simple method of sorting data.
The algorithm starts at the beginning of the data set. It compares the first two elements, and if
the first is greater than the second, it swaps them. It continues doing this for each pair of
adjacent elements to the end of the data set. It is used for small list.
Insertion Sort
This is a simple sorting algorithm that is relatively efficient for small lists and mostly-sorted lists,
and often used as part of more sophisticated algorithms. It works by taking elements from the
list one by one and inserting them in their correct position into a new sorted list.
Merge Sort
Merge sort takes advantage of the ease of merging already sorted lists into a new sorted list. It
starts by comparing every two elements (i.e., 1 with 2, then 3 with 4...) and swapping them if the
first should come after the second. It then merges each of the resulting lists of two into lists of
four, then merges those lists of four, and so on; until at last two lists are merged into the final
sorted list. Of the algorithms described here, this is the first that scales well to very large list.
Heap Sort
Heap sort works by determining the largest (or smallest) element of the list, placing that at the
end (or beginning) of the list, then continuing with the rest of the list, but accomplishes this task
efficiently by using a data structure called a heap, a special type of binary tree. Once the data
list has been made into a heap, the root node is guaranteed to be the largest
(or smallest) element. When it is removed and placed at the end of the list, the heap is
rearranged so the largest element remaining moves to the root.
Quick Sort
Quick sort is a divide and conquer algorithm which relies on a partition operation: to partition an
array, we choose an element, called a pivot, move all smaller elements before the pivot, and
move all greater elements after it. This can be done efficiently in linear time and in-place. We
then, recursively sort the lesser and greater sub-lists. Efficient implementations of quick sort
(with in-place partitioning) are typically unstable sorts and somewhat complex, but are among
the fastest sorting algorithms in practice. Because of its modest space usage, quick sort is one
of the most popular sorting algorithms, available in many standard libraries. The most complex
issue in quick sort is choosing a good pivot element; consistently poor choices of pivots can
result in drastically slower performance,
53
II. Search Algorithm
Linear search: processes the records of a file in their order of occurrence until it either
locates the desired records or processes all the records.
- if the records in the files are ordered then the number of search is relative to the position
of the desired file. The lowest number of search could be 1 and maximum is n
- Moving to the next record is done simply by incrementing the address of the current
record by the record size
{
Linear search algorithm
FOR i= 1 to n DO
IF key(sought) = key (i) THEN Terminate successfully
Next i
}
Binary search: for an ordered file, this searching technique reduces the number of
search by comparing the key of the sought key with the middle record of the file. The
upper or the lower halve of the file will be eliminated based on whether the sought key is
greater than or less than the middle key.
IF the key(sought) < key(middle) THEN eliminate the upper portion including the middle key
ELSE IF the key(sought) > key(middle) THEN eliminate the lower portion including the
middle key.
The procedure will continue until the desired record is found or it is determined that the record is
not available in the file
{
Binary search algorithm
Lower: = 1
Upper; = n
WHILE Lower <= Upper DO
IF Middle := (Lower + Upper)/2 THEN terminate successfully
ELSE IF key(sought) > key (middle) THEN Lower := Middle + 1
ELSE Upper := Middle-1
END IF
END WHILE
Terminate unsuccessfully
}
54