Computer Science Notes Operating Systems
Computer Science Notes Operating Systems
This minisite contains notes taken by Chris Northwood whilst studying Computer Science at the
University of York between 2005-09 and the University of Sheffield 2009-10.
They are published here in case others find them useful, but I provide no warranty for their accuracy,
completeness or whether or not they are up-to-date.
The contents of this page have dubious copyright status, as great portions of some of my revision notes
are verbatim from the lecture slides, what the lecturer wrote on the board, or what they said.
Additionally, lots of the images have been captured from the lecture slides.
An operating system is a program that acts as an intermediary between a user of the PC and the
hardware.
System Components
A computer system can be abstracted into various components. Users (people, or other machines and
systems) interact with system and application programs (defines the way in which system resources are
used to solve the computing problems of the users), which interact with the OS (controls and co-
ordinates the use of hardware among the various application programs for the user), which interacts
with the hardware (the most basic computing resources - CPU, memory, I/O, etc). When developing an
OS, all levels must be considered, not just the OS level.
For users, they interact with the hardware (monitor, keyboard, mouse, etc) to control the system and
application programs. The program needs an efficient interface to interact with the OS to utilise the
resources it needs to accomplish the tasks the user wants to do and to also communicate with the user.
The OS interface describes the calls that an application program can make into the OS.
The OS is designed for ease of use, which is in terms of not having to deal with low level hardware issues.
This is more important than CPU usage - a user doesn't care if the CPU is idle 99% of the time. The goal
of OS design is to provide a useful, simple and efficient system:
• ease of use
However, abstraction away from the physical machine to provide ease of use can lead to inefficiencies.
There's a trade off required between the extra functionality of the virtual machine and the overheads
required to implement this.
These are systems which didn't have operating systems. Users had complete access to the machine and
the interface was basically raw hardware. Programming was accomplished by wiring up plugboards to
control basic behaviour of the computer - the problems to be solved were normally numerical
calculations.
These systems had minimal operating systems and were typically mainframe computers used to support
commercial and scientific applications. Jobs and programs were written using punch cards and the
computers were normally application specific.
The minimal OS was required to interact with I/O devices and a subroutine library existed to manage I/O
interaction. The library is loaded into the top of memory and stays there. Application programs existed
(compiler/card reader), but were limited.
Optimisations could be made in reducing the setup time by batching similar jobs. A small computer
reads a number of jobs onto magnetic tape and the magtape is run on the mainframe. A number of jobs
is then done in succession without wasting CPU time. These systems had a small permenantly resident
OS which had initial control and controlled automatic job sequencing.
These systems had a new memory layout and the processor had two modes - Normal, where some
instructions (e.g., HALT) are forbidden, which was used by the batch programs, and Privileged, where all
instructions are available, this was the level the OS ran in. Some CPUs also implemented protection in
Normal mode to prevent the user program overriding the first section of memory in which the OS
existed. Normal mode also has an instruction to call the OS (hand back control).
Batch systems were good for CPU bound scientific computing with minimal I/O, but bad for data
processing with high I/O as the CPU is often idle. The speeds of the mechanical I/O devices is
considerably slower than that of the electronic computer and jobs often have to pause when performing
I/O. In some cases, the CPU was often idle for >80% of the time.
However, with a permenantly resident OS in the memory, the OS now becomes an overhead, with less
physical memory available for the application running at the time. Programming discipline in the
application is still required as the OS can not protect against the application crashing the machine via
bad I/O, infinite loops, etc...
The introduction of ICs led to minicomputers, which tried to merge the scientific computers with I/O
computers and this introduced multiprogramming. Disk drives allowed jobs and programs to be stored
on disk, allowing the computer to have access to many jobs. The CPU can now load many jobs into
memory by giving each one a memory partition and then execute them by multiplexing between them.
In these OSs, the OS picks a job and then passes control to it. Eventually, the job either finishes or waits
for some task (e.g., I/O) to complete. In a non-multiprogrammed system, the CPU would now be idle. In
a multiprogrammed system, the OS switches the CPU to another job.
Eventually, the operation the original job was waiting for completes, enabling that job to be eligable to
be waiting for CPU time. As long as there is at least one job waiting to execute, the CPU is not idle.
When a job completes, the OS can load another job into that memory slot. This technique is called
spooling - Simultaneous Peripheral Operation On Line, and is still used today for operations such as print
queues.
Multiprogramming OSs are quite complex - they now handle quite complex I/O devices (disks, etc) and
swapping between applications when suspended on I/O.
Multiprogramming works best when there are a small number of large jobs - this reduces the overhead
of the OS due to managing memory between less jobs. This method also encourages a fairer overall use
of the computer.
In time-share systems, the user has the illusion of using the computer continuously and the user can
now interact directly with the computer whilst it is running. This encourages smaller jobs and more
efficient use of the computer. Additionally, many users can share the computer simultaneously,
submitting jobs/commands directly via keyboards/terminals, etc.
The first time-sharing system was CTSS, developed by MIT. This formed the basis for MULTICS, the first
commercial OS that had limited success. UNIX was then developed by Bell Labs in the high-level
language C, based on B.
The fourth generation of operating systems broguth about the first mass user operating systems, as LSI
and VLSI enabled low-cost computing, with mini and desktop computers becoming common.
The age of mass computing is based around desktop computers. The desktop PC is dedicated to a single
user and the distinction between desktop and minicomputers become increasingly blurred. Many
minicomputers are now merely highend desktop PCs, with advanced I/O capabilities and
multiprocessors.
The multiprogrammed OS (UNIX, VMS, etc) was developed further for mass user computers, adding
functionality like: massive file storage, fast networking, complex application programs and the potential
for multiprocessor. Modern desktop systems have dedicated single user OSs (e.g., CP/M, DOS, MS-DOS,
Mac OS, Windows) with complex I/O devices: keyboards, mice, display screens, small printers, graphics
tablets, etc, and complex graphic applications, with user convenience and responsiveness considered of
high importance.
An increased blurring between desktop computers and minicomputers has led to minicomputer OSs
becoming available for desktop, e.g., Linux - based on the BSD and SysV variants of UNIX.
Much of the current OS trends are based in the relative cost of computers and people. In the beginning,
computers were few with very expensive hardware, so the people cost was cheap compared to the
computer cost, therefore the goal was the maximise hardware utilisation. Now we have an amazing
number of computers with cheap hardware and the human cost is expensive compared to the hardware
cost. The goal now is ease of use.
Current OS trends are towards specialist requirements - moving away from the general purpose OS for
specialist needs, or tuning existing general purpose OSs. The introduction of real-time OSs, which require
low overheads and high efficiency of resource usage. Multi-processor machines are becoming more
common and OSs need to update to be able to handle the extra hardware. On the other hand,
embedded devices are now become very common, which have limited power, space and weight
requirements.
Resource Sharing
The OS contains a set of algorithms that allocates resources to the programs executed on behalf of the
user. These resources include time, power, hardware, etc...
Control Program
The control program controls the operation of the application programs to prevent errors affecting other
programs.
Kernel
The kernel is the only program resident all the time (all other applications are application programs).
The modern general purpose OS contains various components to accomplish these roles, such as:
• Process management
• File management
• Protection system
All but the last two in the above list can be consdered to be resource management
Processes
A program is an algorithm expressed in some notation. A process is the activity of executing the
program.
Early systems only allowed one program to be executed at a time. Later systems allowed
multiprogramming, where many programs are loaded into memory at once and only one can be
executed at any time. Programs are termed jobs or processes when executing. A process is sequential
and can only do one thing at a time.
Multiple user processes run in seperate (protected) address space, and can't directly access other
memory spaces. Some user programs can share address space (e.g., an interpreter), but only the code
(which is read only) is shared. The kernel can access all, however.
As the CPU only has one program counter, so the OS must use a system known as process switching.
Process Creation
Processes are created by one of 4 principle events:
When a new process is created, the OS must assign memory and create a task/process control block
(TCB/PCB) which holds various data about the currently running process, such as:
• Process state
• Unique program ID
• Program counter
• CPU registers
• Memory information (this, along with the program counter and CPU registers forms the process
context)
Process Hierarchies
When a parent creates a child process, the child process can create its own child processes, and we get a
hierarchy (or tree) of processes. UNIX terms this as a process group, whereas Windows has no such
concept, and all processes are created equal.
Process State
In single processor systems, only one process can be in the running state at any time. If all processes are
waiting (or blocked), then the processor is idle, until one of the processes become unblocked, or ready.
Process Termination
When a process is terminated, all resources used by a process is reclaimed by the OS, with the PCB last
(as the PCB contains information on where the resources are to be reclaimed).
System Calls
System calls provide an interface between the running process and the OS. They are available as an
assembly level instruction, normally some kind of software interrupt (INT 0x80 on Linux) which
corresponds to a system call to which the OS reacts appropriately.
Parameters are used in system calls, just as procedure calls. Normally registers, tables or stacks are used,
but in modern processors, tables are the most popular. Another method is using instructions that force
an address space change without using an interrupt, but the effect is the same.
System calls normally map onto the services provided by the OS.
POSIX is an IEEE standard based on the UNIX interface. It is not an OS, merely a standard API which is
very vague and covers almost all systems calls you would ever have to make. Some aspects of most
major OSs are POSIX compliant, but there exists no complete implementation of the POSIX standard.
Linux is primarily POSIX compliant for the core calls.
Threads
Traditional (heavyweight) processes have a single control path through their code. Many applications are
concurrent, but do not want the overhead of multiple processes and inter-process communication (e.g.,
a word processor has UI, reading input, a background spell checker, etc).
Many OS's provide threads (sometimes called lightweight processes), a basic unit of CPU use. Each is a
seperate control path through the code, and threads within a process are essentially independent.
Threads can access all the address space by the process, and threads have no protection against each
other (therefore, you should program threads safely).
Each thread comprises of a thread ID, program counter, register set and a stack. The thread shares with
other threads belonging to a process, the code section, the data section and OS resources such as open
files. Each thread needs context, but not as much as the process needs, as it inherits some of the context
of the process.
Threading has various benefits, such as responsiveness - other threads can continue if one blocks (e.g.,
on I/O), resource benefits - there's only one copy of the code and TCB, etc, economy - thread creation
and management is less costly than that of processes, and better utilisations of multi-processor
architectures (threads can run in parallel).
User Threads
Implemented by user level libraries, this provide threads to an application with the the support of the
underlying OS. Here, the application handles its own threading, so the OS only sees one process. The
application has a thread table, used in the application for keeping track of the thread state and context,
in a similar way to how the kernel handles TCBs for processes. The state diagram for a thread looks
similar to the one for a process:
User threads are usually fast and have low management overheads (as a library, the threads library is
within the address space of the process, so calls to the threads library functions are fast). Application can
also define management (i.e., scheduling) based on better information on the threads than a more
general policy.
This method does have disadvantages, however. When a thread calls I/O, the whole process is blocked,
as the OS sees the whole process as waiting for I/O. When an application has threads that are likely to
block often, then user threads may not be the best option.
Kernel Threads
Kernel threads are implemented at the OS level, and therefore has higher overheads (a thread control
space/table is now needed in the TCB) and is slower, due to syscalls. With kernel threads, the OS
schedules all threads, so it can schedule appropriately (e.g., when one thread blocks, only move that into
waiting instead of the whole process). Threads can also be scheduled over multiple processes, so we
have parallelism, instead of concurrency (which can introduce its own set of problems).
Hybrid Threading
Both user and kernel threading has their advantages and disadvantages, so using the hybrid methods
give you the best of both worlds. Most real implementations of threading are hybrid.
A common approach is to multiplex a number of user threads onto a number of kernel threads; each
kernel thread has some set of user threads that take turns using the kernel thread. There is usually a
maximum number of kernel threads per user process.
Many-to-One
This is the same as normal user-level threading (only with management in the OS); many user level
threads are mapped to a single kernel thread. All thread management happens in user-space, so this is
efficient, but it does suffer from the one thread blocking the whole process problem.
You could also implement a kernel thread as a process, which is used on systems that don't support
kernel threads.
One-to-One
All threads here are kernel threads - each user thread maps directly on to one kernel thread. This has
more concurrency than the many-to-one model, as when one thread blocks another can run. This does
have the overhead of having to create a kernel thread for each user thread, however.
Many-to-Many
This is the normal method of implementing threading. Here, the user library controls all - which threads
are in the kernel, which are not and also controls swapping, etc. The library tries to control the threads
so that not all of the kernel threads end up blocked at the same time, by swapping in and out waiting
user threads to kernel threads.
This has the advantages of there being control over concurrency in the application, and the application
can allocate user threads to kernel threads in a way to optimise the concurrency and ensure that at least
one user thread is runnable at any time.
This method is quite complex, however and there are overheads in both the OS and the application. The
user thread scheduling policy may also conflict with the kernel thread scheduling policy, introducing
inefficiencies.
Threading Issues
fork() creates a seperate duplicate process and exec() replaces the entire process with the program
specified by its parameter. However, we need to consider what happens in a multi-threaded
process. exec() works in the same manor - replacing the entire process including any threads (kernel
threads are reclaimed by the kernel), but if a thread calls fork(), should all threads be duplicated, or is
the new process single threaded?
Some UNIX systems work around this by having two versions of fork(), a fork() that duplicates all threads
and a fork that duplicates only the calling thread, and usage of these two versions depends entirely on
the application. If exec() is called immediately after a fork(), then duplicating all threads is not necessary,
but if exec() is not called, all threads are copied (an OS overhead is implied to copy all the kernel threads
as well).
Thread Cancellation
Cancellation is termination of the thread before it is complete (e.g., if multiple threads are searching a
database and one finds an answer, the others can stop).
There are two types of cancellation: asynchronous and deferred (or synchronous). With asynchronous,
the calling thread can immediately terminate the target thread, but in deferred cancellation, the target
thread is information that it is to be terminated by setting some status in the target thread. When the
target thread consults its status, it finds out it is to terminate and performs some clear-up and
terminates itself. This requires some co-ordination between threads.
Asynchronous cancellation is useful if the target process does not respond, otherwise it should not be
used, as it can lead to inconsistencies.
Signal Handling
This is used in UNIX to notify a process an error has occured. It is normally generated by an event and is
delivered to the process. Once delivered, the signal must be handled.
Synchronous signals are delivered to the same process that performs the operation that caused the
signal, e.g., divide by 0 errors, etc...
Aysnchronous signals are delivered to running processes and are generated by something external to the
process, e.g., a key press.
The signal can be handled by a default signal handler, or a user-defined signal handler. User-defined
overrides the default handler (in most cases).
In multi-threaded systems, we have to consider the case of to which thread should the signal be
delivered?
• To the thread where the signal applies (how do you know this?)
• To a designated thread, which handles all signals (has the problem of there being the overhead
of another thread, and the thread may not be scheduled in the best way to handle the signal)
CPU Scheduling
CPU scheduling refers to the assignment of the CPU amongst processes that are ready. It selects from
among the processes in memory that are ready to execute and allocates the CPU to one of them. In
terms of process state, the scheduler decides which of the processes that are in the ready state should
be allocated to the CPU and enter the running state.
4. terminates
A scheduling decision may also take place as the result of a hardware interrupt.
We call scheduling under conditions 1 and 4 nonpreemptive - once the program is running, it only gives
up the CPU voluntarily. Scheduling under conditions 2 and 3 is preemptive - that is, the running process
can be preempted by an interrupt. Problems may arise when a program is preempted whilst accessing
shared data, which gives us a need for mutual exclusion.
Dispatcher
The dispatcher module gives control of the CPU to the process selected by the scheduler. This involves:
• switching context
• jumping to the proper location in the user program to restart that program
With the dispatcher, we have a new metric, dispatch latency, which is the time it takes for the dispatcher
to stop one process and start another running. Dispatch latency is optimal when it is minimised.
The dispatcher is an example of a mechanism, which is how you should implement scheduling. Policies
decide what will be done, and the seperation of policies and mechanisms is an important OS principle,
allowing flexibility if policy decisions are to be changed later.
This is based upon the states in the process state transition graph. The OS needs to find the TCBs of all
processes efficiently, and needs to find a process waiting on a particular event quickly when that event
occurs. Specifically, the OS needs to find the next process to dispatch (i.e., from the ready queue) quickly
on a context switch. You also need to be able to move processes between queues efficiently.
A process is only in one queue at a time. For example, you may have a ready queue, which is the set of
all processes residing in main memory, ready to execute, and also a number of waiting queues, one for
each different I/O device, which is the set of processes waiting on I/O from that device. Processes move
between queues as they change state. Pictorally, this may look like this:
With this model, we have two types of schedulers, the short-term scheduler (or CPU scheduler), which
selects which process should be executed next from the ready queue and allocates CPU. It is invoked
very frequently (every 100 or so milliseconds), and must be very fast. Additionally, we have the long-
term scheduler (or job scheduler). This selects which new processes should be brought into the ready
queue, and is invoked very infrequently (in terms of seconds and perhaps even minutes). The long-term
scheduler controls the degree of multi-programming (i.e., the number of processes in memory).
However, how do we decide which jobs should be brought in from main memory?
Processes can be described as being either I/O-bound (spends more time doing I/O than computations,
and has many short CPU bursts) or CPU-bound (spends more time doing computation, has a few very
long CPU bursts). The long term scheduler tries to maintain a balance between these two forms.
Some OSs also have a medium term scheduler, which swaps in/out partially executed processes.
Scheduling Policy
When the OS needs to select another process to execute, the scheduling policy decides which of the
processes to assign to the CPU. Scheduling policy effectively determines the order of the ready queue so
that the next process to run is at the head of the queue.
Scheduling policies are often defined on the CPU-I/O burst cycle. Most process executions consist of a
CPU execution and an I/O wait. The characteristics of this cycle help determine policies. The relationship
between relative sizes of CPU and I/O bursts are very important. I/O bursts are very long compared to
CPU bursts in most applications. A histogram of CPU-burst times gives us a typical CPU-burst graph,
showing many short CPU bursts and few long CPU bursts, which are the CPU bound processes.
• Throughput high (the number of processes that complete their execution per time unit)
• Turnaround time low (the amount of time it takes to execute a particular process)
• Waiting time low (amount of time a process has been in the waiting queue)
• Response time low (amount of time it takes from when a request was submitted until the first
response is produced, not output - for time-sharing environments)
Turnaround time alone is a bad goal to optimise towards, as overall processes will take longer.
In First-Come, First-Served (FCFS) scheduling, processes are scheduled in the order that they arrive in the
ready queue.
The average waiting time for FCFS is not optimal, however, as it suffers from the convoy effect, where
short processes are delayed behind long processes.
Shortest-Job-First Scheduling
Shortest-Job-First (SJF) scheduling works by associating each process with the length of its next CPU
burst, and then use these lengths to schedule the process with the shortest time next.
This has two schemes: non-preemptive (once the CPU is given to the process, it cannot be preempted
until it completes its CPU burst); and preemptive (if a new process arrives with CPU burst length less
than the remaining time of the current executing process, preempt), which is sometimes known as
Shortest-Remaining-Time-First (SRTF) .
SJF is optimal for giving a minimum averasge waiting time for a given set of processes.
Determining the length of the next CPU burst can only be done by estimating the length. One way of
accomplishing this is by using the length of previous CPU bursts and exponential averaging with the
forumla τn + 1 = αtn + (1 - α)τn, where tn = the actual length of the nth CPU burst, τn + 1 = predicted value for
next CPU burst and 0 ≤ α ≤ 1.
Priority Scheduling
A priority number (integer) is associated with each process and the CPU is allocated to the process with
the highest priority (the smallest integer). This can be either preemptive or non-preemptive. SJF is a
priority scheduling policy if the priority is predicted next CPU burst time.
This does have the problem of starvation, where low priority processes may never execute, and the
solution to this is aging, where as time progresses, the priority of the process is increased.
Round Robin
Each process gets a small unit of CPU time (a time quanta), which is usually about 10-100 ms. After this
time has elapsed, the process is preempted and added to the end of the ready queue. A smaller time
quanta means more context switches and therefore more overheads, so the time quanta needs to be set
to some appropriate value that is large with respect to the size of the context switch, but not too high as
to basically become a FIFO policy. If there are n processes in the ready queue and the time quantum is q,
then the process gets 1/n of the CPU time in chunks of at most q time units at once. No process waits
more than (n - 1)q time units.
Average turnaround time does not necessarily improve as the quantum increases, so quantum size is a
trade-off between good turnaround and low overheads.
Multi-level Scheduling
Here, the ready queue is partitioned into seperate queues, e.g., a foreground queue (for interactive
processes) and a background queue (for batch processes). Each queue has its own scheduling algorithm,
for example the foreground one is a RR queue and the background one may be FCFS.
Each queue must also be scheduled for CPU time, for example we could have a fixed priority scheduling
algorithm (serve all from foreground, then from background), but this does have the possibilty of
starvation, or we could have a time-slice, where each queue gets a certain amount of CPU time which it
can schedule among its processes, i.e., 80% to foreground and 20% to background.
A process can move between the various queues in a multi-level scheduling policy, and this is a way of
implementing aging. The multilevel-feedback-queue scheduler is defined by the following parameters:
• number of queues
• method used to determine which queue a process will enter when that process needs service
Thread Scheduling
For user-level threads, any general purpose scheduling policy can be used according to the best policy for
that application. General user-level thread implementations should allow the application sufficient
freedom to define an application-specific scheduling regime, i.e., allow the application to reorder the
thread scheduling queues at any time, with any policy.
Memory Management
Processes (including threads) need physical memory in order to execute. The OS manages memory and
allocates memory to processes from physical memory, and virtual memory. Physical memory
management enables the OS to safely share physical memory among processes. This allows us to
multiprogram the processor, to time-share the system (overlap computation and I/O) and to prevent one
process from using another memory.
• You may need to load processes to different locations each time they execute
• You may need to swap all or part of a process to secondary storage during execution (and then
reload it at a different memory location, as above)
Some simple memory management techniques include loading the process into memory and switching
between them with no protection between the processes (DOS), or copying the entire process memory
to secondary storage when it performs I/O and then copy it back when it is ready to restart (early UNIX).
This is clearly inefficient.
Most modern OSs work by giving each process a part of memory that it is allowed to access and check
each memory access to ensure that the process stays within its own memory.
Main memory must accomodate both the OS and user processes. Usually, the main memory is divided
into a number of distinct partitions, such as the resident OS, which is normally a single partition held in
low memory, as this is where CPUs place interrupt vectors. User processes normally consist of one or
more partitions, within memory above the OS. However, we need to consider how we're going to divide
up these user partitions.
The simplest scheme is to partition into regions with fixed boundaries - this is called fixed partitioning
and has two types, where the partitions are of equal size, or where they are of unequal size. Any process
whose size is less than or equal to the partition size can be loaded into an available partition. If all the
partitions are full, the operating system can swap a process out of a partition. However, if a process
doesn't fit into a partition, it is the programmers responsibility to make sure that the program can be
split over multiple partitions. With this, main memory use is inefficient. Any program, no matter how
small, occupies an entire partion. This is called internal fragmentation.
With unequal partitions, the problem is reduced, but not eliminated. Instead of creating x partitions of
equal size, partitions of varying sizes are created, so a program can be loaded into the one most
appropriate to its size.
Another method is dynamic partitioning, where the partitions are of variable length and number and are
created when the process asks for them - so, the process is allocated exactly as much memory as it
requires, eliminating internal fragmentation. It does however leave holes in memory, as processes are
swapped in and out in different places, leaving small gaps between processes. This is called external
fragmentation and could lead to a situation where there is enough free memory for a process to run, but
it is fragmented across main memory. The way to solve this is using compaction - rearranging the
programs in memory so they are contiguous and all free memory is in one block. This is very, very slow.
There are different ways of implementing allocation of partitions from a list of free holes, such as:
• best-fit: allocate the smallest hole that is small enough; the entire list of holes must be searched,
unless it is ordered by size
• next-fit: scan holes from the location of the last allocation and choose the next available block
that is large enough (can be implemented using a circular linked list)
First-fit is usually the best and fastest, but analysis has shown that under first-fit, fragmentation can
cause up to half of the available memory to be lost in holes (external fragmentation) or unused memory
(internal fragmentation). Best-fit has the smallest amount of external fragmentation. Next-fit can
produce external fragmentation similar to best or first, depending upon the next block found. It does use
holes at the end of memory more frequently than others perhaps leading to the largest blocks of
contiguous memory.
A common approach, especially in embedded systems is the buddy system. Under the buddy system, the
entire space is treated as a single block of 2U. If a process requests memory of size s, such that 2U-1 < s ≤
2U, the entire block is allocated, otherwise the block is split into two equal buddies. This process
continues until the smallest block greater than or equal to s is generated.
At any time, the buddy system maintains a list of holes of each size 2i. A hole on the i + 1 list can be split
into 2 equal holes of size 2i in the i list.
There are two main problems in implementing physical memory management - how can processes be
run in different areas of memory, given that they contain lots of addresses, and how can the OS provide
memory protection.
Memory addresses in a program must be translated into physical memory addresses, so programs are
created to be relocatable, such as the ELF binaries for Linux. There are three points where binding of
program address to memories can occur:
• compile time - if the memory location is known at compile time, static absolute addresses can be
compiled in; any starting address changes must be recompiled in.
• load time - here, the code generated is relocatable and the code can be loaded at any address
and the final binding occurs when the program is loaded. If the memory locations change, the
code just needs to be reloaded.
• execution time - the binding is delayed until runtime if the process can be moved from one
memory segment to another during execution. This method does require hardware support.
To work round this, we have the concept of a logical address space that is bound to a seperate physical
address space, which is the key to memory management. Logical addresses are the ones generated by
the CPU, and physical addresses are the ones seen by the memory unit. When using compile or load-
time address-binding schemes, the logical and physical addresses are the same, but with the execution-
time addres-binding scheme, the logical (virtual) and physical addresses differ. This is accomplished using
a memory management unit.
The memory management unit (MMU) is a hardware device that maps a logical address to a physical
one, so the user program only deals with logical addresses, and never sees the physical one.
The most simple form of MMU is the relocation register. The relocation register holds the value of an
offset and then adds the offset to all logical addresses given to it to give the physical address. In this
system, logical addresses are in the range 0..max for all processes and physical addresses are in the range
R..R+max, where R is the value in the relocation register. The process only needs to generate the logical
addresses, and only sees the process running in addresses 0..max. This principle of mapping is central to
memory management.
The relocation register can provide memory protection with the addition of a limit register, which
contains the allowed range of logical addresses - each logical address must be less than the limit register.
Hardware support is obviously required for this.
Another way of thinking of this is not thinking of a program as one large chunk of contiguous memory - a
process may be broken up into pieces that do not need to be loaded into main memory all the time. This
increases opportunities for multiprogramming as with many processes in main memory, it is more likely
a process will be in the ready state at any particular point. This is achieved by programming techniques
with minimal OS support.
So far, we have assumed that the entire program data and code must be loaded into memory for process
to execute - this is called static loading. An alternative approach is called dynamic loading which works by
not loading routines until it is called. This gives a better memory-space utilisation and unused routines
are never loaded. This is useful when large amounts of code are required to handle infrequently occuring
cases. If dynamic loading is implemented through program design, no special support from the OS is
required.
Dynamic loading enables a small part of the process to be loaded initially - usually around the program
entry point with part of the data and for a greater degree of multiprogramming as only parts of the
program are needed at any time in physical memory. It is usually the responsibility of the application to
ensure that the appropriate parts of the application are loaded at any time.
Overlays
These are slightly more structured, and codes are put into phases that are loaded over one another.
Again, the motivation is to keep only code and data in memory that is actually needed. When one phase
completes, the application arranges to overlay its memory with the next phase when required. The total
memory required is the same as the memory required by the largest phase. Some OS support may be
given to implement this.
This is useful when there is a limited amount of physical RAM, and some secondary storage is available
to hold code/data, e.g., in embedded systems.
Shared Libraries
Shared libraries can be implemented in two ways - using static or dynamic linking. In static linking, all
libraries required by the program are linked into the image (i.e., form part of the data/code of the
program), so the memory allocated must be big enough to hold the application and the libraries.
With dynamic linking, the libraries required by the program are dynamically linked at execution time
(just-in-time), so the memory allocated must be big enough to hold the application. Shared libraries are
allocated from a different area of memory.
Swapping
A process (or part of a process) can be swapped temporarily out of memory to a backing store (e.g., disk)
and then brought back into memory for continued execution. This is how medium-level scheduling is
used. If you want to reload at a different address, execution time binding is required.
The issue with swapping is the backing store - a fast disk large enough to accomodate copies of all
memory images for all users that must provide direct access to these memory images is required. A
major part of the swap time overhead is the transfer time. The total transfer time is directly proportional
to the amount of memory swapped.
One of the more common swapping variants is called roll-out, roll-in and it is used with priority-based
scheduling algorithms. Lower-priority processes are swapped out so higher-priority processes can be
loaded and executed.
Many different varieties of swapping can be found on many OSs, such as UNIX, Linux, Windows, etc...
Paging
The problems with using physical memory allocation with unequal, fixed or variable sized partitions are
fragmentation and inefficiencies (algorithms to track holes, and compaction). Paging and segmentation
are memory allocation techniques that largely eliminate these problems.
Many of these problems arise as memory allocation is of contiguous blocks of memory to processes.
Paging removes this restriction, as the logical address space of a process can be noncontiguous. The
physical memory space is divided into fixed-size blocks called frames, and logical memory is divided into
blocks of the same size called pages. Pages are then allocated and fit into frames.
To implement paging, we need to keep track of all free frames. To run a program of size n pages, we need
to find n free frames and load the program into them. There is a problem with internal fragmentation if a
process size is not an exact number of frames. Only part of one frame is lost to this fragmentation,
however, and if we make the frame size quite small, this isn't a significant problem.
With paging, the logical address generated by the CPU is divided into a page number p and a page
offset d. The page number is used as an index into a page table which contains the base address of each
page in physical memory and the page offset is combined with the base address to define the physical
memory address that is sent to the memory unit.
Usually, page tables are a fixed length, often the same size as a frame. If a process does not require all
entries in the page table, the redundant entries are set to be invalid. If a reference is made to an invalid
page table entry, the OS must handle a page fault and there is the potential for an erroneous logical
address to have been generated by the application process.
The page table is normally kept in main memory, so in this scheme every data/instruction access
requires two memory lookups, one for the page table and then one for the actual instruction. The two
memory access problem can be solved using a special peice of hardware called a translation look-aside
buffer (TLB).
TLBs are implemented using associative memory, which allows for a parallel search of page numbers in
order to get the frame number. The TLB is not a part of normal physical memory.
Memory protection in paging is implemented by associating a protection bit with each frame. The valid-
invalid bit is attached to each entry in the page table, where "valid" indicates that the associated page is
in a process' legal address space and is thus a legal page, whereas "invalid" indicates that the page is not
in the legal address space of the process.
Modern systems allow processes to have large logical address spaces, so the page table itself can
become very large. Hierarchial page tables can be used to break up the logical address space into
multiple page tables.
In address spaces > 32 bits, hashed page tables have become common. Here, the virtual page table is
hashed into a page table.
Often, having one page table per process can be expensive in terms of physical memory. The alternative
is the inverted page table, where there is one entry for each real page of memory and the entry consists
of the virtual address of the page stored in that real memory location, with information about the
process that owns that page. This decreases memory usage, but increases the time needed to search the
table when a page reference occurs. Hash tables can be used to limit the search to one, or at most a few,
page-table entries.
Segmentation
Paging seperates the user's view of memory from actual physical memory. A user thinks of the program
as a collection of segments. A segment is a logical unit, such as: the main program, procedures,
functions, methods, objects, local variables, global variables, common blocks, the stack, the symbol table
and arrays. Segmentation is a memory management scheme that supports a user view of memory.
In segmentation, you have a logical address consisting of a segment number and an offset and a segment
table that maps two-dimensional physical addresses. Each table entry has a base that contains the
starting physical address where segments reside in memory and the limit that specifies the length of the
segment.
With each entry in the segment table, a validation bit (one that indicates whether or not it was an illegal
statement) and read/write/execute privileges are associated with the entry. Since segments can vary in
length, memory allocation is a dynamic storage-allocation problem, which can lead to external
fragmentation.
Segmented Paging
Both paging and segmentation have advantages and disadvantages and most succesful processor
families are moving towards a mixture of segmentation and paging. Segmented paging usually uses a
segment table, a multi-level page table and an offset.
Virtual Memory
Virtual memory is the seperation of user logical memory from physical memory. Only part of the
program needs to be in memory for execution (i.e., only the part needed for current execution - follows
from the principle of locality, where programs and data references inside a process tend to cluster and
only a few pieces of a process will be needed in any period of time). The major advantages of virtual
memory is that the logical address space can be much larger than the physical address space and more
processes can be partly in physical memory at any time.
Virtual memory can be much larger that physical memory, with n pages of virtual memory mapped
to m pages of physical memory and other pages on secondary storage.
Virtual memory can be implemented assuming a virtual address space of 0..max, where max is usually a
power of 2, e.g., could be 232 bytes, or 4 GiB, in 32-bit systems. The process can be loaded anywhere in
physical memory and often the physical memory size is significantly smaller than the virtual address
space of an individual process.
Physical memory nowadays is very large, so virtual memory is only useful for large data sets, and for
compilation models.
Demand Paging
Demand paging brings a page into memory only when it is needed. This means there is less I/O, less
memory, a faster response time and more multiprogramming.
A page is needed when there is a reference to it (i.e., an instruction/data address). An invalid address will
lead to an abort, and a not-in-memory request will lead to a fetch to memory - this is called a page fault.
When the first reference to a page occurs, the page will not be in memory under this system, so a page
fault occurs. This is a non-maskable interrupts (can't be covered up by the user or OS) and it is
synchronous, you need to know the running process that caused the interrupt. This allows for the OS to
bring the appropriate page into main memory and for it to detect invalid memory references - the OS
contains a range of valid addresses for a process. The process often does not need the full range of
virtual memory, hence code size, data size and stack size can be used to check if a memory reference is
valid. This information is stored on a per process basis in the TCB.
When a page fault occurs, the OS looks at the internal table to decide if the process is permitted to make
the memory reference and if it is, it finds an empty frame and swaps the page into that frame, updates
the page table and the internal OS tables and continues the instruction that caused the page fault - as
with other standard interrupts.
Pure demand paging is where you start executing a program with no pages loaded into memory. This will
trigger a page fault immediately, however.
Demand paging is more complex when a TLB is involved. When a virtual address is given, the processor
consults the TLB, and if a TLB hit occurs, the real address is returned and main memory is consulted in
the usual manner. If the page table entry is not found in the TLB (a miss), the page number is used to
index the process page table, and then a check occurs to see if the page is in main memory. If it's not, a
page fault is issued and the non-TLB process as above is followed, with the TLB being updated as well as
the page table.
Page Replacement
Page replacement is needed when there are no free physical memory frames.
When we want to load a page into memory, we try to find a free frame to put the page into. If there is no
free frame, then we use a page replacement algorithm to select a victim frame, and then the victim is
copied back onto disk and the new frame is loaded into its place and the page and frame tables are
updated.
The aim for designing a page replacement algorithm is to give the lowest page-fault rate. An algorithm is
evaluated by running a particular string of memory references and computing the number of page faults
on that string. An ideal page fault graph would look like:
Some algorithms, however give more page faults if the number of pages is increased - this is called
Belady's Anomaly.
The simplest page replacement algorithm is the first-in-first-out method, which can be implemented
using a circular linked list.
Another algorithm is the least recently used (LRU) algorithm, which replaces the page that has not been
referenced for the longest time. By the principle of locality, this should be the page that is least likely to
be referenced in the future. Each page could be tagged with the time of last reference, but this would
require a great deal of overhead. LRU can be easily implemented using a stack of page numbers; when a
page is referenced, it is moved or placed on the top of the stack, and when looking for something for a
replacement, the bottom of the stack is all that you need.
The second chance algorithm uses a reference bit, which is initially set to 0 and then is updated to 1
when it is accessed. The second chance algorithm uses a circular list of pages. When searching for a page
to replace, if a page has a reference bit one, it is set to 0 and the page left in memory, and the next page
in circular order is replaced subject to the same rules.
We can also consider counting algorithms, where a counter is kept of the number of references that have
been made to each page. The least frequently used (LFU) algorithm replaces the page with the smallest
count. The most frequently used (MFU) algorithm is based on the algorithm that the page with the
smallest count was probably just brought in and has yet to be used.
The optimal algorithm replaces the page that will not be used for the longest period of time. It is
impossible to have perfect knowledge of future events, but if you already know what's happening (as
with a reference string), you can use this algorithm to measure how well your algorithm performs.
The OS may wish some frames to remain in physical memory frames, such as parts of the OS itself,
control structures, I/O buffers, etc..., which is why we can use page and frame locking. When a frame is
locked, it may not be replaced. This usually requires a lock bit to be added with each frame, and is
usually supported in the OS, not hardware.
Each process requires a minimum number of pages. If indirect referencing is used, each reference
requires a page, so each level of references increases the amount of pages required.
Frame Allocation
Again, there are different methods of allocating frames. If there are 100 frames and 5 processes, each
process can be given 20 pages. This is called equal allocation. A different way is called proportional
allocation, where allocation is given according to the size of the process: ai, the allocation for process pi =
si/S × m, where si = size of process pi, S = Σsi and m = total number of frames.
A variation of the proportional allocation scheme is the priority allocation scheme, where the
proportions are computed based on priorities, rather than size (a higher priority process will execute
quicker if it has more frames).
If a process pi develops a page fault you can either select a replacement from one of its frames or a
replacement from a process with a lower priority number. This brings us to the question of global or local
replacement. Global replacement will select a replacement frame from the set of all frames, so one
process can take a frame from another. In priority based allocation schemes, this may allow higher
priority processes to increase their allocation of frames by taking a frame from a low priority process.
Problems arise as the page fault behaviour of one process will also depend on the behaviour of other
processes.
In local replacement, each process selects from only its own set of allocated frames, so the process' page
fault behaviour is entirely dependent upon itself, and not upon other processes. The process execution
time becomes more consistent across multiple executions.
Thrashing
If a process does not have enough frames allocated, the page fault rate can get very high, so a lot of
swapping back and forth between main memory and secondary storage is done, and the result is
thrashing, which is where more work is done by the OS swapping and paging than is done by the
application in doing something useful. This has severe performance implications.
A common form of thrashing is due to the OS monitoring utilisation and many page faults occur,
decreasing utilisation. As utilisation decreases, the OS increases the degree of multiprogramming to try
and increase it, which in turn increases the page faults and decreases utilisation...
Thrashing occurs under global frame replacement more as frames can be stolen from other (active)
processes. Local frame replacement limits thrashing to within individual processes. If a process starts
thrashing, it cannot steal frames from another process and cause it to start thrashing.
To prevent thrashing, a process must allocate a sufficient number of frames. We can predict a sufficient
amount of frames using the locality model. The diagram below shows how memory accesses change
over execution of a program.
Thrashing will occur when the size of the locality > memory (pages) allocated to the process.
The working set model assumes locality - it defines a per-process working set window which contains the
most recent page references for a process. If a page is in active use, it will have been accessed in the
working set window, and will be in the working set. If a page is not being used, it will drop from the
working set sometime after its last reference. The working set approximates the locality of a process.
Formally, we have Δ, the working set window width in time and WSSi, the working set of process Pi is the
total number of pages accessed in the most recent Δ. The working set will vary with time, and if Δ is too
small, it will not cover the entire locality, and if Δ is too large, it'll cover several localities. If Δ = ∞, it will
cover the entire program. The most important property is the total size of the working sets in the
system. D = ΣWSSi. If D > m (the total number of pages), thrashing will occur, and a process must be
suspended.
The working set can be implemented with an interval timer and a reference bit. For example, if Δ =
10,000, each process in the OS has bits to indicate whether it has been accessed in the last 5000, 10,000
or 15,000 cycles. On a page access, the 5000 bit is set, and a timer interrupt every 5000 cycles shifts all
the bits along one, with the 5000 reference bit set to 0. If any of the bits in memory = 1, the page is in
the working set.
This is not entirely accurate, as you cannot tell within an interval of 5000 when a page reference actually
occured. You can improve this by interrupting more frequently and increasing the number of reference
bits, but this has a greater overhead and reduces the time available to do useful work.
The working set model is quite clumsy and expensive to implement, and page-fault frequency is better.
An "acceptable" page-fault rate is better, and if the page fault rate travels outside of these bounds, then
the process loses a frame, and if the page-fault rate gets too high, the process gains a frame.
We need processes to co-operate to carry out particular jobs, e.g., one process requests a service from
another (such as requesting an I/O operation from the kernel, as the kernel is still a process), pipelined
processes working together (one process can't proceed until the previous one has finished) and where
multiple processes work on the same task, for example one might read from disk to memory, and
another reads from memory to the audio device. This can be accomplished using inter process
communication (IPC).
Independent processes cannot affect or be affected by the execution of another process (all processes
are in a seperate protected memory address space). Co-operating processes can affect or be affected by
the execution of another process, via some form of IPC mechanism.
The OS must solve some of the problems that IPC causes, however. It must provide mechanisms to
control interaction betwen the processes and must order accesses to shared resources by processes. The
OS must also provide mechanisms by which resources can wait until the resource is available.
A paradigm for co-operating processes is the producer-consumer problem. A producer process produces
information that is consumed by a consumer process. There are two models which can be used here,
one is the unbounded-buffer, which places no practical limit on the size of the buffer, and the buffer is
just expanded as the producer fills it and the other is the bounded-buffer, which assumes there is a fixed
buffer size and the producer must wait for the consumer to remove data if the buffer is full.
One way to implement the bounded-buffer solution is to use a circular array/list, but this only works if
the processes can share memory. Another implementation that maximises the use of the buffer is to use
an integer called a counter which is then incremented when a producer writes to the buffer and
decremented when a consumer reads from it. These operations must be performed atomically though,
which isn't always the case if a context switch occurs during the increment and decrement process.
We now have a race condition, which is the situation where several processes access and manipulate
shared data concurrently and the final value of the shared data depends on which process finishes last,
so the data may become inconsistent. To prevent race conditions and to ensure consistent data,
concurrent processes must be synchronised. This requires mechanisms to ensure the orderly execution
of co-operating processes.
Critical Sections
Here, we can use critical sections. If we have n processes competing to use some shared data, each
process has a code segment called the critical section, in which the shared data is accessed. The problem
is to ensure that when one process is executing in its critical section, no other process is allowed to
execute in its critical section. Solutions to this problem include:
• Mutual exclusion (mutexs) - if a process is executing in its critical section, then no other
processes can execute in their critical sections
• Progress - if no process is executing in its critical section and there exists some program that
wish to enter their critical section, then the selection of the processes that will enter their critical
section next can not be postponed indefinately
• Bounded waiting - a bound must exist on the number of times that other processes are allowed
to enter their critical sections after a process has made a request to enter its critical section and
before that request is granted
One way to accomplish this is to use command such as EnterCriticalSection() and LeaveCriticalSection().
Processes may share some common variables to synchronise their actions. For example, we could have
an integer value called turn, which if it matches the value associated with that process, then the process
can enter its critical section and then increment the turn counter. This satisfies the mutual exclusion
requirement, but not progress.
A second algorithm is to use guard flags, which consists of an array of flags. When a process is ready to
enter its critical section, it sets its flag to true and waits until all other processes have completed their
critical sections and then sets its flag to false. This satisfies mutual exclusion again, but not progress.
The final algorithm is a mixture of turn and flags, which waits for the turn of a process that is waiting to
enter its critical section to finish and then enter its own. This meets all three requirements, and solves
the critical section problem for two processes.
However, we need to solve the problem for n processes, and one way of implementing this is to use the
Bakery (or Lamport) algorithm. Before entering its critical section, a process receives a number, and the
holder of the smallest number enters its critical section. If processes i and j receive the same number,
then if i < j, then i is served first, else j is served first. The number scheme used always generates
numbers in an increasing order of enumeration.
As in many aspects of software, hardware can make programming easier. Hardware exists that can solve
the critical section problem. If an operation TestAndSet existed, which could test and modify the
contents of a word atomically, mutual exclusion could be implemented. This does not satisfy the
bounded wait requirement, however.
The software algorithms discussed above for exclusion of processes from critical sections are
problematic as they rely on the continuous testing of variables (busy waiting), which is inefficient and
wasteful of CPU cycles. The protocols are complex and unenforceable, relying on programmer discipline
and co-operation. The OS implements efficient IPC mechanisms instead.
Signals
POSIX/Linux signals are a dataless communication either between processes, or between the OS and a
process. Signals can be thought of as software interrupts and require a handler within a process. When
the event that causes the signal occurs, the signal is said to be generated and when the appropriate
action for the process in response to the signal is taken, the signal is said to be delivered. In the interim,
the signal is said to be pending. Signals consist of a number of types, such as system defined (normally
up to 31) corresponding to a particular event, or user-defined (normall integers > 31). Some of these
signals include:
Synchronous signals occur at the same place, every time in the program (e.g., SIGFPE, SIGILL, SIGSEGV)
and are caused by the program itself. Asynchronous signals may occur at any time from a source other
than the process (another process e.g., SIGKILL, a terminal driver on a key press e.g., SIGINT on Ctrl+C,
on an alarm clock, e.g., SIGALRM) and a process can choose how to handle the specific signal.
A signal can be treated in the following ways, it can be ignored (except for SIGKILL), the default action set
by the OS can be executed (the default action is decided by the signal used, such as terminating the
process, suspending the process, terminating the process and saving a copy in memory, continue a
suspended process or just ignore), or a specific signal-catching function can be invoked.
• kill(pid, signal) - send a signal of value signal to the process with ID pid.
• sigaction(handler, signal) - installs a handler for a specific signal (handlers are specific to each
process)
• sigprocmask(..., mask, ...) - signals can be blocked or masked in much the same manner as
hardware interrupts can be masked by the OS
• sigpending(mask) - signals that are raised whilst they are masked out can be examined
Semaphores
Semaphores are "a non-negative integer, which apart from initialisation can only be acted on by two
standard, atomic, uninterruptible operations, WAIT (which decrements the value of the semaphore) and
SIGNAL (which increments the value)" - E. Dijkstra
You must guarantee that a maximum of one process can execute a WAIT or SIGNAL operation on the
same semaphore at the same time (atomicity).
There can be two types of semaphore, a counting semaphore (the integer value can range over an
unrestricted domain) and a binary semaphore, where the binary value can range only between 0 and 1.
This can be simpler to implement.
Semaphores can be used to implement mutual exclusion for the critical section of n processes. The
shared data is a binary-semaphore mutex which is WAITed on when a program wants to enter the critical
section and then SIGNALed when the critical section completes.
Semaphores can also be used to implement synchronisation. For example, B can only be executed in
program j only once A has executed in program i. When a semaphore is used to indicate a condition, it is
called a condition variable and synchronisation using a condition variable is called condition
synchronisation.
This can introduce two conditions, however: starvation, which is indefinite blocking when a process may
never be removed from the queue in which it is suspended; and deadlock, where two or more processes
are waiting indefinitely for an event that can only be caused by only one of the waiting processes.
Counting semaphores can be used for resource management. If you consider a resource
with m allocatable elements, the counting semaphore is set to m and then users can wait for a resource
by waiting on the semaphore and then free up a resource by signalling a semaphore.
The classic problems of synchronisation, such as the bounded-buffer problem, the readers and writers
problem and the dining-philosophers problem can be solved using semaphores and mutexs.
OS level semaphores ensure mutual exclusion over a shared resource if and only if the application is
disciplined. For example, the programmer should only access shared data if and only if it is holding the
appropriate semaphores, and it should issue a signal to release the semaphores when it finishes the
critical section.
Semaphore operations still require atomicity. If an operation is in the OS, it can be accessed by a process
using a system call. When it is in the OS, the call will execute atomically, or at least seem atomic from the
users point of view, if interrupts are enabled. True atomicity can only be achieved with hardware
support, such as the TestAndSet instruction discussed earlier.
POSIX semaphores support two interfaces (Linux does not support these fully). Unnamed semaphores
are used within a process (or between related processes) and are created by a call to sem_init(...) by a
process. Operations include sem_wait(...) for a wait and sem_post(...) for a signal. sem_trywait(...) also
exists, which works like wait if the semaphore has a greater count than 0, but doesn't wait if the
semaphore is 0, returning immediately to the calling process. These semaphores are destroyed by a call
to sem_destroy(...), or if the creating process dies.
Named semaphores can be used between any process. The namespace is the same as the filesystem (so
the name of a semaphore is effectively a filename). These are created using sem_create(...) and then can
be manipulated as above. sem_close(...) removes the link between a process and a semaphore, and they
can live on past the termination of the creating process, so must be explictly destroyed
using sem_destroy(...). Here, the OS merely stores a list of processes waiting on a particular semaphore
(i.e., processes that are blocked on a semaphore), but hardware support is still required for atomicity.
Linux implements SysV semaphores (i.e., as it existed in System V UNIX), which are not POSIX compliant.
They are system wide resources, managed by the OS, and any process can access a semaphore, which is
stored in the semaphore table of an OS. Allocation occurs via semget(...), which creates a set (which may
be of one) of semaphores or associate with an existing set of semaphores. All operations on semaphores
occurs using semop(semaphore, op, ...) and semctl(...) allows you to examine or change the value of a
semaphore, e.g., to initialise it, as well as telling you how many processes are waiting on the semaphore
and allowing you to destroy it.
Semaphores can also be used to provide mutual exclusion between threads, such as provided by the
Linux pthreads library which provides mutexes. pthread_mutex_init() provides initialisation for a
mutex, pthread_mutex_lock() obtains a lock for a mutex and pthread_mutex_unlock() will give up a lock.
The semaphores between pthreads are supported by the standard Linux kernel semaphores.
Conditional synchronisation can also be used, where a condition enables threads to atomically block and
test a condition under the protection of mutual exclusion, until the condition is true. If the condition is
false, the thread blocks on the variable and automatically releases the mutex. If another thread changes
the condition, it may wake up waiting threads by signalling the associated condition variable. Waiting
threads will wake up and then reacquire the mutex and reevaluate the
condition. pthread_cond_init() will initialise a condition variable, which must be a mutex initialised
by pthread_mutex_init(). pthread_cond_wait() and pthread_cond_signal() take a condition variable and
a mutex as parameters.
Using the low-level IPC mechanisms mentioned above relies on programmer discipline for the correct
use of semaphores to wait and signal in the correct order at the correct point in the algorithm.
Additionally, it requires an agreement between communicators (i.e., between the reader and the writer)
on the precise protocol to be used, and problems can be encountered if the writer assumes writers have
priority and readers assume that the readers have priority over the shared data. Sometimes, hardware
support is required to provide atomicity (TestAndSet, interrupt disabling, etc).
High-level IPC provides a structured IPC less susceptible to programmer error. It incorporates both the
means of shared data control and the shared data itself, as opposed to low level IPC, such as
semaphores, where the shared data control is distinct from the shared data (note, the POSIX
implementation assumes that the shared data in a file could be accessed from processes without waiting
for the appropriate semaphore).
Message Passing
Message passing is a mechanism for processes to communicate and synchronise their actions. In a
message system, processes communicate with each other without resorting to shared variables. If P and
Q wish to communicate, they need to establish a communication link between them and exchange
messages via send/receive.
This does require OS support and a number of implementation issues must be resolved - where are
messages stored whilst they are in transit, how many processes can join in with a communication, and is
the message size fixed or variable, is the link uni or bi-directional, synchronous or asynchronous, etc...
With message passing, processes must name each other explicitly, such as: send(P, message), to send a
message to process P and receive(Q, message) to receive a message from process Q.
With this method, links are established automatically and a link is associated with exactly one pair of
communicating processes, and between each pair their exists exactly one link. The link may be
unidirectional, but it is usually bidirectional.
A method of indirect communication involves using messages being directed and received from
mailboxes. Each mailbox has a unique ID and processes can only communicate only if they share a
mailbox. This introduces the problem of deciding who gets a message if a mailbox is shared, however.
The primitives in this method are: send(A, message) and receive(A, message), which sends a receives
messages from mailbox A.
In this case, a link is only established if processes share a common mailbox, and a link may be associated
with many processes. Each pair of processes may share several communication links and the links may
be uni- or bi-directional.
Mailboxes can be owned by either the OS or a process. If it is OS owned (in OS address space), the OS is
independent and not attached to any process, so operations and system calls are required for the
process to create a new mailbox, send and receive messages through the mailbox and to delete
mailboxes. If a mailbox is process owned (in the address space of a process) the process with a mailbox
receives messages and other processes send messages to the mailbox. An operation or system call is still
required to write to the mailbox.
The link (direct communication) or mailbox (indirect communication) will contain messages that have
been sent or not received. This is called a buffer, and they can be implemented in more than one way:
• Zero capacity - 0 messages are held, and the sender must wait for the receiver
• Bounded capacity - here there is a buffer with a finite length of n messages and the sender must
wait if the link is full
Client-Server Communication
When the processes are on different machines, network communication is required. Here, three forms of
IPC are available: sockets, remote procedure calls (RPCs) and remote method invocation (Java). These
will be discussed in the NDS course.
In POSIX, message queues between processes are used. Messages are stored in FIFO order, and the
queue is set up so that either the process sends or receives blocking (on the full or empty queue
respectively), or a the process sends or receives non-blocking. A process reading from an empty queue
will block.
In POSIX, the process creates or connects to an existing one using mq_open(name, ...), where name is a
filename in the same namespace as POSIX semaphores. Sending and receiving messages occur
using mq_send(...) and mq_receive(...), and process can close its connection to a message queue
by mq_close(...), however the message queue will still exist after this call. To destroy a message
queue, mq_unlink(...) is used before mq_close(...), and the queue will be destroyed after all processes
attached to the queue have unlinked.
These processes are not implemented at the OS level, but usually as part of the programming language.
There is no atomicity at the application level.
Critical regions are a high-level synchronisation construct. A shared variable v of type T is declared as v :
shared T. The variable v is only accessible inside the statement region v when B do S, where B is a
boolean expression, which means that when statement S is being executed, no other process can access
variable v.
Regions referring to the same shared variable exclude each other in time (i.e., mutual exclusion is
enforced). When a process tries to execute the region statement, the boolean expression B is evaluated.
If B is true, S is executed, but if B is false, the process is delayed until B becomes true and no other
process is in the region associated with v.
Critical regions are implemented using semaphores. With the shared variable x, the following are
associated: semaphore mutex, first-delay, second-delay; int first-count, second-count;. Mutually
exclusive access to the critical section is provided by mutex. If a process can not enter the critical section
because B is false, it initially waits on first-delay and is moved to second-delay before it is allowed to re-
evaluate B. A track is kept on the number of processes waiting on first-delay and second-delay by using
first-count and second-count respectively.
This algorithm assumes a FIFO ordering in the queueing of processes for a semaphore. For an arbitrary
queueing discipline, a more complicated implementation is required.
Monitors are high-level synchronisation constructs that allows the safe sharing of an abstract data type
among concurrent processes. The monitor encapsulates shared data together with procedures that act
upon that data. Only the procedures can be accessed from outside the monitor; the data inside the
monitor can not be directly accessed from outside the monitor. Only one process can be active inside a
monitor at a time.
To allow a process to wait within a monitor, a condition variable must be declared as: condition x, y;. The
condition variable can only be used with the operations wait and signal. x.wait() means that the process
invoking the operation is suspended until another process invokes x.signal(), which resumes exactly one
suspended process. If no process is suspended, then the signal operation has no effect.
A conditional-wait construct also exists: x.wait(c), where c is an integer expression evaluated when the
wait operation is executed. The value is a priority number stored within the name of a process that is
suspended. When x.signal() is executed, the process with the smallest associated priority number is
resumed next.
• User processes must always make their calls on the monitor in a correct sequence.
• You must ensure that an uncooperative process does not ignore the mutual-exclusion gateway
provided by the monitor and try to access the shared resource directly, without using the access
protocols.
Deadlocks
Deadlocks occur when a set of blocked processes are each holding a resource and waiting to acquire a
resource held by another process in the set. Deadlocks can be characterised by the bridge-crossing
problem:
There is only one lane across the bridge. We can view each section of the bridge as a resource, and if a
deadlock occurs, it can be resolved if one car backs up (pre-empt resources and rollback), but sometimes
that will require the cars following it to be backed up also. Starvation can also occur, if only cars from one
direction are allowed to pass.
1. Mutual exclusion: only one process at a time can use the resource
2. Hold and wait: a process holding at least one resource is waiting to acquire additional resources
held by other processes
3. No preemption: a resource can only be released voluntarily by the process holding it, after the
process has completed its task
4. Circular wait: there exists a set {P0, P1, ..., Pn} of waiting processes such that P0 is waiting for a
resource that is held by P1, which is waiting for a resource that is held by P2, ..., Pn - 1 is waiting for
a resource that is held by Pn, that is waiting for a resource that is held by P0.
We could plot a resource allocation graph which consists of a set of vertices V and a set of edges E, such
that V is partitioned into two types, P = {P1, P2, ..., Pn}, the set of all processes in the system and R = {R1,
R2, ..., Rm}, the set of all resource types in the system. Each process utilises a resource using: requests, a
directed edge Pi → Rj, which if it can not be assigned immediately, the requesting process is blocked; and
resource use, indicated by a directed edge Rj → Pi. That is, the process uses the resource. When a
resource is released, edges are removed.
The above resource allocation graph has no deadlocks, as there are no circularities. P3 has all the
resources it needs; when it frees R3 then P2 can proceed; when P2 releases R1, then P1 can proceed.
The above resource allocation graph does have a deadlock, however. Mutual exclusion holds over all of
the resources, a number of processes hold resources and are waiting for others and no process wants to
release a resource. P3 holds R3 and wants R2; P2 holds R1 and R2 and wants R3 and P1 holds R2 and wants
R1. This is a circular wait, and we have all the conditions for deadlock.
If a graph contains a cycle, the following basic rules can be applied: if the graph contains no cycles, no
deadlock occurs; if the graph contains a cycle, if there is only one instance per resource type, then
deadlock occurs, otherwise if there are several instances per resource type, there is the possibility of
deadlock.
Handling Deadlocks
The free basic methods for handling deadlocks are prevention/avoidance (ensuring that the system will
never enter a deadlock state), detection and recovery (allows the system to enter a deadlock state and
then recover from it) and the head-in-the-sand approach (ignore the problem and pretend that
deadlocks never occur - this is the approach used by most OS, including UNIX, Linux and Windows).
To implement prevention, we need to ensure that one or more of the deadlock conditions do not hold:
• Mutual exclusion is not required for sharable resources, but it must hold for non-sharable
resources.
• Hold and wait must guarantee that whenever a process requests a resource, it does not hold any
other resources. The process must request and be allocated all of its resources before it begins
execution, or the process can only request resources when the process currently has none. This
does have the possibility of low resource utilisation and starvation.
• No preemption - If a process that is holding some resources requests another resource that can
not be immediately allocated to it, then all resources currently being held are released.
Preempted resources are added to the list of resources for which the process is waiting and the
process will only be restarted when it can regain its old resources, as well as the new ones it is
requesting.
• Circular wait - impose a total ordering of all resource types and require that each process
requests resources in an increasing order of enumeration.
Deadlock algorithms prevent deadlock, so deadlock will never occur if (at least one of) the restrictions
are enforced. But, there is a cost to this system. To restrain how resource requests are made can have
the potential for low device utilisation and a reduced system throughput.
Alternatively, with some addition a priori information, deadlock can be avoided on a per request basis.
The simple method is that each process declares the maximum number of resources of each type that it
may need, or a dynamic alternative is to dynamically examine the resource-allocation state to ensure
that there can never be a circular-wait condition. The resource allocation state is defined by the number
of available and allocated resources, and the maximum demands of the process.
Systems States
When a process requests an available resource, the system must decide if immediate allocation would
leave the system in a safe state - a state where there exists a sequence of all processes so that all
processes can get their required resources and complete. If the system is not in a safe state, it is in an
unsafe state, which means the possibility of deadlock exists (but is not necessarily going to happen). The
deadlock state is a subset of the unsafe states.
Avoidance requires that the system will never enter an unsafe state, so an algorithm must pick a safe
sequence of events and processes for execution. One such algorithm is the resource-allocation graph
algorithm. A claim edge Pi → Rj indicates that process Pi may request resource Rj which is represented by
a dashed line. A claim edge is converted to a request edge when a process requests a resource. When a
resource is released by a process, the assignment reconverts to a claim edge. Resources must be claimed
a priori in the system.
File Systems
Files are an integral part of most computer systems. Input to applications can be by means of a file, and
output can be saved in a file for long-term storage. File systems can store large amounts of data, and the
storage is persistent. The file management system is considered part of the OS.
The file managment system is a way a process may access files. The programmer does not need to
develop file management software, this is provided by the OS. The objectives of a file management
system are to meet the data management needs of the user, to guarantee the data in the file is valid, to
minimise the potential for lost or destroyed data and to optimise performance. We can also break down
what the user requirements for data management are into:
To enable this, the file management system typically provides functions to indentify and locate a
selected file, to organise files in a structure (often accomplished using directories) which describes the
locations of all files and their attributes, and to provide mechanisms for protection. The file management
systems provides secondary storage management, that is, it allocates files to free disk blocks and
manages free storage for available disk blocks.
Files
A file is a contiguous, logical addres space (apart from sparse files), however, they may not be stored as a
contiguous file. Files may be of different types, but from the perspective of the OS, files have no
structure - all a file is is a sequence of machine words and bytes. However, from the user or application
perspective, files may have a simple record structure made up of lines (of fixed or variable length) or
complex structures, which have special control characters in the structure to make up files of various
types.
• Time, date and user identification - used for protection, security and usage monitoring
File types, traditionally implemented using file extensions (in Microsoft Windows, this is a legacy from
DOS), enables the OS to invoke an appropriate application for each file type. UNIX and its variants use a
"magic number" at the start of the file to declare its type.
File systems have criteria for file organisation, such as rapid access (especially needed when accessing a
single data item), ease of update (not always a concern, however, e.g., CD-ROM), economy of storage
(want to reduced wasted space, although this is less of a concern in modern computing), simple
maintenance and reliability.
In these types of files, access starts at the beginning of the file, and to read to the middle of the file, you
must start at the beginning and read the file until you reach the position you want, and then rewind to
get back to the beginning. This method of file access is slow.
A fixed format is used for structuring data into records. Records are normally the same length, and the
fields in a record are in fixed order. One field (usually the first) is a key, which uniquely identifies the
record within that file. Usually, records are stored in key order and batch updates are used to reduce the
update time (records are cached and then ordered before an update is done). This method is designed
for sequential access devices (tapes), but sequential files can be implemented on random access devices,
such as disks.
Additions to and deletions from the file can cause problems. The file is stored in a sequential ordering of
record keys, so insertion requires "moving" some of the records on the storage medium. This is very
expensive in time. Usually, new records are placed in a new file, called the overflow, log or transaction
file and a periodic batch update occurs that merges the overflow file with the original file. Alternatively,
records are organised as a linked list. More overhead in terms of storage is required, and additional
processing is required also. This only really works (on sequential storage devices) if the records remain in
key order. Deletion requests can also be stored in the log file.
To reduce slow, sequential access files, and to support random access storage devices better, indexed
sequential files can be used. The index provides the lookup capability to get in the vicinity of a desired
record faster. The index is searched for a key field that is equal or less than the desired key value which
contains a pointer to the main file. The search then continues in the main file by the location indicated
by the pointer. The records are still stored in order, and new records are added to an overflow file (a
record in the main file preceding the new record is updated to contain a pointer to the new record in the
overflow file). The overflow is then periodically merged with the main file during a batch update.
Comparison: If a file contains 1,000,000 records, in a standard sequential file, the average lookup case is
500,000 accesses to find a record in the sequential file. If an index contains 1000 entries, it will take on
average 500 accesses to find the key, followed by 500 accesses in the main file. On average, it is now
1000 accesses.
Direct access files exploit the capability of random access devices (such as disks) to access directly any
address on the device. No sequential ordering of data in the file is required, and this method can be used
where rapid access is required, such as in swapping or convential user files. It is not that useful for
backups or archives where only occasional access is required (these can therefore be implemented on
cheaper media, such as tapes).
Directories
Directories are used to organise files, and they themselves are files owned by the OS. Directories contain
information about files, such as name, attributes, location, ownership, size, access/update times and
protection information. They provide a mapping between file names and the files themselves.
From a user perspective, they need to search for a file (by name or attribute), create a file, delete a file,
list a directory and rename a file, etc...
The directory must be organised to obtain efficiency (e.g., to locate a file quickly), naming (so it is
convenient for users, two users can have the same name for different files and the same file can have
several different names) and grouping (a logical grouping of files by properties).
There are multiple ways of implementing directories. One such way is the single-level directory, where a
single directory exists for all users. However, if all users share the same directory, they must ensure that
all file names are unique, and there is the grouping problem of little structure or organisation.
In a two-level directory setup, each user has their own directory, which allows a user to have the same
filename as another user, this allows for efficient searching, but has the problem of there being no
grouping capability between users.
Another is the tree-structured directory, which allows for efficient searching and grouping capability,
without any sharing of files. In this model, we have the concept of a current, or working, directory. There
are issues with this model, however. There are absolute and relative path names, and creating a new file
or directory is done in the current directory. Also, how do you delete a directory? Most OSs require for
you to explicitly state that you want to remove all files in the directory, or that you do it beforehand.
Tree structure prohibits the sharing of files and directories, however. Acyclic graphs remove this
restriction. They are more complex than tree, and two different names can refer to the same file
(aliasing), so there can be multiple path names for the same file. This does introduce a deletion problem,
however - you can not delete a file until all references to it are solved. This is difficult, so we can solve it
using backpointers (so we can delete all the pointers to a file), have backpointers in a daisy-chain
organisation, or by using an entry-hold-count solution.
Acyclic graphs are complex to manage, however. You need to ensure there are no cycles, which links can
introduce. Merely having subdirectories and files can not introduce cycles. Another way of guaranteeing
no cycles is to only allow links to files, not subdirectories, or by using a cycle detection algorithm every
time a new link is added to determine whether or not it is okay (this is expensive).
A file has to be opened before it is of any use, and similarly, a file system must be mounted before any
files or directories are available.
Sharing of files on multi-user systems is desirable. Sharing can be done through a protection scheme. A
simple method is to ensure that only one user has write access to a file at a time. A complex method is
using a variation of semaphores implemented by some file systems (including in POSIX) called file locks.
On distributed systems, files may be shared across a network.
Protection
Protection exists to make sure that files are not read or modified by unauthorised persons. Protection
mechanisms are provided by the OS to safeguard information inside the computer. Objects and the
rights that they have in particular contexts are identified in the aim to prohibit processes from accessing
objects they are not authorised to access.
A domain is associated with a set of pairs <objects, rights>, where rights will be a subset of operations
that are allowed on the object, and a domain could be a user or a process.
It is possible for an object to be in two domains at the same time with or without the same rights.
Different mechanisms are used for identifying domains, in UNIX a combination of user ID and group ID is
used.
Another method of providing protection is through access control lists and capabilities. It is abstract
model, where matrix rows correspond to domains, and columns correspond to objects. The ACL defines
columns - access rights for an object in different domains, and capabilities define rows, which are
associated with each process and is a set of operations that are permitted.
The capability list is a row of the matrix, and a capability then is like an abstract data type, that is, it
points to an object and specifies which operations are allowed on it (it may also have a field 'type'
indicating what type of object it is). The list then specifies what capabilities are accessible in this domain.
In UNIX, the file owner (by default, the creator) is able to control what should be done, and by whom.
Types of accesses include read, write and execute. In UNIX, this is implemented as a bitmask with three
domains, owner, group and other. Groups in this case are created by a sysadmin, not the user.
File systems reside on secondary storage and contains both programs and data. It also contains swapped
out pages as part of the virtual memory management system. The main issues to be addressed here are
how to store files (a contiguous logical structure) on secondary storage (disk) and mapping high level file
operations on to low-level disk operations, whilst requiring efficient use of the disk and fast read/write
times.
Disks provide a series of blocks, usually of a fixed size, which are sometimes related to the memory
frame/page size, often as a multiple. The number of file blocks can be loaded into one physical memory
frame.
Disks are random access, that is, you can directly address any given block on the disk, and it is simple to
access any file sequentially or randomly. For I/O efficiency, transfers between memory and disk are often
performed in units of blocks.
For convenient access to disks, the OS imposes a logical file system, which defines how files look to the
user (i.e., operations, file attributes, directories, etc...). Algorithms and data structures are needed to
map the logical file structure onto the physical disk. This file organisation maps logical file structure onto
physical block addresses. A basic file system issues generic commands to a device driver to read and
write blocks. Finally, the I/O control contains device drivers to translate basic file system commands into
commands for a specific disk.
Several structures are used to implement file systems, such as on disk structures including the boot
control block, the partition control block, the directory structure to organise the files and a per file file
control block (FCB) containing permissions, dates (access, create, modify), owner, group, size and blocks.
In memory structures are also used, such as the partition table, the directory structure (with recently
accessed directories) and an open file table (both system wide, and per process).
The above diagram shows the necessary file system structures provided by the OS to open a file (a) and
reading a file (b).
Virtual file systems exist so that a single interface can be provided by the OS to many different types of
file systems. Different file systems can be mounted on different mount points within the same directory
structure, and operations have the same effect on all the different file systems.
This logical file system is called the Virtual File System, or VFS and it allows the same system call interface
(the API) to be used for different types of file system. The API is to the VFS interface, rather than any
specific type of file system, and the VFS can distinguish between local and remote files.
Directory Implementation
The simplest implementation of directories is a linear list of file names with pointers to the data blocks.
This is time-consuming, however (e.g., a search is required when a file is created to ensure uniqueness),
as linear searches are expensive, however this cost can be offset by keeping frequently used directory
information in a cache. Keeping an order on the names could reduce searches, but it would make it more
costly for insert/delete of names.
Another implementation is to use a hash table, which decreases directory search time. There is a
problem with collisions (situations where two file names hash to the same location) in this approach,
however. A method is needed for knowing that a hash location has a number of corresponding file
names, and for determining which name to use. This type of hash table is called a chained-overflow hash
table.
In contiguous allocation, each file occupies a set of contiguous blocks on the disk. This is relatively simple
to implement, as only a starting location (block #) and length (number of blocks) are required. Random
access is also relatively easy. This is wasteful of space (dynamic storage-allocation problem) due to
fragmentation (see memory management earlier) and files can not easily grow.
A slightly more improved method is linked allocation, where each file is a linked list of disk blocks, which
may be scattered anywhere on the disk, and where each block contains a pointer to the next block.
Again, this is simple as only the starting address is required and there is no wasted space in this free
space management system. However, there is no random access, as you have to go down the linked list
to find the appropriate block.
The file allocation table (FAT) system is a variant of linked allocation used in MS-DOS and OS/2. A section
at the beginning of the disk is dedicated to a FAT, containing an entry for each disk block. The directory
entry then only needs the block number for the start of the file. The FAT then indexed by the block
number gives the next block, etc...
Linked allocation solves the problems of contiguous allocation, but it does not support random access
tables without a FAT. This can be achieved with an index table - all the pointers are brought together into
an index block, which exists per file, where the ith entry points to the ith block of file, and this gives us
random access. With large files, however, large index tables are needed, which is a disadvantage; multi-
level indexing can be introduced to combat this, however. Additionally, the index table is an overhead.
Since disk space is limited, space needs to be reused from deleted files and free space needs to be
monitored. One way of implementing this is using a bit vector which shows when a block is free or
occupied. An alternative to this bit vector is the linked list. A pointer is kept to the first free block, which
contains a pointer to the next free block, etc... This is inefficient, as to traverse the list requires multiple
disk block reads, however this is an uncommon operation, as most operations require just one block.
However, the linked list does not require any extra free space, unlike the bit vector, although there is no
easy way to get contiguous space easily.
These methods can be modified using grouping, storing the addresses of the next n free blocks in the
first free block; next n addresses of free blocks in the next free block, etc... - this is more efficient in the
number of blocks that need to be read; and counting, which takes advantage of the fact that free blocks
usually form together. The first free block contains the number of free blocks immediately following (i.e.,
forming a contiguous free store), together with the address of the next free block at the start of some
contiguous free store.
This assumes a structure where all files have distinct absolute paths, using a tree hierarchy. It is possible
to set up cycles via symbolic links
The VPS provides uniform access to all file systems, including special files (e.g., /proc, /dev).
The file system organisation is a multi-level indexing scheme, with an i-node (index-node) associated
with each file, which contains attributes and disk addresses of the files' blocks. There is no distinction
between files and directories other than having a different mode type field.
In an i-node, the first few disk addresses are stored (which for small files, will be all the disk addresses)
and there are links to further i-nodes which contain further disk addresses. This may go up to further
indirection, which gives us a very flexible system at the expense of complexity and storage space.
For directories, each entry contains a file name and an i-node number. Here, the i-node contains
information about the time, size, type, ownership and disk blocks contained in the i-node. If you imagine
that the file system knows the root directory and wants to look up /usr/neil/ops/exam, then from the
root i-node, the system locates the i-node number of the file /usr, and then looks for the next
component neil, and so on. Relative paths work in the same way, where there is a known current
directory i-node.
The Ext2 Linux general purpose filesystem implements this, with a mostly static allocation of i-nodes. The
disk is split up into block groups of one or more disk cylinders (i.e., those physically together) and blocks
are used for i-nodes and file data blocks. Bitmaps are used for keeping data blocks and i-nodes in the
group. When allocating disk blocks, related information is attempted to be kept together (i.e., i-nodes of
all files in a directory in the same block group) and new directories are created in block groups with the
smallest number of directories.
This scheme works well as long as there is > 10% free space.
With journalling file systems, files are stored in both disc and main memory (a memory cache of disk
blocks is used for efficiency), however we have to consider the possibility of what happens when the
system fails. On the disk, file system data structure may be corrupted, but we must ensure there is no
loss of data (i.e., consistency is maintained). Journaling (or log-based transaction) file systems attempt to
maintain consistency.
For a file operation, updates to the disk structure are written to a log, and when complete, the changes
are considered committed. The updates for a file operation form a transaction, and actual updates to
disk structures are performed so that if there is a crash, it is known how much of the transaction has
been performed (and which other transactions remain in the log). ReiserFS under Linux is an example of
a journalling file system.
I/O
Handling I/O efficiently is important to the user, and therefore to the OS. All user interaction with the
machine is done using I/O and most user invoked operations (i.e., applications) will use I/O at some
point.
Operating systems provide I/O device handling based on utilising low-level hardware interfaces to
devices (i.e., access device via special instructions). Device drivers are provisioned here to handle the
idiosyncrasies of individual devices, programmed in terms of a low-level hardware interface and provide
a standard high-level application interface to devices, to enable their easy use by application software.
Devices are addressed either using direct mapping, where devices exist in a special I/O range and special
I/O instructions exist in the CPU to implement this, or by memory mapping, where devices appear as
addresses in the normal memory range and normal processor instructions are used to access these
addresses.
Devices tend to be accessible in the same place for a particular architecture and each address range
maps to the I/O control registers for the deviced accessed by special I/O instructions. Some devices have
both direct and memory-mapped address space (e.g., graphics cards).
I/O ports typically consist of a status register, which contains the bits read by the host that indiciate
device state (e.g., data ready for a ready), a control register which enables the host to change the mode
of a device (e.g., from full to half-duplex for serial ports), a data-in register which is read by the host to
get input and the data-out register, which is written to by the host for data to be sent to the device. This
introduces the issue of how does the host know when a particular condition has arisen in the status
register.
Polling is one way to solve this problem, and is a host initiated approach. The device status register is
polled until the wanted condition occurs, but this way is inefficient as there is a busy-wait cycle to wait
for I/O from a device. The alternative solution is using interrupts, which is initiated by both the host and
the device. A CPU interrupt request line is triggered by the device and the interrupt handler receives
interrupts and uses an interrupt vector to dispatch an interrupt to the correct handler based on priority.
Most interrupts can be maskable to ignore or delay some interrupts, and the interrupt mechanism can
be used for exceptions. There is no need for busy-waiting in this model.
In the IA32 architecture, interrupts 0-31 are unmaskable, as some are exceptions, and 32-255 are
maskable and are used for devices.
For information on
DMA, see ICS
DMA is used to avoid programmed I/O for large data movement and is implemented using a DMA
controller to bypass the CPU and transfer data directly between the I/O device and memory.
Application Interfaces
There are many different hardware devices, but application processes want a common interface, but
devices can vary in many dimensions: character (deals with bytes or words of data, e.g., keyboards, mice,
serial ports, etc and commands to deal with this include get and put - often libraries are implemented to
allow in-line editing), stream (a stream of bits) or block (devices deal with blocks of data, such as disks,
commands include read, write, seek and most raw I/O filesystem access uses this) transfer; sequential or
random access; sharable or dedicated device; device speed; read-write, read-only or write-only. Timer
devices also exist, which provide the current time, elapsed time, or just a standard timer.
I/O can be either blocking or non-blocking. That is, a process is suspended until I/O is completed under a
blocking scheme (this is easy to use and understand, but is insufficient for some needs - although using
threads blocking can be programmed around), or under a non-blocking scheme, an I/O call returns as
much as available and returns quickly with a count of bytes read or written. A final scheme is
asynchronous, which is difficult to use and is a scheme where a program runs while the I/O executes and
the I/O subsystem signals the process when the I/O is completed.
Usually, hardware devices are controlled in the same manner as files (create, write, read, delete, etc...),
so the filesystem namespace can be used for devices (/dev in UNIX). The kernel implements I/O through
a series of layers as below:
The I/O subsystem of the kernel implements I/O scheduling and some I/O request ordering via a per-
device queue, although some OSs do try to implement fairness. Additionally, buffering is also
implemented, where data is stored in memory while it is transferred between devices (this copes with
device speed mismatch and transfer size mismatch and "copy semantics).
Buffering can be implemented using many different methods, such as caching, where fast memory holds
a copy of the data (always just a copy, and this is the key to performance), or spooling, where output for
a device is held (this is useful is the device can only serve one request at a time, e.g., printing). Device
reservation can also be used, where exclusive access is provided to a device and system calls for
allocation and deallocation are used. However, deadlock (discussed above) can be encountered here.
The kernel has data structures for things like keeping state info for I/O components including open file
tables, network connections and the character device state, as well as many, many other data structures
to track buffers, memory allocation and "dirty" blocks. Some OSs use object-orrientated methods and
message passing to implement I/O.
Some OSs use a file based approach to kernel I/O structures. Devices are named as files and accesses to
devices are by file operations (with some exceptions, e.g., seek in a character device may have no effect).
In Linux, an additional command ioctl() provides direct access to the device driver, perhaps to set the
device into a particular mode (e.g., changing speed or other parameters of a serial port).
I/O Performance
I/O is a major factor in system performance, as it makes demands on the CPU to execute the device
driver or kernel I/O code and it forces context switches due to interrupts. Data copying and network
traffic are especially stressful.
Ways of improving performance include reducing the number of context switches, data copying and
interrupts by using large transfers, smart controllers and polling. DMA can also be used and the CPU,
memory, bus and I/O performance need to be balanced for highest throughput.
This brings us to the question of where should functionality be implemented to get best performance -
application kernel or hardware level?