Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Notes-4up-Forprinting-2012 OS PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

Operating Systems Course Outcomes

By the end of the course you should be able to


I describe the general architecture of computers
Julian Bradfield
I describe, contrast and compare differing structures for operating
jcb@inf.ed.ac.uk systems
I understand and analyse theory and implementation of: processes,
IF–4.07 resource control (concurrency etc.), physical and virtual memory,
scheduling, I/O and files
In addition, during the practical exercise and associated self-study, you
will:
I become familiar (if not already) with the C language, gcc compiler,
and Makefiles
I understand the high-level structure of the Linux kernel both in
concept and source code
I acquire a detailed understanding of one aspect (the scheduler) of
the Linux kernel
1 / 184 3 / 184

Course Aims Course Outline


I general understanding of structure of modern computers This outline is subject to modification during the course.
I purpose, structure and functions of operating systems I Introduction; history of computers; overview of OS (this lecture)
I illustration of key OS aspects by example I Computer architecture (high-level view); machines viewed at
different abstraction levels
I Basic OS functions and the historical development of OSes
I Processes (1)
I Processes (2) – threads and SMP
I Scheduling (1) – cpu utilization and task scheduling
I Concurrency (1) – mutual exclusion, synchronization
I Concurrency (2) – deadlock, starvation, analysis of concurrency

2 / 184 4 / 184
I Memory (1) – physical memory, early paging and segmentation Textbooks
techniques
There are many very good operating systems textbooks, most of which
I Memory (2) – modern virtual memory concepts and techniques
cover the material of the course (and much more).
I Memory (3) – paging policies I shall be (very loosely) following
I I/O (1) – low level I/O functions W. Stallings Operating Systems: Internals and Design Principles, Prentice-
I I/O (2) – high level I/O functions and filesystems Hall/Pearson.
I Case studies: one or both of: the Windows NT family; IBM’s Another book that can as well be used is
System/390 family – N.B. you will be expected to study Linux A. Silberschatz and P. Galvin Operating Systems Concepts (5th or later
during the practical exercise and in self-study. edition), Addison-Wesley.
I Other topics to be determined, e.g. security. Most of the other major OS texts are also suitable.
You are expected to read around the subject in some textbook, but there
is no specific requirement to buy Stallings 7th edition.
References to Stallings change from edition to edition, so are mainly by
keyword.

5 / 184 7 / 184

Assessment Acknowledgement
The course is assessed by a written examination (75%), one practical I should like to thank Dr Steven Hand of the University of Cambridge,
exercise (15%) and an essay (10%). who has provided me with many useful figures for use in my slides, and
The practical exercise will run through weeks 3–8, and will involve allowed me to use some of his slides as a basis for some of mine.
understanding and modifying the Linux kernel. The final assessed
outcome is a relatively small part of the work, and will not be too hard;
most of the work will be in understanding C, Makefiles, the structure of a
real OS kernel, etc. This is essential for real systems work!
The essay will be due at the end of week 10, and will be from a list of
topics, either a more extensive investigation of something covered briefly
in lectures, or a study of something not covered. (Ideas welcome.)

6 / 184 8 / 184
A brief and selective history of computing . . . The Difference Engine, [The Analytical Engine] – 1812, 1832 Babbage /
Lovelace.
Computing machines have been increasing in complexity for many
centuries, but only recently have they become complex enough to require
something recognizable as an operating system. Here, mostly for fun, is a
quick review of the development of computers.
The abacus – some millennia BP.

[ Science Museum ?? ]

[Association pour le musée international du calcul de l’informatique et de Analytical Engine (never built) anticipated many modern aspects of
l’automatique de Valbonne Sophia Antipolis (AMISA)] computers. See http://www.fourmilab.ch/babbage/.
9 / 184 11 / 184

Logarithms (Napier): the slide rule – 1622 Bissaker Electro-mechanical punched card – 1890 Hollerith (→ IBM)
First mechanical digital calculator – 1642 Pascal Vacuum tube – 1905 De Forest
Relay-based IBM 610 hits 1 MultiplicationPS – 1935
ABC, 1st electronic digital computer – 1939 Atanasoff / Berry
Z3, 1st programmable computer – 1941 Zuse
Colossus, Bletchley Park – 1943

[original source unknown]


10 / 184 12 / 184
ENIAC – 1945, Eckert & Mauchley The Von Neumann Architecture

Memory
Input

Arithmetic
Control Logical Unit
Unit Output

Accumulator

In 1945, John von Neumann drafted the EDVAC report, which set out the
[University of Pennsylvania] architecture now taken as standard.
13 / 184 15 / 184

I 30 tons, 1000 sq feet, 140 kW the transistor – 1947 (Shockley, Bardeen, Brattain)
I 18k vacuum tubes, 20 10-digit accumulators EDSAC, 1st stored program computer – 1949 (Wilkes)
I 100 kHz, around 300 M(ult)PS I 3k vacuum tubes, 300 sq ft, 12 kW
I in 1946 added blinking lights for the Press! I 500kHz, ca 650 IPS
Programmed by a plugboard, so very slow to change program. I 1K 17-bit words of memory (Hg ultrasonic delay lines)
I operating system of 31 words
I see http://www.dcs.warwick.ac.uk/~edsac/ for a simulator
TRADIC, 1st valve-free computer – 1954 (Bell Labs)
first IC – 1959 (Kilby & Noyce, TI)
IBM System/360 – 1964. Direct ancestor of today’s zSeries, with
continually evolved operating system.
Intel 4004, 1st µ-processor – 1971 (Ted Hoff)
Intel 8086, IBM PC – 1978
VLSI (> 100k transistors) – 1980

14 / 184 16 / 184
Levels of (Programming) Languages Quick Review of Computer Architecture
Level 5 ML/Java Processor
Bytecode Bus
Address Data Control
Register File
interpret
Level 4 C/C++ Source (including PC)
Memory
compile
Control Execution e.g. 64 MByte
Level 3 ASM Source Unit Unit 2^26 x 8 =
assemble Other Object 536,870,912bits
Level 2 Object File Files ("Libraries")
link Reset

Hard Disk
Executable File execute
Level 1 Framebuffer
("Machine Code")
Super I/O
(Modern) Computers can be programmed at several levels.
Sound Card
Level relates to lower via either translation/compilation or interpretation. Mouse Keyboard Serial

Similarly operation of a computer can be described at many levels.


Exercise: justify (or attack) the placing of bytecode in Level 5 in the
(Please revise Inf2!)
diagram. 17 / 184 19 / 184

Layered Virtual Machines Registers


Virtual Machine M5 (Language L5) Meta-Language Level (Very) fast on-chip memory.
Typically 32 or 64 bits; nowadays from 8 to 128 registers is usual.
Virtual Machine M4 (Language L4) Compiled Language Level Data is loaded from memory into registers before being operated on.
Registers may be purely internal and not visible to the programmer, even
Assembly Language Level
at machine code level.
Virtual Machine M3 (Language L3)

Operating System Level


Most processors distinguish data and control registers: bits in a control
Virtual Machine M2 (Language L2)
register have special meaning to CPU.
Virtual Machine M1 (Language L1) Conventional Machine Level

Actual Machine M0 (Language L0) Digital Logic Level

Think of a virtual machine in each layer built on the lower VM; machine
in one level understands language of that level.
This course considers mainly levels 1 and 2.
Exercise: Operating Systems are often written in assembly language or C
or higher. What does it mean to say level 2 is below levels 3 and 4?
18 / 184 20 / 184
Intel Pentium has: The cache is fast, expensive memory sitting between CPU and main
I eight 32-bit general purpose registers memory – cache ↔ CPU via special bus.
May have several levels of cache – current IBM mainframes have four.
I six 16-bit segment registers (for address space management)
The OS has to be aware of the cache and control it, e.g. when switching
I two 32-bit control registers, including Program Counter (called EIP
address spaces.
by Intel)
IBM z/Architecture has:
I sixteen 64-bit general registers
I sixteen 64-bit floating point registers
I one 32-bit floating point control register
I sixteen 64-bit control registers
I sixteen 32-bit access registers (for address space management)
I one Program Status Word (PC)

21 / 184 23 / 184

Memory Hierarchy The Fetch–Execute Cycle


Control Unit
CPU
Cache (SRAM)
Main Memory Execution Unit

+
Execution Data
Register File

Bus Interface Unit

Register File
Unit Cache
64MB PC
DRAM
Decode IB
Control Instruction
Unit Cache
32K ROM

Address
Data
Control
PC initialized to fixed value on CPU reset. Then repeat (until halt):
Bus 1. instruction is fetched from memory address in PC into instruction
buffer
2. Control Unit decodes instruction
3. Execution Unit executes it
4. PC is updated: explicitly by jumps, implicitly otherwise
22 / 184 24 / 184
Input/Output Devices Bus Hierarchy
We’ll consider these later in the course. For now, note that: Processor Memory Bus (100Mhz)
Bus

I/O devices typically connected to CPU via a bus (or via a chain of

Caches
I
Processor
buses and bridges)
64MByte
I wide range of devices, e.g.: hard disk, CD, graphics card, sound DIMM
64MByte
card, ethernet card, modem DIMM

I often with several stages and layers


Bridge
I all of which are very slow compared to CPU. Framebuffer
ISA Bus (8Mhz)
PCI Bus (33Mhz)

Bridge
SCSI
Controller
Sound
Card

Most computers have many different buses, with different functions and
characteristics.
25 / 184 27 / 184

Buses Interrupts
ADDRESS
Devices much slower than CPU; can’t have CPU wait for device. Also,
Processor
external events may occur.
DATA Memory
Interrupts provide suitable mechanism. Interrupt is (logically) a signal line
CONTROL
into CPU. When asserted, CPU jumps to particular location (e.g. on x86,
Other Devices on interrupt (IRQ) n, CPU jumps to address stored in nth entry of table
pointed to by IDTR control register).
A bus is a group of ‘wires’ shared by several devices (e.g. CPU, memory, The jump saves state; when the interrupt handler finishes, it uses a special
I/O). Buses are cheap and versatile, but can be a severe performance return instruction to restore control to original program.
bottleneck (e.g. PC-card hard disks). Thus, I/O operation is: instruct device and continue with other tasks;
A bus typically has address lines, data lines and control lines. when device finishes, it raises interrupt; handler gets info from device etc.
and schedules requesting task.
Operated in master–slave protocol: e.g. to read data from memory, CPU
(master) puts address on bus and asserts ‘read’; memory (slave) retrieves In practice (e.g. x86), may be one or two interrupt pins on chip, with
data, puts data on bus; CPU reads from bus. interrupt controller to encode external interrupts onto bus for CPU.
In some cases, may need initialization protocol to decide which device is
the bus master; in others, it’s pre-determined.

26 / 184 28 / 184
Direct Memory Access (DMA) In the beginning. . .
DMA means allowing devices to write directly (i.e. via bus) into main Earliest ‘OS’ simply transferred programs from punched card reader to
memory. memory.
E.g., CPU tells device ‘write next block of data into address x’; gets Everything else done by lights and switches on front panel.
interrupt when done. Job scheduling done by sign-up sheets.
PCs have basic DMA; IBM mainframes’ ‘I/O channels’ are a sophisticated User ( = programmer = operator) had to set up entire job (e.g.: load
extension of DMA (CPU can construct complex programs for device to compiler, load source code, invoke compiler, etc) programmatically.
execute). I/O directly programmed.

29 / 184 31 / 184

So what is an Operating System for? First improvements


An OS must . . . Users write programs and give tape or cards to operator.
handle relations between CPU/memory and devices (relations between Operator feeds card reader, collects output, returns it to users.
CPU and memory are usually in CPU hardware); (Improvement for user – not for operator!)
handle allocation of memory; Start providing standard card libraries for linking, loading, I/O drivers,
handle sharing of memory and CPU between different logical tasks; etc.
handle file management;
ever more sophisticated tasks . . .
... in Windows, handle most of the UI graphics. (Is this OS business?)
Exercise: On the Web, find the Brown/Denning hierarchy of OS functions.
Discuss the ordering of the hierarchy, paying particular attention to levels
5 and 6. Which levels does the Linux kernel handle? And Windows Vista?
(kernel: the single (logical) program that is loaded at boot time and has
primary control of the computer.)

30 / 184 32 / 184
Early batch systems Making good use of resource – multiprogramming
Late 1950s–early 1960s saw introduction of batch systems (General Even in the 60s, I/O was very slow compared to CPU. So jobs would
Motors, IBM; standard on IBM 7090/7094). waste most (typically > 75%) of the CPU cycles waiting for I/O.
Multiprogramming introduced: monitor loads several user programs; when
I monitor is simple resident OS: reads
Interrupt
one is waiting for I/O, run another.
Processing jobs, transfers control to program,
Multiprogramming means the monitor must:
Device
Drivers
receives control back from program at
Monitor
Job
end of task. I manage memory among the various tasks
Sequencing I batches of jobs can be put onto one I schedule execution of the tasks
Control Language
Interpreter
tape and read in turn by monitor – Multiprogramming OSes introduced early 60s – Burroughs MCP (1963)
Boundary reduces human intervention. was early (and advanced) example.
I monitor permanently resident: user In 1964, IBM introduced System/360 hardware architecture. Family
programs must be loaded into of architectures, still going strong (S/360 → S/370 → S/370-XA →
User different area of memory ESA/370 → ESA/390 → z/Architecture). Simulated/emulated previous
Program
Area IBM computers.
Early S/360 OSes not very advanced: DOS single batch; MFT ran fixed
number of tasks. In 1967 MVT ran up to 15 tasks.
33 / 184 35 / 184

Figure 2.3 Memory Layout for a Resident Monitor

Protecting the monitor from the users Using batch systems was (and is) pretty painful. E.g. on MVS, to
assemble, link and run a program:
Having monitor co-resident with user programs is asking for trouble.
Desirable features, needing hardware support, include: //USUAL JOB A2317P,’MAE BIRDSALL’
I memory protection: user programs should not be able to . . . write to //ASM EXEC PGM=IEV90,REGION=256K, EXECUTES ASSEMBLER
// PARM=(OBJECT,NODECK,’LINECOUNT=50’)
monitor memory,
//SYSPRINT DD SYSOUT=*,DCB=BLKSIZE=3509 PRINT THE ASSEMBLY LISTING
I timer control: . . . or run for ever, //SYSPUNCH DD SYSOUT=B PUNCH THE ASSEMBLY LISTING
I privileged instructions: . . . or directly access I/O (e.g. might read //SYSLIB DD DSNAME=SYS1.MACLIB,DISP=SHR THE MACRO LIBRARY
next job by mistake) or certain other machine functions, //SYSUT1 DD DSNAME=&&SYSUT1,UNIT=SYSDA, A WORK DATA SET
// SPACE=(CYL,(10,1))
I interrupts: . . . or delay the monitor’s response to external events //SYSLIN DD DSNAME=&&OBJECT,UNIT=SYSDA, THE OUTPUT OBJECT MODULE
// SPACE=(TRK,(10,2)),DCB=BLKSIZE=3120,DISP=(,PASS)
//SYSIN DD * IN-STREAM SOURCE CODE
.
code
.
/*

34 / 184 36 / 184
//LKED EXEC PGM=HEWL, EXECUTES LINKAGE EDITOR Virtual Memory
// PARM=’XREF,LIST,LET’,COND=(8,LE,ASM)
Multitasking, and time-sharing in particular, much easier if all tasks are
//SYSPRINT DD SYSOUT=* LINKEDIT MAP PRINTOUT
//SYSLIN DD DSNAME=&&OBJECT,DISP=(OLD,DELETE) INPUT OBJECT MODULE resident, rather than being swapped in and out of memory.
//SYSUT1 DD DSNAME=&&SYSUT1,UNIT=SYSDA, A WORK DATA SET But not enough memory! Virtual memory decouples memory as seen by
// SPACE=(CYL,(10,1)) the user task from physical memory. Task sees virtual memory, which may
//SYSLMOD DD DSNAME=&&LOADMOD,UNIT=SYSDA, THE OUTPUT LOAD MODULE be anywhere in real memory, and can be paged out to disk.
// DISP=(MOD,PASS),SPACE=(1024,(50,20,1))
Hardware support required: all memory references by user tasks must be
//GO EXEC PGM=*.LKED.SYSLMOD,TIME=(,30), EXECUTES THE PROGRAM
// COND=((8,LE,ASM),(8,LE,LKED)) translated to real addresses – and if the virtual page is on disk, monitor
//SYSUDUMP DD SYSOUT=* IF FAILS, DUMP LISTING called to load it back in real memory.
//SYSPRINT DD SYSOUT=*, OUTPUT LISTING In 1963, Burroughs had virtual memory. IBM only introduced it to
// DCB=(RECFM=FBA,LRECL=121) mainframe line with S/370 in 1972.
//OUTPUT DD SYSOUT=A, PROGRAM DATA OUTPUT
// DCB=(LRECL=100,BLKSIZE=3000,RECFM=FBA)
//INPUT DD * PROGRAM DATA INPUT
.
data
.
/*
//
37 / 184 39 / 184

Time-sharing Real
Address
Memory
Allow interactive terminal access to computer, with many users sharing. Processor
Virtual
Management
Unit Main
Early system (CTSS, Cambridge, Mass.) gave each user 0.2s of CPU time; Address
Memory
monitor then saved user program state, loaded state of next scheduled
user.
IBM’s TSS for S/360 was similar – and a software engineering disaster. Disk
Major motivation for development of SE! Address

Secondary
Memory

Virtual Memory Addressing

38 / 184 40 / 184
The Process Concept Memory Protection
With virtual memory, becomes natural to give different tasks their own Virtual memory itself allows user’s memory to be isolated from kernel
independent address space or view of memory. Monitor then schedules memory and other users’ memory. Both for historical reasons and to allow
processes appropriately, and does all context-switching (loading of virtual user/kernel memory to be appropriately shared, many architectures have
memory control info, etc.) transparently to user process. separate protection mechanisms as well:
Note on terminology. It’s common to use ‘process’ for task with independent I A frame or page may be read or write accessible only to a processor
address space, espec. in Unix setting, but this is not a universal definition. Tasks
in a high privilege level;
sharing the same address space are called ‘tasks’ (IBM) or ‘threads’ (Unix). But
some older OSes without virtual memory called their tasks ‘processes’. I In S/370, each frame of memory has a 4-bit storage key, and each
Communication between processes becomes a major issue (studied later); task runs with a particular key.
as does control of resources. I the virtual memory mechanism may be extended with permission
bits; frames can then be shared.
I combination of all the above may be used.

41 / 184 43 / 184

Modes of CPU operation OS structure – traditional


To protect OS from users, all modern CPUs operate in more than one
privilege level App. App. App. App.
I S/370 family has supervisor and problem states
I Intel x86 has rings 0,1,2,3.
Unpriv
Transition to a higher privilege level only allowed via tightly controlled
Priv Kernel
mechanisms. E.g. IBM SVC (supervisor call) or Intel INT are like software System Calls

interrupts: change to supervisor mode and jump to pre-determined Scheduler


address.
CPU instructions that can damage system are restricted to supervisor File System Protocol Code

state: e.g. virtual memory control, I/O.


Device Driver Device Driver
S/W
H/W

All OS function sits in the kernel. Some modern kernels are very large –
tens of MLoC. Bug in any function can crash system. . .
42 / 184 44 / 184
OS structure – microkernels Processes – what are they?
App. App. App. App.
Recall that a process is ‘a program in execution’; may have own view of
memory; sees one processor, although it’s sharing it with other processes
– running on virtual processor.
To switch between processes, we need to track:
Server Server
I its memory, including stack and heap
I the contents of registers
Unpriv I program counter
Priv Server Device Device I its state
Driver Driver

Kernel Scheduler
S/W
H/W

Small core, which talks to (maybe privileged) components in separate


servers. 45 / 184 47 / 184

Kernel vs Microkernel Process States


Microkernels: State is an abstraction used by OS. One standard analysis has five states:
I increase modularity
I increase extensibility I New: process being created
but
I Running: process executing on CPU
I Ready: not on CPU, but ready to run
I have more overhead (due to IPC)
I Blocked: waiting for an event (and so not runnable)
I can be difficult to implement (synchronization)
I Exit: process finished, awaiting cleanup
I often keep multiple copies of OS data structures
Exercise: find out what process states Linux uses. How do they correspond
Modern real (rather than CS) OSes are hybrid:
to this set?
I Linux is monolithic, but has modules that are dynamically State of process is maintained by OS. Transitions between states happen
(un)loadable as follows:
I Windows NT was orig. microkernel-ish, but for performance has put
stuff back into kernel.
See GNU Hurd (based on MACH microkernel) ...

46 / 184 48 / 184
admit release
Context Switching
New Exit PCB allows OS to switch process contexts:
dispatch
Process A Operating System Process B
Ready Running executing
idle
Save State into PCB A
timeout
or yield idle
Restore State from PCB B
event event-wait
executing
Blocked

Save State into PCB B


I admit: process control set up, move to run queue idle

I dispatch: scheduler gives CPU to runnable process executing


Restore State from PCB A

I timeout/yield: running process forced to/volunteers to give up CPU


I event-wait: process needs to wait for e.g. I/O
Time-consuming, so modern CPUs provide H/W support. (About 80
I event: event occurs – wake up process and tell it
pages in IBM ESA/390 manual – complex, sophisticated, rarely used
I release: process terminates, release resources
mechanisms.)
49 / 184 51 / 184

Process Control Block Kernel Context?


In what context does the kernel execute?
Process Number (or Process ID) PCB contains all neces-
sary information: I in older OSes, kernel is seen as single program in real memory
Current Process State
I unique process ID I in modern OSes, kernel may execute in context of user process
CPU Scheduling Information
I process state I parts of OS may be processes (in some sense)
Program Counter
I PC and other For example, in both Unix and OS/390, I/O is dealt with by kernel code
Other CPU Registers registers (when not running in context of user process, but master scheduler is independent
running) of user processes.
I memory (Using advanced features of S/390, the OS/390 kernel may be executing
Memory Management Information
management info in the context of several user processes.)
Other Information I scheduling and
(e.g. list of open files, name of
executable, identity of owner, CPU accounting info
time used so far, devices owned) I ...
Refs to previous and next PCBs

50 / 184 52 / 184
Scheduling Creating Processes(2)
When do processes move from Ready to Running? This is the job of the When a process is created, the OS must
scheduler. We will look at this in detail later. I assign unique identifier
I allocate memory space: both kernel memory for control structures,
and user memory
I initialize PCB and (maybe) memory management tables
I link PCB into OS data structures
I initialize remaining control structures
I for WinNT, OS/390: load program
I for Unix: make child process a copy of parent
Modern Unices don’t actually copy; they share and do copy-on-write.

53 / 184 55 / 184

Creating Processes (1) Ending Processes


How, why, when are processes created? Processes may
I By the OS when a job is submitted or a user logs on. I terminate voluntarily (Unix exit())
I By the OS to perform background service for user (e.g. printing). I perform illegal operation (privileged instruction, access non-existent
I By explicit request from user program (spawn, fork). memory, etc.)
I be killed by user (Unix kill()) or OS because
In Unix, create a new process (and address space) for every program
I allocated resources exceeded
executed: e.g. shell does fork() and child process does execve() to I task functionality no longer needed
load program. N.B. fork() creates a full copy of the calling process. I parent terminating (in some OSes) ...
In WinNT, CreateProcess() creates new process and loads program.
On termination, the OS must:
In OS/390, users create subtasks only for explicit concurrent processing,
and all subtasks share same address space. (For new address space, I deal with pending output etc.
submit batch job. . . ) I release all system resources held by process
I unlink PCB from OS data structures
I reclaim all user and kernel memory

54 / 184 56 / 184
Processes and Threads Real Threads vs Thread Libraries
Processes Threads can be implemented as part of the OS; e.g. Linux, OS/390,
I own resources such as address space, i/o devices, files Windows.
If the OS does not do this (or in any case), threads can be implemented
I are units of scheduling and execution
by user-space libraries:
These are logically distinct. Some old OSes (MVS) and most modern
I thread library implements mini-process scheduler (entirely in user
OSes (Unix, Windows) allow many threads (or lightweight processes [some
space), e.g.
Unices] or tasks [IBM]) to execute concurrently in one process (or address
space [IBM]). I context of thread is PC, registers, stacks etc., saved in
Everything previously said about scheduling applies to threads; but I thread control block (stored in user process’s memory)
process-level context is shared by the thread contexts. All threads in one I switching between threads can happen voluntarily, or on timeout
process share system resources. Hence (user level timer, rather than kernel timer)
I creating threads is quick (ca. 10 times quicker than processes)
I ending threads is quick
I switching threads within one process is quick
I inter-thread communication is quick and easy (have shared memory)
57 / 184 59 / 184

Thread Operations Advantages include:

Thread state similar to process state. Basic operations similar:


I context switching very fast – no OS involvement
I scheduling can be tailored to application
I create: thread spawns new thread, specifying instruction pointer or
routine to call. OS sets up thread context: registers, stack space, . . . I thread library can be OS-independent
I block: thread waits for event. Other threads may execute. Disadvantages:
I unblock: event occurs, thread become ready. I if thread makes blocking system call, entire process is blocked.
I finish: thread completes; context reclaimed. There are ways to work round this. Exercise: How?
I user-space threads don’t execute concurrently on multiprocessor
systems.

58 / 184 60 / 184
MultiProcessing SMP OS design considerations
There is always a desire for faster computers. One solution is to use I cache coherence: several CPUs, one shared memory. Each CPU has
several processors connected together. Following taxonomy is widely used: its own cache. What happens when CPU 1 writes to memory that
CPU 2 has cached? This problem is usually solved by hardware
I Single Instruction Single Data stream (SISD): normal setup, one designers, not OS designers.
processor, one instruction stream, one memory. I re-entrancy: several CPUs may call kernel simultaneously. Kernel
I Single Instruction Multiple Data stream (SIMD): a single program code must be written to allow this.
executes in lockstep on several processors. E.g. vector processors I scheduling: genuine concurrency between threads. Also between
(used for large scientific applications). kernel threads.
I Multiple Instruction Single Data stream (MISD): not used. I memory: must maintain virtual memory consistency between
I Multiple Instruction Multiple Data stream (MIMD): many processors processors (since each CPU has VM hardware support).
each executing different programs on different data. I fault tolerance: single CPU failure should not be catastrophic.
Within MIMD systems, processors may be loosely coupled, for example,
a network of separate computers with communication links; or tightly
coupled, for example processors connected via single bus to shared
memory.
61 / 184 63 / 184

Symmetric MultiProcessing – SMP Scheduling


With shared memory multiprocessing, where does the OS run? Scheduling happens over several time-scales and at several levels.
Master–slave: The kernel runs on one CPU, and dispatches user processes I Batch scheduling, long-term: which jobs should be started?
to others. All I/O etc. is done by request to the kernel on the master Depends on, e.g., estimated resource requirements, tape drive
CPU. Easy, but inefficient and failure prone. requirements, . . .
Symmetric: The kernel may execute on any CPU. Kernel may be multi- I medium term: some OSes suspend or swap out processes to
process or multi-threaded. Each processor may have its own scheduler. ameliorate resource contention. This is a medium term (seconds to
Much more flexible and efficient – but much more complex. This is SMP. minutes) procedure. We won’t discuss it. Exercise: read up in the
Exercise: Why is this MIMD, and not MISD? textbooks on suspension/swapout – which modern OSes do it?
I process scheduling, short-term: which process gets the CPU next?
How long does it get?
We will consider mainly short-term scheduling here.

62 / 184 64 / 184
Scheduling Criteria Preemptive Policies
To schedule effectively, need to decide criteria for success! For example, Here we interrupt processes after some time (the quantum).
I good utilization: minimize the amount of CPU idle time I round-robin: when the quantum expires, running process is sent to
I good utilization: job throughput back of ready queue. Good for general purposes. Tends to favour
CPU-bound processes – can be refined to avoid this. How big
I fairness: jobs should all get a ‘fair’ share of CPU . . .
should the quantum be? ‘Slightly greater than the typical
I priority: . . . unless they’re high priority interaction time.’ (How fast do you type?) Recent Linux kernels
I response time: fast (in human terms) response to interactive input have base quantum of around 50ms.
I real-time: hard deadlines, e.g. chemical plant control I shortest remaining time: (SRT) – preemptive version of SPN. On
I predictability: avoid wild variations in user-visible performance quantum expiry, dispatch process with shortest expected running
time. Tends to starve long CPU-bound processes. Estimation
Balance very system-dependent: on PCs, response time is important,
problem as for SPN.
utilization irrelevant; in large financial data centre, throughput is vital.

65 / 184 67 / 184

Non-preemptive Policies I feedback: use dynamically assigned priorities:


I new process starts in queue of priority 0 (highest);
In a non-preemptive policy, once a job gets the CPU, it keeps it until it I each time it’s pre-empted, goes to back of next lower priority queue;
yields or needs I/O etc. Such policies are often suitable for long-term I dispatch first process in highest occupied queue.
scheduling; not often used now for short-term. (Obviously poor for This tends to starve long jobs, esp. in interactive context. Possible
interactive response!) solutions:
I increase quantum for lower priority processes
I first-come-first-served: (FCFS, FIFO, queue) – what it says. Favours I raise priority for processes that are starved
long and CPU-bound processes over short or I/O-bound processes.
Not often appropriate; but used as sub-component of priority
systems.
I shortest process next: (SPN) – dispatch process with shortest
expected processing time. Improves overall performance, favours
short jobs. Poor predictability. How do you estimate expected time?
For batch jobs (long-term), user can estimate; for short-term, can
build up (weighted) average CPU residency over time as process
executes. E.g. exponentially weighted averaging.
I and others . . .

66 / 184 68 / 184
Scheduling evaluation: Suggested Reading SMP scheduling: Dispatching
In your favourite OS textbook, read the chapter on basic scheduling. For process scheduling, performance analysis and simulation indicate that
Study the section(s) on evaluation of scheduling algorithms. Aim to the differences between scheduling algorithms are much reduced in a
understand the principles of queueing analysis and simulation modelling multi-processor system. There may be no need to use complex systems:
for evaluating scheduler algorithms. FCFS, or slight variant, may suffice.
(E.g. Stallings 7/e chap 9 and online chap 20.) For thread scheduling, situation is more complex. SMP allows many
threads within a process to run concurrently; but because these threads
are typically interacting frequently (unlike different user processes), it turns
out that performance is sensitive to scheduling. Four main approaches:
I load sharing: idle processor selects ready thread from whole pool
I gang scheduling: a gang of related threads are simultaneous
dispatched to a set of CPUs
I dedicated CPUs: static assignment of threads (within program) to
CPUs
I dynamic scheduling: involve the application in changing number of
threads; OS shares CPUs among applications ‘fairly’.
69 / 184 71 / 184

Multiprocessor Scheduling Load sharing is simplest and most like uniprocessing environment. As for
process scheduling, FCFS works well. But it has disadvantages:
Scheduling for SMP systems involves:
I the single pool of TCBs must be accessed with mutual exclusion –
I assigning processes to processors may be bottleneck, esp. on large systems
I deciding on multiprogramming on each processor I preempted threads are unlikely to be rescheduled to same CPU;
I actually dispatching processes loses benefits of CPU cache (hence Linux, e.g., refines algorithm to
processes to CPUs: Do we assign processes to processors statically (on try to keep threads on same CPU)
creation), or dynamically? If statically, may have idle CPUs; if dynamically, I program wanting all its threads running together is unlikely to get it
complexity of scheduling is increased – esp. in SMP, where kernel may be – if threads are tightly coupled, could severely impact performance.
executing concurrently on several CPUs. Most systems use load sharing, but with refinements or user-specifiable
multiprogramming: Do we need to multiprogram on each CPU? ‘Obviously, parameters to address some of the disadvantages. Gang scheduling
yes.’ But if there are many CPUs, and the application is parallel at the or dedicated assignment may be used in special purpose (e.g. parallel
thread level, may be better (for response time) not to. numerical and scientific computation) systems.

70 / 184 72 / 184
Real-Time Scheduling Concurrency
Real-time systems have deadlines. These may be hard: necessary for When multiprogramming on a uniprocessor, processes are interleaved in
success of task, or soft: if not met, it’s still worth running the task. execution, but concurrent in the abstract. On multiprocessor systems,
Deadlines give RT systems particular requirements in: processes really concurrent. This gives rise to many problems:
I determinism: need to acknowledge events (e.g. interrupt) within I resource control: if one resource, e.g. global variable, is accessed by
predetermined time two processes, what happens? Depends on order of executions.
I responsiveness: and take appropriate action quickly enough I resource allocation: processes can acquire resources and block,
I user control: hardness of deadlines and relative priorities is (almost stopping other processes.
always) a matter for the user, not the system I debugging: execution becomes non-deterministic (for all practical
I reliability: systems must ‘fail soft’. panic() is not an option! purposes).
Better still, they shouldn’t fail.

73 / 184 75 / 184

RTOSes typically do not handle deadlines as such. Instead, they try to Concurrency – example problem
respond quickly to tasks’ demands. This may mean allowing preemption
almost everywhere, even in small kernel routines. Suppose a server, which spawns a thread for each request, keeps count of
the number of bytes written in some global variable bytecount.
Suggested reading: read the section on real-time scheduling in Stallings
(section 10.2). If two requests are served in parallel, they look like
serve request1 serve request2
Exercise: how does Linux handle real-time scheduling?
tmp1 = bytecount + thiscount1 ; tmp2 = bytecount + thiscount2 ;
bytecount = tmp1 ; bytecount = tmp2 ;
Depending on the way in which threads are scheduled, bytecount may
be increased by thiscount1 , thiscount2 , or (correct) thiscount1 +
thiscount2 .
Solution: control access to shared variable: protect each read–write
sequence by a lock which ensures mutual exclusion. (Remember Java
synchronized.)

74 / 184 76 / 184
Mutual Exclusion Mutex – first attempt
Allow processes to identify critical sections where they have exclusive Suppose we have a global variable turn. We could say that when
access to a resource. The following are requirements: Pi wishes to enter critical section, it loops checking turn, and can
I mutual exclusion must be enforced! proceed iff turn = i. When done, flips turn. In pseudocode:
I processes blocking in noncritical section must not interfere with while ( turn != i ) { }
others /* critical section */
turn = ı̂;
I processes wishing to enter critical section must eventually be allowed
to do so This has obvious problems:
I entry to critical section should not be delayed without cause I processes busy-wait
I there can be no assumptions about speed or number of processors I the processes must take strict turns
A requirement on clients, which may or may not be enforced, is: although it does enforce mutex.
I processes remain in their critical section for finite time

77 / 184 79 / 184

Implementing Mutual Exclusion Mutex – second attempt


How do we do it? Need to keep state of each process, not just id of next process.
I via hardware: special machine instructions So have an array of two boolean flags, flag[i], indicating whether Pi is
in critical. Then Pi does:
I via OS support: OS provides primitives via system call
I via software: entirely by user code while ( flag[ı̂] ) { }
flag[i] = true;
Of course, OS support needs internal hardware or software implementation. /* critical section */
How do we do it in software? flag[i] = false;
We assume that mutual exclusion exists in hardware, so that memory
access is atomic: only one read or write to a given memory location at a This doesn’t even enforce mutex: P0 and P1 might check each other’s
time. (True in almost all architectures.) (Exercise: is such an assumption flag, then both set own flags to true and enter critical section.
necessary?)
We will now try to develop a solution for mutual exclusion of two processes,
P0 and P1 . (Let ı̂ mean 1 − i.)
Exercise: is it (a) true, (b) obvious, that doing it for two processes is
enough?

78 / 184 80 / 184
Mutex – third attempt Mutex – Dekker’s algorithm
Maybe set one’s own flag before checking the other’s? Ensure that one process has priority, so will not defer; and give other
flag[i] = true; process priority after performing own critical section.
while ( flag[ı̂] ) { } flag[i] = true;
/* critical section */ while ( flag[ı̂] ) {
flag[i] = false; if ( turn == ı̂ ) {
flag[i] = false;
This does enforce mutex. (Exercise: prove it.) while ( turn == ı̂ ) { }
But now both processes can set flag to true, then loop for ever waiting flag[i] = true;
for the other! This is deadlock. }
}
/* critical section */
turn = ı̂;
flag[i] = false;

Optional Exercise: show this works. (If you have lots of time.)

81 / 184 83 / 184

Mutex – fourth attempt Mutex – Peterson’s algorithm


Deadlock arose because processes insisted on entering critical section and Peterson came up with a much simpler and more elegant (and generaliz-
busy-waited. So if other process’s flag is set, let’s clear our flag for a bit able) algorithm.
to allow it to proceed:
flag[i] = true;
flag[i] = true; turn = ı̂;
while ( flag[ı̂] ) { while ( flag[ı̂] && turn == ı̂ ) { }
flag[i] = false; /* critical section */
/* sleep for a bit */ flag[i] = false;
flag[i] = true;
} Compulsory Exercise: show that this works. (Use textbooks if necessary.)
/* critical section */
flag[i] = false;

OK, but now it is possible for the processes to run in exact synchrony and
keep deferring to each other – livelock.

82 / 184 84 / 184
Mutual Exclusion: Using Hardware Support Types of semaphore
On a uniprocessor, mutual exclusion can be achieved by preventing A semaphore is called strong if waiting processes are released FIFO;
processes from being interrupted. So just disable interrupts! Technique it is weak if no guarantee is made about the order of release. Strong
used extensively inside many OSes. Forbidden to user programs for semaphores are more useful and generally provided; henceforth, all
obvious reasons. Can’t be used in long critical sections, or may lose semaphores are strong.
interrupts. A binary or boolean semaphore takes only the values 0 and 1: wait
This doesn’t work in SMP systems. A number of SMP architectures decrements from 1 to 0, or blocks if already 0; signal unblocks, or
provide special instructions. E.g. S/390 provides TEST AND SET, increments from 0 to 1 if no blocked processes.
which reads a bit in memory and then sets it to 1, atomically as seen by Recommended Exercise: Show how to use a private integer variable
other processors. This allows easy mutual exclusion: have shared variable and two binary semaphores in order to implement a general semaphore.
token, then process grabs token using test-and-set. (Please think about this before looking up the answer!)
while ( test-and-set(token) == 1 ) { }
/* critical section */
token = 0;

This is still busy-waiting. Deadlock is possible: low priority process grabs


the token, then high priority process pre-empts and busy waits for ever.
85 / 184 87 / 184

Semaphores Implementing Semaphores


Dijkstra provided the first general-purpose abstract technique for OS and How do we implement a semaphore? Need an integer variable and queue
programming language control of concurrency. of blocked processes, protected against concurrent access.
A semaphore is a special (integer) variable s, which can be accessed only Use any of the mutex techniques discussed earlier. So what have we
by the following operations: bought by implementing semaphores?
I init(s,n): create the semaphore and initialize it to the Answer: the mutex problem (and the associated busy-waiting) are
non-negative value n. confined inside just two (or three) system calls. User programs do not
need to busy-wait; only the OS busy-waits, and only during the (short)
I wait(s): the semaphore value is decremented. If the value is now
implementation of semaphore operations.
negative, the calling process is blocked.
I signal(s): the semaphore is incremented. If the value is
non-positive, one process blocked on wait is unblocked.
It is traditional, following Dijkstra, to use P (proberen) and V (verhogen) for
wait and signal.

86 / 184 88 / 184
Using Semaphores Monitors
A semaphore gives an easy solution to user level mutual exclusion, for any Because solutions using semaphores have wait and signal separated in
number of processes. Let s be a semaphore initialized to 1. Then each the code, they are hard to understand and check.
process just does: A monitor is an ‘object’ which provides some methods, all protected by
wait(s); a blocking mutex lock, so only one process can be ‘in the monitor’ at a
/* critical section */ time. Monitor local variables are only accessible from monitor methods.
signal(s); Monitor methods may call:
I cwait(c) where c is a condition variable confined to the monitor:
Exercise: what happens if s is initialized to m rather than 1? the process is suspended, and the monitor released for another
process.
I csignal(c): some process suspended on c is released and takes the
monitor.
Unlike semaphores, csignal does nothing if no process is waiting.
What’s the point? The monitor enforces mutex; and all the synchroniza-
tion is inside the monitor methods, where it’s easier to find and check.
This version of monitors has some drawbacks; there are refinements which
work better.
89 / 184 91 / 184

The Producer–Consumer Problem The Readers/Writers Problem


General problem occurring frequently in practice: a producer repeatedly A common situation is to have a resource which may be read by many
puts items into a buffer, and a consumer takes them out. Problem: make processes at once, but any read must block a write; and which can be
this work, without delaying either party unnecessarily. (Note: can’t just written by only one process at once, blocking all other access.
protect buffer with a mutex lock, since consumer needs to wait when This can be solved using semaphores. There are design decisions: do
buffer is empty.) readers have priority? Or writers? Or do they all go into a common
Can be solved using semaphores. Assume buffer is an unlimited queue. queue?
Declare two semaphores: init(n,0) (tracks number of items in buffer) Suggested Reading: read about the problem in your OS textbook (e.g.
and init(s,1) (used to lock the buffer). Stallings 7/e 5.6).
Examples include:
Producer loop Consumer loop
datum = produce(); wait(n); I Unix file locks: many Unices provide read/write locking on files. See
wait(s); wait(s); man fcntl on Linux.
append(buffer,datum); datum = extract(buffer); I The OS/390 ENQ system call provides general purpose read/write
signal(s); signal(s); locks.
signal(n); consume(datum); I The Linux kernel uses ‘read/write semaphores’ internally. See
Exercise: what happens if the consumer’s wait operations are swapped? lib/rwsem-spinlock.c.
90 / 184 92 / 184
Message Passing Preventing Deadlock
Many systems provide message passing services. Processes may send and Deadlock requires three facts about system policy to be true:
receive messages to and from each other. I resources are held by only one process at a time
send and receive may be blocking or non-blocking when there is no I a resource can be held while waiting for another
receiver waiting or no message to receive. Most usual is non-blocking
send and blocking receive.
I processes do not unwillingly lose resources
If message passing is reliable, it can be used for mutex and synchronization: If any of these does not hold, deadlock does not happen. If they are true,
deadlock may happen if
I simple mutex by using a single message as a token I a circular dependency arises between resource requests
I producer/consumer: producer sends data as messages to consumer; The first three can to some extent be prevented from holding, but not
consumer sends null messages to producer to acknowledge practically so. However, the fourth can be prevented by ordering resources,
consumption. and requiring processes to acquire resources in increasing order.
Message-passing is implemented using fundamental mutex techniques.

93 / 184 95 / 184

Deadlock Avoiding Deadlock


We have already seen deadlock. In general, deadlock is the permanent A more refined approach is to deny resource requests that might lead to
blocking of two (or more) processes in a situation where each holds a deadlock. This requires processes to declare in advance the maximum
resource the other needs, but will not release it until after obtaining the resource they might need. Then when a process does request a resource,
other’s resource: analyse whether granting the request might result in deadlock.
How do we do the analysis? If we grant the request, is there sufficient
Process P Process Q resource to allow one process to run to completion? And when it finishes
acquire(A); acquire(B); (and releases its resources), can we run another? And so on. If not, we
acquire(B); acquire(A); should deny (block) the original request.
release(A); release(B);
Suggested Reading: Look up banker’s algorithm.
release(B); release(A);

Some example situations are:


I A is a disk file, B is a tape drive.
I A is an I/O port, B is a memory page.
Another instance of deadlock is message passing where two processes are
each waiting for the other to send a message.
94 / 184 96 / 184
Deadlock Detection Relocation and Address Binding
Even if we don’t use deadlock avoidance, similar techniques can be used When we load the contents of a static variable into a register, where is
to detect whether deadlock currently exists. What can we do then? the variable in memory? When we branch, where do we branch to?
I kill all deadlocked processes (!) If programs are always loaded at same place, can determine this at compile
time.
I selectively kill deadlocked processes
But in multiprogramming, can’t predict where program will be loaded.
I forcibly remove resources from some processes (what does the
So, compiler can tag all memory references, and make them relative to
process do?)
start of program. Then relocating loader loads program at location X ,
I if checkpoint-restart is available, roll back to pre-deadlock point, say, and adds X to all memory addresses in program. Expensive. And
and hope it doesn’t happen next time (!) what if program is swapped out and brought back elsewhere?

97 / 184 99 / 184

Memory Management Writing relocatable code


The OS needs memory; the user program needs memory. In multiprogram- One way round: provide hardware instructions that access memory relative
ming world, each user process needs memory. They each need memory to a base register, and have programmer use these. Program loader then
for: sets base register, but nothing else.
I code (instructions, text): the program itself E.g. In S/390, typical instruction is
I static data: data compiled into the program L R13,568(R12)
I dynamic data: heap, stack meaning ‘load register 13 with value in address (contents of register 12
plus 568)’. Programmer (or assembler/compiler) makes all memory refs
Memory management is the problem of providing this. Key requirements: of this form; programmer or OS loads R12 with appropriate value.
This requires explicit programming: why not have hardware and OS do it?
I relocation: moving programs in memory
I allocation: assigning memory for processes
I protection: preventing access to other processes’ memory. . .
I sharing: . . . except when appropriate
I logical organization: how memory is seen by process
I physical organization: and how it is arranged in hardware
98 / 184 100 / 184
Segmentation Partitioning – the Buddy System
A segment is a portion of memory starting at an address given in a base Compromise between fixed and dynamic.
register B. The OS loads a value b into B. When program refers to I Memory is maintained as a binary tree of blocks of sizes 2k for
memory address x, hardware transparently translates it to x + b. L ≤ k ≤ U suitable upper and lower bounds.
To achieve protection, can add limit register L. OS loads L with length of I When process of size s, 2i−1 < s ≤ 2i , comes in, look for free block
segment l. Then if x > l, raise address fault (exception). (origin of Unix
of size 2i . If none, find (recursively) block of size 2i+1 and split it in
error message ‘Segmentation fault’.)
two.
Relocation Register I When blocks are freed, merge free sibling nodes (‘buddies’) to
limit base re-create bigger blocks.
Variants on the buddy system are still used, e.g. in allocating memory

Memory
within the Linux kernel. (I.e. memory for use by the kernel.)
no
+
CPU logical physical
address yes address

address fault
101 / 184 103 / 184

Partitioning Multiple Segments


Segmentation allows programs to be put into any available chunk of Can extend segmentation to have multiple segments per program:
memory. How do we partition memory between various processes? I hardware/OS provide different segments for different types of data,
I fixed partitioning: divide memory into fixed chunks. Disadvantage: e.g. text (code), data (static data), stack (dynamic data). (How do
small process in large chunk is wasteful. Example: OS/MFT. you tell what sort of address is being used?)
I dynamic partitioning: load process into suitable chunk; when exits, I hardware/OS provides multiple segments at user request.
free chunk, maybe merge with neighbouring free chunks. I logical memory address viewed as pair (s, o)
Disadvantage: (external) fragmentation – memory tends to get split I process has segment table: look up entry s in table to get base and
into small chunks. May need to swap out running process to make limit bs , ls
room for higher priority new process. How do we choose chunks? I translate as normal to o + bs or raise fault if o + bs > ls
I first fit: choose first big enough chunk
I next fit: choose first big enough chunk after last allocated chunk Exercise: look up how segmentation is done on the Intel x86 architecture.
I best fit: choose chunk with least waste
First fit is generally best: next fit fragments a bit more; best fit
fragments a lot.

102 / 184 104 / 184


Segmentation has some advantages: Page table entry includes valid bit, since not all pages may have frames.
I may correspond to user view of memory. Start and length of page table are held in control registers, as for
I importantly, protection can be done per segment: each segment can segmentation. May also include protection via protection bit(s) in page
be protected against, e.g., read, write, execute. table entry, e.g. read, write, execute, supervisor-mode-only, etc.
I makes sharing of code/data easy. (But better to have a single list of
segment descriptors, and have process segment tables point into
that, than to duplicate information between processes.)
and some disadvantages:
I variable size segments leads to external fragmentation again;
I may need to compact memory to reduce fragmentation;
I small segments tend to minimize fragmentation, but annoy
programmer.

105 / 184 107 / 184

Paging Translation Lookaside Buffer


Small segments reduce fragmentation; variable size segments introduce With paging (or segmentation), each logical memory reference needs two
various problems. A special case would be to have many small fixed-size (or more) physical memory references. A translation lookaside buffer
segments always provided – invisibly to the programmer. This is paging. (TLB) is a special cache for keeping recently used paging information.
Virtual storage is divided in pages of fixed size (typically 4KB). Each page TLB is associative cache mapping page address directly to frame address.
is mapped to a frame of real storage, by means of a page table.

Memory
logical address
Page Table CPU
p o TLB
Memory

p1 f1
p p2 f2
p o p3 f3
p4 f4 f o
logical address
physical address
Page Table
CPU 1 f f o
physical p
address
1 f

106 / 184 108 / 184


Like all caches, the TLB introduces a coherency problem. Sharing Pages
When the process context switches, active page table changes: must flush
Memory can be shared by having different pages map to the same frame.
TLB.
For code, need re-entrant code: stateless, not self-modifying.
When page is freed, must invalidate entry in TLB.
Otherwise, use copy-on-write:
Note that TLB also caches protection bits; changes in protection bits
must invalidate TLB entry. I mark the pages read-only in each process (using protection bits in
page table);
I when process writes, generates protection exception;
I OS handles exception by allocating new frame, copying shared page,
and updating process’s page table

109 / 184 111 / 184

Multi-level Paging Virtual Memory


Modern systems have address space of at least 231 bytes, or 219 4K pages. Pages do not have to be in real memory all the time! We can store them
That’s a page table several megabytes long: one for each process. . . on disk when not needed.
Modern systems have two (or more) levels of page table: I initialize process’s page table with invalid entries;
Base Register Virtual Address
I on first reference to page, get exception: handle it, allocate frame,
L1 Address P1 P2 Offset update page table entry;
I when real memory gets tight, choose some pages, write them to
L1 Page Table disk, invalidate them, and free the frames for use elsewhere;
0
I when process refers to page on disk, get exception; handle by
L2 Page Table reading in from disk (if necessary paging out some other page).
n L2 Address 0
OS often uses frame-address portion of invalid page table entry to keep
n Leaf PTE its location on disk.
N

110 / 184 112 / 184


Hardware support for VM usually includes: Combined Paging and Segmentation: Intel
I modified bit for page: no need to write out page if not changed Intel has full blown segmentation and independent paging.
since last read in;
I Logical address is 16-bit segment id and 32-bit offset.
I referenced bit or counter: unreferenced pages are first candidates for
freeing.
I Segment id indexes into segment table; but
I segment id portion of logical address is found via a segment register;
Architectures differ where this happens:
I which is usually implicit in access type (CS register for instruction
I On Intel, modified and reference bits are part of page table entry. accesses, DS for data, SS for stack, ES for string data), but can be
I On S/390, they are part of storage key associated with each real specified to be in any of six segment registers (there are exceptions).
frame. I Segment registers are part of task context. (Task context stored in
Exercise: What, if any, difference does this make to the OS memory special system segments!)
management routines? I May be single global segment table; may also have task-specific
segment tables.
The result of segment translation is 32-bit linear address.

113 / 184 115 / 184

Combined Paging and Segmentation: S/390 Completely independently, the linear address goes through a two-level
paging system.
The concepts of paging and segmentation can be combined.
I Segment related info (e.g. segment tables) can be paged out; so can
In S/390, they are intertwined, and can be seen as a 2-level paging system.
second-level page tables.
I There is no link between pages and segments: segments need not lie
I Logical address is 31 bits: on page boundaries.
I first 11 bits index into current segment table I Pages can be 4KB, or 4MB.
I next 8 bits index into page table; I Page table register is part of task context, stored in task segment (!).
I remaining bits are offset.
Page tables can be paged out, by marking their entries invalid in the
segment table.
For normal programming, there is only one segment table per process.
Other segment tables (up to 16) are used by special purpose instructions
for moving data between address spaces.

114 / 184 116 / 184


Paging Policies Replacement Policy
In such a virtual memory system, the OS has to decide when to page in When memory runs out, and a page is brought in, which page gets paged
and out. What are the criteria? out?
I minimize number of page faults: avoid paging out pages that will be Aim: page out the page with the longest time until its next reference.
soon need (This provably minimizes page faults.) In the absence of clairvoyance, we
can try:
I minimize disk i/o: avoid reclaiming dirty (modified) pages
I LRU – least recently used: choose the page with longest time since
last reference. This is almost optimal – but would have very high
overhead, even if hardware supported it.
I FIFO – first in, first out: simple, but pages out heavily used pages.
Performs poorly.
I clock policy: attempts to get some of the performance of LRU
without the overhead. See next page.

117 / 184 119 / 184

Fetch Policies Clock Replacement Policy


When should a page be brought back into main memory from disk? Makes use of the ‘use’ (accessed) bit provided by most hardware.
I demand paging: when referenced. The locality principle suggests Put frames in a circular list 0, . . . , n − 1. Have an index i. When looking
this should work well after an initial burst of activity. for a page to replace, do:
I prepaging: try to bring in pages ahead of demand, exploiting increment i;
characteristics of disks to improve efficiency. while (frame i used) {
clear use bit on frame i;
Prepaging was not shown to be effective, and has been little, if at all,
increment i; }
used.
return i;
A few years ago it became a live issue again with a study suggesting it
can now be useful. Hence doesn’t choose page unless it has been unreferenced for one
http://www.cs.amherst.edu/˜sfkaplan/research/prepaging/ complete pass through storage. Clock algorithm performs reasonably well,
Windows now prepages application programs based on your pattern of use about 25% worse than LRU.
throughout the day. Enhance to reduce I/O: scan only unmodified frames, without clearing
use bit. If this fails, scan modified frames, clearing use bit. If this fails,
repeat from beginning.

118 / 184 120 / 184


Page Caching Actually tracking the working set is too expensive. Some approximations
are
Many systems (including Linux) use a clock-like algorithm with the
I page fault frequency: choose threshold frequency f . On page fault:
addition of caches or buffers:
I if (virtual) time since last fault is < 1/f , add one page to RSS;
I When a page is replaced, it’s added to the end of the free page list if otherwise
clear, or the modified page list if dirty. I discard unreferenced pages, and shrink RSS; clear use bits on other
I The actual frame used for the paged-in page is the head of the free pages
page list. Works quite well, but poor performance in interlocality transitions
I If no free pages, or when modified list gets beyond certain size, write I variable-interval sampled working set: at intervals,
out modified pages and move to free list. I evaluate working set (clear use bits at start, check at end)
This means that
I make this the initial resident set for next interval
I add any faulted-in pages (i.e. shrink RS only between intervals)
I pages in the caches can be instantly restored if referenced again; I the interval is every Q page faults (for some Q), subject to upper and
I I/O is batched, and therefore more efficient. lower virtual time bounds U and L.
I Tune Q, U, L according to experience. . .
Linux allows you to tune various parameters of the paging caches. It also
has a background kernel thread that handles actual I/O; this also ‘trickles
out’ pages to keep a certain amount of memory free most of the time, to
make allocation fast.
121 / 184 123 / 184

Resident Set Management Input/Output


In the previous schemes, when a process page faults, some other process’s I/O is the messiest part of most operating systems.
page may be paged out. An alternative view is to manage independently I dealing with wildly disparate hardware
the resident set of each process.
I with speeds from 102 to 109 bps
I allocate a certain number of frames to each process (on what I and applications from human communication to data storage
criteria?)
I varying complexity of device interface (e.g. line printer vs disk)
I after a process reaches its allocation, if it page faults, choose some
page of that process to reclaim
I data transfer sizes from 1 byte to megabytes
I re-evaluate resident set size (RSS) from time to time
I in many different representations and encodings
I and giving many idiosyncratic error conditions
How do we choose the RSS? The working set of a process over time ∆
is the set of pages referenced in the last ∆ time units. Aim to keep the Uniformity is almost impossible.
working set in memory (for what ∆?).
Working sets tend to be stable for some time (locality), and change to a
new stable set every so often (‘interlocality transitions’).

122 / 184 124 / 184


I/O Techniques Interrupt-driven I/O
The techniques for I/O have evolved (and sometimes unevolved): Recall basic interrupt technique from earlier lecture.
I direct control: CPU controls device by reading/writing data. lines Interrupt handler is usually split into a device-independent prologue
directly (sometimes called the ‘interrupt handler’) and a device-dependent body
(sometimes called the ‘interrupt service routine’). Prologue saves context
I polled I/O: CPU communicates with hardware via built-in controller;
(if required), does any interrupt demuxing; body does device-specific
busy-waits for completion of commands.
work, e.g. acknowledge interrupt, read data, move it to user space.
I interrupt-driven I/O: CPU issues command to device, gets interrupt
ISRs need to run fast (so next interrupt can be handled), but may also
on completion
need to do complex work; therefore often schedule non-urgent part to run
I direct memory access: CPU commands device, which transfers data later. (Linux ‘bottom halves’ (2.2 and before) or ‘tasklets’ (2.4), MVS
directly to/from main memory (DMA controller may be separate ‘service request blocks’).
module, or on device).
I I/O channels: device has specialized processor, interpreting special
command set. CPU asks device to execute entire I/O program.
Terminology warning: Stallings uses ‘programmed I/O’ for ‘polled I/O’; but the
PIO (programmed I/O) modes of PC disk drives are (optionally but usually)
interrupt-driven.
125 / 184 127 / 184

Programmed/Polled I/O DMA


Device has registers, accessible via system bus. For output: A DMA controller accesses memory via system bus, and devices via I/O
I CPU places data in data register bus. To use system bus, it steals cycles: takes mastery of the bus for a
cycle, causing CPU to pause.
I CPU puts write command in command register
CPU communicates (as bus master) with DMA controller via usual bus
I CPU busy-waits reading status register until ready flag is set technique: puts address of memory to be read/written on data lines,
Similarly for input, where CPU reads from data register. address of I/O device on address lines, read/write on command lines.
DMA controller handles transfer between memory and device; interrupts
CPU when finished.
Note: DMA interacts with paging! Can’t page out a page involved in
DMA. Solutions: either lock page into memory, or copy to buffer in kernel
memory and use that instead.

126 / 184 128 / 184


I/O Channels Many systems classify devices into broad classes:

IBM mainframe peripherals have always had sophisticated controllers


I character: terminals, printers, keyboards, mice, . . . typically transfer
called ‘channels’. data byte at a time, don’t store data.
Operating system builds channel program (with commands including I block: disk, CD-ROM, tape, . . . transfer data in blocks (fixed or
data transfer, conditionals and loops) in main memory, and issues variable size), usually store data
START SUBCHANNEL instruction. Channel executes entire program before I network: ethernet etc, tend to have mixed characteristics and need
interrupting CPU. idiosyncratic control
Channels and devices are themselves organized into a complex communi- I other: clocks etc.
cation network to achieve maximum performance.
Unix has the ‘everything is a file’ philosophy: devices appear as (special)
IBM mainframe disk drives (DASD (direct access storage device) volumes) files. If read/write makes sense, you (application programmer) can do it;
are themselves much more sophisticated than PC disks, with built-in device-specific functions available via ioctl system call on device file.
facilities for structured (key, value) records and built-in searching on keys:
But somebody still has to write the device driver!
designed particularly for database applications.

129 / 184 131 / 184

Taming I/O programming Disk Basics


Disks are the main storage medium, and their physical characteristics give
Unpriv
Application-I/O Interface Virtual Device Layer rise to special considerations.
I A typical modern disk drive comprises several platters, each a thin
I/O Buffering I/O Scheduling Common I/O Functions
Priv disk coated with magnetic material.
Device Device Device I A comb of heads is on a movable arm, with one head per surface.
Driver Driver Driver Device Driver Layer
I If the heads stay still, they access circles on the spinning platters.
One circle is called a track; the set of tracks is called a cylinder.
H/W Keyboard HardDisk Network Device Layer
I Often, tracks are divided into fixed length sectors.

So far as possible, confine device-specific code to small, low layer, and Consequently, to access data in a given sector, need to:
write higher-level code in terms of abstract device classes. I move head assembly to right cylinder (around 4 ms on modern disks)
I wait for right sector to rotate beneath head (around 5 ms in modern
disks)
Disk scheduling is the art of minimizing these delays.

130 / 184 132 / 184


Disk Scheduling I Level 2: data are striped in small (byte or word) strips across some
disks, with an error checksum (Hamming code) striped across other
If I/O requests from all programs are executed as they arrive, can expect disks. Overkill; not used.
much time to be wasted in seek and rotation delays. Try to avoid this.
I Level 3: same, but using only parity bits, stored on other disk. If
How?
one disk fails, data can be read with on-the-fly parity computation;
If we don’t know the current disk position, all we can do is use expected then failed disk can be regenerated. Has write overhead.
properties of disk access. Because of locality, LIFO may actually work
quite well. If we do know current position, can do intelligent scheduling:
I Level 4: large data strips, as for level 0, with extra parity strip on
other disk. Write overhead again, bottleneck on parity disk.
I SSTF (shortest service time first): do request with shortest seek
I Level 5: as level 4, but distribute parity strip across disks, avoiding
time.
bottleneck.
I SCAN: move the head assembly from out to in and back again,
servicing requests for each cylinder as it’s reached. Avoids
I Level 6: data striping across n disks, with two different checksums
starvation; is harmed by locality. on two other disks (usually one simple parity check, one more
sophisticated checksum). Designed for very high reliability
I C-SCAN: scan in one direction only, then flip back. Avoids bias
requirements.
towards extreme tracks.
I FSCAN, N-step-SCAN: avoid long delays by servicing only a quota
(N-step-SCAN) of requests per cylinder, or (FSCAN) only those
requests arrived before start of current scan.
133 / 184 135 / 184

RAID File Organization


Disks are slow, and store critical information. RAID (redundant array Unix users are used to the idea of a file as an unstructured stream of bytes.
of independent (orig. inexpensive) disks) is a suite of techniques for This is not universally the case. Structural hierarchy is often provided at
improving failure resistance and performance. OS level:
Basic idea is to view several (‘small’, cheap) physical disks as one logical I field: basic element of data. May be typed (string, integer, etc.).
volume. There are seven levels of RAID. May be of fixed or variable length. Field name may be explicit, or
I Level 0: data are striped across n disks. Data is divided into strips; implicit in position in record.
first n logical strips placed in first physical strip of each disk. Thus n I record: collection of related fields, relating to one entity. May be of
consecutive logical strips can be read in parallel. Choice of strip size fixed or variable length, and have fixed or variable fields.
depends on application: high transfer rate (small strips → high I file: collection of records forming a single object at OS and user
parallelism within one request), or high I/O request rate (large strips level. (Usually) has a name, and is entered in directories or
→ several different requests in parallel). catalogues. Usual unit of access control.
I Level 1: data are mirrored (duplicated) on each disk. Protects I database: collection of related data, often in multiple files, satisfying
against disk failure; no overhead, instant recovery. certain design properties. Usually not at OS level. See database
course.

134 / 184 136 / 184


Layers of Access to File Data Directories and Catalogues
As usual, access is split into conceptual layers: How is a file found on disk? Usually have special files (in known location)
I device drivers: already covered listing other files with location.
Many systems have hierarchical directories:
I physical I/O: reading/writing blocks on disk. Already covered.
I basic I/O system: connects file-oriented I/O to physical I/O. I directories list files, including other directories.
Scheduling, buffering etc. at this level. I file is located by path through directory tree, e.g.
I logical I/O: presents the application programmer with a (hopefully /group/teaching/cs3/os/Modules/worker.c
uniform) view of files and records. I directory entry may contain file metadata (owner, permissions,
I access methods: provide application programmer with routines for access/mod times etc.), or this may be stored with file.
indexed etc. access to files. I (usually) directories can only be accessed via system calls, not by
normal user I/O routines

137 / 184 139 / 184

File Organization Unix Files and Directories


Within the file, how are records structured and accessed? May be concern In Unix:
of application program only (e.g. Unix), or built in to operating system I files are unstructured byte sequences
(e.g. OS/390).
I metadata (including pointers to data) is stored in an inode
I byte stream: unstructured stream of bytes. Only native Unix type.
I pile: unstructured sequence of variable length records. Records and
I directories link names to inodes (and that’s all)
fields need to be self-identifying; can be searched only exhaustively. I hence file permissions are entirely unrelated to directory permissions
I (fixed) sequential: sequence of fixed-length records. Can store only I inodes may be listed in multiple directories
value of fields. One field may be key. Search is sequential; if records I inodes (and file data) are automatically freed when no directory
ordered by key, need not be exhaustive (but then problems in update). links to it
I indexed sequential: add an index file, indexing
I the root directory of a filesystem is found in a fixed inode (number 2
key fields by position in main file, and overflow file for updates.
in Linux filesystems).
Access much faster; update handled by adding to (sequential)
overflow file. Every so often, merge overflow file with main file.
I indexed: drop the
sequential ordering; use one exhaustive index plus auxilary indexes.
I hashed / direct: hash key value directly into
offset within file. (Again, use overflow file for hash value clashes.)
138 / 184 140 / 184
OS/390 Data Set Organization Access Control Mechanisms
is complex. To simplify: include
I files may have any of the formats mentioned above, and others I predefined permission bits, e.g. Unix read/write/execute for
I files live on a disk volume, which has a VTOC giving names and owner/group/other users.
some metadata (e.g. format) for files on the disk I access control lists giving specific rights to specific users or groups
I files from many volumes can be put in catalogs I capabilities granted to users over files (see Computer Security)
I and a filename prefix can be associated with a catalog via the
master catalog. E.g. the file JCB.ASM.SOURCE will be found in the
catalog associated with JCB
I catalogs also contain additional metadata (security etc.), depending
on files
I the master catalog is defined at system boot time from the VTOC
of the system boot volume.

141 / 184 143 / 184

Access Control Blocking


Often files are shared between users. Access rights may be restricted. How are the logical records packed into physical blocks on disk? Depending
Types of access include on hardware, block size may be fixed, or variable. (PC/Unix disks are
I knowledge of existence (e.g. seeing directory entry) fixed; S/390 DASDs allow varying block sizes.)
I execute (for programs) I fixed blocking: pack constant number of fixed-length records into
block
I read
I variable, spanning: variable-length records, packed without regard to
I write
block boundaries. May need implicit or explicit continuation pointers
I write append-only when spanning blocks. Some records need two I/O operations.
I change access rights I variable, non-spanning: records don’t span blocks; just waste space
I delete at end of block
Choices are made on performance criteria.

142 / 184 144 / 184


File Space Allocation NT Design Principles
How do we allocate physical blocks to files? This is very similar to Design goals include:
the memory allocation problem, but with more options (since OS can I portability – not just Intel;
manipulate complex data structures, unlike memory hardware).
I security – for commercial and military use;
I contiguous allocation: makes file I/O easy and quick. But: I POSIX compliance – to ‘ease transition from Unix’. . .
fragmentation; need for compaction. (If space is not pre-allocated
I SMP support;
(own problems!), may need dynamic compaction.
I chained allocation: allocate blocks as and when needed, and chain I extensibility;
them together in one list per file. Easy, no fragmentation problems, I internationalization and localization;
but file may be scattered over disk, → v. inefficient I/O; direct I backwards compatibility (to Windows 9?, 3.1, even MS-DOS)
access is slow.
Accordingly, NT is micro-kernel based, modular/layered, and written in
I indexed allocation: file has index of blocks or sequences of blocks
high-level language. (C and C++).
allocated to it. E.g. file foo is on blocks 3,4,5,78,79,80. Most
popular method; direct access and sequential access; avoids
fragmentation, has some contiguity.
Similar approaches to organizing free space on disk, though many systems
just use bitmap of block allocation.
145 / 184 147 / 184

The Windows NT family NT Family General Structure


The successor to the Windows 9? family. Started after (failure of) OS/2. Logon Win16 Win32 MS-DOS
Posix
Applications
Process Applications Applications Applications
OS/2
Applications
I Started 1989. New codebase; microkernel based architecture.
I NT3.1 released 1993; poor quality. Security Win16 MS-DOS POSIX
I NT3.5 released 1994 (3.51 in 1995); more or less usable. Subsytem Subsytem Subsytem Subsytem OS/2
Subsytem
I NT4.0 released 1996; matched W95 look’n’feel. For performance, Win32

some functions (esp. graphics) put back into kernel. Subsytem

I Windows 2000 (NT 5.0); adds features for distributed processing; User Mode

Active Directory (distributed directory service).


Native NT Interface (Sytem Calls)
Kernel Mode
EXECUTIVE
I Windows XP: no really significant OS-side changes. Terminal servers I/O VM Process
Object
allow multiple users on one workstation (cf. Linux virtual consoles). Manager Manager Manager Manager

I Windows Vista: still NT, but many components extensively File System Cache Security LPC
re-worked. Interesting techniques include machine-learning based Drivers Manager Manager Facility

paging. Security focus (Trusted Platform Module).


DEVICE DRIVERS KERNEL
I Windows 7: trying to cut down on kernel bloat. Changes to memory Hardware Abstraction Layer (HAL)

management and scheduling, but nothing revolutionary. Hardware

146 / 184 148 / 184


HAL, Kernel, Executive, Subsystems Memory Management
The Hardware Abstraction Layer converts all hardware-specific details into NT has paged virtual memory. Page reclamation is local to process;
an abstract interface. resident set size is managed according to global demand.
The (micro)kernel handles process and thread scheduling, inter- The executive virtual memory manager provides VM services to processes,
rupt/exception handling, SMP synchronization, and recovery. It is such as sharing and memory-mapped files.
object-oriented. It is non-pageable and non-preemptible. In Vista, the system tries to pre-load pages that are likely to be needed
The executive comprises a number of modules, running in kernel mode – it even tracks application usage by time of day in order to load them
but in thread context (cf. Linux kernel threads). before they’re called.
Subsystems are user-mode processes providing native NT facilities to
other operating system APIs: Win32, POSIX, OS/2, Win3.1, MS-DOS.

149 / 184 151 / 184

Processes and Threads Object Management


NT has processes that own resources, including an address space, and Almost everything is an object. The executive object manager provides
threads that are the dispatchable units, as in the general description earlier object services:
in the course. I hierarchical namespace for named objects (via directory objects);
Process/thread services are provided by the executive’s process manager. I access control lists
General purpose; no built-in restrictions to parent/child relationships.
I naming domains – mapping existing namespaces to object
Scheduling is priority-based; processes returning from I/O get boosted,
namespace
and then decay each quantum. Special hacks for GUI response.
I symbolic links
Quantum is around 20ms on Workstation, or double/triple for ‘foreground
task’. Longer quantum on server configurations. I handles to objects (used by processes etc.)

150 / 184 152 / 184


I/O Management MVS Components
The executive I/O manager supervises dispatching of basic I/O to device I supervisor: main OS functions
drivers. I Master Scheduler: system start and control, communication with
operator; master task in system
I asynchronous request/response model – application queues request,
is signalled on completion
I Job Entry Subsystem (JES2): entry of batch jobs, handling of
output
I device drivers and file system drivers can be stacked (cf. Solaris
I System Management Facility (SMF): accounting, performance
streams)
analysis
I cache manager provides general caching services I Resource Measurement Facility: records data about system events
I network drivers include distributed system support (W2K and XP) and usage, for use by SMF and others
NT supports the old FAT-32 filesystem, but also the modern NT filesystem. I Workload Manager: manages workload according to installation
goals
I an NTFS volume occupies a partition, a disk, or multiple disks I Timesharing Option (TSO/E): provides interactive timesharing
I an NTFS file is structured: a file has attributes, including (possibly services
multiple) data attributes and security attributes I TCAM, VTAM, TCP/IP: telecoms and networking
I files located via MFT (master file table) I Global Resource Serialization: resource control across clusters
I NTFS is a journalling file system and many others
153 / 184 155 / 184

z/OS – OS/390 – MVS Supervisor


See earlier for evolution. MVS is the basic operating system component I Dispatcher: main scheduler (in OS sense)
of z/OS. MVS comprises BCP (Basic Control Program) and JES2 (Job I Real Storage Manager: manages real memory, decides on page
Entry Subsystem). in/out/reclaim, determines resident set sizes etc.
MVS design objectives (1972–date!): I Auxiliary Storage Manager: handles page/swap in/out
I performance I Virtual Storage Manager: address space management and virtual
I reliability allocation (calls RSM to get real memory)
I availability I System Resources Manager: supervisor component of Workload
I compatibility Manager: advises/instructs above components

in the large system environment.


A large multiprocessor MVS cluster is resilient. The system recovers from,
and reconfigures round, almost all failures, including hardware failures.
(99.999% uptime is claimed; some installations are said to have stayed up
for more than a decade.)

154 / 184 156 / 184


Job Entry Subsystem Recap of other aspects
handles processing of batch jobs: We have already described many aspects of S/390 and OS/390. To recap:
I read from ‘card reader’ or TSO SUBMIT command
I convert JCL to internal form I Memory is paged on demand, with a two-level paging system.
I start execution I The task is the basic dispatchable unit. One address space may have
many tasks. The service request is a small unit, dispatchable to any
I spool output; hold for later inspection, and/or print
address space
JES2 is a basic system; JES3 provides more advanced job scheduling I There is a general resource control mechanism ENQ/DEQ.
facilities.
I SMP is supported in hardware.
I I/O is highly sophisticated, offloaded from CPU. Throughput and
transaction rates can be very high. (E.g. 10 000s of concurrent users
accessing terabyte databases; sustained I/O of tens of MB/s on a
single CPU.) Intra-cluster data transfer at GB/s.

157 / 184 159 / 184

Address spaces VM – Virtual Machine Operating System


A virtual address space includes: The S/390 hardware is easy to virtualize. VM is an S/390 OS exploiting
I nucleus: important control blocks (in ptic CVT); most OS routines this.
I Fixed Link Pack Area: non-pageable (e.g. for performance) shared I VM provides each user with a (configurable) virtual S/390 machine

libraries etc. to which they have full access.


I private area: includes address-space-local system data, and user I The VM CP (control program) gives each user a virtual console with

programs and data operator-like commands, at which


I Common Service Area: data shared between all tasks I they may start CMS (Conversational Monitor System), a single-user

I System Queue Area: page tables etc. operating system for interactive work;
I Pageable Link Pack Area: shared libraries permanently resident in I or they may load a S/390 operating system: MVS, Linux/390, or

virtual memory even VM.


I VM is a fully paging virtual memory OS, so
Note that (unlike Linux) most system data is mapped into all address I IBM OSes can be adjusted to allow communication with VM
spaces.
hypervisor, so only one OS does the paging etc. (maybe VM, maybe
Address spaces are created for each started task (operator START guest)
command), job, and TSO user.
A single S/390 machine can reasonably host 10–20 thousand virtual
There are also data spaces: extra address spaces solely for user data. machines running CMS. Some production shops run 3000 virtual Linux
158 / 184
machines on one S/390 under VM. 160 / 184
Security Overview Protection
Security requirements are sometimes divided into categories: Different degrees and granularities of protection can be provided, with
I Confidentiality: data (or even its existence) should protected from increasing difficulty, by operating sytems.
disclosure to unauthorized entities. I no protection: fine, if the system is contained by physical security.
I Integrity: data should not be modified by unauthorized entities. I isolation of tasks: different tasks have separate address spaces,
I Availability: data should be available to authorized entities filespaces, etc., with no communication.
I Authenticity: all entities should be identified, so that all operations I public/private: allow object owners to make them public (accessible
are attributable. Also, nowadays, to other processes) or private.
I Non-repudiation: no entity should be able to deny doing any action I sharing via access lists: OS enforces user-specified access restrictions
that it did do. given in ACLs
I . . . via capabilities: or with dynamically created access capabilities,
which may
I limit uses: constrain detailed use: printing, viewing, copying etc.

161 / 184 163 / 184

Attacks Techniques
Many attacks are obvious: cut power lines, intercept phone lines, etc. I User identification
Some are less obvious: traffic analysis may reveal critical information. Data I Passwords
aggregation may extract sensitive information from several apparently
I One-time passwords
I Biometrics
innocent sources.
I Confidentiality
Social engineering attacks often work.
I OS facilities
I Encryption
I Authenticity and non-repudiation
I Cryptographic signing

162 / 184 164 / 184


Malicious Software Cracking user accounts
Two main entry routes for malicious software: Nowadays, by far the easiest way is by tricking the user into opening an
I exploiting bugs in system software, e.g. buffer overflow attacks executable attachment or running a download. With foolishly designed
mail systems, you may not even need to trick the user. Counter-measures,
I exploiting users, e.g. most email viruses
in increasing order of severity:
Malicious software includes worms, viruses, logic bombs, trojan horses, I educate users
trap doors. Prevention via:
I scan mail for known viruses
I rigorous access control on need-to-know basis I modify local mail programs etc. to stop them executing attachments
I reviews of potentially exploitable code I prohibit (by modifying OS if necessary) execution of any program
I user education not digitally signed or otherwise known to be trusted
Detection via
I signature scanning
I sandboxed execution
I performance and system behaviour analysis

165 / 184 167 / 184

Cracking and counter-cracking More traditional techniques:

The ultimate desideratum of a cracker is complete privileged access to a


I guessing (or brute force searching) passwords.
system. Attacks may involve several levels of indirection: Counter: draconian password policies (problems with this?).
I faking login screens (this would be very easy on DICE).
I break directly into a privileged network server, suborn an operator,
Counter: train users to use “secure attention” before log on, if
etc.
available.
I break a user account, then exploit weakness in OS to get root
I break into a trusted but more vulnerable machine, use as relay

166 / 184 168 / 184


Trapdoors and Backdoors Attacking Servers and Services
The cracker (either as an insider, or by cracking another machine) inserts Many Unix machines run (or ran) servers such as FTP, Web, etc. Many
a trapdoor into a privileged program, allowing root access or whatever. of these must provide services on behalf of all users; they therefore [!]
Classic paper by Ken Thompson 1984: run as root. Windows machines similarly run several services in privileged
I modify the C compiler (cc) so that it recognizes when it is modes.
compiling the login program: it then inserts a routine to allow a Buffer overflow vulnerabilities in privileged servers give direct entry for the
master password for any account; cracker. Numerous examples, on both Unix and Windows.
I and if it recognizes it is compiling itself, it inserts these two routines. Don’t run unnecessary servers; use firewall to block access to all except
trusted servers.
Moral: you can’t trust a system built with somebody else’s tools, even if
the source is clean.
Main counter: well trusted sources. But . . .

169 / 184 171 / 184

In autumn 2003, somebody attempted to insert a trapdoor into the master The Internet Worm of 3 November 1988
copy of the Linux kernel. They added what looked like an obscure test
with a typo to a system call, which would actually allow anybody to Robert Morris, Jr., released the first worm that seriously disrupted the
become root by passing appropriate arguments. Fortunately, Internet. It used many techniques. For attack:
I It was caught by clash detection in version control.
I buffer overflow problem in the Unix finger service
I The source tree they modified was not actually the master (although
I exploiting intentional ‘trapdoor’ in Unix mail servers compiled in
is a source used by many people). debugging mode – very many production servers were!
I password guessing
I Unix remote execution allowed easy spread
It also had defensive capabilities:
I changes its name to sh
I forks() to change process id frequently
I avoids leaving files around
I obfuscates data in memory to hinder analysis

170 / 184 172 / 184


Breaking root Multilevel Security
Once the cracker has user access, the next step is to become superuser, Military and governmental security desires have driven much work on
get admin status, etc. Usual vulnerabilities are easier to exploit when secure operating systems.
already on the system. May also be other possibilities, for example: Multilevel security is the concept of processing information with differing
I MVS has a notion of authorized program which can do privileged degrees of sensitivity on the same system. E.g. British official scheme
things is unclassified, restricted, confidential, secret, top secret in a hierarchy,
refined by codewords restricting material to specified users. (E.g. the
I one MVS system had a home-grown user command processor (=
famous ‘top/most secret ultra’ was the ‘top/most secret’ classification,
shell) which ran authorized (so it could invoke authorized programs)
with the ‘ultra’ codeword identifying the Enigma decrypts.) A user cleared
I and which ran in storage key 8 (normal user storage key) . . . to one level should not see higher level material.
I MVS I/O standard access methods allow user to intercept READ Applications also in commerce, banking, comms (telephone billing system
calls on open files, so should read but not write switching data).
I user could install read hook on files being used by shell

173 / 184 175 / 184

Intrusion detection, concealment and rootkits The Bell–LaPadula Security Policy Model
Modern systems do a lot of monitoring to try to detect suspicious activity: A security policy model describes concisely and accurately what constraints
changed files, unusual processes. Therefore, after successful crack, need security places on information flow.
to avoid detection. Often install modified copies of ls, ps etc. The BLP model is two basic principles:
Modern Linux rootkits try even harder: install kernel modules which
I No Read Up (simple security property): a process running at one
modify kernel code of system calls so that certain processes and files
level may not read data at a higher level;
are ignored - even clean ls or ps will not show them. Also modify load
average to ignore your password cracker, etc. etc. I No Write Down (*-property): a process running at one level may
not write data at a lower level.
The second policy prevents viruses etc. from copying sensitive information
down to a lower security level, as well as stopping humans accidentally
doing so.
(How is declassification achieved?)

174 / 184 176 / 184


Maintaining the Security Policy The Common Criteria for Security Assurance Evaluation
The design principle is to mediate all data transfers through a small, 0 Inadequate Assurance
verifiable OS component, the reference monitor. This, combined with the 1 Functionally Tested. Provides analysis of the security functions,
hardware it uses (and the human), forms the Trusted Computing Base using a functional and interface specification of the TOE, to
(TCB). Application programs do not have to be checked! understand the security behaviour. The analysis is supported by
Special architectures are required for highest security, but this is now too independent testing of the security functions.
expensive/inconvenient. Most current ‘secure’ operating systems run on 2 Structurally Tested. Analysis of the security functions using a
commodity hardware, and are ‘hardened’ versions of commodity OSes. functional and interface specification and the high level design of the
subsystems of the TOE. Independent testing of the security
functions, evidence of developer ”black box” testing, and evidence
of a development search for obvious vulnerabilities.
3 Methodically Tested and Checked. The analysis is supported by
”grey box” testing, selective independent confirmation of the
developer test results, and evidence of a developer search for obvious
vulnerablitities. Development environment controls and TOE
configuration management are also required.

177 / 184 179 / 184

Evaluation Criteria 4 Methodically Designed, Tested and Reviewed. Analysis is supported


by the low-level design of the modules of the TOE, and a subset of
The U.S. Department of Defense designed the hugely influential Orange the implementation. Testing is supported by an independent search
Book criteria for evaluating the security of systems. Classification for obvious vulnerabilities. Development controls are supported by a
scheme A1, B3, B2, B1, C2, C1, D. Only one general purpose computer life-cycle model, identification of tools, and automated configuration
(Honeywell SCOMP: purpose-built hardware and OS) ever achieved an management.
A1 rating, which required formal verification. Various Unices and MVS
5 Semiformally Designed and Tested. Analysis includes all of the
reached B2 (structured mandatory access controls, etc. etc.).
implementation. Assurance is supplemented by a formal model and
Now superseded by the Common Criteria based on Orange Book, British a semiformal presentation of the functional specification and high
and European criteria. level design, and a semiformal demonstration of correspondence.
The search for vulnerabilities must ensure relative resistance to
penetration attack. Covert channel analysis and modular design are
also required.

178 / 184 180 / 184


6 Semiformally Verified Design and Tested. Analysis is supported by a Steganography – Hiding the Existence of Data
modular and layered approach to design, and a structured
presentation of the implementation. The independent search for In many contexts, even revealing the existence of sensitive data may
vulnerabilities must ensure high resistance to penetration attack. be dangerous – whether because an enemy army may be able to draw
The search for covert channels must be systematic. Development inferences, or because you’re a political activist under investigation by the
environment and configuration management controls are further security police. (In several countries, mere use of crypto is a crime; in the
strengthened. U.K., you can be (legally) forced to decrypt any encrypted material.)
7 Formally Verified Design and Tested. The formal model is Steganography is the science of hiding data.
supplemented by a formal presentation of the functional The simplest (don’t use it) trick is to hide data in an image, by using the
specification and high level design showing correspondence. least significant bit of each pixel to carry the data.
Evidence of developer ”white box” testing and complete
independent confirmation of developer test results are required.
Complexity of the design must be minimised.

181 / 184 183 / 184

Single Level Security – the Easy Way Stegfs: a Steganographic File System for Linux
Many workers with security clearances want to use ‘standard’ computers Theory by Anderson, Needham, Shamir; implementation by McDonald
(i.e. Windows) in a simple (non-MLS) way. How is data protected? and Kuhn.
Usually by a software add-on that maintains the hard drive in an Stegfs provides a crypto-based secure filesystem with multiple levels of
encrypted state. security. If you have only the level 3 password (say), you can’t tell that
The encryption key is held in a USB dongle. Several commercial devices there is a level 4, let alone that there is any level 4 data.
are approved to reduce the classification level of a laptop by two levels If the machine is running at level 3, the OS doesn’t know about level 4
when it is switched off (e.g. Top Secret to Confidential). data, so it may write over it. . .
(Question: Top Secret material may not be removed from a government . . . so StegFS maintains several copies of data in dispersed blocks, in the
site save in very unusual circumstances. Confidential material can be hope that one of them will survive until you next enter the relevant level
taken for home working subject to certain precautions. What should be - every so often, you should enter the highest security level and run a
the rule for such a laptop?) maintenance procedure to regenerate the multiple copies.

182 / 184 184 / 184

You might also like