A Simple, Unix-Like Teaching Operating System: Russ Cox Frans Kaashoek Robert Morris Xv6-Book@pdos - Csail.mit - Edu
A Simple, Unix-Like Teaching Operating System: Russ Cox Frans Kaashoek Robert Morris Xv6-Book@pdos - Csail.mit - Edu
A Simple, Unix-Like Teaching Operating System: Russ Cox Frans Kaashoek Robert Morris Xv6-Book@pdos - Csail.mit - Edu
Russ Cox
Frans Kaashoek
Robert Morris
xv6-book@pdos.csail.mit.edu
2 Page tables 29
4 Locking 51
5 Scheduling 61
6 File system 75
7 Summary 93
A PC hardware 95
Index 105
This is a draft text intended for a class on operating systems. It explains the main con-
cepts of operating systems by studying an example kernel, named xv6. xv6 is a re-im-
plementation of Dennis Ritchie’s and Ken Thompson’s Unix Version 6 (v6). xv6 loose-
ly follows the structure and style of v6, but is implemented in ANSI C for an x86-
based multiprocessor.
The text should be read along with the source code for xv6. This approach is inspired
by John Lions’s Commentary on UNIX 6th Edition (Peer to Peer Communications; IS-
BN: 1-57398-013-7; 1st edition (June 14, 2000)). See https://pdos.csail.mit.edu/6.828 for
pointers to on-line resources for v6 and xv6, including several hands-on homework as-
signments using xv6.
We have used this text in 6.828, the operating systems class at MIT. We thank the fac-
ulty, teaching assistants, and students of 6.828 who have all directly or indirectly con-
tributed to xv6. In particular, we would like to thank Austin Clements and Nickolai
Zeldovich. Finally, we would like to thank people who emailed us bugs in the text or
suggestions for provements: Abutalib Aghayev, Sebastian Boehm, Anton Burtsev,
Raphael Carvalho, Rasit Eskicioglu, Color Fuzzy, Giuseppe, Tao Guo, Robert Hilder-
man, Wolfgang Keller, Austin Liew, Pavan Maddamsetti, Jacek Masiulaniec, Michael
McConville, miguelgvieira, Mark Morrissey, Harry Pan, Askar Safin, Salman Shah, Rus-
lan Savchenko, Pawel Szczurko, Warren Toomey, tyfkda, and Zou Chang Wei.
If you spot errors or have suggestions for improvement, please send email to Frans
Kaashoek and Robert Morris (kaashoek,rtm@csail.mit.edu).
The rest of this chapter outlines xv6’s services—processes, memory, file descrip- shell
time-share
tors, pipes, and file system—and illustrates them with code snippets and discussions of pid+code
how the shell, which is the primary user interface to traditional Unix-like systems, uses fork+code
them. The shell’s use of system calls illustrates how carefully they have been designed. child process
parent process
The shell is an ordinary program that reads commands from the user and exe- fork+code
cutes them. The fact that the shell is a user program, not part of the kernel, illustrates exit+code
the power of the system call interface: there is nothing special about the shell. It also wait+code
means that the shell is easy to replace; as a result, modern Unix systems have a variety
of shells to choose from, each with its own user interface and scripting features. The
xv6 shell is a simple implementation of the essence of the Unix Bourne shell. Its im-
plementation can be found at line (8550).
exited child of the current process; if none of the caller’s children has exited, wait wait+code
printf+code
waits for one to do so. In the example, the output lines wait+code
exec+code
parent: child=1234
exec+code
child: exiting
might come out in either order, depending on whether the parent or child gets to its
printf call first. After the child exits the parent’s wait returns, causing the parent to
print
parent: child 1234 is done
Although the child has the same memory contents as the parent initially, the parent
and child are executing with different memory and different registers: changing a vari-
able in one does not affect the other. For example, when the return value of wait is
stored into pid in the parent process, it doesn’t change the variable pid in the child.
The value of pid in the child will still be zero.
The exec system call replaces the calling process’s memory with a new memory
image loaded from a file stored in the file system. The file must have a particular for-
mat, which specifies which part of the file holds instructions, which part is data, at
which instruction to start, etc. xv6 uses the ELF format, which Chapter 2 discusses in
more detail. When exec succeeds, it does not return to the calling program; instead,
the instructions loaded from the file start executing at the entry point declared in the
ELF header. Exec takes two arguments: the name of the file containing the executable
and an array of string arguments. For example:
This fragment replaces the calling program with an instance of the program
/bin/echo running with the argument list echo hello. Most programs ignore the first
argument, which is conventionally the name of the program.
The xv6 shell uses the above calls to run programs on behalf of users. The main
structure of the shell is simple; see main (8701). The main loop reads a line of input
from the user with getcmd. Then it calls fork, which creates a copy of the shell pro-
cess. The parent calls wait, while the child runs the command. For example, if the
user had typed ‘‘echo hello’’ to the shell, runcmd would have been called with ‘‘echo
hello’’ as the argument. runcmd (8606) runs the actual command. For ‘‘echo hello’’, it
would call exec (8626). If exec succeeds then the child will execute instructions from
echo instead of runcmd. At some point echo will call exit, which will cause the par-
ent to return from wait in main (8701). You might wonder why fork and exec are not
combined in a single call; we will see later that separate calls for creating a process
and loading a program is a clever design.
Xv6 allocates most user-space memory implicitly: fork allocates the memory re-
quired for the child’s copy of the parent’s memory, and exec allocates enough memory
to hold the executable file. A process that needs more memory at run-time (perhaps
for malloc) can call sbrk(n) to grow its data memory by n bytes; sbrk returns the
location of the new memory.
Xv6 does not provide a notion of users or of protecting one user from another; in
Unix terms, all xv6 processes run as root.
for(;;){
n = read(0, buf, sizeof buf);
if(n == 0)
break;
if(n < 0){
fprintf(2, "read error\n");
exit();
}
if(write(1, buf, n) != n){
fprintf(2, "write error\n");
exit();
}
}
The important thing to note in the code fragment is that cat doesn’t know whether it
is reading from a file, console, or a pipe. Similarly cat doesn’t know whether it is
printing to a console, a file, or whatever. The use of file descriptors and the conven-
tion that file descriptor 0 is input and file descriptor 1 is output allows a simple imple-
mentation of cat.
The close system call releases a file descriptor, making it free for reuse by a fu-
ture open, pipe, or dup system call (see below). A newly allocated file descriptor is al-
ways the lowest-numbered unused descriptor of the current process.
File descriptors and fork interact to make I/O redirection easy to implement.
Fork copies the parent’s file descriptor table along with its memory, so that the child
starts with exactly the same open files as the parent. The system call exec replaces the
calling process’s memory but preserves its file table. This behavior allows the shell to
implement I/O redirection by forking, reopening chosen file descriptors, and then exec-
ing the new program. Here is a simplified version of the code a shell runs for the
command cat < input.txt:
argv[0] = "cat";
argv[1] = 0;
if(fork() == 0) {
close(0);
open("input.txt", O_RDONLY);
exec("cat", argv);
}
After the child closes file descriptor 0, open is guaranteed to use that file descriptor for
the newly opened input.txt: 0 will be the smallest available file descriptor. Cat then
executes with file descriptor 0 (standard input) referring to input.txt.
The code for I/O redirection in the xv6 shell works in exactly this way (8630). Re-
call that at this point in the code the shell has already forked the child shell and that
runcmd will call exec to load the new program. Now it should be clear why it is a
good idea that fork and exec are separate calls. Because if they are separate, the shell
can fork a child, use open, close, dup in the child to change the standard input and
output file descriptors, and then exec. No changes to the program being exec-ed (cat
in our example) are required. If fork and exec were combined into a single system
call, some other (probably more complex) scheme would be required for the shell to
redirect standard input and output, or the program itself would have to understand
how to redirect I/O.
Although fork copies the file descriptor table, each underlying file offset is shared
between parent and child. Consider this example:
if(fork() == 0) {
write(1, "hello ", 6);
exit();
} else {
wait();
write(1, "world\n", 6);
}
At the end of this fragment, the file attached to file descriptor 1 will contain the data
hello world. The write in the parent (which, thanks to wait, runs only after the
child is done) picks up where the child’s write left off. This behavior helps produce
sequential output from sequences of shell commands, like (echo hello; echo world)
>output.txt.
The dup system call duplicates an existing file descriptor, returning a new one that
refers to the same underlying I/O object. Both file descriptors share an offset, just as
the file descriptors duplicated by fork do. This is another way to write hello world
into a file:
fd = dup(1);
write(1, "hello ", 6);
write(fd, "world\n", 6);
Two file descriptors share an offset if they were derived from the same original
file descriptor by a sequence of fork and dup calls. Otherwise file descriptors do not
share offsets, even if they resulted from open calls for the same file. Dup allows shells
2>&1. The 2>&1 tells the shell to give the command a file descriptor 2 that is a dupli-
cate of descriptor 1. Both the name of the existing file and the error message for the
non-existing file will show up in the file tmp1. The xv6 shell doesn’t support I/O redi-
rection for the error file descriptor, but now you know how to implement it.
File descriptors are a powerful abstraction, because they hide the details of what
they are connected to: a process writing to file descriptor 1 may be writing to a file, to
a device like the console, or to a pipe.
Pipes
A pipe is a small kernel buffer exposed to processes as a pair of file descriptors,
one for reading and one for writing. Writing data to one end of the pipe makes that
data available for reading from the other end of the pipe. Pipes provide a way for
processes to communicate.
The following example code runs the program wc with standard input connected
to the read end of a pipe.
int p[2];
char *argv[2];
argv[0] = "wc";
argv[1] = 0;
pipe(p);
if(fork() == 0) {
close(0);
dup(p[0]);
close(p[0]);
close(p[1]);
exec("/bin/wc", argv);
} else {
close(p[0]);
write(p[1], "hello world\n", 12);
close(p[1]);
}
The program calls pipe, which creates a new pipe and records the read and write file
descriptors in the array p. After fork, both parent and child have file descriptors refer-
ring to the pipe. The child dups the read end onto file descriptor 0, closes the file de-
scriptors in p, and execs wc. When wc reads from its standard input, it reads from the
pipe. The parent closes the read side of the pipe, writes to the pipe, and then closes
the write side.
If no data is available, a read on a pipe waits for either data to be written or all
file descriptors referring to the write end to be closed; in the latter case, read will re-
turn 0, just as if the end of a data file had been reached. The fact that read blocks
until it is impossible for new data to arrive is one reason that it’s important for the
child to close the write end of the pipe before executing wc above: if one of wc’s file
descriptors referred to the write end of the pipe, wc would never see end-of-file.
File system
The xv6 file system provides data files, which are uninterpreted byte arrays, and
directories, which contain named references to data files and other directories. The di-
rectories form a tree, starting at a special directory called the root. A path like /a/b/c
refers to the file or directory named c inside the directory named b inside the directo-
ry named a in the root directory /. Paths that don’t begin with / are evaluated relative
to the calling process’s current directory, which can be changed with the chdir system
call. Both these code fragments open the same file (assuming all the directories in-
volved exist):
chdir("/a");
chdir("b");
open("c", O_RDONLY);
open("/a/b/c", O_RDONLY);
The first fragment changes the process’s current directory to /a/b; the second neither
refers to nor changes the process’s current directory.
There are multiple system calls to create a new file or directory: mkdir creates a
new directory, open with the O_CREATE flag creates a new data file, and mknod creates
a new device file. This example illustrates all three:
struct stat {
short type; // Type of file
int dev; // File system’s disk device
uint ino; // Inode number
short nlink; // Number of links to file
uint size; // Size of file in bytes
};
A file’s name is distinct from the file itself; the same underlying file, called an in-
ode, can have multiple names, called links. The link system call creates another file
system name referring to the same inode as an existing file. This fragment creates a
new file named both a and b.
open("a", O_CREATE|O_WRONLY);
link("a", "b");
Reading from or writing to a is the same as reading from or writing to b. Each inode
is identified by a unique inode number. After the code sequence above, it is possible to
determine that a and b refer to the same underlying contents by inspecting the result
of fstat: both will return the same inode number (ino), and the nlink count will be
set to 2.
The unlink system call removes a name from the file system. The file’s inode
and the disk space holding its content are only freed when the file’s link count is zero
and no file descriptors refer to it. Thus adding
unlink("a");
to the last code sequence leaves the inode and file content accessible as b. Further-
more,
fd = open("/tmp/xyz", O_CREATE|O_RDWR);
unlink("/tmp/xyz");
is an idiomatic way to create a temporary inode that will be cleaned up when the pro-
cess closes fd or exits.
Shell commands for file system operations are implemented as user-level pro-
grams such as mkdir, ln, rm, etc. This design allows anyone to extend the shell with
Real world
Unix’s combination of the ‘‘standard’’ file descriptors, pipes, and convenient shell
syntax for operations on them was a major advance in writing general-purpose
reusable programs. The idea sparked a whole culture of ‘‘software tools’’ that was re-
sponsible for much of Unix’s power and popularity, and the shell was the first so-called
‘‘scripting language.’’ The Unix system call interface persists today in systems like BSD,
Linux, and Mac OS X.
The Unix system call interface has been standardized through the Portable Oper-
ating System Interface (POSIX) standard. Xv6 is not POSIX compliant. It misses sys-
tem calls (including basic ones such as lseek), it implements systems calls only par-
tially, etc. Our main goals for xv6 are simplicity and clarity while providing a simple
UNIX-like system-call interface. Several people have extended xv6 with a few more
basic system calls and a simple C library so that they can run basic Unix programs.
Modern kernels, however, provide many more system calls, and many more kinds of
kernel services, than xv6. For example, they support networking, windowing systems,
user-level threads, drivers for many devices, and so on. Modern kernels evolve contin-
uously and rapidly, and offer many features beyond POSIX.
For the most part, modern Unix-derived operating systems have not followed the
early Unix model of exposing devices as special files, like the console device file dis-
cussed above. The authors of Unix went on to build Plan 9, which applied the ‘‘re-
sources are files’’ concept to modern facilities, representing networks, graphics, and oth-
er resources as files or file trees.
The file system abstraction has been a powerful idea. Even so, there are other
models for operating system interfaces. Multics, a predecessor of Unix, abstracted file
storage in a way that made it look like memory, producing a very different flavor of
interface. The complexity of the Multics design had a direct influence on the designers
of Unix, who tried to build something simpler.
This book examines how xv6 implements its Unix-like interface, but the ideas and
concepts apply to more than just Unix. Any operating system must multiplex process-
es onto the underlying hardware, isolate processes from each other, and provide mech-
anisms for controlled inter-process communication. After studying xv6, you should be
able to look at other, more complex operating systems and see the concepts underlying
xv6 in those systems as well.
tions (e.g., adding numbers, etc.) and is said to be running in user space, while the
software in kernel mode can also execute privileged instructions and is said to be run-
ning in kernel space. The software running in kernel space (or in kernel mode) is
called the kernel.
An application that wants to read or write a file on disk must transition to the
kernel to do so, because the application itself can not execute I/O instructions. Proces-
sors provide a special instruction that switches the processor from user mode to kernel
mode and enters the kernel at an entry point specified by the kernel. (The x86 proces-
sor provides the int instruction for this purpose.) Once the processor has switched to
kernel mode, the kernel can then validate the arguments of the system call, decide
whether the application is allowed to perform the requested operation, and then deny
it or execute it. It is important that the kernel sets the entry point for transitions to
kernel mode; if the application could decide the kernel entry point, a malicious appli-
cation could enter the kernel at a point where the validation of arguments etc. is
skipped.
Kernel organization
A key design question is what part of the operating system should run in kernel
mode. One possibility is that the entire operating system resides in the kernel, so that
the implementations of all system calls run in kernel mode. This organization is called
a monolithic kernel.
In this organization the entire operating system runs with full hardware privilege.
This organization is convenient because the OS designer doesn’t have to decide which
part of the operating system doesn’t need full hardware privilege. Furthermore, it easy
for different parts of the operating system to cooperate. For example, an operating
system might have a buffer cache that can be shared both by the file system and the
virtual memory system.
A downside of the monolithic organization is that the interfaces between different
parts of the operating system are often complex (as we will see in the rest of this text),
and therefore it is easy for an operating system developer to make a mistake. In a
monolithic kernel, a mistake is fatal, because an error in kernel mode will often result
in the kernel to fail. If the kernel fails, the computer stops working, and thus all appli-
cations fail too. The computer must reboot to start again.
To reduce the risk of mistakes in the kernel, OS designers can minimize the
amount of operating system code that runs in kernel mode, and execute the bulk of
the operating system in user mode. This kernel organization is called a microkernel.
Figure 1-1 illustrates this microkernel design. In the figure, the file system runs as
a user-level process. OS services running as processes are called servers. To allow ap-
plications to interact with the file server, the kernel provides an inter-process commu-
nication mechanism to send messages from one user-mode process to another. For
kernel
space Microkernel
example, if an application like the shell wants to read or write a file, it sends a message process
address space
to the file server and waits for a response. virtual address
In a microkernel, the kernel interface consists of a few low-level functions for physical address
starting applications, sending messages, accessing device hardware, etc. This organiza- user memory
tion allows the kernel to be relatively simple, as most of the operating system resides
in user-level servers.
Xv6 is implemented as a monolithic kernel, following most Unix operating sys-
tems. Thus, in xv6, the kernel interface corresponds to the operating system interface,
and the kernel implements the complete operating system. Since xv6 doesn’t provide
many services, its kernel is smaller than some microkernels.
Process overview
The unit of isolation in xv6 (as in other Unix operating systems) is a process. The
process abstraction prevents one process from wrecking or spying on another process’s
memory, CPU, file descriptors, etc. It also prevents a process from wrecking the kernel
itself, so that a process can’t subvert the kernel’s isolation mechanisms. The kernel
must implement the process abstraction with care because a buggy or malicious appli-
cation may trick the kernel or hardware in doing something bad (e.g., circumventing
enforced isolation). The mechanisms used by the kernel to implement processes in-
clude the user/kernel mode flag, address spaces, and time-slicing of threads.
To help enforce isolation, the process abstraction provides the illusion to a pro-
gram that it has its own private machine. A process provides a program with what ap-
pears to be a private memory system, or address space, which other processes cannot
read or write. A process also provides the program with what appears to be its own
CPU to execute the program’s instructions.
Xv6 uses page tables (which are implemented by hardware) to give each process
its own address space. The x86 page table translates (or ‘‘maps’’) a virtual address (the
address that an x86 instruction manipulates) to a physical address (an address that the
processor chip sends to main memory).
Xv6 maintains a separate page table for each process that defines that process’s
address space. As illustrated in Figure 1-2, an address space includes the process’s user
memory starting at virtual address zero. Instructions come first, followed by global
variables, then the stack, and finally a ‘‘heap’’ area (for malloc) that the process can ex-
pand as needed.
heap
Each process’s address space maps the kernel’s instructions and data as well as the struct proc+code
p->xxx+code
user program’s memory. When a process invokes a system call, the system call exe- thread
cutes in the kernel mappings of the process’s address space. This arrangement exists p->kstack+code
so that the kernel’s system call code can directly refer to user memory. In order to
leave plenty of room for user memory, xv6’s address spaces map the kernel at high ad-
dresses, starting at 0x80100000.
The xv6 kernel maintains many pieces of state for each process, which it gathers
into a struct proc (2337). A process’s most important pieces of kernel state are its
page table, its kernel stack, and its run state. We’ll use the notation p->xxx to refer to
elements of the proc structure.
Each process has a thread of execution (or thread for short) that executes the pro-
cess’s instructions. A thread can be suspended and later resumed. To switch transpar-
ently between processes, the kernel suspends the currently running thread and resumes
another process’s thread. Much of the state of a thread (local variables, function call
return addresses) is stored on the thread’s stacks. Each process has two stacks: a user
stack and a kernel stack (p->kstack). When the process is executing user instructions,
only its user stack is in use, and its kernel stack is empty. When the process enters the
kernel (for a system call or interrupt), the kernel code executes on the process’s kernel
stack; while a process is in the kernel, its user stack still contains saved data, but isn’t
actively used. A process’s thread alternates between actively using its user stack and its
kernel stack. The kernel stack is separate (and protected from user code) so that the
kernel can execute even if a process has wrecked its user stack.
When a process makes a system call, the processor switches to the kernel stack,
raises the hardware privilege level, and starts executing the kernel instructions that im-
plement the system call. When the system call completes, the kernel returns to user
space: the hardware lowers its privilege level, switches back to the user stack, and re-
sumes executing user instructions just after the system call instruction. A process’s
thread can ‘‘block’’ in the kernel to wait for I/O, and resume where it left off when the
4 Mbyte
kernel text
text and data and data
0 BIOS 0
To allow the rest of the kernel to run, entry sets up a page table that maps virtu-
(empty)
p->kstack
Figure 1-4. A new kernel stack.
the process’s kernel thread. If the memory allocation fails, allocproc changes the forkret+code
trapret+code
state back to UNUSED and returns zero to signal failure. p->context+code
forkret+code
Now allocproc must set up the new process’s kernel stack. allocproc is written trapret+code
forkret+code
so that it can be used by fork as well as when creating the first process. allocproc trapret+code
sets up the new process with a specially prepared kernel stack and set of kernel regis- userinit+code
ters that cause it to ‘‘return’’ to user space when it first runs. The layout of the pre-
pared kernel stack will be as shown in Figure 1-4. allocproc does part of this work
by setting up return program counter values that will cause the new process’s kernel
thread to first execute in forkret and then in trapret (2507-2512). The kernel thread
will start executing with register contents copied from p->context. Thus setting p-
>context->eip to forkret will cause the kernel thread to execute at the start of
forkret (2853). This function will return to whatever address is at the bottom of the
stack. The context switch code (3059) sets the stack pointer to point just beyond the
end of p->context. allocproc places p->context on the stack, and puts a pointer to
trapret just above it; that is where forkret will return. trapret restores user regis-
ters from values stored at the top of the kernel stack and jumps into the process (3324).
This setup is the same for ordinary fork and for creating the first process, though in
the latter case the process will start executing at user-space location zero rather than at
a return from fork.
As we will see in Chapter 3, the way that control transfers from user software to
the kernel is via an interrupt mechanism, which is used by system calls, interrupts, and
exceptions. Whenever control transfers into the kernel while a process is running, the
hardware and xv6 trap entry code save user registers on the process’s kernel stack.
userinit writes values at the top of the new stack that look just like those that would
%ss from the stack. The contents of the trap frame have been transferred to the CPU
state, so the processor continues at the %eip specified in the trap frame. For init-
proc, that means virtual address zero, the first instruction of initcode.S.
At this point, %eip holds zero and %esp holds 4096. These are virtual addresses
in the process’s address space. The processor’s paging hardware translates them into
physical addresses. allocuvm has set up the process’s page table so that virtual address
zero refers to the physical memory allocated for this process, and set a flag (PTE_U)
that tells the paging hardware to allow user code to access that memory. The fact that
userinit (2533) set up the low bits of %cs to run the process’s user code at CPL=3
means that the user code can only use pages with PTE_U set, and cannot modify sensi-
tive hardware registers such as %cr3. So the process is constrained to using only its
own memory.
Real world
In the real world, one can find both monolithic kernels and microkernels. Many
Unix kernels are monolithic. For example, Linux has a monolithic kernel, although
some OS functions run as user-level servers (e.g., the windowing system). Kernels
such as L4, Minix, QNX are organized as a microkernel with servers, and have seen
wide deployment in embedded settings.
Most operating systems have adopted the process concept, and most processes
look similar to xv6’s. A real operating system would find free proc structures with an
explicit free list in constant time instead of the linear-time search in allocproc; xv6
uses the linear scan (the first of many) for simplicity.
Exercises
1. Set a breakpoint at swtch. Single step with gdb’s stepi through the ret to forkret,
then use gdb’s finish to proceed to trapret, then stepi until you get to initcode
at virtual address zero.
2. KERNBASE limits the amount of memory a single process can use, which might be
irritating on a machine with a full 4 GB of RAM. Would raising KERNBASE allow a
process to use more memory?
Page tables are the mechanism through which the operating system controls what
memory addresses mean. They allow xv6 to multiplex the address spaces of different
processes onto a single physical memory, and to protect the memories of different pro-
cesses. The level of indirection provided by page tables allows many neat tricks. xv6
uses page tables primarily to multiplex address spaces and to protect memory. It also
uses a few simple page-table tricks: mapping the same memory (the kernel) in several
address spaces, mapping the same memory more than once in one address space (each
user page is also mapped into the kernel’s physical view of memory), and guarding a
user stack with an unmapped page. The rest of this chapter explains the page tables
that the x86 hardware provides and how xv6 uses them. Compared to a real-world
operating system, xv6’s design is restrictive, but it does illustrate the key ideas.
Paging hardware
As a reminder, x86 instructions (both user and kernel) manipulate virtual addresses.
The machine’s RAM, or physical memory, is indexed with physical addresses. The x86
page table hardware connects these two kinds of addresses, by mapping each virtual
address to a physical address.
An x86 page table is logically an array of 2^20 (1,048,576) page table entries
(PTEs). Each PTE contains a 20-bit physical page number (PPN) and some flags. The
paging hardware translates a virtual address by using its top 20 bits to index into the
page table to find a PTE, and replacing the address’s top 20 bits with the PPN in the
PTE. The paging hardware copies the low 12 bits unchanged from the virtual to the
translated physical address. Thus a page table gives the operating system control over
virtual-to-physical address translations at the granularity of aligned chunks of 4096
(2^12) bytes. Such a chunk is called a page.
As shown in Figure 2-1, the actual translation happens in two steps. A page table
is stored in physical memory as a two-level tree. The root of the tree is a 4096-byte
page directory that contains 1024 PTE-like references to page table pages. Each page ta-
ble page is an array of 1024 32-bit PTEs. The paging hardware uses the top 10 bits of
a virtual address to select a page directory entry. If the page directory entry is present,
the paging hardware uses the next 10 bits of the virtual address to select a PTE from
the page table page that the page directory entry refers to. If either the page directory
entry or the PTE is not present, the paging hardware raises a fault. This two-level
structure allows a page table to omit entire page table pages in the common case in
which large ranges of virtual addresses have no mappings.
Each PTE contains flag bits that tell the paging hardware how the associated vir-
tual address is allowed to be used. PTE_P indicates whether the PTE is present: if it is
20 12
1023
PPN Flags
20 12
1023
1
0
Page Table
PPN Flags
1
0
CR3
Page Directory
31 12 11 10 9 8 7 6 5 4 3 2 1 0
A CW
Physical Page Number V DA UW P
L DT
P- Present
Page table and page directory
W- Writable
entries are identical except for
U- User
the D bit.
WT - 1=Write-through, 0=Write-back
CD - Cache Disabled
A- Accessed
D- Dirty (0 in page directory)
AVL - Available for system use
not set, a reference to the page causes a fault (i.e. is not allowed). PTE_W controls PTE_W+code
PTE_U+code
whether instructions are allowed to issue writes to the page; if not set, only reads and kvmalloc+code
instruction fetches are allowed. PTE_U controls whether user programs are allowed to
use the page; if clear, only the kernel is allowed to use the page. Figure 2-1 shows how
it all works. The flags and all other page hardware related structures are defined in
mmu.h (0700).
A few notes about terms. Physical memory refers to storage cells in DRAM. A
byte of physical memory has an address, called a physical address. Instructions use
only virtual addresses, which the paging hardware translates to physical addresses, and
then sends to the DRAM hardware to read or write storage. At this level of discussion
there is no such thing as virtual memory, only virtual addresses.
Device memory
RW-
0xFE000000
Unused if PHYSTOP is
less than 2 Gbyte
Free memory
RW--
end
0xFE000000
PHYSTOP
Extended memory
PAGESIZE User stack RWU
Figure 2-2. Layout of the virtual address space of a process and the layout of the physical address
space. Note that if a machine has more than 2 Gbyte of physical memory, xv6 can use only the memory
that fits between KERNBASE and 0xFE00000.
declares the constants for xv6’s memory layout, and macros to convert virtual to physi-
cal addresses.
When a process asks xv6 for more memory, xv6 first finds free physical pages to
provide the storage, and then adds PTEs to the process’s page table that point to the
new physical pages. xv6 sets the PTE_U, PTE_W, and PTE_P flags in these PTEs. Most
processes do not use the entire user address space; xv6 leaves PTE_P clear in unused
PTEs. Different processes’ page tables translate user addresses to different pages of
physical memory, so that each process has private user memory.
Xv6 includes all mappings needed for the kernel to run in every process’s page ta-
ble; these mappings all appear above KERNBASE. It maps virtual addresses KERN-
BASE:KERNBASE+PHYSTOP to 0:PHYSTOP. One reason for this mapping is so that the
kernel can use its own instructions and data. Another reason is that the kernel some-
times needs to be able to write a given page of physical memory, for example when
creating page table pages; having every physical page appear at a predictable virtual
address makes this convenient. A defect of this arrangement is that xv6 cannot make
use of more than 2 gigabytes of physical memory, because the kernel part of the ad-
dress space is 2 gigabytes. Thus, xv6 requires that PHYSTOP be smaller than 2 giga-
bytes, even if the computer has more than 2 gigabytes of physical memory.
allocator during entry, which allocates memory just after the end of the kernel’s data
segment. This allocator does not support freeing and is limited by the 4 MB mapping
in the entrypgdir, but that is sufficient to allocate the first kernel page table.
argument 0
...
heap argument N
0 nul-terminated string
address of argument 0 argv[argc]
...
address of argument N argv[0]
PAGESIZE stack address of address of argv argument of main
guard page argument 0
argc argc argument of main
data 0xFFFFFFF return PC for main
(empty)
text
0
Figure 2-3. Memory layout of a user process with its initial stack.
old start of the free list in r->next, and sets the free list equal to r. kalloc removes kalloc+code
sbrk+code
and returns the first element in the free list.
Figure 2-3 shows the layout of the user memory of an executing process in xv6.
Each user process starts at address 0. The bottom of the address space contains the
text for the user program, its data, and its stack. The heap is above the stack so that
the heap can expand when the process calls sbrk. Note that the text, data, and stack
sections are layed out contiguously in the process’s address space but xv6 is free to use
non-contiguous physical pages for those sections. For example, when xv6 expands a
process’s heap, it can use any free physical page for the new virtual page and then pro-
gram the page table hardware to map the virtual page to the allocated physical page.
This flexibility is a major advantage of using paging hardware.
The stack is a single page, and is shown with the initial contents as created by ex-
ec. Strings containing the command-line arguments, as well as an array of pointers to
them, are at the very top of the stack. Just under that are values that allow a program
to start at main as if the function call main(argc, argv) had just started. To guard a
stack growing off the stack page, xv6 places a guard page right below the stack. The
guard page is not mapped and so if the stack runs off the stack page, the hardware
will generate an exception because it cannot translate the faulting address. A real-
world operating system might allocate more space for the stack so that it can grow be-
yond one page.
Code: exec
Exec is the system call that creates the user part of an address space. It initializes the
user part of an address space from a file stored in the file system. Exec (6610) opens
the named binary path using namei (6623), which is explained in Chapter 6. Then, it
reads the ELF header. Xv6 applications are described in the widely-used ELF format,
defined in elf.h. An ELF binary consists of an ELF header, struct elfhdr (0905), fol-
lowed by a sequence of program section headers, struct proghdr (0924). Each progh-
dr describes a section of the application that must be loaded into memory; xv6 pro-
grams have only one program section header, but other systems might have separate
sections for instructions and data.
The first step is a quick check that the file probably contains an ELF binary. An
ELF binary starts with the four-byte ‘‘magic number’’ 0x7F, ’E’, ’L’, ’F’, or
ELF_MAGIC (0902). If the ELF header has the right magic number, exec assumes that
the binary is well-formed.
Exec allocates a new page table with no user mappings with setupkvm (6637), allo-
cates memory for each ELF segment with allocuvm (6651), and loads each segment into
memory with loaduvm (6655). allocuvm checks that the virtual addresses requested is
below KERNBASE. loaduvm (1903) uses walkpgdir to find the physical address of the al-
located memory at which to write each page of the ELF segment, and readi to read
from the file.
The program section header for /init, the first user program created with exec,
looks like this:
# objdump -p _init
Program Header:
LOAD off 0x00000054 vaddr 0x00000000 paddr 0x00000000 align 2**2
Exercises
1. Look at real operating systems to see how they size memory.
2. If xv6 had not used super pages, what would be the right declaration for en-
trypgdir?
3. Write a user program that grows its address space with 1 byte by calling
sbrk(1). Run the program and investigate the page table for the program before the
call to sbrk and after the call to sbrk. How much space has the kernel allocated?
What does the pte for the new memory contain?
4. Modify xv6 so that the pages for the kernel are shared among processes, which
reduces memory consumption.
5. Modify xv6 so that when a user program dereferences a null pointer, it will re-
ceive a fault. That is, modify xv6 so that virtual address 0 isn’t mapped for user pro-
grams.
6. Unix implementations of exec traditionally include special handling for shell
scripts. If the file to execute begins with the text #!, then the first line is taken to be a
program to run to interpret the file. For example, if exec is called to run myprog
arg1 and myprog’s first line is #!/interp, then exec runs /interp with command
line /interp myprog arg1. Implement support for this convention in xv6.
7. Delete the check if(ph.vaddr + ph.memsz < ph.vaddr) in exec.c, and con-
struct a user program that exploits that the check is missing.
8. Change xv6 so that user processes run with only a minimal part of the kernel
mapped and so that the kernel runs with its own page table that doesn’t include the
user process.
9. How would you improve xv6’s memory layout if xv6 where running on a 64-bit
processor?
X86 protection
The x86 has 4 protection levels, numbered 0 (most privilege) to 3 (least privilege).
In practice, most operating systems use only 2 levels: 0 and 3, which are then called
kernel mode and user mode, respectively. The current privilege level with which the
x86 executes instructions is stored in %cs register, in the field CPL.
On the x86, interrupt handlers are defined in the interrupt descriptor table (IDT).
The IDT has 256 entries, each giving the %cs and %eip to be used when handling the
corresponding interrupt.
To make a system call on the x86, a program invokes the int n instruction, where
n specifies the index into the IDT. The int instruction performs the following steps:
• Fetch the n’th descriptor from the IDT, where n is the argument of int.
• Check that CPL in %cs is <= DPL, where DPL is the privilege level in the de-
scriptor.
• Save %esp and %ss in CPU-internal registers, but only if the target segment selec-
tor’s PL < CPL.
• Load %ss and %esp from a task segment descriptor.
• Push %ss.
(empty)
int+code
• Push %esp. iret+code
• Push %eflags. int+code
initcode.S+code
• Push %cs.
• Push %eip.
• Clear the IF bit in %eflags, but only on an interrupt.
• Set %cs and %eip to the values in the descriptor.
The int instruction is a complex instruction, and one might wonder whether all
these actions are necessary. For example, the check CPL <= DPL allows the kernel to
forbid int calls to inappropriate IDT entries such as device interrupt routines. For a
user program to execute int, the IDT entry’s DPL must be 3. If the user program
doesn’t have the appropriate privilege, then int will result in int 13, which is a gener-
al protection fault. As another example, the int instruction cannot use the user stack
to save values, because the process may not have a valid stack pointer; instead, the
hardware uses the stack specified in the task segment, which is set by the kernel.
Figure 3-1 shows the stack after an int instruction completes and there was a
privilege-level change (the privilege level in the descriptor is lower than CPL). If the
int instruction didn’t require a privilege-level change, the x86 won’t save %ss and
%esp. After both cases, %eip is pointing to the address specified in the descriptor ta-
ble, and the instruction at that address is the next instruction to be executed and the
first instruction of the handler for int n. It is job of the operating system to imple-
ment these handlers, and below we will see what xv6 does.
An operating system can use the iret instruction to return from an int instruc-
tion. It pops the saved values during the int instruction from the stack, and resumes
execution at the saved %eip.
trapno
ds
es
fs
gs
eax
ecx
edx
ebx
oesp
ebp
esi
esp edi
(empty)
p->kstack
Figure 3-2. The trapframe on the kernel stack
%gs, and the general-purpose registers (3305-3310). The result of this effort is that the alltraps+code
alltraps+code
kernel stack now contains a struct trapframe (0602) containing the processor regis- alltraps+code
ters at the time of the trap (see Figure 3-2). The processor pushes %ss, %esp, trap+code
%eflags, %cs, and %eip. The processor or the trap vector pushes an error number, alltraps+code
and alltraps pushes the rest. The trap frame contains all the information necessary
to restore the user mode processor registers when the kernel returns to the current
process, so that the processor can continue exactly as it was when the trap started.
Recall from Chapter 2, that userinit built a trapframe by hand to achieve this goal
(see Figure 1-4).
In the case of the first system call, the saved %eip is the address of the instruction
right after the int instruction. %cs is the user code segment selector. %eflags is the
content of the %eflags register at the point of executing the int instruction. As part
of saving the general-purpose registers, alltraps also saves %eax, which contains the
system call number for the kernel to inspect later.
Now that the user mode processor registers are saved, alltraps can finishing set-
ting up the processor to run kernel C code. The processor set the selectors %cs and
%ss before entering the handler; alltraps sets %ds and %es (3313-3315).
Once the segments are set properly, alltraps can call the C trap handler trap. It
pushes %esp, which points at the trap frame it just constructed, onto the stack as an
argument to trap (3318). Then it calls trap (3319). After trap returns, alltraps pops
the argument off the stack by adding to the stack pointer (3320) and then starts execut-
Code: Interrupts
Devices on the motherboard can generate interrupts, and xv6 must set up the
hardware to handle these interrupts. Devices usually interrupt in order to tell the ker-
nel that some hardware event has occured, such as I/O completion. Interrupts are
usually optional in the sense that the kernel could instead periodically check (or "poll")
the device hardware to check for new events. Interrupts are preferable to polling if the
events are relatively rare, so that polling would waste CPU time. Interrupt handling
shares some of the code already needed for system calls and exceptions.
Interrupts are similar to system calls, except devices generate them at any time.
There is hardware on the motherboard to signal the CPU when a device needs atten-
tion (e.g., the user has typed a character on the keyboard). We must program the de-
vice to generate an interrupt, and arrange that a CPU receives the interrupt.
Let’s look at the timer device and timer interrupts. We would like the timer hard-
ware to generate an interrupt, say, 100 times per second so that the kernel can track
the passage of time and so the kernel can time-slice among multiple running process-
es. The choice of 100 times per second allows for decent interactive performance
while not swamping the processor with handling interrupts.
Drivers
A driver is the code in an operating system that manages a particular device: it tells
the device hardware to perform operations, configures the device to generate interrupts
when done, and handles the resulting interrupts. Driver code can be tricky to write
because a driver executes concurrently with the device that it manages. In addition,
the driver must understand the device’s interface (e.g., which I/O ports do what), and
that interface can be complex and poorly documented.
The disk driver provides a good example. The disk driver copies data from and
The IDE device provides access to disks connected to the PC standard IDE con-
troller. IDE is now falling out of fashion in favor of SCSI and SATA, but the interface
is simple and lets us concentrate on the overall structure of a driver instead of the de-
tails of a particular piece of hardware.
Xv6 represent file system blocks using struct buf (3850). BSIZE (4055) is identical
to the IDE’s sector size and thus each buffer represents the contents of one sector on a
particular disk device. The dev and sector fields give the device and sector number
and the data field is an in-memory copy of the disk sector. Although the xv6 file sys-
tem chooses BSIZE to be identical to the IDE’s sector size, the driver can handle a
BSIZE that is a multiple of the sector size. Operating systems often use bigger blocks
than 512 bytes to obtain higher disk throughput.
The flags track the relationship between memory and disk: the B_VALID flag
means that data has been read in, and the B_DIRTY flag means that data needs to be
written out.
The kernel initializes the disk driver at boot time by calling ideinit (4251) from
main (1232). Ideinit calls ioapicenable to enable the IDE_IRQ interrupt (4256). The
call to ioapicenable enables the interrupt only on the last CPU (ncpu-1): on a two-
processor system, CPU 1 handles disk interrupts.
Next, ideinit probes the disk hardware. It begins by calling idewait (4257) to
wait for the disk to be able to accept commands. A PC motherboard presents the sta-
tus bits of the disk hardware on I/O port 0x1f7. Idewait (4238) polls the status bits
until the busy bit (IDE_BSY) is clear and the ready bit (IDE_DRDY) is set.
Now that the disk controller is ready, ideinit can check how many disks are
present. It assumes that disk 0 is present, because the boot loader and the kernel were
both loaded from disk 0, but it must check for disk 1. It writes to I/O port 0x1f6 to
select disk 1 and then waits a while for the status bit to show that the disk is ready
(4259-4266). If not, ideinit assumes the disk is absent.
After ideinit, the disk is not used again until the buffer cache calls iderw,
which updates a locked buffer as indicated by the flags. If B_DIRTY is set, iderw
writes the buffer to the disk; if B_VALID is not set, iderw reads the buffer from the
disk.
Disk accesses typically take milliseconds, a long time for a processor. The boot
Real world
Supporting all the devices on a PC motherboard in its full glory is much work, be-
cause there are many devices, the devices have many features, and the protocol be-
tween device and driver can be complex. In many operating systems, the drivers to-
gether account for more code in the operating system than the core kernel.
Actual device drivers are far more complex than the disk driver in this chapter,
ware uses interrupts to notify the operating system of status changes. Modern disk
controllers typically accept a batch of disk requests at a time and even reorder them to
make most efficient use of the disk arm. When disks were simpler, operating systems
often reordered the request queue themselves.
Many operating systems have drivers for solid-state disks because they provide
much faster access to data. But, although a solid-state disk works very differently from
a traditional mechanical disk, both devices provide block-based interfaces and read-
ing/writing blocks on a solid-state disk is still more expensive than reading/writing
RAM.
Other hardware is surprisingly similar to disks: network device buffers hold pack-
ets, audio device buffers hold sound samples, graphics card buffers hold video data and
command sequences. High-bandwidth devices—disks, graphics cards, and network
cards—often use direct memory access (DMA) instead of programmed I/O (insl,
outsl). DMA allows the device direct access to physical memory. The driver gives the
device the physical address of the buffer’s data and the device copies directly to or
from main memory, interrupting once the copy is complete. DMA is faster and more
efficient than programmed I/O and is less taxing for the CPU’s memory caches.
Some drivers dynamically switch between polling and interrupts, because using
interrupts can be expensive, but using polling can introduce delay until the driver pro-
cesses an event. For example, a network driver that receives a burst of packets may
switch from interrupts to polling since it knows that more packets must be processed
and it is less expensive to process them using polling. Once no more packets need to
be processed, the driver may switch back to interrupts, so that it will be alerted imme-
diately when a new packet arrives.
The IDE driver routes interrupts statically to a particular processor. Some drivers
configure the IO APIC to route interrupts to multiple processors to spread out the
work of processing packets. For example, a network driver might arrange to deliver
interrupts for packets of one network connection to the processor that is managing
that connection, while interrupts for packets of another connection are delivered to an-
other processor. This routing can get quite sophisticated; for example, if some network
connections are short lived while others are long lived and the operating system wants
to keep all processors busy to achieve high throughput.
If a program reads a file, the data for that file is copied twice. First, it is copied
from the disk to kernel memory by the driver, and then later it is copied from kernel
space to user space by the read system call. If the program then sends the data over
the network, the data is copied twice more: from user space to kernel space and from
kernel space to the network device. To support applications for which efficiency is im-
portant (e.g., serving popular images on the Web), operating systems use special code
paths to avoid copies. As one example, in real-world operating systems, buffers typical-
ly match the hardware page size, so that read-only copies can be mapped into a pro-
cess’s address space using the paging hardware, without any copying.
Exercises
Locking
Xv6 runs on multiprocessors: computers with multiple CPUs executing indepen-
dently. These multiple CPUs share physical RAM, and xv6 exploits the sharing to
maintain data structures that all CPUs read and write. This sharing raises the possibil-
ity of one CPU reading a data structure while another CPU is mid-way through up-
dating it, or even multiple CPUs updating the same data simultaneously; without care-
ful design such parallel access is likely to yield incorrect results or a broken data struc-
ture. Even on a uniprocessor, an interrupt routine that uses the same data as some in-
terruptible code could damage the data if the interrupt occurs at just the wrong time.
Any code that accesses shared data concurrently must have a strategy for main-
taining correctness despite concurrency. The concurrency may arise from accesses by
multiple cores, or by multiple threads, or by interrupt code. xv6 uses a handful of sim-
ple concurrency control strategies; much more sophistication is possible. This chapter
focuses on one of the strategies used extensively in xv6 and many other systems: the
lock.
A lock provides mutual exclusion, ensuring that only one CPU at a time can hold
the lock. If a lock is associated with each shared data item, and the code always holds
the associated lock when using a given item, then we can be sure that the item is used
from only one CPU at a time. In this situation, we say that the lock protects the data
item.
The rest of this chapter explains why xv6 needs locks, how xv6 implements them,
and how it uses them. A key observation will be that if you look at some code in xv6,
you must ask yourself if another processor (or interrupt) could change the intended
behavior of the code by modifying data (or hardware resources) it depends on. You
must keep in mind that a single C statement can be several machine instructions and
thus another processor or an interrupt may muck around in the middle of a C state-
ment. You cannot assume that lines of code on the page are executed atomically.
Concurrency makes reasoning about correctness much more difficult.
Race conditions
As an example of why we need locks, consider several processors sharing a single
disk, such as the IDE disk in xv6. The disk driver maintains a linked list of the out-
standing disk requests (4226) and processors may add new requests to the list concur-
rently (4354). If there were no concurrent requests, you might implement the linked list
as follows:
l->next list
Memory
l->next list
CPU2
15 16
Time
Figure 4-1. Example race
This implementation is correct if executed in isolation. However, the code is not cor-
rect if more than one copy executes concurrently. If two CPUs execute insert at the
same time, it could happen that both execute line 15 before either executes 16 (see
Figure 4-1). If this happens, there will now be two list nodes with next set to the for-
mer value of list. When the two assignments to list happen at line 16, the second
one will overwrite the first; the node involved in the first assignment will be lost.
The lost update at line 16 is an example of a race condition. A race condition is a
situation in which a memory location is accessed concurrently, and at least one access
is a write. A race is often a sign of a bug, either a lost update (if the accesses are
writes) or a read of an incompletely-updated data structure. The outcome of a race
depends on the exact timing of the two CPUs involved and how their memory opera-
tions are ordered by the memory system, which can make race-induced errors difficult
to reproduce and debug. For example, adding print statements while debugging in-
21 void
22 acquire(struct spinlock *lk)
23 {
24 for(;;) {
25 if(!lk->locked) {
26 lk->locked = 1;
27 break;
28 }
29 }
30 }
Unfortunately, this implementation does not guarantee mutual exclusion on a multi-
processor. It could happen that two CPUs simultaneously reach line 25, see that lk-
>locked is zero, and then both grab the lock by executing line 26. At this point, two
different CPUs hold the lock, which violates the mutual exclusion property. Rather
than helping us avoid race conditions, this implementation of acquire has its own
race condition. The problem here is that lines 25 and 26 executed as separate actions.
In order for the routine above to be correct, lines 25 and 26 must execute in one
atomic (i.e., indivisible) step.
To execute those two lines atomically, xv6 relies on a special x86 instruction, xchg
(0569). In one atomic operation, xchg swaps a word in memory with the contents of a
register. The function acquire (1574) repeats this xchg instruction in a loop; each iter-
ation atomically reads lk->locked and sets it to 1 (1581). If the lock is already held,
lk->locked will already be 1, so the xchg returns 1 and the loop continues. If the
xchg returns 0, however, acquire has successfully acquired the lock—locked was 0
and is now 1—so the loop can stop. Once the lock is acquired, acquire records, for
debugging, the CPU and stack trace that acquired the lock. If a process forgets to re-
lease a lock, this information can help to identify the culprit. These debugging fields
are protected by the lock and must only be edited while holding the lock.
The function release (1602) is the opposite of acquire: it clears the debugging
fields and then releases the lock. The function uses an assembly instruction to clear
locked, because clearing this field should be atomic so that the xchg instruction won’t
see a subset of the 4 bytes that hold locked updated. The x86 guarantees that a 32-bit
movl updates all 4 bytes atomically. Xv6 cannot use a regular C assignment, because
the C language specification does not specify that a single assignment is atomic.
Xv6’s implementation of spin-locks is x86-specific, and xv6 is thus not directly
portable to other processors. To allow for portable implementations of spin-locks, the
C language supports a library of atomic instructions; a portable operating system
would use those instructions.
then A. This situation can result in a deadlock if two threads execute the code paths ideintr+code
wakeup+code
concurrently. Suppose thread T1 executes code path 1 and acquires lock A, and thread ptable+code
T2 executes code path 2 and acquires lock B. Next T1 will try to acquire lock B, and ticks+code
T2 will try to acquire lock A. Both acquires will block indefinitely, because in both sys_sleep+code
tickslock+code
cases the other thread holds the needed lock, and won’t release it until its acquire re- iderw+code
turns. To avoid such deadlocks, all code paths must acquire locks in the same order. idelock+code
The need for a global lock acquisition order means that locks are effectively part of ideintr+code
each function’s specification: callers must invoke functions in a way that causes locks
to be acquired in the agreed-on order.
Xv6 has many lock-order chains of length two involving the ptable.lock, due to
the way that sleep works as discussed in Chapter 5. For example, ideintr holds the
ide lock while calling wakeup, which acquires the ptable lock. The file system code
contains xv6’s longest lock chains. For example, creating a file requires simultaneously
holding a lock on the directory, a lock on the new file’s inode, a lock on a disk block
buffer, idelock, and ptable.lock. To avoid deadlock, file system code always acquires
locks in the order mentioned in the previous sentence.
Interrupt handlers
Xv6 uses spin-locks in many situations to protect data that is used by both interrupt
handlers and threads. For example, a timer interrupt might (3414) increment ticks at
about the same time that a kernel thread reads ticks in sys_sleep (3823). The lock
tickslock serializes the two accesses.
Interrupts can cause concurrency even on a single processor: if interrupts are en-
abled, kernel code can be stopped at any moment to run an interrupt handler instead.
Suppose iderw held the idelock and then got interrupted to run ideintr. Ideintr
would try to lock idelock, see it was held, and wait for it to be released. In this situ-
ation, idelock will never be released—only iderw can release it, and iderw will not
continue running until ideintr returns—so the processor, and eventually the whole
system, will deadlock.
Sleep locks
Sometimes xv6 code needs to hold a lock for a long time. For example, the file
system (Chapter 6) keeps a file locked while reading and writing its content on the
disk, and these disk operations can take tens of milliseconds. Efficiency demands that
the processor be yielded while waiting so that other threads can make progress, and
this in turn means that xv6 needs locks that work well when held across context
switches. Xv6 provides such locks in the form of sleep-locks.
Xv6 sleep-locks support yielding the processor during their critical sections. This
property poses a design challenge: if thread T1 holds lock L1 and has yielded the pro-
cessor, and thread T2 wishes to acquire L1, we have to ensure that T1 can execute
while T2 is waiting so that T1 can release L1. T2 can’t use the spin-lock acquire func-
tion here: it spins with interrupts turned off, and that would prevent T1 from running.
To avoid this deadlock, the sleep-lock acquire routine (called acquiresleep) yields the
processor while waiting, and does not disable interrupts.
acquiresleep (4622) uses techniques that will be explained in Chapter 5. At a
high level, a sleep-lock has a locked field that is protected by a spinlock, and ac-
quiresleep’s call to sleep atomically yields the CPU and releases the spin-lock. The
result is that other threads can execute while acquiresleep waits.
Because sleep-locks leave interrupts enabled, they cannot be used in interrupt
handlers. Because acquiresleep may yield the processor, sleep-locks cannot be used
inside spin-lock critical sections (though spin-locks can be used inside sleep-lock criti-
cal sections).
Xv6 uses spin-locks in most situations, since they have low overhead. It uses
sleep-locks only in the file system, where it is convenient to be able to hold locks
across lengthy disk operations.
Limitations of locks
Locks often solve concurrency problems cleanly, but there are times when they are
awkward. Subsequent chapters will point out such situations in xv6; this section out-
lines some of the problems that come up.
Sometimes a function uses data which must be guarded by a lock, but the func-
tion is called both from code that already holds the lock and from code that wouldn’t
otherwise need the lock. One way to deal with this is to have two variants of the
function, one that acquires the lock, and the other that expects the caller to already
hold the lock; see wakeup1 for an example (2953). Another approach is for the function
to require callers to hold the lock whether the caller needs it or not, as with sched
(2758). Kernel developers need to be aware of such requirements.
It might seem that one could simplify situations where both caller and callee need
a lock by allowing recursive locks, so that if a function holds a lock, any function it
Real world
Concurrency primitives and parallel programming are active areas of research, because
programming with locks is still challenging. It is best to use locks as the base for
higher-level constructs like synchronized queues, although xv6 does not do this. If you
program with locks, it is wise to use a tool that attempts to identify race conditions,
because it is easy to miss an invariant that requires a lock.
Most operating systems support POSIX threads (Pthreads), which allow a user
process to have several threads running concurrently on different processors. Pthreads
has support for user-level locks, barriers, etc. Supporting Pthreads requires support
from the operating system. For example, it should be the case that if one pthread
blocks in a system call, another pthread of the same process should be able to run on
that processor. As another example, if a pthread changes its process’s address space
(e.g., grow or shrink it), the kernel must arrange that other processors that run threads
of the same process update their hardware page tables to reflect the change in the ad-
dress space. On the x86, this involves shooting down the Translation Look-aside Buffer
(TLB) of other processors using inter-processor interrupts (IPIs).
It is possible to implement locks without atomic instructions, but it is expensive,
and most operating systems use atomic instructions.
Locks can be expensive if many processors try to acquire the same lock at the
same time. If one processor has a lock cached in its local cache, and another proces-
sor must acquire the lock, then the atomic instruction to update the cache line that
holds the lock must move the line from the one processor’s cache to the other proces-
sor’s cache, and perhaps invalidate any other copies of the cache line. Fetching a cache
line from another processor’s cache can be orders of magnitude more expensive than
fetching a line from a local cache.
To avoid the expenses associated with locks, many operating systems use lock-free
data structures and algorithms. For example, it is possible to implement a linked list
like the one in the beginning of the chapter that requires no locks during list searches,
and one atomic instruction to insert an item in a list. Lock-free programming is more
complicated, however, than programming locks; for example, one must worry about in-
Exercises
1. Move the acquire in iderw to before sleep. Is there a race? Why don’t you
observe it when booting xv6 and run stressfs? Increase critical section with a dummy
loop; what do you see now? explain.
2. Remove the xchg in acquire. Explain what happens when you run xv6?
3. Write a parallel program using POSIX threads, which is supported on most op-
erating systems. For example, implement a parallel hash table and measure if the num-
ber of puts/gets scales with increasing number of cores.
4. Implement a subset of Pthreads in xv6. That is, implement a user-level thread
library so that a user process can have more than 1 thread and arrange that these
threads can run in parallel on different processors. Come up with a design that cor-
rectly handles a thread making a blocking system call and changing its shared address
space.
Scheduling
Any operating system is likely to run with more processes than the computer has
processors, so a plan is needed to time-share the processors among the processes. Ide-
ally the sharing would be transparent to user processes. A common approach is to
provide each process with the illusion that it has its own virtual processor by multi-
plexing the processes onto the hardware processors. This chapter explains how xv6
achieves this multiplexing.
Multiplexing
Xv6 multiplexes by switching each processor from one process to another in two
situations. First, xv6’s sleep and wakeup mechanism switches when a process waits for
device or pipe I/O to complete, or waits for a child to exit, or waits in the sleep sys-
tem call. Second, xv6 periodically forces a switch when a process is executing user in-
structions. This multiplexing creates the illusion that each process has its own CPU,
just as xv6 uses the memory allocator and hardware page tables to create the illusion
that each process has its own memory.
Implementing multiplexing poses a few challenges. First, how to switch from one
process to another? Although the idea of context switching is simple, the implementa-
tion is some of the most opaque code in xv6. Second, how to switch transparently to
user processes? Xv6 uses the standard technique of driving context switches with
timer interrupts. Third, many CPUs may be switching among processes concurrently,
and a locking plan is necessary to avoid races. Fourth, a process’s memory and other
resources must be freed when the process exits, but it cannot do all of this itself be-
cause (for example) it can’t free its own kernel stack while still using it. Finally, each
core of a multi-core machine must remember which process it is executing so that sys-
tem calls affect the correct process’s kernel state. Xv6 tries to solve these problems as
simply as possible, but nevertheless the resulting code is tricky.
xv6 must provide ways for processes to coordinate among themselves. For exam-
ple, a parent process may need to wait for one of its children to exit, or a process
reading a pipe may need to wait for some other process to write the pipe. Rather than
make the waiting process waste CPU by repeatedly checking whether the desired event
has happened, xv6 allows a process to give up the CPU and sleep waiting for an event,
and allows another process to wake the first process up. Care is needed to avoid races
that result in the loss of event notifications. As an example of these problems and
their solution, this chapter examines the implementation of pipes.
save
swtch swtch restore
kernel
space
kstack kstack kstack
shell scheduler cat
Kernel
Figure 5-1. Switching from one user process to another. In this example, xv6 runs with one CPU (and
thus one scheduler thread).
swtch+code
contexts
Figure 5-1 outlines the steps involved in switching from one user process to an- struct context+code
other: a user-kernel transition (system call or interrupt) to the old process’s kernel trap+code
thread, a context switch to the current CPU’s scheduler thread, a context switch to a yield+code
sched+code
new process’s kernel thread, and a trap return to the user-level process. The xv6 swtch+code
scheduler has its own thread (saved registers and stack) because it is sometimes not cpu-
safe for it execute on any process’s kernel stack; we’ll see an example in exit. In this >scheduler+code
swtch+code
section we’ll examine the mechanics of switching between a kernel thread and a sched-
uler thread.
Switching from one thread to another involves saving the old thread’s CPU regis-
ters, and restoring the previously-saved registers of the new thread; the fact that %esp
and %eip are saved and restored means that the CPU will switch stacks and switch
what code it is executing.
The function swtch performs the saves and restores for a thread switch. swtch
doesn’t directly know about threads; it just saves and restores register sets, called con-
texts. When it is time for a process to give up the CPU, the process’s kernel thread
calls swtch to save its own context and return to the scheduler context. Each context
is represented by a struct context*, a pointer to a structure stored on the kernel
stack involved. Swtch takes two arguments: struct context **old and struct
context *new. It pushes the current registers onto the stack and saves the stack point-
er in *old. Then swtch copies new to %esp, pops previously saved registers, and re-
turns.
Let’s follow a user process through swtch into the scheduler. We saw in Chapter
3 that one possibility at the end of each interrupt is that trap calls yield. Yield in
turn calls sched, which calls swtch to save the current context in proc->context and
switch to the scheduler context previously saved in cpu->scheduler (2822).
Swtch (3052) starts by copying its arguments from the stack to the caller-saved reg-
isters %eax and %edx (3060-3061); swtch must do this before it changes the stack pointer
and can no longer access the arguments via %esp. Then swtch pushes the register
state, creating a context structure on the current stack. Only the callee-saved registers
need to be saved; the convention on the x86 is that these are %ebp, %ebx, %esi,
send
206 207 204 205
store p wakeup test spin forever
Figure 5-2. Example lost wakeup problem
216, send runs on another CPU: it changes q->ptr to be nonzero and calls wakeup, deadlocked
which finds no processes sleeping and thus does nothing. Now recv continues execut-
ing at line 216: it calls sleep and goes to sleep. This causes a problem: recv is
asleep waiting for a pointer that has already arrived. The next send will wait for recv
to consume the pointer in the queue, at which point the system will be deadlocked.
The root of this problem is that the invariant that recv only sleeps when q->ptr
== 0 is violated by send running at just the wrong moment. One incorrect way of
protecting the invariant would be to modify the code for recv as follows:
300 struct q {
301 struct spinlock lock;
302 void *ptr;
303 };
304
305 void*
306 send(struct q *q, void *p)
307 {
308 acquire(&q->lock);
309 while(q->ptr != 0)
310 ;
311 q->ptr = p;
312 wakeup(q);
313 release(&q->lock);
314 }
315
316 void*
317 recv(struct q *q)
318 {
319 void *p;
320
321 acquire(&q->lock);
322 while((p = q->ptr) == 0)
323 sleep(q);
324 q->ptr = 0;
325 release(&q->lock);
326 return p;
327 }
One might hope that this version of recv would avoid the lost wakeup because the
that sleep holds ptable.lock, it is safe to release lk: some other process may start a
call to wakeup(chan), but wakeup will not run until it can acquire ptable.lock, so it
must wait until sleep has finished putting the process to sleep, keeping the wakeup
from missing the sleep.
There is a minor complication: if lk is equal to &ptable.lock, then sleep would
deadlock trying to acquire it as &ptable.lock and then release it as lk. In this case,
sleep considers the acquire and release to cancel each other out and skips them en-
tirely (2890). For example, wait (2964) calls sleep with &ptable.lock.
Now that sleep holds ptable.lock and no others, it can put the process to sleep
by recording the sleep channel, changing the process state, and calling sched (2895-2898).
At some point later, a process will call wakeup(chan). Wakeup (2964) acquires pt-
able.lock and calls wakeup1, which does the real work. It is important that wakeup
hold the ptable.lock both because it is manipulating process states and because, as
we just saw, ptable.lock makes sure that sleep and wakeup do not miss each other.
Wakeup1 is a separate function because sometimes the scheduler needs to execute a
wakeup when it already holds the ptable.lock; we will see an example of this later.
Wakeup1 (2953) loops over the process table. When it finds a process in state SLEEPING
with a matching chan, it changes that process’s state to RUNNABLE. The next time the
scheduler runs, it will see that the process is ready to be run.
Xv6 code always calls wakeup while holding the lock that guards the sleep condi-
tion; in the example above that lock is q->lock. Strictly speaking it is sufficient if
wakeup always follows the acquire (that is, one could call wakeup after the release).
Why do the locking rules for sleep and wakeup ensure a sleeping process won’t miss
a wakeup it needs? The sleeping process holds either the lock on the condition or the
ptable.lock or both from a point before it checks the condition to a point after it is
marked as sleeping. If a concurrent thread causes the condition to be true, that thread
must either hold the lock on the condition before the sleeping thread acquired it, or
after the sleeping thread released it in sleep. If before, the sleeping thread must have
seen the new condition value, and decided to sleep anyway, so it doesn’t matter if it
misses the wakeup. If after, then the earliest the waker could acquire the lock on the
condition is after sleep acquires ptable.lock, so that wakeup’s acquisition of pt-
able.lock must wait until sleep has completely finished putting the sleeper to sleep.
Then wakeup will see the sleeping process and wake it up (unless something else
wakes it up first).
It is sometimes the case that multiple processes are sleeping on the same channel;
for example, more than one process reading from a pipe. A single call to wakeup will
wake them all up. One of them will run first and acquire the lock that sleep was
called with, and (in the case of pipes) read whatever data is waiting in the pipe. The
Code: Pipes
The simple queue we used earlier in this chapter was a toy, but xv6 contains two real
queues that use sleep and wakeup to synchronize readers and writers. One is in the
IDE driver: a process adds a disk request to a queue and then calls sleep. The IDE
interrupt handler uses wakeup to alert the process that its request has completed.
A more complex example is the implementation of pipes. We saw the interface
for pipes in Chapter 0: bytes written to one end of a pipe are copied in an in-kernel
buffer and then can be read out of the other end of the pipe. Future chapters will ex-
amine the file descriptor support surrounding pipes, but let’s look now at the imple-
mentations of pipewrite and piperead.
Each pipe is represented by a struct pipe, which contains a lock and a data
buffer. The fields nread and nwrite count the number of bytes read from and written
to the buffer. The buffer wraps around: the next byte written after buf[PIPESIZE-1]
is buf[0]. The counts do not wrap. This convention lets the implementation distin-
guish a full buffer (nwrite == nread+PIPESIZE) from an empty buffer (nwrite ==
nread), but it means that indexing into the buffer must use buf[nread % PIPESIZE]
instead of just buf[nread] (and similarly for nwrite). Let’s suppose that calls to
piperead and pipewrite happen simultaneously on two different CPUs.
Pipewrite (6830) begins by acquiring the pipe’s lock, which protects the counts,
the data, and their associated invariants. Piperead (6851) then tries to acquire the lock
too, but cannot. It spins in acquire (1574) waiting for the lock. While piperead waits,
pipewrite loops over the bytes being written—addr[0], addr[1], ..., addr[n-
1]—adding each to the pipe in turn (6844). During this loop, it could happen that the
buffer fills (6836). In this case, pipewrite calls wakeup to alert any sleeping readers to
the fact that there is data waiting in the buffer and then sleeps on &p->nwrite to wait
for a reader to take some bytes out of the buffer. Sleep releases p->lock as part of
putting pipewrite’s process to sleep.
Now that p->lock is available, piperead manages to acquire it and enters its crit-
ical section: it finds that p->nread != p->nwrite (6856) (pipewrite went to sleep be-
cause p->nwrite == p->nread+PIPESIZE (6836)) so it falls through to the for loop,
copies data out of the pipe (6863-6867), and increments nread by the number of bytes
copied. That many bytes are now available for writing, so piperead calls wakeup (6868)
to wake any sleeping writers before it returns to its caller. Wakeup finds a process
sleeping on &p->nwrite, the process that was running pipewrite but stopped when
the buffer filled. It marks that process as RUNNABLE.
Real world
The xv6 scheduler implements a simple scheduling policy, which runs each pro-
cess in turn. This policy is called round robin. Real operating systems implement more
sophisticated policies that, for example, allow processes to have priorities. The idea is
that a runnable high-priority process will be preferred by the scheduler over a
runnable low-priority process. These policies can become complex quickly because
there are often competing goals: for example, the operating might also want to guaran-
tee fairness and high throughput. In addition, complex policies may lead to unintend-
ed interactions such as priority inversion and convoys. Priority inversion can happen
when a low-priority and high-priority process share a lock, which when acquired by
the low-priority process can prevent the high-priority process from making progress.
A long convoy can form when many high-priority processes are waiting for a low-pri-
ority process that acquires a shared lock; once a convoy has formed it can persist for
long time. To avoid these kinds of problems additional mechanisms are necessary in
sophisticated schedulers.
Sleep and wakeup are a simple and effective synchronization method, but there
are many others. The first challenge in all of them is to avoid the ‘‘lost wakeups’’ prob-
lem we saw at the beginning of the chapter. The original Unix kernel’s sleep simply
disabled interrupts, which sufficed because Unix ran on a single-CPU system. Because
xv6 runs on multiprocessors, it adds an explicit lock to sleep. FreeBSD’s msleep takes
the same approach. Plan 9’s sleep uses a callback function that runs with the
scheduling lock held just before going to sleep; the function serves as a last minute
check of the sleep condition, to avoid lost wakeups. The Linux kernel’s sleep uses an
Exercises
1. Sleep has to check lk != &ptable.lock to avoid a deadlock (2890-2893). Sup-
pose the special case were eliminated by replacing
if(lk != &ptable.lock){
acquire(&ptable.lock);
release(lk);
}
with
release(lk);
acquire(&ptable.lock);
Doing this would break sleep. How?
2. Most process cleanup could be done by either exit or wait, but we saw above
that exit must not free p->stack. It turns out that exit must be the one to close the
open files. Why? The answer involves pipes.
3. Implement semaphores in xv6. You can use mutexes but do not use sleep and
wakeup. Replace the uses of sleep and wakeup in xv6 with semaphores. Judge the re-
sult.
4. Fix the race mentioned above between kill and sleep, so that a kill that oc-
curs after the victim’s sleep loop checks p->killed but before it calls sleep results in
the victim abandoning the current system call.
5. Design a plan so that every sleep loop checks p->killed so that, for example,
a process that is in the IDE driver can return quickly from the while loop if another
kills that process.
6. Design a plan that uses only one context switch when switching from one user
process to another. This plan involves running the scheduler procedure on the kernel
stack of the user process, instead of the dedicated scheduler stack. The main challenge
is to clean up a user process correctly. Measure the performance benefit of avoiding
one context switch.
7. Modify xv6 to turn off a processor when it is idle and just spinning in the loop
in scheduler. (Hint: look at the x86 HLT instruction.)
8. The lock p->lock protects many invariants, and when looking at a particular
piece of xv6 code that is protected by p->lock, it can be difficult to figure out which
invariant is being enforced. Design a plan that is more clean by perhaps splitting p-
>lock in several locks.
File system
The purpose of a file system is to organize and store data. File systems typically
support sharing of data among users and applications, as well as persistence so that
data is still available after a reboot.
The xv6 file system provides Unix-like files, directories, and pathnames (see Chap-
ter 0), and stores its data on an IDE disk for persistence (see Chapter 3). The file sys-
tem addresses several challenges:
• The file system needs on-disk data structures to represent the tree of named di-
rectories and files, to record the identities of the blocks that hold each file’s con-
tent, and to record which areas of the disk are free.
• The file system must support crash recovery. That is, if a crash (e.g., power failure)
occurs, the file system must still work correctly after a restart. The risk is that a
crash might interrupt a sequence of updates and leave inconsistent on-disk data
structures (e.g., a block that is both used in a file and marked free).
• Different processes may operate on the file system at the same time, so the file
system code must coordinate to maintain invariants.
• Accessing a disk is orders of magnitude slower than accessing memory, so the file
system must maintain an in-memory cache of popular blocks.
The rest of this chapter explains how xv6 addresses these challenges.
Overview
The xv6 file system implementation is organized in seven layers, shown in Figure
6-1. The disk layer reads and writes blocks on an IDE hard drive. The buffer cache
layer caches disk blocks and synchronizes access to them, making sure that only one
kernel process at a time can modify the data stored in any particular block. The log-
ging layer allows higher layers to wrap updates to several blocks in a transaction, and
ensures that the blocks are updated atomically in the face of crashes (i.e., all of them
are updated or none). The inode layer provides individual files, each represented as an
inode with a unique i-number and some blocks holding the file’s data. The directory
layer implements each directory as a special kind of inode whose content is a sequence
of directory entries, each of which contains a file’s name and i-number. The pathname
layer provides hierarchical path names like /usr/rtm/xv6/fs.c, and resolves them
with recursive lookup. The file descriptor layer abstracts many Unix resources (e.g.,
pipes, devices, files, etc.) using the file system interface, simplifying the lives of applica-
tion programmers.
The file system must have a plan for where it stores inodes and content blocks on
the disk. To do so, xv6 divides the disk into several sections, as shown in Figure 6-2.
The file system does not use block 0 (it holds the boot sector). Block 1 is called the
Inode
Logging
Buffer cache
superblock; it contains metadata about the file system (the file system size in blocks, the superblock
mfks+code
number of data blocks, the number of inodes, and the number of blocks in the log). bread+code
Blocks starting at 2 hold the log. After the log are the inodes, with multiple inodes bwrite+code
per block. After those come bitmap blocks tracking which data blocks are in use. The buf
brelse+code
remaining blocks are data blocks; each is either marked free in the bitmap block, or
holds content for a file or directory. The superblock is filled in by a separate program,
called mfks, which builds an initial file system.
The rest of this chapter discusses each layer, starting with the buffer cache. Look
out for situations where well-chosen abstractions at lower layers ease the design of
higher ones.
0 1 2
Figure 6-2. Structure of the xv6 file system. The header fs.h (4050) contains constants and data struc-
tures describing the exact layout of the file system.
soon. binit+code
main+code
NBUF+code
bcache.head+code
B_VALID+code
B_DIRTY+code
Code: Buffer cache bget+code
iderw+code
The buffer cache is a doubly-linked list of buffers. The function binit, called by bget+code
main (1230), initializes the list with the NBUF buffers in the static array buf (4450-4459). bget+code
All other access to the buffer cache refer to the linked list via bcache.head, not the B_VALID+code
buf array.
A buffer has two state bits associated with it. B_VALID indicates that the buffer
contains a copy of the block. B_DIRTY indicates that the buffer content has been mod-
ified and needs to be written to the disk.
Bread (4502) calls bget to get a buffer for the given sector (4506). If the buffer
needs to be read from disk, bread calls iderw to do that before returning the buffer.
Bget (4466) scans the buffer list for a buffer with the given device and sector num-
bers (4472-4480). If there is such a buffer, bget acquires the sleep-lock for the buffer.
bget then returns the locked buffer.
If there is no cached buffer for the given sector, bget must make one, possibly
reusing a buffer that held a different sector. It scans the buffer list a second time,
looking for a buffer that is not locked and not dirty: any such buffer can be used.
Bget edits the buffer metadata to record the new device and sector number and ac-
quires its sleep-lock. Note that the assignment to flags clears B_VALID, thus ensuring
that bread will read the block data from disk rather than incorrectly using the buffer’s
previous contents.
It is important that there is at most one cached buffer per disk sector, to ensure
that readers see writes, and because the file system uses locks on buffers for synchro-
nization. bget ensures this invariant by holding the bache.lock continuously from
the first loop’s check of whether the block is cached through the second loop’s declara-
tion that the block is now cached (by setting dev, blockno, and refcnt). This causes
the check for a block’s presence and (if not present) the designation of a buffer to hold
the block to be atomic.
It is safe for bget to acquire the buffer’s sleep-lock outside of the bcache.lock
critical section, since the non-zero b->refcnt prevents the buffer from being re-used
for a different disk block. The sleep-lock protects reads and writes of the block’s
buffered content, while the bcache.lock protects information about which blocks are
cached.
If all the buffers are busy, then too many processes are simultaneously executing
file system calls; bget panics. A more graceful response might be to sleep until a
Logging layer
One of the most interesting problems in file system design is crash recovery. The
problem arises because many file system operations involve multiple writes to the disk,
and a crash after a subset of the writes may leave the on-disk file system in an incon-
sistent state. For example, suppose a crash occurs during file truncation (setting the
length of a file to zero and freeing its content blocks). Depending on the order of the
disk writes, the crash may either leave an inode with a reference to a content block
that is marked free, or it may leave an allocated but unreferenced content block.
The latter is relatively benign, but an inode that refers to a freed block is likely to
cause serious problems after a reboot. After reboot, the kernel might allocate that
block to another file, and now we have two different files pointing unintentionally to
the same block. If xv6 supported multiple users, this situation could be a security
problem, since the old file’s owner would be able to read and write blocks in the new
file, owned by a different user.
Xv6 solves the problem of crashes during file system operations with a simple
form of logging. An xv6 system call does not directly write the on-disk file system data
structures. Instead, it places a description of all the disk writes it wishes to make in a
log on the disk. Once the system call has logged all of its writes, it writes a special
commit record to the disk indicating that the log contains a complete operation. At
that point the system call copies the writes to the on-disk file system data structures.
After those writes have completed, the system call erases the log on disk.
If the system should crash and reboot, the file system code recovers from the
crash as follows, before running any processes. If the log is marked as containing a
complete operation, then the recovery code copies the writes to where they belong in
the on-disk file system. If the log is not marked as containing a complete operation,
the recovery code ignores the log. The recovery code finishes by erasing the log.
Log design
The log resides at a known fixed location, specified in the superblock. It consists
of a header block followed by a sequence of updated block copies (‘‘logged blocks’’).
The header block contains an array of sector numbers, one for each of the logged
blocks, and the count of log blocks. The count in the header block on disk is either
zero, indicating that there is no transaction in the log, or non-zero, indicating that the
log contains a complete committed transaction with the indicated number of logged
blocks. Xv6 writes the header block when a transaction commits, but not before, and
sets the count to zero after copying the logged blocks to the file system. Thus a crash
midway through a transaction will result in a count of zero in the log’s header block; a
crash after a commit will result in a non-zero count.
Each system call’s code indicates the start and end of the sequence of writes that
must be atomic with respect to crashes. To allow concurrent execution of file system
operations by different processes, the logging system can accumulate the writes of mul-
tiple system calls into one transaction. Thus a single commit may involve the writes of
multiple complete system calls. To avoid splitting a system call across transactions, the
logging system only commits when no file system system calls are underway.
The idea of committing several transactions together is known as group commit.
Group commit reduces the number of disk operations because it amortizes the fixed
cost of a commit over multiple operations. Group commit also hands the disk system
more concurrent writes at the same time, perhaps allowing the disk to write them all
during a single disk rotation. Xv6’s IDE driver doesn’t support this kind of batching,
but xv6’s file system design allows for it.
Xv6 dedicates a fixed amount of space on the disk to hold the log. The total
number of blocks written by the system calls in a transaction must fit in that space.
This has two consequences. No single system call can be allowed to write more dis-
tinct blocks than there is space in the log. This is not a problem for most system calls,
but two of them can potentially write many blocks: write and unlink. A large file
write may write many data blocks and many bitmap blocks as well as an inode block;
unlinking a large file might write many bitmap blocks and an inode. Xv6’s write sys-
tem call breaks up large writes into multiple smaller writes that fit in the log, and un-
link doesn’t cause problems because in practice the xv6 file system uses only one
bitmap block. The other consequence of limited log space is that the logging system
cannot allow a system call to start unless it is certain that the system call’s writes will
fit in the space remaining in the log.
Inode layer
The term inode can have one of two related meanings. It might refer to the on-
disk data structure containing a file’s size and list of data block numbers. Or ‘‘inode’’
might refer to an in-memory inode, which contains a copy of the on-disk inode as
well as extra information needed within the kernel.
The on-disk inodes are packed into a contiguous area of disk called the inode
blocks. Every inode is the same size, so it is easy, given a number n, to find the nth
inode on the disk. In fact, this number n, called the inode number or i-number, is
how inodes are identified in the implementation.
The on-disk inode is defined by a struct dinode (4078). The type field distin-
guishes between files, directories, and special files (devices). A type of zero indicates
that an on-disk inode is free. The nlink field counts the number of directory entries
that refer to this inode, in order to recognize when the on-disk inode and its data
Code: Inodes
type
data
major
minor
nlink ...
size
address 1
..... data
address 12
indirect
data
indirect block
address 1
..... ...
address 128
data
be reading and writing to the file, because it successfully opened it. But, if a crash hap- struct dinode+code
NDIRECT+code
pens before the last process closes the file descriptor for the file, then the file will be
marked allocated on disk but no directory entry points to it.
File systems handle this case in one of two ways. The simple solution is that on
recovery, after reboot, the file system scans the whole file system for files that are
marked allocated, but have no directory entry pointing to them. If any such file exists,
then it can free those files.
The second solution doesn’t require scanning the file system. In this solution, the
file system records on disk (e.g., in the super block) the inode inumber of a file whose
link count drops to zero but whose reference count isn’t zero. If the file system re-
moves the file when its reference counts reaches 0, then it updates the on-disk list by
removing that inode from the list. On recovery, the file system frees any file in the list.
Xv6 implements neither solution, which means that inodes may be marked allo-
cated on disk, even though they are not in use anymore. This means that over time
xv6 runs the risk that it may run out of disk space.
The on-disk inode structure, struct dinode, contains a size and an array of
block numbers (see Figure 6-3). The inode data is found in the blocks listed in the
dinode’s addrs array. The first NDIRECT blocks of data are listed in the first NDIRECT
itrunc frees a file’s blocks, resetting the inode’s size to zero. Itrunc (5456) starts
by freeing the direct blocks (5462-5467), then the ones listed in the indirect block (5472-
5475), and finally the indirect block itself (5477-5478).
Bmap makes it easy for readi and writei to get at an inode’s data. Readi (5503)
starts by making sure that the offset and count are not beyond the end of the file.
Reads that start beyond the end of the file return an error (5514-5515) while reads that
start at or cross the end of the file return fewer bytes than requested (5516-5517). The
main loop processes each block of the file, copying data from the buffer into dst
(5519-5524). writei (5553) is identical to readi, with three exceptions: writes that start
at or cross the end of the file grow the file, up to the maximum file size (5566-5567); the
loop copies data into the buffers instead of out (5572); and if the write has extended the
file, writei must update its size (5577-5580).
Both readi and writei begin by checking for ip->type == T_DEV. This case
handles special devices whose data does not live in the file system; we will return to
this case in the file descriptor layer.
The function stati (5488) copies inode metadata into the stat structure, which is
exposed to user programs via the stat system call.
Real world
The buffer cache in a real-world operating system is significantly more complex
than xv6’s, but it serves the same two purposes: caching and synchronizing access to
the disk. Xv6’s buffer cache, like V6’s, uses a simple least recently used (LRU) eviction
policy; there are many more complex policies that can be implemented, each good for
some workloads and not as good for others. A more efficient LRU cache would elimi-
nate the linked list, instead using a hash table for lookups and a heap for LRU evic-
tions. Modern buffer caches are typically integrated with the virtual memory system
to support memory-mapped files.
Xv6’s logging system is inefficient. A commit cannot occur concurrently with file
system system calls. The system logs entire blocks, even if only a few bytes in a block
are changed. It performs synchronous log writes, a block at a time, each of which is
likely to require an entire disk rotation time. Real logging systems address all of these
problems.
Logging is not the only way to provide crash recovery. Early file systems used a
scavenger during reboot (for example, the UNIX fsck program) to examine every file
and directory and the block and inode free lists, looking for and resolving inconsisten-
cies. Scavenging can take hours for large file systems, and there are situations where it
is not possible to resolve inconsistencies in a way that causes the original system calls
to be atomic. Recovery from a log is much faster and causes system calls to be atomic
in the face of crashes.
Xv6 uses the same basic on-disk layout of inodes and directories as early UNIX;
this scheme has been remarkably persistent over the years. BSD’s UFS/FFS and Linux’s
ext2/ext3 use essentially the same data structures. The most inefficient part of the file
system layout is the directory, which requires a linear scan over all the disk blocks dur-
ing each lookup. This is reasonable when directories are only a few disk blocks, but is
expensive for directories holding many files. Microsoft Windows’s NTFS, Mac OS X’s
HFS, and Solaris’s ZFS, just to name a few, implement a directory as an on-disk bal-
anced tree of blocks. This is complicated but guarantees logarithmic-time directory
lookups.
Xv6 is naive about disk failures: if a disk operation fails, xv6 panics. Whether this
is reasonable depends on the hardware: if an operating systems sits atop special hard-
ware that uses redundancy to mask disk failures, perhaps the operating system sees
failures so infrequently that panicking is okay. On the other hand, operating systems
Exercises
1. Why panic in balloc? Can xv6 recover?
2. Why panic in ialloc? Can xv6 recover?
3. Why doesn’t filealloc panic when it runs out of files? Why is this more
common and therefore worth handling?
4. Suppose the file corresponding to ip gets unlinked by another process between
sys_link’s calls to iunlock(ip) and dirlink. Will the link be created correctly?
Why or why not?
6. create makes four function calls (one to ialloc and three to dirlink) that
it requires to succeed. If any doesn’t, create calls panic. Why is this acceptable?
Why can’t any of those four calls fail?
7. sys_chdir calls iunlock(ip) before iput(cp->cwd), which might try to lock
cp->cwd, yet postponing iunlock(ip) until after the iput would not cause deadlocks.
Why not?
8. Implement the lseek system call. Supporting lseek will also require that you
modify filewrite to fill holes in the file with zero if lseek sets off beyond f->ip-
>size.
9. Add O_TRUNC and O_APPEND to open, so that > and >> operators work in the
shell.
Summary
This text introduced the main ideas in operating systems by studying one operating
system, xv6, line by line. Some code lines embody the essence of the main ideas (e.g.,
context switching, user/kernel boundary, locks, etc.) and each line is important; other
code lines provide an illustration of how to implement a particular operating system
idea and could easily be done in different ways (e.g., a better algorithm for scheduling,
better on-disk data structures to represent files, better logging to allow for concurrent
transactions, etc.). All the ideas were illustrated in the context of one particular, very
successful system call interface, the Unix interface, but those ideas carry over to the
design of other operating systems.
PC hardware
This appendix describes personal computer (PC) hardware, the platform on which
xv6 runs.
A PC is a computer that adheres to several industry standards, with the goal that
a given piece of software can run on PCs sold by multiple vendors. These standards
evolve over time and a PC from 1990s doesn’t look like a PC now. Many of the cur-
rent standards are public and you can find documentation for them online.
From the outside a PC is a box with a keyboard, a screen, and various devices
(e.g., CD-ROM, etc.). Inside the box is a circuit board (the ‘‘motherboard’’) with CPU
chips, memory chips, graphic chips, I/O controller chips, and busses through which the
chips communicate. The busses adhere to standard protocols (e.g., PCI and USB) so
that devices will work with PCs from multiple vendors.
From our point of view, we can abstract the PC into three components: CPU,
memory, and input/output (I/O) devices. The CPU performs computation, the memo-
ry contains instructions and data for that computation, and devices allow the CPU to
interact with hardware for storage, communication, and other functions.
You can think of main memory as connected to the CPU with a set of wires, or
lines, some for address bits, some for data bits, and some for control flags. To read a
value from main memory, the CPU sends high or low voltages representing 1 or 0 bits
on the address lines and a 1 on the ‘‘read’’ line for a prescribed amount of time and
then reads back the value by interpreting the voltages on the data lines. To write a
value to main memory, the CPU sends appropriate bits on the address and data lines
and a 1 on the ‘‘write’’ line for a prescribed amount of time. Real memory interfaces
are more complex than this, but the details are only important if you need to achieve
high performance.
I/O
Processors must communicate with devices as well as memory. The x86 processor
provides special in and out instructions that read and write values from device ad- I/O ports
memory-mapped
dresses called I/O ports. The hardware implementation of these instructions is essen- I/O
tially the same as reading and writing memory. Early x86 processors had an extra ad-
dress line: 0 meant read/write from an I/O port and 1 meant read/write from main
memory. Each hardware device monitors these lines for reads and writes to its as-
signed range of I/O ports. A device’s ports let the software configure the device, exam-
ine its status, and cause the device to take actions; for example, software can use I/O
port reads and writes to cause the disk interface hardware to read and write sectors on
the disk.
Many computer architectures have no separate device access instructions. Instead
the devices have fixed memory addresses and the processor communicates with the
device (at the operating system’s behest) by reading and writing values at those ad-
dresses. In fact, modern x86 architectures use this technique, called memory-mapped
I/O, for most high-speed devices such as network, disk, and graphics controllers. For
reasons of backwards compatibility, though, the old in and out instructions linger, as
do legacy hardware devices that use them, such as the IDE disk controller, which xv6
uses.
Figure B-1. The relationship between logical, linear, and physical addresses.
boot loader
Appendix B real mode
32 20 12
16
8 Base Limit Flags
0
GDT/LDT
try to data. The code segment descriptor has a flag set that indicates that the code boot loader
global descriptor
should run in 32-bit mode (0660). With this setup, when the boot loader enters protect- table
ed mode, logical addresses map one-to-one to physical addresses. gdtdesc+code
The boot loader executes an lgdt instruction (9141) to load the processor’s global gdt+code
CR0_PE+code
descriptor table (GDT) register with the value gdtdesc (9187-9189), which points to the gdt+code
table gdt. SEG_KDATA+code
Once it has loaded the GDT register, the boot loader enables protected mode by bootmain+code
setting the 1 bit (CR0_PE) in register %cr0 (9142-9144). Enabling protected mode does
not immediately change how the processor translates logical to physical addresses; it is
only when one loads a new value into a segment register that the processor reads the
GDT and changes its internal segmentation settings. One cannot directly modify %cs,
so instead the code executes an ljmp (far jump) instruction (9153), which allows a code
segment selector to be specified. The jump continues execution at the next line (9156)
but in doing so sets %cs to refer to the code descriptor entry in gdt. That descriptor
describes a 32-bit code segment, so the processor switches into 32-bit mode. The boot
loader has nursed the processor through an evolution from 8088 through 80286 to
80386.
The boot loader’s first action in 32-bit mode is to initialize the data segment reg-
isters with SEG_KDATA (9158-9161). Logical address now map directly to physical ad-
dresses. The only step left before executing C code is to set up a stack in an unused
region of memory. The memory from 0xa0000 to 0x100000 is typically littered with
device memory regions, and the xv6 kernel expects to be placed at 0x100000. The
boot loader itself is at 0x7c00 through 0x7e00 (512 bytes). Essentially any other sec-
tion of memory would be a fine location for the stack. The boot loader chooses
0x7c00 (known in this file as $start) as the top of the stack; the stack will grow
down from there, toward 0x0000, away from the boot loader.
Finally the boot loader calls the C function bootmain (9168). Bootmain’s job is to
load and run the kernel. It only returns if something has gone wrong. In that case,
the code sends a few output words on port 0x8a00 (9170-9176). On real hardware, there
is no device connected to that port, so this code does nothing. If the boot loader is
running inside a PC simulator, port 0x8a00 is connected to the simulator itself and
can transfer control back to the simulator. Simulator or not, the code then executes an
infinite loop (9177-9178). A real boot loader might attempt to print an error message
first.
Real world
The boot loader described in this appendix compiles to around 470 bytes of ma-
chine code, depending on the optimizations used when compiling the C code. In or-
der to fit in that small amount of space, the xv6 boot loader makes a major simplify-
ing assumption, that the kernel has been written to the boot disk contiguously starting
at sector 1. More commonly, kernels are stored in ordinary file systems, where they
may not be contiguous, or are loaded over a network. These complications require the
Exercises
1. Due to sector granularity, the call to readseg in the text is equivalent to read-
seg((uchar*)0x100000, 0xb500, 0x1000). In practice, this sloppy behavior turns
out not to be a problem Why doesn’t the sloppy readsect cause problems?
2. Suppose you wanted bootmain() to load the kernel at 0x200000 instead of
0x100000, and you did so by modifying bootmain() to add 0x100000 to the va of each
ELF section. Something would go wrong. What?
3. It seems potentially dangerous for the boot loader to copy the ELF header to mem-
ory at the arbitrary location 0x10000. Why doesn’t it call malloc to obtain the memo-
ry it needs?