A New C Compiler: Ken Thompson
A New C Compiler: Ken Thompson
A New C Compiler: Ken Thompson
Ken Thompson
AT&T Bell Laboratories
Murray Hill, New Jersey 07974
ABSTRACT
This paper describes yet another series of C compilers. These compilers were
developed over the last several years and are now in use on Plan 9. These compilers are
experimental in nature and were developed to try out new ideas. Some of the ideas were
good and some not so good.
1. Introduction
Most C compilers consist of a multitude of passes with numerous interfaces. A typical C compiler has the
following passes – pre-processing, lexical analysis and parsing, code generation, optional assembly optimi-
sation, assembly (which itself is usually multiple passes), and loading. [Joh79a]
If one profiles what is going on in this whole process, it becomes clear that I/O dominates. Of the cpu
cycles expended, most go into conversion to and from intermediate file formats. Even with these many
passes, the code generated is mostly line-at-a-time and not very efficient. With these conventional compil-
ers as benchmarks, it seemed easy to make a new compiler that could execute much faster and still produce
better code.
The first three compilers built were for the National 32000, Western 32100, and an internal computer called
a Crisp. These compilers have drifted into disuse. Currently there are active compilers for the Motorola
68020 and MIPS 2000/3000 computers. [Mot85, Kan88]
2. Structure
The compiler is a single program that produces an object file. Combined in the compiler are the traditional
roles of pre-processor, compiler, code generator, local optimiser, and first half of the assembler. The object
files are binary forms of assembly language, similar to what might be passed between the first and second
passes of an assembler.
Object files and libraries are combined and loaded by a second program to produce the executable binary.
The loader combines the roles of second half of the assembler, global optimiser, and loader. There is a
third small program that serves as an assembler. It takes an assembler-like input and performs a simple
translation into the object format.
3. The Language
The compiler implements ANSI C with some restrictions and extensions. [Ker88] If this had been a
product-oriented project rather than a research vehicle, the compiler would have implemented exact ANSI
C. Several of the poorer features were left out. Also, several extensions were added to help in the imple-
mentation of Plan 9. [Pik90] There are many more departures from the standard, particularly in the
libraries, that are beyond the scope of this paper.
-2-
struct node
{
int type;
union
{
double dval;
float fval;
long lval;
};
struct lock;
} *node;
This is a declaration with an unnamed substructure, lock, and an unnamed subunion. This shows the two
major usages of this feature. The first allows references to elements of the subunit to be accessed as if they
were in the outer structure. Thus node->dval and node->locked are legitimate references. In C, the
name of a union is almost always a non-entity that is mechanically declared and used with no purpose.
-3-
The second usage is poor man’s classes. When a pointer to the outer structure is used in a context that is
only legal for an unnamed substructure, the compiler promotes the type. This happens in assignment state-
ments and in argument passing where prototypes have been declared. Thus, continuing with the example,
lock = node;
would assign a pointer to the unnamed lock in the node to the variable lock. Another example,
extern void lock(struct lock*);
func(...)
{
...
lock(node);
...
}
will pass a pointer to the lock substructure.
It would be nice to add casts to the implicit conversions to unnamed substructures, but this would conflict
with existing C practice. The problem comes about from the almost ambiguous dual meaning of the cast
operator. One usage is conversion; for example (double)5 is a conversion, but (struct
lock*)node is a PL/1 ‘‘unspec.’’
used for two variables in the Plan 9 kernel, u and m. U is a pointer to the structure representing the cur-
rently running process and m is a pointer to the per-machine data structure.
x = f(...)
is rewritten as
f(&x, ...).
This saves a copy and makes the compilation much less clumsy. A disadvantage is that if you call this
function without an assignment, a dummy location must be invented. An earlier version of the compiler
passed a null pointer in such cases, but was changed to pass a dummy argument after measuring some run-
ning programs.
There is also a danger of calling a function that returns a structure without declaring it as such. Before
ANSI C function prototypes, this would probably be enough consideration to find some other way of
returning structures. These compilers have an option that complains every time that a subroutine is com-
piled that has not been fully specified by a prototype, which catches this and many other errors. This is
now the default and is highly recommended for all ANSI C compilers.
5. Implementation
The compiler is divided internally into four machine-independent passes, four machine-dependent passes,
and an output pass. The next nine sections describe each pass in order.
5.1. Parsing
The first pass is a YACC-based parser. [Joh79b] All code is put into a parse tree and collected, without
interpretation, for the body of a function. The later passes then walk this tree.
The input stream of the parser is a pushdown list of input activations. The preprocessor expansions of
#define and #include are implemented as pushdowns. Thus there is no separate pass for preprocess-
ing.
Even though it is just one pass of many, the parsing take 50% of the execution time of the whole compiler.
Most of this (75%) is due to the inefficiencies of YACC. The remaining 25% of the parse time is due to the
low level character handling. The flexibility of YACC was very important in the initial writing of the com-
piler, but it would probably be worth the effort to write a custom recursive descent parser.
5.2. Typing
The next pass distributes typing information to every node of the tree. Implicit operations on the tree are
added, such as type promotions and taking the address of arrays and functions.
5.5. Addressability
This is the first of the machine-dependent passes. The addressability of a computer is defined as the
expression that is legal in the address field of a machine language instruction. The addressability of differ-
ent computers varies widely. At one end of the spectrum are the 68020 and VAX, which allow a complex
array of incrementing, decrementing, indexing and relative addressing. At the other end is the MIPS, which
allows registers and constant offsets from the contents of a register. The addressability can be different for
different instructions within the same computer.
It is important to the code generator to know when a subtree represents an address of a particular type. This
-6-
is done with a bottom-up walk of the tree. In this pass, the leaves are labelled with small integers. When
an internal node is encountered, it is labelled by consulting a table indexed by the labels on the left and
right subtrees. For example, on the 68020 computer, it is possible to address an offset from a named loca-
tion. In C, this is represented by the expression *(&name+constant). This is marked addressable by
the following table. In the table, a node represented by the left column is marked with a small integer from
the right column. Marks of the form A1 are addressable while marks of the form N1 are not addressable.
Node Marked
name A1
const A2
&A1 A3
A3+A1 N1 (note this is not addressable)
*N1 A4
Here there is a distinction between a node marked A1 and a node marked A4 because the address operator
of an A4 node is not addressable. So to extend the table:
Node Marked
&A4 N2
N2+N1 N1
The full addressability of the 68020 is expressed in 18 rules like this. When one ports the compiler, this
table is usually initialised so that leaves are labelled as addressable and nothing else. The code produced is
poor, but porting is easy. The table can be extended later.
In the same bottom-up pass of the tree, the nodes are labelled with a Sethi-Ullman complexity. [Set70] This
number is roughly the number of registers required to compile the tree on an ideal machine. An address-
able node is marked 0. A function call is marked infinite. A unary operator is marked as the maximum of
1 and the mark of its subtree. A binary operator with equal marks on its subtrees is marked with a subtree
mark plus 1. A binary operator with unequal marks on its subtrees is marked with the maximum mark of
its subtrees. The actual values of the marks are not too important, but the relative values are. The goal is to
compile the harder (larger mark) subtree first.
5.7. Registerisation
Up to now, the compiler has operated on syntax trees that are roughly equivalent to the original source lan-
guage. The previous pass has produced machine language in an internal format. The next two passes oper-
ate on the internal machine language structures. The purpose of the next pass is to reintroduce registers for
heavily used variables.
All of the variables that can be potentially registerised within a routine are placed in a table. (Suitable vari-
ables are all automatic or external scalars that do not have their addresses extracted. Some constants that
are hard to reference are also considered for registerisation.) Four separate data flow equations are evalu-
ated over the routine on all of these variables. Two of the equations are the normal set-behind and used-
-7-
ahead bits that define the life of a variable. The two new bits tell if a variable life crosses a function call
ahead or behind. By examining a variable over its lifetime, it is possible to get a cost for registerising.
Loops are detected and the costs are multiplied by three for every level of loop nesting. Costs are sorted
and the variables are replaced by available registers on a greedy basis.
The 68020 has two different types of registers. For the 68020, two different costs are calculated for each
variable life and the register type that affords the better cost is used. Ties are broken by counting the num-
ber of available registers of each type.
Note that externals are registerised together with automatics. This is done by evaluating the semantics of a
‘‘call’’ instruction differently for externals and automatics. Since a call goes outside the local procedure, it
is assumed that a call references all externals. Similarly, externals are assumed to be set before an ‘‘entry’’
instruction and assumed to be referenced after a ‘‘return’’ instruction. This makes sure that externals are in
memory across calls.
The overall results are very satisfying. It would be nice to be able to do this processing in a machine-
independent way, but it is impossible to get all of the costs and side effects of different choices by examin-
ing the parse tree.
Most of the code in the registerisation pass is machine-independent. The major machine-dependency is in
examining a machine instruction to ask if it sets or references a variable.
Experiments have shown that it is marginally worth while to rename uses of the destination variable with
uses of the source variable up to the first use of the source variable.
The second transform will do relabelling without deleting instructions. When a ‘‘move’’ instruction is
encountered, if the source variable has been set prior to the use of the destination variable then all of the ref-
erences to the source variable are replaced by the destination and the ‘‘move’’ is inverted. Typically, this
transformation will alter two ‘‘move’’ instructions and allow the first transformation another chance to
remove code. This transformation uses the forward data flow set up in the previous pass.
Again, the following is a depiction of the transformation where the pattern is in the left column and the
rewrite is in the right column.
SET a SET b
(no use of b)
USE a USE b
(no use of b)
MOVE a,b MOVE b,a
Iterating these transformations will usually get rid of all redundant ‘‘move’’ instructions.
-8-
A problem with this organisation is that the costs of registerisation calculated in the previous pass must
depend on how well this pass can detect and remove redundant instructions. Often, a fine candidate for reg-
isterisation is rejected because of the cost of instructions that are later removed. Perhaps the registerisation
pass should discount a large percentage of a ‘‘move’’ instruction anticipating the effectiveness of this pass.
6. The loader
The loader is a multiple pass program that reads object files and libraries and produces an executable
binary. The loader also does some minimal optimisations and code rewriting. Many of the operations per-
formed by the loader are machine-dependent.
The first pass of the loader reads the object modules into an internal data structure that looks like binary
assembly language. As the instructions are read, unconditional branch instructions are removed. Condi-
tional branch instructions are inverted to prevent the insertion of unconditional branches. The loader will
also make a copy of a few instructions to remove an unconditional branch. An example of this appears in a
later section.
The next pass allocates addresses for all external data. Typical of computers is the 68020 which can refer-
ence ±32K from an address register. The loader allocates the address register A6 as the static pointer. The
value placed in A6 is the base of the data segment plus 32K. It is then cheap to reference all data in the first
64K of the data segment. External variables are allocated to the data segment with the smallest variables
allocated first. If all of the data cannot fit into the first 64K of the data segment, then usually only a few
large arrays need more expensive addressing modes.
For the MIPS computer, the loader makes a pass over the internal structures exchanging instructions to try
to fill ‘‘delay slots’’ with useful work. (A delay slot on the MIPS is a euphemism for a timing bug that
must be avoided by the compiler.) If a useful instruction cannot be found to fill a delay slot, the loader will
insert ‘‘noop’’ instructions. This pass is very expensive and does not do a good job. About 20% of all
instructions are in delay slots. About 50% of these are useful instructions and 50% are ‘‘noops.’’ The ven-
dor supplied assembler does this job much more effectively filling about 80% of the delay slots with useful
instructions.
On the 68020 computer, branch instructions come in a variety of sizes depending on the relative distance of
the branch. Thus the size of branch instructions can be mutually dependent on each other. The loader uses
a multiple pass algorithm to resolve the branch lengths. [Szy78] Initially, all branches are assumed minimal
length. On each subsequent pass, the branches are reassessed and expanded if necessary. When no more
expansions occur, the locations of the instructions in the text segment are known.
On the MIPS computer, all instructions are one size. A single pass over the instructions will determine the
locations of all addresses in the text segment.
The last pass of the loader produces the executable binary. A symbol table and other tables are produced to
help the debugger to interpret the binary symbolically.
The loader has source line numbers at its disposal, but the interpretation of these numbers relative to
#include files is not done. The loader is also in a good position to perform some global optimisations,
but this has not been exploited.
7. Performance
The following is a table of the source size of the various components of the compilers.
-9-
lines module
409 machine-independent compiler headers
975 machine-independent compiler Yacc
5161 machine-independent compiler C
The following table is timing of a test program that does Quine-McClusky boolean function minimisation.
The test program is a single file of 907 lines of C that is dominated by bit-picking and sorting. The execu-
tion time does not significantly depend on library implementation. Since no other compiler runs on Plan 9,
these tests were run on a single-processor MIPS 3000 computer with vendor supplied software. The opti-
miser in the vendor supplied compiler is reputed to be extremely good. Another compiler, lcc, is compared
in this list. Lcc is another new and highly portable compiler jointly written at Bell Labs and Princeton.
None of the compilers were tuned on this test.
1.0s new cc compile time
0.5s new cc load time
90.4s new cc run time
8. Example
Here is a small example of a fragment of C code to be compiled on the 68020 compiler.
- 10 -
int a[10];
void
f(void)
{
int i;
The following is the tree of the assignment statement after all machine-independent passes. The numbers in
angle brackets are addressabilities. Numbers 10 or larger are addressable. The addressability, 9, for the
INDEX operation means addressable if its second operand is placed in an index register. The number in
parentheses is the Sethi-Ullman complexity. The typing information is at the end of each line.
ASSIGN (1) long
INDEX <9> long
ADDR <12> *long
NAME "a" 0 <10> long
NAME "i" -4 <11> *long
NAME "i" -4 <11> long
The following is the 68020 machine language generated before the registerisation pass. Note that there is
no assembly language in this compiler; this is a print of the internal form in the same sense as the previous
tree is a print of that internal form.
Here is some explanation of notation: (SP) denotes an automatic variable; (SB) denotes an external vari-
able; A7 is the stack pointer, $4 is a constant.
f: TEXT
SUBL $4,A7
CLRL i(SP)
loop: MOVL $10,R0
CMPL R0,i(SP)
BLE ret
MOVL i(SP),R0
MOVL i(SP),a(SB)(R0.L*4)
ADDL $1,i(SP)
JMP loop
ret: ADDL $4,A7
RTS
The following is the code after all compiling passes, but before loading:
f: TEXT
SUBL $4,A7
CLRL R1
loop: MOVL $10,R0
CMPL R0,R1
BLE ret
MOVL R1,a(SB)(R1.L*4)
ADDL $1,R1
JMP loop
ret: ADDL $4,A7
RTS
The following is the code produced by the loader. The only real difference is the expansion and inversion
- 11 -
9. Conclusions
The new compilers compile quickly, load slowly, and produce medium quality object code. The compilers
are relatively portable, requiring but a couple weeks work to produce a compiler for a different computer.
As a whole, the experiment is a success. For Plan 9, where we needed several compilers with specialised
features and our own object formats, the project was indispensable.
Two problems have come up in retrospect. The first has to do with the division of labour between compiler
and loader. Plan 9 runs on a multi-processor and as such compilations are often done in parallel. Unfortu-
nately, all compilations must be complete before loading can begin. The load is then single-threaded. With
this model, any shift of work from compile to load results in a significant increase in real time. The same is
true of libraries that are compiled infrequently and loaded often. In the future, we will try to put some of
the loader work back into the compiler.
The second problem comes from the various optimisations performed over several passes. Often optimisa-
tions in different passes depend on each other. Iterating the passes could compromise efficiency, or even
loop. We see no real solution to this problem.
10. References
Aho87. Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, Compilers – Principles, Techniques, and Tools,
Addison Wesley, Reading, MA (1987).
Joh79a. S. C. Johnson, ‘‘A Tour Through the Portable C Compiler,’’ in UNIX Programmer’s Manual, Sev-
enth Ed., Vol. 2A, AT&T Bell Laboratories, Murray Hill, NJ (1979).
Joh79b. S. C. Johnson, ‘‘YACC – Yet Another Compiler Compiler,’’ in UNIX Programmer’s Manual,
Seventh Ed., Vol. 2A, AT&T Bell Laboratories, Murray Hill, NJ (1979).
Kan88. Gerry Kane, MIPS RISC Architecture, Prentice-Hall, Englewood Cliffs, NJ (1988).
Ker88. Brian W. Kernighan and Dennis M. Ritchie, The C Programming Language, Second Edition,
Prentice-Hall, Englewood Cliffs, NJ (1988).
- 12 -
Mot85. Motorola, MC68020 32-Bit Microprocessor User’s Manual, Second Edition, Prentice-Hall, Engle-
wood Cliffs, NJ (1985).
Pik90. Rob Pike, Dave Presotto, Ken Thompson, and Howard Trickey, ‘‘Plan 9 from Bell Labs,’’ Proc.
UKUUG Conf., London, UK (July 1990).
Set70. R. Sethi and J. D. Ullman, ‘‘The Generation of Optimal Code for Arithmetic Expressions,’’ J. ACM
17(4), pp. 715-728 (1970).
Szy78. T. G. Szymanski, ‘‘Assembling Code for Machines with Span-dependent Instructions,’’ Comm.
ACM 21(4), pp. 300-308 (1978).