Ranta, A. (2012) - Implementing Programming Languages: An Introduction To Compilers and Interpreters
Ranta, A. (2012) - Implementing Programming Languages: An Introduction To Compilers and Interpreters
Computing
Texts in
Texts in
16
Computing
Aarne Ranta
Aarne Ranta
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
16
Computing
Texts in
Texts in
16
Computing
Aarne Ranta
Aarne Ranta
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
Texts in Computing
Volume 16
Implementing
Programming
Languages
An Introduction to
Compilers and Interpreters
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
Volume 4
The Haskell Road to Logic, Maths and Programming
Kees Doets and Jan van Eijck
Volume 5
Bridges from Classical to Nonmonotonic Reasoning
David Makinson
Volume 6
Automata and Dictionaries
Denis Maurel and Franz Guenthner
Volume 7
Learn Prolog Now!
Patrick Blackburn, Johan Bos and Kristina Striegnitz
Volume 8
A Meeting of the Minds: Proceedings of the Workshop on Logic, Rationality
and Interaction, Beijing 2007
Johan van Benthem, Shier Jun and Frank Veltman, eds.
Volume 9
Logic for Artificial Intelligence & Information Technology
Dov M. Gabbay
Volume 10
Foundations of Logic and Theory of Computation
Amílcar Sernadas and Cristina Sernadas
Volume 11
Invariants: A Generative Approach to Programming
Daniel Zingaro
Volume 12
The Mathematics of the Models of Reference
Francesco Berto, Gabriele Rossi and Jacopo Tagliabue
Volume 13
Picturing Programs
Stephen Bloch
Volume 14
JAVA: Just in Time
John Latham
Volume 15
Design and Analysis of Purely Functional Programs
Christian Rinderknecht
Volume 16
Implementing Programming Languages. An Introduction to Compilers and
Interpreters
Aarne Ranta, with an appendix coauthored by Markus Forsberg
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
Implementing
Programming
Languages
An Introduction to
Compilers and Interpreters
Aarne Ranta
with an appendix coauthored by
Markus Forsberg
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
© Individual author and College Publications 2012. All rights
reserved.
ISBN 978-1-84890-064-6
College Publications
Scientific Director: Dov Gabbay
Managing Director: Jane Spurr
Department of Computer Science
King’s College London, Strand, London WC2R 2LS, UK
http://www.collegepublications.co.uk
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
Contents
Preface ix
1 Compilation Phases 1
1.1 From language to binary . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Levels of languages . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Compilation and interpretation . . . . . . . . . . . . . . . . . . 6
1.4 Compilation phases . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Compilation errors . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 More compilation phases . . . . . . . . . . . . . . . . . . . . . . 12
1.7 Theory and practice . . . . . . . . . . . . . . . . . . . . . . . . 13
1.8 The scope of the techniques . . . . . . . . . . . . . . . . . . . . 14
2 Grammars 15
2.1 Defining a language . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Using BNFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Rules, categories, and trees . . . . . . . . . . . . . . . . . . . . 20
2.4 Precedence levels . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Abstract and concrete syntax . . . . . . . . . . . . . . . . . . . 22
2.6 Abstract syntax in Haskell . . . . . . . . . . . . . . . . . . . . . 24
2.7 Abstract syntax in Java . . . . . . . . . . . . . . . . . . . . . . 26
2.8 List categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.9 Specifying the lexer . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.10 Working out a grammar . . . . . . . . . . . . . . . . . . . . . . 32
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
vi CONTENTS
4 Type Checking 57
4.1 The purposes of type checking . . . . . . . . . . . . . . . . . . . 57
4.2 Specifying a type checker . . . . . . . . . . . . . . . . . . . . . 58
4.3 Type checking and type inference . . . . . . . . . . . . . . . . . 59
4.4 Context, environment, and side conditions . . . . . . . . . . . . 60
4.5 Proofs in a type system . . . . . . . . . . . . . . . . . . . . . . 62
4.6 Overloading and type conversions . . . . . . . . . . . . . . . . . 63
4.7 The validity of statements and function definitions . . . . . . . 64
4.8 Declarations and block structures . . . . . . . . . . . . . . . . . 65
4.9 Implementing a type checker . . . . . . . . . . . . . . . . . . . 67
4.10 Annotating type checkers . . . . . . . . . . . . . . . . . . . . . 69
4.11 Type checker in Haskell . . . . . . . . . . . . . . . . . . . . . . 70
4.12 Type checker in Java . . . . . . . . . . . . . . . . . . . . . . . . 74
5 Interpreters 81
5.1 Specifying an interpreter . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Side effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3 Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4 Programs, function definitions, and function calls . . . . . . . . 85
5.5 Laziness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.6 Implementing the interpreter . . . . . . . . . . . . . . . . . . . 87
5.7 Interpreting Java bytecode* . . . . . . . . . . . . . . . . . . . . 90
5.8 Objects and memory management* . . . . . . . . . . . . . . . . 93
6 Code Generation 97
6.1 The semantic gap . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2 Specifying the code generator . . . . . . . . . . . . . . . . . . . 98
6.3 The compilation environment . . . . . . . . . . . . . . . . . . . 99
6.4 Simple expressions and statements . . . . . . . . . . . . . . . . 100
6.5 Expressions and statements with jumps . . . . . . . . . . . . . 103
6.6 Compositionality . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.7 Function calls and definitions . . . . . . . . . . . . . . . . . . . 106
6.8 Putting together a class file . . . . . . . . . . . . . . . . . . . . 108
6.9 Implementing code generation . . . . . . . . . . . . . . . . . . . 110
6.10 Compiling to native code* . . . . . . . . . . . . . . . . . . . . . 113
6.11 Code optimization* . . . . . . . . . . . . . . . . . . . . . . . . . 118
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
CONTENTS vii
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
viii CONTENTS
Index 201
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
Preface
ix
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
x CONTENTS
Aimed as a first book on the topic, this is not a substitute for the “real” books if
you want to do research in compilers, or if you are involved in cutting edge im-
plementations of large programming languages. Some of the many things that
we have left out are low-level details of lexer implementation, algorithms for
building LR parser generators, data flow analysis, register allocation, memory
management, and parallelism. The lexer and parser details are left out because
they are nowadays handled by standard tools, and application programmers
can concentrate on just specifying their own languages and leave the details to
the tools. The other aspects are left out because they are handled by standard
back ends such as the Java Virtual Machine and LLVM. Appendix D of the
book gives reading hints on the more advanced topics.
The language CPP is a small part of the immense language C++; we could
almost as well say C or Java. However, the parser (Assignment 1) also contains
many of the tricky special features of C++ such as templates. The purpose
of this is to throw you into cold water and show that you can actually swim.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
CONTENTS xi
Managing to do this assignment will give you confidence that you can easily
cope with any feature of programming language syntax.
Assignments 2, 3, and 4 deal with a small language, which however contains
everything that is needed for writing useful programs: arithmetic expressions,
declarations and assignments, conditionals, loops, blocks, functions, strings,
input and output. They will give the basic understanding of how programming
languages work. Most features of imperative programming languages can then
be understood as variations of the same themes. Assignment 5 widens the
perspective by introducing the concepts of functional programming, such as
higher-order functions and closures.
The assignments are not only practical but also close to the “real world”.
Compiler books sometimes use toy languages and home-made virtual machines,
in order to focus on the pure concepts rather than messy details. But we have
preferred fragments of real languages (C++, Haskell) and a real virtual machine
(JVM). Even though some details of these languages might be messy, they have
a core that we find pure enough to give a good representation of the concepts.
The advantage of using real languages is that you can easily compare your
work with standard compilers. You can for instance produce your own Java
class files and link them together with files generated by standard Java compil-
ers. When running your code, you will probably experience the embarrassment
(and pleasure!) of seeing things like bytecode verification errors, which rarely
arise with standard compilers! To give even more perspective, we will show the
basics of compilation to Intel x86 code, which will enable you to compile and
run some simple programs “on bare silicon”.
The assignments require you to write code in two different formats:
Thus you don’t need to write code for traditional compiler tools such as Lex and
Yacc. Such code, as well as many other parts of the compiler, are automatically
derived from the BNFC grammar. For the general-purpose language, you could
actually choose any of Java, Haskell, C, C++, C#, or OCaml, since BNFC
supports all these languages. But in this book, we will focus on the use of
Java and Haskell as implementation language. You should choose the language
according to your taste and experience. If you want to use C++ or C#, you can
easily follow the Java code examples, whereas OCaml programmers can follow
Haskell. C is a little different, but mostly closer to Java than to Haskell. The
theory-based approach guarantees that very little in the material is tied to a
specific implementation language: since the compiler components are explained
on an abstract level, they can be easily implemented in different languages.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
xii CONTENTS
Web resources
This book has a web page,
http://digitalgrammars.com/ipl-book/
• errata
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
CONTENTS xiii
Acknowledgements
Since I first lectured on compiler construction in 2002, more than a thousand
students have followed the courses and contributed to the evolution of this
material. I am grateful to all the students for useful feedback and for a confir-
mation that the chosen approach makes sense.
I also want to thank my course assistants throughout these years: Grégoire
Détrez, Arnar Birgisson, Ramona Enache, Krasimir Angelov, Michal Palka,
Jean-Philippe Bernardy, Kristoffer Hedberg, Anders Mörtberg, Daniel Hedin,
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
xiv CONTENTS
Håkan Burden, Kuchi Prasad, Björn Bringert, and Josef Svenningsson. They
have helped consolidate the material and come with new ideas, in particular
for the exercises and assignments in this book. Björn Bringert wrote the code
that the Java examples in Chapters 4 and 5 are based on.
The book owes a lot to lecture notes written by earlier teachers of the
courses: Ulf Norell, Marcin Benke, Thomas Hallgren, Lennart Augustsson,
Thomas Jonsson, and Niklas Röjemo. Thus I have profited from first-rate
experience in programming language design and implementation, proven in
both large-scale implementations and scientific publications. The spirit of this
field in Gothenburg is exactly what I call a theory-based practical approach.
The language CPP is based on the language Javalette inherited from earlier
courses.
The BNF Converter started as joint work with Markus Forsberg in 2002.
In 2003, Michael Pellauer joined the project and created the first versions of C,
C++, and Java back ends, making BNFC multilingual. Markus’s PhD thesis
from 2007 has been one of the main sources of documentation of BNFC, and
he has helped in adapting one of our co-authored reports into an appendix of
this book: the quick reference manual to BNFC (Appendix A).
Over the years, BNFC has also received substantial contributions (such as
back ends for new languages) from Krasimir Angelov, Björn Bringert, Johan
Broberg, Paul Callaghan, Ola Frid, Peter Gammie, Patrik Jansson, Kristofer
Johannisson, Antti-Juhani Kaijanaho, and Ulf Norell. Many users of BNFC
have sent valuable feedback, proving the claim that one of the most precious
resources of a piece of software are its users.
A draft of this book was read by Rodolphe Lepigre and Jan Smith, who
made many corrections and valuable suggestions. Jane Spurr at King’s College
Publications was always responsive with accurate editorial help.
Gothenburg, 7 May 2012
Aarne Ranta
aarne@chalmers.se
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
Chapter 1
Compilation Phases
This chapter introduces the concepts and terminology for most of the later
discussion. It explains the difference between compilers and interpreters, the
division into low and high level languages, and the data structures and algo-
rithms involved in each component of a programming language implementation.
Many of these components are known as the compilation phases, that is, the
phases through which a compiler goes on its way from source code to machine
code.
0 = 0
1 = 1
2 = 10
3 = 11
4 = 100
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
2 CHAPTER 1. COMPILATION PHASES
and so on. This generalizes easily to letters and to other characters, for instance
by the use of the ASCII encoding:
A = 65 = 1000001
B = 66 = 1000010
C = 67 = 1000011
and so on. In this way we can see that all data manipulated by computers
can be expressed by 0’s and 1’s. But what is crucial is that even the pro-
grams that manipulate the data can be so expressed. To take a real-world
example, programs in the JVM machine language (Java Virtual Machine)
are sequences of bytes, that is, groups of eight 0’s or 1’s (capable of expressing
the numbers from 0 to 255). A byte can encode a numeric value, for instance
an integer or a character as above. But it can also encode an instruction,
that is, a command to do something. For instance, addition and multiplication
(of integers) are expressed in JVM as bytes in the following way:
+ = 96 = 0110 0000 = 60
* = 104 = 0110 1000 = 68
We put a space in the middle of each byte to make it more readable, and
more spaces between bytes. The last figure shown is a hexadecimal encoding,
where each half-byte is encoded by a base-16 digit that ranges from 0 to F (with
A=10, B=11,. . . ,F=15). Hexadecimals are a common way to display binaries
in machine language documentation.
From the encodings of numbers and operators, one could construct a simple-
minded encoding of arithmetic formulas, by just putting together the codes for
5, +, and 6:
While this could be made to work, actual JVM chooses a more roundabout
way. In the logic that it follows, the expression is first converted to a postfix
form, where the operands come before the operator:
5 + 6 =⇒ 5 6 +
One virtue of the postfix form is that we don’t need parentheses. For instance,
(5 + 6) * 7 =⇒ 5 6 + 7 *
5 + (6 * 7) =⇒ 5 6 7 * +
At least the former expression needs parentheses when the usual infix order is
used, that is, when the operator is between the operands.
The way the JVM machine manipulates expressions is based on a so-called
stack, which is the working memory of the machine. The stack is like a pile
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
1.1. FROM LANGUAGE TO BINARY 3
of plates, where new plates are pushed on the stack, and only one plate is
available at a time, the one last pushed—known as the top of the stack. An
arithmetic operation such as + (usually called “add”) takes the two top-most
elements from the stack and returns their sum on the top. Thus the compu-
tation of, say, 5 + 6, proceeds as follows, where the left column shows the
instructions and the right column the stack after each instruction:
bipush 5 ; 5
bipush 6 ; 5 6
iadd ; 11
The instructions are here shown as assembly code, which means that readable
instruction names and decimal numbers are used instead of binaries. The
instruction bipush means pushing an integer that has the size of one byte, and
iadd means integer addition.
To take a more complex example, the computation of 5 + (6 * 7) is
bipush 5 ; 5
bipush 6 ; 5 6
bipush 7 ; 5 6 7
imul ; 5 42
iadd ; 47
In this case, unlike the previous one, the stack at one point contains more
numbers than two; but the integer multiplication (imul) instruction correctly
picks the topmost ones 6 and 7 and returns the value 42 on the stack.
The binary JVM code must make it clear which bytes stand for numeric
values and which ones for instructions such as “add”. This is obvious if you
think that we need to read 0110 0000 sometimes as number 96, and sometimes
as addition. The way to make it clear that a byte stands for a numeric value is
to prefix it with a special instruction, the one called bipush. Thus we get the
code for an addition expression:
To convert this all into binary, we only need the code for the push instruction,
5 + 6 = 0001 0000 0000 0101 0001 0000 0000 0110 0110 0000
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
4 CHAPTER 1. COMPILATION PHASES
• Both data and programs can be expressed as binary code, i.e. by 0’s and
1’s.
2. Compile the code for X, followed by the code for Y, followed by the code
for F.
This procedure is our first example of a compiler. It shows the two main ideas
of compilers, which we will repeat again and again in new configurations:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
1.2. LEVELS OF LANGUAGES 5
human
human language
ML Haskell
Lisp Prolog
C++ Java
C
assembler
machine language
machine
Figure 1.1: Some programming languages from the highest to the lowest level.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
6 CHAPTER 1. COMPILATION PHASES
5 + 6 * 7
The second step is very easy: you just look up the binary codes for each symbol
in the assembly language and put them together in the same order. It is
sometimes not regarded as a part of compilation proper, but as a separate level
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
1.3. COMPILATION AND INTERPRETATION 7
of assembly. The main reason for this is purely practical: modern compilers
don’t need to go all the way to the binary, but just to the assembly language,
since there exist assembly programs that can do the rest.
A compiler is a program that translates code to some other code. It
does not actually run the program. An interpreter does this. Thus a source
language expression,
5 + 6 * 7
47
This computation can be performed without any translation of the source code
into machine code. However, a common practice is in fact a combination of
compilation and interpretation. For instance, Java programs are, as shown
above, compiled into JVM code. This code is in turn interpreted by a JVM
interpreter.
The compilation of Java is different from for instance the way C is translated
by GCC (GNU Compiler Collection). GCC compiles C into the native code
of each machine, which is just executed, not interpreted. JVM code must be
interpreted because it is not executable by any actual machine.
Sometimes a distinction is made between “compiled languages” and “inter-
preted languages”, C being compiled and Java being interpreted. This is really
a misnomer, in two ways. First, any language could have both an interpreter
and a compiler. Second, it’s not Java that is interpreted by a “Java interpreter”,
but JVM, a completely different language to which Java is compiled.
Here are some examples of how some known languages are normally treated:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
8 CHAPTER 1. COMPILATION PHASES
Advantages of compilation:
The advent of JIT is blurring the distinction, and so do virtual machines with
actual machine language instruction sets, such as VMWare. In general, the
best trade-offs are achieved by combinations of compiler and interpreter com-
ponents, reusing as much as possible (as we saw is done in the reuse of the
assembly phase). This leads us to the following topic: how compilers are di-
vided into separate components.
• The lexer reads a string of characters and chops it into tokens, i.e.
to “meaningful words”; the figure represents the token string by putting
spaces between tokens.
• The parser reads a string of tokens and groups it into a syntax tree,
i.e. to a structure indicating which parts belong together and how; the
figure represents the syntax tree by using parentheses.
• The type checker finds out the type of each part of the syntax tree that
might have alternative types, and returns an annotated syntax tree;
the figure represents the annotations by the letter i (“integer”) in square
brackets.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
1.4. COMPILATION PHASES 9
↓ lexer
↓ parser
↓ type checker
↓ code generator
Figure 1.2: Compilation phases from Java source code to JVM assembly code.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
10 CHAPTER 1. COMPILATION PHASES
• The code generator converts the annotated syntax tree into a list of
target code instructions. The figure uses normal JVM assembly code,
where imul means integer multiplication, bipush pushing integer bytes,
and iload pushing values of integer variables.
"hello
(4 * (y + 5) - 12))
sort(45)
Errors on later phases than type checking are usually not supported. One
reason is the principle (by Robin Milner, the creator of ML), that “well-typed
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
1.5. COMPILATION ERRORS 11
programs cannot go wrong”. This means that if a program passes the type
checker it will also work on later phases. Another, more general reason is that
the compiler phases can be divided into two groups:
• The front end, which performs analysis, i.e. inspects the program:
lexer, parser, type checker.
• The back end, which performs synthesis: code generator.
It is natural that only the front end (analysis) phases look for errors.
A good compiler finds all errors at the earliest occasion. Thereby it saves
work: it doesn’t try to type check code that has parse errors. It is also more
useful for the user, because it can then give error messages that go to the very
root of the problem.
Of course, compilers cannot find all errors, for instance, all bugs in the
program. The problem with an array index out of bounds is a typical example
of such errors. However, in general it is better to find errors at compile time
than at run time, and this is one aspect in which compilers are constantly
improving. One of the most important lessons of this book will be to understand
what is possible to do at compile time and what must be postponed to run time.
For instance, array index out of bounds is not possible to detect at compile time,
if the index is a variable that gets its value at run time.
Another typical example is the binding analysis of variables: if a variable
is used in an expression in Java or C, it must have been declared and given a
value. For instance, the following function is incorrect in C:
int main () {
printf("%d",x) ;
}
The reason is that x has not been declared, which for instance GCC correctly
reports as an error. But the following is correct in C:
int main () {
int x ;
printf("%d",x) ;
}
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
12 CHAPTER 1. COMPILATION PHASES
int main () {
int x ;
if (readInt()) x = 1 ;
printf("%d",x) ;
}
Here x gets a value under a condition. It may be that this condition is impos-
sible to decide at compile time. Hence it is not decidable at compile time if x
has a value—neither in the parser, nor in the type checker.
Exercise 1-2. The following C code has around six errors (some of them
depend on what you count as an error). Locate the errors and explain at which
compiler phase they are (or should be) revealed.
int main ()
{
int i ;
int j = k + 1 ;
int a[] = {1,2,3}
j = a + 6 ;
a[4] = 7 ;
printf(hello world\n) ;
}
Desugaring is normally done at the syntax tree level, and it can be inserted as
a phase between parsing and type checking. A disadvantage can be, however,
that errors arising in type checking then refer to code that the programmer has
never written herself, but that has been created by desugaring.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
1.7. THEORY AND PRACTICE 13
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
14 CHAPTER 1. COMPILATION PHASES
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
Chapter 2
Grammars
15
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
16 CHAPTER 2. GRAMMARS
coercions Exp 2 ;
bnfc
and you should get a message specifying the authors and license of BNFC and
its usage options. If the command bnfc does not work, you can install the
software from the BNFC homepage, which is linked from the book’s web page.
BNFC is available for Linux, Mac OS, and Windows, and there are several
installation methods (such as Debian packages), from which you can choose
the one most suitable for your platform and taste. Each platform also has
Unix-style shells: Cygwin in Windows and Terminal in Mac OS.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
2.2. USING BNFC 17
Now, assuming you have BNFC installed, you can run it on the file Calc.cf
in Figure 2.1.
bnfc -m Calc.cf
to generate the components for Java. (Writing java1.4 also works, but gener-
ates clumsier code that doesn’t use Java’s generics.)
make
Again, this can fail at some point if you don’t have the Haskell tools installed:
the GHC Haskell compiler, the Happy parser generator, and the Alex lexer
generator. You don’t need to install them, if you aim to work in Java and not
in Haskell, but let us first assume you do have GHC, Happy, and Alex. Then
your run of make will successfully terminate with the message
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
18 CHAPTER 2. GRAMMARS
Notice that TestCalc reads Unix standard input; the easiest thing to provide
the parsable expression is thus by a pipe from the echo command. Then, the
response of TestCalc is the following:
Parse Successful!
[Abstract Syntax]
EAdd (EInt 5) (EMul (EInt 6) (EInt 7))
[Linearized tree]
5 + 6 * 7
It first says that it has succeeded to parse the input, then shows an abstract
syntax tree, which is the result of parsing and gives the tree structure of the
expression. Finally, it displays the linearization, which is the string obtained
by using the grammar in the direction opposite to parsing. This string can
be different from the input string, for instance, if the input has unnecessary
parentheses.
Input can also be read from a file. The standard input method for this is
./TestCalc FILE_with_an_expression
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
2.2. USING BNFC 19
Calc/AllVisitor.java
Calc/Test.java # top-level test file
Calc/Yylex # lexer
Calc/Calc.cup # parser
Calc.tex # language document
Makefile # Makefile
There are no Haskell files any more, but files for Java, its parser tool Cup, and
its lexer tool JLex. The Makefile works exactly like in the case of Haskell:
make
Well. . . if you have done exactly as shown above, you will probably fail with
the message
This problem is typical in Java when using libraries that reside in unusual
places, which often happens with user-installed libraries like Cup and JLex.
Fortunately there is an easy solution: you just have to define the class path
that Java uses for finding libraries. On my Ubuntu Linux laptop, the following
shell command does the job:
export CLASSPATH=.:/usr/local/java/Cup:/usr/local/java
Now I will get a better result with make. Then I can run the parser test in
almost the same way as with the version compiled with Haskell:
Parse Successful!
[Abstract Syntax]
(EAdd (EInt 5) (EMul (EInt 6) (EInt 7)))
[Linearized Tree]
5 + 6 * 7
To summarize, these are the two most important facts about BNFC:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
20 CHAPTER 2. GRAMMARS
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
2.4. PRECEDENCE LEVELS 21
The reason we don’t give this analysis is that multiplication expressions have
a higher precedence. In BNFC, precedence levels are the digits attached
to category symbols. Thus Exp1 has precedence level 1, Exp2 has precedence
level 2, etc. The nonterminal Exp without a digit is defined to mean the same
as Exp0.
The rule
What is the highest level? This is specified in the grammar by using a coercions
statement. For instance, coercions Exp 2 says that 2 is the highest level for
Exp. It is actually a shorthand for the following “ordinary” BNF rules:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
22 CHAPTER 2. GRAMMARS
These rules are called coercions, since they just coerce expressions from one
category to another, without doing anything—that is, without creating new
nodes in the abstract syntax tree. The underscore in front of these rules is a
dummy label, which indicates that no constructor is added.
In practice, compilers don’t quite work in this simple way. The main reason
is that the tree obtained in parsing may have to be converted to another tree
before code generation. For instance, type annotations may have to be added
to an arithmetic expression tree in order to select the proper JVM instructions.
The BNF grammar specifies the abstract syntax of a language. But it
simultaneously specifies its concrete syntax as well. The concrete syntax
gives more detail than the abstract syntax: it says what the expression parts
look like and in what order they appear. One way to spell out the distinction
is by trying to separate these aspects in a BNF rule. Take, for instance, the
rule for addition expressions:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
2.5. ABSTRACT AND CONCRETE SYNTAX 23
which hides the actual symbol used for addition (and thereby the place where
it appears). It also hides the precedence levels, since they don’t imply any
differences in the abstract syntax trees.
In brief, the abstract syntax is extracted from a BNF grammar as follows:
If this is performed with Calc.cf (Figure 2.1), the following rules remain:
• their nodes and leaves are constructors (i.e. labels of BNF rules).
In contrast, concrete syntax trees, also called parse trees, look different:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
24 CHAPTER 2. GRAMMARS
Here are the parse tree and the abstract syntax tree for the expression 5 + 6
* 7 as analysed by Calc.cf:
A parse tree is an accurate encoding of the sequence of BNF rules applied, and
hence it shows all coercions between precedence levels and all tokens in the
input string.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
2.6. ABSTRACT SYNTAX IN HASKELL 25
data Exp =
EAdd Exp Exp
| ESub Exp Exp
| EMul Exp Exp
| EDiv Exp Exp
| EInt Integer
import AbsCalc
Thus we can now turn our parser into an interpreter! We do this by modifying
the generated file TestCalc.hs: instead of showing the syntax tree, we let it
show the value from interpretation:
import LexCalc
import ParCalc
import AbsCalc
import Interpreter
import ErrM
main = do
interact calc
putStrLn ""
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
26 CHAPTER 2. GRAMMARS
This, in a nutshell, is how you can build any compiler on top of BNFC:
3. Let the main file show the results of syntax tree manipulation.
If your Main module is in a file named Calculator, you can compile it with
GHC as follows:
• For each constructor of the category, a class extending the base class.
This means quite a few files, which are for the sake of clarity put to a separate
directory Absyn. In the case of Calc.cf, we have the files
Calc/Absyn/EAdd.java
Calc/Absyn/EDiv.java
Calc/Absyn/EInt.java
Calc/Absyn/EMul.java
Calc/Absyn/ESub.java
Calc/Absyn/Exp.java
This is what the classes look like; we ignore some of the code in them now and
only show the parts crucial for abstract syntax representation:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
2.7. ABSTRACT SYNTAX IN JAVA 27
The visitor method is the best choice in more advanced applications, and we
will return to it in the chapter on type checking. To get the calculator up
and running now, let us follow the simpler way. What we do is take the classes
generated in Calc/Absyn/ by BNFC, and add an eval method to each of them:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
28 CHAPTER 2. GRAMMARS
javac Calc/Calculator.java
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
2.8. LIST CATEGORIES 29
(123 + 47 - 6) * 222 / 4
The first rule states that a list of definitions can be empty (“nil”). The second
rule states that a list can be formed by prepending a definition to a list (“cons”).
Lists often have terminators, i.e. tokens that appear after every item of a
list. For instance, function definitions might have semicolons (;) as terminators.
This is expressed as follows:
The pattern of list rules is so common that BNFC has some special notations
for it. Thus lists of a category C can be denoted as [C ]. Instead of pairs of
rules, one can use the shorthand terminator. Thus the latter pair of rules for
lists of definitions can be written concisely
The former pair, where no terminator is used, is written with an “empty ter-
minator”,
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
30 CHAPTER 2. GRAMMARS
This shorthand expands to a set of rules for the category ListExp. The rule
for function calls can then be written
Instead of ListExp, BNFC programmers can use the bracket notation for lists,
[Exp], which is borrowed from Haskell. Thus an alternative formulation of the
ECall rule is
The abstract syntax built by the parser in Haskell indeed represents lists as
Haskell lists, rather than as a new data type. In Java, list categories are
similarly represented as linked lists. To summarize,
Sometimes lists are required to be nonempty, i.e. have at least one element.
This is expressed in BNFC by adding the keyword nonempty:
Their internal representations are still lists in the host language, which means
that empty lists are type-correct although the parser never returns them.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
2.9. SPECIFYING THE LEXER 31
• Char, character literals: any character between single quotes, e.g. ’x’
and ’7’;
The precise definitions of these types are given in the LBNF report, Appendix
A. Notice that the integer and floating point literals do not contain negative
numbers; negation is usually defined in a grammar rule as a prefix operator
working for expressions other than literals, too.
The predefined token types are often sufficient for language implementa-
tions, especially for new languages, which can be designed to follow the BNFC
rules for convenience. But BNFC also allows the definition of new token types.
These definitions are written by using regular expressions. For instance, one
can define a special type of upper-case identifiers as follows:
This defines UIdent as a token type, which contains strings starting with an
upper-case letter and continuing with a (possibly empty) sequence of letters,
digits, and underscores.
The following table gives the main regular expressions available in BNFC.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
32 CHAPTER 2. GRAMMARS
When BNFC is run, bare token types are encoded as types of strings. For
instance, the standard Ident type is in Haskell represented as
Position token types add to this a pair of integers indicating the line and the
column in the input:
• single-line comments, which run from a start token till the end of the line;
• arbitrary-length comments, which run from a start token till a closing
token.
comment "//" ;
comment "/*" "*/" ;
Thus single-line comments need one token, the start token, whereas arbitrary-
length comments need the opening and the closing token.
Since comments are resolved by the lexer, they are processed by using a fi-
nite automaton. Therefore nested comments are not possible. A more thorough
explanation of this will be given in next chapter.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
2.10. WORKING OUT A GRAMMAR 33
• A program may contain comments, which are ignored by the parser. Com-
ments can start with the token // and extend to the end of the line. They
can also start with /* and extend to the next */.
comment "//" ;
comment "/*" "*/" ;
DFun. Def ::= Type Id "(" [Arg] ")" "{" [Stm] "}" ;
separator Arg "," ;
terminator Stm "" ;
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
34 CHAPTER 2. GRAMMARS
int i ;
– a type and many variables,
int i, j ;
– a type and one initialized variable,
int i = 6 ;
Now, we could reuse the function argument declarations Arg as one kind of
statements. But we choose the simpler solution of restating the rule for one-
variable declarations.
• Expressions are specified with the following table that gives their prece-
dence levels. Infix operators are assumed to be left-associative, except
assignments, which are right-associative. The arguments in a function
call can be expressions of any level. Otherwise, the subexpressions are
assumed to be one precedence level above the main expression.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
2.10. WORKING OUT A GRAMMAR 35
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
36 CHAPTER 2. GRAMMARS
Finally, we need a coercions rule to specify the highest precedence level, and
a rule to form function argument lists.
coercions Exp 15 ;
separator Exp "," ;
• The available types are bool, double, int, string, and void.
Here we cannot use the built-in Ident type of BNFC, because apostrophes
(’) are not permitted! But we can define our identifiers easily by a regular
expression:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
Chapter 3
37
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
38 CHAPTER 3. LEXING AND PARSING*
symbols, where symbols are just elements from any finite set, such as the
128 7-bit ASCII characters. Programming languages are examples of formal
languages. They are rather complex in comparison to the examples usually
studied in the theory; but the good news is that their complexity is mostly due
to repetitions of simple well-known patterns.
A regular language is, like any formal language, a set of strings, i.e. se-
quences of symbols, from a finite set of symbols called the alphabet. Only
some formal languages are regular; in fact, regular languages are exactly those
that can be defined by regular expressions, which we already saw in Sec-
tion 2.9. We don’t even need all the expressions, but just five of them; the
other ones are convenient shorthands. They are shown in the following table,
together with the corresponding regular language in set-theoretic notation:
expression language
’a’ {a}
AB {ab|a ∈ [[A]], b ∈ [[B]]}
A|B [[A]] ∪ [[B]]
A* {a1 a2 . . . an |ai ∈ [[A]], n ≥ 0}
eps {} (empty string)
The table uses the notation [[A]] for the set corresponding to the expression A.
This notation is common in computer science to specify the semantics of a
language, in terms of the denotations of expressions.
When does a string belong to a regular language? A straightforward answer
would be to write a program that interprets the sets, e.g. in Haskell by using
list comprehensions instead of the set brackets. This implementation, however,
would be very inefficient. The usual way to go is to compile regular expres-
sions to finite automata. Finite automata are graphs that allow traversing
their input strings symbol by symbol. For example, the following automaton
recognizes a string that is either an integer literal or an identifier or a string
literal.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
3.2. REGULAR LANGUAGES AND FINITE AUTOMATA 39
digit digit*
| letter (’_’ | letter | digit)*
| ’"’ (char - (’\’ | ’"’) | ’\’ (’\’ | ’"’))* ’"’
The automaton can be used for the recognition of tokens. In this case, a
recognized token is either a decimal integer, an identifier, or a string literal.
The recognition starts from the initial state, that is, the node marked “init”.
It goes to the next state depending on the first character. If it is a digit 0...9,
the state is the one marked “int”. With more digits, the recognition loops back
to this state. The state is marked with a double circle, which means it is a
final state, also known as an accepting state. The other accepting states
are “ident” and “string”. But on the way to “string”, there are non-accepting
states: the one before a second quote is read, and the one after an escape
(backslash) is read.
The automaton above is deterministic, which means that at any state,
any input symbol has at most one transition, that is, at most one way to go
to a next state. If a symbol with no transition is encountered, the string is
not accepted. For instance, a&b would not be an accepted string in the above
automaton; nor is it covered by the regular expression.
An automaton can also be nondeterministic, which means that some sym-
bols may have many transitions. An example is the following automaton, with
the corresponding regular expression that recognizes the language {ab, ac}:
a b | a c
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
40 CHAPTER 3. LEXING AND PARSING*
Now, this automaton and indeed the expression might look like a stupid thing
to write anyway: wouldn’t it be much smarter to factor out the a and write
simply as follows?
a (b | c)
The answer is no, both surprisingly and in a way typical to compiler construc-
tion. The reason is that one should not try to optimize automata by hand—one
should let a compiler do that automatically and much more reliably! Gener-
ating a non-deterministic automaton is the standard first step of compiling
regular expressions. After that, deterministic and, indeed, minimal automata
can be obtained as optimizations.
Just to give an idea of how tedious it can be to create deterministic automata
by hand, think about compiling an English dictionary into an automaton. It
may start as follows:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
3.3. THE COMPILATION OF REGULAR EXPRESSIONS 41
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
42 CHAPTER 3. LEXING AND PARSING*
digraph {
rankdir = LR ;
start [label = "", shape = "plaintext"]
init [label = "init", shape = "circle"] ;
a [label = "", shape = "circle"] ;
end [label = "", shape = "doublecircle"] ;
start -> init ;
init -> init [label = "a,b"] ;
init -> a [label = "a"] ;
a -> end [label = "a,b"] ;
}
The intermediate abstract representation should encode the mathematical def-
inition of automata:
Definition. A finite automaton is a 5-tuple hΣ, S, F, i, ti where
Step 2. Determination
One of the most powerful and amazing properties of finite automata is that they
can always be made deterministic by a fairly simple procedure. The procedure
is called the subset construction. In brief: for every state s and symbol a
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
3.3. THE COMPILATION OF REGULAR EXPRESSIONS 43
in the automaton, form a new state σ(s, a) that “gathers” all those states to
which there is a transition from s by a. More precisely:
• σ(s, a) is the set of those states si to which one can arrive from s by
consuming just the symbol a. This includes of course the states to which
the path contains epsilon transitions.
• The transitions from σ(s, a) = {s1 , . . . , sn } for a symbol b are all the
transitions with b from any si . (When this is specified, the subset con-
struction must of course be iterated to build σ(σ(s, a), b).)
• The state σ(s, a) = {s1 , . . . , sn } is final if any of si is final.
a b | a c
How does this come out? First we look at the possible transitions with the
symbol a from state 0. Because of epsilon transitions, there are no less than
four possible states, which we collect to the state named {2,3,6,7}. From this
state, b can lead to 4 and 9, because there is a b-transition from 3 to 4 and an
epsilon transition from 4 to 9. Similarly, c can lead to 8 and 9.
The resulting automaton is deterministic but not yet minimal. Therefore
we perform one more optimization.
Step 3. Minimization
Determination may left the automaton with superfluous states. This means
that there are states without any distinguishing strings. A distinguishing
string for states s and u is a sequence x of symbols that ends up in an accepting
state when starting from s and in a non-accepting state when starting from u.
For example, in the previous deterministic automaton, the states 0 and
{2,3,6,7} are distinguished by the string ab. When starting from 0, it leads to
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
44 CHAPTER 3. LEXING AND PARSING*
the final state {4,9}. When starting from {2,3,6,7}, there are no transitions
marked for a, which means that any string starting with a ends up in a dead
state which is non-accepting.
But the states {4,9} and {8,9} are not distinguished by any string. The
only string that ends to a final state is the empty string, from both of them.
The minimization can thus merge these states, and we get the final, optimized
automaton
The algorithm for minimization is a bit more complicated than for determina-
tion. We omit its details here.
Exercise 3-2.+ Write a compiler from regular expressions to NFA’s, covering
the minimal set (symbol, sequence, union, closure, empty) and the notation
used in the presentation above. You can use BNFC and define a suitable token
type for symbols (Section 2.9). As for the precedences, closure should bind
stronger than sequence, sequence stronger than union. The automata and the
compiler can be expressed in a mathematical notation and pseudocode. For
instance, the definition of automata for one-symbol expressions suggests
You can also write an actual compiler as a back-end to the parser. If you are
really ambitious, you can generate Graphviz code to show all the bubbles!
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
3.4. PROPERTIES OF REGULAR LANGUAGES 45
every state; this can always be guaranteed by adding a dedicated dead state as
a goal for those symbols that are impossible to continue with.
The reasoning above relies on the correspondence theorem saying that
the following three are equivalent, convertible to each other: regular languages,
regular expressions, finite automata. The determination algorithm moreover
proves that there is always a deterministic automaton. The closure property
for regular languages and expressions follows. (The theorem is due to Stephen
Kleene, the father of regular expressions and automata. After him, the closure
construction A* is also known as the Kleene star.)
Another interesting property is inherent in the subset construction: the
size of a DFA can be exponential in the size of the NFA (and therefore of the
expression). The subset construction shows a potential for this, because there
could in principle be a different state in the DFA for every subset of the NFA,
and the number of subsets of an n-element set is 2n .
A concrete example of the size explosion of automata is a language of
strings of a’s and b’s, where the nth element from the end is an a. Consider
this in the case n=2. The regular expression and NFA are easy:
(a|b)* a (a|b)
But how on earth can we make this deterministic? How can we know, when
reading a string, which a is the second-last element so that we can stop looping?
It is possible to solve this problem by the subset construction, which is left
as an exercise. But there is also an elegant direct construction, which I learned
from a student many years ago. The idea is that the state must “remember”
the last two symbols that have been read. Thus the states can be named aa,
ab, ba, and bb. The states aa and ab are accepting, because they have a as
the second-last symbol; the other two are not accepting. Now, for any more
symbols encountered, one can “forget” the previous second-last symbol and go
to the next state accordingly. For instance, if you are in ab, then a leads to ba
and b leads to bb. The complete automaton is here:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
46 CHAPTER 3. LEXING AND PARSING*
Notice that the initial state is bb, because a string must have at least two
symbols in order to be accepted.
With a similar reasoning, it is easy to see that a DFA for a as the third-last
symbol must have at least 8 states, for fourth-last 16, and so on. Unfortunately,
the exponential blow-up of automata is not only a theoretical construct, but
often happens in practice and can come as a surprise to those who build lexers
by using regular expressions.
The third property of finite-state automata we want to address is, well, their
finiteness. Remember from the definition that an automaton has a finite set
of states. This fact can be used for proving that an automaton cannot match
parentheses, i.e. guarantee that a string has as many left and right parentheses.
The argument uses, in the way typical of formal language theory, a’s and
b’s to stand for left and right parentheses, respectively. The language we want
to define is
{an bn |n = 0, 1, 2 . . .}
Now assume that the automaton is in state sn after having read n a’s and
starting to read b’s. This state must be different for every n, which means that
there must be infinitely many states. For if we had sm = sn for some m 6= n,
then the automaton would recognize the expressions an bm and am bn , which
are not in the language! The automaton would look as follows:
S ::= ;
S ::= "a" S "b" ;
and process it in parser tools. But there is a related construct that one might
want to try to treat in a lexer: nested comments. The case in point is code
of the form
a /* b /* c */ d */ e
a e
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
3.5. CONTEXT-FREE GRAMMARS AND PARSING 47
a d */ e
The reason is that the lexer is implemented by using a finite automaton, which
cannot count the number of matching parentheses—in this case comment de-
limiters.
Exercise 3-3. Consider the simple NFA for the expression (a|b)* a (a|b)
discussed in the text. Make it into a DFA by using the subset construction. Is
the result different from the DFA constructed by informal reasoning?
Exercise 3-4. Test the explosion of automata in standard Lex-like tools by
compiling regular expressions similar to the previous exercise but with a as
the 10th-last or the 20th-last symbol. Do this by measuring the size of the
generated code in Haskell (Alex), Java (JLex), or C (Flex).
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
48 CHAPTER 3. LEXING AND PARSING*
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
3.6. LL(K) PARSING 49
Thus we save the expression e and the statement s and build an SIf three from
them, ignoring the terminals in the production.
The pseudocode shown is easy to translate to both imperative and func-
tional code. But we don’t recommend this way of implementing parsers, since
BNFC is easier to write and more powerful. We show it rather because it is a
useful introduction to the concept of conflicts, which arise even when BNFC
is used.
As an example of a conflict, consider the rules for if statements with and
without else:
SIf. Stm ::= "if" "(" Exp ")" Stm
SIfElse. Stm ::= "if" "(" Exp ")" Stm "else" Stm
In an LL(1) parser, which rule should we choose when we see the token if? As
there are two alternatives, we have a conflict.
One way to solve conflicts is to write the grammar in a different way. In
this case, for instance, we can use left factoring, which means sharing the
common left part of the rules:
SIE. Stm ::= "if" "(" Exp ")" Stm Rest
RElse. Rest ::= "else" Stm
REmp. Rest ::=
To get the originally wanted abstract syntax, we have to define a function
(in the host language, i.e. Haskell or Java) that eliminates Rest trees in the
following way:
SIE exp stm REmp =⇒ SIf exp stm
SIE exp stm (RElse stm2) =⇒ SIfElse exp stm stm2
But it can be tricky to rewrite a grammar so that it enables LL(1) parsing.
Perhaps the most well-known problem is left recursion. A rule is left-recursive
if it has the form
C ::= C . . .
that is, the value category C is itself the first item on the right hand side. Left
recursion is common in programming languages, because operators such as +
are left associative. For instance, consider the simplest pair of rules for sums
of integers:
Exp ::= Exp "+" Integer
Exp ::= Integer
These rules make an LL(1) parser loop, because, to build an Exp, the parser
first tries to build an Exp, and so on. No input is consumed when trying this,
and therefore the parser loops.
The grammar can be rewritten, again, by introducing a new category:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
50 CHAPTER 3. LEXING AND PARSING*
The new category Rest has right recursion, which is harmless. A tree con-
version is of course needed to return the originally wanted abstract syntax.
The clearest way to see conflicts and to understand the nature of LL(1)
parsing is to build a parser table from the grammar. This table has a row for
each category and a column for each token. Each cell shows what rule applies
when the category is being sought and the input begins with the token. For
example, the grammar (already considered above)
A conflict means that a cell contains more than one rule. This grammar has
no conflicts, but if we added the SIfElse rule, the cell (Stm,if) would contain
both SIf and SIfElse.
Exercise 3-5. Write a recursive-descent parser for the example grammar (with
if, while, expression statements, and integer expressions) in a general-purpose
language like Haskell or Java.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
3.7. LR(K) PARSING 51
The rightmost derivation of the same string fills in the rightmost nontermi-
nal first.
The LR(1) parser reads its input, and builds a stack of results, which are
combined afterwards, as soon as some grammar rule can be applied to the top
of the stack. When seeing the next token (lookahead 1), it chooses among five
actions:
• Shift: read one more token (i.e. move it from input to stack).
• Accept: return the single value on the stack when no input is left.
• Reject: report that there is input left but no action to take, or that the
input is finished but the stack is not one with a single value of expected
type.
Shift and reduce are the most common actions, and it is customary to illustrate
the parsing process by showing the sequence of these actions. Take, for instance,
the following grammar. We use integers as rule labels, so that we also cover
the dummy coercion (label 2).
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
52 CHAPTER 3. LEXING AND PARSING*
Initially, the stack is empty, so the parser must shift and put the token 1 to the
stack. The grammar has a matching rule, rule 4, and so a reduce is performed.
Then another reduce is performed by rule 2. Why? This is because the next
token (the lookahead) is +, and there is a rule that matches the sequence Exp
+. If the next token were *, then the second reduce would not be performed.
This is shown later in the process, when the stack is Exp + Exp1.
How does the parser know when to shift and when to reduce? Like in the
case of LL(k) parsing, it uses a table. In an LR(1) table, the rows are parser
states, and there is a column for each terminal and also for each nonterminal.
The cells are parser actions.
So, what is a parser state? It is a grammar rule together with the posi-
tion that has been reached when trying to match the rule. This position is
conventionally marked by a dot. Thus, for instance,
is the state where an if statement is being read, and the parser has read the
tokens if and ( and is about to look for an Exp.
Here is an example of an LR(1) table. It is based on the table produced by
BNFC and Happy from the previous grammar, so it is actually a variant called
LALR(1); see below. The Happy compiler has added two rules to the grammar:
rule (0) that produces integer literal terminals (L int) from the nonterminal
Integer, and a start rule (here unnumbered), which adds the extra token $ to
mark the end of the input. Then also the other rules have to decide what to
do if they reach the end of input. We only show the columns for terminals and
the corresponding shift, reduce, and accept actions. For shift, the next state is
given. For reduce, the rule number is given.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
3.7. LR(K) PARSING 53
+ * $ L int
0 (start) - - - s3
3 Integer -> L int . r0 r0 r0 -
4 Exp1 -> Integer . r4 r4 r4 -
5 Exp1 -> Exp1 . "*" Integer - s8 - -
6 %start pExp -> Exp . $ s9 - a -
Exp -> Exp . "+" Exp1
7 Exp -> Exp1 . r2 s8 r2 -
Exp1 -> Exp1 . "*" Integer
8 Exp1 -> Exp1 "*" . Integer - - - s3
9 Exp -> Exp "+" . Exp1 - - - s3
10 Exp -> Exp "+" Exp1 . r1 s8 r1 -
Exp1 -> Exp1 . "*" Integer
11 Exp1 -> Exp1 "*" Integer . r3 r3 r3 -
The size of LR(1) tables can be large, because it is the number of rule
positions multiplied by the number of tokens and categories. For LR(2), we
need the square of the number, which is too large in practice. Even LR(1)
tables are usually not built in their full form. Instead, standard tools like Yacc,
Bison, CUP, Happy use LALR(1), look-ahead LR(1). In comparison to full
LR(1), LALR(1) reduces the number of states by merging some states that
are similar to the left of the dot. States 6, 7, and 10 in the above table are
examples of this.
In terms of general expressivity, the following inequations hold:
That a grammar is in LALR(1), or any other of the classes, means that its
parsing table has no conflicts. Therefore none of these classes can contain
ambiguous grammars.
Exercise 3-6. Trace the LR parsing of the (nonsense) statement
while (2 + 5) 3 * 6 * 7 ;
in the language which is the same as the language with + and * used in this
section, with while statements and expression statements added.
Exercise 3-7. Consider the language ’X’*, i.e. sequences of symbol X. Write
two context-free grammars for it: one left-recursive and one right-recursive.
With both grammars, trace the LR parsing of the string XXXX. What can you
say about the memory consumption of the two processes?
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
54 CHAPTER 3. LEXING AND PARSING*
The latter are more harmful, but also easier to eliminate. The clearest case is
plain ambiguities. Assume, for instance, that a grammar tries to distinguish
between variables and constants:
Any Ident parsed as an Exp can be reduced with both of the rules. The solution
to this conflict is to remove one of the rules and leave it to the type checker to
distinguish constants from variables.
A more tricky case is implicit ambiguities. The following grammar tries to
cover a fragment of C++, where a declaration (in a function definition) can
be just a type (DTyp), and a type can be just an identifier (TId). At the same
time, a statement can be a declaration (SDecl), but also an expression (SExp),
and an expression can be an identifier (EId).
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
3.8. FINDING AND RESOLVING CONFLICTS 55
The problem arises when if statements are nested. Consider the following
input and position (.):
if (x > 0) if (y < 8) return y ; . else return x ;
There are two possible actions, which lead to two analyses of the statement.
The analyses are made explicit by braces.
shift: if (x > 0) { if (y < 8) return y ; else return x ;}
reduce: if (x > 0) { if (y < 8) return y ;} else return x ;
This conflict is so well established that it has become a “feature” of languages
like C and Java. It is solved by a principle followed by standard tools: when a
conflict arises, always choose shift rather than reduce. But this means, strictly
speaking, that the BNF grammar is no longer faithfully implemented by the
parser.
Hence, if your grammar produces shift-reduce conflicts, this will mean that
some programs that your grammar recognizes cannot actually be parsed. Usu-
ally these conflicts are not so “well-understood” ones as the dangling else, and
it can take a considerable effort to find and fix them. The most valuable tool
in this work are the info files generated by some parser tools. For instance,
Happy can be used to produce an info file by the flag -i:
happy -i ParCPP.y
The resulting file ParConf.info is a very readable text file. A quick way to
check which rules are overshadowed in conflicts is to grep for the ignored reduce
actions:
grep "(reduce" ParConf.info
Interestingly, conflicts tend to cluster on a few rules. If you have very many,
do
grep "(reduce" ParConf.info | sort | uniq
The conflicts are (usually) the same in all standard tools, since they use the
LALR(1) method. Since the info file contains no Haskell, you can use Happy’s
info file even if you principally work with another tool.
Another diagnostic tool is the debugging parser. In Happy,
happy -da ParCPP.y
When you compile the BNFC test program with the resulting ParCPP.hs, it
shows the sequence of actions when the parser is executed. With Bison, you
can use gdb (GNU Debugger), which traces the execution back to lines in the
Bison source file.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
56 CHAPTER 3. LEXING AND PARSING*
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
Chapter 4
Type Checking
Type checking tries to find out if a program makes sense. This chapter defines
the traditional notion of type checking as exemplified by C and Java. While
most questions are straightforward, there are some tricky questions such as
variable scopes. And while most things in type checking are trivial for a human
to understand, Assignment 2 will soon show that it requires discipline and
perseverance to make a machine check types automatically.
This chapter provides all the concepts and tools needed for solving Assign-
ment 2, which is a type checker for a fragment of C++.
57
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
58 CHAPTER 4. TYPE CHECKING
where the condition that the value y is a sorted version of the argument x is
expressed as the type Sorted(x,y). But at the time of writing this is still in
the avant-garde of programming language technology.
Coming back to more standard languages, type checking has another func-
tion completely different from correctness control. It is used for type anno-
tations, which means that it enables the compiler to produce more efficient
machine code. For instance, JVM has separate instructions for integer and
double-precision float addition (iadd and dadd, respectively). One might al-
ways choose dadd to be on the safe side, but the code becomes more efficient
if iadd is used whenever possible. One reason is that integers need just half of
the memory doubles need.
Since Java source code uses + ambiguously for integer and float addition,
the compiler must decide which one is in question. This is easy if the operands
are integer or float constants: it could be made in the parser. But if the
operands are variables, and since Java uses the same kind of variables for all
types, the parser cannot decide this. Ultimately, recalling Section 3.9, this is so
because context-free grammars cannot deal with the copy language! It is the
type checker that is aware of the context, that is, what variables have been
declared and in what types. Luckily, the parser will already have analysed the
source code into a tree, so that the task of the type checker is not hopelessly
complicated.
a : bool b : bool
a && b : bool
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
4.3. TYPE CHECKING AND TYPE INFERENCE 59
which can be read, if a has type bool and b has type bool. then a && b has type
bool.
In general, an inference rule has a set of premisses J1 , . . . , Jn and a con-
clusion J, which are separated by a horizontal line:
J1 . . . Jn
J
This inference rule is read:
From the premisses J1 , . . . , Jn , we can conclude J.
There is also a shorter paraphrase:
If J1 , . . . , Jn , then J.
In type checkers (and also in interpreters), the rule is often applied upside
down:
To check J, check J1 , . . . , Jn .
The premisses and conclusions in inference rules are called judgements.
The most common judgements in type systems have the form
e:T
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
60 CHAPTER 4. TYPE CHECKING
When we translate a typing rule to type checking code, its conclusion becomes
a case for pattern matching, and its premisses become recursive calls for type
checking. For instance, the above && rule becomes
There are no patterns matching other types than bool, so type checking fails
for them.
In a type inference rule, the premisses become recursive calls as well, but
the type in the conclusion becomes the value returned by the function:
infer(a && b) :
check(a, bool)
check(b, bool)
return bool
Notice that the function should not just return bool outright: it must also
check that the operands are of type bool.
Both in inference rules and pseudocode, we use concrete syntax notation
for expression patterns—that is, a&&b rather than (EAnd a b). In real type
checking code, abstract syntax must of course be used.
Γ`e:T
which is read, expression e has type T in context Γ. For example, the following
judgement holds:
x : int, y : int ` x+y>y : bool
It means:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
4.4. CONTEXT, ENVIRONMENT, AND SIDE CONDITIONS 61
x1 : T1 , . . . , xn : Tn
if x : T in Γ
Γ`x:T
What does this mean? The condition “if x : T in Γ” is not a judgement but a
sentence in the metalanguage (English). Therefore it cannot appear above the
inference line as one of the premisses, but beside the line, as a side condition.
The situation becomes even cleared if we look at the pseudocode:
infer(Γ, x) :
t := lookup(x, Γ)
return t
Looking up the type of the variable is not a recursive call to infer or check, but
uses another function, lookup.
One way to make this fully precise is to look at actual implementation code;
let’s take Haskell code for brevity. Here we have the type inference and lookup
functions
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
62 CHAPTER 4. TYPE CHECKING
If the language has function definitions, we also need to look up the types of
functions when type checking function calls (f (a, b, c)). We will assume that
the context Γ also includes the type information for functions. Then Γ is more
properly called the environment for type checking, and not just the context.
The only place where the function storage part of the environment ever
changes is when type checking function definitions. The only place where it is
needed is when type checking function calls. The typing rule involves a lookup
of the function in Γ as a side condition, and the typings of the arguments as
premisses:
Γ ` a1 : T1 · · · Γ ` an : Tn
if f : (T1 , . . . , Tn ) → T in Γ
Γ ` f (a1 , . . . , an ) : T
For the purpose of expressing the value of function lookup, we use the notation
(T1 , . . . , Tn ) → T for the type of functions, even though there is no such type
in the language described.
The tree can be made more explicit by adding explanations on which rules are
applied at each inference:
x y
x : int, y : int ` x : int x : int, y : int ` y : int
+ y
x : int, y : int ` x+y : int x : int, y : int ` y : int
>
x : int, y : int ` x+y>y : bool
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
4.6. OVERLOADING AND TYPE CONVERSIONS 63
In addition to the variable rule (marked x or y), the tree uses the rules for +
and >:
Γ ` a : int Γ ` b : int Γ ` a : int Γ ` b : int
Γ ` a + b : int Γ ` a > b : bool
As we will see in next section, these rules are special cases for rules where also
doubles and strings can be added and compared.
Γ`a:t Γ`b:t
if t is int or double or string
Γ ` a+b : t
Γ`a:t Γ`b:t
if t is int or double or string
Γ ` a == b : bool
and similarly for the other operators. Notice that a + expression has the same
type as its operands, whereas == always gives a boolean. In both cases, we can
first infer the type of the first operand and then check the second operand with
respect to this type:
infer(a + b) :
t := infer(a)
// check that t ∈ {int, double, string}
check(b, t)
return t
We have made string a possible type of +, following C++ and Java. For other
arithmetic operations, only int and double are possible.
Yet another case of expressions having different type is type conversions.
For instance, an integer can be converted into a double. This may sound trivial
from the ordinary mathematical point of view, because integers are a subset of
reals. But for most machines this is not the case, because integers and doubles
have totally different binary representations and different sets of instructions.
Therefore, the compiler usually has to generate a special instruction for type
conversions, both explicit and implicit ones.
The general idea of type conversions involves an ordering between types.
An object from a smaller type can be safely (i.e. without loss of information)
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
64 CHAPTER 4. TYPE CHECKING
converted to a larger type, whereas the opposite is not safe. Then, for instance,
the typing rule for addition expressions becomes
Γ`a:t Γ`b:u
if t, u ∈ {int, double, string}
Γ ` a + b : max(t, u)
Let us assume the following ordering:
1 + 2 + "hello" + 1 + 2
We will return to the details of its evaluation in Chapter 5. But you can already
now approach the question by finding out which type of + applies to each of
the four additions. Recall that + is left associative!
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
4.8. DECLARATIONS AND BLOCK STRUCTURES 65
A similar rule could be given to return statements. However, when they occur
within function bodies, they can more properly be checked with respect to the
return types of the functions.
Similarly to statements, function definitions are just checked for validity:
x1 : T1 , . . . , xm : Tm ` s1 . . . sn valid
T f (T1 x1 , . . . , Tm xm ){s1 . . . , sn } valid
The variables declared as parameters of the function define the context in
which the body is checked. The body consists of a list of statements s1 . . . sn ,
which are checked in this context. One can think of this as a shorthand for n
premisses, where each statement is in turn checked in the same context. But
this is not quite true, because the context may change from one statement to
the other. We return to this in next section.
To be really precise, the type checker of function definitions should also
check that all variables in the parameter list are distinct. We shall see in the
next section that variables introduced in declarations are checked to be new.
Then they must also be new with respect to the function parameters.
It would also make sense to add to the conclusion of this rule that Γ is
extended by the new function and its type. However, this would not be enough
for mutually recursive functions, that is, pairs of functions that call each
other. Therefore we rather assume that the functions in Γ are added at a
separate first pass of the type checker, which collects all functions and their
types (and also checks that all functions have different names). We return to
this in Section 4.9.
One could also add a condition that the function body contains a return
statement of expected type. A more sophisticated version of this could also
allow returns in if branches, for example,
1. A variable declared in a block has its scope till the end of that block.
2. A variable can be declared again in an inner block, but not otherwise.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
66 CHAPTER 4. TYPE CHECKING
{
int x ;
{
x = 3 ; // x : int
double x ; // x : double
x = 3.14 ;
int z ;
}
x = x + 1 ; // x : int, receives the value 3 + 1
z = 8 ; // ILLEGAL! z is no more in scope
double x ; // ILLEGAL! x may not be declared again
int z ; // legal, since z is no more in scope
}
Our type checker has to control that the block structure is obeyed. This requires
a slight revision of the notion of context. Instead of a simple lookup table, Γ
must be made into a stack of lookup tables. We denote this with a dot
notation, for example,
Γ1 .Γ2
where Γ1 is an old (i.e. outer) context and Γ2 an inner context. The innermost
context is the top of the stack.
The lookup function for variables must be modified accordingly. With just
one context, it looks for the variable everywhere. With a stack of contexts, it
starts by looking in the top-most context and goes deeper in the stack only if
it doesn’t find the variable.
A declaration introduces a new variable in the current scope. This variable is
checked to be fresh with respect to the context. But how do we express that the
new variable is added to the context in which the later statements are checked?
This is done by a slight modification of the judgement that a statement is valid:
we can write rules checking that a sequence of statements is valid,
Γ ` s1 . . . sn valid
A declaration extends the context used for checking the statements that follow:
Γ, x : T ` s2 . . . sn valid
x not in the top-most context in Γ
Γ ` T x; s2 . . . sn valid
In other words: a declaration followed by some other statements s2 . . . sn is
valid, if these other statements are valid in a context where the declared variable
is added. This addition causes the type checker to recognize the effect of the
declaration.
For block statements, we push a new context on the stack. In the rule
notation, this is seen as the appearance of a dot after Γ. Otherwise the logic is
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
4.9. IMPLEMENTING A TYPE CHECKER 67
similar to the declaration rule—but now, it is the statements inside the block
that are affected by the context change, not the statements after:
Γ. ` r1 . . . rm valid Γ ` s2 . . . sn valid
Γ ` {r1 . . . rm }s2 . . . sn valid
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
68 CHAPTER 4. TYPE CHECKING
We make the check functions return a Void. Their job is to go through the
code and silently return if the code is correct. If they encounter an error, they
emit an error message. So does infer if type inference fails, and lookup if the
variable or function is not found in the environment. The extend functions can
be made to fail if the inserted variable or function name already exists in the
environment.
Most of the types involved in the signature above come from the abstract
syntax of the implemented language, hence ultimately from its BNF grammar.
The exceptions are FunType, and Env. FunType is a data structure that
contains a list of argument types and a value type. Env contains a lookup table
for functions and a stack of contexts, each of which is a lookup table. These
are our first examples of symbol tables, which are needed in all compiler
components after parsing. We don’t need the definitions of these types in the
pseudocode, but just the functions for lookup and for environment construction
(extend, newBlock, and emptyEnv). But we will show possible Haskell and
Java definitions below.
Here is the pseudocode for the function checking that a whole program is
valid. A program is a sequence of function definitions. It is checked in two
passes: first, collect the type signatures of each function by running extend
on each definition in turn. Secondly, check each function definition in the
environment that now contains all the functions with their types.
check(d1 , . . . , dn ) :
Γ0 := emptyEnv()
for i = 1, . . . , n : Γi := extend(Γi−1 , di )
for each i = 1, . . . , n : check(Γn , di )
We first use the extend function to update the environment with the types of
all functions. Then we check all definitions, on the last line, in the resulting
environment Γn , because the variables in each definition are not visible to other
definitions.
Checking a single function definition is derived from the rule in Section 4.7:
check(Γ, t f (t1 x1 , . . . , tm xm ){s1 . . . sn } :
Γ0 := Γ
for i = 1, . . . , m : Γi := extend(Γi−1 , xi , ti )
check(Γm , s1 . . . sn )
Checking a statement list needs pattern matching over different forms of
statements. The most critical parts are declarations and blocks:
check(Γ, t x; s2 . . . sn ) :
// here, check that x is not yet in Γ
Γ0 := extend(Γ, x, t)
check(Γ0 , s2 . . . sn )
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
4.10. ANNOTATING TYPE CHECKERS 69
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
70 CHAPTER 4. TYPE CHECKING
This rule doesn’t add typed expressions to the parser, but only to the abstract
syntax, the pretty-printer, and the syntax-directed translation skeleton.
If type conversions are wanted, they can be added by the C++ style rule
If this is not made internal, also explicit type conversions become possible
in the language. An implicit conversion adds this to the syntax tree as a part of
the type annotation process. For instance, in addition expressions, a conversion
is added to the operand that does not have the maximal type:
infer(Γ, a + b) :
[a0 : u] := infer(Γ, a)
[b0 : v] := infer(Γ, b)
// here, check that u, v ∈ {int, double, string}
if (u < v)
return [v(a0 ) + b0 : v]
else if (v < u)
return [a0 + u(b0 ) : u]
else
return [a0 + b0 : u]
Exercise 4-4. Give the typing rule and the type-checking pseudocode for
explicit type conversions.
A suitable pipeline looks as follows. It calls the lexer within the parser, and
reports a syntax error if the parser fails. Then it proceeds to type checking,
showing an error message at failure and saying “OK” if the check succeeds.
When more compiler phases are added, the next one takes over from the OK
branch of type checking.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
4.11. TYPE CHECKER IN HASKELL 71
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
72 CHAPTER 4. TYPE CHECKING
Symbol tables
The environment has separate parts for the function type table and the stack
of variable contexts. We use the Map type for symbol tables, and a list type for
the stack. Using lists for symbol tables is also possible, but less efficient and
moreover not supported by built-in update functions.
You should keep the datatypes abstract, that is, use them only via these oper-
ations. Then you can switch to another implementation if needed, for instance
to make it more efficient or add more things in the environment. You can
also more easily modify your type checker code to work as an interpreter or a
code generator, where the environment is different but the same operations are
needed.
Checking the overloaded addition uses a generic auxiliary for overloaded binary
operations:
inferBin :: [Type] -> Env -> Exp -> Exp -> Err Type
inferBin types env exp1 exp2 = do
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
4.11. TYPE CHECKER IN HASKELL 73
Notice that this function is able to change the environment. This means that
the checker for statement lists can be defined simply
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
74 CHAPTER 4. TYPE CHECKING
x : rest -> do
env’ <- checkStm env x
checkStms env’ rest
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
4.12. TYPE CHECKER IN JAVA 75
Let us see how the calculator is implemented with the visitor pattern:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
76 CHAPTER 4. TYPE CHECKING
At least to me, the most difficult thing to understand with visitors is the
difference between accept and visit. It helps to look at what exactly happens
when the interpreter is run on an expression—let’s say 2 + 3:
But this is how Java can after all make it happen in a modular, type-correct
way.
As an optimization, the recursive calls to eval in the definition above could
be replaced by direct uses of accept:
But this would not work for mutually recursive functions such as type inference
and type checking.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
4.12. TYPE CHECKER IN JAVA 77
For the return type R, we already have the class Type from the abstract syntax.
But we also need a representation of function types:
Now we can define the environment with two components: a symbol table
(HashMap) of function type signatures, and a stack (LinkedList) of variable
contexts. We also need lookup and update methods:
We also need something that Haskell gives for free: a way to compare types
for equality. This we can implement with a special enumeration type of type
codes:
Now we can give the headers of the main classes and methods:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
78 CHAPTER 4. TYPE CHECKING
On the top level, the compiler ties together the lexer, the parser, and the type
checker. Exceptions are caught at each level:
try {
l = new Yylex(new FileReader(args[0]));
parser p = new parser(l);
CPP.Absyn.Program parse_tree = p.pProgram();
new TypeChecker().typecheck(parse_tree);
} catch (TypeException e) {
System.out.println("TYPE ERROR");
System.err.println(e.toString());
System.exit(1);
} catch (IOException e) {
System.err.println(e.toString());
System.exit(1);
} catch (Throwable e) {
System.out.println("SYNTAX ERROR");
System.out.println ("At line " + String.valueOf(l.line_num())
+ ", near \"" + l.buff() + "\" :");
System.out.println(" " + e.getMessage());
System.exit(1);
}
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
4.12. TYPE CHECKER IN JAVA 79
The function typeCode converts source language types to their type codes:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
80 CHAPTER 4. TYPE CHECKING
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
Chapter 5
Interpreters
γ`e⇓v
81
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
82 CHAPTER 5. INTERPRETERS
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
5.2. SIDE EFFECTS 83
γ ` e ⇓ hv, γ 0 i
γ ` x = e ⇓ hv, γ 0 (x := v)i
if x := v in γ
γ ` ++x ⇓ hv + 1, γ(x := v + 1)i
if x := v in γ
γ ` x++ ⇓ hv, γ(x := v + 1)i
One might think that side effects only matter in expressions that have side
effects themselves, such as assignments. But also other forms of expressions
must be given all those side effects that occur in their parts. For instance,
x++ - ++x
is, even if perhaps bad style, an expression that can be interpreted easily with
the given rules. The interpretation rule for subtraction just has to take into
account the changing environment:
γ ` a ⇓ hu, γ 0 i γ 0 ` b ⇓ hv, γ 00 i
γ ` a - b ⇓ hu − v, γ 00 i
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
84 CHAPTER 5. INTERPRETERS
So, what is the value of x++ - ++x in the environment x := 1? This is easy to
calculate by building a proof tree:
x := 1 ` x++ ⇓ h1, x := 2i x := 2 ` ++x ⇓ h3, x := 3i
x := 1 ` x++ - ++x ⇓ h−2, x := 3i
Another kind of side effects are IO actions, that is, input and output.
For instance, printing a value is an output action side effect. We will not treat
them with inference rules here, but show later how they can be implemented
in the interpreter code.
Exercise 5-0. In C, the evaluation order of the operands of subtraction is left
unspecified. What other value could the expression x++ - ++x in the environ-
ment x := 1 have in C?
5.3 Statements
Statements are executed for their side effects, not to receive values. Lists
of statements are executed in order, where each statement may change the
environment for the next one. Therefore the judgement form is
γ ` s1 . . . sn ⇓ γ 0
This can, however, be reduced to the interpretation of single statements by the
following two rules:
γ ` s1 ⇓ γ 0 γ 0 ` s2 . . . sn ⇓ γ 00
γ ` ⇓ γ (empty sequence)
γ ` s1 . . . sn ⇓ γ 00
Expression statements just ignore the value of the expression:
γ ` e ⇓ hv, γ 0 i
γ ` e; ⇓ γ 0
For if and while statements, the interpreter differs crucially from the type
checker, because it has to consider the two possible values of the condition
expression. Therefore, if statements have two rules: one where the condition
is true (1), one where it is false (0). In both cases, just one of the statements
in the body is executed. But recall that the condition can have side effects!
γ ` e ⇓ h1, γ 0 i γ 0 ` s ⇓ γ 00 γ ` e ⇓ h0, γ 0 i γ 0 ` t ⇓ γ 00
γ ` if (e) s else t ⇓ γ 00 γ ` if (e) s else t ⇓ γ 00
For while statements, the truth of the condition results in a loop where the
body is executed and the condition tested again. Only if the condition becomes
false (since the environment has changed) can the loop be terminated.
γ ` e ⇓ h1, γ 0 i γ 0 ` s ⇓ γ 00 γ 00 ` while (e) s ⇓ γ 000 γ ` e ⇓ h0, γ 0 i
γ ` while (e) s ⇓ γ 000 γ ` while (e) s ⇓ γ 0
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
5.4. PROGRAMS, FUNCTION DEFINITIONS, AND FUNCTION CALLS85
γ ` t x; ⇓ γ, x := null
We don’t need to check for the freshness of the new variable, because this
has been done in the type checker! This is one instance of the principle of
Milner, that “well-typed programs can’t go wrong” (Section 1.5). However, in
this very case we would gain something with a run-time check, if the language
allows declarations in branches of if statements.
For block statements, we push a new environment on the stack, just as we
did in the type checker. The new variables declared in the block are added to
this new environment, which is popped away at exit from the block.
γ. ` s1 . . . sn ⇓ γ 0 .δ
γ ` {s1 . . . sn } ⇓ γ 0
What is happening in this rule? The statements in the block are interpreted in
the environment γ., which is the same as γ with a new, empty, variable storage
on the top of the stack. The new variables declared in the block are collected
in this storage, which we denote by δ. After the block, δ is discarded. But
the old γ part may still have changed, because the block may have given new
values to some old variables! Here is an example of how this works, with the
environment after each statement shown in a comment.
{
int x ; // x := null
{ // x := null.
int y ; // x := null. y := null
y = 3 ; // x := null. y := 3
x = y + y ; // x := 6. y := 3
} // x := 6
x = x + 1 ; // x := 7
}
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
86 CHAPTER 5. INTERPRETERS
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
5.5. LAZINESS 87
5.5 Laziness
The rule for interpreting function calls is an example of the call by value
evaluation strategy. This means that the arguments are evaluated before the
function body is evaluated. Its alternative is call by name, which means
that the arguments are inserted into the function body as expressions, before
evaluation. One advantage of call by name is that it doesn’t need to evaluate
expressions that don’t actually occur in the function body. Therefore it is also
known as lazy evaluation. A disadvantage is that, if the variable is used more
than once, it has to be evaluated again and again. This, in turn, is avoided by
a more refined technique of call by need, which is the one used in Haskell.
We will return to evaluation strategies in Section 7.5. Most languages, in
particular C and Java, use call by value, which is why we have used it here, too.
But even these languages do have some exceptions: the boolean expressions a
&& b and a || b are evaluated lazily. Thus in a && b, a is evaluated first. If the
value is false (0), the whole expression comes out false, and b is not evaluated
at all. This is actually important, because it allows the programmer to write
x != 0 && 2/x > 1
which would otherwise result in a division-by-zero error when x == 0.
The operational semantics resembles if and while statements in Section
5.3. Thus it is handled with two rules—one for the 0 case and one for the 1
case:
γ ` a ⇓ h0, γ 0 i γ ` a ⇓ h1, γ 0 i γ 0 ` b ⇓ hv, γ 00 i
γ ` a&&b ⇓ h0, γ 0 i γ ` a&&b ⇓ hv, γ 00 i
For a || b, the evaluation stops if a == 1.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
88 CHAPTER 5. INTERPRETERS
The top-level interpreter first gathers the function definitions to the environ-
ment, then executes the main function:
exec(d1 . . . dn ) :
γ0 := emptyEnv()
for i = 1, . . . , n :
γi := extend(γi−1 , di )
eval(γn , main())
exec(γ, e; ) :
hv, γ 0 i := eval(γ, e)
return γ 0
exec(γ, while (e) s) :
hv, γ 0 i := eval(γ, e)
if v = 0
return γ 0
else
γ 00 := exec(γ 0 , s)
exec(γ 00 , while (e) s)
eval(γ, a − b) :
hu, γ 0 i := eval(γ, a)
hv, γ 00 i := eval(γ 0 , b)
return hu − v, γ 00 i
eval(γ, f (a1 , . . . , am )) :
for i = 1, . . . , m :
hvi , γi i := eval(γi−1 , ai )
t f (t1 x1 , . . . , tm xm ){s1 . . . sn } := lookup(f, γ)
hv, γ 0 i := eval(x1 := v1 , . . . , xm := vm , s1 . . . sn )
returnhv, γm i
The implementation language takes care of the operations on values, for in-
stance, comparisons like v = 0 and calculations like u − v.
The implementation language may also need to define some predefined
functions, in particular ones used for input and output. Six such functions
are needed in Assignment 3 of this book: reading and printing integers, doubles,
and strings. The simplest way to implement them is as special cases of the eval
function, calling the host language printing or reading functions:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
5.6. IMPLEMENTING THE INTERPRETER 89
eval(γ, printInt(e)) :
hγ 0 , vi := eval(γ, e)
// print integer v to standard output
return hvoid-value, γ 0 i
eval(γ, readInt()) :
// read integer v from standard input
return hv, γi
The type Val can be thought of as a special case of Exp, only containing
literals (and negative numbers), but it is better implemented as an algebraic
data type. One way to do this is to derive the implementation from a BNFC
grammar by internal rules (cf. Section 4.10):
But some work remains to be done with the arithmetic operations. You cannot
simply write
VInteger(2) + VInteger(3)
because + in Haskell and Java is not defined for the type Val. Instead, you
have to define a special function addVal to the effect that
addVal(VInteger(u),VInteger(v)) = VInteger(u+v)
addVal(VDouble(u), VDouble(v)) = VDouble(u+v)
addVal(VString(u), VString(v)) = VString(u+v)
In Java, + will do for strings, but in Haskell you need ++. You won’t need any
other cases because, once again, well-typed programs can’t go wrong!
The actual Haskell and Java code follows the same structure as in Chapter
4. In Haskell, the monad needs to be changed: the IO monad is now the most
natural choice. Thus, for instance,
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
90 CHAPTER 5. INTERPRETERS
Notice that the Visitor interface requires a return type, which is expectedly
set to Val in ExpEvaluator, and to the dummy type Object in StmExecuter.
The environment can be changed as a side effect.
Both interpreters can be easily extended to debuggers, which print the
state (i.e. the values of variables) after each change of the state. They should
also print the statement or expression causing the change of the state.
bipush 5 ; 5
bipush 6 ; 5 6
bipush 7 ; 5 6 7
imul ; 5 42
iadd ; 47
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
5.7. INTERPRETING JAVA BYTECODE* 91
instruction explanation
bipush n push byte constant n
iadd pop topmost two values and push their sum
imul pop topmost two values and push their product
iload i push value stored in address i
istore i pop topmost value and store it in address i
goto L go to code position L
ifeq L pop top value; if it is 0 go to position L
The instructions working on integers have variants for other types in the
full JVM; see next chapter, and also Appendix B for a longer list.
The load and store instructions are used to compile variables. The code
generator assigns a memory address to every variable. This address is an
integer. Declarations are compiled so that the next available address is reserved
to the variable in question; no instruction is generated. Using a variable as an
expression means loading it, whereas assigning to it means storing it. The
following code example with both C and JVM illustrates the workings:
TEST:
while (exp) ; here, code to evaluate exp
stm ifeq END
; here, code to execute stm
goto TEST
END:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
92 CHAPTER 5. INTERPRETERS
source code that we gave earlier in this chapter is correspondingly called big-
step semantics. For instance, a + b is there specified by saying a and b are
evaluated first; but each of them can take any number of intermediate steps.
The format of our small-step rules for JVM is
• a code pointer P,
• a stack S,
• a variable storage V.
The rules work on instructions, executed one at a time. The next instruction is
determined by the code pointer. Each instruction can do some of the following:
Here are the small-step semantic rules for the instructions we have introduced:
hbipush v, P, V, Si −→ hP + 1, V, S.vi
hiadd, P, V, S.v.wi −→ hP + 1, V, S.v + wi
himul, P, V, S.v.wi −→ hP + 1, V, S.v × wi
hiload i, P, V, Si −→ hP + 1, V, S.V (i)i
histore i, P, V, S.vi −→ hP + 1, V (i := v), Si
hgoto L, P, V, Si −→ hP (L), V, Si
hifeq L, P, V, S.0i −→ hP (L), V, Si
hifeq L, P, V, S.vi −→ hP (L), V, Si (v 6= 0)
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
5.8. OBJECTS AND MEMORY MANAGEMENT* 93
In the JVM case e ⇓ v means that executing the instructions in e returns the
value v on top of the stack after some number of steps and then terminates.
To make this completely precise, we should of course also specify how the
environment evolves.
Semantic rules are a precise, declarative specification of an interpreter.
They can guide its implementation. But they also make it possible, at least
in principle, to perform reasoning about compilers. If both the source lan-
guage and the target language have a formal semantics, it is possible to define
the correctness of a compiler precisely. For instance:
An expression compiler c is correct if, for all expressions e, e ⇓ v if
and only if c(e) ⇓ v.
Of course, it is a substantial task actually to carry out such a verification. To
start with, one has to make sure that arithmetic operations work in the same
way in the source and the target language. The machine’s representation of
integers is finite (for instance, 32 bits), which requires a careful specification
of what happens with an overflow. Floating-point arithmetic is even more
complex. In this book, we have not worried about such details. It might
happen that, for instance, interpreters written in Haskell and Java produce
different values in certain limiting cases.
Exercise 5-1.+ Implement an interpreter for a fragment of JVM, permitting
the execution of at least straight-code programs (i.e. programs without jumps).
You can work on text files with assembly code.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
94 CHAPTER 5. INTERPRETERS
is an example: if you declare a string variable, the size of its value can grow
beyond any limits when the program is running. This is usually the case with
objects, which Java manipulates by using complex types, classes.
The compilation of object-oriented programs in full generality is beyond the
scope of this book. But we can get a flavour of the issues by just looking at
strings. Consider the function that replicates a string k times:
What happens when the string variable r is declared is that one memory word
is allocated to store an address. Loading r on the stack means loading just
this address—which has the same size as an integer, independently of the size
of the string itself. The address indicates the place where the string itself is
stored. It is not stored on the stack, but in another part of the memory, called
the heap. Let us look at what happens when the program is running with k
= 2, s = ”hi”. We have to consider the “ordinary” variable storage (V , as in
the semantic rules above) and the heap separately. The evolution of the stack
and the heap is shown in Figure 5.1.
The variables that have their values in V , such as integer variables, are
called stack variables. In the current example, k and i are stack variables,
occupying addresses 0 and 3, respectively. Stack variables store objects of fixed
sizes, neatly piled on top of each other. But s and r are heap variables. For
them, V stores just addresses to the heap. In the heap, the addresses point to
objects of variable sizes. They can be shared (as in the case where s and r point
to the same string) or have some shared parts, with links to other segments
of data. They must in general be split around in different places in the heap,
because any amount of storage reserved for them may get insufficient. We will
not go into the details of how exactly strings are stored; the essential thing is
that an object on the heap need not be stored “in one place”.
The stack and the stack variable storage are allocated separately for each
function call. When the function terminates (usually by return), this storage
is freed. But this does not automatically free the heap storage. For instance, at
the exit from the function replicate, the storage for k, s, r, and i is emptied,
and therefore the addresses &s and &r disappear from V . However, we cannot
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
5.8. OBJECTS AND MEMORY MANAGEMENT* 95
; k in 0, s in 1 — 2,&s &s>“hi”
string r ; ; reserve 2 for r — 2,&s,
r = s ; aload 1 &s
astore 2 — 2,&s,&s
int i = 1 ; ; reserve 3 for i — 2,&s,&s,
bipush 1 1
istore 3 — 2,&s,&s,1
r = s + r ; aload 1 &s
aload 2 &s.&s
; call string-+ &r &s>“hi”,&r>“hihi”
astore 2 — 2,&s,&r,1 &s>“hi”,&r>“hihi”
just take away the strings "hi" and "hihi" from the heap, because they or
their parts may still be accessed by some other stack variables from outer calls.
This means that, while a program is running, its heap storage can grow beyond
all limits, even beyond the physical limits of the computer. To prevent this,
memory management is needed.
In C and C++, memory management is performed manually: the source
code uses the function malloc to reserve a part of memory and obtain a pointer
to the beginning of this part, and the function free to make the memory usable
for later calls of malloc, when it is no longer needed. In C++, the Standard
Template Library tries to hide much of this from the application programmer.
In Java, memory management is automatic, because JVM takes care of it.
The component of JVM that does this is called garbage collection. Garbage
collection is a procedure that finds out what parts of the heap are still needed,
in the sense that there are stack variables pointing to them. The parts that
are not needed can be freed for new uses.
Which one is better, manual memory management or garbage collection?
There are many programmers who prefer C-style manual management, because
it can be more precise and thereby consume just the minimum of memory
required, and also because garbage collection is a program that has to be run
beside the main program and can slow it down in unpredictable ways. On
the other hand, garbage collection techniques have improved a lot due to the
development of languages like Java and Haskell, which rely on it. It is so good
that its performance can be hard to beat by manual memory management.
Of course, manual memory management is needed on the implementation
level for languages that need heap allocation and garbage collection. It is
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
96 CHAPTER 5. INTERPRETERS
needed for writing the interpreter, as in the case of JVM. In the case of Haskell,
which is compiled to native code, garbage collection is a part of the run-time
system, which is support code linked together with the code of the compiled
program itself.
One of the simplest garbage collection methods is the mark-sweep garbage
collection algorithm. It is based on a memory model, which defines the stack
and heap data structures. For simplicity, let us think of them both as arrays.
In the stack, the array elements are of two kinds:
The heap is segmented to blocks. Each element in the heap is one of:
After each block, either a new begin node or an unused node must follow.
The algorithm itself composes three functions:
The boolean freeness value of beginning elements thus indicates marking in the
mark-sweep garbage collection. The mark of the beginning node applies to the
whole block, so that the sweep phase either preserves or frees all nodes of the
block.
Exercise 5-2.+ Implement the mark-sweep garbage collection algorithm, in-
cluding a memory model needed for it. More details can be found in e.g. the
books Modern Compiler Implementation by Appel and Implementing Func-
tional Languages by Peyton Jones and Lester, both listed in Appendix D.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
Chapter 6
Code Generation
There is a semantic gap between the basic constructs of high-level and ma-
chine languages. This may make machine languages look frighteningly different
from source languages. However, the syntax-directed translation method can
be put into use once again, Assignment 4 will be painless for anyone who has
completed the previous assignments. It uses JVM, Java Virtual Machine, as
target code. For the really brave, we will also give an outline of compilation to
native Intel x86 code. This is the last piece in understanding the whole chain
from source code to bare silicon. We cannot give the full details, but focus on
two features that don’t show in the JVM but are important in real machines:
how to use registers and how to implement function calls by using stack frames.
This chapter provides all the concepts and tools needed for solving Assign-
ment 4, which is a compiler from a fragment of C++ to JVM.
97
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
98 CHAPTER 6. CODE GENERATION
The general picture is that machine code is simpler. This is what makes the
correspondence of concepts into many-one: for instance, both statements and
expressions are compiled to instructions. The same property makes compilation
of constructs into one-many: typically, one statement or expression translates
to many instructions. For example,
x + 3 iload 0
bipush 3
iadd
But the good news resulting from this is that compilation is easy, because it
can proceed by just ignoring some information in the source language!
However, this is not completely true. Machine languages also need some
information that is not explicit in most source languages, but must be extracted
from the code in earlier compilation phases. In particular, the type checker has
to annotate the syntax tree with type information as shown in Section 4.10.
γ`e↓c
thus one rule for each type, and with type annotations assumed to be in place.
However, we will here use only pseudocode rather than inference rules. One
reason is that inference rules are not traditionally used for this task, so the
notation would be a bit home-made. Another, more important reason is that
the generated code is sometimes quite long, and the rules could become too
wide to fit on the page. But as always, rules and pseudocode are just two
concrete syntaxes for the same abstract ideas.
Following the above rules, the pseudocode for compiling * expressions be-
comes
compile(γ, [a ∗ b : t]) :
c := compile(γ, a)
d := compile(γ, b)
if t = int
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
6.3. THE COMPILATION ENVIRONMENT 99
return c d imul
else
return c d dmul
The exact definition of the environment need not bother us in the pseudocode.
We just need to know the utility functions that form its interface. Here are the
pseudocode signatures for the compilation and helper functions:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
100 CHAPTER 6. CODE GENERATION
Notice that the maximum value of M is the maximum amount of variable stor-
age needed by the program, here 7. This information is needed when code is
generated for each function definition in JVM.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
6.4. SIMPLE EXPRESSIONS AND STATEMENTS 101
The dconst and iconst sets are better than bipush because they need no sec-
ond byte for the argument. It is of course easy to optimize the code generation
to one of these. But let us assume, for simplicity, the use of the worst-case
instructions:
compile(i) : // integer literals
emit(ldc i)
compile(d) : // double literals
emit(ldc2 w d)
compile(s) : // string literals
emit(ldc s)
Arithmetic operations were already covered. The scheme shown for multipli-
cation in Section 6.2 works also for subtraction and division. For addition,
we need one more case: string concatenation is compiled as a function call
(invokestatic, cf. Section 6.7):
compile([a + b : t]) :
compile(a)
compile(b)
if t = int
emit(iadd)
elseif t = double
emit(dadd)
else
emit(invokestatic runtime/plusString(Ljava/lang/String;
Ljava/lang/String;)Ljava/lang/String;)
Variables are loaded from the storage:
compile([x : int]) : emit(iload lookup(x))
compile([x : double]) : emit(dload lookup(x))
compile([x : string]) : emit(aload lookup(x))
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
102 CHAPTER 6. CODE GENERATION
Like for constants, there are special instructions available for small addresses;
see Appendix B.
Assignments need some care, since we are treating them as expressions
which both have side effects and return values. A simple-minded compilation
would give
i = 3 ; =⇒ iconst 3 istore 1
It follows from the semantics in Section 5.7 that after istore, the value 3 is
no more on the stack. This is fine as long as the expression is used only as a
statement. But if its value is needed, then we need both to store it and have
it on the stack. One way to guarantee this is
Another way is to duplicate the top of the stack with the instruction dup:
This works for integers and strings; the variant for doubles is dup2. Thus we
can use the following compilation scheme for assignments:
compile([x = e : t]) :
compile(e)
if t = int
emit(dup)
emit(istore lookup(x))
else if t = double
emit(dup2)
emit(dstore lookup(x))
else
emit(dup)
emit(astore lookup(x))
What about if the value is not needed? Then we can use the pop instruction,
hpop, P, V, S.vi −→ hP + 1, V, Si
and its big sister pop2. The rule is common to all uses of expressions as
statements:
compile([e : t]; ) :
compile(e)
if t ∈ {int, bool, string}
emit(pop)
else if t = double
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
6.5. EXPRESSIONS AND STATEMENTS WITH JUMPS 103
emit(pop2)
else return
The last case takes case of expressions of type void: these leave nothing on
the stack to pop. The only such expressions in our language are function calls
with void as return type.
Declarations have a compilation scheme that emits no code, but just reserves
a place in the variable storage:
compile(t x; ) :
extend(x, t)
The extend helper function looks up the smallest available address for a vari-
able, say i, and updates the compilation environment with the entry (x → i).
The “smallest available address” is incremented by the size of the type.
Blocks are likewise compiled by creating a new part of storage, which is
freed at exit from the block:
compile({s1 . . . sn }) :
newBlock()
for i = 1, . . . , n : compile(si )
exitBlock()
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
104 CHAPTER 6. CODE GENERATION
TEST:
while (exp) exp
stm ifeq END
stm
goto TEST
END:
As specified in Section 5.7, the ifeq instruction checks if the top of the stack
is 0. If yes, the execution jumps to the label; if not, it continues to the next
instruction. The checked value is the value of exp in the while condition. Value
0 means that the condition is false, hence the body is not executed. Otherwise,
the value is 1 and the body stm is executed. After this, we take a jump back
to the test of the condition.
if statements are compiled in a similar way:
if (exp) exp
stm1 ifeq FALSE
else stm1
stm2 goto TRUE
FALSE:
stm2
TRUE:
The idea is to have a label for the false case, similar to the label END in while
statements. But we also need a label for true, to prevent the execution of the
else branch. The compilation scheme is straightforward to extract from this
example.
JVM has no comparison operations, conjunction, or disjunction returning
boolean values. Therefore, if we want to get the value of exp1 < exp2, we
execute code corresponding to
We use the conditional jump if icmplt LABEL, which compares the two el-
ements on the top of the stack and jumps if the second-last is less than the
last:
We can do this with just one label if we use code that first pushes 1 on the
stack. This is overwritten by 0 if the comparison does not succeed:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
6.6. COMPOSITIONALITY 105
bipush 1
exp1
exp2
if_icmplt TRUE
pop
bipush 0
TRUE:
There are instructions similar to if icmplt for all comparisons of integers: eq,
ne, lt, gt, ge, and le. For doubles, the mechanism is different. There is one
instruction, dcmpg, which works as follows:
hdcmpg, P, V, S.d.ei −→ hP + 1, V, S.vi
where v = 1 if d > e, v = 0 if d = e, and v = −1 if d < e. We leave it as an
exercise (as a part of Assignment 4) to produce the full compilation schemes
for both integer and double comparisons.
Putting together the compilation of comparisons and while loops gives
terrible spaghetti code, shown in the middle column.
TEST: TEST:
while (x < 9) stm bipush 1
iload 0 iload 0
bipush 9 bipush 9
if_icmplt TRUE if_icmpge END
pop
bipush 0
TRUE:
ifeq goto END
stm stm
goto TEST goto TEST
END: END:
The right column shows better code doing the same job. It makes the compar-
ison directly in the while jump, by using its negation if icmpge; recall that
!(a < b) == (a >= b). The problem is: how can we get this code by using
the compilation schemes?
6.6 Compositionality
A syntax-directed translation function T is compositional, if the value re-
turned for a tree is a function of the values for its immediate subtrees:
T (Ct1 . . . tn ) = f (T (t1 ), . . . , T (tn ))
In the implementation, this means that,
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
106 CHAPTER 6. CODE GENERATION
• in Haskell, pattern matching does not need patterns deeper than one;
• in Java, one visitor definition per class and function is enough.
In Haskell, it would be easy to use non-compositional compilation schemes,
by deeper patterns:
compile (SWhile (ELt exp1 exp2) stm) = ...
In Java, another visitor must be written to define what can happen depending
on the condition part of while.
Another approach is to use compositional code generation followed by a
separate phase of back-end optimization of the generated code: run through
the code and look for code fragments that can be improved. This technique is
more modular and therefore usually preferable to non-compositional hacks in
code generation. We will return to optimizations in Section 6.11.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
6.7. FUNCTION CALLS AND DEFINITIONS 107
The JVM instruction for function calls is invokestatic. As the name suggests,
we are only considering static methods here. The instruction needs to know
the type of the function. It also needs to know its class. But we assume for
simplicity that there is a global class C where all the called functions reside.
The precise syntax for invokestatic is shown by the following example:
invokestatic C/mean(II)I
This calls a function int mean (int x, int y) in class C. So the type is
written with a special syntax where the argument types are in parentheses
before the value type. Simple types have one-letter symbols corresponding to
Java types as follows:
Ljava/lang/String; = string
The top level structure in JVM (as in Java) is a class. Function definitions
are included in classes as methods. Here is a function and the compiled
method in JVM assembler:
The first line obviously shows the function name and type. The function body
is in the indented part. Before the body, two limits are specified: the storage
needed for local variables (V in the semantic rules) and the storage needed for
the evaluation stack (S in the semantics).
The local variables include the two arguments but nothing else, and since
they are integers, the limit is 2. The stack can be calculated by simulating
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
108 CHAPTER 6. CODE GENERATION
the JVM: it reaches 2 when pushing the two variables, but never beyond that.
The code generator can easily calculate these limits by maintaining them in
the environment; otherwise, one can use rough limits such as 1000.
Now we can give the compilation scheme for function definitions. We write
funtypeJVM (t1 , . . . , tm , t)) to create the JVM representation for the type of
the function.
compile(t f (t1 x1 , . . . , tm xm ){s1 , . . . , sn }) :
emit(.method public static f funtypeJVM (t1 , . . . , tm , t))
emit(.limit locals locals(f ))
emit(.limit stack stack(f ))
for i = 1, . . . , m : extend(xi , ti )
for i = 1, . . . , n : compile(si )
emit(.end method)
We didn’t show yet how to compile return statements. JVM has separate
instructions for different types. Thus:
compile(return [e : t]; ) :
compile(e)
if t = string
emit(areturn)
else if t = double
emit(dreturn)
else
emit(ireturn)
compile(return; ) :
emit(return)
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
6.8. PUTTING TOGETHER A CLASS FILE 109
The methods are compiled as described in the previous section. Each method
has its own stack, locals, and labels; in particular, a jump from one method
can never reach a label in another method.
If we follow the C convention as in Chapter 5, the class must have a main
method. In JVM, its type signature is different from C:
following the Java convention that main takes an array of strings as its argument
and returns a void. The code generator must therefore treat main as a special
case: create this type signature and reserve address 0 for the array variable.
The first available address for local variables is 1.
The class name, Foo in the above template, can be generated by the compiler
from the file name (without suffix). The IO functions (reading and printing
integers and doubles; cf. Section 5.7) can be put into a separate class, say
runtime, and then called as usual:
invokestatic runtime/printInt(I)V
invokestatic runtime/readInt()I
Also a function for string concatenation (+) can be included in this class.
The easiest way to produce the runtime class is to write a Java program
runtime.java and compile it to runtime.class. Then you will be able to
run “standard” Java code together with code generated by your own compiler.
The class file and all JVM code shown so far is not binary code but assembly
code. It follows the format of Jasmin, which is a JVM assembler. In order
to create the class file Foo.class, you have to compile your source code into a
Jasmin file Foo.j. This is assembled by the call
jasmin Foo.j
java Foo
This executes the main function. A link for obtaining the Jasmin program is
given on the book web page.
You can disassemble Java class files with the command javap -c:
javap -c Foo
The notation is slightly different from Jasmin, but this is still a good way to
compare your own compiler with the standard javac compiler, and also to
get hints for your own compiler. The main differences are that jumps use line
numbers instead of labels, and that the ldc and invokestatic instructions
refer to the runtime constant pool instead of showing explicit arguments.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
110 CHAPTER 6. CODE GENERATION
s -> (s,v)
which takes a state as its input and returns a value and a new state. The state
can be inspected and modified by the library functions
get :: State s s
modify :: (s -> s) -> State s ()
Following the use of Void in Section 6.3, we give the compilation functions a
type whose return value doesn’t matter:
EMul a b -> do
compileExp a
compileExp b
emit $ case typExp e of
Type_int -> imul_Instr
Type_double -> dmul_Instr
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
6.9. IMPLEMENTING CODE GENERATION 111
The helper function typExp is easy to define if the type checker has type-
annotated all trees, as explained in Section 4.10.
The environment has several components: symbol tables for functions and
variables, counters for variable addresses and labels, and also a counter for the
maximum stack depth if you want to give an accurate figure to limit stack
(Section 6.7). But the definition above assumes that also the code is put into
the environment! Otherwise both recursive calls to compileExp would need to
return some code, and the last step would concatenate these pieces with the
multiplication instruction.
Here is a partial definition of the environment, only containing what is
needed for variables and the code:
The emit function works by changing the code part of the environment with the
state monad library function modify (notice that the instructions are collected
in reverse order, which is more efficient):
Similarly, the lookup of a variable address uses the get function for state mon-
ads:
Notice that a stack (i.e. list) of variable environments is needed, to take care
of block structure.
All operations of Section 6.3 can be implemented in a state monad by using
the get and modify functions, so that the imperative flavour and simplicity of
the compilation schemes is preserved.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
112 CHAPTER 6. CODE GENERATION
public Env() {
vars = new LinkedList<HashMap<String,Integer>>();
maxvar = 0 ;
}
public void addVar(String x, TypeCode t) {
// use TypeCode to determine the increment of maxvar
}
}
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
6.10. COMPILING TO NATIVE CODE* 113
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
114 CHAPTER 6. CODE GENERATION
The notation we will use is that of the assembly language NASM, Netwide
Assembler. NASM has open-source freely available tools and documentation;
see Appendix D for further information.
The machine also has a notion of a stack; it is a part of the memory that
can be accessed by its address. A memory address consists of a register (which
usually marks the start of the currently available memory segment) with an
offset, that is, the distance of the address from the register value as the number
of bytes. Thus for instance
means an addition to eax of the value stored in the address [ebp-8], where
ebp is the register pointing to the beginning of the current stack frame, that
is, the memory segment available for the current function call.
Let us look at a little example: a function computing the Fibonacci numbers
less than 500. We write the source code and corresponding NASM code side
by side.
The structure of the code is somewhat similar to JVM, in particular the way
the while loop is expressed with labels and jumps. The arithmetic operations
potentially need fewer instructions than in JVM; notice, however, that we use
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
6.10. COMPILING TO NATIVE CODE* 115
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
116 CHAPTER 6. CODE GENERATION
Notice the importance of saving registers before calling a function. Every func-
tion uses registers in its own way and does not know what other functions
might have called it and how they use the registers. But since every function
has its own part of the stack (its stack frame), values stored on the stack are
safe. Either the caller or the callee (i.e. the called function) must make sure
to save registers on the stack; in the steps above, the caller does this.
Also notice the saving of the return address. Again, the callee cannot know
where it has been called from. But the caller has saved the pointer to the
calling code, and the execution of the program can continue from there when
the callee has returned. In JVM, this is not visible in the code, because the
JVM interpreter uses a hard-coded way of calling functions, which includes
keeping track of return addresses.
To give an example, let us look at the functions
in a situation where old has called new. When the call of new is active, the
stack looks as shown in Figure 6.1.
We follow the convention of drawing the stack so that it grows downwards.
This reflects the way memory addresses are created: the offsets are positive
before the frame pointer, negative after it. Thus, for instance, inside old
before calling new,
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
6.10. COMPILING TO NATIVE CODE* 117
..
. local variables of the caller of old
..
. saved registers of the caller of old
2 second argument to old
1 first argument to old
ret saved return address for the call of old
fp saved frame pointer of the caller of old
← frame pointer of old
..
. local variables of old
..
. saved registers of old
3 argument to new
ret saved return address for the call of new
fp saved frame pointer of old
← frame pointer of new
..
. local variables of new
← stack pointer
assuming the arguments and variables are integers and consume 4 bytes each.
The C-style calling conventions are supported by some special instructions:
enter for entering a stack frame, leave for leaving a stack frame, pusha for
pushing certain registers to the stack to save them, and popa for popping the
values back to the same registers. Instructions like these make the assembly
code shorter, but they may consume more processor time than the simpler in-
structions. One of the problems of x86 compilation is indeed the selection of
instructions, due to its CISC architecture (Complex Instruction Set Com-
puting). Another complicating feature of x86 is that floating point operations
work in a way rather different from the (historically older) integer operations.
One solution to the complications is the RISC architecture (Reduced
Instruction Set Computing), whose most common current appearance is
in the ARM processor used in mobile phones. RISC machines have fewer
instructions than x86, and these instructions are simpler and more uniform.
They also tend to have more registers, which work in a more uniform way.
But a more radical solution to native code compilation is to do it via in-
termediate code, such as LLVM (derived from “Low Level Virtual Machine”,
which is no longer its official name). LLVM has a RISC-style instruction set
with an infinite supply of virtual registers. Compilers can just generate
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
118 CHAPTER 6. CODE GENERATION
LLVM code and leave the generation of native code to LLVM tools.
Exercise 6-0. Compile the above Fibonacci program to assembly code with
GCC, by the command
gcc -S fib.c
and try to understand the resulting file fib.s. The code is not NASM, but
this is probably not the main problem. You can use the standard library func-
tion call printf("%i\n",hi) to express printInt(hi), if you put #include
<stdio.h> in the beginning of the file.
Exercise 6-1. Write compilation schemes for integer constants, variables, ad-
ditions, subtractions, an multiplications in NASM. Assume the availability of
four integer registers (eax, ebx, ecx, edx) and an unlimited supply of stack
memory in addresses [ebp-4k] (that is, the offsets are multiples of 4). The
arithmetic instructions have the syntax
(x + y + z) * 2 + (y + z + u + v) * 3 - (u + v + w) * 4
Also use this as a test case for your compilation schemes. If you want to test
your compiler in reality, the book PC Assembly Language by Paul Carter gives
you the information needed (see Appendix D).
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
6.11. CODE OPTIMIZATION* 119
into separate phases. We have already looked at two such cases: instruction
selection, which we wanted to perform in a compositional way (Section 6.6),
and register allocation, which is also better postponed to a phase after code
generation (Section 6.10).
Of course, the purpose of optimization can also be to improve the code
written by the programmer. It is then typically used for encouraging high-
level ways of programming—to free the programmer from thinking about low-
level details that can be left to the machine. One example is the technique of
constant folding. It allows the user to write statements such as
Constant folding means that operations on constants are carried out at com-
pile time. Thus the shown complex expression will only consume one push
instruction in the JVM.
Constant folding can be seen as a special case of a technique known as
partial evaluation. It means the evaluation of expressions at compile time,
which in general is not total, because some variables get their values only at run
time. It needs a lot of care. For example, a tempting rule of partial evaluation
could be to reduce all self-subtractions into zero,
e − e =⇒ 0
But this can go wrong in several ways. One is the situation where e has side
effects—for instance,
i++ - i++
or a function call f() which returns an integer but prints hello each time.
Pure languages are ones in which expression evaluation doesn’t have side
effects. The advantage is that they can be optimized more aggressively than
non-pure languages—in fact, this is just one example of the ease of reasoning
about code in pure languages. Haskell is an example of a language that is,
at least almost, pure, and enjoys strongly optimizing compilers, such as GHC.
But even in pure languages, care is needed—for instance, the expression
1/x - 1/x
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
120 CHAPTER 6. CODE GENERATION
f (. . .){. . . f (. . .); }
bipush 5
bipush 6 bipush 5
bipush 7 =⇒ bipush 42 =⇒ bipush 47
imul iadd
iadd
This example also shows that iterating the process can result in more op-
timizations, because the eliminable addition expression has here been created
by first eliminating the multiplication. Another example is the elimination of
stack duplication for an assignment whose value is not needed (cf. Section 6.4):
bipush 6
i = 6 ; −→ dup =⇒ bipush 6
istore 4 istore 4
pop
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
6.11. CODE OPTIMIZATION* 121
Register allocation is a procedure that tries to fit all variables and tem-
porary values of a program to a given set of registers. The small set of registers
in x86 is an example. This task is usually performed on an intermediate lan-
guage such as LLVM, after first letting the code generator use an unlimited
supply of registers, known as virtual registers.
The main concept in register allocation is liveness: variables that are “live”
at the same time cannot be kept in the same register. Live means that the value
of the variable may be needed later in the program. Here is an example piece
of code with the liveness of variables marked beside each statement:
int x = 1 ; // x live
int y = 2 ; // x y live
printInt(y) ; // x live
int z = 3 ; // x z live
printInt(x + z) ; // x z live
y = z - x ; // y live
z = f(y) ; // y live
printInt(y) ;
return ;
How many registers are needed for the three variables x, y, and z? The answer
is two, because y and z are never live at the same time and can hence be kept
in the same register.
A classic algorithm for register allocation works in two steps:
1. Liveness analysis: find out which variables are live at the same time, to
define an interference graph, where the nodes are variables and edges
express the relation “live at the same time”.
2. Graph colouring: colour the nodes of the graph so that the same colour
can be used for two nodes if they are not connected.
The colours in the second step correspond to registers. The example program
above has the following interference graph:
x
/\
y z
It shows that y and z can be given the same colour. Graph colouring is of course
a very general algorithm. It is also used for colouring countries on a map, where
the “interference” of two countries means that they have a common border.
Liveness analysis is an example of a family of techniques known as dataflow
analysis. In addition to register allocation, it can be used for tasks such as
dead-code elimination: a piece of code is dead if it can never be reached
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
122 CHAPTER 6. CODE GENERATION
1. Take away any node with less than k edges, together with its edges, until
no nodes are left. Maintain a list (a stack) of the removed nodes and
their edges when they are removed.
2. When the graph is empty, rebuild it starting from the top of the stack
(i.e. from the last node removed). Since every node in the stack has
less than k neighbours, each node can always be given a colour that is
different from its neighbours.
Of course, the procedure need not work for all graphs and all numbers k, since
it may be the case that every node has k or more neighbours. But you can test
this algorithm first with the simple graph shown above and 2 colours. Then
with a non-trivial example: draw a map of continental Europe and try to colour
it with 4 colours. Germany, for instance, has 9 neighbours, and its neighbour
Austria has 8. But you should start with the countries with less neighbours
and hope that at every point there will be countries with less than 4 neighbours
left.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
Chapter 7
Functional Programming
Languages
123
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
124 CHAPTER 7. FUNCTIONAL PROGRAMMING LANGUAGES
doub x = x + x ;
twice f x = f (f x) ;
quadruple = twice doub ;
main = twice quadruple 2 ;
This program has four function definitions. The first one defines a function
called doub, which for its argument x returns x + x. The second one defines
twice, which iterates the application of a function on an argument twice. The
third one, quadruple, applies twice to doub. The fourth one, main, prints the
result of applying twice to quadruple and 2.
We will explain the syntax and semantics of our functional language more
properly soon. Just one thing is needed now: we follow the syntax of Haskell
and write function applications by just putting the function and its arguments
one after the other,
f xyz
whereas languages like C, and also ordinary mathematics, use parentheses and
commas,
f (x, y, z)
As we will see later, this simplified notation is actually very logical. But let us
first walk through the computation of main in the above example:
main
= twice quadruple 2
= quadruple (quadruple 2)
= twice doub (twice doub 2)
= doub (doub (doub (doub 2)))
= doub (doub (doub (2 + 2)))
= doub (doub (doub 4))
= doub (doub (4 + 4))
= doub (doub 8)
= doub (8 + 8)
= doub 16
= 16 + 16
= 32
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
7.2. FUNCTIONS AS VALUES 125
What we do in each step is replace some part of the expression by its definition,
possibly also replacing variables by their actual arguments. This operation is
called substitution, and it could be defined with syntax-directed translation.
However, it is very tricky in the presence of variable bindings and also com-
putationally expensive. Therefore we will use a better method when building
an interpreter in Section 7.4, generalizing the interpreter of function calls in
imperative languages (Section 5.4).
// doub x = x + x
// twice f x = f (f x)
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
126 CHAPTER 7. FUNCTIONAL PROGRAMMING LANGUAGES
// not possible:
(int f (int x)) quadruple()
{
return twice(doub) ;
}
int quadruple(int x)
{
return twice(doub, x) ;
}
This definition has the same meaning as the one without x; hence adding or
removing the second variable doesn’t change the meaning of the function.
To understand what precisely happens in function definitions, we introduce
types for functions. In Haskell-like languages, they are written in the following
way:
(Haskell uses a double colon :: for typing, but we stick to a single :.) The
notation is right-associative, and hence equivalent to
This is a function that returns the maximum of its argument and 4. Notice
that application is left-associative, with max 4 5 the same as (max 4) 5.
In many other languages, the value of a function must be a “basic type”,
i.e. not a function type. This corresponds to having a tuple of arguments:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
7.3. ANONYMOUS FUNCTIONS 127
Tuples are a type of its own, which has the following typing rule:
Γ`a:A Γ`b:B
Γ ` (a, b) : A*B
Partial application cannot access parts of tuples. Hence, using a function over
a tuple forces its application to both arguments.
But there is an equivalence between functions over tuples and two-place
functions:
(A ∗ B) → C ⇐⇒ A → B → C
Converting the first to the second is called currying, with reference to Haskell
B. Curry, the logician who invented many of the ideas underlying functional
programming. It is a powerful programming technique, but it also simplifies
the semantics and implementation of programming languages; for instance, as
we have seen, it enables the encoding of many-place functions as one-place
functions, which simplifies both the type checking and interpretation rules.
λx.e
f x1 . . . xn = e
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
128 CHAPTER 7. FUNCTIONAL PROGRAMMING LANGUAGES
f = λx1 . . . . λxn .e
// triple x = x + x + x
int triple(int x)
{
return x + x + x ;
}
Exp ::=
Ident -- variables, constants
| Integer -- integer literals
| "(" "\" Ident "->" Exp ")" -- abstractions
| "(" Exp Exp ")" -- applications
γ`e⇓v
which is read, ”in the environment γ, the expression e evaluates to the value
v”. Notice that evaluation cannot change the environment. This is because
we are dealing with a purely functional language, a language without side
effects.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
7.4. EVALUATING EXPRESSIONS 129
2 + 3 ∗ 8 ⇓ 26
But it can also be more complex. For instance, the Haskell function replicate
creates a list of some number of copies of an object, so that for instance
replicate 20 1 ⇓ [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
130 CHAPTER 7. FUNCTIONAL PROGRAMMING LANGUAGES
As an approximation of what values are we could now say: values are closed
expressions. In practice, however, it is better to include yet another ingredient:
to allow open expressions together with values for their free variables. For
instance,
(2 + 3 * x){x := 8}
is such a value. It could be computed further, by replacing x with 8. But we
can regard this as good enough a value even without further computation.
In general terms, we will use values of the form
e{γ}
where e is an expression and γ is an environment. The environment gives values
to the free variables of the expression. It so to speak closes the expression, and
is therefore called a closure.
The semantics we are now going to formulate uses two kinds of values:
• integers
• closures of lambda abstracts
The rules for variables and integer literals are simple: a variable expression is
evaluated by looking up the variable in the environment.
x := v is in γ
γ`x⇓v
An integer literal evaluates to itself.
γ`i⇓i
For lambda abstracts, the normal procedure is: do nothing. A lambda abstract
is itself a perfect representation of a function as a value. However, the body
of the lambda abstract may contain some other variables than the one bound
by the lambda. These variables get their values in the evaluation environment,
which is therefore added to the expression to form a closure:
γ ` (λx.e) ⇓ (λx.e){γ}
Function application is the most complex case. Here we can recall how we
did for the imperative language in Section 5.4. There we had to deal with
applications to many arguments simultaneously, whereas here it is enough to
consider functions with one argument. Recalling moreover that evaluation has
no side effects, we can consider the following special case of the application
rule:
γ ` a ⇓ u x := u ` s1 . . . sn ⇓ v
if V f (T x){s1 . . . sn } in γ
γ ` f (a) ⇓ v
Adapting this rule to the functional language requires two changes:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
7.5. CALL BY VALUE VS. CALL BY NAME 131
The latter difference implies that the evaluation of f is not simply a look-up in
the function table. But we can just replace the look-up by a step of evaluating
the expression f . This evaluation results in a closure, with a lambda abstract
λx.e and an environment δ. Then (f a) is computed by evaluating e in an
environment where the variable x is set to the value of the argument a:
γ ` f ⇓ (λx.e){δ} γ`a⇓u δ, x := u ` e ⇓ v
γ ` (f a) ⇓ v
doub x = x + x
doub = \x -> x + x
In this case, the applied function has no free variables. But this is just a limiting
case. The need of closures is shown by an example with a two-place function,
plus x y = x + y
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
132 CHAPTER 7. FUNCTIONAL PROGRAMMING LANGUAGES
infinite = 1 + infinite
first x y = x
main = first 5 infinite
main
= first 5 infinite
= (\x -> \y -> x) 5 (1 + infinite)
= (\y -> 5) (1 + infinite)
= (\y -> 5) (2 + infinite)
...
which leads to non-termination. Even though the function first ignores its
second argument, call-by-value requires this argument to be evaluated.
With call by name,
main
= first 5 infinite
= (\x -> \y -> x) 5 infinite
= (\y -> 5) infinite
= 5
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
7.6. IMPLEMENTING AN INTERPRETER 133
one: if there is any order that makes the evaluation of an expression terminate,
then call-by-name is such an order.
Why isn’t call by name always used then? The reason is that it may be less
efficient, since it may lead to some expressions getting evaluated many times,
i.e. once for each time the argument is used. With call by value, the expression
is evaluated just once, and its value is then reused for all occurrences of the
variable. The following pair of examples shows what happens:
doub x = x + x
doub (doub 8)
= doub 8 + doub 8 -- by name
= 8 + 8 + 8 + 8
= 32
doub (doub 8)
= doub 16 -- by value
= 16 + 16
= 32
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
134 CHAPTER 7. FUNCTIONAL PROGRAMMING LANGUAGES
doub x = x + x ;
pow x = if (x < 1) then 1 else doub (pow (x-1)) ;
main = pow 30 ;
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
7.6. IMPLEMENTING AN INTERPRETER 135
\x -> \x -> x + x
eval(γ, i) :
return i
eval(γ, x) :
e{δ} := lookup(γ, x)
eval(hfunctions(γ), δi, e)
eval(γ, a + b) :
u := eval(γ, a)
v := eval(γ, b)
return u + v
The + on the last line is integer addition on the value level. It fails if the values
are not integers. But as long as the language has no type checker, we will know
this only at run time. Notice, moreover, that this rule can only be applied if u
and v are fully evaluated integers. The less than operator < has a similar rule,
returning 1 if the comparison is true, 0 if it is false.
If-then-else expressions are interpreted lazily, even if we use call by value as
the general strategy:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
136 CHAPTER 7. FUNCTIONAL PROGRAMMING LANGUAGES
Abstractions simply return closures with the variables of the current environ-
ment:
eval(γ, λx.b) :
return (λx.b){variables(γ)}
Notice that we take only the variables of the environment into the closure, not
the function symbols.
Application is the most complex case. Here is a general rule, which works
for both call by value and call by name strategies. The decision is made in
just one point: when deciding what value to use for the bound variable when
evaluating the body.
eval(γ, (f a)) :
(λx.b){δ} := eval(γ, f )
if call by value
u := eval(γ, a)
else
u := a{variables(γ)}
eval(update(hfunctions(γ), δi, x, u), b)
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
7.7. TYPE CHECKING FUNCTIONAL LANGUAGES* 137
Simple as the system is, it is much more powerful than the type system we
used for the imperative language in Chapter 4. The power comes from the
unconstrained generation of function types from any other types, giving rise to
functions of functions, and so on. For example,
In Section 7.2, we gave rules for this type system and explained the method of
currying, implying that we only need one-place functions.
A type checker could be implemented in the usual way by converting the
typing rules to type checking and inference code. Some care is thereby needed,
though. Starting with the abstraction rule,
Γ, x : A ` b : B
Γ ` λx.b : A → B
it is easy to define type checking:
check(Γ, λx.b, A → B) :
check(extend(Γ, x, A), b, B)
But what happens if we need type inference? Before even trying to formulate
type inference for lambda abstracts, we can simply notice that the expression
\x -> x
A -> A
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
138 CHAPTER 7. FUNCTIONAL PROGRAMMING LANGUAGES
whatever type A is. Hence it is impossible to do type inference for all expressions
of our functional language, if we expect to return a unique type in simple type
theory.
One way to solve the type inference problem for lambda abstracts is to
change their syntax, so that it includes type information:
λx : t.b
But a more common solution is to make the type system polymorphic. This
means that one and the same expression can have many types.
7.8 Polymorphism*
The polymorphism idea was introduced in ML in the 1970’s and inherited by
Haskell in the 1990’s. It also inspired the template system of C++. Taking
the simplest possible example, the identity function of last section, we can write
// id : A -> A
template<class A> A id(A x)
{
return x ;
}
// id : A -> A
public static <A> A id(A x)
{
return x ;
}
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
7.8. POLYMORPHISM* 139
Notice that different variables mean more generality than the same variable.
For example, a -> b is more general than a -> a, because it doesn’t force a
and b to be the same.
Let us take one of the examples into detailed consideration. We start the
procedure by introducing a variable t for the type:
t = a -> b -> c
f (f x) : c
f : d -> e
But since f is the variable bound by the first lambda, we also have
f : a
and hence,
a = d -> e
Thus the result of applying f must have type e. But it must also have type c,
because f (f x) : c. What is more, it must also have type d, because f can
be applied to its own result. Hence
c = e = d
The type of x is on one hand b (as the second abstracted variable), on the
other hand d (because f applies to x). Hence
c = e = b = d
a = d -> d
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
140 CHAPTER 7. FUNCTIONAL PROGRAMMING LANGUAGES
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
7.9. POLYMORPHIC TYPE CHECKING WITH UNIFICATION* 141
Ident fresh ()
A substitution γ can be applied to a type t, which means replacing the type
variables in t with their values as given in γ. (Substitution in types is simpler
than substitution in expressions, because there are no variable bindings in
types. This is one reason why we avoided substitutions in the interpreter and
used closures instead.) We write
tγ
to apply the substitution γ to the type t. As an example of a substitution and
its application,
(a -> c -> d){a:= d -> d, c:=d, b:=d} ⇓ (d -> d) -> d -> d
is one that could have been used at a certain point in the type inference of the
previous section.
Now we are ready to start defining type inference. Constants and variables
are simple, and return the empty substitution {}:
infer(i) :
return h{}, Inti
infer(x) :
t := lookup(x)
return h{}, ti
For lambda abstracts, we introduce a fresh type variable for the bound variable;
this represents the argument type of the function. Then we infer the type of
the body. This inference returns a substitution, which we have to apply to the
argument type variable, since it may give it a value. After the inference of the
body, we must discard the latest variable x from the context:
infer(λx.b) :
a := fresh()
extend(x, a)
hγ, ti := infer(b)
free(x)
return hγ, aγ → ti
As a first example, let us infer the type of the identity function by following
the rules. We write this in the same format as the definition of the infer
function, showing the actual values returned at each stage. We also make the
context explicit when relevant:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
142 CHAPTER 7. FUNCTIONAL PROGRAMMING LANGUAGES
infer(λx.x) :
a := fresh()
extend(x, a)
h{}, ai := infer(x) // in context x : a
return h{}, a{} → ai
infer(f a) :
hγ1 , t1 i := infer(f )
hγ2 , t2 i := infer(a)
v := fresh()
γ3 := mgu(t1 γ2 , t2 → v)
return hγ3 ◦ γ2 ◦ γ1 , vγ3 i
The types of the function and the argument are inferred first. To combine
all information obtained, the type of the function is refined by applying the
substitution received from the argument, obtaining t1 γ2 . The type is then
expressed in terms of the inferred argument type t2 and an unknown value
type, which is represented by a fresh type variable v. These two types are sent
to unification, mgu. This gives yet another substitution γ3 , which is applied
to the value type. All information is finally gathered in the composition of
substitutions γ3 ◦ γ2 ◦ γ1 , which is returned together with the value type. The
composition of substitutions is defined via their application, in a way similar
to the usual composition of functions:
t(δ ◦ γ) = (tγ)δ
It remains to define unification, mgu. It takes two types and returns their
most general unifier, which is a substitution γ that gives the same result
when applied to any of the two types. In other words,
Of course, mgu can also fail, if the types are not unifiable.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
7.9. POLYMORPHIC TYPE CHECKING WITH UNIFICATION* 143
return {}
else if occurs(v, t)
fail (”occurs check”)
else
return {v := t}
mgu(t, v) : // the second type is a type variable
mgu(v, t)
mgu(t, u) : // other cases: succeeds only for equal types
if t = u
return {}
else
fail (”types not unifiable”)
There are thus two reasons why types may not unify. They can just be different
types, as in the last case. It rejects for instance the unification of Int with
Double, and Int with a function type. Or the unification can fail in the so-
called occurs check: if the type variable v occurs in the type t, the types
are not unifiable. This check rejects, for instance, the unification of v with
v → u. If the occurs check were not performed, the algorithm would return the
substitution
{v := v → u}
infer(λx.(x x)) :
a := fresh()
hγ, ti := infer(x x) : // in context x : a
h{}, ai := infer(x)
h{}, ai := infer(x)
b := fresh()
γ := mgu(a, a → b) :
fail (”occurs check”)
On the last line, mgu fails because of occurs check: a cannot unify with a → b.
In other words, a function cannot be applied to itself.
Exercise 7-2. Trace the type inference algorithm and unification with the
expression \f -> \x -> f (f x).
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
144 CHAPTER 7. FUNCTIONAL PROGRAMMING LANGUAGES
Exercise 7-3. Extend type inference and unification to pair and list types.
Can you find a generalization that makes this possible without adding new
cases to the function definitions?
Exercise 7-4.+ Implement the unification algorithm and type inference for
expressions. Try them out with all the examples discussed.
Exercise 7-5. Even if (\x -> (x x)) fails to type check, a “self-application”
is completely legal in (\x -> x)(\x -> x). Can you explain why?
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
Chapter 8
The functional language shown in the previous chapter was very simple, but it
can be made even simpler: the minimal language of lambda calculus has just
three grammar rules. It needs no integers, no booleans—almost nothing, since
everything can be defined by those three rules. This leads us to the notion
of Turing Completeness, which defines what a general-purpose programming
language must be able to do. In addition to lambda calculus, we will show
another Turing complete language, which is an imperative language similar to
C, but still definable on less than ten lines. Looking at these languages gives
us tools to assess the popular thesis that “it doesn’t matter what language you
use, since it’s the same old Turing machine anyway”.
This chapter provides the concepts and tools needed for solving Assignment
6 in a satisfactory way—creating a domain-specific query language. But it is
your own imagination and willingness to learn more that set the limits of what
you can achieve.
145
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
146 CHAPTER 8. THE LANGUAGE DESIGN SPACE
It was soon proved that these models are equivalent: although they express
programs and computation in very different ways, they cover exactly the same
programs. And the solvability of mathematical problems got a negative answer:
it is not possible to construct a machine that can solve all problems. One of
the counter-examples is the halting problem: it was proved by Turing that
there cannot be any program (i.e. any Turing machine) which decides for any
given program and input if the program terminates with that input.
The models of computation also became prototypes for programming lan-
guages, corresponding to the different programming language paradigms
still in use today. Thus the Turing Machine itself was the prototypical imper-
ative language. Lambda Calculus was the prototypical functional language,
but the way programs are usually written looks more like recursive functions.
The term Turing-completeness is used for any programming language that
is equivalent to any of these models, that is, equivalent to the Turing Machine.
All general-purpose programming languages used today are Turing-complete.
But this doesn’t say very much: actually, a language can be very simple and
still Turing-complete.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
8.2. PURE LAMBDA CALCULUS AS A PROGRAMMING LANGUAGE*147
This language is called the pure lambda calculus. It doesn’t even need
integer constants, because they can be defined as follows:
0 = \f -> \x -> x
1 = \f -> \x -> f x
2 = \f -> \x -> f (f x)
3 = \f -> \x -> f (f (f x))
...
The intuition is: when you add n to m, you get a function that applies f first
m times and then n times. Altogether, you apply f to x m + n times.
Here is an example. You may want to carry it out in more detail by using
the operational semantics of the previous chapter.
PLUS 2 3
= (\m -> \n -> \f -> \x -> n f (m f x))
(\f -> \x -> f (f x)) (\f -> \x -> f (f (f x)))
= \f -> \x -> (\f -> \x -> f (f (f x)))
f ((\f -> \x -> f (f x)) f x)
= \f -> \x -> (\f -> \x -> f (f (f x))) f (f (f x))
= \f -> \x -> f (f (f (f (f x))))
= 5
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
148 CHAPTER 8. THE LANGUAGE DESIGN SPACE
The idea is that a boolean performs a choice from two arguments. TRUE chooses
the first, FALSE the second. (Thus FALSE happens to be equal to 0 after all,
by renaming of bound variables.) This property is put to use in the conditional
expression, which expects a Church boolean as its first argument, and the “if”
and “else” values as additional arguments:
defines the factorial (n!). In pure lambda calculus, however, such definitions
are not available. The “definitions” we have shown above are not part of
the language, but just something that we use for introducing shorthands for
long expressions. A shorthand must always be possible to eliminate from an
expression. Using the defined name on the right hand side would create a
circular definition, and the name would be impossible to eliminate.
Thus the most amazing invention in pure lambda calculus is perhaps the
possibility to express recursion. This can be done with the fix-point combi-
nator, also known as the Y combinator, which is the function
Y g = g (Y g)
which means that Y iterates g infinitely many times. You can easily verify this
property by taking a couple of computation steps (exercise).
The factorial can be defined in terms of the fix-point combinator as follows:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
8.3. ANOTHER TURING-COMPLETE LANGUAGE* 149
FACT =
Y (\f -> \n -> IFTHENELSE (ISZERO n) 1 (MULT n (f (PRED n))))
This corresponds very closely to the intuitive recursive definition, using the
Church numerals, booleans, and conditionals. It needs two auxiliary concepts:
ISZERO to decide if a numeral is equal to 0, and PRED to find the previous
numeral (i.e. n − 1, except 0 for 0). These are defined as follows:
The definition of PRED is rather complex; you might want to try and verify at
least that PRED 1 is 0.
To write programs in pure lambda calculus is certainly possible! But it is in-
convenient and inefficient. However, it is a good starting point for a language to
have a very small core language. The implementation (compiler, interpreter)
is then built for the core language with syntactic sugar and possibly optimiza-
tions. For instance, the functional language of Chapter 7 has built-in integers
as an optimization. Among real-world languages, Lisp is built from lambda
calculus with very few additions, such as a primitive notion of lists. Haskell
has a small core language based on lambda calculus with algebraic datatypes
and pattern matching, as well as primitive number types.
Exercise 8-0. Show the Y combinator property
Y g = g (Y g)
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
150 CHAPTER 8. THE LANGUAGE DESIGN SPACE
A BF program has one implicit byte pointer, called “the pointer”, which
is free to move around within an array of bytes, initially all set to zero. The
pointer itself is initialized to point to the beginning of this array. The language
has eight commands, each of which is expressed by a single character:
> increment the pointer
< decrement the pointer
+ increment the byte at the pointer
- decrement the byte at the pointer
. output the byte at the pointer
, input a byte and store it in the byte at the pointer
[ jump forward past the matching ] if the byte at the pointer is 0
] jump backward to the matching [ unless the byte at the pointer is 0
All other characters are treated as comments and thereby ignored.
Here is an example of a BF program: char.bf, displaying the ASCII char-
acter set (from 0 to 255):
.+[.+]
Here is the program hello.bf, which prints “Hello”
++++++++++ Set counter 10 for iteration
[>+++++++>++++++++++<<-] Set up 7 and 10 on array and iterate
>++. Print ’H’
>+. Print ’e’
+++++++. Print ’l’
. Print ’l’
+++. Print ’o’
Exercise 8-4. Define integer addition in BF. You can first restrict it to numbers
whose size is one byte, then try to be more ambitious.
Exercise 8-5. Write an interpreter of BF and try it out on char.bf and
hello.bf.
Exercise 8-6. Write a compiler of BF via translation to C:
> ++p;
< --p;
+ ++*p;
- --*p;
. putchar(*p);
, *p = getchar();
[ while (*p) {
] }
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
8.4. CRITERIA FOR A GOOD PROGRAMMING LANGUAGE 151
The code should appear within a main () function, which initializes the storage
and the pointer as follows:
char a[30000];
char *p = a;
Test your compiler with the BF programs presented in this section. The array
size 30,000 comes from the original BF definition; to make BF truly Turing-
complete you can think of it as infinite.
• Efficiency: the language should permit writing code that runs fast and
in small space.
• Clarity: the language should permit writing programs that are easy to
understand.
These criteria are obviously not always compatible, but there are trade-offs.
For instance, pure lambda calculus and BF obviously satisfy orthogonality, but
hardly any of the other criteria. Haskell and C++ are known for providing
many ways to do the same things, and therefore blamed for lack of orthogonal-
ity. But they are certainly good at many other counts.
In practice, different languages are good for different applications. For
instance, BF can be good for reasoning about computability. There may also
be languages that aren’t good for any applications. And even good languages
can be implemented in bad ways, let alone used in bad ways.
We suggested in Chapter 1 that languages are evolving toward higher and
higher levels, as a result of improved compiler technology and more powerful
computers. This creates more work for machines (and for compiler writers!)
but relieves the burden of language users. Here are some trends that can be
observed in the history:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
152 CHAPTER 8. THE LANGUAGE DESIGN SPACE
• Toward richer type systems (from bit strings to numeric types to struc-
tures to algebraic data types to dependent types).
• Imperative or declarative?
• Interpreted or compiled?
• Portable or platform-dependent?
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
8.6. EMBEDDED LANGUAGES* 153
• Turing-complete or limited?
• Language or library?
Making a DSL Turing-complete means that it has, in theory at least, the same
power as general-purpose languages. PostScript and JavaScript are examples
of DSL’s that actually are Turing-complete. But the extra power comes with
the price that their halting problem is undecidable. Consequently, there are
no complexity guarantees for programs. For instance, a parser written in a
general-purpose language can be exponential or even loop infinitely, whereas a
parser written in BNFC is guaranteed to run in linear time.
Nevertheless, there is a rationale for DSL’s which are not languages at all,
but just libraries in general-purpose programming languages. They are known
as embedded languages.
• No extra training is needed for those who already know the host language.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
154 CHAPTER 8. THE LANGUAGE DESIGN SPACE
If we work with the combinators, we must first define the abstract syntax
datatypes manually:
data Stm = SIf Exp Stm | SWhile Exp Stm | SExp Exp
data Exp = EInt Integer
Then we define the parser functions, using objects of types Stm and Exp as
values. The terminals of the BNF grammar are treated by the lit function,
and the nonterminals by calls to the parsing functions. Semantic actions are
used for building the abstract syntax trees.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
8.6. EMBEDDED LANGUAGES* 155
infixr 4 ...
infixl 3 ***
infixl 2 |||
-- read input [a], return value b and the rest of the input
type Parser a b = [a] -> [(b,[a])]
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
156 CHAPTER 8. THE LANGUAGE DESIGN SPACE
The example shows that combinators require more code to be written manually
than in BNFC, where the grammar is enough. On the other hand, the run-time
code becomes smaller, because no bulky Happy-generated files are needed.
Moreover, combinators have more expressive power than BNFC. In par-
ticular, they can deal with ambiguity, because the Parser type is a list of
all parse results, not only one. They are not even limited to context-free lan-
guages; writing a parser for the copy language is left as an exercise. This extra
power is in practice often the main reason to use combinators rather than BNF
grammars.
As recursive descent parsers in general, the combinators loop with left-
recursive rules. Left recursion must thus be manually eliminated, as shown in
Section 3.6. LL(1) conflicts as such are not a problem. But without left fac-
toring, parsers that have to inspect many paths can become very slow, because
they have to do backtracking from unsuccessful paths. If such things hap-
pen, it can be very difficult to find the reason, because there are no automatic
diagnostic tools similar to the LALR(1) table construction of Happy.
Our conclusion of parser combinators is that they very closely reflect the
pros and cons of embedded languages listed at the beginning of this section.
Moreover, BNF grammars are an already established, well-understood lan-
guage, which has available implementations (e.g. in Happy and, as a front-
end to it, in BNFC). Therefore there is little reason to use parser combinators
except when the additional power is needed or as a test case for functional
programming techniques, which in fact is a very common use.
All this said, there are plenty of cases where an embedded language can
be the best choice. For instance, it would be overkill to define a separate
language for arithmetic expressions, because they are so well supported in
standard programming languages. There are many cases where a well-defined
library works almost like a language of its own—for instance, in C++, where
the Standard Template Library gives a high-level access to data structures
freeing application programmers from details such as memory management.
Also the problem of complexity guarantees, which we encountered in the
case of parser combinators, can be overcome by using types as a control mech-
anism. We could for instance have a similar set of combinators as in Figure
8.1, but with finite automata as target type:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
8.7. CASE STUDY: BNFC AS A DOMAIN-SPECIFIC LANGUAGE 157
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
158 CHAPTER 8. THE LANGUAGE DESIGN SPACE
As a concrete evidence for the brevity of BNFC, compare the size of BNFC
source code of the language implemented in Section 2.10 (CPP.cf) with the
size of generated code in some target languages. For comparison, we also take
the raw C++ code, that is, the code generated by Bison and Flex. The figures
show that Bison and Flex just about halve the total size of the code needed to
be written, whereas BNFC shrinks it by almost two orders of magnitude!
Of course, C++ and other programmers can write shorter code than this by
hand, and they might not need all of the components generated by BNFC (for
instance, the documentation). Moreover, BNFC might not be powerful enough,
for instance, if the language to be implemented is not context-free. But even
for such projects, it can be a good starting point to write an approximative
BNF grammar and let BNFC produce the boilerplate code that serves as a
starting point.
In Section 8.4, we summarized a set of language design questions. Let us
look at the decisions made in the case of BNFC:
• Statically checked? Yes, some consistency checks are made before gener-
ating the host language code. But more checks would be desirable, for
instance, the check for LALR conflicts.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
8.8. USING BNFC FOR IMPLEMENTING LANGUAGES 159
The most important lesson from BNFC is perhaps that its declarativity makes it
both succinct, portable, and predictable. Using Happy, CUP, or Bison directly
would be none of these, because any host language code can be inserted in the
semantic actions.
The things that are not so good in BNFC involve more the implementation
than the language itself:
Static checking should be closer to BNFC source. We already mentioned
the checking of conflicts. Another known issue is the use of identifiers. Using
the same identifier both as category and constructor is legal in Haskell but not
in Java. This gives errors, about which only a warning is issued. But if the
warning is ignored, hard to understand Java errors may arise.
The maintenance of the BNFC code base is made difficult by the multitude
of different host languages. The standard lexer and parser tools are not always
backward-compatible, and since updating BNFC lags behind, users may need
to use older versions of these tools. Maintenance is further complicated by
the fact that the source code of BNFC has become a mess and is hard to
understand.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
160 CHAPTER 8. THE LANGUAGE DESIGN SPACE
Just like in Assignment 1, you should compile your grammar often and re-
run it on the example set. In this process, it may happen that you have to
change your design. For instance, it may turn out that some constructs of your
language were actually ambiguous; maybe you didn’t see this at a first sight,
but it was conflicts found in LALR(1) table construction that revealed this.
If you find real ambiguities in this way, it is advisable to eliminate them by
changing the grammar.
One more advantage with BNFC is that you can change your implementa-
tion language. There have been projects where a first prototype implementation
was built in Haskell and a later production version in C. The BNF grammar
remained unchanged, as the programmers only needed to run bnfc with the -c
option to create the new parser and abstract syntax.
It is also possible to combine implementation languages, because they can
communicate via the parser and the pretty printer. Thus one can parse the
code in Haskell, make type checking in Haskell, pretty-print the type-annotated
trees, and parse the resulting code in a C program that performs the rest of
the compilation. This property also makes BNFC usable for defining data for-
mats: you can define a structured data representation that permits exchange
of data between programs written in different languages. Sometimes such a
representation is a good alternative to more standard formats such as XML,
which is verbose and difficult for humans to read.
In code generation, it may make sense to create a BNFC grammar for the
target language as well—even for such a simple language as JVM assembler.
This has two advantages:
• code generation produces abstract syntax trees and can ignore some de-
tails of the target code concrete syntax;
• target code trees can be processed further, for instance, by optimizations.
Using BNFC also has a price: it forces you to restrict your language to what
we like to boldly call well-behaved languages. These are the main three
restrictions:
Surprisingly many legacy languages have constructs that are not “well-behaved”.
For instance, Haskell violates all the three restrictions. Java and C are largely
well-behaved, and have actually BNFC grammars available at BNFC web page.
Full C++ requires a much more powerful parser than LR(k ) for any k. But
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
8.9. COMPILING NATURAL LANGUAGE* 161
often the features that go beyond the “well-behaved” languages can be avoided
without any penalty for usability.
In Haskell, layout syntax is a feature that violates all of the three require-
ments. Section A.8 in Appendix A explains how a restricted form of layout
syntax can be implemented in BNFC by using preprocessing. Opinions are
split about the utility of layout syntax: many programmers think it is a nui-
sance in Haskell. At the same time, layout syntax has gained new popularity
in Python.
One peculiarity of layout syntax is that it breaks the fundamental prop-
erty of alpha convertibility, which says that changing a variable name (in
all occurrences) doesn’t change the program behaviour. A counterexample in
Haskell is
If you rename e to exp, the code gets a syntax error, because the branch EAdd
no longer starts from the same column as EMul.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
162 CHAPTER 8. THE LANGUAGE DESIGN SPACE
the documents say. But they do not reach publishing quality, which would
require an accurate and flawless rendering of the original.
Another important task of natural language processing is human-computer
interaction (HCI), where attempts are made to replace the use of computer
languages by human language. This language is often very restricted, for in-
stance, voice commands in a car enabling the use of the music player and the
navigation system. Unlike machine translation, such systems only have to deal
with small parts of language. But they have to do it with precision. The natu-
ral choice of techniques is hence similar to compilation: formal grammars and
semantic analysis.
-- general part
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
8.11. GRAMMATICAL FRAMEWORK, GF* 163
-- specific part
This grammar is not yet quite what we want; for instance, it says which number
is prime even though we expect many numbers as an answer and hence which
numbers are prime would be more adequate. There is also a plain error, that
of placing all properties before kinds. This is right for even number, but wrong
for greater than 3 number, which should be number greater than 3. Of course,
we could solve both issues by more categories and rules, but this would clutter
the abstract syntax with semantically irrelevant distinctions, such as singular
and plural kinds and pre- and postfix properties.
abstract Arithm = {
cat Exp ;
fun EInt : Int -> Exp ;
fun EMul : Exp -> Exp -> Exp ;
}
concrete ArithmJava of Arithm = {
lincat Exp = Str ;
lin EInt i = i.s ;
lin EMul x y = x ++ "*" ++ y ;
}
concrete ArithmJVM of Arithm = {
lincat Exp = Str ;
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
164 CHAPTER 8. THE LANGUAGE DESIGN SPACE
This grammar has three GF modules: one abstract syntax module Arithm,
and two concrete syntax modules, ArithmJava and ArithmJVM. If you save
the modules in the files Arithm.gf, ArithmJava.gf and ArithmJVM.gf, you
can translate Java expressions to JVM expressions, and also vice versa. You
first start the gf interpreter by the shell command
gf ArithmJava.gf ArithmJVM.gf
In the shell that opens, you use a pipe to parse from Java and linearize to JVM:
Notice that the Java grammar is ambiguous: 7 * 12 * 9 has two parse trees.
GF returns them both and produces two JVM expressions. Allowing ambiguity
is one of the first things a natural language grammar has to do. In this very
case, however, we would rather eliminate the ambiguity by using precedences,
as in Section 2.4.
So let us look at how GF works. The main idea is to separate abstract and
concrete syntax. In BNF, these two aspects are expressed together. Thus the
BNF rule
• EMul is a tree-building function that takes two Exp trees and forms an
Exp tree.
In GF, these two aspects are expressed by two different rules: a fun (function)
rule and a lin (linearization) rule:
The rules are put into separate modules, marked as abstract and concrete.
This makes it possible to combine one abstract with several concretes, as we
did above. Consequently, we can translate by parsing a string in one concrete,
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
8.11. GRAMMATICAL FRAMEWORK, GF* 165
and linearizing it with another. Notice the concatenation symbol ++, which is
needed between tokens; in GF, just writing x "*" y would mean the applica-
tion of the function x to the arguments "*" and y.
In a BNF grammar, the set of categories is implicit in the sense that there
are no separate rules telling what categories there are, but they are collected
from the grammar rules. In GF, categories must be introduced explicitly, by
cat rules in the abstract and lincat rules in the concrete. The only excep-
tion is a handful of predefined categories, such as Int and Float.
The lincat rule specifies the linearization type of a category. In the
above grammar, it is just Str for Exp in both Java and JVM. But this can be
enriched to records, tables, and parameters. Here is a better grammar for Java
expressions, taking care of precedence with a parameter:
The grammar defines a parameter type (param) called Prec, with two values
representing precedence levels 0 and 1. The linearization type of Exp is a
record type, with a field s for a string and p for a precedence. Integer literals
have the higher precedence, and multiplications the lower one. Notice that the
built-in type Int also has a record as linearization type, which is shown by the
term i.s giving the string field of the record.
The interesting work is done in the linearization rule of EMul. There the
second operand gets parentheses if it is an expression on the lower level. The
first operand does not, because multiplication is left-associative. The choice is
made with a case expression, similar to case expressions in Haskell. Paren-
theses are added with the function parenth, which is defined as an auxiliary
operation (oper).
When parsed with this modified grammar, Java expressions get unique parse
trees. The JVM concrete syntax does not need changes. In fact, much of
the power of GF comes from the ability to use different linearization types in
different languages. In natural languages, parameters are linguistic features
such as number, gender, and case. These features work in very different ways
depending on language.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
166 CHAPTER 8. THE LANGUAGE DESIGN SPACE
Abstract syntax
This is the basic query grammar, defining the type system and the forms of
queries. Each function is followed by a comment that shows an example that
has that structure. Notice the flag startcat setting the default start category.
Also notice that the keywords cat and fun need not be repeated in groups of
definitions of the same type.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
8.12. A GF GRAMMAR FOR QUERIES* 167
abstract Query = {
flags startcat = Query ;
cat
Query ;
Kind ;
Property ;
Term ;
fun
QWhich : Kind -> Property -> Query ; -- which numbers are prime
QWhether : Term -> Property -> Query ; -- is any number prime
TAll : Kind -> Term ; -- all numbers
TAny : Kind -> Term ; -- any number
PAnd : Property -> Property -> Property ; -- even and prime
POr : Property -> Property -> Property ; -- even or odd
PNot : Property -> Property ; -- not prime
KProperty : Property -> Kind -> Kind ; -- even number
}
The MathQuery module inherits all categories and functions of Query and adds some
of its own.
English
The BNF grammar for queries in Section 8.10 had two problems, which are
easy to solve in GF by parameters. For the different forms of kind expressions,
we use a parameter Number (in the grammatical sense), with values for the
singular (Sg) and the plural (Pl). The linearization type of Kind is a table
type, Number => Str, which is similar to inflection tables in grammars: it
gives a string value to each parameter of type Number. The linearization of
KNumber below is an example of a table. The selection operator (!) picks
values from tables, as exemplified in QWhich, TAll, and TAny. The placement
of properties is likewise controlled by a parameter type, Fix (prefix or postfix).
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
168 CHAPTER 8. THE LANGUAGE DESIGN SPACE
Exercise 8-11. Apply the query language to some other domain than mathe-
matics.
Exercise 8-12. Extend the query language with new question forms, such as
where and when questions, which may be appropriate for other domains than
mathematics.
Exercise 8-13.+ Port the query language to some other language than English,
without changing its abstract syntax.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
8.13. THE ANSWERING ENGINE* 169
Compiling to Haskell
The concrete syntax in GF is actually also an example of denotational seman-
tics! It is an interpretation of syntax trees as strings, records, and tables. We
can use this idea to give an implementation of the query language as translation
to Haskell code. In Haskell, we will use lists rather than sets as denotations of
kinds. But otherwise the translation is very much the same as the denotational
semantics.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
170 CHAPTER 8. THE LANGUAGE DESIGN SPACE
Of course, we could make the generated Haskell code nicer by using precedences
to eliminate some parentheses, as shown in Section 8.11. But this is not so
important, because we use Haskell only internally, as question answering engine.
We set an upper bound 1000 for numbers to prevent infinite search. Here is an
example of a query and its translation as obtained in the GF shell:
We can thus translate queries from English to Haskell in GF. As the simplest
possible end-user interface, we can write a shell script query, which pipes the
English query to GF, which produces a Haskell translation, which is sent to the
GHC Haskell compiler. The flag -e makes GHC work as an expression inter-
preter. The GF command pt -number=1 makes sure that just one expression
is sent to Haskell, if the parse happens to be ambiguous.
#!/bin/bash
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
8.14. THE LIMITS OF GRAMMARS* 171
./query "which numbers greater than 100 and smaller than 150 are prime"
[101,103,107,109,113,127,131,137,139,149]
lincat Kind = CN ;
lincat Property = A ;
lin KProperty prop kind = mkCN prop kind ;
The same definitions work for all of the currently 24 languages of the library,
although for instance the order of the adjective and the noun can vary in
concrete syntax (even number in English becomes nombre pair in French).
Differences of course also appear on the level of words, where mkA in each
language produces an adjective inflection table:
English needs only one form of adjectives, but French has 4 and Finnish over
30.
GF makes it possible to generate and parse natural languages, whenever the
sentences involved are within the scope of a grammar. What is out of reach,
however, is the parsing of the whole natural language. The problem is that, in
natural language, the grammar is not something that can be given once and for
all. This is in contrast to programming languages, which are defined by their
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
172 CHAPTER 8. THE LANGUAGE DESIGN SPACE
grammars. For natural language, the grammar is more like an open research
problem, and the language can change at any time.
Nevertheless, it is instructive to compare natural language processing with
compilation. Many compilation phases have counterparts in machine transla-
tion:
Lexical analysis, parsing, and generation are in both cases derived from gram-
mars, even though grammars are typically much harder to write for natural
languages than for programming languages. But what about semantic analy-
sis?
In compilers, semantic analysis usually requires more work than the other
phases. There are two reasons for this. First, the grammars are usually easy,
since computer languages are simple. Secondly, there is often a considerable
semantic gap between the source and target languages, which requires sub-
stantial analysis and maybe restructuring the tree. For instance, type informa-
tion may need to be added, and variables may need to be converted to memory
addresses or registers.
In natural language translation, writing the grammars can itself be a sub-
stantial task. On the other hand, the semantic gap between natural languages
is typically not so huge as between high-level programming languages and ma-
chine languages. This is illustrated by the GF resource grammar library, which
implements the same structures for many languages. As long as these struc-
tures are preserved in translation, all that is needed is just to select proper
words in the target language to fill in the structures.
However, when parsing natural language, semantic analysis problems due
to ambiguity soon arise. A typical example is word sense disambigua-
tion: one word may have several possible translations, corresponding to dif-
ferent meanings of the word. For instance, the English word drug is in French
médicament (medical drug) or drogue (narcotic drug). The proper translation
of
is in most cases
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
8.14. THE LIMITS OF GRAMMARS* 173
because substances used against malaria are medical drugs, not narcotic drugs.
Notice the similarity of this analysis to overload resolution in compilers (Sec-
tion 4.6): to translate Java’s + into JVM, it is necessary to find out the types
of its operands, and then select either iadd or dadd or string concatenation.
Ambiguity extends from words to syntactic structures. Consider the sen-
tences
I ate a pizza with shrimps
I ate a pizza with friends
The phrases with shrimps and with friends can attach to the noun pizza but
also to the whole act of eating. The preposition chosen in French translation
depends on this, and so does indeed the meaning of the sentence. Again, by
using our knowledge about what things can go together, we can guess that the
following analyses are correct:
I ate a (pizza with shrimps)
(I ate a pizza) with friends
Syntactic ambiguity is a problem that programming languages mostly avoid
by design. But we have already seen an exception: the dangling else problem
(Section 3.8). It is a real ambiguity in the grammar, but it is avoided by an
ad hoc rule forcing one of the parses. Another example is the type cast syntax
of C++, which creates an ambiguity between function declarations and object
declarations with casts:
char *aName(String(s));
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
174 CHAPTER 8. THE LANGUAGE DESIGN SPACE
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
Appendix A
http://bnfc.digitalgrammars.com
175
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
176 APPENDIX A. BNFC QUICK REFERENCE
has the label SWhile and forms trees of the form (SWhile e s), where e is a
tree for Exp and s a tree for Stm.
More formally, an LBNF grammar consists of a collection of rules, which
have the following form (expressed by a regular expression; Section A.9 gives
a complete BNF definition of the notation):
The first identifier is the rule label, followed by the value category. On the
right-hand side of the production arrow (::=) is the list of production items. An
item is either a quoted string (terminal) or a category symbol (nonterminal).
The right-hand side of a rule whose value category is C is called a production
for C.
Identifiers, that is, rule names and category symbols, can be chosen ad libi-
tum, with the restrictions imposed by the target language. To satisfy Haskell,
and C and Java as well, the following rule should be followed with rule labels
and categories:
Additional features
Basic LBNF as defined above is clearly sufficient for defining any context-
free language. However, it is not always convenient to define a programming
language purely with BNF rules. Therefore, some additional features are added
to LBNF: abstract syntax conventions, lexer rules, pragmas, and macros. These
features are treated in the subsequent sections.
Abstract syntax conventions. Creating an abstract syntax by adding
a node type for every BNF rule may sometimes become too detailed, or clut-
tered with extra structural levels. To remedy this, we have identified the most
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
A.3. ABSTRACT SYNTAX CONVENTIONS 177
common problem cases, and added to LBNF some extra conventions to handle
them.
Lexer rules. Some aspects of a language belong to its lexical structure
rather than its grammar, and are more naturally described by regular expres-
sions than by BNF rules. We have therefore added to LBNF two rule formats
to define the lexical structure: tokens and comments.
Pragmas. Pragmas are rules instructing the BNFC grammar compiler to
treat some rules of the grammar in certain special ways: to reduce the number
of entrypoints or to treat some syntactic forms as internal only.
Macros. Macros are syntactic sugar for potentially large groups of rules
and help to write grammars concisely. This is both for the writer’s and the
reader’s convenience; among other things, macros naturally force certain groups
of rules to go together, which could otherwise be spread arbitrarily in the
grammar.
Layout syntax. This is a non-context-free feature present in some pro-
gramming languages, such as Haskell and Python. LBNF has a set of rule
formats for defining a limited form of layout syntax. It works as a preprocessor
that translates layout syntax into explicit structure markers.
Semantic definitions. Some labels can be excluded from the final abstract
syntax by rules that define them in terms of other labels.
digit+
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
178 APPENDIX A. BNFC QUICK REFERENCE
Semantic dummies
Sometimes the concrete syntax of a language includes rules that make no se-
mantic difference. An example is a BNF rule making the parser accept extra
semicolons after statements:
Stm ::= Stm ";" ;
As this rule is semantically dummy, we do not want to represent it by a con-
structors in the abstract syntax. Instead, we introduce the following conven-
tion:
A rule label can be an underscore , which does not add anything
to the syntax tree.
Thus we can write the following rule in LBNF:
_ . Stm ::= Stm ";" ;
Underscores are of course only meaningful as replacements of one-argument
constructors where the value type is the same as the argument type. Semantic
dummies leave no trace in the pretty-printer. Thus, for instance, the pretty-
printer “normalizes away” extra semicolons.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
A.3. ABSTRACT SYNTAX CONVENTIONS 179
Precedence levels
A common idiom in (ordinary) BNF is to use indexed variants of categories to
express precedence levels:
Thus Exp1 and Exp2 are indexed variants of Exp. The plain Exp is a synonym
of Exp0.
Transitions between indexed variants are semantically dummy, and we do
not want to represent them by constructors in the abstract syntax. To do this,
we extend the use of underscores to indexed variants. The example grammar
above can now be labelled as follows:
data Exp = EInt Integer | ETimes Exp Exp | EPlus Exp Exp
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
180 APPENDIX A. BNFC QUICK REFERENCE
Indexed categories can be used for other purposes than precedence, since the
only thing we can formally check is the type skeleton (see the Section A.3). The
parser does not need to know that the indices mean precedence, but only that
indexed variants have values of the same type. The pretty-printer, however,
assumes that indexed categories are used for precedence, and may produce
strange results if they are used in some other way.
Polymorphic lists
It is easy to define monomorphic list types in LBNF:
However, compiler writers in languages like Haskell may want to use predefined
polymorphic lists, because of the language support for these constructs. LBNF
permits the use of Haskell’s list constructors as labels, and list brackets in
category names:
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
A.3. ABSTRACT SYNTAX CONVENTIONS 181
Then the smart programmer would also be careful to reverse the list when it
is used as an argument of another rule construction.
The BNF Converter automatically performs the left-recursion transforma-
tion for pairs of rules of the form
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
182 APPENDIX A. BNFC QUICK REFERENCE
The second-last rule corresponds to the absence of empty data types in Haskell.
The last rule could be strengthened so as to require that all regular rule labels
be unique: this is needed to guarantee error-free pretty-printing. Violating this
strengthened rule currently generates only a warning, not a type error.
["abc7%"] denotes the union of the characters ’a’ ’b’ ’c’ ’7’ ’%’
{"abc7%"} denotes the sequence of the characters ’a’ ’b’ ’c’ ’7’ ’%’
The atomic expressions upper, lower, letter, and digit denote the character
classes suggested by their names (letters are isolatin1). The expression char
matches any character in the 8-bit ASCII range, and the “epsilon” expression
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
A.5. PRAGMAS 183
eps matches the empty string.1 Thus eps is equivalent to {""}, whereas the
empty language is expressed by [""].
Note. The empty language is not available for the Java lexer tool JLex.
where the pair of integers indicates the line and column of the first character
of the token. The pretty-printer omits the position component.
comment "//" ;
comment "/*" "*/" ;
A.5 Pragmas
Internal pragmas
Sometimes we want to include in the abstract syntax structures that are not
part of the concrete syntax, and hence not parsable. They can be, for instance,
syntax trees that are produced by a type-annotating type checker. Even though
they are not parsable, we may want to pretty-print them, for instance, in the
type checker’s error messages. To define such an internal constructor, we use
a pragma
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
184 APPENDIX A. BNFC QUICK REFERENCE
For instance, the following pragma defines Stm and Exp to be the only entry
points:
A.6 Macros
Terminators and separators
The terminator macro defines a pair of list rules by what token terminates
each element in the list. For instance,
tells that each statement (Stm) in a list [Stm] is terminated with a semicolon
(;). It is a shorthand for the pair of rules
The qualifier nonempty in the macro makes one-element list to be the base case.
Thus
is shorthand for
The terminator can be specified as empty "". No token is introduced then, but
e.g.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
A.6. MACROS 185
Coercions
The coercions macro is a shorthand for a group of rules translating between
precedence levels. For instance,
coercions Exp 3 ;
is shorthand for
_. Exp ::= Exp1 ;
_. Exp1 ::= Exp2 ;
_. Exp2 ::= Exp3 ;
_. Exp3 ::= "(" Exp ")" ;
Because of the total coverage of these coercions, it does not matter if the integer
indicating the highest level (here 3) is bigger than the highest level actually
occurring, or if there are some other levels without productions in the grammar.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
186 APPENDIX A. BNFC QUICK REFERENCE
Unlabelled rules
The rules macro is a shorthand for a set of rules from which labels are gener-
ated automatically. For instance,
rules Type ::= Type "[" Integer "]" | "float" | "int" | Type "*" ;
is shorthand for
The labels are created automatically. A label starts with the value category
name. If the production has just one item, which is moreover possible as a part
of an identifier, that item is used as a suffix. In other cases, an integer suffix is
used. No global checks are performed when generating these labels. Any label
name clashes that result from them are captured by BNFC type checking on
the generated rules.
Notice that, using the rules macro, it is possible to define an LBNF gram-
mar without giving any labels. To guarantee the uniqueness of labels, the
productions of each category should then be grouped together.
We now want to have some syntactic sugar. Note that the labels for these rules
all start with a lowercase letter, indicating that they correspond to defined
functions rather than nodes in the abstract syntax tree.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
A.8. LAYOUT SYNTAX 187
Functions are defined using the define keyword. Definitions have the form
define f x1 . . . xn = e
Another use of defined functions is to simplify the abstract syntax for binary
operators. Instead of one node for each operator one can have a general node
(EOp) for all binary operator applications.
_. Op ::= Op1;
_. Op ::= Op2;
Precedence levels can be used to make sure that the pretty printer prints enough
parenthesis.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
188 APPENDIX A. BNFC QUICK REFERENCE
to handle layout, they can use the Haskell layout resolver as a preprocessor
to their front end, before the lexer. In Haskell, the layout resolver appears,
automatically, in its most natural place, which is between the lexer and the
parser. The layout pragmas of BNFC are not powerful enough to handle the
full layout rule of Haskell 98, but they suffice for the “regular” cases.
Here is an example, found in the the grammar of the logical framework Alfa
(a predecessor of Agda):
layout "of", "let", "where", "sig", "struct" ;
The first line says that "of", "let", "where", "sig", "struct" are layout
words, i.e. start a layout list. A layout list is a list of expressions normally
enclosed in curly brackets and separated by semicolons, as shown by the Alfa
example
ECase. Exp ::= "case" Exp "of" "{" [Branch] "}" ;
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
A.9. THE BNF GRAMMAR OF LBNF 189
c :: Nat = case x of {
True -> b
;False -> case y of {
False -> b
};Neither -> d
There are two more layout-related pragmas. The layout stop pragma, as in
tells the resolver that the layout list can be exited with some stop words, like
in, which exits a let list. It is no error in the resolver to exit some other kind
of layout list with in, but an error will show up in the parser.
The
layout toplevel ;
pragma tells that the whole source file is a layout list, even though no layout
word indicates this. The position is the first column, and the resolver adds a
semicolon after every paragraph whose first token is at this position. No curly
brackets are added. The Alfa file above is an example of this, with two such
semicolons added.
To make layout resolution a stand-alone program, e.g. to serve as a prepro-
cessor, the programmer can modify the BNFC-generated file (LayoutX.hs for
the language X) and either compile it or run it in the GHCi interpreter by
Note. The generated layout resolver does not work correctly if a layout word
is the first token on a line.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
190 APPENDIX A. BNFC QUICK REFERENCE
: ; . ::= [ ] ( )
, = | - * + ? { }
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
A.9. THE BNF GRAMMAR OF LBNF 191
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
192 APPENDIX A. BNFC QUICK REFERENCE
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
Appendix B
These tables contain all instructions used in Chapters 5 and 6 and some other ones
that can be useful as optimizations in Assignment 4. We use the dot (.) to separate
values on the stack, and two-letter variables (dd,ee) to represent double values. The
asterisk (*) in an explanation indicates that there is a longer explanation after the
tables.
193
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
194 APPENDIX B. SOME JVM INSTRUCTIONS
More explanations:
• dcmpg, dcmpl: the value left on the stack is 1 if the inequality holds, 0 if the
values are equal, and -1 otherwise.
• dup2: the instruction duplicates the topmost two words, which can be one
double value or two integer values.
• if icmpeq, if icmpne, if icmplt, if icmpge, if icmpgt, if icmple correspond-
ing to =, 6=, <, ≥, >, ≤ jump to the label if the comparison holds between the
top-2 integer values on the stack.
• ifeq, ifne, iflt, ifge, ifgt, ifle corresponding to =, 6=, <, ≥, >, ≤ jump to
the label if the comparison holds between the top integer value on the stack
and 0.
• ldc, ldc2 w: the constants pushed are stored in the constant pool, and the
actual bytecode argument (after assembly) is a reference to this pool.
• pop2: the instruction pops the topmost two words, which can be one double
value or two integer values.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
Appendix C
Summary of the
Assignments
The full specifications of the assignments can be found on the book web page,
together with supporting material such as test suites and code templates.
195
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
196 APPENDIX C. SUMMARY OF THE ASSIGNMENTS
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
Appendix D
Further Reading
also known as the Dragon Book, is the classic compiler book. Its nickname
comes from the cover picture, where a knight is fighting a dragon entitled
“Complexity of Compiler Design” with “LALR Parser Generation” as her (his?)
sword and “Syntax-Directed Translation” as her shield. These concepts are
central in the current book as well—but the 1009 pages of the Dragon Book
also show how they work under the hood, and they cover advanced topics such
as parallelism.
197
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
198 APPENDIX D. FURTHER READING
is a rich source on the earliest and most fundamental ideas, including Knuth’s
original paper on LR(k ) parsing.
Chapter 2. Appel’s book triple mentioned at Chapter 1 was a direct inspira-
tion to BNFC, showing how the same set of concepts can be implemented in
different languages.
is the source of the code examples defining the fragment addressed in Assign-
ment 1. It is an unusual book on C++, focusing on the high-level aspects
of the language and the use of the Standard Template Library almost as an
embedded language.
Chapter 3. The Dragon and Tiger books (Chapter 1) cover the details needed
for implementing lexer and parser generators. But
is the classic book on formal language theory and its relation to computability.
It gives all details of the algorithms, including proofs of correctness.
Chapter 4.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
199
explores the limits of what can be done with types, including specifications and
proofs of programs.
Chapter 5.
is the definitive source for JVM instructions, also giving hints on how to compile
to it.
explains the workings of JVM and a way to generate it by using the Jasmin
assembler.
is both brief and deep, and a major source in how machines work. It is free
and it comes with a set of tools helping to use the NASM assembler to actually
produce code that runs on your computer platform.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
200 APPENDIX D. FURTHER READING
may be making all other current ways to produce machine code obsolete.
Chapter 7.
is a thorough, yet easily readable article (for those who know Haskell, at least).
The type checker in Section 7.9 can be seen as a simplified version of the code
presented in this paper.
inspired a lot of the material in this chapter. For instance, it includes a discus-
sion of what programming languages will look like a hundred years from now
(i.e. from 2003).
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
201
gives a balanced view on the trade-offs faced when trying to raise the level of a
language and at the same time keep it efficient and attractive to main-stream
programmers. Written by the creator of C++, it is honest about design flaws,
many of which were detected when it was too late to fix them, because so many
people had started using the flawed constructs in their code.
is a book about GF, which was used for implementing the query language
in English and Haskell. GF and BNFC share many ideas. Their common
mission is to make language implementation more accessible by the use of code
generation from grammars. The GF book and the current book were partly
written in parallel.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
Index
202
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
INDEX 203
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
204 INDEX
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
INDEX 205
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
206 INDEX
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
INDEX 207
XML, 160
tail recursion elimination, 120
target code optimization, 13 Y combinator, 148
template system, 138 Yacc, xi, 37, 50, 53, 152, 157
terminal, 20, 176
terminator, 29, 184
theory-based practical approach, ix
token, 8, 177, 182
token type, 31
top (stack machine), 3, 90
transition (automaton), 39
transition (operational semantics), 91
transition function, 42
transitive closure, 92
translation, xii, 4, 5, 7
tuple, 126
Turing Machine, 146
Turing-completeness, 146
type annotation, 58, 69
type checking, 8, 59
type code, 77
type conversion, 63
type error, 10
type inference, 60, 140
type system, 58, 152
type variable, 138
unconditional jump, 91
unification, 140, 142
union (regular expression), 41
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.
This book has been purchased as a PDF version from the Publisher and is for the purchaser's sole use.