Compiler Construction Notes
Compiler Construction Notes
Compiler Construction Notes
30 or so characters, from a single line of source code, are first transformed by lexical analysis into a sequence of 7
tokens. Those tokens are then used to build a tree of height 4 during syntax analysis. Semantic analysis may
transform the tree into one of height 5, that includes a type conversion necessary for real addition on an integer
operand. Intermediate code generation uses a simple traversal algorithm to linearize the tree back into a sequence
of machine-independent three-address-code instructions.
t1 = inttoreal(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
Optimization of the intermediate code allows the four instructions to be reduced to two machine-independent
instructions. Final code generation might implement these two instructions using 5 machine instructions, in which
the actual registers and addressing modes of the CPU are utilized.
MOVF
MULF
MOVF
ADDF
MOVF
id3, R2
#60.0, R2
id2, R1
R2, R1
R1, id1
Regular Expressions
The notation we use to precisely capture all the variations that a given category of token may take are called
"regular expressions" (or, less formally, "patterns". The word "pattern" is really vague and there are lots of other
notations for patterns besides regular expressions). Regular expressions are a shorthand notation for sets of
strings. In order to even talk about "strings" you have to first define an alphabet, the set of characters which can
appear.
1. Epsilon is a regular expression denoting the set containing the empty string
2. Any letter in the alphabet is also a regular expression denoting the set containing a one-letter string
consisting of that letter.
3. For regular expressions r and s,
r|s
is a regular expression denoting the union of r and s
4. For regular expressions r and s,
rs
is a regular expression denoting the set of strings consisting of a member of r followed by a member of s
5. For regular expression r,
r*
is a regular expression denoting the set of strings consisting of zero or more occurrences of r.
6. You can parenthesize a regular expression to specify operator precedence (otherwise, alternation is like
plus, concatenation is like times, and closure is like exponentiation)
Although these operators are sufficient to describe all regular languages, in practice everybody uses extensions:
For regular expression r,
r+
is a regular expression denoting the set of strings consisting of one or more occurrences of r. Equivalent to
rr*
For regular expression r,
r?
is a regular expression denoting the set of strings consisting of zero or one occurrence of r. Equivalent to
r|epsilon
The notation [abc] is short for a|b|c. [a-z] is short for a|b|...|z
Finite Automata
A finite automaton is an abstract, mathematical machine, also known as a finite state machine, with the following
components:
1.
2.
3.
4.
5.
A set of states S
A set of input symbols E (the alphabet)
A transition function move(state, symbol) : new state(s)
A start state S0
A set of final states F
For a deterministic finite automaton (DFA), the function move(state, symbol) goes to at most one state, and
symbol is never epsilon.
Finite automata correspond in a 1:1 relationship to transition diagrams; from any transition diagram one can write
down the formal automaton in terms of items #1-#5 above, and vice versa. To draw the transition diagram for a
finite automaton:
draw a circle for each state s in S; put a label inside the circles to identify each state by number or name
draw an arrow between Si and Sj, labeled with x whenever the transition says to move(Si, x) : Sj
draw a "wedgie" into the start state S0 to identify it
draw a second circle inside each of the final states in F
DFA Implementation
The nice part about DFA's is that they are efficiently implemented on computers. What DFA does the following
code correspond to? What is the corresponding regular expression? You can speed this code fragment up even
further if you are willing to use goto's or write it in assembler.
state := S0
for(;;)
switch (state) {
case 0:
switch (input) {
'a': state = 1; input = getchar(); break;
'b': input = getchar(); break;
default: printf("dfa error\n"); exit(1);
}
case 1:
switch (input) {
EOF: printf("accept\n"); exit(0);
default: printf("dfa error\n"); exit(1);
}
}
4. For regular expressions r and s, draw rs by adding epsilon transitions from the final states of r to the start
state of s.
5. For regular expression r, draw r* by adding new start and final states, and epsilon transitions (a) from the
start state to the final state, (b) from the final state back to the start state, (c) from the new start to the old
start and from the old final states to the new final state.
6. For parenthesed regular expression (r) you can use the NFA for r.
NFA's can be converted automatically to DFA's
In: NFA N
OUt: DFA D
Method: Construct transition table Dtran (a.k.a. the "move function"). Each DFA state is a set of NFA states.
Dtran simulates in parallel all possible moves N can make on a given string.
Operations to keep track of sets of NFA states:
e_closure(s)
set of states reachable from state s via epsilon
e_closure(T)
set of states reachable from any state in set T via epsilon
move(T,a)
set of states to which there is an NFA transition from states in T on symbol a
Algorithm:
Dstates := {e_closure(start_state)}
while T := unmarked_member(Dstates) do {
mark(T)
for each input symbol a do {
U := e_closure(move(T,a))
if not member(Dstates, U) then
insert(Dstates, U)
Dtran[T,a] := U
}
}
lex(1)
and flex(1)
These programs generally take a lexical specification given in a .l file and create a corresponding C language
lexical analyzer in a file named lex.yy.c. The lexical analyzer is then linked with the rest of your compiler.
The C code generated by lex has the following public interface. Note the use of global variables instead of
parameters, and the use of the prefix yy to distinguish scanner names from your program names. This prefix is
also used in the YACC parser generator.
FILE *yyin;
int yylex();
char yytext[];
The .l file format consists of a mixture of lex syntax and C code fragments. The percent sign (%) is used to signify
lex elements. The whole file is divided into three sections separated by %%:
header
%%
body
%%
helper functions
The header consists of C code fragments enclosed in %{ and %} as well as macro definitions consisting of a name
and a regular expression denoted by that name. lex macros are invoked explicitly by enclosing the macro name in
curly braces. Following are some example lex macros.
letter
digit
[a-zA-Z]
[0-9]
ident
{letter}({letter}|{digit})*
The body consists of of a sequence of regular expressions for different token categories and other lexical entities.
Each regular expression can have a C code fragment enclosed in curly braces that executes when that regular
expression is matched. For most of the regular expressions this code fragment (also called a semantic action
consists of returning an integer that identifies the token category to the rest of the compiler, particularly for use by
the parser to check syntax. Some typical regular expressions and semantic actions might include:
" "
{ident}
"*"
You also need regular expressions for lexical errors such as unterminated character constants, or illegal characters.
The helper functions in a lex file typically compute lexical attributes, such as the actual integer or string values
denoted by literals.
The union literal will hold computed values of integers, real numbers, and strings.
Syntax Analysis
Parsing is the act of performing syntax analysis to verify an input program's compliance with the source language.
A by-product of this process is typically a tree that represents the structure of the program.
When X consists only of terminal symbols, it is a string of the language denoted by the grammar. Each
iteration of the loop is a derivation step. If an iteration has several nonterminals to choose from at some
point, the rules of derviation would allow any of these to be applied. In practice, parsing algorithms tend to
always choose the leftmost nonterminal, or the rightmost nonterminal, resulting in strings that are leftmost
derivations or rightmost derivations.
Grammar Ambiguity
The grammar
E
E
E
E
->
->
->
->
E + E
E * E
( E )
ident
allows two different derivations for strings such as "x + y * z". The grammar is ambiguous, but the
semantics of the language dictate a particular operator precedence that should be used. One way to
eliminate such ambiguity is to rewrite the grammar. For example, we can force the precedence we want by
adding some nonterminals and production rules.
E
E
T
T
F
F
->
->
->
->
->
->
E + T
T
T * F
F
( F )
ident
->
->
->
->
->
A B C
a A
epsilon
b
c
We can remove the left recursion by introducing new nonterminals and new production rules.
E
E'
T
T'
F
->
->
->
->
->
T
+
F
*
(
E'
T E' | epsilon
T'
F T' | epsilon
E ) | ident
Getting rid of such immediate left recursion is not enough, one must get rid of indirect left recursion, where
two or more nonterminals are mutually left-recursive. One can rewrite any CFG to remove left recursion
(Algorithm 4.1).
for i := 1 to n do
for j := 1 to i-1 do begin
replace each Ai -> Aj gamma with productions
Ai -> delta1gamma | delta2gamma
end
eliminate immediate left recursion
Backtracking?
Current token could begin more than one of your possible production rules? Try all of them, remember and
reset state for each try.
S -> cAd
A -> ab
A -> a
cAd
a A'
b
(epsilon)
One can also perform left factoring (Algorithm 4.2) to reduce or eliminate the lookahead or backtracking
needed to tell which production rule to use. If the end result has no lookahead or backtracking needed, the
resulting CFG can be solved by a "predictive parser" and coded easily in a conventional language. If
backtracking is needed, a recursive descent parser takes more work to implement, but is still feasible. As a
more concrete example:
S -> if E then S
S -> if E then S1 else S2
Follow(A)
Follow(A) for nonterminal A is the set of terminals that can appear immediately to the right of A in some
sentential form S -> aAxB... To compute Follow, apply these rules to all nonterminals in the grammar:
1. Add $ to Follow(S)
2. if A -> aBb then add First(b) - epsilon to Follow(B)
3. if A -> aB or A -> aBb where epsilon is in First(b), then add Follow(A) to Follow(B).
Bottom Up Parsing
Bottom up parsers start from the sequence of terminal symbols and work their way back up to the start
symbol by repeatedly replacing grammar rules' right hand sides by the corresponding non-terminal. This is
the reverse of the derivation process, and is called "reduction".
Example. For the grammar
(1)
(2)
(3)
(4)
S->aABe
A->Abc
A->b
B->d
the string "abbcde" can be parsed bottom-up by the following reduction steps:
abbcde
aAbcde
aAde
aABe
S
Input
w$
LR Parsers
LR denotes a class of bottom up parsers that is capable of handling virtually all programming language
constructs. LR is efficient; it runs in linear time with no backtracking needed. The class of languages
handled by LR is a proper superset of the class of languages handled by top down "predictive parsers". LR
parsing detects an error as soon as it is possible to do so. Generally building an LR parser is too big and
complicated a job to do by hand, we use tools to generate LR parsers.
The LR parsing algorithm is given below. See Figure 4.29 for a schematic.
ip = first symbol of input
repeat {
s = state on top of parse stack
a = *ip
case action[s,a] of {
In many languages two nested "if" statements produce a situation where an "else" clause could legally
belong to either "if". The usual rule (to shift) attaches the else to the nearest (i.e. inner) if statement.
Example reduce reduce conflict:
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
S -> id LP plist RP
S -> E GETS E
plist -> plist, p
plist -> p
p -> id
E -> id LP elist RP
E -> id
elist -> elist, E
elist -> E
closure({[E
closure({[E
closure({[E
{ [E -> E +
-> E+.T]})
-> E+.T], [E -> .T*F], [T -> .F]})
-> E+.T], [E -> .T*F], [T -> .F], [F-> .(E)], [F -> .id]})
.T],[T -> .T * F],[T -> .F],[F -> .(E)],[F -> .id]}
Valid Items: an item A -> b 1. b 2 is valid for a viable prefix a b 1 if there is a derivation:
S'=>*rmaAw=>*rmab1b2w
Suppose A -> b1.b 2 is valid for ab1, and aB1 is on the parsing stack
1. if b2 != e, we should shift
2. if b2 = e, A -> b1 is the handle, and we should reduce by this production
Note: two valid items may tell us to do different things for the same viable prefix. Some of these conflicts
can be resolved using lookahead on the input string.
->
->
->
->
aABe
Abc
b
d
I0=closure([S'>.S]
FIRST(S) =
FIRST{A} =
FIRST{B} =
FIRST{S'}=
{a}
{b}
{d}
{a}
FOLLOW(S) =
FOLLOW(A) =
FOLLOW{B} =
FOLLOW{S'}=
{$}
{b,d}
{e}
{$}
=closure([S'>.S],[S>.aABe])
goto(I0,S)=closure([S'>S.])=I1
goto(I0,a)=closure([S>a.Abe])
=closure([S>a.Abe],[A>.Abc],[A>.b])=I2
goto(I2,A)=closure([S>aA.Be],[A>A.bc])
=closure([S>aA.Be],[A>A.bc],[B>.d])=I3
goto(I2,B)=closure([A>b.])=I4
goto(I3,B)=closure([S>aAB.e])=I5
goto(I3,b)=closure([A>Ab.c])=I6
goto(I3,d)=closure([B>d.])=I7
goto(I5,e)=closure([S>aABe.])=I8
goto(I6,c)=closure([A>Abc.])=I9
YACC
YACC files end in .y and take the form
declarations
%%
grammar
%%
subroutines
The declarations section defines the terminal symbols (tokens) and nonterminal symbols. The most useful
declarations are:
%token a
declares terminal symbol a; YACC can generate a set of #define's that map these symbols onto
integers, in a y.tab.h file
%start A
specifies the start symbol for the grammar (defaults to nonterminal on left side of the first production
rule).
The grammar gives the production rules, interspersed with program code fragments called semantic actions
that let the programmer do what's desired when the grammar productions are reduced. They follow the
syntax
A : body ;
Where body is a sequence of 0 or more terminals, nonterminals, or semantic actions (code, in curly braces)
separated by spaces. As a notational convenience, multiple production rules may be grouped together using
the vertical bar (|).
The Value Stack
YACC's parse stack contains only states
YACC maintains a parallel set of values
$ is used in semantic actions to name elements on the value stack
$$ denotes the value associated with the LHS (nonterminal) symbol
$n denotes the value associated with RHS symbol at position n.
Value stack typically used to construct the parse tree
Typical rule with semantic action: A : b C d { $$ = tree(R,3,$1,$2,$3); }
The default value stack is an array of integers
The value stack can hold arbitrary values in an array of unions
The union type is declared with %union and is named YYSTYPE
YACC precedence and associativity declarations
YACC headers can specify precedence and associativity rules for otherwise heavily ambiguous grammars.
Precedence is determined by increasing order of these declarations. Example:
%right ASSIGN
%left PLUS MINUS
%left TIMES DIVIDE
%right POWER
%%
expr: expr ASSIGN expr
| expr PLUS expr
| expr MINUS expr
| expr TIMES expr
| expr DIVIDE expr
| expr POWER expr
;
Semantic Analysis
Semantic ("meaning") analysis refers to a phase of compilation in which the input program is studied in
order to determine what operations are to be carried out. The two primary components of a classic semantic
analysis phase are variable reference analysis and type checking. These components both rely on an
underlying symbol table.
What we have at the start of semantic analysis is a tree built definitions; they can have all the synthesized
attributes they want. In practice, attributes get stored in parse tree nodes and the semantic rules are
evaluated either (a) during parsing (for easy rules) or (b) during one or more (sub)tree traversals.
Symbol Table Module
Symbol tables are used to resolve names within name spaces. Symbol tables are generally organized
hierarchically according to the scope rules of the language. See the operations defined on pages 474-476 of
the text. To wit:
mktable(parent)
creates a new symbol table, whose scope is local to (or inside) parent
enter(table, symbolname, type, offset)
insert a symbol into a table
addwidth(table)
sums the widths of all entries in the table
enterproc(table, name, newtable)
enters the local scope of the named procedure
Type Checking
Perhaps the primary component of semantic analysis in many traditional compilers consists of the type
checker. In order to check types, one first must have a representation of those types (a type system) and then
one must implement comparison and composition operators on those types using the semantic rules of the
source language being compiled. Lastly, type checking will involve adding synthesized attributes through
those parts of the language grammar that involve expressions and values.
Type Systems
Types are defined recursively according to rules defined by the source language being compiled. A type
system might start with rules like:
Base types (int, char, etc.) are types
Named types (via typedef, etc.) are types
Types composed using other types are types, for example:
array(T, indices) is a type. In some languages indices always start with 0, so array(T, size)
works.
T1 x T2 is a type (specifying, more or less, the tuple or sequence T1 followed by T2; x is a socalled cross-product operator).
record((f1 x T1) x (f2 x T2) x ... x (fn x Tn)) is a type
in languages with pointers, pointer(T) is a type
(T1 x ... Tn) -> Tn+1 is a type denoting a function mapping parameter types to a return type
In some language type expressions may contain variables whose values are types.
In addition, a type system includes rules for assigning these types to the various parts of the program;
usually this will be performed using attributes assigned to grammar symbols.
Representing C (C++, Java, etc.) Types
The type system is represented using data structures in the compiler's implementation language. In the
symbol table and in the parse tree attributes used in type checking, there is a need to represent and compare
source language types. You might start by trying to assign a numeric code to each type, kind of like the
integers used to denote each terminal symbol and each production rule of the grammar. But what about
arrays? What about structs? There are an infinite number of types; any attempt to enumerate them will fail.
Instead, you should create a new data type to explicitly represent type information. This might look
something like the following:
struct c_type {
int base_type;
/* 1 = int, 2=float, ... */
union {
struct array {
int size;
struct c_type *elemtype;
} a;
struct ctype *p;
struct struc {
char *label;
struct field **f;
} s;
} u;
}
struct field {
char *name;
struct ctype *elemtype;
}
Given this representation, how would you initialize a variable to represent each of the following types:
int [10][20]
struct foo { int x; char *s; }
Run-time Environments
Relationship between source code names and data objects during execution
Procedure activations
Memory management and layout
Library functions
Scopes and Bindings
Variables may be declared explicitly or implicitly in some languages
Scope rules for each language determine how to go from names to declarations.
Each use of a variable name must be associated with a declaration. This is generally done via a symbol
table. In most compiled languages it happens at compile time (in contrast, for example ,with LISP).
Environment and State
Environment maps source code names onto storage addresses (at compile time), while state maps storage
addresses into values (at runtime). Environment relies on binding rules and is used in code generation; state
operations are loads/stores into memory, as well as allocations and deallocations. Environment is concerned
with scope rules, state is concerned with things like the lifetimes of variables.
Runtime Memory Regions
Operating systems vary in terms of how the organize program memory for runtime execution, but a typical
scheme looks like this:
code
static data
stack (grows down)
heap (may grow up, from bottom of address space)
The code section may be read-only, and shared among multiple instances of a program. Dynamic loading
may introduce multiple code regions, which may not be contiguous, and some of them may be shared by
different programs. The static data area may consist of two sections, one for "initialized data", and one
section for uninitialized (i.e. all zero's at the beginning). Some OS'es place the heap at the very end of the
address space, with a big hole so either the stack or the heap may grow arbitrarily large. Other OS'es fix the
stack size and place the heap above the stack and grow it down.
Questions to ask about a language, before writing its code generator
1. May procedures be recursive? (Duh, all modern languages...)
2. What happens to locals when a procedure returns? (Lazy deallocation rare)
3. May a procedure refer to non-local, non-global names? (Pascal-style nested procedures, and object
field names)
4. How are parameters passed? (Many styles possible, different declarations for each (Pascal), rules
Automatic storage management is one of the single most important features that makes programming easier.
Basic problem in garbage collection: given a piece of memory, are there any pointers to it? (And if so,
where exactly are all of them please). Approaches:
reference counting
traversal of known pointers (marking)
copying
compacting
generational
conservative collection
E.place = newtemp();
E.code = E1.code || E2.code || gen(MUL,E.place,E1.place,E2.place);
E.place = newtemp();
E.code = E1.code || gen(NEG,E.place,E1.place,E2.place);
E.place = E1.place;
E.code = E1.code;
E.place = id.place;
E.code = emptylist();
Three-Address Code
Basic idea: break down source language expressions into simple pieces that:
translate easily into real machine code
form a linearized representation of a syntax tree
allow us to check our own work to this point
allow machine independent code optimizations to be performed
increase the portability of the compiler
Instruction set:
x := y op z
x := op y
store result of unary operation on y to x
x := y
store y to x
x := &y
store address of y to x
x := *y
store contents pointed to by y to x
*x := y
store y to location pointed to by x
goto L
unconditional jump to L
if x rop y then goto L binary conditional jump to L
if x then goto L unary conditional jump to L
if !x then goto L unary negative conditional jump to L
param x
store x as a parameter
call p,n,x
call procedure p with n parameters, store result in x
return x
return from procedure, use x as the result
Declarations (Pseudo instructions): These declarations list size units as "bytes"; in a uniform-size
environment offsets and counts could be given in units of "slots", where a slot (4 bytes on 32-bit machines)
holds anything.
global
declare a global named x at offset n1 having n2 bytes of space
x,n1,n2
declare a procedure named x with n1 bytes of parameter space and n2 bytes of local variable
proc x,n1,n2
space
local x,n declare a local named x at offset n from the procedure frame
label Ln designate that label Ln refers to the next instruction
end
declare the end of the current procedure
Adaptations for Object Oriented Code
x := y field z lookup field named z within y, store address to x
class
declare a class named x with n1 bytes of class variables and n2 bytes of class method
x,n1,n2 pointers
field x,n declare a field named x at offset n in the class frame
new x
create a new instance of class name x
Intermediate Code for Control Flow
Code for control flow (if-then, switches, and loops) consists of code to test conditions, and the use of goto
instructions and labels to route execution to the correct code. Each chunk of code that is executed together
(no jumps into or out of it) is called a basic block. The basic blocks are nodes in a control flow graph, where
goto instructions, as well as falling through from one basic block to another, are edges connecting basic
blocks.
Depending on your source language's semantic rules for things like "short-circuit" evaluation for boolean
operators, the operators like || and && might be similar to + and * (non-short-circuit) or they might be more
like if-then code.
A general technique for implementing control flow code is to add new attributes to tree nodes to hold labels
that denote the possible targets of jumps. The labels in question are sort of analogous to FIRST and
FOLLOW; for any given list of instructions corresponding to a given tree node, we might want a .first
attribute to hold the label for the beginning of the list, and a .follow attribute to hold the label for the next
instruction that comes after the list of instructions. The .first attribute can be easily synthesized. The .follow
attribute must be inherited from a sibling. The labels have to actually be allocated and attached to
instructions at appropriate nodes in the tree corresponding to grammar production rules that govern control
flow. An instruction in the middle of a basic block need neither a first nor a follow.
C code
Attribute Manipulations
E.true = newlabel();
E.false = S.follow;
S1.follow = S.follow;
S->if E then S1
S.code = E.code || gen(LABEL, E.true)||
S1.code
E.true = newlabel();
E.false = newlabel();
S1.follow = S.follow;
S->if E then S1 else S2 S2.follow = S.follow;
S.code = E.code || gen(LABEL, E.true)||
S1.code || gen(GOTO, S.follow) ||
gen(LABEL, E.false) || S2.code
Exercise: OK, so what does a while loop look like?
On Boolean Operators, and Short Circuit Control Flow
Different languages have different semantics for booleans; for example Pascal treats them as identical to
arithmetic operators, while the C family of languages (and many ) others specify "short-circuit" evaluation
in which operands are not evaluated once the answer to the boolean result is known. Some ("kitchen-sink"
design) languages have two sets of boolean operators: short circuit and non-short-circuit.
Implementation techniques for these alternatives include:
1. treat boolean operators same as arithmetic operators, evaluate each and every one into temporary
variable locations.
2. add extra attributes to keep track of code locations that are targets of jumps. Boolean expressions'
results evaluate to jump instructions.
3. one could change the machine execution model so it implicity routes control from expression failure
to the appropriate location. In order to do this one would
mark boundaries of code in which failure propagates
maintain a stack of such marked "expression frames"
Non-short Circuit Example
a
translates into
100:
103:
104:
107:
108:
111:
112:
if a1 = 0
goto 104
t1 = 1
if c2 = 0
goto 108
t2 = 1
if e3 = 0
goto 112
t3 = 1
t4 = t2 AND t3
t5 = t1 OR t4
Short-Circuit Example
a
translates into
if a
Note: L3 might instead be the target E.false; L1 might instead be E.true;
no computation of a 0 or 1 into t might be needed at all.
Instruction Selection
Accessing values in registers is much much faster than accessing main memory.
Register allocation denotes the selection of which variables will go
into registers. Register assignment is the determination of exactly
which register to place a given variable. The goal of these operations
is generally to minimize the total number of memory accesses required
by the program.
When the number of variables in use at a given time exceeds the number
of registers available (the common case), some variables may be used
directly from memory if the instruction set supports memory-based operations.
When an instruction set does not support memory-based operations, all
variables must be loaded into a register in order to perform arithmetic
or logic using them.
A virtual machine architecture such as the JVM changes the "final" code
generation somewhat. We have seen several changes, some of which
simplify final code generation and some of which complicate things.