Introduction to Compilers and Language Design -chapter7
Introduction to Compilers and Language Design -chapter7
Anyone is free to download and print the PDF edition of this book for per-
sonal use. Commercial distribution, printing, or reproduction without the
author’s consent is expressly prohibited. All other rights are reserved.
You can find the latest version of the PDF edition, and purchase inexpen-
sive hardcover copies at http://compilerbook.org
99
100 CHAPTER 7. SEMANTIC ANALYSIS
100
7.1. OVERVIEW OF TYPE SYSTEMS 101
/* This is C code */
int i;
int a[10];
for(i=0;i<100;i++) a[i] = i;
101
102 CHAPTER 7. SEMANTIC ANALYSIS
code that can explicitly examine the type of a variable. For example, the
instanceof operator in Java allows one to test for types explicitly:
/* This is C code */
int x = 32.5;
/* This is C code */
int *i;
float *f = i;
102
7.2. DESIGNING A TYPE SYSTEM 103
The compiler determines that 32.5 has type double, and therefore x
must also have type double. In a similar way, the output operator << is
defined to have a certain behavior on integers, another behavior on strings,
and so forth. In this case, the compiler already determined that the type of
x is double and so it chooses the variant of << that operates on doubles.
This is useful because variables and functions dealing with days and
months are now kept separate, preventing you from accidentally assigning
one to another, or for giving the value 13 to a variable of type Month.
C has a similar feature, but it is much weaker: typedef declares a new
name for a type, but doesn’t have any means of restricting the range, and
doesn’t prevent you from making assignments between types that share
the same base type:
/* This is C code */
typedef int Month;
typedef int Day;
/* Assigning m to d is allowed in C,
because they are both integers. */
Month m = 10;
Day d = m;
103
104 CHAPTER 7. SEMANTIC ANALYSIS
/* This is Go code */
type coordinates struct {
latitude float64
longitude float64
}
Less frequently used are union types in which multiple symbols oc-
cupy the same memory. For example, in C, you can declare a union type
of number that contains an overlapping float and integer:
/* This is C code */
union number {
int i;
float f;
};
union number n;
n.i = 10;
n.f = 3.14;
In this case, n.i and n.f occupy the same memory. If you assign 10
to n.i and read it back, you will see 10 as expected. However, if you
assign 10 to n.i and read back n.f, you will likely observe a garbage
value, depending on how exactly the two values are mapped into memory.
Union types are occasionally handy when implementing operating system
features such as device drivers, because hardware interfaces often re-use
the same memory locations for multiple purposes.
104
7.2. DESIGNING A TYPE SYSTEM 105
• Perform a bitwise copy. If the two variables have the same under-
lying storage size, the unlike assignment could be accomplished by
just copying the bits in one variable to the location of the other. This
is usually a bad idea, since there is no guarantee that one data type
has any meaning in the other context. But it does happen in a few
select cases, such as when assigning different pointer types in C.
105
106 CHAPTER 7. SEMANTIC ANALYSIS
The B-Minor type system is safe, static, and explicit. As a result, it is fairly
compact to describe, straightforward to implement, and eliminates a large
number of programming errors. However, it may be more strict than some
languages, so there will be a large number of errors that we must detect.
B-Minor has the following atomic types:
• All binary operators must have the same type on the left and right
hand sides.
106
7.4. THE SYMBOL TABLE 107
• The comparison operators < <= >= > may only be applied to
integer values and always return boolean.
• The boolean operators ! && || may only be applied to boolean
values and always return boolean.
The symbol table records all of the information that we need to know
about every declared variable (and other named items, like functions) in
the program. Each entry in the table is a struct symbol which is shown
in Figure 7.1.
The kind field indicates whether the symbol is a local variable, a global
variable, or a function parameter. The type field points to a type structure
indicating the type of the variable. The name field gives the name (obvi-
ously), and the which field gives the ordinal position of local variables
and parameters. (More on that later.)
As with all the other data structures we have created so far, we must
have a factory function like this:
107
108 CHAPTER 7. SEMANTIC ANALYSIS
Conceptually, the symbol table is just a map between the name of each
variable, and the symbol structure that describes it:
However, it’s not quite that simple, because most programming lan-
guages allow the same variable name to be used multiple times, as long
as each definition is in a distinct scope. In C-like languages (including B-
Minor) there is a global scope, a scope for function parameters and local
variables, and then nested scopes everywhere curly braces appear.
For example, the following B-Minor program defines the symbol x
three times, each with a different type and storage class. When run, the
program should print 10 hello false.
x: integer = 10;
108
7.4. THE SYMBOL TABLE 109
Stack Top
Inner symbol
x Scope
x LOCAL(0) BOOLEAN
Table
Function symbol
x Scope x PARAM(0) STRING
Table
symbol
x
x GLOBAL INTEGER
Global symbol
f Scope
f GLOBAL FUNCTION
Table
symbol
main
main GLOBAL FUNCTION
109
110 CHAPTER 7. SEMANTIC ANALYSIS
void scope_enter();
void scope_exit();
int scope_level();
110
7.5. NAME RESOLUTION 111
With the symbol table in place, we are now ready to match each use of a
variable name to its matching definition. This process is known as name
resolution. To implement name resolution, we will write a resolve
method for each of the structures in the AST, including decl resolve(),
stmt resolve() and so forth.
Collectively, these methods must iterate over the entire AST, looking
for variable declarations and uses. Wherever a variable is declared, it must
be entered into the symbol table and the symbol structure linked into the
AST. Wherever a variable is used, it must be looked up in the symbol table,
and the symbol structure linked into the AST. Of course, if a symbol is
declared twice in the same scope, or used without declaration, then an
appropriate error message must be emitted.
We will begin with declarations, as shown in Figure 7.4. Each decl
represents a variable declaration of some kind, so decl resolve will cre-
ate a new symbol, and then bind it to the name of the declaration in the
current scope. If the declaration represents an expression (d->value is
not null) then the expression should be resolved. If the declaration repre-
sents a function (d->code is not null) then we must create a new scope
and resolve the parameters and the code.
Figure 7.4 gives some sample code for resolving declarations. As al-
ways in this book, consider this starter code in order to give you the basic
idea. You will have to make some changes in order to accommodate all
the features of the language, handle errors cleanly, and so forth.
In a similar fashion, we must write resolve methods for each structure
in the AST. stmt resolve() (not shown) must simply call the appropri-
ate resolve on each of its sub-components. In the case of a STMT BLOCK,
it must also enter and leave a new scope. param list resolve() (also
not shown) must enter a new variable declaration for each parameter of a
function, so that those definitions are available to the code of a function.
To perform name resolution on the entire AST, you may simply invoke
decl resolve() once on the root node of the AST. This function will
traverse the entire tree by calling the necessary sub-functions.
111
112 CHAPTER 7. SEMANTIC ANALYSIS
d->symbol = symbol_create(kind,d->type,d->name);
expr_resolve(d->value);
scope_bind(d->name,d->symbol);
if(d->code) {
scope_enter();
param_list_resolve(d->type->params);
stmt_resolve(d->code);
scope_exit();
}
decl_resolve(d->next);
}
if( e->kind==EXPR_NAME ) {
e->symbol = scope_lookup(e->name);
} else {
expr_resolve( e->left );
expr_resolve( e->right );
}
}
112
7.6. IMPLEMENTING TYPE CHECKING 113
113
114 CHAPTER 7. SEMANTIC ANALYSIS
to the symbol structure, which contains the type. This type is copied and
returned to the parent node.
For interior nodes of the expression tree, we must compare the type
of the left and right subtrees, and determine if they are compatible with
the rules indicated in Section 7.3. If not, we emit an error message and
increment a global error counter. Either way, we return the appropriate
type for the operator. The types of the left and right branches are no longer
needed and can be deleted before returning.
Here is the basic code structure:
switch(e->kind) {
case EXPR_INTEGER_LITERAL:
result = type_create(TYPE_INTEGER,0,0);
break;
case EXPR_STRING_LITERAL:
result = type_create(TYPE_STRING,0,0);
break;
type_delete(lt);
type_delete(rt);
return result;
}
114
7.6. IMPLEMENTING TYPE CHECKING 115
Let’s consider the cases for a few operators in detail. Arithmetic oper-
ators can only be applied to integers, and always return an integer type:
case EXPR_ADD:
if( lt->kind!=TYPE_INTEGER ||
rt->kind!=TYPE_INTEGER ) {
/* display an error */
}
result = type_create(TYPE_INTEGER,0,0);
break;
case EXPR_EQ:
case EXPR_NE:
if(!type_equals(lt,rt)) {
/* display an error */
}
if(lt->kind==TYPE_VOID ||
lt->kind==TYPE_ARRAY ||
lt->kind==TYPE_FUNCTION) {
/* display an error */
}
result = type_create(TYPE_BOOLEAN,0,0);
break;
case EXPR_DEREF:
if(lt->kind==TYPE_ARRAY) {
if(rt->kind!=TYPE_INTEGER) {
/* error: index not an integer */
}
result = type_copy(lt->subtype);
} else {
/* error: not an array */
/* but we need to return a valid type */
result = type_copy(lt);
}
break;
115
116 CHAPTER 7. SEMANTIC ANALYSIS
and the other typechecking methods simply traverse the AST, compute the
type of expressions, and then check them against declarations and other
constraints as needed.
For example, decl typecheck simply confirms that variable declara-
tions match their initializers and otherwise typechecks the body of func-
tion declarations:
116
7.7. ERROR MESSAGES 117
s: string = "hello";
b: boolean = false;
i: integer = s + (b<5);
But, your project compiler can very easily have much more detailed
error messages like this:
It’s just a matter of taking some care in printing out each of the expres-
sions and types involved when a problem is found:
117
118 CHAPTER 7. SEMANTIC ANALYSIS
7.8 Exercises
myprintf(
"error: cannot add a %T (%E) to a %T (%E)\n",
lt,e->left,rt,e->right
);
118