Compiler Construction II Handout

COMPILER CONSTRUCTION II
BOOTSTRAPPING
Bootstrapping is a process in which simple language is used to translate more complicated
program which in turn may handle for more complicated program. This complicated program
can further handle even more complicated program and so on. Bootstrapping has the following
usage:
1. Bootstrapping is widely used in the compilation development.
2. Bootstrapping is used to produce a self-hosting compiler. Self-hosting compiler is a type
of compiler that can compile its own source code.
3. Bootstrap compiler is used to compile the compiler and then you can use this compiled
compiler to compile everything else as well as future versions of itself.
A compiler can be characterized by three languages:

1. Source Language
2. Target Language
3. Implementation Language
The T- diagram shows a compiler SCIT for Source S, Target T, implemented in I.
Follow some steps to produce a new language L for machine A:

1. Create a compiler SCAA for subset, S of the desired language, L using language "A" and
that compiler runs on machine A.
2. Create a compiler LCSA for language L written in a subset of L.
3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for language L,

which runs on machine A and produces code for machine A.
1
The process described by the T-diagrams is called bootstrapping.
Writing a compiler for any high level language is a complicated process. It takes lot of time to
write a compiler from scratch. Hence simple language is used to generate target code in some
stages. To clearly understand the Bootstrapping technique consider a following scenario.
Suppose we want to write a cross compiler for new language X. The implementation language of
this compiler is say Y and the target code being generated is in language Z. That is, we create
XYZ. Now if existing compiler Y runs on machine M and generates code for M then it is
denoted as YMM. Now if we run XYZ using YMM then we get a compiler XMZ. That means a
compiler for source language X that generates a target code in language Z and which runs on
machine M.
Following diagram illustrates the above scenario.
Example:
We can create compiler of many different forms. Now we will generate.
Compiler which takes C language and generates an assembly language as an output with the
availability of a machine of assembly language.
Step-1: First we write a compiler for a small of C in assembly language.
Step-2: Then using with small subset of C i.e. C0, for the source language c the compiler is
written.
Step-3: Finally we compile the second compiler. using compiler 1 the compiler 2 is compiled.
2
Step-4: Thus we get a compiler written in ASM which compiles C and generates code in ASM.
LEXICAL ANALYSIS
Lexical analysis is the very first phase in the compiler designing. It takes the modified source
code which is written in the form of sentences. In other words, it helps you to converts a
sequence of characters into a sequence of tokens. The lexical analysis breaks this syntax into a
series of tokens. It removes any extra space or comment written in the source code.
Programs that perform lexical analysis are called lexical analyzers or lexers. A lexer contains
tokenizer or scanner. If the lexical analyzer detects that the token is invalid, it generates an error.
It reads character streams from the source code, checks for legal tokens, and pass the data to the
syntax analyzer when it demands.
Example
How Pleasant Is The Weather?
See this example; Here, we can easily recognize that there are five words How Pleasant, The,
Weather, Is. This is very natural for us as we can recognize the separators, blanks, and the
punctuation symbol.
HowPl easantIs Th ewe ather?
Now, check this example, we can also read this. However, it will take some time because
separators are put in the Odd Places. It is not something which comes to you immediately.
Basic Terminologies
What's a lexeme?
A lexeme is a sequence of characters that are included in the source program according to the
matching pattern of a token. It is nothing but an instance of a token.
What's a token?
The token is a sequence of characters which represents a unit of information in the source
program.
What is Pattern?
A pattern is a description which is used by the token. In the case of a keyword which uses as a
token, the pattern is a sequence of characters.
Lexical Analyzer Architecture: How tokens are recognized

The main task of lexical analysis is to read input characters in the code and produce tokens.
Lexical analyzer scans the entire source code of the program. It identifies each token one by one.
Scanners are usually implemented to produce tokens only when requested by a parser. Here is
how this works-
3
"Get next token" is a command which is sent from the parser to the lexical analyzer.
On receiving this command, the lexical analyzer scans the input until it finds the next token.
It returns the token to Parser.
Lexical Analyzer skips whitespaces and comments while creating these tokens. If any error is
present, then Lexical analyzer will correlate that error with the source file and line number.
Roles of the Lexical analyzer

Lexical analyzer performs below given tasks:
1. Helps to identify token into the symbol table
2. Removes white spaces and comments from the source program
3. Correlates error messages with the source program
4. Helps you to expands the macros if it is found in the source program
5. Read input characters from the source program
Example of Lexical Analysis, Tokens, Non-Tokens

Consider the following code that is fed to Lexical Analyzer
#include <stdio.h>
int maximum(int x, int y) {
// This will compare 2 numbers
if (x > y)
return x;
else {
return y;
}
}
4
Examples of Tokens created

Lexeme Token
int Keyword
maximum Identifier
( Operator
int Keyword
x Identifier
, Operator
int Keyword
Y Identifier
) Operator
{ Operator
If Keyword
Examples of Nontokens
Type Examples
Comment // This will compare 2 numbers
Pre-processor directive #include <stdio.h>
Pre-processor directive #define NUMS 8,9
Macro NUMS
Whitespace /n /b /t
Lexical Errors
A character sequence which is not possible to scan into any valid token is a lexical error.
Important facts about the lexical error:
1. Lexical errors are not very common, but it should be managed by a scanner
2. Misspelling of identifiers, operators, keyword are considered as lexical errors
5
3. Generally, a lexical error is caused by the appearance of some illegal character, mostly at
the beginning of a token.
Error Recovery in Lexical Analyzer

Here, are a few most common error recovery techniques:
1. Removes one character from the remaining input
2. In the panic mode, the successive characters are always ignored until we reach a well-
formed token
3. By inserting the missing character into the remaining input
4. Replace a character with another character
5. Transpose two serial characters
The Lexical Analyzer and Parser Comparisons

Lexical Analyser Parser
Scan Input program Perform syntax analysis
Identify Tokens Create an abstract representation of the

code
Insert tokens into Symbol Table Update symbol table entries
It generates lexical errors It generates a parse tree of the source code
Reasons for Separating Lexer from the Parser

1. The simplicity of design: It eases the process of lexical analysis and the syntax analysis
by eliminating unwanted tokens
2. To improve compiler efficiency: Helps you to improve compiler efficiency
3. Specialization: specialized techniques can be applied to improves the lexical analysis
process
4. Portability: only the scanner requires to communicate with the outside world
5. Higher portability: input-device-specific peculiarities restricted to the lexer
Advantages of Lexical analysis

1. Lexical analyzer method is used by programs like compilers which can use the parsed
data from a programmer's code to create a compiled binary executable code
2. It is used by web browsers to format and display a web page with the help of parsed data
from JavsScript, HTML, CSS
3. A separate lexical analyzer helps you to construct a specialized and potentially more
efficient processor for the task
6
Disadvantage of Lexical analysis

1. You need to spend significant time reading the source program and partitioning it in the
form of tokens
2. Some regular expressions are quite difficult to understand compared to PEG or EBNF
rules
3. More effort is needed to develop and debug the lexer and its token descriptions
4. Additional runtime overhead is required to generate the lexer tables and construct the
tokens
SYNTAX ANALYSIS
Syntax analysis is a second phase of the compiler design process that comes after lexical
analysis. It analyses the syntactical structure of the given input. It checks if the given input is in
the correct syntax of the programming language in which the input which has been written. It is
known as the Parse Tree or Syntax Tree.
The Parse Tree is developed with the help of pre-defined grammar of the language. The syntax
analyzer also checks whether a given program fulfills the rules implied by a context-free
grammar. If it satisfies, the parser then creates the parse tree of that source program. Otherwise,
it will display error messages.
Intermediate
token
Parser Parse Rest of Representation
Source Lexical
Program Tree Frontend
Analyzer
get next token
Symbol Table
Why do you need Syntax Analyzer?

1. Check if the code is valid grammatically
2. The syntactical analyzer helps you to apply rules to the code
3. Helps you to make sure that each opening brace has a corresponding closing balance
4. Each declaration has a type and that the type must be exists
Important Syntax Analyzer Terminology

Important terminologies used in syntax analysis process:
a. Sentence: A sentence is a group of character over some alphabet.
b. Lexeme: A lexeme is the lowest level syntactic unit of a language (e.g., total, start).
c. Token: A token is just a category of lexemes.
d. Keywords and reserved words – It is an identifier which is used as a fixed part of the
syntax of a statement. It is a reserved word which you can't use as a variable name or
identifier.
e. Noise words - Noise words are optional which are inserted in a statement to enhance the
readability of the sentence.
f. Comments – It is a very important part of the documentation. It mostly display by, /* */,
or//Blank (spaces)
7
g. Delimiters – It is a syntactic element which marks the start or end of some syntactic unit.
Like a statement or expression, "begin"...''end", or {}.
h. Character set - ASCII, Unicode
i. Identifiers – It is a restrictions on the length which helps you to reduce the readability of
the sentence.
j. Operator symbols - + and – performs two basic arithmetic operations.
k. Syntactic elements of the Language
Importance of Parsing
A parse also checks that the input string is well-formed, and if not, reject it.
Following are important tasks perform by the parser:

a. Helps you to detect all types of Syntax errors
8
b. Find the position at which error has occurred

c. Clear & accurate description of the error.
d. Recovery from an error to continue and find further errors in the code.
e. Should not affect compilation of "correct" programs.
f. The parse must reject invalid texts by reporting syntax errors
Parsing Techniques
Parsing techniques are divided into two different groups:
a. Top-Down Parsing,
b. Bottom-Up Parsing
Top-Down Parsing:
In the top-down parsing construction of the parse tree starts at the root and then proceeds towards
the leaves.
Two types of Top-down parsing are:

1. Predictive Parsing:
Predictive parse can predict which production should be used to replace the specific input
string. The predictive parser uses look-ahead point, which points towards next input
symbols. Backtracking is not an issue with this parsing technique. It is known as LL(1)
Parser
2. Recursive Descent Parsing:
This parsing technique recursively parses the input to make a parse tree. It consists of
several small functions, one for each nonterminal in the grammar.
Bottom-Up Parsing:
In the bottom-up parsing technique the construction of the parse tree starts with the leave, and
then it processes towards its root. It is also called as shift-reduce parsing. This type of parsing is
created with the help of using some software tools.
Error – Recovery Methods

Common Errors that occur in Parsing
1. Lexical: Name of an incorrectly typed identifier
2. Syntactical: unbalanced parenthesis or a missing semicolon
3. Semantical: incompatible value assignment
4. Logical: Infinite loop and not reachable code
A parser should able to detect and report any error found in the program. So, whenever an error
occurred the parser. It should be able to handle it and carry on parsing the remaining input. A
program can have following types of errors at various compilation process stages. There are five
common error-recovery methods which can be implemented in the parser
Statement mode recovery

1. In the case when the parser encounters an error, it helps you to take corrective steps. This
allows rest of inputs and states to parse ahead.
9
2. For example, adding a missing semicolon is comes in statement mode recover method.
However, parse designer need to be careful while making these changes as one wrong
correction may lead to an infinite loop.
Panic-Mode recovery
1. In the case when the parser encounters an error, this mode ignores the rest of the
statement and not process input from erroneous input to delimiter, like a semi-colon. This
is a simple error recovery method.
2. In this type of recovery method, the parser rejects input symbols one by one until a single
designated group of synchronizing tokens is found. The synchronizing tokens generally
using delimiters like or.
Phrase-Level Recovery:
Compiler corrects the program by inserting or deleting tokens. This allows it to proceed to parse
from where it was. It performs correction on the remaining input. It can replace a prefix of the
remaining input with some string this helps the parser to continue the process.
Error Productions
Error production recovery expands the grammar for the language which generates the erroneous
constructs. The parser then performs error diagnostic about that construct.
Global Correction:
The compiler should make less number of changes as possible while processing an incorrect
input string. Given incorrect input string a and grammar c, algorithms will search for a parse tree
for a related string b. Like some insertions, deletions, and modification made of tokens needed to
transform an into b is as little as possible.
Grammar
A grammar is a set of structural rules which describe a language. Grammars assign structure to
any sentence. This term also refers to the study of these rules, and this file includes morphology,
phonology, and syntax. It is capable of describing many, of the syntax of programming
languages.
Rules of Form Grammar

1. The non-terminal symbol should appear to the left of the at least one production
2. The goal symbol should never be displayed to the right of the::= of any production
3. A rule is recursive if LHS appears in its RHS
Notational Conventions
Notational conventions symbol may be indicated by enclosing the element in square brackets. It
is an arbitrary sequence of instances of the element which can be indicated by enclosing the
element in braces followed by an asterisk symbol, { ... }*.
10
It is a choice of the alternative which may use the symbol within the single rule. It may be
enclosed by parenthesis ([,] ) when needed.
Two types of Notational conventions area Terminal and Non-terminals

1. Terminals:
a. Lower-case letters in the alphabet such as a, b, c,
b. Operator symbols such as +,-, *, etc.
c. Punctuation symbols such as parentheses, hash, comma
d. 0, 1, ..., 9 digits
e. Boldface strings like id or if, anything which represents a single terminal symbol
2. Nonterminals:
a. Upper-case letters such as A, B, C
b. Lower-case italic names: the expression or some
Context Free Grammar

A CFG is a left-recursive grammar that has at least one production of the type. The rules in a
context-free grammar are mainly recursive. A syntax analyzer checks that specific program
satisfies all the rules of Context-free grammar or not. If it does meet, these rules syntax analyzers
may create a parse tree for that programme.
expression -> expression -+ term
expression -> expression – term
expression-> term
term -> term * factor
term -> expression/ factor
term -> factor factor
factor -> ( expression )
factor -> id
Grammar Derivation
Grammar derivation is a sequence of grammar rule which transforms the start symbol into the
string. A derivation proves that the string belongs to the grammar's language.
Left-most Derivation
When the sentential form of input is scanned and replaced in left to right sequence, it is known as
left-most derivation. The sentential form which is derived by the left-most derivation is called
the left-sentential form.
Right-most Derivation
Rightmost derivation scan and replace the input with production rules, from right to left,
sequence. It's known as right-most derivation. The sentential form which is derived from the
rightmost derivation is known as right-sentential form.
11
Syntax and Lexical Analyser: Comparisons

Syntax Analyser Lexical Analyser
The syntax analyzer mainly The lexical analyzer eases the
deals with recursive constructs task of the syntax analyzer.
of the language.
The syntax analyzer works on The lexical analyzer
tokens in a source program to recognizes the token in a
recognize meaningful source program.
structures in the programming
language.
It receives inputs, in the form It is responsible for the validity
of tokens, from lexical of a token supplied by the
analyzers. syntax analyzer
Disadvantages of using Syntax Analysers

1. It will never determine if a token is valid or not
2. Not helps you to determine if an operation performed on a token type is valid or not
3. You can't decide that token is declared & initialized before it is being used
SEMANTIC ANALYSIS
A semantic analyzer checks the semantics of a program, that is, whether the language constructs
are meaningful or not. A semantic analyzer mainly performs static type checking.
A compiler must ensure that the source program follows the syntax and semantic conventions of
the source language. Once the syntax is verified, the next task to be performed by a compiler is
to check the semantics of the language. A semantic analyzer shown in Figure 3.1 mainly verifies
whether the language constructs are meaningful (semantics) or not. This is called even static type
checking, which ensures that certain kinds of programming errors will be detected and reported.
Parsing cannot detect some errors. Some errors are captured during compile time called static
checking (e.g., type compatibility). Languages like C, C++, C#, Java, and Haskell uses static
checking. Static checking is even called early binding. During static checking programming
errors are caught early. This causes program execution to be efficient. Static checking not only
increases the efficiency and reliability of the compiled program, but also makes execution faster.
Semantic Analysis is the third phase of Compiler. Semantic Analysis makes sure that
declarations and statements of program are semantically correct. It is a collection of procedures
which is called by parser as and when required by grammar. Both syntax tree of previous phase
and symbol table are used to check the consistency of the given code. Type checking is an
important part of semantic analysis where compiler makes sure that each operator has matching
operands.
12
Semantic Analyzer:
It uses syntax tree and symbol table to check whether the given program is semantically
consistent with language definition. It gathers type information and stores it in either syntax tree
or symbol table. This type information is subsequently used by compiler during intermediate-
code generation.
Semantic Errors:
Errors recognized by semantic analyzer are as follows:
 Type mismatch
 Undeclared variables
 Reserved identifier misuse
Functions of Semantic Analysis:

1. Type Checking –
Ensures that data types are used in a way consistent with their definition.
2. Label Checking –
A program should contain labels references.
3. Flow Control Check –
Keeps a check that control structures are used in a proper manner.(example: no break
statement outside a loop)
Example:
float x = 10.1;
float y = x*20;
In the above example integer 30 will be typecasted to float 30.0 before multiplication, by
semantic analyzer.
Static and Dynamic Semantics

Static Semantics –
It is named so because of the fact that these are checked at compile time. The static semantics
and meaning of program during execution, are indirectly related.
Dynamic Semantic Analysis –

It defines the meaning of different units of program like expressions and statements. These are
checked at runtime unlike static semantics.
SYMBOL TABLE
Symbol table is an important data structure used in a compiler.
13
Symbol table is used to store the information about the occurrence of various entities such as
objects, classes, variable name, interface, function name etc. it is used by both the analysis and
synthesis phases.
The symbol table used for following purposes:

3. It is used to store the name of all entities in a structured form at one place.
4. It is used to verify if a variable has been declared.
5. It is used to determine the scope of a name.
6. It is used to implement type checking by verifying assignments and expressions in the
source code are semantically correct.
A symbol table can either be linear or a hash table. Using the following format, it maintains the
entry for each name.
<symbol name, type, attribute>
For example, suppose a variable store the information about the following variable declaration:
static int salary then, it stores an entry in the following format:
<salary, int, static>
The clause attribute contains the entries related to the name.
Implementation
The symbol table can be implemented in the unordered list if the compiler is used to handle the
small amount of data.
A symbol table can be implemented in one of the following techniques:
1. Linear (sorted or unsorted) list
2. Hash table
3. Binary search tree
Symbol table are mostly implemented as hash table.
Operations
The symbol table provides the following operations:
Insert ()
Insert () operation is more frequently used in the analysis phase when the tokens are identified
and names are stored in the table.
The insert() operation is used to insert the information in the symbol table like the unique name
occurring in the source code.
In the source code, the attribute for a symbol is the information associated with that symbol. The
information contains the state, value, type and scope about the symbol.
The insert () function takes the symbol and its value in the form of argument.
For example:
14
int x;
Should be processed by the compiler as:

insert (x, int)
lookup()
In the symbol table, lookup() operation is used to search a name. It is used to determine:
 The existence of symbol in the table.
 The declaration of the symbol before it is used.
 Check whether the name is used in the scope.
 Initialization of the symbol.
 Checking whether the name is declared multiple times.
The basic format of lookup() function is as follows:

lookup (symbol)
This format is varies according to the programming language.
Data structure for symbol table

1. A compiler contains two type of symbol table: global symbol table and scope symbol
table.
2. Global symbol table can be accessed by all the procedures and scope symbol table.
The scope of a name and symbol table is arranged in the hierarchy structure as shown below:
int value=10;

void sum_num()
{
int num_1;
int num_2;

{
int num_3;
int num_4;
}

int num_5;

{
int_num 6;
int_num 7;
}
}

Void sum_id
{
int id_1;
15
int id_2;

{
int id_3;
int id_4;
}

int num_5;
}
The above grammar can be represented in a hierarchical data structure of symbol tables:
The global symbol table contains one global variable and two procedure names. The name
mentioned in the sum_num table is not available for sum_id and its child tables.
Data structure hierarchy of symbol table is stored in the semantic analyzer. If you want to search
the name in the symbol table then you can search it using the following algorithm:
 First a symbol is searched in the current symbol table.
 If the name is found then search is completed else the name will be searched in the
symbol table of parent until,
 The name is found or global symbol is searched.
16
Representing Scope Information

In the source program, every name possesses a region of validity, called the scope of that name.
The rules in a block-structured language are as follows:
1. If a name declared within block B then it will be valid only within B.
2. If B1 block is nested within B2 then the name that is valid for block B2 is also valid for
B1 unless the name's identifier is re-declared in B1.
 These scope rules need a more complicated organization of symbol table than a list of
associations between names and attributes.
 Tables are organized into stack and each table contains the list of names and their
associated attributes.
 Whenever a new block is entered then a new table is entered onto the stack. The new
table holds the name that is declared as local to this block.
 When the declaration is compiled then the table is searched for a name.
 If the name is not found in the table then the new name is inserted.
 When the name's reference is translated then each table is searched, starting from the
each table on the stack.
For example:
int x;
void f(int m) {
float x, y;
{
int i, j;
int u, v;
}
}
int g (int n)
{
bool t;
}
17
Fig: Symbol table organization that complies with static scope information rules
INTERMEDIATE CODE GENERATION

In the analysis-synthesis model of a compiler, the front end of a compiler translates a source
program into an independent intermediate code, then the back end of the compiler uses this
intermediate code to generate the target code (which can be understood by the machine).
The benefits of using machine independent intermediate code are:

1. Because of the machine independent intermediate code, portability will be enhanced. For
ex, suppose, if a compiler translates the source language to its target machine language
without having the option for generating intermediate code, then for each new machine, a
full native compiler is required. Because, obviously, there were some modifications in the
compiler itself according to the machine specifications.
2. Retargeting is facilitated
3. It is easier to apply source code modification to improve the performance of source code
by optimising the intermediate code.
18
If we generate machine code directly from source code then for n target machine we will have n
optimisers and n code generators but if we will have a machine independent intermediate code,
we will have only one optimiser. Intermediate code can be either language specific (e.g.,
Bytecode for Java) or language. independent (three-address code).
The following are commonly used intermediate code representation:

1. Postfix Notation –
The ordinary (infix) way of writing the sum of a and b is with operator in the middle: a +
b
The postfix notation for the same expression places the operator at the right end as ab +.
In general, if e1 and e2 are any postfix expressions, and + is any binary operator, the
result of applying + to the values denoted by e1 and e2 is postfix notation by e1e2 +. No
parentheses are needed in postfix notation because the position and arity (number of
arguments) of the operators permit only one way to decode a postfix expression. In
postfix notation the operator follows the operand.
Example – The postfix representation of the expression (a – b) * (c + d) + (a – b) is: ab –
cd + *ab -+.
2. Three-Address Code –
A statement involving no more than three references (two for operands and one for result)
is known as three address statement. A sequence of three address statements is known as
three address code. Three address statement is of the form x = y op z , here x, y, z will
have address (memory location). Sometimes a statement might contain less than three
references but it is still called three address statement.
Example – The three address code for the expression a + b * c + d:
19
T1 = b * c
T2 = a + T1
T3 = T2 + d
T1, T2, T3 are temporary variables.
3. Syntax Tree –
Syntax tree is nothing more than condensed form of a parse tree. The operator and
keyword nodes of the parse tree are moved to their parents and a chain of single
productions is replaced by single link in syntax tree the internal nodes are operators and
child nodes are operands. To form syntax tree put parentheses in the expression, this way
it's easy to recognize which operand should come first.
Example –
x = (a + b * c) / (a – b * c)
CODE OPTIMIZATION
The code optimization in the synthesis phase is a program transformation technique, which tries
to improve the intermediate code by making it consume fewer resources (i.e. CPU, Memory) so
that faster-running machine code will result. Compiler optimizing process should meet the
following objectives:
1. The optimization must be correct, it must not, in any way, change the meaning of the
program.
2. Optimization should increase the speed and performance of the program.
3. The compilation time must be kept reasonable.
4. The optimization process should not delay the overall compiling process.
20
When to Optimize?
Optimization of the code is often performed at the end of the development stage since it reduces
readability and adds code that is used to increase the performance.
Types of Code Optimization –The optimization process can be broadly classified into two types :
1. Machine Independent Optimization – This code optimization phase attempts to improve
the intermediate code to get a better target code as the output. The part of the
intermediate code which is transformed here does not involve any CPU registers or
absolute memory locations.
2. Machine Dependent Optimization – Machine-dependent optimization is done after
the target code has been generated and when the code is transformed according to the
target machine architecture. It involves CPU registers and may have absolute memory
references rather than relative references. Machine-dependent optimizers put efforts to
take maximum advantage of the memory hierarchy.
Code Optimization is done in the following different ways:

1. Compile Time Evaluation :
(i) A = 2*(22.0/7.0)*r
Perform 2*(22.0/7.0)*r at compile time.
(ii) x = 12.4
y = x/2.3
Evaluate x/2.3 as 12.4/2.3 at compile time.
2. Variable Propagation :
//Before Optimization
c = a * b
x = a
till
d = x * b + 4

//After Optimization
c = a * b
x=a
till
d=a*b+4
Hence, after variable propagation, a*b and x*b will be identified as common sub-
expression.
3. Dead code elimination : Variable propagation often leads to making assignment

statement into dead code
c = a * b
x = a
till
21
d = a * b + 4

//After elimination :
c=a*b
till
d=a*b+4
4. Code Motion :
 Reduce the evaluation frequency of expression.
 Bring loop invariant statements out of the loop.
a = 200;
while(a>0)
{
b = x + y;
if (a % b == 0}
printf(“%d”, a);
}

//This code can be further optimized as
a = 200;
b = x + y;
while(a>0)
{
if (a % b == 0}
printf(“%d”, a);
}
5. Induction Variable and Strength Reduction :

 An induction variable is used in loop for the following kind of assignment i = i +
constant.
 Strength reduction means replacing the high strength operator by the low strength.
i = 1;
while (i<10)
{
y = i * 4;
}

//After Reduction
i=1
t=4
{
while( t<40)
22
y = t;
t = t + 4;
}
CODE GENERATION
Code generation can be considered as the final phase of compilation. Through post code
generation, optimization process can be applied on the code, but that can be seen as a part of
code generation phase itself. The code generated by the compiler is an object code of some
lower-level programming language, for example, assembly language. We have seen that the
source code written in a higher-level language is transformed into a lower-level language that
results in a lower-level object code, which should have the following minimum properties:
1. It should carry the exact meaning of the source code.
2. It should be efficient in terms of CPU usage and memory management.
We will now see how the intermediate code is transformed into target object code (assembly
code, in this case).
Directed Acyclic Graph

Directed Acyclic Graph (DAG) is a tool that depicts the structure of basic blocks, helps to see the
flow of values flowing among the basic blocks, and offers optimization too. DAG provides easy
transformation on basic blocks. DAG can be understood here:
 Leaf nodes represent identifiers, names or constants.
 Interior nodes represent operators.
 Interior nodes also represent the results of expressions or the identifiers/name where the
values are to be stored or assigned.
Example:
t0 = a + b
t1 = t0 + c
d = t0 + t1
[t0 = a + b]
[t1 = t0 + c]
[d = t0 + t1]
23
Peephole Optimization
This optimization technique works locally on the source code to transform it into an optimized
code. By locally, we mean a small portion of the code block at hand. These methods can be
applied on intermediate codes as well as on target codes. A bunch of statements is analyzed and
are checked for the following possible optimization:
Redundant instruction elimination

At source code level, the following can be done by the user:
int add_ten(int x) int add_ten(int x) int add_ten(int x) int add_ten(int x)
{ { { {
int y, z; int y; int y = 10; return x + 10;
y = 10; y = 10; return x + y; }
z = x + y; y = x + y; }
return z; return y;
} }
At compilation level, the compiler searches for instructions redundant in nature. Multiple loading
and storing of instructions may carry the same meaning even if some of them are removed. For
example:
 MOV x, R0
 MOV R0, R1
We can delete the first instruction and re-write the sentence as:
MOV x, R1
Unreachable code
Unreachable code is a part of the program code that is never accessed because of programming
constructs. Programmers may have accidently written a piece of code that can never be reached.
Example:
void add_ten(int x)
{
return x + 10;
printf(“value of x is %d”, x);
}
In this code segment, the printf statement will never be executed as the program control returns
back before it can execute, hence printf can be removed.
Flow of control optimization

There are instances in a code where the program control jumps back and forth without
performing any significant task. These jumps can be removed. Consider the following chunk of
code:
...
MOV R1, R2
GOTO L1
24
...
L1 : GOTO L2
L2 : INC R1
In this code,label L1 can be removed as it passes the control to L2. So instead of jumping to L1
and then to L2, the control can directly reach L2, as shown below:
...
MOV R1, R2
GOTO L2
...
L2 : INC R1
Algebraic expression simplification

There are occasions where algebraic expressions can be made simple. For example, the
expression a = a + 0 can be replaced by a itself and the expression a = a + 1 can simply be
replaced by INC a.
Strength reduction
There are operations that consume more time and space. Their ‘strength’ can be reduced by
replacing them with other operations that consume less time and space, but produce the same
result.
For example, x * 2 can be replaced by x << 1, which involves only one left shift. Though the
output of a * a and a2 is same, a2 is much more efficient to implement.
Accessing machine instructions

The target machine can deploy more sophisticated instructions, which can have the capability to
perform specific operations much efficiently. If the target code can accommodate those
instructions directly, that will not only improve the quality of code, but also yield more efficient
results.
Code Generator
A code generator is expected to have an understanding of the target machine’s runtime
environment and its instruction set. The code generator should take the following things into
consideration to generate the code:
 Target language : The code generator has to be aware of the nature of the target language
for which the code is to be transformed. That language may facilitate some machine-
specific instructions to help the compiler generate the code in a more convenient way.
The target machine can have either CISC or RISC processor architecture.
 IR Type : Intermediate representation has various forms. It can be in Abstract Syntax
Tree (AST) structure, Reverse Polish Notation, or 3-address code.
 Selection of instruction : The code generator takes Intermediate Representation as input
and converts (maps) it into target machine’s instruction set. One representation can have
many ways (instructions) to convert it, so it becomes the responsibility of the code
generator to choose the appropriate instructions wisely.
 Register allocation : A program has a number of values to be maintained during the
execution. The target machine’s architecture may not allow all of the values to be kept in
25
the CPU memory or registers. Code generator decides what values to keep in the
registers. Also, it decides the registers to be used to keep these values.
 Ordering of instructions : At last, the code generator decides the order in which the
instruction will be executed. It creates schedules for instructions to execute them.
Descriptors
The code generator has to track both the registers (for availability) and addresses (location of
values) while generating the code. For both of them, the following two descriptors are used:
 Register descriptor : Register descriptor is used to inform the code generator about the
availability of registers. Register descriptor keeps track of values stored in each register.
Whenever a new register is required during code generation, this descriptor is consulted
for register availability.
 Address descriptor : Values of the names (identifiers) used in the program might be
stored at different locations while in execution. Address descriptors are used to keep track
of memory locations where the values of identifiers are stored. These locations may
include CPU registers, heaps, stacks, memory or a combination of the mentioned
locations.
Code generator keeps both the descriptor updated in real-time. For a load statement, LD R1, x,
the code generator:
 updates the Register Descriptor R1 that has value of x and
 updates the Address Descriptor (x) to show that one instance of x is in R1.
Code Generation
Basic blocks comprise of a sequence of three-address instructions. Code generator takes these
sequence of instructions as input.
Note : If the value of a name is found at more than one place (register, cache, or memory), the
register’s value will be preferred over the cache and main memory. Likewise cache’s value will
be preferred over the main memory. Main memory is barely given any preference.
getReg : Code generator uses getReg function to determine the status of available registers and
the location of name values. getReg works as follows:
 If variable Y is already in register R, it uses that register.
 Else if some register R is available, it uses that register.
 Else if both the above options are not possible, it chooses a register that requires minimal
number of load and store instructions.
For an instruction x = y OP z, the code generator may perform the following actions. Let us
assume that L is the location (preferably register) where the output of y OP z is to be saved:
 Call function getReg, to decide the location of L.
 Determine the present location (register or memory) of y by consulting the Address
Descriptor of y. If y is not presently in register L, then generate the following instruction
to copy the value of y to L:
MOV y’, L
26
where y’ represents the copied value of y.

 Determine the present location of z using the same method used in step 2 for y and
generate the following instruction:
OP z’, L
where z’ represents the copied value of z.

 Now L contains the value of y OP z, that is intended to be assigned to x. So, if L is a
register, update its descriptor to indicate that it contains the value of x. Update the
descriptor of x to indicate that it is stored at location L.
 If y and z has no further use, they can be given back to the system.
Other code constructs like loops and conditional statements are transformed into assembly
language in general assembly way.
27

Compiler Construction II Handout

Uploaded by

Copyright:

Available Formats

Compiler Construction II Handout

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Compiler Construction II Handout

Uploaded by

Copyright:

Available Formats

COMPILER CONSTRUCTION II

A compiler can be characterized by three languages:

The T- diagram shows a compiler SCIT for Source S, Target T, implemented in I.

Follow some steps to produce a new language L for machine A:

2. Create a compiler LCSA for language L written in a subset of L.

3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for language L,

The process described by the T-diagrams is called bootstrapping.

Step-1: First we write a compiler for a small of C in assembly language.

Lexical Analyzer Architecture: How tokens are recognized

Roles of the Lexical analyzer

Example of Lexical Analysis, Tokens, Non-Tokens

Examples of Tokens created

Comment // This will compare 2 numbers

Pre-processor directive #include <stdio.h>

Pre-processor directive #define NUMS 8,9

Error Recovery in Lexical Analyzer

The Lexical Analyzer and Parser Comparisons

Scan Input program Perform syntax analysis

Identify Tokens Create an abstract representation of the

Insert tokens into Symbol Table Update symbol table entries

It generates lexical errors It generates a parse tree of the source code

Reasons for Separating Lexer from the Parser

Advantages of Lexical analysis

Disadvantage of Lexical analysis

Why do you need Syntax Analyzer?

Important Syntax Analyzer Terminology

Following are important tasks perform by the parser:

b. Find the position at which error has occurred

Two types of Top-down parsing are:

Error – Recovery Methods

Statement mode recovery

Rules of Form Grammar

Two types of Notational conventions area Terminal and Non-terminals

Context Free Grammar

Syntax and Lexical Analyser: Comparisons

Disadvantages of using Syntax Analysers

Functions of Semantic Analysis:

Static and Dynamic Semantics

Dynamic Semantic Analysis –

The symbol table used for following purposes:

The clause attribute contains the entries related to the name.

Symbol table are mostly implemented as hash table.

Should be processed by the compiler as:

The basic format of lookup() function is as follows:

Data structure for symbol table

Representing Scope Information

INTERMEDIATE CODE GENERATION

The benefits of using machine independent intermediate code are:

The following are commonly used intermediate code representation:

Code Optimization is done in the following different ways:

3. Dead code elimination : Variable propagation often leads to making assignment

5. Induction Variable and Strength Reduction :

Directed Acyclic Graph

Redundant instruction elimination

Flow of control optimization

Algebraic expression simplification

Accessing machine instructions

where y’ represents the copied value of y.

where z’ represents the copied value of z.