Bootstrapping is a process in which simple language is used to translate more complicated
program which in turn may handle for more complicated program. This complicated program
can further handle even more complicated program and so on. Bootstrapping has the following
1. Bootstrapping is widely used in the compilation development.
2. Bootstrapping is used to produce a self-hosting compiler. Self-hosting compiler is a type
of compiler that can compile its own source code.
3. Bootstrap compiler is used to compile the compiler and then you can use this compiled
compiler to compile everything else as well as future versions of itself.
Writing a compiler for any high level language is a complicated process. It takes lot of time to
write a compiler from scratch. Hence simple language is used to generate target code in some
stages. To clearly understand the Bootstrapping technique consider a following scenario.
Suppose we want to write a cross compiler for new language X. The implementation language of
this compiler is say Y and the target code being generated is in language Z. That is, we create
XYZ. Now if existing compiler Y runs on machine M and generates code for M then it is
denoted as YMM. Now if we run XYZ using YMM then we get a compiler XMZ. That means a
compiler for source language X that generates a target code in language Z and which runs on
machine M.
Following diagram illustrates the above scenario.
We can create compiler of many different forms. Now we will generate.
Compiler which takes C language and generates an assembly language as an output with the
availability of a machine of assembly language.
Step-2: Then using with small subset of C i.e. C0, for the source language c the compiler is
Step-3: Finally we compile the second compiler. using compiler 1 the compiler 2 is compiled.
Step-4: Thus we get a compiler written in ASM which compiles C and generates code in ASM.
Lexical analysis is the very first phase in the compiler designing. It takes the modified source
code which is written in the form of sentences. In other words, it helps you to converts a
sequence of characters into a sequence of tokens. The lexical analysis breaks this syntax into a
series of tokens. It removes any extra space or comment written in the source code.
Programs that perform lexical analysis are called lexical analyzers or lexers. A lexer contains
tokenizer or scanner. If the lexical analyzer detects that the token is invalid, it generates an error.
It reads character streams from the source code, checks for legal tokens, and pass the data to the
syntax analyzer when it demands.
How Pleasant Is The Weather?
See this example; Here, we can easily recognize that there are five words How Pleasant, The,
Weather, Is. This is very natural for us as we can recognize the separators, blanks, and the
punctuation symbol.
HowPl easantIs Th ewe ather?
Now, check this example, we can also read this. However, it will take some time because
separators are put in the Odd Places. It is not something which comes to you immediately.
Basic Terminologies
What's a lexeme?
A lexeme is a sequence of characters that are included in the source program according to the
matching pattern of a token. It is nothing but an instance of a token.
What's a token?
The token is a sequence of characters which represents a unit of information in the source
What is Pattern?
A pattern is a description which is used by the token. In the case of a keyword which uses as a
token, the pattern is a sequence of characters.
"Get next token" is a command which is sent from the parser to the lexical analyzer.
On receiving this command, the lexical analyzer scans the input until it finds the next token.
It returns the token to Parser.
Lexical Analyzer skips whitespaces and comments while creating these tokens. If any error is
present, then Lexical analyzer will correlate that error with the source file and line number.
int Keyword
maximum Identifier
( Operator
int Keyword
x Identifier
, Operator
int Keyword
Y Identifier
) Operator
{ Operator
If Keyword
Examples of Nontokens
Type Examples
Macro NUMS
Whitespace /n /b /t
Lexical Errors
A character sequence which is not possible to scan into any valid token is a lexical error.
Important facts about the lexical error:
1. Lexical errors are not very common, but it should be managed by a scanner
2. Misspelling of identifiers, operators, keyword are considered as lexical errors
3. Generally, a lexical error is caused by the appearance of some illegal character, mostly at
the beginning of a token.
Syntax analysis is a second phase of the compiler design process that comes after lexical
analysis. It analyses the syntactical structure of the given input. It checks if the given input is in
the correct syntax of the programming language in which the input which has been written. It is
known as the Parse Tree or Syntax Tree.
The Parse Tree is developed with the help of pre-defined grammar of the language. The syntax
analyzer also checks whether a given program fulfills the rules implied by a context-free
grammar. If it satisfies, the parser then creates the parse tree of that source program. Otherwise,
it will display error messages.
Parser Parse Rest of Representation
Source Lexical
Program Tree Frontend
get next token
Symbol Table
g. Delimiters – It is a syntactic element which marks the start or end of some syntactic unit.
Like a statement or expression, "begin"...''end", or {}.
h. Character set - ASCII, Unicode
i. Identifiers – It is a restrictions on the length which helps you to reduce the readability of
the sentence.
j. Operator symbols - + and – performs two basic arithmetic operations.
k. Syntactic elements of the Language
Importance of Parsing
A parse also checks that the input string is well-formed, and if not, reject it.
Parsing Techniques
Parsing techniques are divided into two different groups:
a. Top-Down Parsing,
b. Bottom-Up Parsing
Top-Down Parsing:
In the top-down parsing construction of the parse tree starts at the root and then proceeds towards
the leaves.
Bottom-Up Parsing:
In the bottom-up parsing technique the construction of the parse tree starts with the leave, and
then it processes towards its root. It is also called as shift-reduce parsing. This type of parsing is
created with the help of using some software tools.
2. For example, adding a missing semicolon is comes in statement mode recover method.
However, parse designer need to be careful while making these changes as one wrong
correction may lead to an infinite loop.
Panic-Mode recovery
1. In the case when the parser encounters an error, this mode ignores the rest of the
statement and not process input from erroneous input to delimiter, like a semi-colon. This
is a simple error recovery method.
2. In this type of recovery method, the parser rejects input symbols one by one until a single
designated group of synchronizing tokens is found. The synchronizing tokens generally
using delimiters like or.
Phrase-Level Recovery:
Compiler corrects the program by inserting or deleting tokens. This allows it to proceed to parse
from where it was. It performs correction on the remaining input. It can replace a prefix of the
remaining input with some string this helps the parser to continue the process.
Error Productions
Error production recovery expands the grammar for the language which generates the erroneous
constructs. The parser then performs error diagnostic about that construct.
Global Correction:
The compiler should make less number of changes as possible while processing an incorrect
input string. Given incorrect input string a and grammar c, algorithms will search for a parse tree
for a related string b. Like some insertions, deletions, and modification made of tokens needed to
transform an into b is as little as possible.
A grammar is a set of structural rules which describe a language. Grammars assign structure to
any sentence. This term also refers to the study of these rules, and this file includes morphology,
phonology, and syntax. It is capable of describing many, of the syntax of programming
Notational Conventions
Notational conventions symbol may be indicated by enclosing the element in square brackets. It
is an arbitrary sequence of instances of the element which can be indicated by enclosing the
element in braces followed by an asterisk symbol, { ... }*.
It is a choice of the alternative which may use the symbol within the single rule. It may be
enclosed by parenthesis ([,] ) when needed.
Grammar Derivation
Grammar derivation is a sequence of grammar rule which transforms the start symbol into the
string. A derivation proves that the string belongs to the grammar's language.
Left-most Derivation
When the sentential form of input is scanned and replaced in left to right sequence, it is known as
left-most derivation. The sentential form which is derived by the left-most derivation is called
the left-sentential form.
Right-most Derivation
Rightmost derivation scan and replace the input with production rules, from right to left,
sequence. It's known as right-most derivation. The sentential form which is derived from the
rightmost derivation is known as right-sentential form.
A semantic analyzer checks the semantics of a program, that is, whether the language constructs
are meaningful or not. A semantic analyzer mainly performs static type checking.
A compiler must ensure that the source program follows the syntax and semantic conventions of
the source language. Once the syntax is verified, the next task to be performed by a compiler is
to check the semantics of the language. A semantic analyzer shown in Figure 3.1 mainly verifies
whether the language constructs are meaningful (semantics) or not. This is called even static type
checking, which ensures that certain kinds of programming errors will be detected and reported.
Parsing cannot detect some errors. Some errors are captured during compile time called static
checking (e.g., type compatibility). Languages like C, C++, C#, Java, and Haskell uses static
checking. Static checking is even called early binding. During static checking programming
errors are caught early. This causes program execution to be efficient. Static checking not only
increases the efficiency and reliability of the compiled program, but also makes execution faster.
Semantic Analysis is the third phase of Compiler. Semantic Analysis makes sure that
declarations and statements of program are semantically correct. It is a collection of procedures
which is called by parser as and when required by grammar. Both syntax tree of previous phase
and symbol table are used to check the consistency of the given code. Type checking is an
important part of semantic analysis where compiler makes sure that each operator has matching
Semantic Analyzer:
It uses syntax tree and symbol table to check whether the given program is semantically
consistent with language definition. It gathers type information and stores it in either syntax tree
or symbol table. This type information is subsequently used by compiler during intermediate-
code generation.
Semantic Errors:
Errors recognized by semantic analyzer are as follows:
Type mismatch
Undeclared variables
Reserved identifier misuse
float x = 10.1;
float y = x*20;
In the above example integer 30 will be typecasted to float 30.0 before multiplication, by
semantic analyzer.
Symbol table is an important data structure used in a compiler.
Symbol table is used to store the information about the occurrence of various entities such as
objects, classes, variable name, interface, function name etc. it is used by both the analysis and
synthesis phases.
A symbol table can either be linear or a hash table. Using the following format, it maintains the
entry for each name.
<symbol name, type, attribute>
For example, suppose a variable store the information about the following variable declaration:
static int salary then, it stores an entry in the following format:
<salary, int, static>
The symbol table can be implemented in the unordered list if the compiler is used to handle the
small amount of data.
A symbol table can be implemented in one of the following techniques:
1. Linear (sorted or unsorted) list
2. Hash table
3. Binary search tree
The symbol table provides the following operations:
Insert ()
Insert () operation is more frequently used in the analysis phase when the tokens are identified
and names are stored in the table.
The insert() operation is used to insert the information in the symbol table like the unique name
occurring in the source code.
In the source code, the attribute for a symbol is the information associated with that symbol. The
information contains the state, value, type and scope about the symbol.
The insert () function takes the symbol and its value in the form of argument.
For example:
int x;
The scope of a name and symbol table is arranged in the hierarchy structure as shown below:
int value=10;
void sum_num()
int num_1;
int num_2;
int num_3;
int num_4;
int num_5;
int_num 6;
int_num 7;
Void sum_id
int id_1;
int id_2;
int id_3;
int id_4;
int num_5;
The above grammar can be represented in a hierarchical data structure of symbol tables:
The global symbol table contains one global variable and two procedure names. The name
mentioned in the sum_num table is not available for sum_id and its child tables.
Data structure hierarchy of symbol table is stored in the semantic analyzer. If you want to search
the name in the symbol table then you can search it using the following algorithm:
First a symbol is searched in the current symbol table.
If the name is found then search is completed else the name will be searched in the
symbol table of parent until,
The name is found or global symbol is searched.
For example:
int x;
void f(int m) {
float x, y;
int i, j;
int u, v;
int g (int n)
bool t;
Fig: Symbol table organization that complies with static scope information rules
If we generate machine code directly from source code then for n target machine we will have n
optimisers and n code generators but if we will have a machine independent intermediate code,
we will have only one optimiser. Intermediate code can be either language specific (e.g.,
Bytecode for Java) or language. independent (three-address code).
2. Three-Address Code –
A statement involving no more than three references (two for operands and one for result)
is known as three address statement. A sequence of three address statements is known as
three address code. Three address statement is of the form x = y op z , here x, y, z will
have address (memory location). Sometimes a statement might contain less than three
references but it is still called three address statement.
Example – The three address code for the expression a + b * c + d:
T1 = b * c
T2 = a + T1
T3 = T2 + d
T1, T2, T3 are temporary variables.
3. Syntax Tree –
Syntax tree is nothing more than condensed form of a parse tree. The operator and
keyword nodes of the parse tree are moved to their parents and a chain of single
productions is replaced by single link in syntax tree the internal nodes are operators and
child nodes are operands. To form syntax tree put parentheses in the expression, this way
it's easy to recognize which operand should come first.
Example –
x = (a + b * c) / (a – b * c)
The code optimization in the synthesis phase is a program transformation technique, which tries
to improve the intermediate code by making it consume fewer resources (i.e. CPU, Memory) so
that faster-running machine code will result. Compiler optimizing process should meet the
following objectives:
1. The optimization must be correct, it must not, in any way, change the meaning of the
2. Optimization should increase the speed and performance of the program.
3. The compilation time must be kept reasonable.
4. The optimization process should not delay the overall compiling process.
When to Optimize?
Optimization of the code is often performed at the end of the development stage since it reduces
readability and adds code that is used to increase the performance.
Types of Code Optimization –The optimization process can be broadly classified into two types :
1. Machine Independent Optimization – This code optimization phase attempts to improve
the intermediate code to get a better target code as the output. The part of the
intermediate code which is transformed here does not involve any CPU registers or
absolute memory locations.
2. Machine Dependent Optimization – Machine-dependent optimization is done after
the target code has been generated and when the code is transformed according to the
target machine architecture. It involves CPU registers and may have absolute memory
references rather than relative references. Machine-dependent optimizers put efforts to
take maximum advantage of the memory hierarchy.
2. Variable Propagation :
//Before Optimization
c = a * b
x = a
d = x * b + 4
//After Optimization
c = a * b
Hence, after variable propagation, a*b and x*b will be identified as common sub-
d = a * b + 4
//After elimination :
4. Code Motion :
Reduce the evaluation frequency of expression.
Bring loop invariant statements out of the loop.
a = 200;
b = x + y;
if (a % b == 0}
printf(“%d”, a);
//This code can be further optimized as
a = 200;
b = x + y;
if (a % b == 0}
printf(“%d”, a);
i = 1;
while (i<10)
y = i * 4;
//After Reduction
while( t<40)
y = t;
t = t + 4;
Code generation can be considered as the final phase of compilation. Through post code
generation, optimization process can be applied on the code, but that can be seen as a part of
code generation phase itself. The code generated by the compiler is an object code of some
lower-level programming language, for example, assembly language. We have seen that the
source code written in a higher-level language is transformed into a lower-level language that
results in a lower-level object code, which should have the following minimum properties:
1. It should carry the exact meaning of the source code.
2. It should be efficient in terms of CPU usage and memory management.
We will now see how the intermediate code is transformed into target object code (assembly
code, in this case).
[t0 = a + b]
[t1 = t0 + c]
[d = t0 + t1]
Peephole Optimization
This optimization technique works locally on the source code to transform it into an optimized
code. By locally, we mean a small portion of the code block at hand. These methods can be
applied on intermediate codes as well as on target codes. A bunch of statements is analyzed and
are checked for the following possible optimization:
At compilation level, the compiler searches for instructions redundant in nature. Multiple loading
and storing of instructions may carry the same meaning even if some of them are removed. For
MOV x, R0
MOV R0, R1
We can delete the first instruction and re-write the sentence as:
MOV x, R1
Unreachable code
Unreachable code is a part of the program code that is never accessed because of programming
constructs. Programmers may have accidently written a piece of code that can never be reached.
void add_ten(int x)
return x + 10;
printf(“value of x is %d”, x);
In this code segment, the printf statement will never be executed as the program control returns
back before it can execute, hence printf can be removed.
L1 : GOTO L2
L2 : INC R1
In this code,label L1 can be removed as it passes the control to L2. So instead of jumping to L1
and then to L2, the control can directly reach L2, as shown below:
MOV R1, R2
L2 : INC R1
Strength reduction
There are operations that consume more time and space. Their ‘strength’ can be reduced by
replacing them with other operations that consume less time and space, but produce the same
For example, x * 2 can be replaced by x << 1, which involves only one left shift. Though the
output of a * a and a2 is same, a2 is much more efficient to implement.
Code Generator
A code generator is expected to have an understanding of the target machine’s runtime
environment and its instruction set. The code generator should take the following things into
consideration to generate the code:
Target language : The code generator has to be aware of the nature of the target language
for which the code is to be transformed. That language may facilitate some machine-
specific instructions to help the compiler generate the code in a more convenient way.
The target machine can have either CISC or RISC processor architecture.
IR Type : Intermediate representation has various forms. It can be in Abstract Syntax
Tree (AST) structure, Reverse Polish Notation, or 3-address code.
Selection of instruction : The code generator takes Intermediate Representation as input
and converts (maps) it into target machine’s instruction set. One representation can have
many ways (instructions) to convert it, so it becomes the responsibility of the code
generator to choose the appropriate instructions wisely.
Register allocation : A program has a number of values to be maintained during the
execution. The target machine’s architecture may not allow all of the values to be kept in
the CPU memory or registers. Code generator decides what values to keep in the
registers. Also, it decides the registers to be used to keep these values.
Ordering of instructions : At last, the code generator decides the order in which the
instruction will be executed. It creates schedules for instructions to execute them.
The code generator has to track both the registers (for availability) and addresses (location of
values) while generating the code. For both of them, the following two descriptors are used:
Register descriptor : Register descriptor is used to inform the code generator about the
availability of registers. Register descriptor keeps track of values stored in each register.
Whenever a new register is required during code generation, this descriptor is consulted
for register availability.
Address descriptor : Values of the names (identifiers) used in the program might be
stored at different locations while in execution. Address descriptors are used to keep track
of memory locations where the values of identifiers are stored. These locations may
include CPU registers, heaps, stacks, memory or a combination of the mentioned
Code generator keeps both the descriptor updated in real-time. For a load statement, LD R1, x,
the code generator:
updates the Register Descriptor R1 that has value of x and
updates the Address Descriptor (x) to show that one instance of x is in R1.
Code Generation
Basic blocks comprise of a sequence of three-address instructions. Code generator takes these
sequence of instructions as input.
Note : If the value of a name is found at more than one place (register, cache, or memory), the
register’s value will be preferred over the cache and main memory. Likewise cache’s value will
be preferred over the main memory. Main memory is barely given any preference.
getReg : Code generator uses getReg function to determine the status of available registers and
the location of name values. getReg works as follows:
If variable Y is already in register R, it uses that register.
Else if some register R is available, it uses that register.
Else if both the above options are not possible, it chooses a register that requires minimal
number of load and store instructions.
For an instruction x = y OP z, the code generator may perform the following actions. Let us
assume that L is the location (preferably register) where the output of y OP z is to be saved:
Call function getReg, to decide the location of L.
Determine the present location (register or memory) of y by consulting the Address
Descriptor of y. If y is not presently in register L, then generate the following instruction
to copy the value of y to L:
MOV y’, L