Compiler Design
Compiler Design
CS860
S8602
COMPILER DESIGN
for
BE-CSE-
CSE- VI SEM
S.PRABU
Associate Professor/CSE
CS8602 COMPILER DESIGN L T P C
3 0 2 4
OBJECTIVES:
To learn the various phases of compiler.
To learn the various parsing techniques.
To understand intermediate code generation and run-time environment.
To learn to implement front-end of the compiler.
To learn to implement code generator.
LIST OF EXPERIMENTS:
1. Develop a lexical analyzer to recognize a few patterns in C. (Ex. identifiers, constants,
comments, operators etc.). Create a symbol table, while recognizing identifiers.
2. Implement a Lexical Analyzer using Lex Tool
3. Implement an Arithmetic Calculator using LEX and YACC
4. Generate three address code for a simple program using LEX and YACC.
5. Implement simple code optimization techniques (Constant folding, Strength reduction and
Algebraic transformation)
6. Implement back-end of the compiler for which the three address code is given as input and
the 8086 assembly language code is produced as output.
PRACTICALS 30 PERIODS
THEORY 45 PERIODS
TOTAL : 75 PERIODS
OUTCOMES:
On Completion of the course, the students should be able to:
Understand the different phases of compiler.
Design a lexical analyzer for a sample language.
Apply different parsing algorithms to develop the parsers for a given grammar.
Understand syntax-directed translation and run-time environment.
Learn to implement code optimization techniques and a simple code generator.
Design and implement a scanner and a parser using LEX and YACC tools.
TEXT BOOK:
1. Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman, Compilers: Principles,
Techniques and Toolsǁ, Second Edition, Pearson Education, 2009.
REFERENCES
1. Randy Allen, Ken Kennedy, Optimizing Compilers for Modern Architectures: A
Dependence based Approach, Morgan Kaufmann Publishers, 2002.
2. Steven S. Muchnick, Advanced Compiler Design and Implementationǁ, Morgan Kaufmann
Publishers - Elsevier Science, India, Indian Reprint 2003.
3. Keith D Cooper and Linda Torczon, Engineering a Compilerǁ, Morgan Kaufmann
Publishers Elsevier Science, 2004.
4. V. Raghavan, Principles of Compiler Designǁ, Tata McGraw Hill Education Publishers,
2010.
5. Allen I. Holub, Compiler Design in Cǁ, Prentice-Hall Software Series, 1993.
CS8602 – COMPILER DESIGN
UNIT I
INTRODUCTION TO COMPILERS
Structure of a compiler – Lexical Analysis – Role of Lexical Analyzer – Input
Buffering – Specification of Tokens – Recognition of Tokens – Lex – Finite
Automata – Regular Expressions to Automata – Minimizing DFA.
Compiler Fundamental
Translator
Is a program that convert one form of language into other.
One form of language Translator Other form of language
Compiler
Is a program that convert high level program into low level program.
Interpreter
Is a program that convert high level program into low level program, line by line
High level program Interpreter Low level program
Assembler
Is a program that convert assembly language program into low level program.
Assembly language Assembler Low level program
CLASSIFICATION OF COMPILER:
Depending upon the function they perform compilers are classified as
• Single pass compiler
Optimizing Compiler
Is a compiler that minimize or maximize some attributes of an executable program.
Cross Compiler
capable of creating executable code for a platform other than the one on which the compiler
is running.
LANGUAGE PROCESSOR
In addition to a compiler, several other programs may be required to create an executable
target program,
STRUCTURE OF A COMPILER
PHASES OF COMPILER:
Conceptually, a compiler operates in phases,
Phase: A phase is a logically cohesive operation that takes one format of source code as input
and generates another format as output.
LEXICAL ANALYZER:
The first phase is Lexical analyzer or Scanner or Linear analyzer
PURPOSE: It reads the source program one character at a time and grouped into tokens.
TOKENS:
Token is a sequence of characters that can be treated as a single logical entity.
The tokens may be
• Identifiers :- Ex., x, y, num, s
• Keywords :- Ex., int, float, char, do, while, if, else
• Operator symbols :- Ex., +, -, *, /
• Special symbols :- Ex., (, ), [, ], {, }, ,
• Constants :- Ex., 5,3
Example:
Specific string:
The token will have token type without token value.
Ex., * is an operator, it has no value.
Classes string:
The token will have token type along with the token value.
Ex., 60 is a constant, its value is 60.
SYNTAX ANALYZER:
The second phase of compiler is Syntax analyzer or Parser or Hierarchical analyzer
PURPOSE: It accepts the stream of tokens and produce syntactic structure called Parse tree.
PARSE TREE:
Is a diagram, it gives the syntactic structure of expression.
Example; If the expression is
Recursive Rules:-
The Hierarchical structure of a program is usually expressed by recursive rules.
The rules are
• Any identifier is an expression
• Any number is an expression
• If expression 1 and expression 2 are expressions, then
expression 1 + expression 2
expression 1 * expression 2
There are two types of parsing
• Top down parsing
• Bottom up parsing
SEMANTIC ANALYZER:
It checks the source program for semantic errors.
And gathers information for sub-sequent code generation phase.
It uses the hierarchical structure determined by the syntax analysis phase to identify the
operators and operands of the expressions and statement.
Type Checking:
An important component of semantic analysis is type checking.
Compiler checks the each operators and operands that are permitted by the source language
specification.
Example.,
When a binary arithmetic operator is applied to an integer and real.
In this case the compiler may need to convert the integer to real.
CODE OPTIMIZATION:
This phase is designed to improve the intermediate code.
So that faster running machine code will be the result.
The output of the optimized code will be the 3 addressed code but with increased efficiency.
CODE GENERATOR:
The final phase of the compiler is the generation of target code.
Normally it is a relocatable machine code or assembly code.
Example.,
The F in each instruction tells that instruction deal with floating point number.
This # sign indicates constant.
The first instruction moves id3 into R2.
Second instruction multiplies the R2 register content to 60 and so on.
Symbol Table:
Symbol table collects the information about various attributes of various identifiers.
These attributes may provide information about the storage allocated for an identifier, its
type, its scope.
In case of procedure names, the important information’s stored are
• Number of arguments
• Types of arguments
• Method of passing each arguments (pass by reference or pass by value)
During lexical analysis the identifier is entered into the symbol table.
Normally no other information’s are entered.
All other phases enter the information such as data type, address location into the symbol
table.
ERROR HANDLER:
The error handler is invoked when an error in the source program is detected.
Each phase can encounter errors.
After detecting an error a phase must deal with that error, so that compilation can proceed.
Lexical analysis can detect errors due to unrecognized token.
Syntax analysis can detect error due to invalid syntactic structure.
Semantic analysis can detect error due to correct syntactic structure but invalid types.
Errors Encountered in Different Phases
GROUPING OF PHASES
Grouping of Phases
Front end : machine independent phases
• Lexical analysis
• Syntax analysis
• Semantic analysis
• Intermediate code generation
• Some code optimization
Back end : machine dependent phases
• Final code generation
• Machine-dependent optimizations
Front end
The front end analyzes the source code to build an internal representation of the program,
called the intermediate representation or IR.
It also manages the symbol table, a data structure mapping each symbol in the source code
to associated information such as location, type and scope.
This is done over several phases, which includes some of the following:
Back end
The term back end is sometimes confused with code generator because of the overlapped
functionality of generating assembly code.
Some literature uses middle end to distinguish the generic analysis and optimization phases
in the back end from the machine-dependent code generators.
The main phases of the back end include the following:
Preprocessor:
Preprocessors produce input to the compilers.
They perform following functions
• Macro processing
• File inclusion
• Rational preprocessors
• Language extensions
Assemblers:
Some compilers produce assembly code.
i.e., passed to an assembler for further processing.
Definition:- It converts an assembly code produced by the compiler into machine code.
Loading:-
Take the relocatable machine code.
Alter the address in machine code as per memory.
Place it into the memory at proper location.
Link Editor:-
Link editor allows us to make a single program from several kinds of machine code.
These files may be the result of several different compilations.
Some of the files may be library files are system routines.
LEXICAL ANALYSER
A lexical analyzer, also called a scanner, typically has the following functionality and
characteristics.
• Its primary function is to convert from a (often very long) sequence of characters into a
(much shorter, perhaps 10X shorter) sequence of tokens. This means less work for
subsequent phases of the compiler.
• The scanner must Identify and Categorize specific character sequences into tokens. It
must know whether every two adjacent characters in the file belong together in the same
token, or whether the second character must be in a different token.
• Most lexical analyzers discard comments & whitespace. In most languages these
characters serve to separate tokens from each other, but once lexical analysis is
completed they serve no purpose. On the other hand, the exact line # and/or column #
may be useful in reporting errors, so some record of what whitespace has occurred may
be retained. Note: in some languages, even popular ones, whitespace is significant.
• Handle lexical errors (illegal characters, malformed tokens) by reporting them intelligibly
to the user.
• Efficiency is crucial; a scanner may perform elaborate input buffering
• Token categories can be (precisely, formally) specified using regular expressions, e.g.
IDENTIFIER=[a-zA-Z][a-zA-Z0-9]*
• Lexical Analyzers can be written by hand, or implemented automatically using finite
automata.
In compilers, a "token" is:
1. a single word of source code input (a.k.a. "lexeme")
2. an integer code that refers to a single word of input
3. a set of lexical attributes computed from a single word of input
Some times Lexical analyzers are divided into a cascade of two phases.
• First is called “scanning”.
Lexical Errors
• A separate lexical analyzer allows us to apply specialized techniques that serve only
the lexical task, not the job of parsing.
• In addition, specialized buffering techniques for reading input characters can speed
up the compiler significantly.
3. Compiler portability is enhanced.
• Input-device-specific peculiarities can be restricted to the lexical analyzer.
INPUT BUFFERING
The lexical analyzer scans the characters of the source program one at a time to discover
tokens. Often, however, many characters beyond the next token many have to be examined
before the next token itself can be determined. For this and other reasons, it is desirable for the
lexical analyzer to read its input from an input buffer. Fig. 1.9 shows a buffer divided into two
halves of, say 100 characters each.
One pointer marks the beginning of the token being discovered. A look ahead pointer scans
ahead of the beginning point, until the token is discovered .we view the position of each pointer
as being between the character last read and the character next to be read. In practice each
buffering scheme adopts one convention either a pointer is at the symbol last read or the symbol
it is ready to read.
The distance which the lookahead pointer may have to travel past the actual token may be large.
For example, in a PL/I program we may see:
DECLARE (ARG1, ARG2… ARG n)
Without knowing whether DECLARE is a keyword or an array name until we see the character
that follows the right parenthesis. In either case, the token itself ends at the second E. If the look
ahead pointer travels beyond the buffer half in which it began, the other half must be loaded
with the next characters from the source file.
Since the buffer shown in above figure is of limited size there is an implied constraint on how
much look ahead can be used before the next token is discovered. In the above example, if the
look ahead traveled to the left half and all the way through the left half to the middle, we could
not reload the right half, because we would lose characters that had not yet been grouped into
tokens. While we can make the buffer larger if we chose or use another buffering scheme, we
cannot ignore the fact that overhead is limited.
TOKENS
A token is a string of characters, categorized according to the rules as a symbol (e.g., IDENTIFIER,
NUMBER, COMMA). The process of forming tokens from an input stream of characters is called
tokenization.
A token can look like anything that is useful for processing an input text stream or text file.
Consider this expression in the C programming language: sum=3+2;
= Assignment operator
3 Number
+ Addition operator
2 Number
; End of statement
Lexeme:
Collection or group of characters forming tokens is called Lexeme.
Pattern:
A pattern is a description of the form that the lexemes of a token may take.
In the case of a keyword as a token, the pattern is just the sequence of characters that form the
keyword. For identifiers and some other tokens, the pattern is a more complex structure that is
matched by many strings.
Expressing Tokens by Regular Expressions
Regular Expressions
Is a special kind of notation used to define the language.
Which in turns defines the tokens or strings.
letter(letter | digit )*
The vertical bar above means “or”.
The parentheses are used to group sub expressions, the star means "zero or more
occurrences of," and the juxtaposition of letter with the remainder of the expression signifies
concatenation.
Rules
A regular expression can be built by following rules.
1. € is a regular expression, that denotes {€}, the set containing the empty string.
2. If a is a symbol in ∑, then a is a regular expression, that denotes {a}.
That is, the set containing the string a.
3. Suppose r and s are regular expressions denoting languages L(r) and L(s), respectively.
1. (r) | (s) is a regular expression denoting the language L(r) U L(s).
2. (r) (s) is a regular expression denoting the language L(r) L(s) .
3. (r) * is a regular expression denoting (L (r)) * .
4. (r) is a regular expression denoting L(r).
Transition Diagrams
An intermediate step in the construction of a lexical analyzer, we first produce a stylized
flowchart, called transition diagram.
it is a diagrammatical representation like flowchart, used for lexical analysis to recognize
tokens.
In the transition diagram the following notations are used.
States : represented by circles.
Edges : used to connecting the states.
Labels : input characters showing the transition diagram.
(1) main()
Start m a i n ( )
0 1 2 3 4 5 6
Lex can perform simple transformations by itself but its main purpose is to facilitate
lexical analysis, the processing of character sequences such as source code to produce symbol
sequences called tokens for use as input to other programs such as parsers. Lex can be used with
a parser generator to perform lexical analysis. It is easy, for example, to interface Lex and Yacc,
an open source program that generates code for the parser in the C programming language.
Lex is proprietary but versions based on the original code are available as open source. These
include a streamlined version called Flex, an acronym for "fast lexical analyzer generator," as well
as components of OpenSolaris and Plan 9.
AUTOMATA
These are language recognizer.
The automata will take input taken and say “yes” if it is a sentence of a language, and say
“no” if it is not a sentence of a language.
Finite Automata
The generalized transition diagram for Regular Expression is called finite automata
It is a labelled, directed graph.
Here the nodes are states, and labelled edges are transitions.
Types
1. Non deterministic finite automata (NFA)
2. Deterministic finite automata (DFA)
NFA
The finite automata which satisfy the following conditions
• Should have one start symbol
• Having one or more final states
• There can be € transitions present.
• There can be one or more transitions on particular input.
“Non deterministic” means that more than one transition out of a state may be possible on the
same input symbol.
Ex. a*
DFA
The finite automata which satisfy the following conditions
2. a
3. a/b
4. ab
5. a*
6. (a/b)*
7. (a/b)*abb
8. (0/1) (0/2)
9. 12/53
10. (abc)*
b*
(a/b)b*
((a/b)b*)*
13. (a*/b*)*
16. (a/b)a(a/b)*abb
17. 0(2/51)*
18. 4(0/123)*
19. (a,a*/bb*)
20. (0/1)*1
NFA to DFA
1. RE = a*
NFA
A = {0,1,3}
ɛ closure[move(A,a)] = {2}
= {2,1,3}
A B
B B
Minimizing DFA:
= {A B}
= {(A B)}
= {A}
Minimized Transition table:
State a
A A
DFA:
2. RE = (a/b)*abb
NFA
a b
A B C
B B D
C B C
D B E
E B C
Minimizing DFA:
= { (A ,B,C,D) (E) }
= { (A,B,C) (D) (E)}
= { (AC) (B) (D) (E)}
= { (A) (B) (C) (D) (E) }
State a b
A B A
B B D
D B E
E B A
DFA:
UNIT II
SYNTAX ANALYSIS
Syntax analysis is otherwise called parsing or hierarchical analysis.
Parser
Is a software that accepts tokens as input and produce parse tree as output.
(or) Syntax analysis is the second phase of the compiler. It gets the input from the tokens
and generates a syntax tree or parse tree.
The parser uses the first components of the tokens produced by the lexical analyzer to
create a tree-like intermediate representation that depicts the grammatical structure of the
token stream.
A typical representation is a syntax tree in which each interior node represents an
operation and the children of the node represent the arguments of the operation.
A syntax tree for the token stream is shown as the output of the syntactic analyzer in
following Fig.
This tree shows the order in which the operations in the assignment.
In compiler model, the parser obtains a string of tokens from the lexical analyzer.
And verifies that the string of token names can be generated by the grammar for the
source language.
ERRORS
On discovering an error, the parser discards input symbols one at a time until a synchronizing
token is found. The synchronizing tokens are usually delimiters, such as
semicolon or end. It has the advantage of simplicity and does not go into an infinite loop.
When multiple errors in the same statement are rare, this method is quite useful.
On discovering an error, the parser performs local correction on the remaining input
that allows it to continue. Example: Insert a missing semicolon or delete an extraneous
semicolon etc.
Error productions:
Global correction:
Given an incorrect input string x and grammar G, certain algorithms can be used to
find a parse tree for a string y, such that the number of insertions, deletions and changes of
tokens is as small as possible. However, these methods are in general too costly in terms of
time and space.
CONTEXT-FREE GRAMMARS
A context-free grammar has four components.
• terminals,
• non terminals,
• a start symbol, and
• productions.
1.Terminals
Terminals are the basic symbols from which strings are formed.
2. Non terminals
Non terminals are syntactic variables that denote sets of strings.
The sets of strings denoted by non terminals help define the language generated by the
grammar.
3.Start Symbol
4.Production
The productions of a grammar specify the manner in which the terminals and
nonterminals can be combined to form strings.
Each production consists of:
• A nonterminal (in left side)
• An arrow
• Set of terminals and non-terminals.
Example:
The following grammar defines simple arithmetic expressions
Uppercase letters late in the alphabet, such as X, Y, 2, represent grammar symbols; that
is, either nonterminals or terminals.
Unless stated otherwise, the left side of the first production is the start symbol.
DERIVATION
The derivational view gives a precise description of the top down construction of a parse
tree.
The central idea is production is treated as a rewriting rule, in which the Non terminal on
the left is replaced by the string on the right side of the production.
Consider the following grammar G.
E E+E | E*E| (E) | -E| id
The string –(id+id) is a sentence of grammar because there is a derivation,
E=> -E => -(E+E) => -(id +E) => -(id+id)
At each step in derivation there are two choices to be made.
We need to choose which Non terminal to replace.
Parse Tree
A parse tree may be viewed as a graphical representation for a derivation that filters out
the choice regarding replacement order.
Each interior node of a parse tree is labeled by some non terminal.
Children of the node are labeled from Left to Right.
The leaves of the parse tree are labeled by non terminals or terminals and read from Left
to Right.
The constitute sentinel form called the yield or frontier of the tree.
Ex. Parse tree for -(id+id)
Ambiguity
A grammar that produce s more than one parse tree for given input string, then it is called
ambiguous grammar.
Or
An ambiguous grammar is one that produces more than one Right most derivation or left
most derivation for the same sentence.
PARSER
Top-Down Parsing (Recursive Descent Parsing)
Top-down parsing can be viewed as the problem of constructing a parse tree for the input
string,
Starting from the root and creating the nodes of the parse tree in preorder.
Equivalently, top-down parsing can be viewed as finding a leftmost derivation for an
input string.
Recursive Descent Parsing.
• Backtracking is needed (If a choice of a production rule does not work, we backtrack
to try other alternatives.)
• It is a general parsing technique, but not widely used.
• Not efficient
Backtracking : Making repeated scans of input string.
Predictive Parsing
The special case of recursive descent parsing is called predictive parsing.
It require no backtracking.
It is efficient one.
It needs a special form of grammars (LL(1) grammars).
Predictive Parser is also known as LL(1) parser.
In many cases , carefully writing a grammar by,
• Eliminating left recursion
• Left factoring the resulting grammar
We can obtain a grammar that can be parsed by a recursive descent parser that needs no
backtracking.
A table-driven predictive parser has an input buffer, a stack, a parsing table, and an output
stream.
Input Buffer
The input buffer contains the string to be parsed, followed by $, a symbol used as a right
end marker to indicate the end of the input string.
Stack
The stack contains a sequence of grammar symbols with $ on the bottom, indicating the
bottom of the stack.
Initially, the stack contains the start symbol of the grammar on top of $.
Parsing Table
The parsing table is a two dimensional array M[A,a] where A is a nonterminal, and a is a
terminal or the symbol $.
BOTTOM UP PARSER
These are the Parser, which constructs the parser tree from the leaves to the root (starting
non-terminal) for the given input string.
Here the input string is reduced to the starting non-terminal.
Bottom up parser otherwise called shift-reduce parser why?
Because it consists of shifting input symbols onto a stack until the R side of a production
appears on the top of the stack.
The R side may then be replaced by (reduced to) the symbols on the L side of the
production. R the process repeated.
LR parsing
A much more general method of shift reduce parsing is called LR parsing.
$ Symbol:-
It is used to specify the bottom of the stack
It is in right end of the input.
Initial condition:
Initially stack is empty.
String W is in input buffer, as follows
Example
Consider the following grammar
Primary Operations
The primary operations of shift-reduce parser are
• Shift
• Reduce
Possible Actions
There are actually four possible actions
• Shift
• Reduce
• Accept
• Error
is the same. The SLR uses FOLLOW sets as lookahead sets which associate the right hand
side of a LR(0) core to a lookahead terminal. This is a greater simplification that in the case
of LALR because many conflicts may arise from LR(0) cores sharing the same right hand
side and lookahead terminal, conflicts that are not present in LALR. This is why SLR has less
language recognition power than LALR with Canonical LR being stronger than both since it
does not include any simplifications.
Yacc provides a general tool for describing the input to a computer program. The
Yacc user specifies the structures of his input, together with code to be invoked as each such
structure is recognized. Yacc turns such a specification into a subroutine that handles the
input process; frequently, it is convenient and appropriate to have most of the flow of
control in the user's application handled by this subroutine.
1) Lexical Analysis:
Lexical analyzer: scans the input stream and converts sequences of
characters into tokens.
Lex is a tool for writing lexical analyzers.
Yacc specification describes a Context Free Grammar (CFG), that can be used to generate a
parser.
Elements of a CFG:
1. Terminals: tokens and literal characters,
2. Variables (non terminals): syntactical elements,
3. Production rules, and
4. Start symbole.
SEMANTIC ANALYSIS
Semantic Analysis computes additional information related to the
meaning of the program once the syntactic structure is known.
In typed languages as C, semantic analysis involves adding information to
the symbol table and performing type checking.
The information to be computed is beyond the capabilities of
standard parsing techniques, therefore it is not regarded as syntax.
As for Lexical and Syntax analysis, also for Semantic Analysis we need
both a representation Formalism and an Implementation Mechanism.
As representation formalism this lecture illustrates what are called Syntax
Directed Translations.
Types of attributes –
Attributes may be of two types – Synthesized or Inherited.
1. Synthesized attributes –
A Synthesized attribute is an attribute of the non-terminal on the left-hand side
of a production. Synthesized attributes represent information that is being
passed up the parse tree. The attribute can take value only from its children
(Variables in the RHS of the production).
For eg. let’s say A -> BC is a production of a grammar, and A’s attribute is dependent
on B’s attributes or C’s attributes then it will be synthesized attribute.
2. Inherited attributes –
An attribute of a non terminal on the right-hand side of a production is called
an inherited attribute. The attribute can take value either from its parent or from
its siblings (variables in the LHS or RHS of the production).
For example, let’s say A -> BC is a production of a grammar and B’s attribute is
dependent on A’s attributes or C’s attributes then it will be inherited attribute.
Now, let’s discuss about S-attributed and L-attributed SDT.
S-attributed SDT :
• If an SDT uses only synthesized attributes, it is called as S-attributed
SDT.
• S-attributed SDTs are evaluated in bottom-up parsing, as the values of
the parent nodes depend upon the values of the child nodes.
• Semantic actions are placed in rightmost place of RHS.
2. L-attributed SDT:
• If an SDT uses both synthesized attributes and inherited attributes with a
restriction that inherited attribute can inherit values from left siblings
only, it is called as L-attributed SDT.
• Attributes in L-attributed SDTs are evaluated by depth-first and left-to-
right parsing manner.
• Semantic actions are placed anywhere in RHS.
For example,
A -> XYZ {Y.S = A.S, Y.S = X.S, Y.S = Z.S}
is not an L-attributed grammar since Y.S = A.S and Y.S = X.S are allowed but Y.S =
Z.S violates the L-attributed SDT definition as attributed is inheriting the value from
its right sibling.
Note – If a definition is S-attributed, then it is also L-attributed but NOT vice-versa.
SYMBOL TABLES
A symbol table is a major data structure used in a compiler.
Associates attributes with identifiers used in a program. For instance, a type
attribute is usually associated with each identifier. A symbol table is a necessary
component Definition (declaration) of identifiers appears once in a program .Use of
identifiers may appear in many places of the program text Identifiers and attributes
are entered by the analysis phases. When processing a definition (declaration) of an
identifier. In simple languages with only global variables and implicit declarations.
The scanner can enter an identifier into a symbol table if it is not already there In
block-structured languages with scopes and explicit declarations:
The parser and/or semantic analyzer enter identifiers and corresponding
attributes
Symbol table information is used by the analysis and synthesis phases
To verify that used identifiers have been defined (declared)
• To verify that expressions and assignments are semantically correct – type
checking
• To generate intermediate or target code
INTERMEDIATE LANGUAGES
The front end translates a source program into an intermediate representation from
which the back end generates target code.
Benefits of using a machine-independent intermediate form are:
1. Retargeting is facilitated. That is, a compiler for a different machine can be
created by attaching a back end for the new machine to an existing front end.
2. A machine-independent code optimizer can be applied to the intermediate
representation.
INTERMEDIATE LANGUAGES
Three ways of intermediate representation:
* Syntax tree
* Postfix notation
* Three address code
The semantic rules for generating three-address code from common programming
language constructs are similar to those for constructing syntax trees or for generating
postfix notation.
Postfix notation:
Postfix notation is a linearized representation of a syntax tree; it is a list of the nodes
of the tree in which a node appears immediately after its children. The postfix notation
for the syntax tree given above is
The ordinary (infix) way of writing the sum of a and b is with operator in the
middle : a + b
The postfix notation for the same expression places the operator at the right end as
ab +.
In general, if e1 and e2 are any postfix expressions, and + is any binary operator,
the result of applying + to the values denoted by e1 and e2 is postfix notation by
e1e2 +. No parentheses are needed in postfix notation because the position and
arity (number of arguments) of the operators permit only one way to decode a
postfix expression. In postfix notation the operator follows the operand.
Three-address code
Three address code is a type of intermediate code which is easy to
generate and can be easily converted to machine code. It makes use of at most three
addresses and one operator to represent an expression and the value computed at
each instruction is stored in temporary variable generated by compiler. The compiler
decides the order of operation given by three address code.
General representation – a = b op c
1. Quadruple –
It is structure with consist of 4 fields namely op, arg1, arg2 and result. op denotes
the operator and arg1 and arg2 denotes the two operands and result is used to store
the result of the expression.
Advantage –
• Easy to rearrange code for global optimization.
• One can quickly access value of temporary variables using symbol table.
Disadvantage –
• Contain lot of temporaries.
• Temporary variable creation increases time and space complexity.
2. Triples –
This representation doesn’t make use of extra temporary variable to represent a
single operation instead when a reference to another triple’s value is needed, a
pointer to that triple is used. So, it consist of only three fields namely op, arg1 and
arg2.
Disadvantage –
• Temporaries are implicit and difficult to rearrange code.
3. Indirect Triples –
This representation makes use of pointer to the listing of all references to
computations which is made separately and stored. Its similar in utility as compared
to quadruple representation but requires less space than it. Temporaries are implicit
and easier to rearrange code.
Example – Consider expression a = b * – c + b * – c
Question – Write quadruple, triples and indirect triples for following expression : (x
+ y) * (y + z) + (x + y + z)
Explanation – The three address code is:
t1 = x + y
t2 = y + z
t3 = t1 * t2
t4 = t1 + z
t5 = t3 + t4
Type Expressions: Types have structure, which we shall represent using type
expressions: a typeexpression is either a basic type or is formed by applying an
operator called atype constructor to a type expression. The sets of basic types and
constructorsdepend on the language to be checked.
CHECKING
A Compiler must check the source program, both the syntactic and semantic
conventions.
This checking is called static checking.
Checking during execution of target program is called dynamic checking.
Static Check
Examples of static checks are,
• Type check
• Flow of control checks
• Uniqueness checks
Flow-Of-Control Checks
A Compiler should report an error, if the block contains proper open and close
statements, where the control flow statements exist inside the block.
Example.,
It should check proper closed brace at the end of the loop, for executing break
statements.
Uniqueness Check
There are some situations in which an object must be defined exactly once.
Example.
In switch-case statement, the labels f case statement must be distinct.
Name-Related Checks
Sometimes, the same name must appear two or more times.
Example.,
In Ada, a loop or block may have a name that appears at the
beginning and end of the construct.
A type checker verifies that the type of a construct matches that expected by its
context. Example.,
• The arithmetic operator mod in Pascal requires integer operands.
• Type checker must verify that +ve operands of mod have type integer.
Similarly, Type checker must verify that,
• Dereferencing is applied only to a pointer
• Indexing is done only to a array
• User-defined function can be invoked by correct number and type of the
arguments.
TYPE SYSTEMS
The design of a type checker for a language based on information about the
syntactic construct in the language.
A type system is a collection of rules for assigning type expressions to the
various parts of a program.
Type expressions:
The type of a language construct will be denoted by a “type expression”.
A type expression is either a basic type or is formed by applying an operator
called type constructor.
Example.,
A basic type is a type expression.
The basic types are
• Boolean
• Char
• Integer
• Real
• type-error
A special basic type is type-error
It will signal an error during type checking.
A convenient way to represent a type expression is to use a graph.
Example.,
consider in a Pascal notation
Char X char → pointer (integer)
UNIT IV
RUN-TIME ENVIRONMENT AND CODE GENERATION
Storage Organization, Stack Allocation Space, Access to Non-local Data on the
Stack, Heap Management - Issues in Code Generation - Design of a simple Code
Generator.
STORAGE ORGANIZATION
Data Objects
Control Stack
When the control returns from the call, the activation can be restarted by
previous value in stack.
Heap
Not all languages, nor all compilers use all of these fields
The purpose of the fields are,
• Temporaries: value those arising in the evaluation of expressions, are
stored in the fields for temporaries.
• Local data: it holds data that is local to an execution of a procedure.
• Saved machine status: it holds the information about the state of the
machine instruction before the procedure is called.
This information includes the values of the PC and machines
register.
These values can be restored when the control returns from the
procedure.
• Access: For language like Fortran, access links are not needed, but in
Pascal mechanism it is needed.
• Control link: It points to the activation record of the caller.
• Actual parameter: It is used by the calling procedure to supply
parameters to the called procedure.
• Returned value: It is used by the called procedure to return a value to the
calling procedure.
Activation Tree
A program consist of procedures, a procedure definition is a
declaration that, in its simplest form, associates an identifier (procedure name) with
a statement (body of the procedure). Each execution of procedure is referred to as
an activation of the procedure. Lifetime of an activation is the sequence of steps
present in the execution of the procedure. If ‘a’ and ‘b’ be two procedures then their
activations will be non-overlapping (when one is called after other) or nested
(nested procedures). A procedure is recursive if a new activation begins before an
earlier activation of the same procedure has ended. An activation tree shows the
way control enters and leaves activations.
Issues with Nested Procedures: Access becomes far more complicated when a
language allows procedure declarations to be nested and also uses the normal static
scoping rule; that is, a procedure can access variables of the procedures whose
declarations surround its own declaration, following the nested scoping rule
described for blocks in Section 1.6.3. The reason is that knowing at compile time
that the declaration of p is immediately nested within q does not tell us the relative
positions of their activation records at run time. In fact, since either p or q or both
may be recursive, there may be several activation records of p and/or q on the
stack.
Finding the declaration that applies to a nonlocal name x in a nested
procedure p is a static decision; it can be done by an extension of the static-scope
rule for blocks. Suppose x is declared in the enclosing procedure q. finding the
relevant activation of q from an activation of p is a dynamic decision; it requires
additional run-time information about activations. One possible solution to this
problem is to use "access links”.
Heap Management
Heap used for dynamically allocated memory, its important operations
includes allocation and de-allocation. In C , Pascal, and Java allocation is done via
the “new” operator and C allocation is done via the “malloc” function call. De-
allocation is done either automatically. The heap is the portion of the store that is
used for data that lives indefinitely, or until the program explicitly deletes it.
Heap Memory Manager: The memory manager keeps track of all the free space
in heap storage at all times. It performs two basic functions:
1. Allocation
2. De allocation
Sn Stack Heap
Heap is a hierarchical data
1 A stack is a linear data structure.
structure.
2 High-speed access Slower compared to stack
It allows you to access variables
3 Local variables only
globally.
Limit on stack size dependent on Does not have a specific limit on
4
OS. memory size.
5 Variables cannot be resized Variables can be resized.
Memory is allocated in a contiguous Memory is allocated in any random
6
block. order.
Automatically done by compiler It is manually done by the
7
instructions. programmer.
Does not require to de-allocate
8 Explicit de-allocation is needed.
variables.
9 Less Cost More Cost
10 Fast access Slow access
CODE GENERATION
The final phase of compiler model is the code generator.
This phase receives the optimized intermediate code and generate the target
code.
Input : it takes the intermediate representation of source program
Ex. three address code
Output : it produces the equivalent target program.
Ex. Assemble language code, of machine code.
Choice of input:
There several choices for the intermediate language, such as
2.Target Programs
The output of the code generator is the target program.
Like be Intermediate code, output may take on a variety of forms:
i. absolute machine language
ii. Relocatable machine language
iii. assembly language.
3.Memory Management
Mapping names in the source program to addresses of data objects in run time
memory is done co-operatively by the front end and the code generator.
From the symbol-table information, a relative address can be determined for the
name in a data area for the procedure.
If machine code is being generated, labels in three address statements
have to be converted to addresses of instructions.
4.Instruction Selection
It is somewhat difficult to select the proper instruction from the instruction set
of the target machine.
The uniformity and completeness of the instruction set are important factors.
If the target machine does not support each data type in a uniform manner, then
each exception to the general rule requires special handling.
Instruction speeds and machine idioms are other important factors.
If we do not care about the efficiency of the target program, instruction
selection is straightforward.
For each type of three- address statement we can design a code skeleton that
outlines the target code to be generated for that construct.
Example:
Every three-address code can be in the form
x := y + z
where x, y, and z are statically allocated.
That can be translated into the target code as follows
MOV Y, R0
ADD Z, R0
MOV R0, X
Sometimes this kind of statement by statement code generation often produces
poor code.
For example, the sequence of statements
a := b + c
d := a + e
would be translated into
MOV b, R0
ADD c, R0
MOV R0, a
MOV a, R0
ADD e, R0
MOV R0, d
Example:
Here the fourth statement is redundant.
The quality of the generated code is determined by its speed and size.
A target machine should have rich instruction set.
If the target machine have “increment” instruction (INC), then the three address
statement may be implemented more efficiently by following single instruction
a = a+1
INC a.
If INC instruction not present, the code will be
MOV a, R0
ADD #1, R0
MOV R0, a
5.Register Allocation
Instructions involving register operands are usually shorter and faster
than those involving operands in memory.
Therefore, efficient utilization of register is one of the important factor for
generation of good code.
The use of registers is often subdivided into two sub problems:
• The assignment d : = (a-b) + (a-c) + (a-c) might be translated into the following
three-address code sequence: Code sequence for the example is:
UNIT V
CODE OPTIMIZATION
Principal Sources of Optimization – Peep-hole optimization - DAG- Optimization
of Basic Blocks-Global Data Flow Analysis - Efficient Data Flow Algorithm.
Optimization
Optimization is the process of modifying the intermediate code to a more
optimal intermediate code, so that it run faster.
Optimizing Compiler
The compiler that apply code improving transformations (optimization), is
called optimizing compiler.
Input
The input of the code optimizer is the intermediate code(ex. Three address
code).
The intermediate code is taken from the front end of the compiler.
Output
The output of the code optimizer is the improved intermediate code.
That can be sent on to code generator phase.
The code generator produces the target program from the transformed
intermediate code.
Here 4*i canbe calculated two times and stored in t6 and t7.
We can use the t6 value which is previously computed instead of t7.
That can be described in the next diagram(b).
2.Copy Propagation
The assignment of the form f:=g called copy statements or copies for short.
The common sub expression is eliminated by a new temporary variable.
The following figure shows the copies introduced during the common sub
expression elimination.
We can eliminate both the test and printing the object code.
One advantage of copy propagation is that it often turns the copy statement into
dead code.
Ex x := t3 a[t2] := t5 a[t4] := x goto B2
that can be converted into, a[t2] := t5
a[t4] := t3 goto B2
4.Constant Folding
Evaluate constant expression at compile time and replace the constant
expressions by their values.
Ex. the expression 2*3.14 would be replaced by 6.28.
5.Code motion
Moves code outside that loop
An important modification that decreases the amount of code in a loop is code
motion.
This transformation takes an expression that yields the same result independent
of the number of times a loop is executed( a loop invariant computation).
And places the expression before the loop.
If the while block does not change the value of variable “limit”, that is “limit”
is loop invariant computation.
Then the code motion will result in the equivalent of t=limit-2;
while (i <= t)
{
……..
}
6.Induction Variable Elimination
loops are usually processed inside out.
7.Reduction in strength
Reduction in strength replaces expensive operations by equivalent cheaper ones
on the target machine.
Ex. X^2 calls the exponentiation routine, the equivalent cheaper
implementation is x*x.
Fixed point multiplication or division by a power of two is cheaper to
implement shift operation.
8.Peephole Optimization
Optimization technique for improving code in a short range.
A simple but effective technique for locally improving the target code is called
peephole optimization.
It is a method for trying to improve the performance of the target program by
examining a short sequence of target instructions( called peephole) and
replacing these instructions by a shorter or faster sequence, whenever possible.
Peephole is small moving window on the target program.
The code in the peephole need not be a contiguous.
Repeated passes over the target code are necessary to get the maximum benefit.
….
L1: goto L2
If there are no jumps to L1, it can also eliminated.
Similarly
If a< b goto L1
….
L1: goto L2
Can be replaced by,
If a< b goto L2
…
L1: goto L2
3.Algebraic Simplification
There is no end to the amount ao algebraic simplification that can be attempted
through peephole optimization.
The frequent algebraic simplifications are,
x=x+0
or
x=x*0
are often produced straight forward intermediate code generation algorithms.
They can be eliminated easily through the peephole optimization.
DAG
Directed Acyclic Graph
A DAG for basic block is a directed acyclic graph with the following labels on
nodes:
1. The leaves of graph are labeled by unique identifier and that identifier can
be variable names or constants.
2. Interior nodes of the graph is labeled by an operator symbol.
3. Exterior nodes of the graph is labeled by an identifier.
Construction of DAGs-
Following rules are used for the construction of DAGs-
Rule 1:
• Operators are represented by interior nodes.
• Identifiers are represented by exterior nodes.
Rule 2:
• A checking is made to find if there exists any node with the same value.
• A new node is created only when node has new value.
• This action helps in detecting the common sub-expressions and avoiding the
re-computation of the same.
Rule 3:
• The assignment instructions of the form x:=y are not performed unless they
are necessary.
Problem-01:
Solution-
Problem-02:
Solution-
Directed Acyclic Graph for the given expression is-
Problem-03:
(1) a = b x c
(2) d = b
(3) e = d x c
(4) b = e
(5) f = b + c
(6) g = f + d
Solution-
When we construct the node for the fourth statement d := a - d, that computation
is already present in the statement 2.
So no need to create new node, but add this definition to appropriate node.
Since there are only three nodes in the above dag, we can reconstruct the three
address code as,
a := b + c
d := a – d
c := d + c
note that when we look for common sub expressions, we really are looking from
expressions that are guaranteed to compute the same value.
No matter how that value is computed.
Thus the dag method will miss the fact that the expression computed by the first
and fourth statements in the sequence is the same, namely b+c.
a := b + c
b := b – d
c := c + d
e := b + c
however algebraic identities applied to the dag, may expose the equivalence.
The dag for this sequence is shown below.
the operation on dags that corresponds to dead code elimination is quite straight
forward to implement.
We delete from a dag any root ( node with no ancestors) that has no live
variables.
Repeated application of this transformation will remove all nodes from the dag
that correspond to dead code.
Let us take a global view and consider all the points in all the blocks.
A path from p1 to pn is a sequence of points p1, p2,,….pn such that for each i
between 1 and n-1 .
1. Pi is the point immediately preceding a statement and pi+1 is the point
immediately following that statement in the same block. Or
2. Pi is the end of some block and pi+1 is the beginning of a successor
block.
To efficiently optimize the code compiler collects all the information about the
program and distribute this information to each block of the flow graph. This
process is known as data-flow graph analysis.
Certain optimization can only be achieved by examining the entire program. It
can't be achieve by examining just a portion of the program.
For this kind of optimization user defined chaining is one particular problem.
Here using the value of the variable, we try to find out that which definition of a
variable is applicable in a statement.
Based on the local information a compiler can perform some optimizations. For
example, consider the following code:
1. x = a + b;
2. x = 6 * 3
o In this code, the first assignment of x is useless. The value computer for x is
never used in the program.
o At compile time the expression 6*3 will be computed, simplifying the
second assignment statement to x = 18;
Some optimization needs more global information. For example, consider the
following code:
1. a = 1;
2. b = 2;
3. c = 3;
4. if (....) x = a + 5;
5. else x = b + 4;
6. c = x + 1;
In this code, at line 3 the initial assignment is useless and x +1 expression can be
simplified as value 7.
But it is less obvious that how a compiler can discover these facts by looking only
at one or two consecutive statements. A more global analysis is required so that the
compiler knows the following things at each point in the program:
o Which variables are guaranteed to have constant values
o Which variables will be used before being redefined
Data flow analysis is used to discover this kind of property. The data flow analysis
can be performed on the program's control flow graph (CFG).
The control flow graph of a program is used to determine those parts of a program
to which a particular value assigned to a variable might propagate.