Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
75 views

Compiler Design

The document describes the objectives and structure of a compiler design course. It discusses the various phases of a compiler like lexical analysis, syntax analysis, intermediate code generation, run-time environment, code generation and optimization. It provides details of the individual units to be covered in the course along with the list of experiments. The objectives of the course are to learn the different phases of a compiler and implement parts of a compiler like the front-end and code generator.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views

Compiler Design

The document describes the objectives and structure of a compiler design course. It discusses the various phases of a compiler like lexical analysis, syntax analysis, intermediate code generation, run-time environment, code generation and optimization. It provides details of the individual units to be covered in the course along with the list of experiments. The objectives of the course are to learn the different phases of a compiler and implement parts of a compiler like the front-end and code generator.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

KARPAGA VINAYAGA COLLEGE OF ENGINEERING AND TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CS860
S8602
COMPILER DESIGN

for
BE-CSE-
CSE- VI SEM

S.PRABU
Associate Professor/CSE
CS8602 COMPILER DESIGN L T P C
3 0 2 4
OBJECTIVES:
 To learn the various phases of compiler.
 To learn the various parsing techniques.
 To understand intermediate code generation and run-time environment.
 To learn to implement front-end of the compiler.
 To learn to implement code generator.

UNIT I INTRODUCTION TO COMPILERS 9


Structure of a compiler – Lexical Analysis – Role of Lexical Analyzer – Input Buffering –
Specification of Tokens – Recognition of Tokens – Lex – Finite Automata – Regular Expressions
to Automata – Minimizing DFA.
UNIT II SYNTAX ANALYSIS 12
Role of Parser – Grammars – Error Handling – Context-free grammars – Writing a grammar –
Top Down Parsing - General Strategies Recursive Descent Parser Predictive Parser-LL(1)
Parser-Shift Reduce Parser-LR Parser-LR (0)Item Construction of SLR Parsing Table -
Introduction to LALR Parser - Error Handling and Recovery in Syntax Analyzer-YACC.

UNIT III INTERMEDIATE CODE GENERATION 8


Syntax Directed Definitions, Evaluation Orders for Syntax Directed Definitions, Intermediate
Languages: Syntax Tree, Three Address Code, Types and Declarations, Translation of
Expressions, Type Checking.

UNIT IV RUN-TIME ENVIRONMENT AND CODE GENERATION 8


Storage Organization, Stack Allocation Space, Access to Non-local Data on the Stack, Heap
Management - Issues in Code Generation - Design of a simple Code Generator.

UNIT V CODE OPTIMIZATION 8


Principal Sources of Optimization – Peep-hole optimization - DAG- Optimization of Basic Blocks-
Global Data Flow Analysis - Efficient Data Flow Algorithm.

LIST OF EXPERIMENTS:
1. Develop a lexical analyzer to recognize a few patterns in C. (Ex. identifiers, constants,
comments, operators etc.). Create a symbol table, while recognizing identifiers.
2. Implement a Lexical Analyzer using Lex Tool
3. Implement an Arithmetic Calculator using LEX and YACC
4. Generate three address code for a simple program using LEX and YACC.
5. Implement simple code optimization techniques (Constant folding, Strength reduction and
Algebraic transformation)
6. Implement back-end of the compiler for which the three address code is given as input and
the 8086 assembly language code is produced as output.
PRACTICALS 30 PERIODS
THEORY 45 PERIODS
TOTAL : 75 PERIODS
OUTCOMES:
On Completion of the course, the students should be able to:
 Understand the different phases of compiler.
 Design a lexical analyzer for a sample language.
 Apply different parsing algorithms to develop the parsers for a given grammar.
 Understand syntax-directed translation and run-time environment.
 Learn to implement code optimization techniques and a simple code generator.
 Design and implement a scanner and a parser using LEX and YACC tools.
TEXT BOOK:
1. Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman, Compilers: Principles,
Techniques and Toolsǁ, Second Edition, Pearson Education, 2009.

REFERENCES
1. Randy Allen, Ken Kennedy, Optimizing Compilers for Modern Architectures: A
Dependence based Approach, Morgan Kaufmann Publishers, 2002.
2. Steven S. Muchnick, Advanced Compiler Design and Implementationǁ, Morgan Kaufmann
Publishers - Elsevier Science, India, Indian Reprint 2003.
3. Keith D Cooper and Linda Torczon, Engineering a Compilerǁ, Morgan Kaufmann
Publishers Elsevier Science, 2004.
4. V. Raghavan, Principles of Compiler Designǁ, Tata McGraw Hill Education Publishers,
2010.
5. Allen I. Holub, Compiler Design in Cǁ, Prentice-Hall Software Series, 1993.
CS8602 – COMPILER DESIGN

UNIT I
INTRODUCTION TO COMPILERS
Structure of a compiler – Lexical Analysis – Role of Lexical Analyzer – Input
Buffering – Specification of Tokens – Recognition of Tokens – Lex – Finite
Automata – Regular Expressions to Automata – Minimizing DFA.

Compiler Fundamental
Translator
Is a program that convert one form of language into other.
One form of language Translator Other form of language

Compiler
Is a program that convert high level program into low level program.

High level program Compiler Low level program

Interpreter
Is a program that convert high level program into low level program, line by line
High level program Interpreter Low level program

Assembler
Is a program that convert assembly language program into low level program.
Assembly language Assembler Low level program

Difference between Compiler and Interpreter


Sn Compiler Interpreter
1 Takes entire program as input Takes single instruction as input
2 Intermediate code is generated No intermediate code
3 Memory requirement is more Less
4 Ex. C Compiler Ex. BASIC

CLASSIFICATION OF COMPILER:
Depending upon the function they perform compilers are classified as
• Single pass compiler

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

• Multi pass compiler


• Load and Go compiler
• Debugging compiler (or) Optimizing compiler
• Cross compiler

Single Pass Compiler


Is a compiler that passes through the source code of each compilation unit only once.

Multi Pass Compiler


Is a compiler that passes through the source code of each compilation unit only once.

Load & Go Compiler


It produces absolute code.
Which executes immediately after compilation.

Optimizing Compiler
Is a compiler that minimize or maximize some attributes of an executable program.

Cross Compiler
capable of creating executable code for a platform other than the one on which the compiler
is running.

LANGUAGE PROCESSOR
In addition to a compiler, several other programs may be required to create an executable
target program,

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

A source program may be divided into modules stored in separate files.


The task of collecting the source program is sometimes entrusted to a separate program,
called a preprocessor.
The preprocessor may also expand shorthand, called macros, into source language elements.
The modified source program is then fed to a compiler. The compiler may produce an
assembly-language program as its output.
Because assembly language is easier to produce as output and is easier to debug.
The assembly language is then processed by a program called an assembler that produces
reloadable machine code as its output.
Large programs are often compiled in pieces, so the relocatable machine code may have to
be linked together with other relocatable object files and library files into the code that
actually runs on the machine.
The linker resolves external memory addresses, where the code in one file may refer to a
location in another file.
The loader then puts together all of the executable object files into memory for execution.

STRUCTURE OF A COMPILER
PHASES OF COMPILER:
Conceptually, a compiler operates in phases,
Phase: A phase is a logically cohesive operation that takes one format of source code as input
and generates another format as output.

The six compiler phases are


• Lexical analyzer
• Syntax analyzer
• Semantic analyzer
• Intermediate code generator
• Code optimizer
• Code generator
Symbol table management and error handler are interacting with these 6 phases of compiler.

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

LEXICAL ANALYZER:
The first phase is Lexical analyzer or Scanner or Linear analyzer
PURPOSE: It reads the source program one character at a time and grouped into tokens.

TOKENS:
Token is a sequence of characters that can be treated as a single logical entity.
The tokens may be
• Identifiers :- Ex., x, y, num, s
• Keywords :- Ex., int, float, char, do, while, if, else
• Operator symbols :- Ex., +, -, *, /
• Special symbols :- Ex., (, ), [, ], {, }, ,
• Constants :- Ex., 5,3
Example:

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

• Totally there are 7 tokens


• Here position - is identifier
• := - is a operator symbol
• Initial - is a identifier
• + - is a operator symbol
• rate - is a identifier
• * - is a operator symbol
• 60 - is a constant
There are two types of tokens
i) Specific string
ii) Classes string

Specific string:
The token will have token type without token value.
Ex., * is an operator, it has no value.

Classes string:
The token will have token type along with the token value.
Ex., 60 is a constant, its value is 60.

SYNTAX ANALYZER:
The second phase of compiler is Syntax analyzer or Parser or Hierarchical analyzer

PURPOSE: It accepts the stream of tokens and produce syntactic structure called Parse tree.

PARSE TREE:
Is a diagram, it gives the syntactic structure of expression.
Example; If the expression is

The Parse tree is

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

Recursive Rules:-
The Hierarchical structure of a program is usually expressed by recursive rules.
The rules are
• Any identifier is an expression
• Any number is an expression
• If expression 1 and expression 2 are expressions, then
expression 1 + expression 2
expression 1 * expression 2
There are two types of parsing
• Top down parsing
• Bottom up parsing

SEMANTIC ANALYZER:
It checks the source program for semantic errors.
And gathers information for sub-sequent code generation phase.
It uses the hierarchical structure determined by the syntax analysis phase to identify the
operators and operands of the expressions and statement.

Type Checking:
An important component of semantic analysis is type checking.
Compiler checks the each operators and operands that are permitted by the source language
specification.

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

Example.,
When a binary arithmetic operator is applied to an integer and real.
In this case the compiler may need to convert the integer to real.

INTERMEDIATE CODE GENERATION:


After syntax and semantic analysis some compiler generate an explicit intermediate
representation of source program
The intermediate code should have two important properties
• It should be easy to produce
• It should be easy to translate into target language.
The intermediate code generator [IMC] transforms the parse tree into intermediate language
representation of the source program.
Three address code also one type of intermediate code.

Three Address Code:-


This is like the assembly language for a machine.
In which every memory location can act like a register.
Three address codes consist of sequence of instruction each of which has utmost three
operands.

CODE OPTIMIZATION:
This phase is designed to improve the intermediate code.
So that faster running machine code will be the result.
The output of the optimized code will be the 3 addressed code but with increased efficiency.

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

CODE GENERATOR:
The final phase of the compiler is the generation of target code.
Normally it is a relocatable machine code or assembly code.

Example.,

The F in each instruction tells that instruction deal with floating point number.
This # sign indicates constant.
The first instruction moves id3 into R2.
Second instruction multiplies the R2 register content to 60 and so on.

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

SYMBOL TABLE MANAGEMENT [Table management (or) Book keeping]:


A symbol table is a data structure containing a record for each identifier with fillings for the
attributes of the identifier.
This data structure allows us to final the record for each identifier quickly and to store or
retrieve data from that record quickly.

Symbol Table:
Symbol table collects the information about various attributes of various identifiers.
These attributes may provide information about the storage allocated for an identifier, its
type, its scope.
In case of procedure names, the important information’s stored are
• Number of arguments
• Types of arguments
• Method of passing each arguments (pass by reference or pass by value)
During lexical analysis the identifier is entered into the symbol table.
Normally no other information’s are entered.

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

All other phases enter the information such as data type, address location into the symbol
table.

ERROR HANDLER:
The error handler is invoked when an error in the source program is detected.
Each phase can encounter errors.
After detecting an error a phase must deal with that error, so that compilation can proceed.
Lexical analysis can detect errors due to unrecognized token.
Syntax analysis can detect error due to invalid syntactic structure.
Semantic analysis can detect error due to correct syntactic structure but invalid types.
Errors Encountered in Different Phases

Error Detection and Reporting


Compiler Errors :-
• Lexical errors (e.g. misspelled word)
• Syntax errors (e.g. unbalanced parentheses, missing semicolon)
• Semantic errors (e.g. type errors)
• Logical errors (e.g. infinite recursion)
Error Handling
• Report errors clearly and accurately
• Recover quickly if possible
• Poor error recover may lead to avalanche of errors

GROUPING OF PHASES
Grouping of Phases
Front end : machine independent phases
• Lexical analysis
• Syntax analysis
• Semantic analysis
• Intermediate code generation
• Some code optimization
Back end : machine dependent phases
• Final code generation
• Machine-dependent optimizations
Front end
The front end analyzes the source code to build an internal representation of the program,
called the intermediate representation or IR.

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

It also manages the symbol table, a data structure mapping each symbol in the source code
to associated information such as location, type and scope.
This is done over several phases, which includes some of the following:

Back end
The term back end is sometimes confused with code generator because of the overlapped
functionality of generating assembly code.
Some literature uses middle end to distinguish the generic analysis and optimization phases
in the back end from the machine-dependent code generators.
The main phases of the back end include the following:

Advantages of separating Front end and Back end


we can produce compilers for different source languages for one target machine by
combining different front ends with the back end for that target machine.
Similarly, we can produce compilers for different target machines, by combining a front end
with back ends for different target machines.

COUSINS OF THE COMPILER:-


Need:-
Input of the compiler may be produced by one or more preprocessors.
Output of the compiler may be needed to further processing before running on machine.
I. Preprocessor
II. Assembler
III. Two pass assembly
IV. Loaders and link editors

Preprocessor:
Preprocessors produce input to the compilers.
They perform following functions
• Macro processing
• File inclusion
• Rational preprocessors
• Language extensions
Assemblers:
Some compilers produce assembly code.
i.e., passed to an assembler for further processing.
Definition:- It converts an assembly code produced by the compiler into machine code.

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

Two Pass Assembly:


The simplest form of assembler makes two phases over the input.
PASS: Read the file from begin to end.
During first pass, all the identifiers are entered into SYMTAB with storage location.
During second pass, it translates each opcode into equivalent machine code.
And it translates each identifier into address by using SYMTAB.
The output of second pass usually relocatable machine code.

Loaders And Link Editors:


Loader performs 2 functions
i) Loading
ii) Link editing

Loading:-
Take the relocatable machine code.
Alter the address in machine code as per memory.
Place it into the memory at proper location.

Link Editor:-
Link editor allows us to make a single program from several kinds of machine code.
These files may be the result of several different compilations.
Some of the files may be library files are system routines.

LEXICAL ANALYSER
A lexical analyzer, also called a scanner, typically has the following functionality and
characteristics.
• Its primary function is to convert from a (often very long) sequence of characters into a
(much shorter, perhaps 10X shorter) sequence of tokens. This means less work for
subsequent phases of the compiler.
• The scanner must Identify and Categorize specific character sequences into tokens. It
must know whether every two adjacent characters in the file belong together in the same
token, or whether the second character must be in a different token.
• Most lexical analyzers discard comments & whitespace. In most languages these
characters serve to separate tokens from each other, but once lexical analysis is

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

completed they serve no purpose. On the other hand, the exact line # and/or column #
may be useful in reporting errors, so some record of what whitespace has occurred may
be retained. Note: in some languages, even popular ones, whitespace is significant.
• Handle lexical errors (illegal characters, malformed tokens) by reporting them intelligibly
to the user.
• Efficiency is crucial; a scanner may perform elaborate input buffering
• Token categories can be (precisely, formally) specified using regular expressions, e.g.
IDENTIFIER=[a-zA-Z][a-zA-Z0-9]*
• Lexical Analyzers can be written by hand, or implemented automatically using finite
automata.
In compilers, a "token" is:
1. a single word of source code input (a.k.a. "lexeme")
2. an integer code that refers to a single word of input
3. a set of lexical attributes computed from a single word of input

ROLE OF THE LEXICAL ANALYSER


Primary task:
To read the input characters and produce sequence of tokens.
These tokens can be used by syntax analyzer or parser.
The following diagram shows the interaction between lexical analyzer with parser.

This is implemented by making the lexical analyzer be a subroutine or co routine of the


parser.
While receiving the “get next token “ command from the parser, the lexical analyzer reads
the input characters until it can identify the next token.
Lexical analyzer may also performs certain secondary tasks.
• Stripping out from the source program comments and white spaces.
• Correlating error messages from the compiler with the source program.

Some times Lexical analyzers are divided into a cascade of two phases.
• First is called “scanning”.

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

• Second is “lexical analysis” .


Scanner is responsible for doing simple tasks.
Lexical analyzer do some complex operations.

Lexical Errors

Few errors are identified clearly at the lexical level.


Because a lexical analyzer has a very localized view of a source program.
Ex. If the string fi is encountered in C program for the first time in the context fi(a==x)
A lexical analyzer cannot tell whether “fi” is misspelling of the keyword if or undeclared
function identifier.
Since fi is a valid identifier, lexical analyzer must return the token for an identifier.

Panic mode recovery


The simplest recover technique is panic mode recovery.
We delete successive character from the remaining input until the lexical analyzer can find a
well-formed token.
This recovery technique may confuse the parser, but in an interactive computing
environment it may be quite adequate.
Other possible error recovery actions are,
• Deleting an extraneous character.
• Inserting a missing character.
• Replacing an incorrect character by a correct character.
• Transposing two adjacent character.
Issues in Lexical Analyzer.
There are a number of reasons why the analysis portion of a compiler is normally separated
into lexical analysis and parsing (syntax analysis) phases.

1. Simplicity of design is the most important consideration.


• The separation of lexical and syntactic analysis often allows us to simplify at least one
of these tasks.
• For example, a parser that had to deal with comments and whitespace as syntactic
units would be considerably more complex than one that can assume comments and
whitespace have already been removed by the lexical analyzer.
• If we are designing a new language, separating lexical and syntactic concerns can lead
to a cleaner overall language design.

2. Compiler efficiency is improved.

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

• A separate lexical analyzer allows us to apply specialized techniques that serve only
the lexical task, not the job of parsing.
• In addition, specialized buffering techniques for reading input characters can speed
up the compiler significantly.
3. Compiler portability is enhanced.
• Input-device-specific peculiarities can be restricted to the lexical analyzer.

Token, Pattern, Lexeme


Token : Sequence of characters having a collective meaning.
Pattern : A set of strings is described by a rule called as a pattern.
Lexeme : A sequence of characters in the source program that is matched by the
pattern for a token.

ISSUES OF LEXICAL ANALYZER


There are three issues in lexical analysis:
• To make the design simpler.
• To improve the efficiency of the compiler.
• To enhance the computer portability.

INPUT BUFFERING
The lexical analyzer scans the characters of the source program one at a time to discover
tokens. Often, however, many characters beyond the next token many have to be examined
before the next token itself can be determined. For this and other reasons, it is desirable for the
lexical analyzer to read its input from an input buffer. Fig. 1.9 shows a buffer divided into two
halves of, say 100 characters each.

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

One pointer marks the beginning of the token being discovered. A look ahead pointer scans
ahead of the beginning point, until the token is discovered .we view the position of each pointer
as being between the character last read and the character next to be read. In practice each
buffering scheme adopts one convention either a pointer is at the symbol last read or the symbol
it is ready to read.

The distance which the lookahead pointer may have to travel past the actual token may be large.
For example, in a PL/I program we may see:
DECLARE (ARG1, ARG2… ARG n)

Without knowing whether DECLARE is a keyword or an array name until we see the character
that follows the right parenthesis. In either case, the token itself ends at the second E. If the look
ahead pointer travels beyond the buffer half in which it began, the other half must be loaded
with the next characters from the source file.

Since the buffer shown in above figure is of limited size there is an implied constraint on how
much look ahead can be used before the next token is discovered. In the above example, if the
look ahead traveled to the left half and all the way through the left half to the middle, we could
not reload the right half, because we would lose characters that had not yet been grouped into
tokens. While we can make the buffer larger if we chose or use another buffering scheme, we
cannot ignore the fact that overhead is limited.

TOKENS
A token is a string of characters, categorized according to the rules as a symbol (e.g., IDENTIFIER,
NUMBER, COMMA). The process of forming tokens from an input stream of characters is called
tokenization.
A token can look like anything that is useful for processing an input text stream or text file.
Consider this expression in the C programming language: sum=3+2;

Lexeme Token type


sum Identifier

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

= Assignment operator
3 Number
+ Addition operator
2 Number
; End of statement

Lexeme:
Collection or group of characters forming tokens is called Lexeme.
Pattern:
A pattern is a description of the form that the lexemes of a token may take.

In the case of a keyword as a token, the pattern is just the sequence of characters that form the
keyword. For identifiers and some other tokens, the pattern is a more complex structure that is
matched by many strings.
Expressing Tokens by Regular Expressions
Regular Expressions
Is a special kind of notation used to define the language.
Which in turns defines the tokens or strings.
letter(letter | digit )*
The vertical bar above means “or”.
The parentheses are used to group sub expressions, the star means "zero or more
occurrences of," and the juxtaposition of letter with the remainder of the expression signifies
concatenation.

Rules
A regular expression can be built by following rules.
1. € is a regular expression, that denotes {€}, the set containing the empty string.
2. If a is a symbol in ∑, then a is a regular expression, that denotes {a}.
That is, the set containing the string a.
3. Suppose r and s are regular expressions denoting languages L(r) and L(s), respectively.
1. (r) | (s) is a regular expression denoting the language L(r) U L(s).
2. (r) (s) is a regular expression denoting the language L(r) L(s) .
3. (r) * is a regular expression denoting (L (r)) * .
4. (r) is a regular expression denoting L(r).

Algebraic Properties Of Regular Expressions


There are number of algebraic laws obeyed by regular expressions.
These can be used to manipulate regular expressions into equivalent forms.

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

The following table shows some algebraic laws.

Transition Diagrams
An intermediate step in the construction of a lexical analyzer, we first produce a stylized
flowchart, called transition diagram.
it is a diagrammatical representation like flowchart, used for lexical analysis to recognize
tokens.
In the transition diagram the following notations are used.
States : represented by circles.
Edges : used to connecting the states.
Labels : input characters showing the transition diagram.

(1) main()
Start m a i n ( )
0 1 2 3 4 5 6

(2) for any identifier


Start letter letter/digit
. 0 1 2 3

Lex (lexical analyzer generator)


Lex is a program designed to generate scanners, also known as tokenizers, which
recognize lexical patterns in text. Lex is an acronym that stands for "lexical analyzer generator." It
is intended primarily for Unix-based systems. The code for Lex was originally developed by Eric
Schmidt and Mike Lesk.

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

Lex can perform simple transformations by itself but its main purpose is to facilitate
lexical analysis, the processing of character sequences such as source code to produce symbol
sequences called tokens for use as input to other programs such as parsers. Lex can be used with
a parser generator to perform lexical analysis. It is easy, for example, to interface Lex and Yacc,
an open source program that generates code for the parser in the C programming language.
Lex is proprietary but versions based on the original code are available as open source. These
include a streamlined version called Flex, an acronym for "fast lexical analyzer generator," as well
as components of OpenSolaris and Plan 9.

AUTOMATA
These are language recognizer.
The automata will take input taken and say “yes” if it is a sentence of a language, and say
“no” if it is not a sentence of a language.

Finite Automata
The generalized transition diagram for Regular Expression is called finite automata
It is a labelled, directed graph.
Here the nodes are states, and labelled edges are transitions.
Types
1. Non deterministic finite automata (NFA)
2. Deterministic finite automata (DFA)
NFA
The finite automata which satisfy the following conditions
• Should have one start symbol
• Having one or more final states
• There can be € transitions present.
• There can be one or more transitions on particular input.
“Non deterministic” means that more than one transition out of a state may be possible on the
same input symbol.
Ex. a*

DFA
The finite automata which satisfy the following conditions

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

• Should have one start symbol


• Having one or more final states
• There cannot be € transitions.
• There can be only one transitions possible on particular input symbol.
Ex. a*

Convert RE into NFA:-


1. ε

2. a

3. a/b

4. ab

5. a*

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

6. (a/b)*

7. (a/b)*abb

8. (0/1) (0/2)

9. 12/53

10. (abc)*

11. ((a/b) b*)*


(a/b)

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

b*

(a/b)b*

((a/b)b*)*

12. (a/b)*abb (a/b)*

13. (a*/b*)*

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

14. (a/b)*a (a/b)

15. (a/b/c)* (a/b)*

16. (a/b)a(a/b)*abb

17. 0(2/51)*

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

18. 4(0/123)*

19. (a,a*/bb*)

20. (0/1)*1

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

NFA to DFA
1. RE = a*

NFA

Finding the states


ɛ closure {0} = {0,1,3}

A = {0,1,3}
ɛ closure[move(A,a)] = {2}
= {2,1,3}

ɛ closure[move(B,a)] = {2} B = {1,2,3}


= B
Transition table:
a

A B

B B

Minimizing DFA:
= {A B}
= {(A B)}
= {A}
Minimized Transition table:

State a

A A

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

DFA:

2. RE = (a/b)*abb

NFA

ɛ closure {0} = {0,1,2,4,7} = A


mov(A,a) = ɛ closure {3,8} = {3,8,6,1,2,4,7} = B
mov(A,b) = ɛ closure {5} = {5,6,1,2,4,7} = C
mov(B,a) = ɛ closure {3,8} = B
mov(B,b) = ɛ closure {9,5} = {9,5,6,7,1,2,4} = D
mov(C,a) = ɛ closure {3,8} = B
mov(C,b) = ɛ closure {5} = C
mov(D,a) = ɛ closure {8,3} = B
mov(D,b) = ɛ closure {10,5} = {10,5,6,7,1,2,4} = E
mov(E,a) = ɛ closure {8,3} = B
mov(E,b) = ɛ closure {5} = C
Transition table:

a b

A B C

B B D

C B C

D B E

E B C

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

Minimizing DFA:
= { (A ,B,C,D) (E) }
= { (A,B,C) (D) (E)}
= { (AC) (B) (D) (E)}
= { (A) (B) (C) (D) (E) }

Minimized Transition table:

State a b

A B A

B B D

D B E

E B A

DFA:

Prepared By S.PRABU, Associate Professor/KVCET


CS8602 – COMPILER DESIGN

UNIT II
SYNTAX ANALYSIS
Syntax analysis is otherwise called parsing or hierarchical analysis.
Parser
Is a software that accepts tokens as input and produce parse tree as output.

Tokens Parser Parse Tree

(or) Syntax analysis is the second phase of the compiler. It gets the input from the tokens
and generates a syntax tree or parse tree.

The parser uses the first components of the tokens produced by the lexical analyzer to
create a tree-like intermediate representation that depicts the grammatical structure of the
token stream.
A typical representation is a syntax tree in which each interior node represents an
operation and the children of the node represent the arguments of the operation.
A syntax tree for the token stream is shown as the output of the syntactic analyzer in
following Fig.

This tree shows the order in which the operations in the assignment.
In compiler model, the parser obtains a string of tokens from the lexical analyzer.
And verifies that the string of token names can be generated by the grammar for the
source language.

Prepared By S.PRABU, Associate Professor/ KVCET


CS8602 – COMPILER DESIGN

We expect the parser to report any syntax errors in an intelligible fashion.


And to recover from commonly occurring errors to continue processing the remainder of
the program.
Conceptually, for well-formed programs, the parser constructs a parse tree and passes it to
the rest of the compiler for further processing.
There are three general types of parsers for grammars:
1. Top-down Parsing Method
2. Bottom-up parsing Method

THE ROLE OF PARSER


The parser or syntactic analyzer obtains a string of tokens from the lexical analyzer and
verifies that the string can be generated by the grammar for the source language. It reports
any syntax errors in the program. It also recovers from commonly occurring errors so that it
can continue processing its input.

1. It verifies the structure generated by the tokens based on the grammar.


2. It constructs the parse tree.
3. It reports the errors.
4. It performs error recovery.

Prepared By S.PRABU, Associate Professor/ KVCET


CS8602 – COMPILER DESIGN

ERRORS

Parser cannot detect errors such as:


1. Variable re-declaration
2. Variable initialization before use
3. Data type mismatch for an operation.

The above issues are handled by Semantic Analysis phase.

Syntax error handling :


Programs can contain errors at many different levels. For example :
1. Lexical, such as misspelling an identifier, keyword or operator.
2. Syntactic, such as an arithmetic expression with unbalanced parentheses.
3. Semantic, such as an operator applied to an incompatible operand.
4. Logical, such as an infinitely recursive call.

Functions of error handler :


1. It should report the presence of errors clearly and accurately.
2. It should recover from each error quickly enough to be able to detect subsequent errors.
3. It should not significantly slow down the processing of correct programs.

Error recovery strategies :


The different strategies that a parse uses to recover from a syntactic error are:
1. Panic mode
2. Phrase level
3. Error productions
4. Global correction

Panic mode recovery:

On discovering an error, the parser discards input symbols one at a time until a synchronizing
token is found. The synchronizing tokens are usually delimiters, such as

Prepared By S.PRABU, Associate Professor/ KVCET


CS8602 – COMPILER DESIGN

semicolon or end. It has the advantage of simplicity and does not go into an infinite loop.
When multiple errors in the same statement are rare, this method is quite useful.

Phrase level recovery:

On discovering an error, the parser performs local correction on the remaining input
that allows it to continue. Example: Insert a missing semicolon or delete an extraneous
semicolon etc.

Error productions:

The parser is constructed using augmented grammar with error productions. If an


error production is used by the parser, appropriate error diagnostics can be generated to
indicate the erroneous constructs recognized by the input.

Global correction:

Given an incorrect input string x and grammar G, certain algorithms can be used to
find a parse tree for a string y, such that the number of insertions, deletions and changes of
tokens is as small as possible. However, these methods are in general too costly in terms of
time and space.

CONTEXT-FREE GRAMMARS
A context-free grammar has four components.
• terminals,
• non terminals,
• a start symbol, and
• productions.

1.Terminals
Terminals are the basic symbols from which strings are formed.

Prepared By S.PRABU, Associate Professor/ KVCET


CS8602 – COMPILER DESIGN

2. Non terminals
Non terminals are syntactic variables that denote sets of strings.
The sets of strings denoted by non terminals help define the language generated by the
grammar.

3.Start Symbol

In a grammar, one non terminal is distinguished as the start symbol.


And the set of strings it denotes is the language generated by the grammar.
Conventionally, the productions for the start symbol are listed first.

4.Production
The productions of a grammar specify the manner in which the terminals and
nonterminals can be combined to form strings.
Each production consists of:
• A nonterminal (in left side)
• An arrow
• Set of terminals and non-terminals.
Example:
The following grammar defines simple arithmetic expressions

Terminals are id, + - , * ,/ ,^


Non terminal are expr, op
Starting symbol is expr.

Prepared By S.PRABU, Associate Professor/ KVCET


CS8602 – COMPILER DESIGN

Notational Convent ions


It is necessary to employ the following notations for identifying terminals and non
terminals.
Following symbols are terminals:
• Lowercase letters early in the alphabet, such as a, b, c.
• Operator symbols such as +, *, and so on.
• Punctuation symbols such as , ( and so on.
• The digits 0,1,. . . ,9.
• Boldface strings such as id or if, each of which represents a single terminal
symbol.
Following symbols are Non Terminals:
• Uppercase letters early in the alphabet, such as A, B, C.
• The letter S, which, when it appears, is usually the start symbol.
• Lowercase, italic names such as expr or stmt.

Uppercase letters late in the alphabet, such as X, Y, 2, represent grammar symbols; that
is, either nonterminals or terminals.
Unless stated otherwise, the left side of the first production is the start symbol.

DERIVATION
The derivational view gives a precise description of the top down construction of a parse
tree.
The central idea is production is treated as a rewriting rule, in which the Non terminal on
the left is replaced by the string on the right side of the production.
Consider the following grammar G.
E E+E | E*E| (E) | -E| id
The string –(id+id) is a sentence of grammar because there is a derivation,
E=> -E => -(E+E) => -(id +E) => -(id+id)
At each step in derivation there are two choices to be made.
We need to choose which Non terminal to replace.

Prepared By S.PRABU, Associate Professor/ KVCET


CS8602 – COMPILER DESIGN

Left most derivation


Derivations in which only the left most non terminal in any sentential form is replaced at
each step.
Such a derivation is called Left most derivation.

Right most derivation


Derivations in which only the right most non terminal in any sentential form is replaced at
each step.
Such a derivation is called Left most derivation.
It is otherwise called canonical derivations. Ex.
The right most derivation of - (id +id) E=> -E => -(E+E) => -(E+id) => -(id+id)

Parse Tree
A parse tree may be viewed as a graphical representation for a derivation that filters out
the choice regarding replacement order.
Each interior node of a parse tree is labeled by some non terminal.
Children of the node are labeled from Left to Right.
The leaves of the parse tree are labeled by non terminals or terminals and read from Left
to Right.
The constitute sentinel form called the yield or frontier of the tree.
Ex. Parse tree for -(id+id)

Ambiguity
A grammar that produce s more than one parse tree for given input string, then it is called
ambiguous grammar.
Or

Prepared By S.PRABU, Associate Professor/ KVCET


CS8602 – COMPILER DESIGN

An ambiguous grammar is one that produces more than one Right most derivation or left
most derivation for the same sentence.

PARSER
Top-Down Parsing (Recursive Descent Parsing)
Top-down parsing can be viewed as the problem of constructing a parse tree for the input
string,
Starting from the root and creating the nodes of the parse tree in preorder.
Equivalently, top-down parsing can be viewed as finding a leftmost derivation for an
input string.
Recursive Descent Parsing.
• Backtracking is needed (If a choice of a production rule does not work, we backtrack
to try other alternatives.)
• It is a general parsing technique, but not widely used.
• Not efficient
Backtracking : Making repeated scans of input string.

Consider the grammar

And the input string “cad”.


To construct the parse tree , we initially create a tree consisting of single node S.
And apply the first production since it matches the first character ‘c’.

Prepared By S.PRABU, Associate Professor/ KVCET


CS8602 – COMPILER DESIGN

we can expand the A using the first alternative.

the resultant string is now “cabd”, but this is not correct.


So go back to the previous position to A, and check any other alternatives of A.
now the resultant string is now “cad”.

We halt and announce successful completion of parsing.

Predictive Parsing
The special case of recursive descent parsing is called predictive parsing.
It require no backtracking.
It is efficient one.
It needs a special form of grammars (LL(1) grammars).
Predictive Parser is also known as LL(1) parser.
In many cases , carefully writing a grammar by,
• Eliminating left recursion
• Left factoring the resulting grammar
We can obtain a grammar that can be parsed by a recursive descent parser that needs no
backtracking.

Prepared By S.PRABU, Associate Professor/ KVCET


CS8602 – COMPILER DESIGN

Stack Implementation Of Predictive Parser


It is possible to build a non recursive predictive parser by maintaining a stack explicitly,
rather than implicitly via recursive calls.
The key problem during predictive parsing is that of determining the production to be
applied for a non-terminal
The no recursive parser in figure looks up the production to be applied in parsing table.

A table-driven predictive parser has an input buffer, a stack, a parsing table, and an output
stream.

Input Buffer
The input buffer contains the string to be parsed, followed by $, a symbol used as a right
end marker to indicate the end of the input string.
Stack
The stack contains a sequence of grammar symbols with $ on the bottom, indicating the
bottom of the stack.
Initially, the stack contains the start symbol of the grammar on top of $.

Parsing Table
The parsing table is a two dimensional array M[A,a] where A is a nonterminal, and a is a
terminal or the symbol $.

The parser is controlled by a program that behaves as follows.


The program considers X, the symbol on the top of the stack, and a, the current input
symbol.
These two symbols determine the action of the parser.

Prepared By S.PRABU, Associate Professor/ KVCET


CS8602 – COMPILER DESIGN

There are three possibilities.

BOTTOM UP PARSER
These are the Parser, which constructs the parser tree from the leaves to the root (starting
non-terminal) for the given input string.
Here the input string is reduced to the starting non-terminal.
Bottom up parser otherwise called shift-reduce parser why?
Because it consists of shifting input symbols onto a stack until the R side of a production
appears on the top of the stack.
The R side may then be replaced by (reduced to) the symbols on the L side of the
production. R the process repeated.

LR parsing
A much more general method of shift reduce parsing is called LR parsing.

Shift Reduce Parsing


Shift reduce parsing attempts to construct a parse tree for an input string beginning at the
leaves (the bottom) and working up towards the root (the top).
We can think this process as one of “reducing” a string w to the starting symbol of the
grammar G.
At each reduction step, a particular substring matching the right side of a production is
replaced by a left side of a production.
Ex.

Prepared By S.PRABU, Associate Professor/ KVCET


CS8602 – COMPILER DESIGN

The sentence abbcde can be reduced to S by the following steps,

This can be achieved by deriving right most derivation in reverse.


S => aABe => aAde => aAbcde => abbcde

Stack Implementation of Shift Reduce Parser


Data Structure:-
A convenient way to implement a shift-reduce parser is to use a stack.
Stack hold grammar symbols and an input buffer hold the string W.

$ Symbol:-
It is used to specify the bottom of the stack
It is in right end of the input.

Initial condition:
Initially stack is empty.
String W is in input buffer, as follows

At each step, the parser performs one of the following actions.


• Shift one symbol from the input onto the parse stack
• Reduce one handle on the top of the parse stack. The symbols from the right hand side
of a grammar rule are popped of the stack, and the non terminal symbol is pushed on
the stack in their place.
• Accept is the operation performed when the start symbol is alone on the parse stack
and the input is empty.

Prepared By S.PRABU, Associate Professor/ KVCET


CS8602 – COMPILER DESIGN

• Error actions occur when no successful parse is possible.

Example
Consider the following grammar

The input string is id1 + id2 * id3.

Primary Operations
The primary operations of shift-reduce parser are
• Shift
• Reduce

Possible Actions
There are actually four possible actions
• Shift
• Reduce
• Accept

Prepared By S.PRABU, Associate Professor/ KVCET


CS8602 – COMPILER DESIGN

• Error

Difference Between Bottom Up And Top Down Parser

Introduction to LALR Parser


The LALR parser and its alternatives, the SLR parser and the
Canonical LR parser, have similar methods and parsing tables; their main difference is in the
mathematical grammar analysis algorithm used by the parser generation tool. LALR
generators accept more grammars than do SLR generators, but fewer grammars than full
LR(1). Full LR involves much larger parse tables and is avoided unless clearly needed for
some particular computer language. Real computer languages can often be expressed as
LALR(1) grammars. In cases where they can't, a LALR(2) grammar is usually adequate. If
the parser generator allows only LALR(1) grammars, the parser typically calls some hand-
written code whenever it encounters constructs needing extended lookahead.

Similar to an SLR parser and Canonical LR parser generator, an LALR


parser generator constructs the LR(0) state machine first and then computes the lookahead
sets for all rules in the grammar, checking for ambiguity. The Canonical LR constructs full
lookahead sets. LALR uses merge sets, that is it merges lookahead sets where the LR(0) core

Prepared By S.PRABU, Associate Professor/ KVCET


CS8602 – COMPILER DESIGN

is the same. The SLR uses FOLLOW sets as lookahead sets which associate the right hand
side of a LR(0) core to a lookahead terminal. This is a greater simplification that in the case
of LALR because many conflicts may arise from LR(0) cores sharing the same right hand
side and lookahead terminal, conflicts that are not present in LALR. This is why SLR has less
language recognition power than LALR with Canonical LR being stronger than both since it
does not include any simplifications.

YACC- YET ANOTHER COMPILER-COMPILER

Yacc provides a general tool for describing the input to a computer program. The
Yacc user specifies the structures of his input, together with code to be invoked as each such
structure is recognized. Yacc turns such a specification into a subroutine that handles the
input process; frequently, it is convenient and appropriate to have most of the flow of
control in the user's application handled by this subroutine.

1) Lexical Analysis:
Lexical analyzer: scans the input stream and converts sequences of
characters into tokens.
Lex is a tool for writing lexical analyzers.

2) Syntactic Analysis (Parsing):


Parser: reads tokens and assembles them into language constructs using the grammar rules of
the language.
Yacc is a tool for constructing parsers.

Yet Another Compiler to Compiler


Reads a specification file that codifies the grammar of a language and generates a parsing
routine

Prepared By S.PRABU, Associate Professor/ KVCET


CS8602 – COMPILER DESIGN

Yacc specification describes a Context Free Grammar (CFG), that can be used to generate a
parser.
Elements of a CFG:
1. Terminals: tokens and literal characters,
2. Variables (non terminals): syntactical elements,
3. Production rules, and
4. Start symbole.

Format of a YACC specification file:


declarations
%%
grammar rules and associated actions
%%
C programs

Prepared By S.PRABU, Associate Professor/ KVCET


UNIT III
INTERMEDIATE CODE GENERATION

SEMANTIC ANALYSIS
Semantic Analysis computes additional information related to the
meaning of the program once the syntactic structure is known.
In typed languages as C, semantic analysis involves adding information to
the symbol table and performing type checking.
The information to be computed is beyond the capabilities of
standard parsing techniques, therefore it is not regarded as syntax.
As for Lexical and Syntax analysis, also for Semantic Analysis we need
both a representation Formalism and an Implementation Mechanism.
As representation formalism this lecture illustrates what are called Syntax
Directed Translations.

SYNTAX DIRECTED TRANSLATION


The Principle of Syntax Directed Translation states that the meaning of an
input sentence is related to its syntactic structure, i.e., to its Parse-Tree.
By Syntax Directed Translations we indicate those formalisms for
specifying translations for programming language constructs guided by context-
free grammars.
o We associate Attributes to the grammar symbols representing the
language constructs.
o Values for attributes are computed by Semantic Rules
associated with grammar productions.
Evaluation of Semantic Rules may:
o Generate Code;
o Insert information into the Symbol Table;
o Perform Semantic Check;
o Issue error messages;
o etc.

Prepared By S.PRABU, Associate Professor-CSE/KVCET


There are two notations for attaching semantic rules:

1. Syntax Directed Definitions. High-level specification hiding many


implementation details (also called Attribute Grammars).
2. Translation Schemes. More implementation oriented: Indicate the order
in which semantic rules are to be evaluated.

Syntax Directed Definitions


• Syntax Directed Definitions are a generalization of context-free grammars in
which:
1. Grammar symbols have an associated set of Attributes;
2. Productions are associated with Semantic Rules for computing the values of
attributes.
Such formalism generates Annotated Parse-Trees where each node of the
tree is a record with a field for each attribute (e.g.,X.a indicates the attribute
a of the grammar symbol X).
The value of an attribute of a grammar symbol at a given parse-tree node is
defined by a semantic rule associated with the production used at that node.

We distinguish between two kinds of attributes:


1. Synthesized Attributes. They are computed from the values of the
attributes of the children nodes.
2. Inherited Attributes. They are computed from the values of the attributes
of both the siblings and the parent nodes

Syntax Directed Definitions: An Example


• Example. Let us consider the Grammar for arithmetic expressions. The Syntax
Directed Definition associates to each non terminal a synthesized attribute called
val.

Prepared By S.PRABU, Associate Professor-CSE/KVCET


3
CS8602 – COMPILER DESIGN

S – attributed and L – attributed SDTs in Syntax directed translation

Before coming up to S-attributed and L-attributed SDTs, here is a brief intro to


Synthesized or Inherited attributes

Types of attributes –
Attributes may be of two types – Synthesized or Inherited.
1. Synthesized attributes –
A Synthesized attribute is an attribute of the non-terminal on the left-hand side
of a production. Synthesized attributes represent information that is being
passed up the parse tree. The attribute can take value only from its children
(Variables in the RHS of the production).
For eg. let’s say A -> BC is a production of a grammar, and A’s attribute is dependent
on B’s attributes or C’s attributes then it will be synthesized attribute.
2. Inherited attributes –
An attribute of a non terminal on the right-hand side of a production is called
an inherited attribute. The attribute can take value either from its parent or from
its siblings (variables in the LHS or RHS of the production).
For example, let’s say A -> BC is a production of a grammar and B’s attribute is
dependent on A’s attributes or C’s attributes then it will be inherited attribute.
Now, let’s discuss about S-attributed and L-attributed SDT.

Prepared By S.PRABU, Associate Professor VIIT


4
CS8602 – COMPILER DESIGN

S-attributed SDT :
• If an SDT uses only synthesized attributes, it is called as S-attributed
SDT.
• S-attributed SDTs are evaluated in bottom-up parsing, as the values of
the parent nodes depend upon the values of the child nodes.
• Semantic actions are placed in rightmost place of RHS.
2. L-attributed SDT:
• If an SDT uses both synthesized attributes and inherited attributes with a
restriction that inherited attribute can inherit values from left siblings
only, it is called as L-attributed SDT.
• Attributes in L-attributed SDTs are evaluated by depth-first and left-to-
right parsing manner.
• Semantic actions are placed anywhere in RHS.
For example,
A -> XYZ {Y.S = A.S, Y.S = X.S, Y.S = Z.S}
is not an L-attributed grammar since Y.S = A.S and Y.S = X.S are allowed but Y.S =
Z.S violates the L-attributed SDT definition as attributed is inheriting the value from
its right sibling.
Note – If a definition is S-attributed, then it is also L-attributed but NOT vice-versa.

Prepared By S.PRABU, Associate Professor VIIT


5
CS8602 – COMPILER DESIGN

SYMBOL TABLES
A symbol table is a major data structure used in a compiler.
Associates attributes with identifiers used in a program. For instance, a type
attribute is usually associated with each identifier. A symbol table is a necessary
component Definition (declaration) of identifiers appears once in a program .Use of
identifiers may appear in many places of the program text Identifiers and attributes
are entered by the analysis phases. When processing a definition (declaration) of an
identifier. In simple languages with only global variables and implicit declarations.
The scanner can enter an identifier into a symbol table if it is not already there In
block-structured languages with scopes and explicit declarations:
The parser and/or semantic analyzer enter identifiers and corresponding
attributes
Symbol table information is used by the analysis and synthesis phases
To verify that used identifiers have been defined (declared)
• To verify that expressions and assignments are semantically correct – type
checking
• To generate intermediate or target code

INTERMEDIATE LANGUAGES
The front end translates a source program into an intermediate representation from
which the back end generates target code.
Benefits of using a machine-independent intermediate form are:
1. Retargeting is facilitated. That is, a compiler for a different machine can be
created by attaching a back end for the new machine to an existing front end.
2. A machine-independent code optimizer can be applied to the intermediate
representation.

Prepared By S.PRABU, Associate Professor VIIT


6
CS8602 – COMPILER DESIGN

INTERMEDIATE LANGUAGES
Three ways of intermediate representation:
* Syntax tree
* Postfix notation
* Three address code

The semantic rules for generating three-address code from common programming
language constructs are similar to those for constructing syntax trees or for generating
postfix notation.

Graphical Representations: Syntax tree:

A syntax tree depicts the natural hierarchical structure of a source program. A


dag (Directed Acyclic Graph) gives the same information but in a more compact way
because common subexpressions are identified. A syntax tree and dag for the
assignment statement a : = b * - c + b * - c are shown in Fig.3.2:

Prepared By S.PRABU, Associate Professor VIIT


7
CS8602 – COMPILER DESIGN

Postfix notation:
Postfix notation is a linearized representation of a syntax tree; it is a list of the nodes
of the tree in which a node appears immediately after its children. The postfix notation
for the syntax tree given above is

The ordinary (infix) way of writing the sum of a and b is with operator in the
middle : a + b
The postfix notation for the same expression places the operator at the right end as
ab +.
In general, if e1 and e2 are any postfix expressions, and + is any binary operator,
the result of applying + to the values denoted by e1 and e2 is postfix notation by
e1e2 +. No parentheses are needed in postfix notation because the position and
arity (number of arguments) of the operators permit only one way to decode a
postfix expression. In postfix notation the operator follows the operand.

Example – The postfix representation of the expression (a – b) * (c + d) + (a – b) is


: ab – cd + *ab -+.

Prepared By S.PRABU, Associate Professor VIIT


8
CS8602 – COMPILER DESIGN

Three-address code
Three address code is a type of intermediate code which is easy to
generate and can be easily converted to machine code. It makes use of at most three
addresses and one operator to represent an expression and the value computed at
each instruction is stored in temporary variable generated by compiler. The compiler
decides the order of operation given by three address code.

General representation – a = b op c

Where a, b or c represents operands like names, constants or compiler generated


temporaries and op represents the operator
Example-1: Convert the expression -a * – (b + c) into three address code.
Ans: t1 = b +c
t2 = - a
t3 = t2 * t1

Implementation of Three Address Code –


There are 3 representations of three address code namely
1. Quadruple
2. Triples
3. Indirect Triples

1. Quadruple –
It is structure with consist of 4 fields namely op, arg1, arg2 and result. op denotes
the operator and arg1 and arg2 denotes the two operands and result is used to store
the result of the expression.
Advantage –
• Easy to rearrange code for global optimization.
• One can quickly access value of temporary variables using symbol table.

Prepared By S.PRABU, Associate Professor VIIT


9
CS8602 – COMPILER DESIGN

Disadvantage –
• Contain lot of temporaries.
• Temporary variable creation increases time and space complexity.

Example – Consider expression a = b * – c + b * – c.


The three address code is:
t1 = - c
t2 = b * t1
t3 = - c
t4 = b * t3
t5 = t2 + t4
a = t5

2. Triples –
This representation doesn’t make use of extra temporary variable to represent a
single operation instead when a reference to another triple’s value is needed, a
pointer to that triple is used. So, it consist of only three fields namely op, arg1 and
arg2.
Disadvantage –
• Temporaries are implicit and difficult to rearrange code.

Prepared By S.PRABU, Associate Professor VIIT


10
CS8602 – COMPILER DESIGN

• It is difficult to optimize because optimization involves moving intermediate


code. When a triple is moved, any other triple referring to it must be updated
also. With help of pointer one can directly access symbol table entry.
Example – Consider expression a = b * – c + b * – c

3. Indirect Triples –
This representation makes use of pointer to the listing of all references to
computations which is made separately and stored. Its similar in utility as compared
to quadruple representation but requires less space than it. Temporaries are implicit
and easier to rearrange code.
Example – Consider expression a = b * – c + b * – c

Prepared By S.PRABU, Associate Professor VIIT


11
CS8602 – COMPILER DESIGN

Question – Write quadruple, triples and indirect triples for following expression : (x
+ y) * (y + z) + (x + y + z)
Explanation – The three address code is:

t1 = x + y
t2 = y + z
t3 = t1 * t2
t4 = t1 + z
t5 = t3 + t4

Prepared By S.PRABU, Associate Professor VIIT


12
CS8602 – COMPILER DESIGN

Prepared By S.PRABU, Associate Professor VIIT


13
CS8602 – COMPILER DESIGN

Types and Declarations


• Type checking uses logical rules to reason about the behavior of a program at
run time. Specifically, it ensures that the types of the operands match the type
expected by an operator. For example, the && operator in Java expects its two
operands to be booleans; the result is also of type boolean.
• Translation Applications. From the type of a name, a compiler can determine
the storage that will be needed for that name at run time. Type information is
also needed to calculate the address denoted by an array reference, to insert
explicit type conversions, and to choose the right version of an arithmetic
operator, among other things.
The actual storage for a procedure call or an object is allocated at run time, when the
procedure is called or the object is created. As we examine local declarations at
compile time, we can, however, lay out relative addresses, where the relative address
of a name or a component of a data structure is an offset from the start of a data area.

Type Expressions: Types have structure, which we shall represent using type
expressions: a typeexpression is either a basic type or is formed by applying an
operator called atype constructor to a type expression. The sets of basic types and
constructorsdepend on the language to be checked.

CHECKING
A Compiler must check the source program, both the syntactic and semantic
conventions.
This checking is called static checking.
Checking during execution of target program is called dynamic checking.

Static Check
Examples of static checks are,
• Type check
• Flow of control checks
• Uniqueness checks

Prepared By S.PRABU, Associate Professor VIIT


14
CS8602 – COMPILER DESIGN

• Name related checks


Type Checking
A Compiler should report an error, if an operator is applied to incompatible
operands.

Flow-Of-Control Checks
A Compiler should report an error, if the block contains proper open and close
statements, where the control flow statements exist inside the block.
Example.,

It should check proper closed brace at the end of the loop, for executing break
statements.
Uniqueness Check
There are some situations in which an object must be defined exactly once.
Example.
In switch-case statement, the labels f case statement must be distinct.

Name-Related Checks
Sometimes, the same name must appear two or more times.
Example.,
In Ada, a loop or block may have a name that appears at the
beginning and end of the construct.

Prepared By S.PRABU, Associate Professor VIIT


15
CS8602 – COMPILER DESIGN

POSITION OF A TYPE CHECKER

A type checker verifies that the type of a construct matches that expected by its
context. Example.,
• The arithmetic operator mod in Pascal requires integer operands.
• Type checker must verify that +ve operands of mod have type integer.
Similarly, Type checker must verify that,
• Dereferencing is applied only to a pointer
• Indexing is done only to a array
• User-defined function can be invoked by correct number and type of the
arguments.

TYPE SYSTEMS
The design of a type checker for a language based on information about the
syntactic construct in the language.
A type system is a collection of rules for assigning type expressions to the
various parts of a program.

Type expressions:
The type of a language construct will be denoted by a “type expression”.
A type expression is either a basic type or is formed by applying an operator
called type constructor.

Example.,
A basic type is a type expression.
The basic types are
• Boolean

Prepared By S.PRABU, Associate Professor VIIT


16
CS8602 – COMPILER DESIGN

• Char
• Integer
• Real
• type-error
A special basic type is type-error
It will signal an error during type checking.
A convenient way to represent a type expression is to use a graph.

Example.,
consider in a Pascal notation
Char X char → pointer (integer)

Checking done by a compiler is said to be static.


Checking done when target program runs is formed as dynamic.
A good type system eliminates the need of dynamic checking.
• Because, it allows us to determine statically that these errors, cannot
occur when the target program runs.

Strongly types language:


A language is said to be strongly typed if its compiler can guarantee that the
programs it accepts will execute without type errors.

SPECIFICATION OF A SIMPLE TYPE CHECKER


The type checker is a translation scheme that synthesizes the type of each
expression from the types of it sub expression.
The type checker can handle arrays, pointers, statements and functions.
Consider the following grammar.

Prepared By S.PRABU, Associate Professor VIIT


17
CS8602 – COMPILER DESIGN

The non terminals

The language has two basic types.


• Char
• Integer
The third basic type is type_error, is used to signal errors.
All array start at 1.
↑ integer leads to the type expression pointer.

Type Checking Of Expression


The following rules, the synthesized attribute type for E, gives the type
expression assigned by the type system.
The following rules say that constants represented by the token, literals and
num have type char and integer respectively.

Prepared By S.PRABU, Associate Professor VIIT


18
CS8602 – COMPILER DESIGN

Type Checking Of Statements


Statements typically do not have value.
The special basic type void can be assigned to them.
If an error is detected within a statement, the type assigned to the statement is
type-error.
Consider the following assignment, conditional and while statements

Prepared By S.PRABU, Associate Professor VIIT


CS8602 – COMPILER DESIGN

UNIT IV
RUN-TIME ENVIRONMENT AND CODE GENERATION
Storage Organization, Stack Allocation Space, Access to Non-local Data on the
Stack, Heap Management - Issues in Code Generation - Design of a simple Code
Generator.

RUN TIME ENVIRONMENTS


While executing the program, the same name in the source text can denote
different data objects in the target machine.

Run Time Support Package

The allocation and de-allocation of data objects is managed by the run-


time support package.
It consisting of runtimes loaded with the generated target code.
It is influenced by the semantics of procedures.
Each execution of a procedure is referred to as an activation of the
procedure.

STORAGE ORGANIZATION

The organization of run-time storage can be used for languages such as


Fortran, Pascal and C.

Sub Division Run-Time Memory

The run-time storage is subdivided to hold


1. Generated target code
2. Data objects
3. Counterpart of the control stack

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

Generated target code:

The size of the generated target code is fixed at compile time.


So the compiler can place it in a statically determined area.

Data Objects

All data objects in Fortran can be allocated statically.


Similarly, the size of some data objects may also be known at compile time.
And these too can be placed in statically determined area.

Control Stack

It is used to manage activations of procedures.


When a call occurs, execution of an activation is interrupted and some
information is stored in stack

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

When the control returns from the call, the activation can be restarted by
previous value in stack.

Heap

A separate area of run-time memory called heap.


It holds other information.
The storage of some data taken from the heap
The activation trees might use the heap to keep the information about
activations.
The sizes of stack and heap can change as the program executes.
They can grow toward each other as needed.
Pascal and C need both run-time stack and heap.
Activation records

Activation records can frame information needed by a single execution of a


procedure is managed by using a contiguous block of storage, called
activation records or frame.
It consists collection of fields as shown below.

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

Not all languages, nor all compilers use all of these fields
The purpose of the fields are,
• Temporaries: value those arising in the evaluation of expressions, are
stored in the fields for temporaries.
• Local data: it holds data that is local to an execution of a procedure.
• Saved machine status: it holds the information about the state of the
machine instruction before the procedure is called.
This information includes the values of the PC and machines
register.
These values can be restored when the control returns from the
procedure.
• Access: For language like Fortran, access links are not needed, but in
Pascal mechanism it is needed.
• Control link: It points to the activation record of the caller.
• Actual parameter: It is used by the calling procedure to supply
parameters to the called procedure.
• Returned value: It is used by the called procedure to return a value to the
calling procedure.

Activation Tree
A program consist of procedures, a procedure definition is a
declaration that, in its simplest form, associates an identifier (procedure name) with
a statement (body of the procedure). Each execution of procedure is referred to as
an activation of the procedure. Lifetime of an activation is the sequence of steps
present in the execution of the procedure. If ‘a’ and ‘b’ be two procedures then their
activations will be non-overlapping (when one is called after other) or nested
(nested procedures). A procedure is recursive if a new activation begins before an
earlier activation of the same procedure has ended. An activation tree shows the
way control enters and leaves activations.

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

Properties of activation trees are :-


• Each node represents an activation of a procedure.
• The root shows the activation of the main function.
• The node for procedure ‘x’ is the parent of node for procedure ‘y’ if and
only if the control flows from procedure x to procedure y.

Example – Consider the following program of Quicksort


main()
{
Int n;
readarray();
quicksort(1,n);
}
quicksort(int m, int n)
{
Int i= partition(m,n);
quicksort(m,i-1);
quicksort(i+1,n);
}
The activation tree for this program will be:

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

Access to Nonlocal Data on the Stack


Introduction: It is very important to know that how procedures access their data,
specially the mechanism for finding data used within a procedure p but that does
not belong to p. Access becomes more complicated in languages where procedures
can be declared inside other procedures. We therefore begin with the simple case of
C functions, and then introduce a language, ML, that permits both nested function
declarations and functions as "first-class objects;" that is, functions can take
functions as arguments and return functions as values. This capability can be
supported by modifying the implementation of the run-time stack.

Data Access without Nested Procedures: In the C family of languages, all


variables are defined either within a single function or outside any function
("globally"). Most importantly, it is impossible to declare one procedure whose
scope is entirely within another procedure. Rather, a global variable v has a scope
consisting of all the functions that follow the declaration of v, except where there is
a local definition of the identifier v.
Variables declared within a function have a scope consisting of that function only,
or part of it, if the function has nested blocks, as discussed in Section 1.6.3. For
languages that do not allow nested procedure declarations, allocation of storage for
variables and access to those variables is simple:
1. Global variables are allocated static storage. The locations of these variables
remain fixed and are known at compile time. So to access any variable that is not
local to the currently executing procedure, we simply use the statically determined
address.
2. Any other name must be local to the activation at the top of the stack. We may
access these variables through the to psp pointer of the stack.
An important benefit of static allocation for global is that declared procedures may
be passed as parameters or returned as results (in C, a pointer to the function is
passed), with no substantial change in the data-access strategy. With the C static-
scoping rule, and without nested procedures, any name nonlocal to one procedure is
nonlocal to all procedures, regardless of how they are activated. Similarly, if a
procedure is returned as a result, then any nonlocal name refers to the storage
statically allocated for it.

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

Issues with Nested Procedures: Access becomes far more complicated when a
language allows procedure declarations to be nested and also uses the normal static
scoping rule; that is, a procedure can access variables of the procedures whose
declarations surround its own declaration, following the nested scoping rule
described for blocks in Section 1.6.3. The reason is that knowing at compile time
that the declaration of p is immediately nested within q does not tell us the relative
positions of their activation records at run time. In fact, since either p or q or both
may be recursive, there may be several activation records of p and/or q on the
stack.
Finding the declaration that applies to a nonlocal name x in a nested
procedure p is a static decision; it can be done by an extension of the static-scope
rule for blocks. Suppose x is declared in the enclosing procedure q. finding the
relevant activation of q from an activation of p is a dynamic decision; it requires
additional run-time information about activations. One possible solution to this
problem is to use "access links”.

Heap Management
Heap used for dynamically allocated memory, its important operations
includes allocation and de-allocation. In C , Pascal, and Java allocation is done via
the “new” operator and C allocation is done via the “malloc” function call. De-
allocation is done either automatically. The heap is the portion of the store that is
used for data that lives indefinitely, or until the program explicitly deletes it.

While local variables typically become inaccessible when their procedures


end, many languages enable us to create objects or other data whose existence is
not tied to the procedure activation that creates them. For example, both C and Java
give the programmer new to create objects that may be passed — or pointers to
them may be passed — from procedure to procedure, so they continue to exist long
after the procedure that created them is gone. Such objects are stored on a heap.

Heap Memory Manager: The memory manager keeps track of all the free space
in heap storage at all times. It performs two basic functions:
1. Allocation
2. De allocation

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

properties of memory managers:


Space Efficiency
Program Efficiency
Low Overhead

Key Differences between Stack and Heap

Sn Stack Heap
Heap is a hierarchical data
1 A stack is a linear data structure.
structure.
2 High-speed access Slower compared to stack
It allows you to access variables
3 Local variables only
globally.
Limit on stack size dependent on Does not have a specific limit on
4
OS. memory size.
5 Variables cannot be resized Variables can be resized.
Memory is allocated in a contiguous Memory is allocated in any random
6
block. order.
Automatically done by compiler It is manually done by the
7
instructions. programmer.
Does not require to de-allocate
8 Explicit de-allocation is needed.
variables.
9 Less Cost More Cost
10 Fast access Slow access

CODE GENERATION
The final phase of compiler model is the code generator.
This phase receives the optimized intermediate code and generate the target
code.
Input : it takes the intermediate representation of source program
Ex. three address code
Output : it produces the equivalent target program.
Ex. Assemble language code, of machine code.

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

The output code must be correct and high quality.


High quality means it use the system resource effectively.

ISSUES IN THE DESIGN OF A CODE GENERATOR


While the details are dependent on the target language and the operating system,
issues such as
• memory management,
• instruction selection,
• register allocation, and
• evaluation order
are inherent in almost all code generation problems.
The following things are main issues in design of code generator
1. Input to the code generator
2. Target programs
3. Memory management
4. Instruction selection
5. Register allocation
6. Choice of evaluation order:
7. Approaches to code generation

1.Input To The Code Generator


The input to the code generator consists of the intermediate representation of the
source program produced by the front end.
That can be combined with information in the symbol table.
That is used to determine the run time addresses of the data objects denoted by
the names in the intermediate representation.

Choice of input:
There several choices for the intermediate language, such as

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

i. Linear representation (postfix notation)


ii. Graphical representation (syntax trees and dags)
iii. Three address representation (quadruples, triples)
iv. Virtual machine representation (Stack machine code)
Prior to the code generation, the front end has scanned, parsed, and
translated into intermediate representation.
So that the values of names appearing in the intermediate language can be
represented by quantities that the target machine can directly manipulate
(bits, integers, reals, pointers, etc.).
The necessary type checking has taken place.
Therefore input of the code generation is free from errors.

2.Target Programs
The output of the code generator is the target program.
Like be Intermediate code, output may take on a variety of forms:
i. absolute machine language
ii. Relocatable machine language
iii. assembly language.

i. Absolute machine language:


Producing an absolute machine language has the advantage.
Because it can be placed in fixed location in memory.

And it can be executed immediately.


A small program can be compiled and executed quickly.
Example., A number of “student-job” compilers, such as
a. WATFIV and
b. PL/C, produce absolute code.

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

ii. Relocatable machine language:


It allows subprograms to be compiled separately.
A set of relocatable object modules can be linked together and loaded for
execution by a linking loader.
If the target machine does not handle relocation automatically, the compiler
must provide explicit relocation information to the loader.

iii. Assembly language program:


It makes the code generation somewhat easier.
We can generate symbolic instructions and use the macro facilities of the
assembler to help generate code.

3.Memory Management
Mapping names in the source program to addresses of data objects in run time
memory is done co-operatively by the front end and the code generator.
From the symbol-table information, a relative address can be determined for the
name in a data area for the procedure.
If machine code is being generated, labels in three address statements
have to be converted to addresses of instructions.

4.Instruction Selection
It is somewhat difficult to select the proper instruction from the instruction set
of the target machine.
The uniformity and completeness of the instruction set are important factors.
If the target machine does not support each data type in a uniform manner, then
each exception to the general rule requires special handling.
Instruction speeds and machine idioms are other important factors.
If we do not care about the efficiency of the target program, instruction
selection is straightforward.

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

For each type of three- address statement we can design a code skeleton that
outlines the target code to be generated for that construct.

Example:
Every three-address code can be in the form
x := y + z
where x, y, and z are statically allocated.
That can be translated into the target code as follows
MOV Y, R0
ADD Z, R0
MOV R0, X
Sometimes this kind of statement by statement code generation often produces
poor code.
For example, the sequence of statements
a := b + c
d := a + e
would be translated into
MOV b, R0
ADD c, R0
MOV R0, a
MOV a, R0
ADD e, R0
MOV R0, d
Example:
Here the fourth statement is redundant.
The quality of the generated code is determined by its speed and size.
A target machine should have rich instruction set.
If the target machine have “increment” instruction (INC), then the three address
statement may be implemented more efficiently by following single instruction
a = a+1

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

INC a.
If INC instruction not present, the code will be
MOV a, R0
ADD #1, R0
MOV R0, a

5.Register Allocation
Instructions involving register operands are usually shorter and faster
than those involving operands in memory.
Therefore, efficient utilization of register is one of the important factor for
generation of good code.
The use of registers is often subdivided into two sub problems:

1. During register allocation, we select the set of variables that will


reside in registers at a point in the program.
2. During a subsequent register assignment phase, we pick the specific register
that a variable will reside in.

Finding an optimal assignment of registers to variables is difficult, even


with single register values.
Certain machines require register-pair (an even and next odd-numbered
registers) for some Operands and results.
For example, in the IBM System/370 machines integer multiplication and
integer division involve register pairs.
Example:
The multiplication instruction is of the form
M x, y
Where x is the multiplicand y is the multiplier
x is the even register of an even/odd register pair. y is the single register
The product occupies the entire even/odd register pair.

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

6.Choice of evaluation order:


The order in which computations are performed can affect the efficiency of the
target code.
Some computation orders require fewer registers to hold intermediate results
than others.
Selecting a best order is another difficult.

7. Approaches To Code Generation


Most important criteria for a code generator is that it produce correct code.
Designing a code generator so it can be easily implemented, tested, and
maintained is an important design goal.

Design of a simple Code Generator


A code generator generates target code for a sequence of three- address statements
and effectively uses registers to store operands of the statements.

Register and Address Descriptors:


• A register descriptor is used to keep track of what is currently in each registers.
The register descriptors show that initially all the registers are empty.
• An address descriptor stores the location where the current value of the name can
be found at run time.

Code Generator Algorithm


Input : Sequence of three address code statements
Output : Target code

For each statement of for x:=y OP z perform the following actions,


1. Invoke function ‘getreg( )’ to determine the location L.
L is usually be register.

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

2.Consult the address descriptor for Y.


Y is both register & memory.
3.Generate instruction OP ‘Z’ L
4.if the current value of Y and/or Z have ‘no next’ uses (NOT LIVE),
clear the register contents.

Generating Code for Assignment Statements:

• The assignment d : = (a-b) + (a-c) + (a-c) might be translated into the following
three-address code sequence: Code sequence for the example is:

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

UNIT V
CODE OPTIMIZATION
Principal Sources of Optimization – Peep-hole optimization - DAG- Optimization
of Basic Blocks-Global Data Flow Analysis - Efficient Data Flow Algorithm.

Optimization
Optimization is the process of modifying the intermediate code to a more
optimal intermediate code, so that it run faster.

Optimizing Compiler
The compiler that apply code improving transformations (optimization), is
called optimizing compiler.

Properties of optimizing compiler


1.transformation must preserve meaning of the program.
2.transformation must speed up the program some measurable amount.
3.transformation must be worth effort.

Input
The input of the code optimizer is the intermediate code(ex. Three address
code).
The intermediate code is taken from the front end of the compiler.

Output
The output of the code optimizer is the improved intermediate code.
That can be sent on to code generator phase.
The code generator produces the target program from the transformed
intermediate code.

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

PRINCIPLE SOURCES OF OPTIMIZATION


Sources of Optimization
1.Common sub expression elimination
2.Copy propagation
3.Dead code elimination
4.Constant folding
5. Code Motion
6.Induction Variable Elimination
7.Reduction in strength
8.Peephole Optimization
8.1. Redundant instruction elimination
8.2. Flow of control optimization
8.3. Algebraic simplifications
8.4. Use of machine idioms

1.Common sub expression elimination


An repeated occurrence of an expression E is called a common sub expression.
It was previously computed and the values of variables in E have not changed,
then we can avoid the re computing the expression.
We can use the previously computed value.

Here 4*i canbe calculated two times and stored in t6 and t7.
We can use the t6 value which is previously computed instead of t7.
That can be described in the next diagram(b).

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

2.Copy Propagation
The assignment of the form f:=g called copy statements or copies for short.
The common sub expression is eliminated by a new temporary variable.
The following figure shows the copies introduced during the common sub
expression elimination.

The idea behind the copy propagation transformation is to use g for f.


Wherever possible after the copy statement f := g. Ex x := t3
a[t2] := t5
a[t4] := x goto B2
that can be converted into, x := t3
a[t2] := t5 a[t4] := t3 goto B2
This may not appear to be an improvement.
But it gives the opportunity to eliminate the assignment to x.

3.Dead code elimination


A variable is live at a point in a program if its value can be used subsequently.
Otherwise it is dead at that point.
The related idea is dead or useless code.
The statements that compute values that never get used.
Consider the following code in C.
# define debug 0
If (debug)
{
Print debugging information
}
The debug value is always 0 for throughout the program.
So the if block cannot be reached.

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

We can eliminate both the test and printing the object code.
One advantage of copy propagation is that it often turns the copy statement into
dead code.
Ex x := t3 a[t2] := t5 a[t4] := x goto B2
that can be converted into, a[t2] := t5
a[t4] := t3 goto B2

4.Constant Folding
Evaluate constant expression at compile time and replace the constant
expressions by their values.
Ex. the expression 2*3.14 would be replaced by 6.28.

5.Code motion
Moves code outside that loop
An important modification that decreases the amount of code in a loop is code
motion.
This transformation takes an expression that yields the same result independent
of the number of times a loop is executed( a loop invariant computation).
And places the expression before the loop.

Ex. while (i <= limit-2)


{
……..
}

If the while block does not change the value of variable “limit”, that is “limit”
is loop invariant computation.
Then the code motion will result in the equivalent of t=limit-2;
while (i <= t)
{
……..
}
6.Induction Variable Elimination
loops are usually processed inside out.

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

Ex. consider the loop around B3.


Only the portion of the flow graph relevant to the transformation on B3 is
shown in following figure,

Every time the value of j decreases by 1, that of t4 decreases by 4 because 4*j is


assigned to t4.
Such identifiers are called induction variables.
When there are two or more induction variables in the loop, it may possible to
get rid of all but one, by the process of induction-variable elimination.
For the inner loop around B3 in fig(a), we cannot the rid of either j or t4
completely.

7.Reduction in strength
Reduction in strength replaces expensive operations by equivalent cheaper ones
on the target machine.
Ex. X^2 calls the exponentiation routine, the equivalent cheaper
implementation is x*x.
Fixed point multiplication or division by a power of two is cheaper to
implement shift operation.

8.Peephole Optimization
Optimization technique for improving code in a short range.

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

A simple but effective technique for locally improving the target code is called
peephole optimization.
It is a method for trying to improve the performance of the target program by
examining a short sequence of target instructions( called peephole) and
replacing these instructions by a shorter or faster sequence, whenever possible.
Peephole is small moving window on the target program.
The code in the peephole need not be a contiguous.
Repeated passes over the target code are necessary to get the maximum benefit.

The following examples of program transformations that are characteristic of


peephole optimization.
1. Redundant instruction elimination
2. Flow of control optimization
3. Algebraic simplifications
4. Use of machine idioms

1.Redundant instruction elimination


If the target code contains the instruction sequence,
MOV R0, a
MOV a, R0
We can delete the second instruction, because first instruction ensures that the
value of a is already in a register R0

2.Flow of control optimization


An intermediate code generator algorithms can frequently produce jumps to
jumps, jumps to conditional jumps, unconditional jumps to jumps.
These unnecessary jumps can be eliminated in either the intermediate code or
the target code by the following types of peephole optimizations.
We can replace the jump sequence
Goto L1
….
L1: goto L2
Can be replaced by,
Goto L2

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

….
L1: goto L2
If there are no jumps to L1, it can also eliminated.
Similarly
If a< b goto L1
….
L1: goto L2
Can be replaced by,
If a< b goto L2

L1: goto L2

3.Algebraic Simplification
There is no end to the amount ao algebraic simplification that can be attempted
through peephole optimization.
The frequent algebraic simplifications are,
x=x+0
or
x=x*0
are often produced straight forward intermediate code generation algorithms.
They can be eliminated easily through the peephole optimization.

4.Use of machine idioms


The target machine have hardware instructions to implement certain specific
operations efficiently.
Use of these instructions can reduce the execution time significantly.
Ex. some machines have auto increment and auto decrement addressing modes.
The use of these modes greatly improves the quality of code when pushing or
popping a stack.
These modes can also be used in code for statements like i = i + 1;

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

DAG
Directed Acyclic Graph
A DAG for basic block is a directed acyclic graph with the following labels on
nodes:
1. The leaves of graph are labeled by unique identifier and that identifier can
be variable names or constants.
2. Interior nodes of the graph is labeled by an operator symbol.
3. Exterior nodes of the graph is labeled by an identifier.

A DAG is constructed for optimizing the basic block.


A DAG is usually constructed using Three Address Code.
Transformations such as dead code elimination and common sub expression
elimination are then applied.

Construction of DAGs-
Following rules are used for the construction of DAGs-
Rule 1:
• Operators are represented by interior nodes.
• Identifiers are represented by exterior nodes.
Rule 2:
• A checking is made to find if there exists any node with the same value.
• A new node is created only when node has new value.
• This action helps in detecting the common sub-expressions and avoiding the
re-computation of the same.
Rule 3:
• The assignment instructions of the form x:=y are not performed unless they
are necessary.

Problem-01:

Consider the following expression and construct a DAG for it-


(a+b)x(a+b+c)

Solution-

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

Three Address Code for the given expression is-


T1 = a + b
T2 = T1 + c
T3 = T1 x T2

Now, Directed Acyclic Graph is-

Problem-02:

Consider the following expression and construct a DAG for it-


(((a+a)+(a+a))+((a+a)+(a+a)))

Solution-
Directed Acyclic Graph for the given expression is-

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

Problem-03:

Consider the following block and construct a DAG for it-

(1) a = b x c
(2) d = b
(3) e = d x c
(4) b = e
(5) f = b + c
(6) g = f + d

Solution-

Directed Acyclic Graph for the given block is-

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

OPTIMIZATION OF BASIC CLOCKS


Number of code improving transformations for basic blocks.
These includes structure preserving transformations, such as common sub
expression elimination, dead code elimination ad algebraic transformations such
as reduction in strength.
Many of the structure preserving transformations can be implemented by
constructing dag for a basic block.
Ex. consider the following three address code.
a := b + c
b := a – d
c := b + c
d := a – d
the dag for above statement is

When we construct the node for the fourth statement d := a - d, that computation
is already present in the statement 2.

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

So no need to create new node, but add this definition to appropriate node.
Since there are only three nodes in the above dag, we can reconstruct the three
address code as,
a := b + c
d := a – d
c := d + c
note that when we look for common sub expressions, we really are looking from
expressions that are guaranteed to compute the same value.
No matter how that value is computed.
Thus the dag method will miss the fact that the expression computed by the first
and fourth statements in the sequence is the same, namely b+c.
a := b + c
b := b – d
c := c + d
e := b + c
however algebraic identities applied to the dag, may expose the equivalence.
The dag for this sequence is shown below.

the operation on dags that corresponds to dead code elimination is quite straight
forward to implement.
We delete from a dag any root ( node with no ancestors) that has no live
variables.

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

Repeated application of this transformation will remove all nodes from the dag
that correspond to dead code.

The use of algebraic identities


Algebraic identities represent another important class of optimization of basic
blocks. Ex.
x+0=0+x=xx–0=x
x*1=1*x=xx/1=x
another class of algebraic optimizations includes reduction in strength.
reduction in strength : replacing more expensive operator into cheaper one.
Ex.
x**2 = x * x
2*x=x+x
x / 2 = x * 0.5
a third class of this optimization is constant folding.
constant folding : evaluate constant expression at compile time and replace the
constant expressions by their values.
Ex. the expression 2*3.14 would be replaced by 6.28.

INTRODUCTION TO GLOBAL DATA-FLOW ANALYSIS


A process of collecting data flow information which are useful for the purpose
of optimization is called data-flow analysis.
Data-flow information can be collected by setting up and solving systems of
equations that relate information at various points in a program.
A typical equation in the form
out[S] = gen[S] U (in[S] – kill [S])
That can be read as,
“the information at the end of statement is either generated within the
statement, or enters at the beginning and is not killed as control flows through the
statement”.
Such equation is called data flow equations.
The details of how data-flow equations are set up and solve depend on three
factors,

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

1. The notions of generating and killing depend on the desired information.


2. Since data flows along control paths, data-flow analysis is affected by the
control constructs in a program.
3. There are subtleties that go along with such statements as procedure calls,
assignments through pointer variables, and even assignments to array variables.
(subtleties = Be difficult to detect or grasp by the mind)

Pints and Paths


Within a basic block, the pointer is in between two adjacent statements, and
also the point before the first statement and after the last.
Thus the clock B1 in the following diagram, has four points. One before any of
the assignments and one after each of three assignments.

Let us take a global view and consider all the points in all the blocks.
A path from p1 to pn is a sequence of points p1, p2,,….pn such that for each i
between 1 and n-1 .
1. Pi is the point immediately preceding a statement and pi+1 is the point
immediately following that statement in the same block. Or
2. Pi is the end of some block and pi+1 is the beginning of a successor
block.

Efficient Data Flow Algorithm

To efficiently optimize the code compiler collects all the information about the
program and distribute this information to each block of the flow graph. This
process is known as data-flow graph analysis.
Certain optimization can only be achieved by examining the entire program. It
can't be achieve by examining just a portion of the program.
For this kind of optimization user defined chaining is one particular problem.
Here using the value of the variable, we try to find out that which definition of a
variable is applicable in a statement.

Prepared By S.PRABU, Associate Professor-CSE/KVCET


CS8602 – COMPILER DESIGN

Based on the local information a compiler can perform some optimizations. For
example, consider the following code:

1. x = a + b;
2. x = 6 * 3

o In this code, the first assignment of x is useless. The value computer for x is
never used in the program.
o At compile time the expression 6*3 will be computed, simplifying the
second assignment statement to x = 18;
Some optimization needs more global information. For example, consider the
following code:

1. a = 1;
2. b = 2;
3. c = 3;
4. if (....) x = a + 5;
5. else x = b + 4;
6. c = x + 1;

In this code, at line 3 the initial assignment is useless and x +1 expression can be
simplified as value 7.
But it is less obvious that how a compiler can discover these facts by looking only
at one or two consecutive statements. A more global analysis is required so that the
compiler knows the following things at each point in the program:
o Which variables are guaranteed to have constant values
o Which variables will be used before being redefined
Data flow analysis is used to discover this kind of property. The data flow analysis
can be performed on the program's control flow graph (CFG).
The control flow graph of a program is used to determine those parts of a program
to which a particular value assigned to a variable might propagate.

Prepared By S.PRABU, Associate Professor-CSE/KVCET

You might also like