Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

PCC All Units QuestionBank

Download as pdf or txt
Download as pdf or txt
You are on page 1of 121

GURU NANAK INSTITUTE OF TECHNOLOGY, IBRAHIMPATNAM-501506

DEPARTMENT OF INFORMATION TECHNOLOGY


Subject: PRINCIPLES OF COMPILER CONSTRUCTION Subject Code-20PC0IT19
QUESTION BANK WITH ANSWERS
UNIT-I
PART-A
1. Define the terms Language Translator and compiler.
A language translator is a computer program that translates code written in one programming language
into another programming language. The most common type of language translator is a compiler.
A compiler is a type of language translator that takes the source code of a computer program written in a
high-level programming language and translates it into an executable form that can be run on a computer.
2. Compare and contrast compiler and interpreter

S.No. Compiler Interpreter

1. The compiler scans the whole program in one Translates the program one statement at a time.
go.

2. As it scans the code in one go, the errors (if


any) Considering it scans code one line at a time, errors
are shown at the end together. are shown line by line.

3. The main advantage of compilers is its Due to interpreters being slow in executing the
execution object code, it is preferred less.
time.

4. It converts the source code into object code. It does not convert source code into object code
instead it scans it line by line

5 It does not require source code for later It requires source code for later execution.
execution.

6 Execution of the program takes place only after Execution of the program happens after every line is
the whole program is compiled. checked or evaluated.

7 The machine code is stored in the disk storage. Machine code is no where stored.

8 Compilers more often take a large amount of In comparison, Interpreters take less time for
time for analyzing the source code. analyzing the source code.

9. It is more efficient. It is less efficient.

10. CPU utilization is more. CPU utilization is less.


11 Any change in source program after the Any change in source program during the
compilation requires recompiling of entire code. translation
does not requires retranslation of entire code.

12. Compiler Takes Entire program as input Interpreter Takes Single instruction as input.

13. Object code is permanently saved for future use. No object code is saved for future use.

Eg. C, C++, C#, etc are programming languages Python, Ruby, Perl, SNOBOL, MATLAB, etc are
that programming languages that are interpreter-based.
are compiler-based.

3. Define Cross Compiler.


A cross-compiler is a type of compiler that runs on one platform or operating system but generates code
for another platform or operating system. In other words, it is a tool that allows developers to create
software for a target platform or architecture that is different from the one on which the compiler is
running.
4. List out the phases of the compiler along their input and output of each phase.
1.Lexical Analysis (Input: Source Code; Output: Tokens
2.Syntax Analysis (Input: Tokens; Output: Parse Tree)
3. Semantic Analysis (Input: Parse Tree; Output: Symbol Table)
4. Intermediate Code Generation (Input: Parse Tree and Symbol Table; Output: Intermediate Code)
5. Optimization (Input: Intermediate Code; Output: Optimized Intermediate Code)
6.Code Generation (Input: Optimized Intermediate Code; Output: Machine Code)
5. What is the significance of lexical analysis phase of a compiler?
The lexical analysis phase, also known as the scanning phase, is the first phase of a compiler. It reads the
source code character by character, identifies the tokens (such as keywords, identifiers, operators, etc.) in
the code, and produces a stream of tokens as output.
6. Define Lexeme, Pattern and Token.
A lexeme is a sequence of characters that represents a meaningful unit in the source code of a programming
language, such as a keyword, an identifier, a number, a symbol, or a string literal. For example, in the expression
"x = 2 + y", the lexemes are "x", "=", "2", "+", and "y".
A pattern is a regular expression that describes the possible forms of a lexeme in the source code. Patterns are used
by the lexical analyzer (lexer) of a compiler to recognize and extract lexemes from the input stream of characters.
For example, the pattern for a decimal integer in many programming languages is "\d+", which matches one or
more digits.
A token is a pair consisting of a lexeme and its corresponding token type, which is a symbolic representation of the
category or class of the lexeme. Token types are defined by the programming language specification and may
include keywords, identifiers, operators, punctuation symbols, and other categories.

7. What is input Buffering in lexical analysis Phase? State its Types.


In lexical analysis, input buffering refers to the process of reading and storing the input characters of a program or a
file in a buffer, and then analyzing the buffer to recognize and extract lexemes. There are two common input
buffering schemes used in lexical analysis:

Line buffering: In this scheme, the input buffer stores one line of input at a time, and the lexer analyzes the line to
extract lexemes. Line buffering is simple and efficient, but it can cause problems when a lexeme spans across
multiple lines or when there are nested comments or string literals that contain newlines.

Block buffering: In this scheme, the input buffer stores a fixed-size block of input, which is larger than a line, and
the lexer analyzes the block to extract lexemes. Block buffering is more complex than line buffering, but it can
handle lexemes that span across multiple lines and can improve the efficiency of the lexer by reducing the number
of input reads.

8. What is the use of Sentinel in tokenization process?


One common technique used in tokenization is to use a "sentinel" character to mark the boundaries between tokens.
A sentinel character is a character that is not found in the text itself, but is used as a placeholder to indicate where
one token ends and the next begins.

9. Define regular Set or Regular language.


In computer science and formal language theory, a regular set or regular language is a set of strings that can
be generated by a regular expression or recognized by a finite automaton (FA).\

10. What are the three basic sections of a LEX program?


A LEX program typically consists of three basic sections:
Definition section: This section contains the regular definitions, also called "macros," that define the patterns of
characters that the LEX program will recognize and translate into tokens. These macros are typically specified using
regular expressions.
Rules section: This section contains the rules that describe how the LEX program will match input text to the macros
defined in the definition section. Each rule consists of a regular expression and an associated action that will be
executed when that regular expression is matched.
User code section/Auxiliary Function Section: This section contains any additional code that the programmer
wants to include in the program. This can include functions, variables, and other program logic that is not directly
related to the lexing process. This section is optional and may be omitted if the program does not require any
additional code.

11.State the uses of the following Built in variables used in LEX programming.
i)yyin ii)yyout iii)yytext iv)yyleng v)yylineno vi)yylval

yyin: This variable is a file pointer that represents the input stream to be scanned by the Lex program. By default,
yyin is set to stdin, but it can be changed to any file or input stream that the program needs to read from.
yyout: This variable is a file pointer that represents the output stream for the Lex program. By default, yyout is set
to stdout, but it can be changed to any file or output stream that the program needs to write to.
yytext: This variable is a character array that contains the text of the current token matched by the Lex program's
regular expressions. The length of the token is stored in the yyleng variable.
yyleng: This variable stores the length of the current token matched by the Lex program's regular expressions. It is
typically used to extract the matched token from the yytext variable.
yylineno: This variable stores the current line number of the input stream being scanned by the Lex program. It is
typically used for error reporting and debugging purposes.
yylval: This variable is used to store the semantic value of the current token matched by the Lex program's regular
expressions. The programmer can define the type and contents of the yylval variable to suit their specific needs.
12. State the uses of the following Built in functions used in LEX programming.
i)yylex( ) ii)yywrap( ) iii)yylesss(int n) iv)yymore( ) v)yyerror( )
yylex( ): This function is the core of a Lex program and is responsible for scanning the input stream, matching the
regular expressions defined in the program, and returning the corresponding tokens to the calling program. The
yylex() function is called repeatedly by the calling program until it returns an end-of-file token or signals an error.
yywrap( ): This function is used to indicate to the calling program that the end of the input stream has been reached.
When yylex() encounters the end of the input stream, it calls yywrap() to determine whether to continue scanning
the input or stop. If yywrap() returns a non-zero value, yylex() returns 0 to the calling program, indicating that the
scanning is complete. Otherwise, yywrap() returns 0, and yylex() resumes scanning the input stream.
yylesss(int n): This function is used to specify the start condition for the next match. In a Lex program, start
conditions are used to define sets of regular expressions that are active only under certain conditions. yylesss() sets
the start condition for the next match to the specified condition n.
yymore( ): This function is used to indicate that the next match should be appended to the current token, rather than
starting a new token. This is useful when a single logical token spans multiple lines or is interrupted by whitespace
or other characters that are not part of the token.
yyerror( ): This function is used to report errors detected during scanning. When yylex() encounters an error, it calls
yyerror() to report the error message to the calling program. The programmer can customize the behavior of yyerror()
to suit their specific needs, such as printing an error message to the console or logging the error to a file.

13. Build NFA for the regular expression r=(a+b)*ab


14. Write a regular expression for relational operators. Design a transition diagram for them.

15. Give a regular definition for signed and unsigned numbers.


digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
sign [ + | - ]
num (sign)? (digit)*
16. Define DFA.
A Deterministic Finite Automaton is a collection of the following things-
1) The Finite set of states which can be denoted by Q
2) The finite set of input symbols ∑
3) The start state q0 that q0 € Q
4) The set of final states F such that F€ Q
5) The mapping function or transition function denoted by 𝛿.Two parameters are passed to this transition
function:
One is current state and other is the input symbol. The Transition function returns a state which can be called as
next state.
17. State the reason behind how gcc compilers can be used across multiple languages and various platforms.
1. Modular architecture: GCC has a modular architecture that allows developers to easily add new languages, target
platforms, and optimization techniques. The modular design of GCC allows it to support a wide range of
programming languages, from low-level systems programming languages like C to high-level scripting languages
like Python.
2. Target platform independence: GCC is designed to be target platform independent, meaning it can be used to
compile code for a wide variety of target architectures and operating systems. This is achieved through the use of a
modular back-end that generates machine code specific to the target platform.
3. Standard compliance: GCC is designed to be compliant with various programming language standards, such as
ANSI C, C++ and POSIX, ensuring that code compiled with GCC is portable across different platforms and
compilers.
4. Open-source nature: GCC is an open-source compiler system, which means that its source code is freely available
to developers. This allows developers to modify and extend the compiler to suit their needs and contributes to the
growth of the compiler in terms of language and platform support.
5. Wide community support: GCC is supported by a large and active community of developers, which ensures that
the compiler is regularly updated and maintained with new language features and target platform support.
PART-B
1. Explain the input buffering scheme for scanning the source program. How the use of sentinels can improve its
performance? Describe in detail.
Buffering scheme is a technique used by lexical analyzers (scanners) to read and process the source program
efficiently. It involves reading the source program into memory in blocks or chunks, rather than character by
character. This reduces the number of system calls required to read the source program and can improve the
performance of the scanner.
There are two common buffering schemes used in scanning the source program:
Fixed-size buffer scheme:
In this scheme, the source program is read into a fixed-size buffer, and the scanner processes the input in chunks of
a certain size. For example, if the buffer size is set to 1024 bytes, the scanner reads the input 1024 bytes at a time.
The advantage of fixed-size buffering is that it is simple to implement, and the buffer size is fixed and does not need
to be dynamically adjusted. However, it can lead to inefficiencies if the source program contains long tokens that
span across buffer boundaries.
Dynamically growing buffer scheme:
In this scheme, the buffer size is dynamically increased as needed to accommodate the input. For example, if the
scanner encounters a long token that spans across buffer boundaries, the buffer size is increased to ensure that the
token can be read in a single buffer read operation.
The advantage of dynamically growing buffering is that it can improve the efficiency of the scanner by ensuring that
long tokens can be read in a single read operation. However, it requires more complex buffer management code, as
the buffer size needs to be adjusted dynamically.

In summary, buffering is an important technique for optimizing the performance of scanners in reading and
processing the source program. The choice of buffering scheme depends on the characteristics of the source program,
such as the average token length, the presence of long tokens, and the expected size of the input.

However, the input buffering scheme can lead to a problem called boundary crossing. Boundary crossing occurs
when the buffer ends in the middle of a token, causing the scanner to read beyond the end of the buffer to complete
the token. This can result in inefficiency and can also lead to errors in the scanning process.
To avoid boundary crossing, sentinels can be used. Sentinels are special characters that are appended to the end of
the input buffer to ensure that the scanner can always complete the processing of the last token. The sentinel can be
a character that does not occur in the source program, such as a null character or a special end-of-file character.
When the scanner encounters the sentinel, it knows that it has reached the end of the input and can stop processing.
This eliminates the need for the scanner to read beyond the end of the buffer, improving performance and reducing
the risk of errors. Overall, the use of sentinels can improve the performance of input buffering by avoiding boundary
crossing and allowing the scanner to process the source program more efficiently.
1. One Buffer Scheme: In this scheme, only one buffer is used to store the input string but the problem with this
scheme is that if lexeme is very long then it crosses the buffer boundary, to scan rest of the lexeme the buffer has
to be refilled, that makes overwriting the first of lexeme.
2. Two Buffer Scheme: To overcome the problem of one buffer scheme, in this method two buffers are used to store
the input string. the first buffer and second buffer are scanned alternately. when end of current buffer is reached the
other buffer is filled. the only problem with this method is that if length of the lexeme is longer than length of the
buffer then scanning input cannot be scanned completely. Initially both the bp and fp are pointing to the first
character of first buffer.
Then the fp moves towards right in search of end of lexeme. as soon as blank character is recognized, the string
between bp and fp is identified as corresponding token. to identify, the boundary of first buffer end of buffer
character should be placed at the end first buffer. Similarly end of second buffer is also recognized by the end of
buffer mark present at the end of second buffer. when fp encounters first eof, then one can recognize end of first
buffer and hence filling up second buffer is started. in the same way when second eof is obtained then it indicates
of second buffer, alternatively both the buffers can be filled up until end of the input program and stream of tokens
is identified. This eof character introduced at the end is calling Sentinel which is used to identify the end of buffer.

2. Explain the various phases of a compiler in detail. Also write down the output for the following expression
after each phase a: =b*c.
A compiler is a program that converts the source code written in a programming language into machine-readable
code that can be executed by a computer. The process of compiling involves several phases, which are:
Lexical Analysis - also known as tokenization, this phase breaks the source code into tokens or lexemes, which
are the smallest meaningful units of the language. The input to this phase is the source code, and the output is a
sequence of tokens that represent the various keywords, identifiers, operators, and literals used in the code. For
example, the expression a = b * c might be tokenized into the following sequence of tokens:
IDENTIFIER (a)
ASSIGNMENT_OPERATOR (=)
IDENTIFIER (b)
MULTIPLICATION_OPERATOR (*)
IDENTIFIER (c)

Syntax Analysis - also known as parsing, this phase analyzes the structure of the code to ensure that it conforms to
the rules of the language's grammar. The input to this phase is the sequence of tokens produced by the lexical
analysis phase, and the output is an abstract syntax tree (AST) that represents the structure of the code. The AST is
a hierarchical tree-like structure that captures the relationships between the various parts of the code. For example,
the AST for the expression a = b * c might look like this:
3. Semantic Analysis - this phase checks the code for semantic correctness, which means ensuring that it follows the
rules and constraints of the language. This phase involves type checking, scope checking, and other kinds of
analyses that ensure the code is well-formed and meaningful. The output of this phase is a symbol table that
contains information about the various identifiers used in the code, such as their types, scope, and memory
locations. For example, in the expression a = b * c, the semantic analysis phase would ensure that b and c are both
of the same type and that a is of a compatible type to store the result of the multiplication.
4. Intermediate Code Generation - this phase transforms the AST into an intermediate representation (IR) that is
closer to machine code but still independent of any particular hardware or operating system. The IR is a lower-
level representation that simplifies the code and removes any language-specific constructs. The output of this
phase is the intermediate code that represents the original code in a simpler and more abstract form. For example,
the intermediate code for the expression a = b * c might look like this:
t1 = b * c
a = t1

5. Code Optimization - this phase optimizes the intermediate code to improve its efficiency and reduce its size. This
phase involves a range of techniques, such as constant folding, dead code elimination, and loop optimization that
make the code faster and smaller without changing its functionality. The output of this phase is optimized
intermediate code that is more efficient than the original intermediate code.
6. Code Generation - this phase generates the machine code that can be executed by the computer. The input to this
phase is the optimized intermediate code, and the output is the machine code that represents the original program
in binary form. The code generator is responsible for mapping the abstract operations in the intermediate code to
the specific instructions of the target processor. The output of this phase is the executable code that can be run on
the target hardware.Assuming that the target hardware is a hypothetical processor that uses a simple assembly
language, the code generation phase
would translate the optimized
intermediate code t1 = b * c and a = t1
into the following machine code:

LOAD R1, b ; load the value of b into


register 1

LOAD R2, c ; load the value of c into


register 2

MUL R3, R1, R2 ; multiply the values in


registers 1 and 2 and store the result in
register 3

STORE a, R3 ; store the value in register


3 into memory location a

This code loads the values of b and c into


registers, multiplies them, and stores the
result in memory location a. This code can be executed directly by the target processor to perform the computation
represented by the original expression a = b * c.

3. Explain in detail about various cousins of a compiler.


Cousins of a Compiler:
The Cousins of Compiler means the context in which the compiler typically operates.They are listed
below.

INTERPRETER:
An interpreter is a program that directly executes the source
code of a program line by line. It does not translate the entire
program into machine code before execution, unlike a
compiler. Instead, it reads each line of code, interprets it, and
then executes it immediately. This makes it slower than a
compiled program, but also more flexible, as it can modify its
behavior based on runtime conditions.
ASSEMBLER:
An assembler is a program that translates assembly language
code into machine code. Assembly language is a low-level
programming language that uses mnemonic codes to represent
machine instructions. Assemblers are used to create
executable files for programs written in assembly language.

The assembler performs several tasks, including:

Tokenization: The assembler tokenizes the assembly language code, breaking it down into individual
instructions, symbols, and operands.

Parsing: The assembler parses the tokenized code, interpreting each instruction and operand and
generating corresponding machine code.

Symbol resolution: The assembler resolves symbols, which are names of functions, variables, or other
program elements that are used in the code. It ensures that each symbol is defined only once and that all
references to the symbol are resolved correctly.

Code generation: Finally, the assembler generates machine code, which consists of binary instructions
and data that can be executed directly by the computer's CPU
LINKER:
A linker is a program that combines object files generated by a compiler into a single executable file. The
linker performs several tasks, including:

Symbol resolution: The linker resolves symbols, which are names of functions, variables, or other
program elements that are used in multiple files. It ensures that each symbol is defined only once and that
all references to the symbol are resolved correctly.

Relocation: The linker relocates object files so that they can be loaded into memory and executed
correctly. This involves adjusting addresses and offsets in the code and data sections of the object files.

Library linking: The linker links object files with libraries that contain precompiled code and data. This
can include standard libraries provided by the operating system or third-party libraries.

Dead code elimination: The linker eliminates code and data that is not used by the program. This can
include unused functions, variables, and other program elements.
Output generation: Finally, the linker generates an executable file or a shared library that can be loaded
into memory and executed by the operating system.
LOADER:
A loader is a program that loads an executable file into memory and prepares it for execution. The loader
performs several tasks, including:

Allocating memory: The loader allocates memory for the program in the computer's memory space. It
determines the amount of memory required by the program and allocates the necessary space.

Resolving dependencies: If the program depends on other libraries or modules, the loader resolves those
dependencies and loads them into memory as well.

Relocating code: The loader may need to relocate the program's code in memory to ensure that it runs
correctly. This is necessary because the program may be designed to run at a specific memory address, but
that address may not be available when the program is loaded.

Setting up the program's environment: The loader sets up the program's environment, including its
initial values for registers and memory locations.

Starting program execution: Finally, the loader starts the program's execution by transferring control to
the program's entry point.
PREPROCESSOR:
A preprocessor is a program that processes the source code of a program before it is compiled. It is used to
perform tasks such as including header files, defining macros, and conditionally compiling code. The
preprocessor is typically run before the compiler as a separate step. Preprocessing is typically done using
directives, which are special commands that begin with a hash symbol (#) and are inserted into the source
code. Some common preprocessing directives include:

#include: This directive is used to include header files in the source code. Header files typically contain
function declarations, constants, and other definitions that are used in the source code.

#define: This directive is used to define macros in the source code. Macros are symbolic names that are
replaced with their corresponding values during preprocessing.

#ifdef, #ifndef, #else, #endif: These directives are used for conditional compilation. They allow parts of
the code to be included or excluded from the final compiled output based on certain conditions.

#pragma: This directive is used to provide hints or instructions to the compiler or linker. Pragmas are
typically used to control optimization or to specify linker options.

The preprocessor runs as a separate step before the compiler, and it generates a modified version of the
source code that is then passed to the compiler for compilation.
Debugger:
A debugger is a program that helps developers find and fix bugs in their code. It allows developers to step
through their code line by line, set breakpoints, inspect variables, and examine the call stack. Debuggers
can be integrated into development environments or run as standalone programs.
Optimizer:
An optimizer is a program that analyzes the code generated by a compiler and tries to improve its
performance. It performs tasks such as removing redundant code, reordering instructions and replacing
slow operations with faster ones. Optimizers can significantly improve the performance of compiled
programs.

4. List out different types of compiler and Explain.

Single-pass compiler:
A single-pass compiler reads the entire source code in one pass and generates the object code in a single
step. This type of compiler is faster than a multi-pass compiler, but may generate less efficient code.
Multi-pass compiler:
A multi-pass compiler reads the source code in multiple passes, performing different tasks such as lexical
analysis, syntax analysis, semantic analysis, and code generation. This type of compiler can generate more
efficient code than a single-pass compiler, but is slower.
Cross-compiler:
A cross-compiler is a compiler that runs on one platform and generates code for another platform. For
example, a compiler that runs on a Windows PC and generates code for a Linux server would be a cross-
compiler.
Just-in-time (JIT) compiler:
A JIT compiler is a compiler that generates machine code at runtime, just before the code is executed. This
allows for dynamic optimization of the code, and can lead to significant performance improvements.
Ahead-of-time (AOT) compiler:
An AOT compiler is a compiler that generates machine code ahead of time, before the code is executed.
This can improve startup time and reduce the memory footprint of the program, but can also increase the
size of the executable.
Incremental compiler:
An incremental compiler is a compiler that only recompiles parts of the code that have changed since the
last compilation. This can speed up the development process by reducing the time required for full
recompilations.
Optimizing compiler:
An optimizing compiler is a compiler that analyzes the code and generates more efficient machine code,
by performing optimizations such as loop unrolling, constant folding, and register allocation.
5. What is a regular expression? Give all the algebraic properties of regular expression.
A regular expression, also known as regex or regexp, is a pattern that describes a set of strings. It is a
sequence of characters that define a search pattern. Regular expressions are used in many programming
languages, text editors, and other software to match and manipulate text.The algebraic properties of
regular expressions are:
Closure:
The set of regular expressions is closed under union, concatenation, and Kleene star. This means that the
result of any operation on two regular expressions is itself a regular expression.
Associativity:
The union and concatenation operations are associative, which means that changing the grouping of the
expressions being combined does not affect the result. That is, (A ∪ B) ∪ C = A ∪ (B ∪ C) and (A • B) •
C = A • (B • C).
Commutativity:
The union operation is commutative, which means that changing the order of the expressions being
combined does not affect the result. That is, A ∪ B = B ∪ A.
Identity elements: There exists an identity element for both union and concatenation. The empty set is the
identity element for union, while the empty string is the identity element for concatenation.
Distributivity:
The concatenation operation distributes over union, which means that (A ∪ B) • C = (A • C) ∪ (B • C).
Kleene star properties:
The Kleene star operation is idempotent, which means that A* = (A*)* for any regular expression A. The
Kleene star also satisfies the following properties:
a. A* is the smallest superset of the empty set that contains
A. b. (A • B)* = (A* • B*). c. (A ∪ B) = A* ∪ B*.
Regular expressions have several identity rules, also known as laws or properties, that allow for the
manipulation and simplification of regular expressions. These rules include:
Union Identity: A ∪ ∅ = A
This rule states that the union of a regular expression A with the empty set (∅ ) is equivalent to A.
Concatenation Identity: A • ε = AThis rule states that the concatenation of a regular expression A with the
empty string (ε) is equivalent to A.
Kleene Star Identity: ε* = ε
This rule states that the Kleene star of the empty string (ε) is itself equal to ε.
Kleene Star of Union: (A ∪ B)* = A* • B*
This rule states that the Kleene star of the union of two regular expressions A and B is equivalent to the
concatenation of their Kleene stars.
De Morgan's Laws: (A ∪ B)' = A' • B' and (A • B)' = A' ∪ B'
These laws state that the complement of the union of two regular expressions is equivalent to the
concatenation of their complements, and the complement of the concatenation of two regular expressions
is equivalent to the union of their complements. (ε—Epsilon, R-Any Regular Expression, ɸ-empty set pi)
1.ɸ+R=R 11.(PQ)*R=P(QR)*
2.ɸ.R= ɸ 12.(P+Q).R=PR+QR
3.ε.R=R.ε=R 13.R.(P+Q)=RP+RQ
4.R+R=R 14.(P+Q)*=(P*.Q*)*=( P*+Q*)*
5.R.R*= R*.R=R+ 15.R*R+R= R*R
6. ε + R.R*= R.R* + ε=R* 16. (R+ ε)*=R*
7. ε* = ε. ɸ*= ε 17. (R+ ε)+R*=R*
8. (R*)*=R* 18. (R+ ε).R*=R*
9. R*= R++ ε 19. (ε+R*)= R*
10. ɸ.R=R. ɸ= ɸ 20. (R+ ε) (R+ ε)*(R+ ε )=R*
6. Explain the structure of the LEX program. Write a Lex program to calculate sum of integers.
The LEX program is a lexical analyzer generator that can be used to generate programs that recognize
patterns in text, such as programming language tokens. Here is an example of a simple LEX program
structure:

%{
/* Header file declarations and global definitions go here */
%}
%%
/* Regular expression rules go here */ {/* Action code for each rule goes here */}
%%
/* Additional functions or code go here */
%{
#include <stdio.h>
int sum = 0; /* Global variable for the sum */
%}

/* Regular expression rules */


%%
[0-9]+ { sum += atoi(yytext); } /* Match integers and add to sum */
[+-] { /* Match operators, do nothing */ }
\n { printf("The sum is %d\n", sum); } /* Match newline, print sum */
. { printf("Invalid character\n"); } /* Match any other character */
%%

int main()
{
yylex();
return 0;
}

7. Construct the NFA for the following regular expression using Thompson’s Construction. Apply
subset construction method to convert it into DFA.(a+b)*abb#
8. Construct the NFA for the following regular expression using Thompson’s Construction. Apply
subset construction method to convert it into DFA.(a+b)*ab*a
9. Construct a DFA without constructing NFA (using syntax tree method) for the following regular
expression. Find the minimized DFA.(a|b|c)*d*(a*|b|ac+)
Now we will compute the follow position of each node
Now we will obtain Dtran for each state and each input

x`
10. Construct a DFA by syntax tree construction method (a+b)*ab#
The Follow Position is computed as
PART-A
1. Define Context Free Grammar.
CFG stands for Context-Free Grammar. It is a type of formal grammar that is used to describe the
syntax or structure of a programming language or any other formal language. In a CFG, a set of
production rules are defined to generate strings of symbols that belong to the language. These
production rules specify how one or more non-terminal symbols can be rewritten as a sequence of
terminal and/or non-terminal symbols. A non-terminal symbol is a symbol that can be replaced by a
sequence of symbols, while a terminal symbol is a symbol that cannot be rewritten further. CFGs are
widely used in computer science, particularly in the design and analysis of programming languages,
compilers, and parsers.

2. Define Parser and its role in language translation.


A parser is a computer program or a software tool that analyses the syntax of a sequence of symbols
or a string of characters according to the rules of a formal grammar. The parser typically takes input in
the form of a sequence of tokens, which are produced by a lexical analyser or tokenizer, and then uses
a grammar to analyse the syntax of the tokens and determine whether the input is syntactically valid
or not.
In programming languages, parsers are used to analyse and validate source code to ensure that it
conforms to the language's syntax rules. Parsers can also be used in other areas such as natural language
processing, where they are used to analyse and understand the structure of human language. There are
various types of parsers, such as top-down parsers, bottom-up parsers, recursive descent parsers, and
more. The choice of parser depends on the specific requirements of the grammar being parsed and the
application in which it is being used.
3. Why Lexical and Syntax Analyser are separated out in a compiler?
The Lexical analyser scans the input program and collects the tokens from it. On the other hand parser
builds a parser tree using these tokens. These are two important activities and these activities are carried
out by these two phases. Separating out these two phases has two advantages 1.it accelerates the
process of compilation and errors in the program can be identified precisely.

4. Differentiate between Top down and Bottom up Parser

5. Consider the following grammar A-->ABd | Aa | a B-->Be | b and remove left recursion.
We can see that the first production rule A → ABd is left-recursive, as A appears as the first symbol
on the right-hand side. To eliminate left recursion, we can use the following approach:
Identify the non-terminals that have left-recursive productions, in this case, A.
Create a new non-terminal symbol A' and replace the left-recursive productions with new productions that
use the non-left-recursive form of the same non-terminal A' instead of A.
Add a new production rule for A that starts with a non-left-recursive symbol (in this case, the terminal
symbol a) and ends with A', allowing the production of strings that start with a and end with the non-left-
recursive form of A.
Using this approach, we can transform the grammar as follows:

A --> a A'
A' --> B D A' | ε
B --> b B'
B' --> ε | e B'
D --> d

Here, we have added a new non-terminal symbol A' and rewritten the left-recursive production
A → ABd as A → aA'. We have also added a new production A' → B D A' | ε to allow for the production
of strings that start with non-terminals B and D, followed by A'. Finally, we have removed the left-recursion
from the B production rules by introducing a new non-terminal symbol B' and adding a new production
rule B → b B' | ε, which allows for the production of zero or more occurrences of the non-terminal symbol
e. The resulting grammar is now free of left recursion.

6. What is handle Pruning


A handle refers to a substring of a parse input that matches the right-hand side of a production rule in a
given context. A handle is also known as a viable prefix or a valid prefix (γ) .Once a handle is identified,
the parser reduces it to the corresponding non-terminal symbol using a production rule, and the non-
terminal symbol is then added to the parse stack. The process of detecting handle using them in reduction
is called handle pruning.

7. What do you mean by LR Parsing?


LR Parsing is a type of bottom-up parsing technique used in compiler construction to analyse and recognize
the syntax of a programming language. LR stands for "Left-to-right, Rightmost derivation," which
describes the process of building a parse tree from the input string by deriving the rightmost production of
the grammar in reverse order, while reading the input from left to right.

8. Write down the Steps in construction of predictive Parsing Table.


The steps to construct a predictive parsing table are as follows:
1. For each non-terminal A in the grammar, find its FIRST set. The FIRST set of a non-terminal
is the set of terminals that can be the first symbol of any string derived from that non-
terminal.
2. For each production A → α in the grammar, add α to the FOLLOW set of each non-terminal
B that can immediately follow A in a derivation. If α can derive the empty string, then add the
FOLLOW set of A to the FOLLOW set of B.
3. For each production A → α in the grammar, add the production A → α to each cell (A, a) in
the parsing table where a is in the FIRST set of α. If α can derive the empty string, then add
the production A → α to each cell (A, b) in the parsing table where b is in the FOLLOW set
of A.
4. If any cell (A, a) in the parsing table has more than one production, then the grammar is not
LL (1) and cannot be parsed by a predictive parser.
5. If any cell (A, a) in the parsing table is empty, then there is no production for parsing the
input symbol a when the parser is in state A. This indicates a syntax error in the input string.
6. If every cell in the parsing table is either empty or contains a single production, then the table
is a valid predictive parsing table for the grammar.
Overall, constructing a predictive parsing table involves identifying the FIRST and FOLLOW sets for
each non-terminal in the grammar, and using these sets to fill in the parsing table entries for each non-
terminal and terminal symbol combination. If the table contains more than one production for a cell, or if
any cell is empty, then the grammar is not suitable for predictive parsing.

9. Compare and contrast SLR, LALR and CLR Parser.

10. Check the following grammar is LL(1) or not.


S aB | 𝝐
B bC | 𝝐
C cS | 𝝐
Solution:
For the given productions S aB | 𝝐
B bC | 𝝐
C cS | 𝝐
We find the First and Follow of each non-terminals { S, B, C }

The Parsing table will be represented as


S aB  First(S) ={a} (S,a) S ϵ  Follow(S)={ $}(S,$)
B bC  First(B) ={b} (B,c) B ϵ  Follow(B)={ $}(B,$)
CcS  First(C) ={c} (C,c) C ϵ  Follow(C)={ $}(C,$)
Since there is no multiple entries in the parsing table the given grammar is LL (1).
3. Explain in detail about the properties of LR Parser working and the process of construction of
LR parsing table to process a given input string.

LR parsing is a bottom-up parsing technique used to analyze and process input


strings based on a given context-free grammar (CFG). LR parsers are more
powerful than LL parsers as they can handle a broader class of grammars. An LR
parser uses a parsing table, often called an LR parsing table, to guide its parsing
process.

To understand the properties of LR parsers and the construction of an LR parsing


table, let's break down the key concepts and steps involved.

1. LR(0) Items: An LR(0) item represents a partial production rule of the CFG along
with a marker indicating the progress of parsing. For example, considering the
production rule A -> α.Bβ, where A is a non-terminal symbol, α and β are
sequences of grammar symbols, and B is the marker, an LR(0) item can be written
as A -> α.Bβ. The marker B represents the position in the production rule where
parsing is currently focused.
2. LR(0) Closure: Given a set of LR(0) items, the closure operation expands the set by
including all items that can be reached by applying the production rules of the
CFG. It ensures that all possible parsing configurations are considered.
3. LR(0) Sets of Items: An LR(0) set of items is a set of LR(0) items obtained by
applying closure and other operations. Each LR(0) set represents a state in the LR
parsing process.
4. LR(0) Automaton: An LR(0) automaton is a directed graph where each node
represents an LR(0) set of items, and the edges indicate transitions between sets
based on grammar symbols. The automaton is constructed by systematically
computing LR(0) sets of items.
5. LR(0) Parsing Table: The LR(0) parsing table is a two-dimensional table that guides
the parsing process based on the LR(0) automaton. The rows of the table
represent the states of the automaton, and the columns correspond to grammar
symbols and special end-of-input markers. The table entries contain the parsing
actions, which can be of three types:
 Shift: Move to the next state.
 Reduce: Apply a production rule and replace a set of symbols on the stack with a
non-terminal symbol.
 Accept: The input string is valid according to the CFG.
6. LR(1) Items: An LR(1) item extends the concept of LR(0) items by considering the
lookahead symbol, i.e., the next input symbol that the parser can see. Each LR(1)
item includes a lookahead symbol along with the partial production rule and
marker. For example, A -> α.Bβ, a represents an LR(1) item with a as the
lookahead symbol.
7. LR(1) Sets of Items: Similar to LR(0) sets, LR(1) sets of items are obtained by
applying closure and other operations to LR(1) items. Each LR(1) set represents a
state in the LR(1) parsing process.
8. LR(1) Automaton: The LR(1) automaton is constructed based on LR(1) sets of
items and is similar to the LR(0) automaton. However, the transitions in the LR(1)
automaton consider the lookahead symbols in addition to the grammar symbols.
9. LR(1) Parsing Table: The LR(1) parsing table is constructed based on the LR(1)
automaton. It is similar to the LR(0) parsing table but takes into account the
lookahead symbols while determining the parsing actions.

The process of constructing the LR parsing table involves the following steps:

The process of constructing the LR parsing table involves several steps. Let's go
through them in detail:

Step 1: Augment the Grammar To construct the LR parsing table, we first need to
augment the given context-free grammar (CFG). The augmentation involves
adding a new start symbol and a new production rule to ensure that the parser
can recognize the entire input string.

Step 2: Compute LR(1) Sets of Items We compute the LR(1) sets of items by
applying closure and other operations to LR(1) items. Initially, we start with the
closure of the item [S' -> .S, $], where S' is the augmented start symbol, S is
the original start symbol, and $ represents the end-of-input marker.

We then iteratively process each LR(1) set and expand it by considering the
transitions based on grammar symbols and lookahead symbols. By applying
closure and goto operations, we generate new LR(1) sets until no more sets can
be created.

Step 3: Construct LR(1) Automaton Using the computed LR(1) sets of items, we
construct the LR(1) automaton. Each LR(1) set of items represents a state in the
automaton, and the transitions between states are determined by the grammar
symbols and lookahead symbols.

Step 4: Fill the Parsing Table Now, we create the LR parsing table, which is a two-
dimensional table with rows representing the states of the LR(1) automaton and
columns representing grammar symbols and lookahead symbols.

For each state I, we perform the following actions:


 If there is an LR(1) item A -> α., a in I, where a is a terminal symbol, we fill the
entry (I, a) in the parsing table with the "Shift" action and the state to which we
transition after shifting.
 If there is an LR(1) item A -> α., $ in I, we fill the entry (I, $) in the parsing
table with the "Accept" action.
 If there is an LR(1) item A -> α., Bβ in I, where B is a non-terminal symbol, we
compute the state J that is reached by applying the goto operation on I with B.
We fill the entries (I, B) in the parsing table with the "Goto" action and the
state J.
 If there is an LR(1) item A -> α.B., a in I, where a is a terminal symbol, we fill
the entries (I, a) in the parsing table with the "Reduce" action and the
production rule A -> αB.

Step 5: Handle Conflicts During the table construction process, conflicts may arise
in the parsing table. Conflicts occur when multiple actions are possible for a given
state and symbol. The two main types of conflicts are shift-reduce conflicts and
reduce-reduce conflicts.

 Shift-Reduce Conflict: Occurs when both a shift and a reduce action are possible.
It indicates ambiguity in the grammar, and the parser needs additional rules or
information to resolve the conflict.
 Reduce-Reduce Conflict: Occurs when multiple reduce actions are possible. It
indicates ambiguity in the grammar, and the grammar needs to be modified to
eliminate the conflict.

Conflicts need to be resolved to ensure that the parsing table is unambiguous


and deterministic. Different conflict resolution strategies can be employed, such
as precedence and associativity rules.

Once the LR parsing table is constructed and any conflicts are resolved, the parser
can use it to process a given input string by following the actions specified in the
table for each input.

4. Construct the LR(0) items for the grammar given below develop SLR Parsing table SaSa
|bSb |aa |bb
Solution:
5. Construct the SLR Parsing table for the following grammar. Also, Parse the input string a * b + a.

E→E+T|T
T → TF | F
F → F* | a | b.
Step1 − Construct the augmented grammar and number the productions.
(0) E′ → E
(1) E → E + T
(2) E → T
(3) T → TF
(4) T → F
(5) F → F ∗
(6) F → a
(7) F → b.
Step2 − Find closure & goto Functions to construct LR (0) items.
Box represents the New states, and the circle represents the Repeating State.
Computation of FOLLOW
We can find out
FOLLOW(E) = {+, $}
FOLLOW(T) = {+, a, b, $}
FOLLOW(F) = {+,*, a, b, $}

Parsing for Input String a * b + a

Stack Input String Action

0 a*b+a$ Shift

0a4 *b+a$ Reduce by F → a.

0F3 *b+a$ Shift

0F3*8 b+a$ Reduce by F → F ∗

0F3 b+a$ Reduce by T → F

0T2 b+a$ Shift

0T2b5 +a $ Reduce by F → b

0T2F7 +a $ Reduce by T → TF

0T2 +a $ Reduce by E → T
Stack Input String Action

0E1 +a $ Shift

0E1+6 a$ Shift

0E1+6a4 $ Reduce by F → a

0E1+6F3 $ Reduce by T → F

0E1+6T9 $ Reduce by E → E + T

0E1 $ Accept
10. Describe in detail about the steps of construction parsing table of LALR parser with
example.

The construction of the parsing table for an LALR (Look-Ahead LR) parser involves
combining the advantages of both LR(0) and SLR(1) parsing techniques. LALR
parsers are more efficient and compact than LR(1) parsers while still being able to
handle a wide range of grammars. Let's go through the steps of constructing the
parsing table for an LALR parser with an example.

Example Grammar: Consider the following set of productions:

1. S -> A c
2. S -> B c
3. A -> a A
4. A -> ε
5. B -> b B
6. B -> ε

Step 1: Construct the LR(0) Automaton The LR(0) automaton is constructed by


computing the LR(0) sets of items for the given grammar. This step is similar to
the construction of the LR(0) automaton for an LR(0) parser. Each LR(0) set
represents a state in the automaton, and the transitions between states are
determined by the grammar symbols.

Step 2: Compute the Look-Ahead Sets In this step, we compute the look-ahead
sets for each LR(0) set of items. The look-ahead sets represent the viable prefixes,
which are the prefixes that can lead to a valid handle (right side of a production).
This step involves applying the look-ahead closure operation to propagate the
look-ahead sets throughout the LR(0) automaton.

Step 3: Merge Compatible States In this step, we merge compatible states of the
LR(0) automaton to create fewer and larger states. Two states are compatible if
they have the same core items (production rules with the same dot position) and
their look-ahead sets are identical. Merging compatible states helps reduce the
size of the LALR parsing table.

Step 4: Construct the Parsing Table Using the merged states from the previous
step, we construct the LALR parsing table. The parsing table is a two-dimensional
table with rows representing the states of the LALR parser, and columns
representing grammar symbols and end-of-input markers.

For each merged state I, we perform the following actions:


 If there is an LR(0) item A -> α . a β in I, where a is a terminal symbol, we fill
the entry (I, a) in the parsing table with the "Shift" action and the state to
which we transition after shifting.
 If there is an LR(0) item A -> α . in I, we fill the entry (I, b) in the parsing
table for every terminal symbol b in the look-ahead set of I with the "Reduce"
action and the production rule A -> α.
 If there is an LR(0) item S' -> S . in I, we fill the entry (I, $) in the parsing
table with the "Accept" action.

Step 5: Handle Conflicts During the table construction process, conflicts may arise
in the parsing table. Conflicts occur when multiple actions are possible for a given
state and symbol. Shift-reduce conflicts and reduce-reduce conflicts can occur.

Shift-reduce conflicts occur when both a shift and a reduce action are possible,
while reduce-reduce conflicts occur when multiple reduce actions are possible.
These conflicts need to be resolved to ensure the parsing table is unambiguous
and deterministic.

By following these steps, we can construct the parsing table for an LALR parser.
The parsing table guides the parsing process for a given input string, enabling the
LALR parser to recognize and analyze the input based on the given grammar.
UNIT-III QUESTION BANK WITH ANSWERS
PART-A

1 What is a translation scheme?


2 State any two rules for type checking.
3 Write down few functionalities of Semantic analyser.
4 Define Synthesized attribute and inherited attribute.
5 What is a dependency graph?
6 Differentiate SDT and SDD.
7 Define L-attributed Definition.
8 Justify why all S-attributed definitions are L-attributed.
9 Define Type equivalence State its types
10 List out the benefits of using machine independent intermediate forms.
11 Why quadruples are preferred over triples in optimizing compiler
1. What is a translation scheme?
The Syntax directed translation scheme is a context -free grammar.
The syntax directed translation scheme is used to evaluate the order of semantic rules.
In translation scheme, the semantic rules are embedded within the right side of the productions.
The position at which an action is to be executed is shown by enclosed between braces. It is written
within the right side of the production.

2. State any two rules for type checking.


Rule for Assignment Compatibility: The rule for assignment compatibility checks whether the type of
the expression being assigned is compatible with the type of the target variable. It ensures that the value
being assigned can be safely stored in the variable without violating type constraints.
Rule for Operator Compatibility: The rule for operator compatibility checks whether the types of
operands used in an expression are compatible with the operator being applied. It ensures that the
operands can be safely combined or manipulated according to the semantics of the operator.

3. Write down few functionalities of Semantic analyser.


The semantic analyser performs crucial tasks such as type checking, scope analysis, symbol table
construction, error detection, and intermediate code generation. It ensures that the program's
semantics are correct, identifies and reports semantic errors, and prepares the program for subsequent
phases of the compilation process.
4. Define synthesized attribute and inherited attribute.
A synthesized attribute is an attribute associated with a nonterminal symbol in a production rule
that derives the nonterminal. It represents information that is determined or synthesized at the
production's head or parent symbol based on the attributes of its child symbols. The value of a
synthesized attribute is computed and propagated from the child symbols to the parent symbol during
the parsing process.
Inherited Attribute: An inherited attribute is an attribute associated with a nonterminal symbol in
a production rule that derives one of its child symbols. It represents information that is passed from
the parent symbol to its child symbols during the parsing process. The value of an inherited attribute
at a child symbol is provided or inherited from its parent symbol.
5. What is a dependency graph?

A dependency graph is used to represent the flow of information among the attributes in
a parse tree. In a parse tree, a dependency graph basically helps to determine the
evaluation order for the attributes. The main aim of the dependency graphs is to help the
compiler to check for various types of dependencies between statements in order to
prevent them from being executed in the incorrect sequence.
6. Differentiate SDT and SDD.
SDD: Specifies the values of attributes by associating semantic rules with the
productions.
SDT scheme: embeds program fragments (also called semantic actions) within
production bodies. The position of the action defines the order in which the action is
executed (in the middle of production or end).
7. Define L-attributed Definition
L-attributed grammars are a special type of attribute grammars. They allow the attributes
to be evaluated in one depth-first left-to-right traversal of the abstract syntax tree. As a
result, attribute evaluation in L-attributed grammars can be incorporated conveniently in
top-down parsing. L-attributed SDD attributes may be inherited or synthesized, this is
referred to as an L-attribute definition. In an S-attributed SDD, attributes all attributes are
synthesized - S-attribute definition.
8. Justify why all S-attributed definitions are L-attributed.
Synthesized Attribute Dependency: In S-attributed definitions, the computation of
synthesized attributes depends only on the attributes of symbols on the RHS of the
production rule. Since L-attributed definitions allow the computation of synthesized
attributes based on the attributes of child symbols, S-attributed definitions satisfy this
condition.
Inherited Attributes: S-attributed definitions do not involve inherited attributes. In L-
attributed definitions, inherited attributes are passed from the parent symbol to its child
symbols. Since S-attributed definitions do not have any inherited attributes, they
automatically satisfy the condition of not depending on inherited attributes.

Therefore, all S-attributed definitions can be considered L-attributed because they satisfy
the characteristics and restrictions of L-attributed definitions. The attribute computations
in S-attributed definitions follow a left-to-right order, and they depend only on the
attributes of symbols on the RHS of production rules, which aligns with the requirements
of L-attributed definitions.

9. Define Type equivalence State its types.


Type equivalence refers to the comparison or matching of types in a programming language. It
determines whether two types are considered equivalent or compatible based on certain criteria. Type
equivalence plays a crucial role in type checking and type inference during the compilation process.
1. Structural Equivalence
2. Name Equivalence
10. List out the benefits of using machine independent intermediate forms.
Machine-independent intermediate code during code generation offers benefits such as portability,
target independence, simplified optimization, modular design, language agnosticism, support for new
platforms, and improved debugging capabilities. These advantages contribute to the development of
efficient, maintainable, and versatile compilers.
11. Why quadruples are preferred over triples in optimizing compiler?
Quadruples are preferred over triples in an optimizing compiler as instructions are often found to move
around in it.
In case of triples, the result of any given operation is referred to by its position and therefore if one
instruction is moved, then it is required to make changes in all the references that lead to that result.
However, this problem doesn't arise with the use of quadruples.
PART-B

What is an attribute grammar? Describe in detail about its two types of attributes with suitable
1
example.
Write S-attributed SDD for simple desk calculator draw annotated parse tree representing any
2
valid input.
What is inherited attribute? Write down the SDD with inherited attribute to declare a list of
3
identifiers.
Write a SDD for the grammar to declare variables with data type int,float or char. Draw a
4
dependency graph for the declaration statement int a,b,c .
How syntax-directed definitions can be used to specify the construction of syntax trees. Give
5
example.
Describe in detail about specification of a simple type checker with an example type system to
6
report type error in various statements.
Explain in detail about various type expressions and the conventions used to represent various
7
program constructs.
8 Explain various types of type equivalence with suitable example.
9 Describe in detail about various types of three address code with suitable examples
101. Explain in detail about different symbol table implementation strategies.

1. What is an attribute grammar? Describe in detail about its two types of attributes with
suitable example.
Attribute grammar is a special form of context-free grammar where some additional information
(attributes) are appended to one or more of its non-terminals in order to provide context-sensitive
information. Each attribute has well-defined domain of values, such as integer, float, character,
string, and expressions.
Attribute grammar is a medium to provide semantics to the context-free grammar and it can help
specify the syntax and semantics of a programming language. Attribute grammar (when viewed as
a parse-tree) can pass values or information among the nodes of a tree.
Example:
E → E + T { E.value = E.value + T.value }

The right part of the CFG contains the semantic rules that specify how the grammar should be
interpreted. Here, the values of non-terminals E and T are added together and the result is copied
to the non-terminal E.

Semantic attributes may be assigned to their values from their domain at the time of parsing and
evaluated at the time of assignment or conditions. Based on the way the attributes get their values,
they can be broadly divided into two categories:
1. Synthesized attributes 2.Inherited attributes.
Synthesized attributes
These attributes get values from the attribute values of their child nodes. To illustrate, assume the
following production:
S → ABC
If S is taking values from its child nodes (A,B,C), then it is said to be a synthesized attribute, as
the values of ABC are synthesized to S.

As in our previous example (E → E + T), the parent node E gets its value from its child node.
Synthesized attributes never take values from their parent nodes or any sibling nodes.

Inherited attributes
In contrast to synthesized attributes, inherited attributes can take values from parent and/or
siblings. As in the following production,
S → ABC
A can get values from S, B and C. B can take values from S, A, and C. Likewise, C can
take values from S, A, and B.

Expansion: When a non-terminal is expanded to terminals as per a grammatical rule.

Reduction : When a terminal is reduced to its corresponding non-terminal according to


grammar rules. Syntax trees are parsed top-down and left to right. Whenever reduction
occurs, we apply its corresponding semantic rules (actions).

Semantic analysis uses Syntax Directed Translations to perform the above tasks.
Semantic analyzer receives AST (Abstract Syntax Tree) from its previous stage (syntax
analysis).

Semantic analyzer attaches attribute information with AST, which are called Attributed
AST.
Attributes are two tuple value, <attribute name, attribute value>

For example:

int value = 5;
<type, “integer”>
<presentvalue, “5”>

S-attributed SDT
If an SDT uses only synthesized attributes, it is called as S-attributed SDT. These
attributes are evaluated using S-attributed SDTs that have their semantic actions written
after the production (right hand side).

As depicted above, attributes in S-


attributed SDTs are evaluated in
bottom-up parsing, as the values of the parent nodes depend upon the values of the child
nodes.
L-attributed SDT
This form of SDT uses both synthesized and inherited attributes with restriction of not
taking values from right siblings.

In L-attributed SDTs, a non-terminal can get values from its parent, child, and sibling
nodes. As in the following production
S → ABC
S can take values from A, B, and C (synthesized). A can take values from S only. B can
take values from S and A. C can get values from S, A, and B. No non-terminal can get
values from the sibling to its right.

Attributes in L-attributed SDTs are evaluated by depth-first and left-to-right parsing


manner.

2. Write S-attributed SDD for simple desk calculator draw annotated parse tree
representing any valid input.
The syntax-directed definition for a desk calculator program, associates an integer-valued
synthesized attribute called vaI with each of the non-terminals E, T, and F. For each E, T,
and F-production. the semantic rule computes the value of attribute val for the non-terminal
on the left side from the values of v d for the non-terminals on the right side.
3. What is inherited attribute? Write down the SDD with inherited attribute to declare a list
of identifiers.
Inherited Attributes – These are the attributes which derive their values from their parent or sibling nodes
i.e. value of inherited attributes are computed by value of parent or sibling nodes.
Example:
A --> BCD { C.in = A.in, C.type = B.type }
Computation of Inherited Attributes –
Construct the SDD using semantic actions.
The annotated parse tree is generated and attribute values are computed in top down manner.
Example: Consider the following grammar
S --> T L
T --> int
T --> float
T --> double
L --> L1, id
L --> id
The SDD for the above grammar can be written as follows

Let us assume an input string int a, c for computing inherited attributes. The annotated parse tree for the
input string is

The value of L nodes is obtained from T.type (sibling) which is basically lexical value obtained as int, float
or double. Then L node gives type of identifiers a and c. The computation of type is done in top down
manner or pre-order traversal. Using function Enter_type the type of identifiers a and c is inserted in
symbol table at corresponding id.entry.

4. Write a SDD for the grammar to declare variables with data type int,float or char. Draw
a dependency graph for the declaration statement int a,b,c.
5. How syntax-directed definitions can be used to specify the construction of syntax trees.
Give example.
6. Describe in detail about specification of a simple type checker with an example type
system to report type error in various statements.
A type checker is a crucial component of a programming language compiler or interpreter that
ensures the compatibility of types within a program. It analyzes the static types of variables,
expressions, and statements to identify potential type errors before the program is executed. Here,
I'll describe the specifications of a simple type checker and provide an example type system to
demonstrate how it can report type errors in various statements.
Specification of a Simple Type Checker:
Type Definitions: The type checker needs to define the types available in the programming
language. These types can include basic types like integers, booleans, characters, and floating-
point numbers, as well as user-defined types like structures or classes.
Symbol Table: The type checker maintains a symbol table that keeps track of declared variables
and their associated types. It allows the type checker to look up the type of a variable when it is
encountered in expressions or statements.
Type Inference: The type checker performs type inference, which means it deduces the types of
expressions based on the types of their constituent parts. For example, if an expression contains an
addition operation between two integers, the type checker infers that the result is also an integer.
Type Checking Rules: The type checker defines a set of rules that specify how different types
can interact with each other. For example, it might specify that an addition operation is only valid
between two integers or floats and not between an integer and a string.
Error Reporting: When a type error is encountered, the type checker generates an error message
that describes the nature of the error and the location in the source code where the error occurred.
It may also suggest possible fixes or provide additional information to help the programmer
understand the issue.
Example Type System:
Let's consider a simple programming language with the following types: int (integer), bool
(boolean), and float (floating-point number).
The type checker will enforce the following rules:
An integer can be assigned to an integer variable.
A boolean can be assigned to a boolean variable.
A floating-point number can be assigned to a float variable.
An integer can be implicitly converted to a float.
A boolean cannot be implicitly converted to any other type.
Arithmetic operations (addition, subtraction, multiplication, division) are only allowed between
integers or floats.
Comparison operations (equality, inequality, greater than, less than) are only allowed between
integers or floats, and the result is a boolean.
Example Code:
python
x: int = 5
y: float = 2.5
z: bool = True
# Valid assignments
x = 10
y = 3.14
z = False
# Invalid assignments
x = "Hello” # Type error: cannot assign string to int
y = x + y # Type error: addition between int and float
z = x > y # Type error: comparison between int and float
In this example, the type checker will report two type errors. The first error occurs when assigning
a string to an integer variable, violating rule 1. The second error occurs when adding an integer and
a float, violating rule 6. The type checker will produce error messages indicating the nature of the
error and the specific lines of code where the errors occur. Overall, a simple type checker analyses
the types used in a program and ensures their compatibility, helping to catch type errors early and
promote type safety.
7. Explain in detail about various type expressions and the conventions used to represent
various program constructs.
Type expressions are used to represent the types of variables, expressions, and program constructs
in a programming language. They describe the nature of the data and operations that can be
performed on them. The conventions used to represent various program constructs vary across
programming languages, but I'll provide a general overview of commonly used conventions.
2. Basic Types:
 Integers: Represented as int, integer, or simply i.
 Booleans: Represented as bool, boolean, or b.
 Floating-Point Numbers: Represented as float, double, or f.
 Characters: Represented as char or c.
 Strings: Represented as string or str.
3. Arrays:
 Arrays: Represented as T[], where T represents the type of elements in the array.
For example, an array of integers can be represented as int[] or array<int>.
 Array Size: Some languages specify the size of an array as part of the type
expression. For example, an array of 10 integers can be represented as int[10] or
array<int, 10>.
4. Pointers:
 Pointers: Represented using a * symbol. For example, a pointer to an integer can be
represented as int* or ptr<int>.
 Nullability: Some languages support Nullability, indicating that a pointer can have
a value of null. This is often represented using ? or Nullable<T>. For example, a
nullable pointer to an integer can be represented as int? or Nullable<int>.
5. Functions:
 Functions: Represented as (parameters) -> return_type. For example, a function
that takes two integers and returns a boolean can be represented as (int, int) ->
bool.
 Anonymous Functions/Lambdas: Represented using a shorthand notation
depending on the programming language. For example, in Python, a lambda
function that takes an integer and returns its square can be represented as lambda x:
x * x.
6. Structures/Classes:
 Structures/Classes: Represented using the name of the structure/class. For example,
a structure/class named Person is represented as Person.
 Generic Structures/Classes: If a structure/class is generic, type parameters are used.
For example, a generic list structure that can hold any type can be represented as
List<T>, where T represents the type parameter.
7. Union Types:
 Union Types: Represented using a | symbol between multiple types. It indicates
that a value can have any of the specified types. For example, a variable that can
hold either an integer or a float can be represented as int | float.
8. Custom Types:
 Custom types defined by the programmer are represented using the name chosen
for the type. For example, if a programmer defines a custom type named Color, it
is represented as Color.
These conventions may vary depending on the programming language and its type system. Some
languages may use keywords or specific syntax to represent certain constructs. It's important to
consult the documentation or specifications of the programming language you are working with to
understand the specific conventions and representations used.
9. Explain various types of type equivalence with suitable example.

Type equivalence refers to the comparison of types to determine whether they are equivalent or
compatible in a given context. Semantic analysis is a phase in the compilation process that focuses
on analysing the meaning and correctness of a program.it also involves in checking if two types
can be safely used interchangeably without violating the rules of the programming language. The
specific rules for type equivalence vary depending on the language and its type system. Here are a
few common aspects of type equivalence.
1. Compatibility of Basic Types: In many programming languages, there are predefined
basic types such as integers, floating-point numbers, booleans, etc. Type equivalence for basic
types typically involves checking if the types match exactly.
2. Compatibility of Composite Types: Composite types, such as arrays, structures, classes,
or records, often have additional considerations for type equivalence. This may include checking
the compatibility of their component types, the order of fields, and the presence of optional or
variable-length components.
3. Compatibility of User-Defined Types: Type equivalence also applies to user-defined types,
such as classes or structs defined by the programmer. In this context, type equivalence may
involve checking the inheritance hierarchy, interfaces, or base classes to ensure that the types are
compatible.
Type equivalence in semantic analysis ensures that type rules are enforced, preventing type errors and
ensuring type safety in a program.

The main difficulty arises from the fact that most modern languages allow the naming of user-
defined types. For instance, in C and C++ this is achieved by the typedef statement. When
checking equivalence of named types, we have two possibilities.
Name equivalence.

Name equivalence is a concept in type systems where types are considered equivalent
if they have the same name, regardless of their internal structure or composition. It
means that two types with the same name are treated as equivalent types, even if they
are defined differently or have different component types.

Name equivalence is commonly found in languages with nominal type systems, where
type compatibility is determined based on the names of the types. In such type systems,
types are given unique names or identifiers, and type equivalence is determined by
comparing these names.

Here are a few examples of name equivalence in different programming languages:

1. Java:
class MyClass {

// ...

// The following types are name-equivalent

MyClass obj1;

MyClass obj2; ;

2. C++:
class Point {

// ...

};
// The following types are name-equivalent

Point p1; Point p2;

In both of these examples, the types MyClass in Java and Point in C++ are considered
equivalent because they have the same name. The internal structure of the types or their
definitions doesn't affect their equivalence.

Name equivalence simplifies type checking and comparison because it focuses solely
on the names of the types. However, it also means that types with the same name but
different structures or definitions are not considered equivalent. This can limit
flexibility in certain cases, especially when dealing with complex type structures or
when trying to establish compatibility between types defined separately.

In contrast to name equivalence, structural equivalence compares the internal structure


or composition of types to determine their equivalence. Structural equivalence is
typically found in languages with structural type systems, such as Haskell, where types
are considered equivalent if their structures match, regardless of their names.

It's important to note that different programming languages may have different type
equivalence rules depending on their design goals, type systems, and language
semantics. The choice between name equivalence and structural equivalence often
depends on the language's intended use cases and philosophy.
Structural equivalence, also known as structural typing or duck typing, is a concept in
type systems where types are considered equivalent or compatible if their structures
match, regardless of their names or declarations. It is a type system feature that allows
for flexible and dynamic type checking based on the shape or structure of types.

In a type system that employs structural equivalence, two types are considered
equivalent if they have the same structure, meaning that their components, fields,
methods, or properties match in terms of number, types, and sometimes order. This
enables objects or values of different types to be used interchangeably as long as their
structures align.

Structural equivalence is commonly found in dynamically-typed languages and


languages with optional static typing, as it promotes code reuse and flexibility. Here
are a few examples of languages and type systems that utilize structural equivalence:

1. Python: Python is a dynamically-typed language that utilizes duck typing, which is a


form of structural typing. In Python, types are determined by the presence of certain
methods or attributes, rather than explicit type declarations.
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
# The following types are structurally equivalent
p1 = Point(0, 0)
p2 = {'x': 0, 'y': 0}
In these examples, types with the same structures are considered equivalent, allowing
objects or values of different types to be used interchangeably when their structures
match. This promotes code reuse and flexibility, as it enables polymorphism based on
the shape or capabilities of objects rather than explicit type relationships.

Structural equivalence simplifies type compatibility and allows for more flexible and
dynamic programming, but it also requires careful consideration and may introduce
potential issues such as accidental compatibility or difficulties in static analysis.

10. Describe in detail about various types of three address code with suitable examples.
Three-address code (TAC) is a low-level intermediate representation used in compilers
and code optimization. It represents high-level language statements in a simplified
form, with each statement containing at most three operands or addresses. TAC is
designed to be easily translated into machine code and enables efficient code
optimization techniques. There are various types of three-address code structures,
including:

Assignment Statements:
x=y+z
In this type of TAC, an assignment statement is represented with a destination operand
(x) and two source operands (y and z). The TAC generates code that computes the sum
of y and z and stores the result in x.

Arithmetic Expressions:
x = y * (a + b)
TAC for arithmetic expressions involves multiple statements. In this example, the
expression is split into two TAC statements. First, the sum of a and b is computed and
stored in a temporary variable (t1). Then, the product of y and t1 is calculated and
assigned to x.

Conditional and Control Statements:


if x > 10 goto L1
Conditional and control statements in TAC often involve labels. In this example, the
TAC checks if the value of x is greater than 10. If the condition is true, the code jumps
to the label L1.
Procedure Calls:
z = foo(x, y)
TAC for procedure calls includes the name of the function (foo) and the arguments (x
and y). The TAC generates code to transfer control to the function and assigns the
return value to the variable z.
Array Operations:
x = arr[i]
TAC supports array operations. In this example, the TAC fetches the value at index i
from the array arr and assigns it to the variable x.

Pointer Operations:
x = *p
TAC allows pointer operations. In this example, the TAC fetches the value pointed to
by the pointer p and assigns it to the variable x.

Control Flow Statements:


if x > 10 goto L1
goto L2
TAC handles control flow statements such as if-else or switch statements. In this
example, if the condition x > 10 is true, the code jumps to label L1; otherwise, it
continues to label L2.
Function Definitions:
int foo(int a, int b) {
return a + b;
}
TAC represents function definitions, including parameter declarations, return type, and
the code inside the function. Each statement in the function is converted into TAC.

These examples illustrate the different types of three-address code structures


commonly used in compilers. TAC simplifies complex high-level language constructs
into a form that is easier to analyse and optimize. By breaking down statements into a
maximum of three operands, TAC enables efficient code generation and optimization
techniques such as constant folding, dead code elimination, and register allocation.
Implementation of Three Address Code –
There are 3 representations of three address code namely
1. Quadruple
It is a structure which consists of 4 fields namely op, arg1, arg2 and result. op denotes
the operator and arg1 and arg2 denotes the two operands and result is used to store the
result of the expression.
Advantage –
 Easy to rearrange code for global optimization.
 One can quickly access value of temporary variables using symbol table.
Disadvantage –
 Contain lot of temporaries.
 Temporary variable creation increases time and space complexity.

Example – Consider expression a = b * – c + b * – c. The three address code is:


t1 = uminus c
t2 = b * t1
t3 = uminus c
t4 = b * t3
t5 = t2 + t4
a = t5
2. Triples
This representation doesn’t make use of extra temporary variable to represent a single
operation instead when a reference to another triple’s value is needed, a pointer to that
triple is used. So, it consist of only three fields namely op, arg1 and arg2.

Disadvantage –
Temporaries are implicit and difficult to rearrange code.
It is difficult to optimize because optimization involves moving intermediate code.
When a triple is moved, any other triple referring to it must be updated also. With help
of pointer one can directly access symbol table entry.
Example – Consider expression a = b * – c + b * – c

3. Indirect Triples:
This representation makes use of pointer to the listing of all references to
computations which is made separately and stored. Its similar in utility as compared to
quadruple representation but requires less space than it. Temporaries are implicit and
easier to rearrange code.

Example – Consider expression a = b * – c + b * – c


11. Explain in detail about different symbol table implementation strategies.
The symbol table is defined as the set of Name and Value pairs.
Symbol Table is an important data structure created and maintained by the compiler in order to
keep track of semantics of variables i.e. it stores information about the scope and binding
information about names, information about instances of various entities such as variable and
function names, classes, objects, etc.
Thus compiler can keep track of all the identifiers with all the necessary information.

Items stored in Symbol table:

 Variable names and constants


 Procedure and function names
 Literal constants and strings
 Compiler generated temporaries
 Labels in source languages
Information used by the compiler from Symbol table:
 Data type and name
 Declaring procedures
 Offset in storage
 If structure or record then, a pointer to structure table.
 For parameters, whether parameter passing by value or by reference
 Number and type of arguments passed to function
 Base Address
Operations of Symbol table – The basic operations defined on a symbol table include:

Implementation of Symbol table –


Following are commonly used data structures for implementing symbol table:-
1. List –
We use a single array or equivalently several arrays, to store names and their associated
information, new names are added to the list in the order in which they are encountered. The
position of the end of the array is marked by the pointer available, pointing to where the next
symbol-table entry will go. The search for a name proceeds backwards from the end of the array
to the beginning. when the name is located the associated information can be found in the words
following next.

id1 info1 id2 info2 …….. id_n info_n


Linked List –
 This implementation is using a linked list. A link field is added to each record.
 Searching of names is done in order pointed by the link of the link field.
 A pointer “First” is maintained to point to the first record of the symbol table.
 Insertion is fast O(1), but lookup is slow for large tables – O(n) on average
Hash Table –
 In hashing scheme, two tables are maintained – a hash table and symbol table and are the
most commonly used method to implement symbol tables.
 A hash table is an array with an index range: 0 to table size – 1. These entries are pointers
pointing to the names of the symbol table.
 To search for a name we use a hash function that will result in an integer between 0 to
table size – 1.
 Insertion and lookup can be made very fast – O(1).
 The advantage is quick to search is possible and the disadvantage is that hashing is
complicated to implement.
Binary Search Tree –
 Another approach to implementing a symbol table is to use a binary search tree i.e. we add
two link fields i.e. left and right child.
 All names are created as child of the root node that always follows the property of the
binary search tree.
 Insertion and lookup are O(log2 n) on average.
UNIT-IV QUESTION BANK SOLUTIONS
PART-A

1. Define Activation record.


An Activation Record is a data structure that is activated/ created when a
procedure/function is invoked, and it includes the following data about the function.
Activation Record in 'C' language consist of
Actual Parameters
Number of Arguments
Return Address
Return Value
Old Stack Pointer (SP)
Local Data in a function or procedure
2. What is static storage allocation?
In static allocation, names are bound to storage locations. If memory is created at compile time then
the memory will be created in static area and only once. Static allocation supports the dynamic data
structure that means memory is created only at compile time and deallocated after program
completion.
3. Give some limitations of static storage allocation.
The drawback with static storage allocation is that the size and position of data objects should be
known at compile time. Another drawback is restriction of the recursion procedure.
4. Compare static, stack and Heap memory allocation strategies.
 Static memory allocation is performed at compile-time and involves assigning fixed
memory locations to variables and data structures.
 The memory allocated for static variables remains constant throughout the program's
execution.
 Static variables are typically used for global variables and variables declared with the static
keyword inside functions.
 The allocation and deallocation of static memory are handled automatically by the compiler.
 The main advantage of static memory allocation is its efficiency and deterministic nature, as
the memory is allocated and deallocated once at compile-time.

 Stack memory allocation is performed at runtime and follows a Last-In-First-Out (LIFO)


structure.
 The stack is a reserved portion of memory that grows and shrinks automatically as functions
are called and return.
 Local variables and function parameters are typically allocated on the stack.
 Stack memory allocation is efficient and fast because memory allocation and deallocation are
managed through simple stack pointer manipulation.
 Memory allocated on the stack is automatically deallocated when a function call is finished.
 However, stack memory is limited, and excessive memory usage can lead to stack overflow
errors.
 Heap memory allocation is dynamic and allows for the allocation and deallocation of
memory at runtime.
 Memory allocated on the heap remains available until explicitly deallocated by the
programmer.
 Heap memory is typically used for dynamically allocated data structures such as arrays and
objects.
 Heap memory allocation is more flexible but also less efficient compared to static and stack
allocation.
 Memory management in the heap is typically handled through functions like malloc() and
free() in languages like C.
 Improper management of heap memory can lead to memory leaks or fragmentation issues.

5. State the importance of control stack during procedure execution.

The control stack is vital for managing function calls, preserving return addresses, storing local
variables, enabling recursion, facilitating stack unwinding, and maintaining control flow during
program execution. It provides the necessary structure and organization for orderly procedure
execution and efficient handling of function calls.
6. List out different parameter passing methods
Pass-by-value.
Pass-by-reference.
Pass-by-value-result.
Pass-by-name.

7. What are the important factors affecting the target code generation?
1. Input to the code generator: The input to the code generator is intermediate representation together
with the information in the symbol table. Intermediate representation has the several choices:
 Postfix notation,
 Syntax tree or DAG,
 Three address code
The code generation phase needs complete error-free intermediate code as an input requires.
2. Target Program: The target program is the output of the code generator. The output can be:

Assembly language: It allows subprogram to be separately compiled.

Relocatable machine language: It makes the process of code generation easier.

Absolute machine language: It can be placed in a fixed location in memory and can be executed
immediately.
3. Target Machine: architecture and its instruction set.
4. Instruction Selection:
5. Register Allocation: Proper utilization of registers improve code efficiency. Use of registers make the
computations faster in comparison to that of memory, so efficient utilization of registers is important. The
use of registers are subdivided into two sub problems:
6. Choice of Evaluation order: The efficiency of the target code can be affected by the order in which the
computations are performed.

8. State the rules to determine leaders of a basic block.


1. The first instruction of the program is a leader.
2. Any instruction that is the target of a branch (e.g., jump, conditional branch) is a leader.
3. Any instruction that follows an unconditional branch is a leader.
4. Any instruction that follows a conditional branch or return statement is a leader.
5. Any instruction that is the target of a procedure call is a leader.
9. Define Dominator in a flow graph.

In a flow graph, a node d dominates node n, if every path from initial node of the flow
graph to n goes through d. This will be denoted by d dom n. Every initial node dominates all the
remaining.
A dominator is a concept that refers to the relationship between nodes in a flow graph.
Specifically, a node X is said to dominate another node Y if every path from the entry node of the
graph to Y must go through X. In other words, X dominates Y if X is an ancestor of Y in the flow
graph.in the flow graph and the entry of a loop dominates all nodes in the loop.
10. State the applications of DAG

DAGs are useful for representing many different types of flows, including data processing flows. By
thinking about large-scale processing flows in terms of DAGs, one can more clearly organize the various
steps and the associated order for these jobs.
1. Task Scheduling
2. Compiler Optimization:
3. Data Flow Analysis:
PART-B
1. Describe in detail about various operations in symbol table organization.

Symbol table is an important data structure created and maintained by compilers in order to
store information about the occurrence of various entities such as variable names, function
names, objects, classes, interfaces, etc. Symbol table is used by both the analysis and the
synthesis parts of a compiler.
A symbol table may serve the following purposes depending upon the language in hand:
 To store the names of all entities in a structured form at one place.
 To verify if a variable has been declared.
 To implement type checking, by verifying assignments and expressions in the source code
are semantically correct.
 To determine the scope of a name (scope resolution).
A symbol table is simply a table which can be either linear or a hash table. It maintains an
entry for each name in the following format:
<symbol name, type, attribute>
For example, if a symbol table has to store information about the following variable
declaration:
static int interest;
then it should store the entry such as:
<interest, int, static>
The attribute clause contains the entries related to the name.
Implementation
If a compiler is to handle a small amount of data, then the symbol table can be implemented
as an unordered list, which is easy to code, but it is only suitable for small tables only. A
symbol table can be implemented in one of the following ways:

 Linear (sorted or unsorted) list


 Binary Search Tree
 Hash table
Among all, symbol tables are mostly implemented as hash tables, where the source code
symbol itself is treated as a key for the hash function and the return value is the information
about the symbol.
Operations
A symbol table, either linear or hash, should provide the following operations.
insert()
This operation is more frequently used by analysis phase, i.e., the first half of the compiler
where tokens are identified and names are stored in the table. This operation is used to add
information in the symbol table about unique names occurring in the source code. The format
or structure in which the names are stored depends upon the compiler in hand.
An attribute for a symbol in the source code is the information associated with that symbol.
This information contains the value, state, scope, and type about the symbol. The insert()
function takes the symbol and its attributes as arguments and stores the information in the
symbol table.
For example:
int a;
should be processed by the compiler as:
insert(a, int);
lookup()
lookup() operation is used to search a name in the symbol table to determine:

 if the symbol exists in the table.


 if it is declared before it is being used.
 if the name is used in the scope.
 if the symbol is initialized.
 if the symbol declared multiple times.
The format of lookup() function varies according to the programming language. The basic
format should match the following:
lookup(symbol)
This method returns 0 (zero) if the symbol does not exist in the symbol table. If the symbol
exists in the symbol table, it returns its attributes stored in the table.
Scope Management
A compiler maintains two types of symbol tables: a global symbol table which can be
accessed by all the procedures and scope symbol tables that are created for each scope in the
program.
To determine the scope of a name, symbol tables are arranged in hierarchical structure as
shown in the example below:
...
int value=10;

void pro_one()
{
int one_1;
int one_2;

{ \
int one_3; |_ inner scope 1
int one_4; |
} /

int one_5;
{ \
int one_6; |_ inner scope 2
int one_7; |
} /
}

void pro_two()
{
int two_1;
int two_2;

{ \
int two_3; |_ inner scope 3
int two_4; |
} /

int two_5;
}
...
The above program can be represented in a hierarchical structure of symbol tables:
The global symbol table contains names for one global variable (int value) and two procedure
names, which should be available to all the child nodes shown above. The names mentioned
in the pro_one symbol table (and all its child tables) are not available for pro_two symbols
and its child tables.
This symbol table data structure hierarchy is stored in the semantic analyser and whenever a
name needs to be searched in a symbol table, it is searched using the following algorithm:
 First a symbol will be searched in the current scope, i.e. current symbol table.
 if a name is found, then search is completed, else it will be searched in the parent symbol
table until,
 Either the name is found or global symbol table has been searched for the name.

2. List out various issues in design of a code generator and possible design alternatives.
Main issues in the design of a code generator are:
 Input to the code generator
 Target program
 Memory management
 Instruction selection
 Register allocation
 Evaluation order
1. Input to the code generator
In the input to the code generator, design issues in the code generator intermediate code created by
the frontend and information from the symbol table that defines the run-time addresses of the data objects
signified by the names in the intermediate representation are fed into the code generator. Intermediate
codes may be represented mainly in quadruples, triples, indirect triples, postfix notation, syntax
trees, DAGs(Directed Acyclic Graph), etc. The code generation step assumes that the input is free of all
syntactic and state semantic mistakes, that all essential type checking has been performed, and that type-
conversion operators have been introduced where needed.
2. Target program
The code generator's output is the target program. The result could be:
 Assembly language: It allows subprograms to be separately compiled.
 Relocatable machine language: It simplifies the code generating process.
 Absolute machine language: It can be stored in a specific position in memory and run immediately.
3. Memory management
In the memory management design, the source program's frontend and code generator map names
address data items in run-time memory. It utilizes a symbol table. In a three-address statement, a
name refers to the name's symbol-table entry. Labels in three-address statements must be
transformed into instruction addresses.
For example,
j: goto i generates the following jump instruction:
if i < j, A backward jump instruction is generated with a target address equal to the quadruple i
code location.
If i > j, It's a forward jump. The position of the first quadruple j machine instruction must be saved
on a list for quadruple i. When i is processed, the machine locations for all instructions that
forward hop to i are populated.
4. Instruction selection
In the Instruction selection, the design issues in the code generator program's efficiency will be
improved by selecting the optimum instructions. It contains all of the instructions, which should
be thorough and consistent. Regarding efficiency, instruction speeds and machine idioms have a
big effect. Instruction selection is simple if we don't care about the target program's efficiency.
The relevant three-address statements, for example, would be translated into the following code
sequence:
P:=Q+R
S:=P+T
MOV Q, R0
ADD R, R0
MOV R0, P
MOV P, R0
ADD T, R0
MOV R0, S

The fourth sentence is unnecessary since the P value is loaded again in that statement already
stored. It results in an inefficient code sequence. A given intermediate representation can be
translated into several distinct code sequences, each with considerable cost differences. Previous
knowledge of instruction cost is required to build good sequences, yet reliable cost information is
difficult to forecast.
5. Register allocation
In the Register allocation, design issues in the code generator can be accessed faster than memory.
The instructions involving operands in the register are shorter and faster than those involved in
memory operands.
The following sub-problems arise when we use registers:
 Register allocation: In register allocation, we select the set of variables that will reside in the
register.
 Register assignment: In the Register assignment, we pick the register that contains a variable.
Certain machines require even-odd pairs of registers for some operands and results.
Example
Consider the following division instruction of the form:
D x, y
Where,
x is the dividend even register in even/odd register pair
y is the divisor
An old register is used to hold the quotient.
6. Evaluation order
The code generator determines the order in which the instructions are executed. The target code's
efficiency is influenced by order of computations. Many computational orders will only require a
few registers to store interim results. However, choosing the best order is a completely
challenging task in the general case.
3. Explain how flow graph representation of three address statement is helpful in code
generation
A basic block is a simple combination of statements. Except for entry and exit, the basic blocks
do not have any branches like in and out. It means that the flow of control enters at the beginning
and it always leaves at the end without any halt. The execution of a set of instructions of a basic
block always takes place in the form of a sequence.
The first step is to divide a group of three-address codes into the basic block. The new basic
block always begins with the first instruction and continues to add instructions until it reaches a
jump or a label. If no jumps or labels are identified, the control will flow from one instruction to
the next in sequential order.
The algorithm for the construction of the basic block is described below step by step:
Algorithm: The algorithm used here is partitioning the three-address code into basic blocks.
Input: A sequence of three-address codes will be the input for the basic blocks.
Output: A list of basic blocks with each three address statements, in exactly one block, is
considered as the output.
Method: We’ll start by identifying the intermediate code’s leaders. The following are some
guidelines for identifying leaders:
1. The first instruction in the intermediate code is generally considered as a leader.
2. The instructions that target a conditional or unconditional jump statement can be considered
as a leader.
3. Any instructions that are just after a conditional or unconditional jump statement can be
considered as a leader.
Each leader’s basic block will contain all of the instructions from the leader until the instruction
right before the following leader’s start.
Example of basic block:
Three Address Code for the expression a = b + c – d is:
T1 = b + c
T2 = T1 - d
a = T2
This represents a basic block in which all the statements execute in a sequence one after the other.

Basic Block Construction:


Let us understand the construction of basic blocks with an example:
Example:
1. PROD = 0
2. I = 1
3. T2 = addr(A) – 4
4. T4 = addr(B) – 4
5. T1 = 4 x I
6. T3 = T2[T1]
7. T5 = T4[T1]
8. T6 = T3 x T5
9. PROD = PROD + T6
10. I = I + 1
11. IF I <=20 GOTO (5)
Using the algorithm given above, we can identify the number of basic blocks in the above three-
address code easily-
There are two Basic Blocks in the above three-address code:
 B1 – Statement 1 to 4
 B2 – Statement 5 to 11
Transformations on Basic blocks:
Transformations on basic blocks can be applied to a basic block. While transformation, we don’t
need to change the set of expressions computed by the block.
There are two types of basic block transformations. These are as follows:
1. Structure-Preserving Transformations
Structure preserving transformations can be achieved by the following methods:
1. Common sub-expression elimination
2. Dead code elimination
3. Renaming of temporary variables
4. Interchange of two independent adjacent statements

2. Algebraic Transformations
In the case of algebraic transformation, we basically change the set of expressions into an
algebraically equivalent set.
For example, and expression
x:= x + 0
or x:= x *1
This can be eliminated from a basic block without changing the set of expressions.
Flow Graph:
A flow graph is simply a directed graph. For the set of basic blocks, a flow graph shows the flow
of control information. A control flow graph is used to depict how the program control is being
parsed among the blocks. A flow graph is used to illustrate the flow of control between basic
blocks once an intermediate code has been partitioned into basic blocks. When the beginning
instruction of the Y block follows the last instruction of the X block, an edge might flow from
one block X to another block Y.
Let’s make the flow graph of the example that we used for basic block formation:

Flow Graph for above Example

Firstly, we compute the basic blocks (which is already done above). Secondly, we assign the flow
control information.
4. Explain in detail about various register allocation and assignment strategies.
Register allocation is an important method in the final phase of the compiler. Registers
are faster to access than cache memory. Registers are available in small size up to few
hundred Kb .Thus it is necessary to use minimum number of registers for variable
allocation. There are three popular Register allocation algorithms.
1. Naive Register Allocation
2. Linear Scan Algorithm
3. Chaitin’s Algorithm
These are explained as following below.
1. Naïve Register Allocation:
 Naive (no) register allocation is based on the assumption that variables are stored in
Main Memory.
 We can’t directly perform operations on variables stored in Main Memory .
 Variables are moved to registers which allows various operations to be carried out
using ALU .
 ALU contains a temporary register where variables are moved before performing
arithmetic and logic operations .
 Once operations are complete we need to store the result back to the main memory in
this method .
 Transferring of variables to and fro from Main Memory reduces the overall speed of
execution .
a=b+c
d=a
c=a+d
Variables stored in Main Memory :

a b c d

2 fp 4 fp 6 fp 8 fp

Machine Level Instructions :


LOAD R1, _4fp
LOAD R2, _6fp
ADD R1, R2
STORE R1, _2fp
LOAD R1, _2fp
STORE R1, _8fp
LOAD R1, _2fp
LOAD R2, _8fp
ADD R1, R2
STORE R1, _6fp
Advantages :
 Easy to understand operations and the flow of variables from Main memory to
Registers and vice versa .
 Only 2 registers are enough to perform any operations .
 Design complexity is less .
Disadvantages:
 Time complexity increases as variables is moved to registers from main memory.
 Too many LOAD and STORE instructions.
 To access a variable second time we need to STORE it to the Main Memory to record
any changes made and LOAD it again.
 This method is not suitable for modern compilers.
2. Linear Scan Algorithm:
 Linear Scan Algorithm is a global register allocation mechanism.
 It is a bottom up approach.
 If n variables are live at any point of time then we require ‘n’ registers.
 In this algorithm the variables are scanned linearly to determine the live ranges of the
variable based on which the registers are allocated.
 The main idea behind this algorithm is that to allocate minimum number of registers
such that these registers can be used again and this totally depends upon the live
range of the variables.
 For this algorithm we need to implement live variable analysis of Code Optimization.
a=b+c
d=e+f
d=d+e
IFZ a goto L0
b=a+d
goto L1
L0 : b = a - d
L1: i = b
Control Flow Graph:

 At any point of time the maximum number of live variables is 4 in this example. Thus
we require 4 registers at maximum for register allocation.
If we draw horizontal line at any point on the above diagram we can see that we require exactly
4 registers to perform the operations in the program.
Splitting:
 Sometimes the required number of registers may not be available. In such case we may
require to move some variables to and from the RAM . This is known as spilling.
 Spilling can be done effectively by moving the variable which is used less number of times
in the program.
Disadvantages:
 Linear Scan Algorithm doesn’t take into account the “lifetime holes” of the variable.
 Variables are not live throughout the program and this algorithm fails to record the holes in
the live range of the variable.
3. Graph Coloring (Chaitin’s Algorithm) :
 Register allocation is interpreted as a graph coloring problem.
 Nodes represent live range of the variable.
 Edges represent the connection between two live ranges.
 Assigning colour to the nodes such that no two adjacent nodes have same colour.
 Number of colours represents the minimum number of registers required.
A k-coloring of the graph is mapped to k registers.
Steps:
1. Choose an arbitrary node of degree less than k.
2. Push that node onto the stack and remove all of its outgoing edges.
3. Check if the remaining edges have degree less than k, if YES goto 5 else goto #
4. If the degree of any remaining vertex is less than k then push it onto to the stack.
5. If there is no more edge available to push and if all edges are present in the stack POP each
node and colour them such that no two adjacent nodes have same colour.
6. Number of colours assigned to nodes is the minimum number of registers needed.
# spill some nodes based on their live ranges and then try again with same k value. If problem
persists it means that the assumed k value can’t be the minimum number of registers .Try
increasing the k value by 1 and try the whole procedure again.
For the same instructions mentioned above the graph coloring will be as follows:
Assuming k=4

After performing the graph coloring, final graph is obtained as follows

5. Illustrate how to generate code for a basic block from its DAG representation with suitable
examples.
DAG representation for basic blocks
A DAG for basic block is a directed acyclic graph with the following labels on nodes:
The leaves of graph are labelled by unique identifier and that identifier can be variable names or constants.
Interior nodes of the graph is labelled by an operator symbol.
Nodes are also given a sequence of identifiers for labels to store the computed value.
DAGs are a type of data structure. It is used to implement transformations on basic blocks.
DAG provides a good way to determine the common sub-expression.
It gives a picture representation of how the value computed by the statement is used in subsequent statements.

Algorithm for construction of DAG


Input: It contains a basic block
Output: It contains the following information:

o Each node contains a label. For leaves, the label is an identifier.


o Each node contains a list of attached identifiers to hold the computed values.

Case (i) x:= y OP z


Case (ii) x:= OP y
Case (iii) x:= y
Method:
Step 1:
If y operand is undefined then create node(y).
If z operand is undefined then for case(i) create node(z).
Step 2:
For case(i), create node(OP) whose right child is node(z) and left child is node(y).
For case(ii), check whether there is node(OP) with one child node(y).
For case(iii), node n will be node(y).
Output:
For node(x) delete x from the list of identifiers. Append x to attached identifiers list
for the node n found in step 2. Finally set node(x) to n.
Example:
Consider the following three address statement:
1. S1:= 4 * i
2. S2:= a[S1]
3. S3:= 4 * i
4. S4:= b[S3]
5. S5:= s2 * S4
6. S6:= prod + S5
7. Prod:= s6
8. S7:= i+1
9. i := S7
10.if i<= 20 goto (1)

Stages in DAG Construction:


UNIT-V
PART-A
1. Define peephole optimization.
Peephole optimization is a simple and effective technique for locally improving target code.
This technique is applied to improve the performance of the target program by examining the
short sequence of target instructions and replacing these instructions by shorter or faster
sequence.
2. List out the characteristics of peephole optimization.
 Some characteristics of peephole optimization are:
 Redundant instruction elimination.
 Elimination of unreachable code.
 Reduction in strength.
 Algebraic simplifications.
 Use of machine idioms.
3. State the criteria or allocation of registers for nested looping statements
Outer Loop Variables: The variables used in the outer loop are typically assigned registers
first, as they have a broader scope and are used across multiple iterations of the inner loop.
These variables are likely to be accessed frequently and may require efficient storage to
minimize memory access latency.
Inner Loop Variables: The variables used exclusively within the inner loop are typically
assigned registers next. Since these variables are limited to the inner loop's scope, they are not
needed outside of it and can be allocated registers separately from the outer loop variables.
Register Reuse: Depending on the number of available registers, some variables from the outer
loop might need to be temporarily spilled to memory to make room for the inner loop variables.
This is a trade-off between register usage and memory access overhead. The compiler might
prioritize the variables that are used most frequently or have the highest performance impact
for register allocation.

Loop Control Variables: Registers are usually allocated for loop control variables such as
loop counters or iterators. These variables are critical for controlling the flow of execution
within the loop and are likely to be accessed frequently. Register allocation ensures fast access
to these variables, minimizing the overhead associated with memory access.
4. Construct a DAG for the following basic block
d=b*c
e=a+b
b=b*c
a=e-d and generate the code using only one register.
Step-1

d0=b0+c0
e0=a0+b0
b1=b0*c0
a1=e0-d0
R1 = b * c // Store b * c in register
d = R1 // Assign R1 to d
e = a + b // Calculate e using a and b
b = R1 // Assign R1 to b
a = e - d // Calculate a using e and d

5. Define Constant Folding with an example.


This is an optimization technique which eliminates expressions that calculate a value that can be
determined before code execution.
If operands are known at compile time, then the compiler performs the operations statically.
An Example,
int x = (2 + 3) * y can be rewritten as int x = 5 * y

6. What are the basic goals of code movement?


 To reduce the size of the code i.e. to obtain the space complexity.
 To reduce the frequency of execution of code i.e. to obtain the time complexity.

7. What is code motion?


Code motion is an optimization technique in which amount of code in a loop is decreased. This
transformation is applicable to the expression that yields the same result independent of the number of
times the loop is executed. Such an expression is placed before the loop.

8. What are the various forms of object code?


Native/Binary Object Code: This is the most common form of object code and consists of
machine code specific to the target architecture or processor. It comprises binary instructions
that can be directly executed by the computer's hardware.
Assembly Object Code: Assembly object code is generated when the source code is
compiled into assembly language instructions specific to the target platform. Assembly code
is a low-level representation that can be further assembled into machine code.
Relocatable Object Code: Relocatable object code is generated when the compiler produces
machine code that can be loaded at different memory locations. It includes information such
as relocation entries, symbol tables, and references to external libraries. Relocatable code
allows flexibility in memory allocation during the linking phase.
Executable Object Code: Executable object code is the final form of object code that can be
directly executed as a standalone program. It combines the necessary machine code, data, and
resources required for the program to run. Executable files often have a specific file format
depending on the operating system.
Dynamic Shared Object Code: Dynamic shared object code, also known as dynamic link
libraries (DLLs) or shared libraries, is a form of object code that contains reusable code and
data. These libraries can be dynamically linked and loaded by multiple programs at runtime,
reducing code duplication and allowing efficient memory usage.
Intermediate Object Code: Intermediate object code refers to the compiled output generated
during the compilation process before the final object code is produced. These intermediate
files contain machine code and related information specific to each source file and are later
combined during the linking phase.

9. What are the criteria for code optimization?


The optimization must be correct, it must not, in any way, change the meaning of the program.
1. Optimization should increase the speed and performance of the program.
2. The compilation time must be kept reasonable.
3. The optimization process should not delay the overall compiling process.
10. Define loop unrolling with example.
Loop unrolling is a technique for attempting to minimize the cost of loop overhead, such as
branching on the termination condition and updating counter variables. This occurs by manually adding the
necessary code for the loop to occur multiple times within the loop body and then updating the conditions
and counters accordingly. The potential for performance improvement comes from the reduced loop
overhead, since less iterations are required to perform the same work, and also, depending on the code, the
possibility for better instruction pipelining.
Example:
for (i = 0; i < N; i++) {
a[i]=b[i]*c[i];
}
If we let N = 4, then we can substitute this straight-line code for the loop:
a[0] = b[0]*c[0];
a[1] = b[1]*c[1];
a[2] = b[2]*c[2];
a[3] = b[3]*c[3];
This unrolled code has no loop overhead code at all, that is, no iteration variable and no tests. But the
unrolled loop has the same problems as the inlined procedure—it may interfere with the cache and
expands the amount of code required.
We do not, of course, have to fully unroll loops. Rather than unroll the above loop four times, we
could unroll it twice. Unrolling produces this code:
for (i = 0; i < 2; i++) {
a[i*2] = b[i*2]*c[i*2];
a[i*2 + 1] = b[i*2 + 1]*c[i*2 + 1];
}
11. What do you mean by data flow analysis?
Data flow analysis is a technique used in compiler design to gather information about how data flows
through a program. It involves analysing the dependencies and relationships between variables and expressions
to infer information about their values and behaviour at different points in the program.
The primary goal of data flow analysis is to provide insights into how data is manipulated, used, and propagated
within a program. It helps compilers optimize code, perform static program analysis, detect errors, and make
informed decisions during program transformations.
PART-B
1. Suggest some Code-improving transformations and techniques for implementing these
transformations to achieve Code optimization.

The code produced by the straight forward compiling algorithms can often be made to run faster or take
less space, or both. This improvement is achieved by program transformations that are traditionally called
optimizations. Compilers that apply code-improving transformations are called optimizing compilers.

Optimizations are classified into two categories. They are


• Machine independent optimizations:
• Machine dependant optimizations:

Machine independent optimizations:


• Machine independent optimizations are program transformations that improve the target code without taking
into consideration any properties of the target machine.
Machine dependant optimizations:
• Machine dependant optimizations are based on register allocation and utilization of special machine-
instruction sequences.

The criteria for code improvement transformations:


Simply stated, the best program transformations are those that yield the most benefit for the least effort. The
transformations provided by an optimizing compiler should have several properties. They are:

1. The transformation must preserve the meaning of programs. That is, the optimization must not change
the output produced by a program for a given input, or cause an error such as division by zero, that was
not present in the original source program.
2. A transformation must, on the average, speedup programs by a measurable amount. We are also
interested in reducing the size of the compiled code although the size of the code has less importance
than it once had. Not every transformation succeeds in improving every program, occasionally an
“optimization” may slow down a program slightly.
3. The transformation must be worth the effort. It does not make sense for a compiler writer to expend
the intellectual effort to implement a code improving transformation and have the compiler expend
the additional time compiling source programs if this effort is not repaid when the target programs are
executed. “Peephole” transformations of this kind are simple enough and beneficial enough to be
included in any compiler.

PRINCIPAL SOURCES OF OPTIMISATION


A transformation of a program is called local if it can be performed by looking only at the statements in a
basic block; otherwise, it is called global. Many transformations can be performed at both the local and
global levels. Local transformations are usually performed first.
Function-Preserving Transformations

There are a number of ways in which a compiler can improve a program without changing the function it
computes.
Function preserving transformations examples:
Common sub expression elimination
Copy propagation,
Dead-code elimination
Constant folding

The other transformations come up primarily when global optimizations are performed.

Frequently, a program will include several calculations of the offset in an array. Some of the duplicate
calculations cannot be avoided by the programmer because they lie below the level of detail accessible
within the source language.

***

Common Sub expressions elimination:

• An occurrence of an expression E is called a common sub-expression if E was previously computed, and


the values of variables in E have not changed since the previous computation. We can avoid recomputing
the expression if we can use the previously computed value.

• For example
t1: = 4*i
t2: = a [t1]
t3: = 4*j
t4: = 4*i
t5: = n
t6: = b [t4] +t5
The above code can be optimized using the common sub-expression elimination as
t1: = 4*i
t2: = a [t1]
t3: = 4*j
t5: = n
t6: = b [t1] +t5
The common sub expression t4: =4*i is eliminated as its computation is already in t1 and the value of i is
not been changed from definition to use.
Copy Propagation:
Assignments of the form f: = g called copy statements, or copies for short. The idea behind the copy-
propagation transformation is to use g for f, whenever possible after the copy statement f: = g. Copy
propagation means use of one variable instead of another. This may not appear to be an improvement, but
as we shall see it gives us an opportunity to eliminate x.
• For example:
x=Pi;
A=x*r*r;
The optimization using copy propagation can be done as follows: A=Pi*r*r; Here the variable x is
eliminated
Dead-Code Eliminations:
A variable is live at a point in a program if its value can be used subsequently; otherwise, it is dead at that
point. A related idea is dead or useless code, statements that compute values that never get used. While the
programmer is unlikely to introduce any dead code intentionally, it may appear as the result of previous
transformations.

Example:

i=0;
if(i=1)
{
a=b+5;
}
Here, ‘if’ statement is dead code because this condition will never get satisfied.
Constant folding:
Deducing at compile time that the value of an expression is a constant and using the constant instead is
known as constant folding. One advantage of copy propagation is that it often turns the copy statement into
dead code.
For example,
a=3.14157/2 can be replaced by
a=1.570 thereby eliminating a division operation.

Loop Optimizations:

In loops, especially in the inner loops, programs tend to spend the bulk of their time. The running time of a
program may be improved if the number of instructions in an inner loop is decreased, even if we increase
the amount of code outside that loop.

Three techniques are important for loop optimization:


 Code motion, which moves code outside a loop;
 Induction-variable elimination, which we apply to replace variables from inner loop.
 Reduction in strength, which replaces and expensive operation by a cheaper one, such as a
multiplication by an addition.
Code Motion:
An important modification that decreases the amount of code in a loop is code motion. This transformation
takes an expression that yields the same result independent of the number of times a loop is executed (a
loop-invariant computation) and places the expression before the loop. Note that the notion “before the
loop” assumes the existence of an entry for the loop. For example, evaluation of limit-2 is a loop-invariant
computation in the following while-statement:
while (i <= limit-2) /* statement does not change limit*/
Code motion will result in the equivalent of
t= limit-2;
while (i<=t) /* statement does not change limit or t */

Induction Variables:
Loops are usually processed inside out. For example consider the loop around B3. Note that the values of
j and t4 remain in lock-step; every time the value of j decreases by 1, that of t4 decreases by 4 because 4*j
is assigned to t4. Such identifiers are called induction variables.

When there are two or more induction variables in a loop, it may be possible to get rid of all but one, by the
process of induction-variable elimination. For the inner loop around B3 in Fig.5.3 we cannot get rid of either
j or t4 completely; t4 is used in B3 and j in B4.

2. Describe in detail about how data flow equations are used in code optimization.
It is the analysis of flow of data in control flow graph, i.e., the analysis that determines the information
regarding the definition and use of data in program. With the help of this analysis, optimization can be done.
In general, its process in which values are computed using data flow analysis. The data flow property
represents information that can be used for optimization.
Data flow analysis is a technique used in compiler design to analyse how data flows through a program. It
involves tracking the values of variables and expressions as they are computed and used throughout the
program, with the goal of identifying opportunities for optimization and identifying potential errors.
The basic idea behind data flow analysis is to model the program as a graph, where the nodes represent
program statements and the edges represent data flow dependencies between the statements. The data flow
information is then propagated through the graph, using a set of rules and equations to compute the values of
variables and expressions at each point in the program.
Some of the common types of data flow analysis performed by compilers include:
Reaching Definitions Analysis: This analysis tracks the definition of a variable or expression and determines
the points in the program where the definition “reaches” a particular use of the variable or expression. This
information can be used to identify variables that can be safely optimized or eliminated.
Live Variable Analysis: This analysis determines the points in the program where a variable or expression
is “live”, meaning that its value is still needed for some future computation. This information can be used to
identify variables that can be safely removed or optimized.
Available Expressions Analysis: This analysis determines the points in the program where a particular
expression is “available”, meaning that its value has already been computed and can be reused. This
information can be used to identify opportunities for common subexpression elimination and other
optimization techniques.
Constant Propagation Analysis: This analysis tracks the values of constants and determines the points in
the program where a particular constant value is used. This information can be used to identify opportunities
for constant folding and other optimization techniques.
Data flow analysis can have a number of advantages in compiler design, including:
Improved code quality: By identifying opportunities for optimization and eliminating potential errors, data
flow analysis can help improve the quality and efficiency of the compiled code.
Better error detection: By tracking the flow of data through the program, data flow analysis can help
identify potential errors and bugs that might otherwise go unnoticed.
Increased understanding of program behaviour: By modelling the program as a graph and tracking the
flow of data, data flow analysis can help programmers better understand how the program works and how it
can be improved.
Basic Terminologies
Definition Point: a point in a program containing some definition.
Reference Point: a point in a program containing a reference to a data item.
Evaluation Point: a point in a program containing evaluation of expression.
Data Flow Properties –
Available Expression – A expression is said to be available at a program point x if along paths its reaching
to x. A Expression is available at its evaluation point.
An expression a+b is said to be available if none of the operands gets modified before their use. Example

 Advantage –
It is used to eliminate common sub expressions.
Reaching Definition – A definition D is reaches a point x if there is path from D to x in which
D is not killed, i.e., not redefined.
Example –

 Advantage –
It is used in constant and variable propagation.

Live variable – A variable is said to be live at some point p if from p to end the variable is used
before it is redefined else it becomes dead.
Example –
Advantage –
 It is useful for register allocation.
 It is used in dead code elimination.
Busy Expression – An expression is busy along a path if its evaluation exists along that path and none of its
operand definition exists before its evaluation along the path.
Advantage –
It is used for performing code movement optimization.
Features:
Identifying dependencies: Data flow analysis can identify dependencies between different parts of a
program, such as variables that are read or modified by multiple statements.
Detecting dead code: By tracking how variables are used, data flow analysis can detect code that is never
executed, such as statements that assign values to variables that are never used.
Optimizing code: Data flow analysis can be used to optimize code by identifying opportunities for common
subexpression elimination, constant folding, and other optimization techniques.
Detecting errors: Data flow analysis can detect errors in a program, such as uninitialized variables, by
tracking how variables are used throughout the program.
Handling complex control flow: Data flow analysis can handle complex control flow structures, such as
loops and conditionals, by tracking how data is used within those structures.
Interprocedural analysis: Data flow analysis can be performed across multiple functions in a program,
allowing it to analyse how data flows between different parts of the program.
Scalability: Data flow analysis can be scaled to large programs, allowing it to analyse programs with many
thousands or even millions of lines of code.
3. Elaborate how Peephole Optimization method is used in optimizing the target code generated.

Peephole optimization is a local optimization technique that compilers use to optimize the generated
code. It is called local optimization because it works by evaluating a small section of the generated
code, generally a few instructions, and optimizing them based on some predefined rules. The
evaluated section of code is known as a peephole or window, therefore, it is referred to as peephole
optimization.

Objectives of Peephole Optimization in Compiler Design

The following are the objectives of peephole optimization in compiler design:

 Increasing code speed: Peephole optimization seeks to improve the execution speed of generated
code by removing redundant instructions or unnecessary instructions.

 Reduced code size: Peephole optimization seeks to reduce generated code size by replacing the long
sequence of instructions with shorter ones.

 Getting rid of dead code: Peephole optimization seeks to get rid of dead code, such as unreachable
code, redundant assignments, or constant expressions that have no effect on the output of the
program.

 Simplifying code: Peephole optimization also seeks to make generated code more understandable
and manageable by removing unnecessary complexities.

Working of Peephole Optimization in Compiler design

The working of peephole optimization can be summarized in the following steps:

Step 1 – Identify the peephole: In the first step, the compiler finds the small sections of the
generated code that needs optimization.

Step 2 – Apply the optimization rule: After identification, in the second step, the compiler applies a
predefined set of optimization rules to the instructions in the peephole.

Step 3 – Evaluate the result: After applying optimization rules, the compiler evaluates the
optimized code to check whether the changes make the code better than the original in terms of
speed, size, or memory usage.

Step 4 – Repeat: The process is repeated by finding new peepholes and applying the optimization
rules until no more opportunities to optimize exists.

Peephole Optimization Techniques

Here are some of the commonly used peephole optimization techniques:

Constant Folding

Constant folding is a peephole optimization technique that involves evaluating constant expressions
at compile-time instead of run-time. This optimization technique can significantly improve the
performance of a program by reducing the number of computations performed at run-time.

Here is an example of constant folding:

Initial Code:
int x = 10 + 5;

int y = x * 2;
Optimized Code:
int x = 15;

int y = x * 2;

Explanation: In this code, the expression 10 + 5 is a constant expression, which means that its value
can be computed at compile-time. Instead of computing the value of the expression at run-time, the
compiler can replace the expression with its computed value, which is 15.

Strength Reduction:
Strength reduction is a peephole optimization technique that aims to replace computationally
expensive operations with cheaper ones, thereby improving the performance of a program.

Here is an example of strength reduction:

Initial Code:

int x = y / 4;

Optimized Code:

int x = y >> 2;

Explanation: In this code, the expression y / 4 involves a division operation, which is


computationally expensive. So, we can replace this with a shift right operation, as bit-wise operations
are generally faster.

Redundant Load and Store Elimination

Redundant load and store elimination is a peephole optimization approach that seeks to reduce
redundant memory accesses in a program. This optimization works by finding code that performs the
same memory access many times and removes the redundant accesses.

Here is an example of this:

Initial Code:

int x = 5;
int y = x + 10;
int z = x + 20;

Optimized Code:

int x = 5;
int y = x + 10;
int z = y + 10; // optimized line

Explanation: In this code, the variable x is loaded from memory twice: once in the second line and
once in the third line. However, since the value of x does not change between the two accesses, the
second access is redundant. In the optimized code, the redundant load of x is eliminated by replacing
the second access with the value of y, which is computed using the value of x in the second line.

Null Sequences Elimination


Null sequences Elimination is a peephole optimization technique used in compiler design to remove
unnecessary instructions from a program. The optimization involves identifying and removing
sequences of instructions that have no effect on the final output of a program.
Here is an example of null sequences elimination:

Initial Code:

int x = 5;
int y = 10;
int z = x + y;
x = 5; // redundant instruction

Optimized Code:

int x = 5;
int y = 10;
int z = x + y;

Explanation: In this code, the value of x is assigned twice: once in the first line and once in the
fourth line. However, since the second assignment has no effect on the final output of the program, it
is a null sequence and can be eliminated.

Conclusion
Peephole optimization in compile design helps in improving the performance of programs by
eliminating redundant code and optimizing code sequences. These techniques involve analysing
small sequences of instructions and making targeted optimizations that can significantly improve the
performance of a program.

4. Explain in detail about how data flow analysis of structured programs is performed.
Data-flow analysis is a technique for gathering information about the possible set of values calculated at
various points in a computer program. A program's control-flow graph (CFG) is used to determine those parts
of a program to which a particular value assigned to a variable might propagate.
Data flow analysis of structured programs involves analysing the flow of data within a program that adheres
to a structured programming paradigm. Structured programs follow control flow structures like sequences,
conditionals, and loops, which can be analysed to understand the behaviour of variables and expressions
throughout the program. The data flow analysis techniques used in structured programs are often based on
control flow graphs (CFGs) and the analysis of variables and their values at different program points.
Here are some key aspects of data flow analysis in structured programs:
Control Flow Graph (CFG): A control flow graph is constructed to represent the control flow structure of
the structured program. It consists of nodes representing basic blocks of code and edges representing the flow
of control between these blocks. Each block typically corresponds to a sequence of statements without any
branches or loops.
Reaching Definitions: Reaching definitions analysis determines the set of definitions that can potentially
reach a particular program point. It identifies the definitions of variables that are valid at different points in the
program. This analysis helps understand variable dependencies and how values are propagated through the
program.
Use-Def and Def-Use Chains: Use-Def and Def-Use chains represent the relationships between variable uses
and their corresponding definitions. A use-def chain links a variable use to its defining point, while a def-use
chain links a definition point to its uses. These chains help in understanding how values flow from definitions
to uses.
Available Expressions: Available expressions analysis identifies expressions whose values are available at
different program points. It determines which expressions have been computed and can be reused without
recomputation. This analysis helps in eliminating redundant computations and improving performance.
Liveness Analysis: Liveness analysis determines the set of variables that are live (potentially used) at different
program points. It helps in understanding variable lifetimes and is crucial for register allocation, memory
management, and other optimization techniques.
Constant Propagation: Constant propagation analysis tracks the propagation of constant values through the
program. It identifies variables that can be replaced with their constant values, which eliminates unnecessary
computations and improves performance.
Dead Code Elimination: Dead code elimination identifies and eliminates code that is guaranteed to have no
effect on the program's behaviour or final output. This analysis helps in removing unused or redundant
statements, improving code efficiency.
These data flow analysis techniques are often performed iteratively, with information being propagated
through the control flow graph until a fixed point is reached. This iterative process ensures that all relevant
data flow information is obtained.
By performing data flow analysis in structured programs, compilers and program analyzers gain insights into
variable dependencies, constant values, live variables, and other properties that can be used for optimization,
error detection, and program understanding. These analyses contribute to improving code quality,
performance, and reliability. Various tools used for such analysis are

Data Flow Analysis Equation


The data flow analysis equation is used to collect information about a program block. The following
is the data flow analysis equation for a statement s-
Out[s] = gen[s] U In[s] - Kill[s]
where
Out[s] is the information at the end of the statement s.
gen[s] is the information generated by the statement s.
In[s] is the information at the beginning of the statement s.
Kill[s] is the information killed or removed by the statement s.
The main aim of the data flow analysis is to find a set of constraints on the In[s]’s and Out[s]’s for
the statement s. The constraints include two types of constraints- The transfer function and the
Control Flow constraint.
Let’s discuss them.
Transfer Function
The semantics of the statement are the constraints for the data flow values before and after a
statement.
For example, consider two statements x = y and z = x. Both these statements are executed.
Thus, after execution, we can say that both x and z have the same value, i.e. y.
Thus, a transfer function depicts the relationship between the data flow values before and after a
statement.
There are two types of transfer functions-
1. Forward propagation
2. Backward propagation
Let’s see both of these.
 Forward propagation
o In forward propagation, the transfer function is represented by Fs for any statement s.
o This transfer function accepts the data flow values before the statement and outputs the
new data flow value after the statement.
o Thus the new data flow or the output after the statement will be Out[s] = Fs(In[s]).
 Backward propagation
o The backward propagation is the converse of the forward propagation.
o After the statement, a data flow value is converted to a new data flow value before the
statement using this transfer function.
o Thus the new data flow or the output will be In[s] = Fs(Out[s]).
Control-Flow Constraints
The second set of constraints comes from the control flow. The control flow value of Si will be equal
to the control flow values into Si + 1 if block B contains statements S1, S2,........, Sn. That is:
IN[Si + 1] = OUT[Si], for all i = 1 , 2, .....,n – 1.
Data Flow Properties
Some properties of the data flow analysis are-
 Available expression
 Reaching definition
 Line variable
 Busy expression
We will discuss these properties one by one.
Available Expression
An expression a + b is said to be available at a program point x if none of its operands gets modified
before their use. It is used to eliminate common subexpressions.
An expression is available at its evaluation point.
Example:
In the above example, the expression L1: 4 * i is an available expression since this expression is
available for blocks B2 and B3, and no operand is getting modified.
Reaching Definition
A definition D is reaching a point x if D is not killed or redefined before that point. It is generally
used in variable/constant propagation.
Example:

In the above example, D1 is a reaching definition for block B2 since the value of x is not changed (it
is two only) but D1 is not a reaching definition for block B3 because the value of x is changed to x +
2. This means D1 is killed or redefined by D2.
Live Variable
A variable x is said to be live at a point p if the variable's value is not killed or redefined by some
block. If the variable is killed or redefined, it is said to be dead.
It is generally used in register allocation and dead code elimination.
Example:

In the above example, the variable a is live at blocks B1,B2, B3 and B4 but is killed at block B5
since its value is changed from 2 to b + c. Similarly, variable b is live at block B3 but is killed at
block B4.
Busy Expression
An expression is said to be busy along a path if its evaluation occurs along that path, but none of its
operand definitions appears before it.
It is used for performing code movement optimization.

5. Elaborate the Concept of Data flow analysis with suitable algorithms and sample intermediate code.
CODE IMPROVIG TRANSFORMATIONS

Algorithms for performing the code improving transformations rely on data-flow


information. Here we consider common sub-expression elimination, copy propagation and
transformations for moving loop invariant computations out of loops and for eliminating
induction variables.
 Global transformations are not substitute for local transformations; both must be performed.
Elimination of global common sub expressions:
 The available expressions data-flow problem discussed in the last section allows us to
determine if an expression at point p in a flow graph is a common sub-expression. The
following algorithm formalizes the intuitive ideas presented for eliminating common
subexpressions.
 ALGORITHM: Global common sub expression elimination.
INPUT: A flow graph with available expression information.
OUTPUT: A revised flow graph.
METHOD: For every statement s of the form x: = y+z6 such that y+z is available at the
beginning of block and neither y nor r z is defined prior to statement s in that block,
do the following.
 To discover the evaluations of y+z that reach s’s block, we follow flow graph
edges, searching backward from s’s block. However, we do not go through
any block that evaluates y+z. The last evaluation of y+z in each block
encountered is an evaluation of y+z that reaches s.
 Create new variable u.
 Replace each statement w: =y+z found in (1) by
u:=y+z
w:=u
 Replace statement s by x:=u.
 Some remarks about this algorithm are in order.
 The search in step(1) of the algorithm for the evaluations of y+z that reach statement s
can also be formulated as a data-flow analysis problem. However, it does not make sense to
solve it for all expressions y+z and all statements or blocks because too much
irrelevant information is gathered.
 Not all changes made by algorithm are improvements. We might wish to limit the
number of different evaluations reaching s found in step (1), probably to one.
 Algorithm will miss the fact that a*z and c*z must have the same value in
a :=x+y c :=x+y
vs
b :=a*z d :=c*z
 Because this simple approach to common sub expressions considers only the literal
expressions themselves, rather than the values computed by expressions.
 Copy propagation:
Various algorithms introduce copy statements such as x :=copies may also be generated
directly by the intermediate code generator, although most of these involve temporaries
local to one block and can be removed by the dag construction. We may substitute y for x in all
these places, provided the following conditions are met every such use u of x.
 Statement s must be the only definition of x reaching u.
 On every path from s to including paths that go through u several times, there are no
assignments to y.
 Condition (1) can be checked using ud-changing information. We shall set up a new dataflow
analysis problem in which in[B] is the set of copies s: x:=y such that every path
from initial node to the beginning of B contains the statement s, and subsequent to the
last occurrence of s, there are no assignments to y.
ALGORITHM: Copy propagation.
INPUT: a flow graph G, with ud-chains giving the definitions reaching block B, and with c_in[B]
representing the solution to equations that is the set of copies x:=y that reach block B along every path,
with no assignment to x or y following the last occurrence of x:=y on the path. We also need ud-chains
giving the uses of each definition.
OUTPUT: A revised flow graph.
METHOD: For each copy s : x:=y do the following:
 Determine those uses of x that are reached by this definition of namely, s: x: =y.
 Determine whether for every use of x found in (1) , s is in c_in[B], where B is the
block of this particular use, and moreover, no definitions of x or y occur prior to this
use of x within B. Recall that if s is in c_in[B]then s is the only definition of x that
reaches B.
 If s meets the conditions of (2), then remove s and replace all uses of x found in (1)
by y.
 Detection of loop-invariant computations:
 Ud-chains can be used to detect those computations in a loop that are loop-invariant, that
 is, whose value does not change as long as control stays within the loop. Loop is a region
 consisting of set of blocks with a header that dominates all the other blocks, so the only
 way to enter the loop is through the header.
 If an assignment x := y+z is at a position in the loop where all possible definitions of y
 and z are outside the loop, then y+z is loop-invariant because its value will be the same
 each time x:=y+z is encountered. Having recognized that value of x will not change, consider
v
 := x+w, where w could only have been defined outside the loop, then x+w is also loop-invariant.
-invariant computations.
INPUT: A loop L consisting of a set of basic blocks, each block containing sequence
of three-address statements. We assume ud-chains are available for the individual
Statements.
OUTPUT: the set of three-address statements that compute the same value each time
Executed, from the time control enters the loop L until control next leaves L.
METHOD: we shall give a rather informal specification of the algorithm, trusting
That the principles will be clear.
1. Mark “invariant” those statements whose operands are all either constant or have
all their reaching definitions outside L.
2. Repeat step (3) until at some repetition no new statements are marked “invariant”.
3. Mark “invariant” all those statements not previously so marked all of whose
operands either are constant, have all their reaching definitions outside L, or have
exactly one reaching definition, and that definition is a statement in L marked
invariant.
4. Performing code motion:
Having found the invariant statements within a loop, we can apply to some of them an
optimization known as code motion, in which the statements are moved to pre-header of
the loop. The following three conditions ensure that code motion does not change what
the program computes. Consider s: x: =y+z.
5.
node with a successor not in the loop.
6. There is no other statement in the loop that assigns to x. Again, if x is a temporary
assigned only once, this condition is surely satisfied and need not be changed.
7. No use of x in the loop is reached by any definition of x other than s. This condition too
will be satisfied, normally, if x is temporary.
ALGORITHM: Code motion.
INPUT: A loop L with ud-chaining information and dominator information.
OUTPUT: A revised version of the loop with a pre-header and some statements moved to the pre-
header.
METHOD:
1. Use loop-invariant computation algorithm to find loop-invariant statements.
2. For each statement s defining x found in step(1), check:
i) That it is in a block that dominates all exits of L,
ii) That x is not defined elsewhere in L, and
iii) That all uses in L of x can only be reached by the definition of x in statement s.
3. Move, in the order found by loop-invariant algorithm, each statement s found in (1) and meeting
conditions (2i), (2ii), (2iii) , to a newly created pre-header, provided any operands of s that are defined
in loop L have previously had their definition statements moved to the pre-header.
 To understand why no change to what the program computes can occur, condition (2i)
and (2ii) of this algorithm assure that the value of x computed at s must be the value of x
after any exit block of L. When we move s to a pre-header, s will still be the definition of
x that reaches the end of any exit block of L. Condition (2iii) assures that any uses of x
within L did, and will continue to, use the value of x computed by s.
Alternative code motion strategies:
 The condition (1) can be relaxed if we are willing to take the risk that we may actually
increase the running time of the program a bit; of course, we never change what the
program computes. The relaxed version of code motion condition (1) is that we may
move a statement s assigning x only if:
1. The block containing s either dominates all exists of the loop, or x is not used outside
the loop. For example, if x is a temporary variable, we can be sure that the value will
be used only in its own block.
 If code motion algorithm is modified to use condition (1’), occasionally the running time
will increase, but we can expect to do reasonably well on the average. The modified algorithm may
move to pre-header certain computations that may not be executed in the loop. Not only does this risk
slowing down the program significantly, it may also cause an error in certain circumstances.
 Even if none of the conditions of (2i), (2ii), (2iii) of code motion algorithm are met by an
assignment x: =y+z, we can still take the computation y+z outside a loop. Create a new
temporary t, and set t: =y+z in the pre-header. Then replace x: =y+z by x: =t in the loop.
In many cases we can propagate out the copy statement x: = t.
Maintaining data-flow information after code motion:
 The transformations of code motion algorithm do not change ud-chaining information,
since by condition (2i), (2ii), and (2iii), all uses of the variable assigned by a moved
statement s that were reached by s are still reached by s from its new position.
 Definitions of variables used by s are either outside L, in which case they reach the pre-header,
or they are inside L, in which case by step (3) they were moved to pre-header
ahead of s.
 If the ud-chains are represented by lists of pointers to pointers to statements, we can
maintain ud-chains when we move statement s by simply changing the pointer to s when
we move it. That is, we create for each statement s pointer ps, which always points to s.
 We put the pointer on each ud-chain containing s. Then, no matter where we move s, we
have only to change ps , regardless of how many ud-chains s is on.
 The dominator information is changed slightly by code motion. The pre-header is now
the immediate dominator of the header, and the immediate dominator of the pre-header is
the node that formerly was the immediate dominator of the header. That is, the pre-header
is inserted into the dominator tree as the parent of the header.
Elimination of induction variable:
 A variable x is called an induction variable of a loop L if every time the variable x
changes values, it is incremented or decremented by some constant. Often, an induction
variable is incremented by the same constant each time around the loop, as in a loop
Headed by for i := 1 to 10.
 However, our methods deal with variables that are incremented or decremented zero, one,
two, or more times as we go around a loop. The number of changes to an induction
variable may even differ at different iterations.
 A common situation is one in which an induction variable, say i, indexes an array, and
some other induction variable, say t, whose value is a linear function of i, is the actual
offset used to access the array. Often, the only use made of i is in the test for loop
termination. We can then get rid of i by replacing its test by one on t.
 We shall look for basic induction variables, which are those variables i whose only
assignments within loop L are of the form i := i+c or i-c, where c is a constant.
ALGORITHM: Elimination of induction variables.
INPUT: A loop L with reaching definition information, loop-invariant computation information and
live variable information.
OUTPUT: A revised loop.
METHOD:
 Consider each basic induction variable i whose only uses are to compute other induction
variables in its family and in conditional branches. Take some j in i’s family, preferably one
such that c and d in its triple are as simple as possible and modify each test that i appears in to
use j instead. We assume in the following that c is positive. A test of the form ‘if i relop x goto
B’, where x is not an induction variable, is replaced by
r := c*x /* r := x if c is 1. */
r := r+d /* omit if d is 0 */
if j relop r goto B
where, r is a new temporary. The case ‘if x relop i goto B’ is handled analogously. If there are two
induction variables i1 and i2 in the test if i1 relop i2 goto B, then we check if both i1 and i2 can be
replaced. The easy case is when we have j1 with triple and j2 with triple, and c1=c2 and d1=d2. Then,
i1 relop i2 is equivalent to j1 relop j2.
 Now, consider each induction variable j for which a statement j: =s was introduced. First check
that there can be no assignment to s between the introduced statement j :=s and any use of j. In
the usual situation, j is used in the block in which it is defined, simplifying this check;
otherwise, reaching definitions information, plus some graph analysis is needed to implement
the check. Then replace all uses of j by uses of s and delete statement j: =s.

You might also like