PCC All Units QuestionBank
PCC All Units QuestionBank
PCC All Units QuestionBank
1. The compiler scans the whole program in one Translates the program one statement at a time.
go.
3. The main advantage of compilers is its Due to interpreters being slow in executing the
execution object code, it is preferred less.
time.
4. It converts the source code into object code. It does not convert source code into object code
instead it scans it line by line
5 It does not require source code for later It requires source code for later execution.
execution.
6 Execution of the program takes place only after Execution of the program happens after every line is
the whole program is compiled. checked or evaluated.
7 The machine code is stored in the disk storage. Machine code is no where stored.
8 Compilers more often take a large amount of In comparison, Interpreters take less time for
time for analyzing the source code. analyzing the source code.
12. Compiler Takes Entire program as input Interpreter Takes Single instruction as input.
13. Object code is permanently saved for future use. No object code is saved for future use.
Eg. C, C++, C#, etc are programming languages Python, Ruby, Perl, SNOBOL, MATLAB, etc are
that programming languages that are interpreter-based.
are compiler-based.
Line buffering: In this scheme, the input buffer stores one line of input at a time, and the lexer analyzes the line to
extract lexemes. Line buffering is simple and efficient, but it can cause problems when a lexeme spans across
multiple lines or when there are nested comments or string literals that contain newlines.
Block buffering: In this scheme, the input buffer stores a fixed-size block of input, which is larger than a line, and
the lexer analyzes the block to extract lexemes. Block buffering is more complex than line buffering, but it can
handle lexemes that span across multiple lines and can improve the efficiency of the lexer by reducing the number
of input reads.
11.State the uses of the following Built in variables used in LEX programming.
i)yyin ii)yyout iii)yytext iv)yyleng v)yylineno vi)yylval
yyin: This variable is a file pointer that represents the input stream to be scanned by the Lex program. By default,
yyin is set to stdin, but it can be changed to any file or input stream that the program needs to read from.
yyout: This variable is a file pointer that represents the output stream for the Lex program. By default, yyout is set
to stdout, but it can be changed to any file or output stream that the program needs to write to.
yytext: This variable is a character array that contains the text of the current token matched by the Lex program's
regular expressions. The length of the token is stored in the yyleng variable.
yyleng: This variable stores the length of the current token matched by the Lex program's regular expressions. It is
typically used to extract the matched token from the yytext variable.
yylineno: This variable stores the current line number of the input stream being scanned by the Lex program. It is
typically used for error reporting and debugging purposes.
yylval: This variable is used to store the semantic value of the current token matched by the Lex program's regular
expressions. The programmer can define the type and contents of the yylval variable to suit their specific needs.
12. State the uses of the following Built in functions used in LEX programming.
i)yylex( ) ii)yywrap( ) iii)yylesss(int n) iv)yymore( ) v)yyerror( )
yylex( ): This function is the core of a Lex program and is responsible for scanning the input stream, matching the
regular expressions defined in the program, and returning the corresponding tokens to the calling program. The
yylex() function is called repeatedly by the calling program until it returns an end-of-file token or signals an error.
yywrap( ): This function is used to indicate to the calling program that the end of the input stream has been reached.
When yylex() encounters the end of the input stream, it calls yywrap() to determine whether to continue scanning
the input or stop. If yywrap() returns a non-zero value, yylex() returns 0 to the calling program, indicating that the
scanning is complete. Otherwise, yywrap() returns 0, and yylex() resumes scanning the input stream.
yylesss(int n): This function is used to specify the start condition for the next match. In a Lex program, start
conditions are used to define sets of regular expressions that are active only under certain conditions. yylesss() sets
the start condition for the next match to the specified condition n.
yymore( ): This function is used to indicate that the next match should be appended to the current token, rather than
starting a new token. This is useful when a single logical token spans multiple lines or is interrupted by whitespace
or other characters that are not part of the token.
yyerror( ): This function is used to report errors detected during scanning. When yylex() encounters an error, it calls
yyerror() to report the error message to the calling program. The programmer can customize the behavior of yyerror()
to suit their specific needs, such as printing an error message to the console or logging the error to a file.
In summary, buffering is an important technique for optimizing the performance of scanners in reading and
processing the source program. The choice of buffering scheme depends on the characteristics of the source program,
such as the average token length, the presence of long tokens, and the expected size of the input.
However, the input buffering scheme can lead to a problem called boundary crossing. Boundary crossing occurs
when the buffer ends in the middle of a token, causing the scanner to read beyond the end of the buffer to complete
the token. This can result in inefficiency and can also lead to errors in the scanning process.
To avoid boundary crossing, sentinels can be used. Sentinels are special characters that are appended to the end of
the input buffer to ensure that the scanner can always complete the processing of the last token. The sentinel can be
a character that does not occur in the source program, such as a null character or a special end-of-file character.
When the scanner encounters the sentinel, it knows that it has reached the end of the input and can stop processing.
This eliminates the need for the scanner to read beyond the end of the buffer, improving performance and reducing
the risk of errors. Overall, the use of sentinels can improve the performance of input buffering by avoiding boundary
crossing and allowing the scanner to process the source program more efficiently.
1. One Buffer Scheme: In this scheme, only one buffer is used to store the input string but the problem with this
scheme is that if lexeme is very long then it crosses the buffer boundary, to scan rest of the lexeme the buffer has
to be refilled, that makes overwriting the first of lexeme.
2. Two Buffer Scheme: To overcome the problem of one buffer scheme, in this method two buffers are used to store
the input string. the first buffer and second buffer are scanned alternately. when end of current buffer is reached the
other buffer is filled. the only problem with this method is that if length of the lexeme is longer than length of the
buffer then scanning input cannot be scanned completely. Initially both the bp and fp are pointing to the first
character of first buffer.
Then the fp moves towards right in search of end of lexeme. as soon as blank character is recognized, the string
between bp and fp is identified as corresponding token. to identify, the boundary of first buffer end of buffer
character should be placed at the end first buffer. Similarly end of second buffer is also recognized by the end of
buffer mark present at the end of second buffer. when fp encounters first eof, then one can recognize end of first
buffer and hence filling up second buffer is started. in the same way when second eof is obtained then it indicates
of second buffer, alternatively both the buffers can be filled up until end of the input program and stream of tokens
is identified. This eof character introduced at the end is calling Sentinel which is used to identify the end of buffer.
2. Explain the various phases of a compiler in detail. Also write down the output for the following expression
after each phase a: =b*c.
A compiler is a program that converts the source code written in a programming language into machine-readable
code that can be executed by a computer. The process of compiling involves several phases, which are:
Lexical Analysis - also known as tokenization, this phase breaks the source code into tokens or lexemes, which
are the smallest meaningful units of the language. The input to this phase is the source code, and the output is a
sequence of tokens that represent the various keywords, identifiers, operators, and literals used in the code. For
example, the expression a = b * c might be tokenized into the following sequence of tokens:
IDENTIFIER (a)
ASSIGNMENT_OPERATOR (=)
IDENTIFIER (b)
MULTIPLICATION_OPERATOR (*)
IDENTIFIER (c)
Syntax Analysis - also known as parsing, this phase analyzes the structure of the code to ensure that it conforms to
the rules of the language's grammar. The input to this phase is the sequence of tokens produced by the lexical
analysis phase, and the output is an abstract syntax tree (AST) that represents the structure of the code. The AST is
a hierarchical tree-like structure that captures the relationships between the various parts of the code. For example,
the AST for the expression a = b * c might look like this:
3. Semantic Analysis - this phase checks the code for semantic correctness, which means ensuring that it follows the
rules and constraints of the language. This phase involves type checking, scope checking, and other kinds of
analyses that ensure the code is well-formed and meaningful. The output of this phase is a symbol table that
contains information about the various identifiers used in the code, such as their types, scope, and memory
locations. For example, in the expression a = b * c, the semantic analysis phase would ensure that b and c are both
of the same type and that a is of a compatible type to store the result of the multiplication.
4. Intermediate Code Generation - this phase transforms the AST into an intermediate representation (IR) that is
closer to machine code but still independent of any particular hardware or operating system. The IR is a lower-
level representation that simplifies the code and removes any language-specific constructs. The output of this
phase is the intermediate code that represents the original code in a simpler and more abstract form. For example,
the intermediate code for the expression a = b * c might look like this:
t1 = b * c
a = t1
5. Code Optimization - this phase optimizes the intermediate code to improve its efficiency and reduce its size. This
phase involves a range of techniques, such as constant folding, dead code elimination, and loop optimization that
make the code faster and smaller without changing its functionality. The output of this phase is optimized
intermediate code that is more efficient than the original intermediate code.
6. Code Generation - this phase generates the machine code that can be executed by the computer. The input to this
phase is the optimized intermediate code, and the output is the machine code that represents the original program
in binary form. The code generator is responsible for mapping the abstract operations in the intermediate code to
the specific instructions of the target processor. The output of this phase is the executable code that can be run on
the target hardware.Assuming that the target hardware is a hypothetical processor that uses a simple assembly
language, the code generation phase
would translate the optimized
intermediate code t1 = b * c and a = t1
into the following machine code:
INTERPRETER:
An interpreter is a program that directly executes the source
code of a program line by line. It does not translate the entire
program into machine code before execution, unlike a
compiler. Instead, it reads each line of code, interprets it, and
then executes it immediately. This makes it slower than a
compiled program, but also more flexible, as it can modify its
behavior based on runtime conditions.
ASSEMBLER:
An assembler is a program that translates assembly language
code into machine code. Assembly language is a low-level
programming language that uses mnemonic codes to represent
machine instructions. Assemblers are used to create
executable files for programs written in assembly language.
Tokenization: The assembler tokenizes the assembly language code, breaking it down into individual
instructions, symbols, and operands.
Parsing: The assembler parses the tokenized code, interpreting each instruction and operand and
generating corresponding machine code.
Symbol resolution: The assembler resolves symbols, which are names of functions, variables, or other
program elements that are used in the code. It ensures that each symbol is defined only once and that all
references to the symbol are resolved correctly.
Code generation: Finally, the assembler generates machine code, which consists of binary instructions
and data that can be executed directly by the computer's CPU
LINKER:
A linker is a program that combines object files generated by a compiler into a single executable file. The
linker performs several tasks, including:
Symbol resolution: The linker resolves symbols, which are names of functions, variables, or other
program elements that are used in multiple files. It ensures that each symbol is defined only once and that
all references to the symbol are resolved correctly.
Relocation: The linker relocates object files so that they can be loaded into memory and executed
correctly. This involves adjusting addresses and offsets in the code and data sections of the object files.
Library linking: The linker links object files with libraries that contain precompiled code and data. This
can include standard libraries provided by the operating system or third-party libraries.
Dead code elimination: The linker eliminates code and data that is not used by the program. This can
include unused functions, variables, and other program elements.
Output generation: Finally, the linker generates an executable file or a shared library that can be loaded
into memory and executed by the operating system.
LOADER:
A loader is a program that loads an executable file into memory and prepares it for execution. The loader
performs several tasks, including:
Allocating memory: The loader allocates memory for the program in the computer's memory space. It
determines the amount of memory required by the program and allocates the necessary space.
Resolving dependencies: If the program depends on other libraries or modules, the loader resolves those
dependencies and loads them into memory as well.
Relocating code: The loader may need to relocate the program's code in memory to ensure that it runs
correctly. This is necessary because the program may be designed to run at a specific memory address, but
that address may not be available when the program is loaded.
Setting up the program's environment: The loader sets up the program's environment, including its
initial values for registers and memory locations.
Starting program execution: Finally, the loader starts the program's execution by transferring control to
the program's entry point.
PREPROCESSOR:
A preprocessor is a program that processes the source code of a program before it is compiled. It is used to
perform tasks such as including header files, defining macros, and conditionally compiling code. The
preprocessor is typically run before the compiler as a separate step. Preprocessing is typically done using
directives, which are special commands that begin with a hash symbol (#) and are inserted into the source
code. Some common preprocessing directives include:
#include: This directive is used to include header files in the source code. Header files typically contain
function declarations, constants, and other definitions that are used in the source code.
#define: This directive is used to define macros in the source code. Macros are symbolic names that are
replaced with their corresponding values during preprocessing.
#ifdef, #ifndef, #else, #endif: These directives are used for conditional compilation. They allow parts of
the code to be included or excluded from the final compiled output based on certain conditions.
#pragma: This directive is used to provide hints or instructions to the compiler or linker. Pragmas are
typically used to control optimization or to specify linker options.
The preprocessor runs as a separate step before the compiler, and it generates a modified version of the
source code that is then passed to the compiler for compilation.
Debugger:
A debugger is a program that helps developers find and fix bugs in their code. It allows developers to step
through their code line by line, set breakpoints, inspect variables, and examine the call stack. Debuggers
can be integrated into development environments or run as standalone programs.
Optimizer:
An optimizer is a program that analyzes the code generated by a compiler and tries to improve its
performance. It performs tasks such as removing redundant code, reordering instructions and replacing
slow operations with faster ones. Optimizers can significantly improve the performance of compiled
programs.
Single-pass compiler:
A single-pass compiler reads the entire source code in one pass and generates the object code in a single
step. This type of compiler is faster than a multi-pass compiler, but may generate less efficient code.
Multi-pass compiler:
A multi-pass compiler reads the source code in multiple passes, performing different tasks such as lexical
analysis, syntax analysis, semantic analysis, and code generation. This type of compiler can generate more
efficient code than a single-pass compiler, but is slower.
Cross-compiler:
A cross-compiler is a compiler that runs on one platform and generates code for another platform. For
example, a compiler that runs on a Windows PC and generates code for a Linux server would be a cross-
compiler.
Just-in-time (JIT) compiler:
A JIT compiler is a compiler that generates machine code at runtime, just before the code is executed. This
allows for dynamic optimization of the code, and can lead to significant performance improvements.
Ahead-of-time (AOT) compiler:
An AOT compiler is a compiler that generates machine code ahead of time, before the code is executed.
This can improve startup time and reduce the memory footprint of the program, but can also increase the
size of the executable.
Incremental compiler:
An incremental compiler is a compiler that only recompiles parts of the code that have changed since the
last compilation. This can speed up the development process by reducing the time required for full
recompilations.
Optimizing compiler:
An optimizing compiler is a compiler that analyzes the code and generates more efficient machine code,
by performing optimizations such as loop unrolling, constant folding, and register allocation.
5. What is a regular expression? Give all the algebraic properties of regular expression.
A regular expression, also known as regex or regexp, is a pattern that describes a set of strings. It is a
sequence of characters that define a search pattern. Regular expressions are used in many programming
languages, text editors, and other software to match and manipulate text.The algebraic properties of
regular expressions are:
Closure:
The set of regular expressions is closed under union, concatenation, and Kleene star. This means that the
result of any operation on two regular expressions is itself a regular expression.
Associativity:
The union and concatenation operations are associative, which means that changing the grouping of the
expressions being combined does not affect the result. That is, (A ∪ B) ∪ C = A ∪ (B ∪ C) and (A • B) •
C = A • (B • C).
Commutativity:
The union operation is commutative, which means that changing the order of the expressions being
combined does not affect the result. That is, A ∪ B = B ∪ A.
Identity elements: There exists an identity element for both union and concatenation. The empty set is the
identity element for union, while the empty string is the identity element for concatenation.
Distributivity:
The concatenation operation distributes over union, which means that (A ∪ B) • C = (A • C) ∪ (B • C).
Kleene star properties:
The Kleene star operation is idempotent, which means that A* = (A*)* for any regular expression A. The
Kleene star also satisfies the following properties:
a. A* is the smallest superset of the empty set that contains
A. b. (A • B)* = (A* • B*). c. (A ∪ B) = A* ∪ B*.
Regular expressions have several identity rules, also known as laws or properties, that allow for the
manipulation and simplification of regular expressions. These rules include:
Union Identity: A ∪ ∅ = A
This rule states that the union of a regular expression A with the empty set (∅ ) is equivalent to A.
Concatenation Identity: A • ε = AThis rule states that the concatenation of a regular expression A with the
empty string (ε) is equivalent to A.
Kleene Star Identity: ε* = ε
This rule states that the Kleene star of the empty string (ε) is itself equal to ε.
Kleene Star of Union: (A ∪ B)* = A* • B*
This rule states that the Kleene star of the union of two regular expressions A and B is equivalent to the
concatenation of their Kleene stars.
De Morgan's Laws: (A ∪ B)' = A' • B' and (A • B)' = A' ∪ B'
These laws state that the complement of the union of two regular expressions is equivalent to the
concatenation of their complements, and the complement of the concatenation of two regular expressions
is equivalent to the union of their complements. (ε—Epsilon, R-Any Regular Expression, ɸ-empty set pi)
1.ɸ+R=R 11.(PQ)*R=P(QR)*
2.ɸ.R= ɸ 12.(P+Q).R=PR+QR
3.ε.R=R.ε=R 13.R.(P+Q)=RP+RQ
4.R+R=R 14.(P+Q)*=(P*.Q*)*=( P*+Q*)*
5.R.R*= R*.R=R+ 15.R*R+R= R*R
6. ε + R.R*= R.R* + ε=R* 16. (R+ ε)*=R*
7. ε* = ε. ɸ*= ε 17. (R+ ε)+R*=R*
8. (R*)*=R* 18. (R+ ε).R*=R*
9. R*= R++ ε 19. (ε+R*)= R*
10. ɸ.R=R. ɸ= ɸ 20. (R+ ε) (R+ ε)*(R+ ε )=R*
6. Explain the structure of the LEX program. Write a Lex program to calculate sum of integers.
The LEX program is a lexical analyzer generator that can be used to generate programs that recognize
patterns in text, such as programming language tokens. Here is an example of a simple LEX program
structure:
%{
/* Header file declarations and global definitions go here */
%}
%%
/* Regular expression rules go here */ {/* Action code for each rule goes here */}
%%
/* Additional functions or code go here */
%{
#include <stdio.h>
int sum = 0; /* Global variable for the sum */
%}
int main()
{
yylex();
return 0;
}
7. Construct the NFA for the following regular expression using Thompson’s Construction. Apply
subset construction method to convert it into DFA.(a+b)*abb#
8. Construct the NFA for the following regular expression using Thompson’s Construction. Apply
subset construction method to convert it into DFA.(a+b)*ab*a
9. Construct a DFA without constructing NFA (using syntax tree method) for the following regular
expression. Find the minimized DFA.(a|b|c)*d*(a*|b|ac+)
Now we will compute the follow position of each node
Now we will obtain Dtran for each state and each input
x`
10. Construct a DFA by syntax tree construction method (a+b)*ab#
The Follow Position is computed as
PART-A
1. Define Context Free Grammar.
CFG stands for Context-Free Grammar. It is a type of formal grammar that is used to describe the
syntax or structure of a programming language or any other formal language. In a CFG, a set of
production rules are defined to generate strings of symbols that belong to the language. These
production rules specify how one or more non-terminal symbols can be rewritten as a sequence of
terminal and/or non-terminal symbols. A non-terminal symbol is a symbol that can be replaced by a
sequence of symbols, while a terminal symbol is a symbol that cannot be rewritten further. CFGs are
widely used in computer science, particularly in the design and analysis of programming languages,
compilers, and parsers.
5. Consider the following grammar A-->ABd | Aa | a B-->Be | b and remove left recursion.
We can see that the first production rule A → ABd is left-recursive, as A appears as the first symbol
on the right-hand side. To eliminate left recursion, we can use the following approach:
Identify the non-terminals that have left-recursive productions, in this case, A.
Create a new non-terminal symbol A' and replace the left-recursive productions with new productions that
use the non-left-recursive form of the same non-terminal A' instead of A.
Add a new production rule for A that starts with a non-left-recursive symbol (in this case, the terminal
symbol a) and ends with A', allowing the production of strings that start with a and end with the non-left-
recursive form of A.
Using this approach, we can transform the grammar as follows:
A --> a A'
A' --> B D A' | ε
B --> b B'
B' --> ε | e B'
D --> d
Here, we have added a new non-terminal symbol A' and rewritten the left-recursive production
A → ABd as A → aA'. We have also added a new production A' → B D A' | ε to allow for the production
of strings that start with non-terminals B and D, followed by A'. Finally, we have removed the left-recursion
from the B production rules by introducing a new non-terminal symbol B' and adding a new production
rule B → b B' | ε, which allows for the production of zero or more occurrences of the non-terminal symbol
e. The resulting grammar is now free of left recursion.
1. LR(0) Items: An LR(0) item represents a partial production rule of the CFG along
with a marker indicating the progress of parsing. For example, considering the
production rule A -> α.Bβ, where A is a non-terminal symbol, α and β are
sequences of grammar symbols, and B is the marker, an LR(0) item can be written
as A -> α.Bβ. The marker B represents the position in the production rule where
parsing is currently focused.
2. LR(0) Closure: Given a set of LR(0) items, the closure operation expands the set by
including all items that can be reached by applying the production rules of the
CFG. It ensures that all possible parsing configurations are considered.
3. LR(0) Sets of Items: An LR(0) set of items is a set of LR(0) items obtained by
applying closure and other operations. Each LR(0) set represents a state in the LR
parsing process.
4. LR(0) Automaton: An LR(0) automaton is a directed graph where each node
represents an LR(0) set of items, and the edges indicate transitions between sets
based on grammar symbols. The automaton is constructed by systematically
computing LR(0) sets of items.
5. LR(0) Parsing Table: The LR(0) parsing table is a two-dimensional table that guides
the parsing process based on the LR(0) automaton. The rows of the table
represent the states of the automaton, and the columns correspond to grammar
symbols and special end-of-input markers. The table entries contain the parsing
actions, which can be of three types:
Shift: Move to the next state.
Reduce: Apply a production rule and replace a set of symbols on the stack with a
non-terminal symbol.
Accept: The input string is valid according to the CFG.
6. LR(1) Items: An LR(1) item extends the concept of LR(0) items by considering the
lookahead symbol, i.e., the next input symbol that the parser can see. Each LR(1)
item includes a lookahead symbol along with the partial production rule and
marker. For example, A -> α.Bβ, a represents an LR(1) item with a as the
lookahead symbol.
7. LR(1) Sets of Items: Similar to LR(0) sets, LR(1) sets of items are obtained by
applying closure and other operations to LR(1) items. Each LR(1) set represents a
state in the LR(1) parsing process.
8. LR(1) Automaton: The LR(1) automaton is constructed based on LR(1) sets of
items and is similar to the LR(0) automaton. However, the transitions in the LR(1)
automaton consider the lookahead symbols in addition to the grammar symbols.
9. LR(1) Parsing Table: The LR(1) parsing table is constructed based on the LR(1)
automaton. It is similar to the LR(0) parsing table but takes into account the
lookahead symbols while determining the parsing actions.
The process of constructing the LR parsing table involves the following steps:
The process of constructing the LR parsing table involves several steps. Let's go
through them in detail:
Step 1: Augment the Grammar To construct the LR parsing table, we first need to
augment the given context-free grammar (CFG). The augmentation involves
adding a new start symbol and a new production rule to ensure that the parser
can recognize the entire input string.
Step 2: Compute LR(1) Sets of Items We compute the LR(1) sets of items by
applying closure and other operations to LR(1) items. Initially, we start with the
closure of the item [S' -> .S, $], where S' is the augmented start symbol, S is
the original start symbol, and $ represents the end-of-input marker.
We then iteratively process each LR(1) set and expand it by considering the
transitions based on grammar symbols and lookahead symbols. By applying
closure and goto operations, we generate new LR(1) sets until no more sets can
be created.
Step 3: Construct LR(1) Automaton Using the computed LR(1) sets of items, we
construct the LR(1) automaton. Each LR(1) set of items represents a state in the
automaton, and the transitions between states are determined by the grammar
symbols and lookahead symbols.
Step 4: Fill the Parsing Table Now, we create the LR parsing table, which is a two-
dimensional table with rows representing the states of the LR(1) automaton and
columns representing grammar symbols and lookahead symbols.
Step 5: Handle Conflicts During the table construction process, conflicts may arise
in the parsing table. Conflicts occur when multiple actions are possible for a given
state and symbol. The two main types of conflicts are shift-reduce conflicts and
reduce-reduce conflicts.
Shift-Reduce Conflict: Occurs when both a shift and a reduce action are possible.
It indicates ambiguity in the grammar, and the parser needs additional rules or
information to resolve the conflict.
Reduce-Reduce Conflict: Occurs when multiple reduce actions are possible. It
indicates ambiguity in the grammar, and the grammar needs to be modified to
eliminate the conflict.
Once the LR parsing table is constructed and any conflicts are resolved, the parser
can use it to process a given input string by following the actions specified in the
table for each input.
4. Construct the LR(0) items for the grammar given below develop SLR Parsing table SaSa
|bSb |aa |bb
Solution:
5. Construct the SLR Parsing table for the following grammar. Also, Parse the input string a * b + a.
E→E+T|T
T → TF | F
F → F* | a | b.
Step1 − Construct the augmented grammar and number the productions.
(0) E′ → E
(1) E → E + T
(2) E → T
(3) T → TF
(4) T → F
(5) F → F ∗
(6) F → a
(7) F → b.
Step2 − Find closure & goto Functions to construct LR (0) items.
Box represents the New states, and the circle represents the Repeating State.
Computation of FOLLOW
We can find out
FOLLOW(E) = {+, $}
FOLLOW(T) = {+, a, b, $}
FOLLOW(F) = {+,*, a, b, $}
0 a*b+a$ Shift
0T2b5 +a $ Reduce by F → b
0T2F7 +a $ Reduce by T → TF
0T2 +a $ Reduce by E → T
Stack Input String Action
0E1 +a $ Shift
0E1+6 a$ Shift
0E1+6a4 $ Reduce by F → a
0E1+6F3 $ Reduce by T → F
0E1+6T9 $ Reduce by E → E + T
0E1 $ Accept
10. Describe in detail about the steps of construction parsing table of LALR parser with
example.
The construction of the parsing table for an LALR (Look-Ahead LR) parser involves
combining the advantages of both LR(0) and SLR(1) parsing techniques. LALR
parsers are more efficient and compact than LR(1) parsers while still being able to
handle a wide range of grammars. Let's go through the steps of constructing the
parsing table for an LALR parser with an example.
1. S -> A c
2. S -> B c
3. A -> a A
4. A -> ε
5. B -> b B
6. B -> ε
Step 2: Compute the Look-Ahead Sets In this step, we compute the look-ahead
sets for each LR(0) set of items. The look-ahead sets represent the viable prefixes,
which are the prefixes that can lead to a valid handle (right side of a production).
This step involves applying the look-ahead closure operation to propagate the
look-ahead sets throughout the LR(0) automaton.
Step 3: Merge Compatible States In this step, we merge compatible states of the
LR(0) automaton to create fewer and larger states. Two states are compatible if
they have the same core items (production rules with the same dot position) and
their look-ahead sets are identical. Merging compatible states helps reduce the
size of the LALR parsing table.
Step 4: Construct the Parsing Table Using the merged states from the previous
step, we construct the LALR parsing table. The parsing table is a two-dimensional
table with rows representing the states of the LALR parser, and columns
representing grammar symbols and end-of-input markers.
Step 5: Handle Conflicts During the table construction process, conflicts may arise
in the parsing table. Conflicts occur when multiple actions are possible for a given
state and symbol. Shift-reduce conflicts and reduce-reduce conflicts can occur.
Shift-reduce conflicts occur when both a shift and a reduce action are possible,
while reduce-reduce conflicts occur when multiple reduce actions are possible.
These conflicts need to be resolved to ensure the parsing table is unambiguous
and deterministic.
By following these steps, we can construct the parsing table for an LALR parser.
The parsing table guides the parsing process for a given input string, enabling the
LALR parser to recognize and analyze the input based on the given grammar.
UNIT-III QUESTION BANK WITH ANSWERS
PART-A
A dependency graph is used to represent the flow of information among the attributes in
a parse tree. In a parse tree, a dependency graph basically helps to determine the
evaluation order for the attributes. The main aim of the dependency graphs is to help the
compiler to check for various types of dependencies between statements in order to
prevent them from being executed in the incorrect sequence.
6. Differentiate SDT and SDD.
SDD: Specifies the values of attributes by associating semantic rules with the
productions.
SDT scheme: embeds program fragments (also called semantic actions) within
production bodies. The position of the action defines the order in which the action is
executed (in the middle of production or end).
7. Define L-attributed Definition
L-attributed grammars are a special type of attribute grammars. They allow the attributes
to be evaluated in one depth-first left-to-right traversal of the abstract syntax tree. As a
result, attribute evaluation in L-attributed grammars can be incorporated conveniently in
top-down parsing. L-attributed SDD attributes may be inherited or synthesized, this is
referred to as an L-attribute definition. In an S-attributed SDD, attributes all attributes are
synthesized - S-attribute definition.
8. Justify why all S-attributed definitions are L-attributed.
Synthesized Attribute Dependency: In S-attributed definitions, the computation of
synthesized attributes depends only on the attributes of symbols on the RHS of the
production rule. Since L-attributed definitions allow the computation of synthesized
attributes based on the attributes of child symbols, S-attributed definitions satisfy this
condition.
Inherited Attributes: S-attributed definitions do not involve inherited attributes. In L-
attributed definitions, inherited attributes are passed from the parent symbol to its child
symbols. Since S-attributed definitions do not have any inherited attributes, they
automatically satisfy the condition of not depending on inherited attributes.
Therefore, all S-attributed definitions can be considered L-attributed because they satisfy
the characteristics and restrictions of L-attributed definitions. The attribute computations
in S-attributed definitions follow a left-to-right order, and they depend only on the
attributes of symbols on the RHS of production rules, which aligns with the requirements
of L-attributed definitions.
What is an attribute grammar? Describe in detail about its two types of attributes with suitable
1
example.
Write S-attributed SDD for simple desk calculator draw annotated parse tree representing any
2
valid input.
What is inherited attribute? Write down the SDD with inherited attribute to declare a list of
3
identifiers.
Write a SDD for the grammar to declare variables with data type int,float or char. Draw a
4
dependency graph for the declaration statement int a,b,c .
How syntax-directed definitions can be used to specify the construction of syntax trees. Give
5
example.
Describe in detail about specification of a simple type checker with an example type system to
6
report type error in various statements.
Explain in detail about various type expressions and the conventions used to represent various
7
program constructs.
8 Explain various types of type equivalence with suitable example.
9 Describe in detail about various types of three address code with suitable examples
101. Explain in detail about different symbol table implementation strategies.
1. What is an attribute grammar? Describe in detail about its two types of attributes with
suitable example.
Attribute grammar is a special form of context-free grammar where some additional information
(attributes) are appended to one or more of its non-terminals in order to provide context-sensitive
information. Each attribute has well-defined domain of values, such as integer, float, character,
string, and expressions.
Attribute grammar is a medium to provide semantics to the context-free grammar and it can help
specify the syntax and semantics of a programming language. Attribute grammar (when viewed as
a parse-tree) can pass values or information among the nodes of a tree.
Example:
E → E + T { E.value = E.value + T.value }
The right part of the CFG contains the semantic rules that specify how the grammar should be
interpreted. Here, the values of non-terminals E and T are added together and the result is copied
to the non-terminal E.
Semantic attributes may be assigned to their values from their domain at the time of parsing and
evaluated at the time of assignment or conditions. Based on the way the attributes get their values,
they can be broadly divided into two categories:
1. Synthesized attributes 2.Inherited attributes.
Synthesized attributes
These attributes get values from the attribute values of their child nodes. To illustrate, assume the
following production:
S → ABC
If S is taking values from its child nodes (A,B,C), then it is said to be a synthesized attribute, as
the values of ABC are synthesized to S.
As in our previous example (E → E + T), the parent node E gets its value from its child node.
Synthesized attributes never take values from their parent nodes or any sibling nodes.
Inherited attributes
In contrast to synthesized attributes, inherited attributes can take values from parent and/or
siblings. As in the following production,
S → ABC
A can get values from S, B and C. B can take values from S, A, and C. Likewise, C can
take values from S, A, and B.
Semantic analysis uses Syntax Directed Translations to perform the above tasks.
Semantic analyzer receives AST (Abstract Syntax Tree) from its previous stage (syntax
analysis).
Semantic analyzer attaches attribute information with AST, which are called Attributed
AST.
Attributes are two tuple value, <attribute name, attribute value>
For example:
int value = 5;
<type, “integer”>
<presentvalue, “5”>
S-attributed SDT
If an SDT uses only synthesized attributes, it is called as S-attributed SDT. These
attributes are evaluated using S-attributed SDTs that have their semantic actions written
after the production (right hand side).
In L-attributed SDTs, a non-terminal can get values from its parent, child, and sibling
nodes. As in the following production
S → ABC
S can take values from A, B, and C (synthesized). A can take values from S only. B can
take values from S and A. C can get values from S, A, and B. No non-terminal can get
values from the sibling to its right.
2. Write S-attributed SDD for simple desk calculator draw annotated parse tree
representing any valid input.
The syntax-directed definition for a desk calculator program, associates an integer-valued
synthesized attribute called vaI with each of the non-terminals E, T, and F. For each E, T,
and F-production. the semantic rule computes the value of attribute val for the non-terminal
on the left side from the values of v d for the non-terminals on the right side.
3. What is inherited attribute? Write down the SDD with inherited attribute to declare a list
of identifiers.
Inherited Attributes – These are the attributes which derive their values from their parent or sibling nodes
i.e. value of inherited attributes are computed by value of parent or sibling nodes.
Example:
A --> BCD { C.in = A.in, C.type = B.type }
Computation of Inherited Attributes –
Construct the SDD using semantic actions.
The annotated parse tree is generated and attribute values are computed in top down manner.
Example: Consider the following grammar
S --> T L
T --> int
T --> float
T --> double
L --> L1, id
L --> id
The SDD for the above grammar can be written as follows
Let us assume an input string int a, c for computing inherited attributes. The annotated parse tree for the
input string is
The value of L nodes is obtained from T.type (sibling) which is basically lexical value obtained as int, float
or double. Then L node gives type of identifiers a and c. The computation of type is done in top down
manner or pre-order traversal. Using function Enter_type the type of identifiers a and c is inserted in
symbol table at corresponding id.entry.
4. Write a SDD for the grammar to declare variables with data type int,float or char. Draw
a dependency graph for the declaration statement int a,b,c.
5. How syntax-directed definitions can be used to specify the construction of syntax trees.
Give example.
6. Describe in detail about specification of a simple type checker with an example type
system to report type error in various statements.
A type checker is a crucial component of a programming language compiler or interpreter that
ensures the compatibility of types within a program. It analyzes the static types of variables,
expressions, and statements to identify potential type errors before the program is executed. Here,
I'll describe the specifications of a simple type checker and provide an example type system to
demonstrate how it can report type errors in various statements.
Specification of a Simple Type Checker:
Type Definitions: The type checker needs to define the types available in the programming
language. These types can include basic types like integers, booleans, characters, and floating-
point numbers, as well as user-defined types like structures or classes.
Symbol Table: The type checker maintains a symbol table that keeps track of declared variables
and their associated types. It allows the type checker to look up the type of a variable when it is
encountered in expressions or statements.
Type Inference: The type checker performs type inference, which means it deduces the types of
expressions based on the types of their constituent parts. For example, if an expression contains an
addition operation between two integers, the type checker infers that the result is also an integer.
Type Checking Rules: The type checker defines a set of rules that specify how different types
can interact with each other. For example, it might specify that an addition operation is only valid
between two integers or floats and not between an integer and a string.
Error Reporting: When a type error is encountered, the type checker generates an error message
that describes the nature of the error and the location in the source code where the error occurred.
It may also suggest possible fixes or provide additional information to help the programmer
understand the issue.
Example Type System:
Let's consider a simple programming language with the following types: int (integer), bool
(boolean), and float (floating-point number).
The type checker will enforce the following rules:
An integer can be assigned to an integer variable.
A boolean can be assigned to a boolean variable.
A floating-point number can be assigned to a float variable.
An integer can be implicitly converted to a float.
A boolean cannot be implicitly converted to any other type.
Arithmetic operations (addition, subtraction, multiplication, division) are only allowed between
integers or floats.
Comparison operations (equality, inequality, greater than, less than) are only allowed between
integers or floats, and the result is a boolean.
Example Code:
python
x: int = 5
y: float = 2.5
z: bool = True
# Valid assignments
x = 10
y = 3.14
z = False
# Invalid assignments
x = "Hello” # Type error: cannot assign string to int
y = x + y # Type error: addition between int and float
z = x > y # Type error: comparison between int and float
In this example, the type checker will report two type errors. The first error occurs when assigning
a string to an integer variable, violating rule 1. The second error occurs when adding an integer and
a float, violating rule 6. The type checker will produce error messages indicating the nature of the
error and the specific lines of code where the errors occur. Overall, a simple type checker analyses
the types used in a program and ensures their compatibility, helping to catch type errors early and
promote type safety.
7. Explain in detail about various type expressions and the conventions used to represent
various program constructs.
Type expressions are used to represent the types of variables, expressions, and program constructs
in a programming language. They describe the nature of the data and operations that can be
performed on them. The conventions used to represent various program constructs vary across
programming languages, but I'll provide a general overview of commonly used conventions.
2. Basic Types:
Integers: Represented as int, integer, or simply i.
Booleans: Represented as bool, boolean, or b.
Floating-Point Numbers: Represented as float, double, or f.
Characters: Represented as char or c.
Strings: Represented as string or str.
3. Arrays:
Arrays: Represented as T[], where T represents the type of elements in the array.
For example, an array of integers can be represented as int[] or array<int>.
Array Size: Some languages specify the size of an array as part of the type
expression. For example, an array of 10 integers can be represented as int[10] or
array<int, 10>.
4. Pointers:
Pointers: Represented using a * symbol. For example, a pointer to an integer can be
represented as int* or ptr<int>.
Nullability: Some languages support Nullability, indicating that a pointer can have
a value of null. This is often represented using ? or Nullable<T>. For example, a
nullable pointer to an integer can be represented as int? or Nullable<int>.
5. Functions:
Functions: Represented as (parameters) -> return_type. For example, a function
that takes two integers and returns a boolean can be represented as (int, int) ->
bool.
Anonymous Functions/Lambdas: Represented using a shorthand notation
depending on the programming language. For example, in Python, a lambda
function that takes an integer and returns its square can be represented as lambda x:
x * x.
6. Structures/Classes:
Structures/Classes: Represented using the name of the structure/class. For example,
a structure/class named Person is represented as Person.
Generic Structures/Classes: If a structure/class is generic, type parameters are used.
For example, a generic list structure that can hold any type can be represented as
List<T>, where T represents the type parameter.
7. Union Types:
Union Types: Represented using a | symbol between multiple types. It indicates
that a value can have any of the specified types. For example, a variable that can
hold either an integer or a float can be represented as int | float.
8. Custom Types:
Custom types defined by the programmer are represented using the name chosen
for the type. For example, if a programmer defines a custom type named Color, it
is represented as Color.
These conventions may vary depending on the programming language and its type system. Some
languages may use keywords or specific syntax to represent certain constructs. It's important to
consult the documentation or specifications of the programming language you are working with to
understand the specific conventions and representations used.
9. Explain various types of type equivalence with suitable example.
Type equivalence refers to the comparison of types to determine whether they are equivalent or
compatible in a given context. Semantic analysis is a phase in the compilation process that focuses
on analysing the meaning and correctness of a program.it also involves in checking if two types
can be safely used interchangeably without violating the rules of the programming language. The
specific rules for type equivalence vary depending on the language and its type system. Here are a
few common aspects of type equivalence.
1. Compatibility of Basic Types: In many programming languages, there are predefined
basic types such as integers, floating-point numbers, booleans, etc. Type equivalence for basic
types typically involves checking if the types match exactly.
2. Compatibility of Composite Types: Composite types, such as arrays, structures, classes,
or records, often have additional considerations for type equivalence. This may include checking
the compatibility of their component types, the order of fields, and the presence of optional or
variable-length components.
3. Compatibility of User-Defined Types: Type equivalence also applies to user-defined types,
such as classes or structs defined by the programmer. In this context, type equivalence may
involve checking the inheritance hierarchy, interfaces, or base classes to ensure that the types are
compatible.
Type equivalence in semantic analysis ensures that type rules are enforced, preventing type errors and
ensuring type safety in a program.
The main difficulty arises from the fact that most modern languages allow the naming of user-
defined types. For instance, in C and C++ this is achieved by the typedef statement. When
checking equivalence of named types, we have two possibilities.
Name equivalence.
Name equivalence is a concept in type systems where types are considered equivalent
if they have the same name, regardless of their internal structure or composition. It
means that two types with the same name are treated as equivalent types, even if they
are defined differently or have different component types.
Name equivalence is commonly found in languages with nominal type systems, where
type compatibility is determined based on the names of the types. In such type systems,
types are given unique names or identifiers, and type equivalence is determined by
comparing these names.
1. Java:
class MyClass {
// ...
MyClass obj1;
MyClass obj2; ;
2. C++:
class Point {
// ...
};
// The following types are name-equivalent
In both of these examples, the types MyClass in Java and Point in C++ are considered
equivalent because they have the same name. The internal structure of the types or their
definitions doesn't affect their equivalence.
Name equivalence simplifies type checking and comparison because it focuses solely
on the names of the types. However, it also means that types with the same name but
different structures or definitions are not considered equivalent. This can limit
flexibility in certain cases, especially when dealing with complex type structures or
when trying to establish compatibility between types defined separately.
It's important to note that different programming languages may have different type
equivalence rules depending on their design goals, type systems, and language
semantics. The choice between name equivalence and structural equivalence often
depends on the language's intended use cases and philosophy.
Structural equivalence, also known as structural typing or duck typing, is a concept in
type systems where types are considered equivalent or compatible if their structures
match, regardless of their names or declarations. It is a type system feature that allows
for flexible and dynamic type checking based on the shape or structure of types.
In a type system that employs structural equivalence, two types are considered
equivalent if they have the same structure, meaning that their components, fields,
methods, or properties match in terms of number, types, and sometimes order. This
enables objects or values of different types to be used interchangeably as long as their
structures align.
Structural equivalence simplifies type compatibility and allows for more flexible and
dynamic programming, but it also requires careful consideration and may introduce
potential issues such as accidental compatibility or difficulties in static analysis.
10. Describe in detail about various types of three address code with suitable examples.
Three-address code (TAC) is a low-level intermediate representation used in compilers
and code optimization. It represents high-level language statements in a simplified
form, with each statement containing at most three operands or addresses. TAC is
designed to be easily translated into machine code and enables efficient code
optimization techniques. There are various types of three-address code structures,
including:
Assignment Statements:
x=y+z
In this type of TAC, an assignment statement is represented with a destination operand
(x) and two source operands (y and z). The TAC generates code that computes the sum
of y and z and stores the result in x.
Arithmetic Expressions:
x = y * (a + b)
TAC for arithmetic expressions involves multiple statements. In this example, the
expression is split into two TAC statements. First, the sum of a and b is computed and
stored in a temporary variable (t1). Then, the product of y and t1 is calculated and
assigned to x.
Pointer Operations:
x = *p
TAC allows pointer operations. In this example, the TAC fetches the value pointed to
by the pointer p and assigns it to the variable x.
Disadvantage –
Temporaries are implicit and difficult to rearrange code.
It is difficult to optimize because optimization involves moving intermediate code.
When a triple is moved, any other triple referring to it must be updated also. With help
of pointer one can directly access symbol table entry.
Example – Consider expression a = b * – c + b * – c
3. Indirect Triples:
This representation makes use of pointer to the listing of all references to
computations which is made separately and stored. Its similar in utility as compared to
quadruple representation but requires less space than it. Temporaries are implicit and
easier to rearrange code.
The control stack is vital for managing function calls, preserving return addresses, storing local
variables, enabling recursion, facilitating stack unwinding, and maintaining control flow during
program execution. It provides the necessary structure and organization for orderly procedure
execution and efficient handling of function calls.
6. List out different parameter passing methods
Pass-by-value.
Pass-by-reference.
Pass-by-value-result.
Pass-by-name.
7. What are the important factors affecting the target code generation?
1. Input to the code generator: The input to the code generator is intermediate representation together
with the information in the symbol table. Intermediate representation has the several choices:
Postfix notation,
Syntax tree or DAG,
Three address code
The code generation phase needs complete error-free intermediate code as an input requires.
2. Target Program: The target program is the output of the code generator. The output can be:
Absolute machine language: It can be placed in a fixed location in memory and can be executed
immediately.
3. Target Machine: architecture and its instruction set.
4. Instruction Selection:
5. Register Allocation: Proper utilization of registers improve code efficiency. Use of registers make the
computations faster in comparison to that of memory, so efficient utilization of registers is important. The
use of registers are subdivided into two sub problems:
6. Choice of Evaluation order: The efficiency of the target code can be affected by the order in which the
computations are performed.
In a flow graph, a node d dominates node n, if every path from initial node of the flow
graph to n goes through d. This will be denoted by d dom n. Every initial node dominates all the
remaining.
A dominator is a concept that refers to the relationship between nodes in a flow graph.
Specifically, a node X is said to dominate another node Y if every path from the entry node of the
graph to Y must go through X. In other words, X dominates Y if X is an ancestor of Y in the flow
graph.in the flow graph and the entry of a loop dominates all nodes in the loop.
10. State the applications of DAG
DAGs are useful for representing many different types of flows, including data processing flows. By
thinking about large-scale processing flows in terms of DAGs, one can more clearly organize the various
steps and the associated order for these jobs.
1. Task Scheduling
2. Compiler Optimization:
3. Data Flow Analysis:
PART-B
1. Describe in detail about various operations in symbol table organization.
Symbol table is an important data structure created and maintained by compilers in order to
store information about the occurrence of various entities such as variable names, function
names, objects, classes, interfaces, etc. Symbol table is used by both the analysis and the
synthesis parts of a compiler.
A symbol table may serve the following purposes depending upon the language in hand:
To store the names of all entities in a structured form at one place.
To verify if a variable has been declared.
To implement type checking, by verifying assignments and expressions in the source code
are semantically correct.
To determine the scope of a name (scope resolution).
A symbol table is simply a table which can be either linear or a hash table. It maintains an
entry for each name in the following format:
<symbol name, type, attribute>
For example, if a symbol table has to store information about the following variable
declaration:
static int interest;
then it should store the entry such as:
<interest, int, static>
The attribute clause contains the entries related to the name.
Implementation
If a compiler is to handle a small amount of data, then the symbol table can be implemented
as an unordered list, which is easy to code, but it is only suitable for small tables only. A
symbol table can be implemented in one of the following ways:
void pro_one()
{
int one_1;
int one_2;
{ \
int one_3; |_ inner scope 1
int one_4; |
} /
int one_5;
{ \
int one_6; |_ inner scope 2
int one_7; |
} /
}
void pro_two()
{
int two_1;
int two_2;
{ \
int two_3; |_ inner scope 3
int two_4; |
} /
int two_5;
}
...
The above program can be represented in a hierarchical structure of symbol tables:
The global symbol table contains names for one global variable (int value) and two procedure
names, which should be available to all the child nodes shown above. The names mentioned
in the pro_one symbol table (and all its child tables) are not available for pro_two symbols
and its child tables.
This symbol table data structure hierarchy is stored in the semantic analyser and whenever a
name needs to be searched in a symbol table, it is searched using the following algorithm:
First a symbol will be searched in the current scope, i.e. current symbol table.
if a name is found, then search is completed, else it will be searched in the parent symbol
table until,
Either the name is found or global symbol table has been searched for the name.
2. List out various issues in design of a code generator and possible design alternatives.
Main issues in the design of a code generator are:
Input to the code generator
Target program
Memory management
Instruction selection
Register allocation
Evaluation order
1. Input to the code generator
In the input to the code generator, design issues in the code generator intermediate code created by
the frontend and information from the symbol table that defines the run-time addresses of the data objects
signified by the names in the intermediate representation are fed into the code generator. Intermediate
codes may be represented mainly in quadruples, triples, indirect triples, postfix notation, syntax
trees, DAGs(Directed Acyclic Graph), etc. The code generation step assumes that the input is free of all
syntactic and state semantic mistakes, that all essential type checking has been performed, and that type-
conversion operators have been introduced where needed.
2. Target program
The code generator's output is the target program. The result could be:
Assembly language: It allows subprograms to be separately compiled.
Relocatable machine language: It simplifies the code generating process.
Absolute machine language: It can be stored in a specific position in memory and run immediately.
3. Memory management
In the memory management design, the source program's frontend and code generator map names
address data items in run-time memory. It utilizes a symbol table. In a three-address statement, a
name refers to the name's symbol-table entry. Labels in three-address statements must be
transformed into instruction addresses.
For example,
j: goto i generates the following jump instruction:
if i < j, A backward jump instruction is generated with a target address equal to the quadruple i
code location.
If i > j, It's a forward jump. The position of the first quadruple j machine instruction must be saved
on a list for quadruple i. When i is processed, the machine locations for all instructions that
forward hop to i are populated.
4. Instruction selection
In the Instruction selection, the design issues in the code generator program's efficiency will be
improved by selecting the optimum instructions. It contains all of the instructions, which should
be thorough and consistent. Regarding efficiency, instruction speeds and machine idioms have a
big effect. Instruction selection is simple if we don't care about the target program's efficiency.
The relevant three-address statements, for example, would be translated into the following code
sequence:
P:=Q+R
S:=P+T
MOV Q, R0
ADD R, R0
MOV R0, P
MOV P, R0
ADD T, R0
MOV R0, S
The fourth sentence is unnecessary since the P value is loaded again in that statement already
stored. It results in an inefficient code sequence. A given intermediate representation can be
translated into several distinct code sequences, each with considerable cost differences. Previous
knowledge of instruction cost is required to build good sequences, yet reliable cost information is
difficult to forecast.
5. Register allocation
In the Register allocation, design issues in the code generator can be accessed faster than memory.
The instructions involving operands in the register are shorter and faster than those involved in
memory operands.
The following sub-problems arise when we use registers:
Register allocation: In register allocation, we select the set of variables that will reside in the
register.
Register assignment: In the Register assignment, we pick the register that contains a variable.
Certain machines require even-odd pairs of registers for some operands and results.
Example
Consider the following division instruction of the form:
D x, y
Where,
x is the dividend even register in even/odd register pair
y is the divisor
An old register is used to hold the quotient.
6. Evaluation order
The code generator determines the order in which the instructions are executed. The target code's
efficiency is influenced by order of computations. Many computational orders will only require a
few registers to store interim results. However, choosing the best order is a completely
challenging task in the general case.
3. Explain how flow graph representation of three address statement is helpful in code
generation
A basic block is a simple combination of statements. Except for entry and exit, the basic blocks
do not have any branches like in and out. It means that the flow of control enters at the beginning
and it always leaves at the end without any halt. The execution of a set of instructions of a basic
block always takes place in the form of a sequence.
The first step is to divide a group of three-address codes into the basic block. The new basic
block always begins with the first instruction and continues to add instructions until it reaches a
jump or a label. If no jumps or labels are identified, the control will flow from one instruction to
the next in sequential order.
The algorithm for the construction of the basic block is described below step by step:
Algorithm: The algorithm used here is partitioning the three-address code into basic blocks.
Input: A sequence of three-address codes will be the input for the basic blocks.
Output: A list of basic blocks with each three address statements, in exactly one block, is
considered as the output.
Method: We’ll start by identifying the intermediate code’s leaders. The following are some
guidelines for identifying leaders:
1. The first instruction in the intermediate code is generally considered as a leader.
2. The instructions that target a conditional or unconditional jump statement can be considered
as a leader.
3. Any instructions that are just after a conditional or unconditional jump statement can be
considered as a leader.
Each leader’s basic block will contain all of the instructions from the leader until the instruction
right before the following leader’s start.
Example of basic block:
Three Address Code for the expression a = b + c – d is:
T1 = b + c
T2 = T1 - d
a = T2
This represents a basic block in which all the statements execute in a sequence one after the other.
2. Algebraic Transformations
In the case of algebraic transformation, we basically change the set of expressions into an
algebraically equivalent set.
For example, and expression
x:= x + 0
or x:= x *1
This can be eliminated from a basic block without changing the set of expressions.
Flow Graph:
A flow graph is simply a directed graph. For the set of basic blocks, a flow graph shows the flow
of control information. A control flow graph is used to depict how the program control is being
parsed among the blocks. A flow graph is used to illustrate the flow of control between basic
blocks once an intermediate code has been partitioned into basic blocks. When the beginning
instruction of the Y block follows the last instruction of the X block, an edge might flow from
one block X to another block Y.
Let’s make the flow graph of the example that we used for basic block formation:
Firstly, we compute the basic blocks (which is already done above). Secondly, we assign the flow
control information.
4. Explain in detail about various register allocation and assignment strategies.
Register allocation is an important method in the final phase of the compiler. Registers
are faster to access than cache memory. Registers are available in small size up to few
hundred Kb .Thus it is necessary to use minimum number of registers for variable
allocation. There are three popular Register allocation algorithms.
1. Naive Register Allocation
2. Linear Scan Algorithm
3. Chaitin’s Algorithm
These are explained as following below.
1. Naïve Register Allocation:
Naive (no) register allocation is based on the assumption that variables are stored in
Main Memory.
We can’t directly perform operations on variables stored in Main Memory .
Variables are moved to registers which allows various operations to be carried out
using ALU .
ALU contains a temporary register where variables are moved before performing
arithmetic and logic operations .
Once operations are complete we need to store the result back to the main memory in
this method .
Transferring of variables to and fro from Main Memory reduces the overall speed of
execution .
a=b+c
d=a
c=a+d
Variables stored in Main Memory :
a b c d
2 fp 4 fp 6 fp 8 fp
At any point of time the maximum number of live variables is 4 in this example. Thus
we require 4 registers at maximum for register allocation.
If we draw horizontal line at any point on the above diagram we can see that we require exactly
4 registers to perform the operations in the program.
Splitting:
Sometimes the required number of registers may not be available. In such case we may
require to move some variables to and from the RAM . This is known as spilling.
Spilling can be done effectively by moving the variable which is used less number of times
in the program.
Disadvantages:
Linear Scan Algorithm doesn’t take into account the “lifetime holes” of the variable.
Variables are not live throughout the program and this algorithm fails to record the holes in
the live range of the variable.
3. Graph Coloring (Chaitin’s Algorithm) :
Register allocation is interpreted as a graph coloring problem.
Nodes represent live range of the variable.
Edges represent the connection between two live ranges.
Assigning colour to the nodes such that no two adjacent nodes have same colour.
Number of colours represents the minimum number of registers required.
A k-coloring of the graph is mapped to k registers.
Steps:
1. Choose an arbitrary node of degree less than k.
2. Push that node onto the stack and remove all of its outgoing edges.
3. Check if the remaining edges have degree less than k, if YES goto 5 else goto #
4. If the degree of any remaining vertex is less than k then push it onto to the stack.
5. If there is no more edge available to push and if all edges are present in the stack POP each
node and colour them such that no two adjacent nodes have same colour.
6. Number of colours assigned to nodes is the minimum number of registers needed.
# spill some nodes based on their live ranges and then try again with same k value. If problem
persists it means that the assumed k value can’t be the minimum number of registers .Try
increasing the k value by 1 and try the whole procedure again.
For the same instructions mentioned above the graph coloring will be as follows:
Assuming k=4
5. Illustrate how to generate code for a basic block from its DAG representation with suitable
examples.
DAG representation for basic blocks
A DAG for basic block is a directed acyclic graph with the following labels on nodes:
The leaves of graph are labelled by unique identifier and that identifier can be variable names or constants.
Interior nodes of the graph is labelled by an operator symbol.
Nodes are also given a sequence of identifiers for labels to store the computed value.
DAGs are a type of data structure. It is used to implement transformations on basic blocks.
DAG provides a good way to determine the common sub-expression.
It gives a picture representation of how the value computed by the statement is used in subsequent statements.
Loop Control Variables: Registers are usually allocated for loop control variables such as
loop counters or iterators. These variables are critical for controlling the flow of execution
within the loop and are likely to be accessed frequently. Register allocation ensures fast access
to these variables, minimizing the overhead associated with memory access.
4. Construct a DAG for the following basic block
d=b*c
e=a+b
b=b*c
a=e-d and generate the code using only one register.
Step-1
d0=b0+c0
e0=a0+b0
b1=b0*c0
a1=e0-d0
R1 = b * c // Store b * c in register
d = R1 // Assign R1 to d
e = a + b // Calculate e using a and b
b = R1 // Assign R1 to b
a = e - d // Calculate a using e and d
The code produced by the straight forward compiling algorithms can often be made to run faster or take
less space, or both. This improvement is achieved by program transformations that are traditionally called
optimizations. Compilers that apply code-improving transformations are called optimizing compilers.
1. The transformation must preserve the meaning of programs. That is, the optimization must not change
the output produced by a program for a given input, or cause an error such as division by zero, that was
not present in the original source program.
2. A transformation must, on the average, speedup programs by a measurable amount. We are also
interested in reducing the size of the compiled code although the size of the code has less importance
than it once had. Not every transformation succeeds in improving every program, occasionally an
“optimization” may slow down a program slightly.
3. The transformation must be worth the effort. It does not make sense for a compiler writer to expend
the intellectual effort to implement a code improving transformation and have the compiler expend
the additional time compiling source programs if this effort is not repaid when the target programs are
executed. “Peephole” transformations of this kind are simple enough and beneficial enough to be
included in any compiler.
There are a number of ways in which a compiler can improve a program without changing the function it
computes.
Function preserving transformations examples:
Common sub expression elimination
Copy propagation,
Dead-code elimination
Constant folding
The other transformations come up primarily when global optimizations are performed.
Frequently, a program will include several calculations of the offset in an array. Some of the duplicate
calculations cannot be avoided by the programmer because they lie below the level of detail accessible
within the source language.
***
• For example
t1: = 4*i
t2: = a [t1]
t3: = 4*j
t4: = 4*i
t5: = n
t6: = b [t4] +t5
The above code can be optimized using the common sub-expression elimination as
t1: = 4*i
t2: = a [t1]
t3: = 4*j
t5: = n
t6: = b [t1] +t5
The common sub expression t4: =4*i is eliminated as its computation is already in t1 and the value of i is
not been changed from definition to use.
Copy Propagation:
Assignments of the form f: = g called copy statements, or copies for short. The idea behind the copy-
propagation transformation is to use g for f, whenever possible after the copy statement f: = g. Copy
propagation means use of one variable instead of another. This may not appear to be an improvement, but
as we shall see it gives us an opportunity to eliminate x.
• For example:
x=Pi;
A=x*r*r;
The optimization using copy propagation can be done as follows: A=Pi*r*r; Here the variable x is
eliminated
Dead-Code Eliminations:
A variable is live at a point in a program if its value can be used subsequently; otherwise, it is dead at that
point. A related idea is dead or useless code, statements that compute values that never get used. While the
programmer is unlikely to introduce any dead code intentionally, it may appear as the result of previous
transformations.
Example:
i=0;
if(i=1)
{
a=b+5;
}
Here, ‘if’ statement is dead code because this condition will never get satisfied.
Constant folding:
Deducing at compile time that the value of an expression is a constant and using the constant instead is
known as constant folding. One advantage of copy propagation is that it often turns the copy statement into
dead code.
For example,
a=3.14157/2 can be replaced by
a=1.570 thereby eliminating a division operation.
Loop Optimizations:
In loops, especially in the inner loops, programs tend to spend the bulk of their time. The running time of a
program may be improved if the number of instructions in an inner loop is decreased, even if we increase
the amount of code outside that loop.
Induction Variables:
Loops are usually processed inside out. For example consider the loop around B3. Note that the values of
j and t4 remain in lock-step; every time the value of j decreases by 1, that of t4 decreases by 4 because 4*j
is assigned to t4. Such identifiers are called induction variables.
When there are two or more induction variables in a loop, it may be possible to get rid of all but one, by the
process of induction-variable elimination. For the inner loop around B3 in Fig.5.3 we cannot get rid of either
j or t4 completely; t4 is used in B3 and j in B4.
2. Describe in detail about how data flow equations are used in code optimization.
It is the analysis of flow of data in control flow graph, i.e., the analysis that determines the information
regarding the definition and use of data in program. With the help of this analysis, optimization can be done.
In general, its process in which values are computed using data flow analysis. The data flow property
represents information that can be used for optimization.
Data flow analysis is a technique used in compiler design to analyse how data flows through a program. It
involves tracking the values of variables and expressions as they are computed and used throughout the
program, with the goal of identifying opportunities for optimization and identifying potential errors.
The basic idea behind data flow analysis is to model the program as a graph, where the nodes represent
program statements and the edges represent data flow dependencies between the statements. The data flow
information is then propagated through the graph, using a set of rules and equations to compute the values of
variables and expressions at each point in the program.
Some of the common types of data flow analysis performed by compilers include:
Reaching Definitions Analysis: This analysis tracks the definition of a variable or expression and determines
the points in the program where the definition “reaches” a particular use of the variable or expression. This
information can be used to identify variables that can be safely optimized or eliminated.
Live Variable Analysis: This analysis determines the points in the program where a variable or expression
is “live”, meaning that its value is still needed for some future computation. This information can be used to
identify variables that can be safely removed or optimized.
Available Expressions Analysis: This analysis determines the points in the program where a particular
expression is “available”, meaning that its value has already been computed and can be reused. This
information can be used to identify opportunities for common subexpression elimination and other
optimization techniques.
Constant Propagation Analysis: This analysis tracks the values of constants and determines the points in
the program where a particular constant value is used. This information can be used to identify opportunities
for constant folding and other optimization techniques.
Data flow analysis can have a number of advantages in compiler design, including:
Improved code quality: By identifying opportunities for optimization and eliminating potential errors, data
flow analysis can help improve the quality and efficiency of the compiled code.
Better error detection: By tracking the flow of data through the program, data flow analysis can help
identify potential errors and bugs that might otherwise go unnoticed.
Increased understanding of program behaviour: By modelling the program as a graph and tracking the
flow of data, data flow analysis can help programmers better understand how the program works and how it
can be improved.
Basic Terminologies
Definition Point: a point in a program containing some definition.
Reference Point: a point in a program containing a reference to a data item.
Evaluation Point: a point in a program containing evaluation of expression.
Data Flow Properties –
Available Expression – A expression is said to be available at a program point x if along paths its reaching
to x. A Expression is available at its evaluation point.
An expression a+b is said to be available if none of the operands gets modified before their use. Example
Advantage –
It is used to eliminate common sub expressions.
Reaching Definition – A definition D is reaches a point x if there is path from D to x in which
D is not killed, i.e., not redefined.
Example –
Advantage –
It is used in constant and variable propagation.
Live variable – A variable is said to be live at some point p if from p to end the variable is used
before it is redefined else it becomes dead.
Example –
Advantage –
It is useful for register allocation.
It is used in dead code elimination.
Busy Expression – An expression is busy along a path if its evaluation exists along that path and none of its
operand definition exists before its evaluation along the path.
Advantage –
It is used for performing code movement optimization.
Features:
Identifying dependencies: Data flow analysis can identify dependencies between different parts of a
program, such as variables that are read or modified by multiple statements.
Detecting dead code: By tracking how variables are used, data flow analysis can detect code that is never
executed, such as statements that assign values to variables that are never used.
Optimizing code: Data flow analysis can be used to optimize code by identifying opportunities for common
subexpression elimination, constant folding, and other optimization techniques.
Detecting errors: Data flow analysis can detect errors in a program, such as uninitialized variables, by
tracking how variables are used throughout the program.
Handling complex control flow: Data flow analysis can handle complex control flow structures, such as
loops and conditionals, by tracking how data is used within those structures.
Interprocedural analysis: Data flow analysis can be performed across multiple functions in a program,
allowing it to analyse how data flows between different parts of the program.
Scalability: Data flow analysis can be scaled to large programs, allowing it to analyse programs with many
thousands or even millions of lines of code.
3. Elaborate how Peephole Optimization method is used in optimizing the target code generated.
Peephole optimization is a local optimization technique that compilers use to optimize the generated
code. It is called local optimization because it works by evaluating a small section of the generated
code, generally a few instructions, and optimizing them based on some predefined rules. The
evaluated section of code is known as a peephole or window, therefore, it is referred to as peephole
optimization.
Increasing code speed: Peephole optimization seeks to improve the execution speed of generated
code by removing redundant instructions or unnecessary instructions.
Reduced code size: Peephole optimization seeks to reduce generated code size by replacing the long
sequence of instructions with shorter ones.
Getting rid of dead code: Peephole optimization seeks to get rid of dead code, such as unreachable
code, redundant assignments, or constant expressions that have no effect on the output of the
program.
Simplifying code: Peephole optimization also seeks to make generated code more understandable
and manageable by removing unnecessary complexities.
Step 1 – Identify the peephole: In the first step, the compiler finds the small sections of the
generated code that needs optimization.
Step 2 – Apply the optimization rule: After identification, in the second step, the compiler applies a
predefined set of optimization rules to the instructions in the peephole.
Step 3 – Evaluate the result: After applying optimization rules, the compiler evaluates the
optimized code to check whether the changes make the code better than the original in terms of
speed, size, or memory usage.
Step 4 – Repeat: The process is repeated by finding new peepholes and applying the optimization
rules until no more opportunities to optimize exists.
Constant Folding
Constant folding is a peephole optimization technique that involves evaluating constant expressions
at compile-time instead of run-time. This optimization technique can significantly improve the
performance of a program by reducing the number of computations performed at run-time.
Initial Code:
int x = 10 + 5;
int y = x * 2;
Optimized Code:
int x = 15;
int y = x * 2;
Explanation: In this code, the expression 10 + 5 is a constant expression, which means that its value
can be computed at compile-time. Instead of computing the value of the expression at run-time, the
compiler can replace the expression with its computed value, which is 15.
Strength Reduction:
Strength reduction is a peephole optimization technique that aims to replace computationally
expensive operations with cheaper ones, thereby improving the performance of a program.
Initial Code:
int x = y / 4;
Optimized Code:
int x = y >> 2;
Redundant load and store elimination is a peephole optimization approach that seeks to reduce
redundant memory accesses in a program. This optimization works by finding code that performs the
same memory access many times and removes the redundant accesses.
Initial Code:
int x = 5;
int y = x + 10;
int z = x + 20;
Optimized Code:
int x = 5;
int y = x + 10;
int z = y + 10; // optimized line
Explanation: In this code, the variable x is loaded from memory twice: once in the second line and
once in the third line. However, since the value of x does not change between the two accesses, the
second access is redundant. In the optimized code, the redundant load of x is eliminated by replacing
the second access with the value of y, which is computed using the value of x in the second line.
Initial Code:
int x = 5;
int y = 10;
int z = x + y;
x = 5; // redundant instruction
Optimized Code:
int x = 5;
int y = 10;
int z = x + y;
Explanation: In this code, the value of x is assigned twice: once in the first line and once in the
fourth line. However, since the second assignment has no effect on the final output of the program, it
is a null sequence and can be eliminated.
Conclusion
Peephole optimization in compile design helps in improving the performance of programs by
eliminating redundant code and optimizing code sequences. These techniques involve analysing
small sequences of instructions and making targeted optimizations that can significantly improve the
performance of a program.
4. Explain in detail about how data flow analysis of structured programs is performed.
Data-flow analysis is a technique for gathering information about the possible set of values calculated at
various points in a computer program. A program's control-flow graph (CFG) is used to determine those parts
of a program to which a particular value assigned to a variable might propagate.
Data flow analysis of structured programs involves analysing the flow of data within a program that adheres
to a structured programming paradigm. Structured programs follow control flow structures like sequences,
conditionals, and loops, which can be analysed to understand the behaviour of variables and expressions
throughout the program. The data flow analysis techniques used in structured programs are often based on
control flow graphs (CFGs) and the analysis of variables and their values at different program points.
Here are some key aspects of data flow analysis in structured programs:
Control Flow Graph (CFG): A control flow graph is constructed to represent the control flow structure of
the structured program. It consists of nodes representing basic blocks of code and edges representing the flow
of control between these blocks. Each block typically corresponds to a sequence of statements without any
branches or loops.
Reaching Definitions: Reaching definitions analysis determines the set of definitions that can potentially
reach a particular program point. It identifies the definitions of variables that are valid at different points in the
program. This analysis helps understand variable dependencies and how values are propagated through the
program.
Use-Def and Def-Use Chains: Use-Def and Def-Use chains represent the relationships between variable uses
and their corresponding definitions. A use-def chain links a variable use to its defining point, while a def-use
chain links a definition point to its uses. These chains help in understanding how values flow from definitions
to uses.
Available Expressions: Available expressions analysis identifies expressions whose values are available at
different program points. It determines which expressions have been computed and can be reused without
recomputation. This analysis helps in eliminating redundant computations and improving performance.
Liveness Analysis: Liveness analysis determines the set of variables that are live (potentially used) at different
program points. It helps in understanding variable lifetimes and is crucial for register allocation, memory
management, and other optimization techniques.
Constant Propagation: Constant propagation analysis tracks the propagation of constant values through the
program. It identifies variables that can be replaced with their constant values, which eliminates unnecessary
computations and improves performance.
Dead Code Elimination: Dead code elimination identifies and eliminates code that is guaranteed to have no
effect on the program's behaviour or final output. This analysis helps in removing unused or redundant
statements, improving code efficiency.
These data flow analysis techniques are often performed iteratively, with information being propagated
through the control flow graph until a fixed point is reached. This iterative process ensures that all relevant
data flow information is obtained.
By performing data flow analysis in structured programs, compilers and program analyzers gain insights into
variable dependencies, constant values, live variables, and other properties that can be used for optimization,
error detection, and program understanding. These analyses contribute to improving code quality,
performance, and reliability. Various tools used for such analysis are
In the above example, D1 is a reaching definition for block B2 since the value of x is not changed (it
is two only) but D1 is not a reaching definition for block B3 because the value of x is changed to x +
2. This means D1 is killed or redefined by D2.
Live Variable
A variable x is said to be live at a point p if the variable's value is not killed or redefined by some
block. If the variable is killed or redefined, it is said to be dead.
It is generally used in register allocation and dead code elimination.
Example:
In the above example, the variable a is live at blocks B1,B2, B3 and B4 but is killed at block B5
since its value is changed from 2 to b + c. Similarly, variable b is live at block B3 but is killed at
block B4.
Busy Expression
An expression is said to be busy along a path if its evaluation occurs along that path, but none of its
operand definitions appears before it.
It is used for performing code movement optimization.
5. Elaborate the Concept of Data flow analysis with suitable algorithms and sample intermediate code.
CODE IMPROVIG TRANSFORMATIONS