SS Unit 4
SS Unit 4
SS Unit 4
Compiler functions can be broadly categorized into several stages in the compilation process,
each serving a specific purpose:
These functions may vary slightly depending on the specific compiler and the target language
or platform. Additionally, modern compilers often include additional features such as
debugging support, profiling, and support for language extensions or optimizations specific to
certain platforms.
In compilers, grammars play a crucial role in defining the syntax of a programming language.
Grammars formally describe the structure of valid programs in a language using a set of
production rules. There are primarily two types of grammars used in compilers:
1. Regular Grammars:
Regular grammars describe languages that can be recognized by finite automata.
They are often used for lexical analysis, where tokens such as identifiers, keywords,
and literals are identified.
Regular expressions are commonly used to define regular grammars.
2. Context-Free Grammars (CFG):
Context-free grammars are more powerful than regular grammars and are used to
describe the syntax of programming languages.
They consist of a set of production rules that specify how different parts of a language
can be combined to form valid programs.
Context-free grammars are typically used in syntax analysis (parsing) to create parse
trees or abstract syntax trees (ASTs) representing the structure of the program.
Tools like yacc/bison (for C/C++) or ANTLR (for Java) are often used to generate
parsers based on context-free grammars.
1. Terminals: These are the basic symbols of the language, such as keywords, identifiers,
operators, and punctuation marks. Terminals are the actual tokens recognized by the lexer.
2. Non-terminals: These are symbols that represent syntactic categories or groups of terminals.
Non-terminals are used to define the structure of the language. For example, in a
programming language, a non-terminal might represent an expression or a statement.
3. Production Rules: Production rules specify how non-terminals can be expanded into
sequences of terminals and non-terminals. Each production rule consists of a non-terminal on
the left-hand side and a sequence of terminals and non-terminals on the right-hand side.
Production rules define the syntax of the language.
4. Start Symbol: This is a special non-terminal that represents the entire program or the top-
level syntactic construct of the language. Parsing begins with the start symbol and proceeds
to build the parse tree or AST according to the production rules.
rCopy code
E -> E + T | E - T | T T -> T * F | T / F | F F -> ( E ) | number
In this grammar, E, T, and F are non-terminals representing expressions, terms, and factors,
respectively. +, -, *, /, (, ), and number are terminals representing addition, subtraction,
multiplication, division, parentheses, and numeric literals. The production rules specify how
expressions, terms, and factors can be combined to form valid arithmetic expressions.
LEXICAL ANALYSIS
Lexical analysis is the first phase of the compilation process where the input source code is
converted into a sequence of tokens for further processing by the compiler. It involves
breaking the source code into meaningful units called tokens. These tokens are the smallest
meaningful units of the language, such as keywords, identifiers, literals, operators, and
punctuation symbols. Here's an overview of the lexical analysis process:
1. Tokenization:
The source code is read character by character.
Characters are grouped into tokens based on predefined patterns defined by regular
expressions.
Common tokens include keywords (e.g., if, while, int), identifiers (e.g., variable
names), literals (e.g., numeric constants, string literals), operators (e.g., +, -, *, /), and
punctuation symbols (e.g., ,, ;, (, )).
2. Skipping Whitespace and Comments:
Whitespace characters (spaces, tabs, line breaks) and comments are typically ignored
during lexical analysis.
This simplifies the tokenization process and improves efficiency.
3. Error Handling:
Lexical analyzers may detect and report lexical errors, such as invalid characters or
unrecognized tokens.
Error handling mechanisms may include emitting error messages and possibly
recovering from errors to continue processing the input.
4. Building Symbol Table (optional):
During lexical analysis, a symbol table may be constructed to store information about
identifiers encountered in the source code.
The symbol table keeps track of identifier names, their types, and possibly other
attributes needed for subsequent compilation phases like semantic analysis.
5. Output:
The output of the lexical analysis phase is typically a stream of tokens represented as
(token_type, token_value) pairs.
This token stream serves as input to the next phase of the compiler, the syntax
analysis (parsing) phase.
Lexical analysis is often implemented using techniques such as finite automata or regular
expressions. Tools like lex (or flex) and ANTLR provide convenient ways to specify lexical
rules and generate lexical analyzers automatically based on those rules. The main goal of
lexical analysis is to simplify the subsequent phases of compilation by providing a structured
representation of the input source code.
SYNTACTIC ANALYSIS
Syntactic analysis, also known as parsing, is the second phase in the compilation process
following lexical analysis. Its primary objective is to analyze the structure of the source code
according to the rules of the programming language's grammar. This phase takes the stream
of tokens produced by the lexical analyzer and constructs a hierarchical structure that
represents the syntactic relationships between the tokens. The most common output of the
syntactic analysis phase is an Abstract Syntax Tree (AST) or a Parse Tree.
Here's a detailed explanation of the syntactic analysis process:
1. Grammar Specification:
The syntactic analysis phase relies on a formal grammar that defines the syntax of the
programming language.
Context-free grammars (CFGs) are commonly used to specify the grammar rules.
The grammar consists of a set of production rules that describe how different
elements of the language (e.g., expressions, statements) can be combined.
2. Parsing Algorithm:
The parsing algorithm is responsible for analyzing the input token stream and
constructing a hierarchical representation of the program's syntactic structure.
Common parsing algorithms include Recursive Descent Parsing, LL Parsing, LR
Parsing, and Earley Parsing.
These algorithms use the grammar rules to guide the parsing process, determining
how tokens are combined to form higher-level constructs.
3. Parse Tree or Abstract Syntax Tree (AST):
As the parsing algorithm processes the input tokens, it builds either a Parse Tree or an
Abstract Syntax Tree.
Parse Tree: Represents the hierarchical structure of the program according to the
grammar rules. It includes all the details of the parsing process, including non-
terminals.
Abstract Syntax Tree (AST): Represents the essential structure of the program
without capturing all the details of the parsing process. It typically excludes non-
essential syntactic elements such as parentheses and semicolons.
ASTs are more commonly used in subsequent compilation phases, as they provide a
clearer and more concise representation of the program's structure.
4. Error Handling:
During syntactic analysis, parsing errors may occur if the input program does not
conform to the grammar rules.
Error handling mechanisms may include providing informative error messages,
recovering from errors to continue parsing, or halting the compilation process if the
errors are severe.
5. Semantic Actions (optional):
Some parsing algorithms allow semantic actions to be associated with grammar rules.
Semantic actions perform additional processing or validation during parsing, such as
type checking or building symbol tables.
6. Output:
The output of the syntactic analysis phase is typically the Parse Tree or AST
representing the syntactic structure of the input program.
This hierarchical representation serves as input to subsequent compilation phases,
such as semantic analysis, optimization, and code generation.
Syntactic analysis is a critical phase in the compilation process, as it lays the foundation for
further processing of the program and ensures that it conforms to the syntax rules of the
programming language.
Code generation is a crucial phase in the compilation process where the compiler translates
the intermediate representation (IR) of the source code, typically an Abstract Syntax Tree
(AST) or some other intermediate representation, into executable code for a target platform.
This phase involves converting the high-level language constructs into low-level machine
instructions or bytecode that can be executed by the target hardware or virtual machine.
Code generation :
Before code generation, the compiler typically performs various analysis and
optimizations on the source code, resulting in an intermediate representation (IR).
The IR is a structured representation of the program's semantics and control flow,
often represented as an Abstract Syntax Tree (AST), Three-Address Code (TAC), or
similar format.
The IR simplifies the code generation process by providing a uniform representation
of the program's behavior.
2. Instruction Selection:
The compiler selects appropriate machine instructions or bytecode instructions to
represent each operation in the IR.
This involves mapping high-level language constructs (such as assignments,
conditionals, loops) to sequences of low-level instructions that perform equivalent
operations on the target platform.
3. Register Allocation:
Register allocation assigns variables and intermediate values to hardware registers or
memory locations.
Efficient register allocation minimizes the use of memory accesses, which can
significantly improve performance.
Various register allocation algorithms are used to optimize register usage, such as
graph coloring, linear scan, and iterative algorithms.
4. Instruction Scheduling:
Instruction scheduling reorders the generated instructions to optimize performance,
such as minimizing pipeline stalls or maximizing instruction-level parallelism.
The goal is to exploit the available resources of the target hardware efficiently and
reduce execution time.
5. Addressing Modes and Memory Management:
For architectures with memory addressing modes, the compiler selects appropriate
addressing modes for memory accesses.
Memory management techniques, such as stack allocation, heap allocation, and static
allocation, are also handled during code generation.
6. Optimization:
Code generation itself can include optimization techniques aimed at improving the
efficiency and performance of the generated code.
Common optimizations performed during code generation include constant folding,
loop unrolling, inline expansion, and instruction scheduling.
7. Platform-specific Code Generation:
Code generation may vary depending on the target platform, such as x86, ARM, or
JVM bytecode.
Platform-specific optimizations and code generation strategies are applied to produce
code optimized for the target architecture.
8. Output:
The output of the code generation phase is the generated machine code or bytecode
suitable for execution on the target platform.
This code is typically stored in an executable file or bytecode format that can be
executed by the target hardware or virtual machine.
Overall, code generation transforms the high-level semantics of the source code into efficient
low-level instructions or bytecode, enabling the program to be executed on the target
platform with optimal performance and resource utilization.
Compiler design involves making various decisions and choices at different stages of the
compilation process. These decisions impact the design, efficiency, and performance of the
compiler. Here are some key compiler design options:
These design options influence the complexity, performance, and usability of the compiler.
Compiler designers must carefully consider these options to create compilers that meet the
requirements of the target language, platform, and application domains.
Grammar :
It is a finite set of formal rules for generating syntactically correct sentences or meaningful
correct sentences.
Constitute Of Grammar :
Grammar is basically composed of two basic elements –
1. Terminal Symbols –
Terminal symbols are those which are the components of the sentences generated using
a grammar and are represented using small case letter like a, b, c etc.
2. Non-Terminal Symbols –
Non-Terminal Symbols are those symbols which take part in the generation of the
sentence but are not the component of the sentence. Non-Terminal Symbols are also
called Auxiliary Symbols and Variables. These symbols are represented using a capital
letter like A, B, C, etc.
Formal Definition of Grammar :
Any Grammar can be represented by 4 tuples – <N, T, P, S>
N – Finite Non-Empty Set of Non-Terminal Symbols.
T – Finite Set of Terminal Symbols.
P – Finite Non-Empty Set of Production Rules.
S – Start Symbol (Symbol from where we start producing our sentences or strings).
Production Rules :
A production or production rule in computer science is a rewrite rule specifying a symbol
substitution that can be recursively performed to generate new symbol sequences. It is of
the form α-> β where α is a Non-Terminal Symbol which can be replaced by β which is a
string of Terminal Symbols or Non-Terminal Symbols.
Example-1 :
Consider Grammar G1 = <N, T, P, S>
T = {a,b} #Set of terminal symbols
P = {A->Aa,A->Ab,A->a,A->b,A-> } #Set of all production rules
S = {A} #Start Symbol
As the start symbol is S then we can produce Aa, Ab, a,b, which can further produce
strings where A can be replaced by the Strings mentioned in the production rules and hence
this grammar can be used to produce strings of the form (a+b)*.
Derivation Of Strings :
All Your Queries Answered | DSA to Development Program | GeeksforGeeks
SKIP
A->a #using production rule 3
OR
A->Aa #using production rule 1
Aa->ba #using production rule 4
OR
A->Aa #using production rule 1
Aa->AAa #using production rule 1
AAa->bAa #using production rule 4
bAa->ba #using production rule 5
Example-2 :
Consider Grammar G2 = <N, T, P, S>
N = {A} #Set of non-terminals Symbols
T = {a} #Set of terminal symbols
P = {A->Aa, A->AAa, A->a, A-> } #Set of all production rules
S = {A} #Start Symbol
As the start symbol is S then we can produce Aa, AAa, a, which can further produce
strings where A can be replaced by the Strings mentioned in the production rules and hence
this grammar can be used to produce strings of form (a)*.
Derivation Of Strings :
A->a #using production rule 3
OR
A->Aa #using production rule 1
Aa->aa #using production rule 3
OR
A->Aa #using production rule 1
Aa->AAa #using production rule 1
AAa->Aa #using production rule 4
Aa->aa #using production rule 3
Equivalent Grammars :
Grammars are said to be equivalent if they produce the same language.
Lexical analysis
Lexical analysis is the first phase of a compiler. It takes modified source code from language
preprocessors that are written in the form of sentences. The lexical analyzer breaks these
syntaxes into a series of tokens, by removing any whitespace or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works
closely with the syntax analyzer. It reads character streams from the source code, checks for
legal tokens, and passes the data to the syntax analyzer when it demands.
Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some
predefined rules for every lexeme to be identified as a valid token. These rules are defined by
grammar rules, by means of a pattern. A pattern explains what can be a token, and these
patterns are defined by means of regular expressions.
In programming language, keywords, constants, identifiers, strings, numbers, operators and
punctuations symbols can be considered as tokens.
Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets
Strings
Any finite sequence of alphabets (characters) is called a string. Length of the string is the
total number of occurrence of alphabets, e.g., the length of the string tutorialspoint is 14 and
is denoted by |tutorialspoint| = 14. A string having no alphabets, i.e. a string of zero length is
known as an empty string and is denoted by ε (epsilon).
Special symbols
Assignment =
Preprocessor #
Language
A language is considered as a finite set of strings over some finite set of alphabets. Computer
languages are considered as finite sets, and mathematically set operations can be performed
on them. Finite languages can be described by means of regular expressions.
Regular Expressions
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme
that belong to the language in hand. It searches for the pattern defined by the language rules.
Regular expressions have the capability to express finite languages by defining a pattern for
finite strings of symbols. The grammar defined by regular expressions is known as regular
grammar. The language defined by regular grammar is known as regular language.
Regular expression is an important notation for specifying patterns. Each pattern matches a
set of strings, so regular expressions serve as names for a set of strings. Programming
language tokens can be described by regular languages. The specification of regular
expressions is an example of a recursive definition. Regular languages are easy to understand
and have efficient implementation.
There are a number of algebraic laws that are obeyed by regular expressions, which can be
used to manipulate regular expressions into equivalent forms.
Operations
Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
letter = [a – z] or [A – Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
sign = [ + | - ]
Decimal = (sign)?(digit)+
The only problem left with the lexical analyzer is how to verify the validity of a regular
expression used in specifying the patterns of keywords of a language. A well-accepted
solution is to use finite automata for verification.
Finite Automata
Finite automata is a state machine that takes a string of symbols as input and changes its state
accordingly. Finite automata is a recognizer for regular expressions. When a regular
expression string is fed into finite automata, it changes its state for each literal. If the input
string is successfully processed and the automata reaches its final state, it is accepted, i.e., the
string just fed was said to be a valid token of the language in hand.
The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ),
Q×Σ➔Q
States : States of FA are represented by circles. State names are of the state is written
inside the circle.
Start state : The state from where the automata starts, is known as start state. Start
state has an arrow pointed towards it.
Intermediate states : All intermediate states has at least two arrows; one pointing to
and another pointing out from them.
Final state : If the input string is successfully parsed, the automata is expected to be
in this state. Final state is represented by double circles. It may have any odd number
of arrows pointing to it and even number of arrows pointing out from it. The number
of odd arrows are one greater than even, i.e. odd = even+1.
Transition : The transition from one state to another state happens when a desired
symbol in the input is found. Upon transition, automata can either move to next state
or stay in the same state. Movement from one state to another is shown as a directed
arrow, where the arrows points to the destination state. If automata stays on the same
state, an arrow pointing from a state to itself is drawn.
Example : We assume FA accepts any three digit binary value ending in digit 1. FA = {Q(q 0,
qf), Σ(0,1), q0, qf, δ}
When the lexical analyzer read the source-code, it scans the code letter by letter; and when a
whitespace, operator symbol, or special symbols occurs, it decides that a word is completed.
For example:
int intvalue;
While scanning both lexemes till ‘int’, the lexical analyzer cannot determine whether it is a
keyword int or the initials of identifier int value.
The Longest Match Rule states that the lexeme scanned should be determined based on the
longest match among all the tokens available.
The lexical analyzer also follows rule priority where a reserved word, e.g., a keyword, of a
language is given priority over user input. That is, if the lexical analyzer finds a lexeme that
matches with any existing reserved word, it should generate an error.
Syntax analysis or parsing is the second phase of a compiler. In this chapter, we shall learn
the basic concepts used in the construction of a parser.
We have seen that a lexical analyzer can identify tokens with the help of regular expressions
and pattern rules. But a lexical analyzer cannot check the syntax of a given sentence due to
the limitations of the regular expressions. Regular expressions cannot check balancing
tokens, such as parenthesis. Therefore, this phase uses context-free grammar (CFG), which is
recognized by push-down automata.
It implies that every Regular Grammar is also context-free, but there exists some problems,
which are beyond the scope of Regular Grammar. CFG is a helpful tool in describing the
syntax of programming languages.
Context-Free Grammar
In this section, we will first see the definition of context-free grammar and introduce
terminologies used in parsing technology.
The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially
the start symbol) by the right side of a production, for that non-terminal.
Example
Where:
V = { Q, Z, N }
Σ = { 0, 1 }
P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 }
S={Q}
This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101,
11111, etc.
Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token
streams. The parser analyzes the source code (token stream) against the production rules to
detect any errors in the code. The output of this phase is a parse tree.
This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and
generating a parse tree as the output of the phase.
Parsers are expected to parse the whole code even if some errors exist in the program. Parsers
use error recovering strategies, which we will learn later in this chapter.
Derivation
A derivation is basically a sequence of production rules, in order to get the input string.
During parsing, we take two decisions for some sentential form of input:
To decide which non-terminal to be replaced with production rule, we can have two options.
Left-most Derivation
If the sentential form of an input is scanned and replaced from left to right, it is called left-
most derivation. The sentential form derived by the left-most derivation is called the left-
sentential form.
Right-most Derivation
If we scan and replace the input with production rules, from right to left, it is known as right-
most derivation. The sentential form derived from the right-most derivation is called the
right-sentential form.
Example
Production rules:
E→E+E
E→E*E
E → id
Input string: id + id * id
E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
Parse Tree
A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are
derived from the start symbol. The start symbol of the derivation becomes the root of the
parse tree. Let us see this by an example from the last topic.
E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
Step 1:
E→E*E
Step 2:
E→E+E*E
Step 3:
E → id + E * E
Step 4:
E → id + id * E
Step 5:
E → id + id * id
In a parse tree:
A parse tree depicts associativity and precedence of operators. The deepest sub-tree is
traversed first, therefore the operator in that sub-tree gets precedence over the operator which
is in the parent nodes.
Ambiguity
A grammar G is said to be ambiguous if it has more than one parse tree (left or right
derivation) for at least one string.
Example
E→E+E
E→E–E
E → id
For the string id + id – id, the above grammar generates two parse trees:
Associativity
If an operand has operators on both sides, the side on which the operator takes this operand is
decided by the associativity of those operators. If the operation is left-associative, then the
operand will be taken by the left operator or if the operation is right-associative, the right
operator will take the operand.
Example
Operations such as Addition, Multiplication, Subtraction, and Division are left associative. If
the expression contains:
id op id op id
(id op id) op id
Operations like Exponentiation are right associative, i.e., the order of evaluation in the same
expression will be:
id op (id op id)
Precedence
If two different operators share a common operand, the precedence of operators decides
which will take the operand. That is, 2+3*4 can have two different parse trees, one
corresponding to (2+3)*4 and another corresponding to 2+(3*4). By setting precedence
among operators, this problem can be easily removed. As in the previous example,
mathematically * (multiplication) has precedence over + (addition), so the expression 2+3*4
will always be interpreted as:
2 + (3 * 4)
Left Recursion
A grammar becomes left-recursive if it has any non-terminal ‘A’ whose derivation contains
‘A’ itself as the left-most symbol. Left-recursive grammar is considered to be a problematic
situation for top-down parsers. Top-down parsers start parsing from the Start symbol, which
in itself is non-terminal. So, when the parser encounters the same non-terminal in its
derivation, it becomes hard for it to judge when to stop parsing the left non-terminal and it
goes into an infinite loop.
Example:
(1) A => Aα | β
(2) S => Aα | β
A => Sd
(1) is an example of immediate left recursion, where A is any non-terminal symbol and α
represents a string of non-terminals.
A top-down parser will first parse the A, which in-turn will yield a string consisting of A
itself and the parser may go into a loop forever.
The production
A => Aα | β
A => βA'
A'=> αA' | ε
This does not impact the strings derived from the grammar, but it removes immediate left
recursion.
Second method is to use the following algorithm, which should eliminate all direct and
indirect left recursions.
START
END
Example
S => Aα | β
A => Sd
S => Aα | β
A => Aαd | βd
and then, remove immediate left recursion using the first technique.
A => βdA'
A' => αdA' | ε
Now none of the production has either direct or indirect left recursion.
Left Factoring
If more than one grammar production rules has a common prefix string, then the top-down
parser cannot make a choice as to which of the production it should take to parse the string in
hand.
Example
A ⟹ αβ | α𝜸 | …
Then it cannot determine which production to follow to parse the string as both productions
are starting from the same terminal (or non-terminal). To remove this confusion, we use a
technique called left factoring.
Left factoring transforms the grammar to make it useful for top-down parsers. In this
technique, we make one production for each common prefixes and the rest of the derivation is
added by new productions.
Example
A => αA'
A'=> β | 𝜸 | …
Now the parser has only one production per prefix which makes it easier to take decisions.
An important part of parser table construction is to create first and follow sets. These sets can
provide the actual position of any terminal in the derivation. This is done to create the parsing
table where the decision of replacing T[A, t] = α with some production rule.
First Set
This set is created to know what terminal symbol is derived in the first position by a non-
terminal. For example,
α→tβ
Follow Set
Syntax analyzers receive their inputs, in the form of tokens, from lexical analyzers. Lexical
analyzers are responsible for the validity of a token supplied by the syntax analyzer. Syntax
analyzers have the following drawbacks -
Code generation can be considered as the final phase of compilation. Through post code
generation, optimization process can be applied on the code, but that can be seen as a part of
code generation phase itself. The code generated by the compiler is an object code of some
lower-level programming language, for example, assembly language. We have seen that the
source code written in a higher-level language is transformed into a lower-level language that
results in a lower-level object code, which should have the following minimum properties:
We will now see how the intermediate code is transformed into target object code (assembly
code, in this case).
Directed Acyclic Graph (DAG) is a tool that depicts the structure of basic blocks, helps to see
the flow of values flowing among the basic blocks, and offers optimization too. DAG
provides easy transformation on basic blocks. DAG can be understood here:
[t0 = a + b]
[t1 = t0 + c]
[d = t0 + t1]
Peephole Optimization
This optimization technique works locally on the source code to transform it into an
optimized code. By locally, we mean a small portion of the code block at hand. These
methods can be applied on intermediate codes as well as on target codes. A bunch of
statements is analyzed and are checked for the following possible optimization:
At compilation level, the compiler searches for instructions redundant in nature. Multiple
loading and storing of instructions may carry the same meaning even if some of them are
removed. For example:
MOV x, R0
MOV R0, R1
We can delete the first instruction and re-write the sentence as:
MOV x, R1
Unreachable code
Unreachable code is a part of the program code that is never accessed because of
programming constructs. Programmers may have accidently written a piece of code that can
never be reached.
Example:
void add_ten(int x)
{
return x + 10;
printf(“value of x is %d”, x);
}
In this code segment, the printf statement will never be executed as the program control
returns back before it can execute, hence printf can be removed.
There are instances in a code where the program control jumps back and forth without
performing any significant task. These jumps can be removed. Consider the following chunk
of code:
...
MOV R1, R2
GOTO L1
...
L1 : GOTO L2
L2 : INC R1
In this code,label L1 can be removed as it passes the control to L2. So instead of jumping to
L1 and then to L2, the control can directly reach L2, as shown below:
...
MOV R1, R2
GOTO L2
...
L2 : INC R1
There are occasions where algebraic expressions can be made simple. For example, the
expression a = a + 0 can be replaced by a itself and the expression a = a + 1 can simply be
replaced by INC a.
Strength reduction
There are operations that consume more time and space. Their ‘strength’ can be reduced by
replacing them with other operations that consume less time and space, but produce the same
result.
For example, x * 2 can be replaced by x << 1, which involves only one left shift. Though the
output of a * a and a2 is same, a2 is much more efficient to implement.
Accessing machine instructions
The target machine can deploy more sophisticated instructions, which can have the capability
to perform specific operations much efficiently. If the target code can accommodate those
instructions directly, that will not only improve the quality of code, but also yield more
efficient results.
Code Generator
Target language : The code generator has to be aware of the nature of the target
language for which the code is to be transformed. That language may facilitate some
machine-specific instructions to help the compiler generate the code in a more
convenient way. The target machine can have either CISC or RISC processor
architecture.
IR Type : Intermediate representation has various forms. It can be in Abstract Syntax
Tree (AST) structure, Reverse Polish Notation, or 3-address code.
Selection of instruction : The code generator takes Intermediate Representation as
input and converts (maps) it into target machine’s instruction set. One representation
can have many ways (instructions) to convert it, so it becomes the responsibility of
the code generator to choose the appropriate instructions wisely.
Register allocation : A program has a number of values to be maintained during the
execution. The target machine’s architecture may not allow all of the values to be kept
in the CPU memory or registers. Code generator decides what values to keep in the
registers. Also, it decides the registers to be used to keep these values.
Ordering of instructions : At last, the code generator decides the order in which the
instruction will be executed. It creates schedules for instructions to execute them.
Descriptors
The code generator has to track both the registers (for availability) and addresses (location of
values) while generating the code. For both of them, the following two descriptors are used:
Register descriptor : Register descriptor is used to inform the code generator about
the availability of registers. Register descriptor keeps track of values stored in each
register. Whenever a new register is required during code generation, this descriptor is
consulted for register availability.
Address descriptor : Values of the names (identifiers) used in the program might be
stored at different locations while in execution. Address descriptors are used to keep
track of memory locations where the values of identifiers are stored. These locations
may include CPU registers, heaps, stacks, memory or a combination of the mentioned
locations.
Code generator keeps both the descriptor updated in real-time. For a load statement, LD R1,
x, the code generator:
Code Generation
Basic blocks comprise of a sequence of three-address instructions. Code generator takes these
sequence of instructions as input.
Note : If the value of a name is found at more than one place (register, cache, or memory),
the register’s value will be preferred over the cache and main memory. Likewise cache’s
value will be preferred over the main memory. Main memory is barely given any preference.
getReg : Code generator uses getReg function to determine the status of available registers
and the location of name values. getReg works as follows:
If variable Y is already in register R, it uses that register.
Else if some register R is available, it uses that register.
Else if both the above options are not possible, it chooses a register that requires
minimal number of load and store instructions.
For an instruction x = y OP z, the code generator may perform the following actions. Let us
assume that L is the location (preferably register) where the output of y OP z is to be saved:
Other code constructs like loops and conditional statements are transformed into assembly
language in general assembly way.
Tutorials Point is a leading Ed Tech company striving to provide the best learning material on
technical and non-technical subjects.
About us
Tutorials Point India Private Limited, Incor9 Building, Kavuri Hills, Madhapur,
Hyderabad
About us
, INDIA