SS Unit 4

BASIC COMPILER FUNCTIONS:
Compiler functions can be broadly categorized into several stages in the compilation process,
each serving a specific purpose:
1. Lexical Analysis (Scanner):

 Breaks the source code into a sequence of tokens.
 Removes comments and whitespace.
 Identifies identifiers, keywords, constants, operators, etc.
2. Syntax Analysis (Parser):
 Analyzes the structure of the source code according to the grammar rules of the
programming language.
 Builds a parse tree or abstract syntax tree (AST) representing the syntactic structure of
the program.
3. Semantic Analysis:
 Checks the semantics of the program.
 Performs type checking to ensure that types are used consistently and correctly.
 Identifies and reports semantic errors.
4. Intermediate Code Generation:
 Transforms the source code or AST into an intermediate representation (IR).
 The IR is often simpler and more uniform than the source language, making further
analysis and optimization easier.
5. Optimization:
 Analyzes and transforms the intermediate code to improve its performance or reduce
its size.
 Common optimizations include constant folding, loop optimization, and inlining.
6. Code Generation:
 Translates the optimized intermediate code into machine code or bytecode suitable for
execution on the target platform.
 Allocates registers and memory locations for variables.
 Generates assembly code or machine instructions.
7. Symbol Table Management:
 Maintains information about identifiers (variables, functions, etc.) encountered during
compilation.
 Stores attributes such as type, scope, and memory location.
 Used during semantic analysis, code generation, and optimization.
8. Error Handling:
 Detects and reports errors in the source code.
 Provides informative error messages to aid debugging.
 May include syntax errors, semantic errors, and other issues.
9. Linking and Loading (optional for some compilers):
 Combines multiple object files and libraries into a single executable or shared library.
 Resolves references between different modules.
 Loads the executable into memory for execution.
These functions may vary slightly depending on the specific compiler and the target language
or platform. Additionally, modern compilers often include additional features such as
debugging support, profiling, and support for language extensions or optimizations specific to
certain platforms.
In compilers, grammars play a crucial role in defining the syntax of a programming language.
Grammars formally describe the structure of valid programs in a language using a set of
production rules. There are primarily two types of grammars used in compilers:
1. Regular Grammars:
 Regular grammars describe languages that can be recognized by finite automata.
 They are often used for lexical analysis, where tokens such as identifiers, keywords,
and literals are identified.
 Regular expressions are commonly used to define regular grammars.
2. Context-Free Grammars (CFG):
 Context-free grammars are more powerful than regular grammars and are used to
describe the syntax of programming languages.
 They consist of a set of production rules that specify how different parts of a language
can be combined to form valid programs.
 Context-free grammars are typically used in syntax analysis (parsing) to create parse
trees or abstract syntax trees (ASTs) representing the structure of the program.
 Tools like yacc/bison (for C/C++) or ANTLR (for Java) are often used to generate
parsers based on context-free grammars.
A context-free grammar consists of four components:
1. Terminals: These are the basic symbols of the language, such as keywords, identifiers,
operators, and punctuation marks. Terminals are the actual tokens recognized by the lexer.
2. Non-terminals: These are symbols that represent syntactic categories or groups of terminals.
Non-terminals are used to define the structure of the language. For example, in a
programming language, a non-terminal might represent an expression or a statement.
3. Production Rules: Production rules specify how non-terminals can be expanded into
sequences of terminals and non-terminals. Each production rule consists of a non-terminal on
the left-hand side and a sequence of terminals and non-terminals on the right-hand side.
Production rules define the syntax of the language.
4. Start Symbol: This is a special non-terminal that represents the entire program or the top-
level syntactic construct of the language. Parsing begins with the start symbol and proceeds
to build the parse tree or AST according to the production rules.
Here's a simple example of a context-free grammar for a basic arithmetic expression

language:
rCopy code
E -> E + T | E - T | T T -> T * F | T / F | F F -> ( E ) | number
In this grammar, E, T, and F are non-terminals representing expressions, terms, and factors,
respectively. +, -, *, /, (, ), and number are terminals representing addition, subtraction,
multiplication, division, parentheses, and numeric literals. The production rules specify how
expressions, terms, and factors can be combined to form valid arithmetic expressions.
LEXICAL ANALYSIS
Lexical analysis is the first phase of the compilation process where the input source code is
converted into a sequence of tokens for further processing by the compiler. It involves
breaking the source code into meaningful units called tokens. These tokens are the smallest
meaningful units of the language, such as keywords, identifiers, literals, operators, and
punctuation symbols. Here's an overview of the lexical analysis process:
1. Tokenization:
 The source code is read character by character.
 Characters are grouped into tokens based on predefined patterns defined by regular
expressions.
 Common tokens include keywords (e.g., if, while, int), identifiers (e.g., variable
names), literals (e.g., numeric constants, string literals), operators (e.g., +, -, *, /), and
punctuation symbols (e.g., ,, ;, (, )).
2. Skipping Whitespace and Comments:
 Whitespace characters (spaces, tabs, line breaks) and comments are typically ignored
during lexical analysis.
 This simplifies the tokenization process and improves efficiency.
3. Error Handling:
 Lexical analyzers may detect and report lexical errors, such as invalid characters or
unrecognized tokens.
 Error handling mechanisms may include emitting error messages and possibly
recovering from errors to continue processing the input.
4. Building Symbol Table (optional):
 During lexical analysis, a symbol table may be constructed to store information about
identifiers encountered in the source code.
 The symbol table keeps track of identifier names, their types, and possibly other
attributes needed for subsequent compilation phases like semantic analysis.
5. Output:
 The output of the lexical analysis phase is typically a stream of tokens represented as
(token_type, token_value) pairs.
 This token stream serves as input to the next phase of the compiler, the syntax
analysis (parsing) phase.
Lexical analysis is often implemented using techniques such as finite automata or regular
expressions. Tools like lex (or flex) and ANTLR provide convenient ways to specify lexical
rules and generate lexical analyzers automatically based on those rules. The main goal of
lexical analysis is to simplify the subsequent phases of compilation by providing a structured
representation of the input source code.
SYNTACTIC ANALYSIS
Syntactic analysis, also known as parsing, is the second phase in the compilation process
following lexical analysis. Its primary objective is to analyze the structure of the source code
according to the rules of the programming language's grammar. This phase takes the stream
of tokens produced by the lexical analyzer and constructs a hierarchical structure that
represents the syntactic relationships between the tokens. The most common output of the
syntactic analysis phase is an Abstract Syntax Tree (AST) or a Parse Tree.
Here's a detailed explanation of the syntactic analysis process:
1. Grammar Specification:
 The syntactic analysis phase relies on a formal grammar that defines the syntax of the
 Context-free grammars (CFGs) are commonly used to specify the grammar rules.
 The grammar consists of a set of production rules that describe how different
elements of the language (e.g., expressions, statements) can be combined.
2. Parsing Algorithm:
 The parsing algorithm is responsible for analyzing the input token stream and
constructing a hierarchical representation of the program's syntactic structure.
 Common parsing algorithms include Recursive Descent Parsing, LL Parsing, LR
Parsing, and Earley Parsing.
 These algorithms use the grammar rules to guide the parsing process, determining
how tokens are combined to form higher-level constructs.
3. Parse Tree or Abstract Syntax Tree (AST):
 As the parsing algorithm processes the input tokens, it builds either a Parse Tree or an
Abstract Syntax Tree.
 Parse Tree: Represents the hierarchical structure of the program according to the
grammar rules. It includes all the details of the parsing process, including non-
terminals.
 Abstract Syntax Tree (AST): Represents the essential structure of the program
without capturing all the details of the parsing process. It typically excludes non-
essential syntactic elements such as parentheses and semicolons.
 ASTs are more commonly used in subsequent compilation phases, as they provide a
clearer and more concise representation of the program's structure.
4. Error Handling:
 During syntactic analysis, parsing errors may occur if the input program does not
conform to the grammar rules.
 Error handling mechanisms may include providing informative error messages,
recovering from errors to continue parsing, or halting the compilation process if the
errors are severe.
5. Semantic Actions (optional):
 Some parsing algorithms allow semantic actions to be associated with grammar rules.
 Semantic actions perform additional processing or validation during parsing, such as
type checking or building symbol tables.
6. Output:
 The output of the syntactic analysis phase is typically the Parse Tree or AST
representing the syntactic structure of the input program.
 This hierarchical representation serves as input to subsequent compilation phases,
such as semantic analysis, optimization, and code generation.
Syntactic analysis is a critical phase in the compilation process, as it lays the foundation for
further processing of the program and ensures that it conforms to the syntax rules of the
Code generation is a crucial phase in the compilation process where the compiler translates
the intermediate representation (IR) of the source code, typically an Abstract Syntax Tree
(AST) or some other intermediate representation, into executable code for a target platform.
This phase involves converting the high-level language constructs into low-level machine
instructions or bytecode that can be executed by the target hardware or virtual machine.
Code generation :
Intermediate Representation (IR):
 Before code generation, the compiler typically performs various analysis and
optimizations on the source code, resulting in an intermediate representation (IR).
 The IR is a structured representation of the program's semantics and control flow,
often represented as an Abstract Syntax Tree (AST), Three-Address Code (TAC), or
similar format.
 The IR simplifies the code generation process by providing a uniform representation
of the program's behavior.
2. Instruction Selection:
 The compiler selects appropriate machine instructions or bytecode instructions to
represent each operation in the IR.
 This involves mapping high-level language constructs (such as assignments,
conditionals, loops) to sequences of low-level instructions that perform equivalent
operations on the target platform.
3. Register Allocation:
 Register allocation assigns variables and intermediate values to hardware registers or
memory locations.
 Efficient register allocation minimizes the use of memory accesses, which can
significantly improve performance.
 Various register allocation algorithms are used to optimize register usage, such as
graph coloring, linear scan, and iterative algorithms.
4. Instruction Scheduling:
 Instruction scheduling reorders the generated instructions to optimize performance,
such as minimizing pipeline stalls or maximizing instruction-level parallelism.
 The goal is to exploit the available resources of the target hardware efficiently and
reduce execution time.
5. Addressing Modes and Memory Management:
 For architectures with memory addressing modes, the compiler selects appropriate
addressing modes for memory accesses.
 Memory management techniques, such as stack allocation, heap allocation, and static
allocation, are also handled during code generation.
6. Optimization:
 Code generation itself can include optimization techniques aimed at improving the
efficiency and performance of the generated code.
 Common optimizations performed during code generation include constant folding,
loop unrolling, inline expansion, and instruction scheduling.
7. Platform-specific Code Generation:
 Code generation may vary depending on the target platform, such as x86, ARM, or
JVM bytecode.
 Platform-specific optimizations and code generation strategies are applied to produce
code optimized for the target architecture.
8. Output:
 The output of the code generation phase is the generated machine code or bytecode
suitable for execution on the target platform.
 This code is typically stored in an executable file or bytecode format that can be
executed by the target hardware or virtual machine.
Overall, code generation transforms the high-level semantics of the source code into efficient
low-level instructions or bytecode, enabling the program to be executed on the target
platform with optimal performance and resource utilization.
COMPILER DESIGN OPTIONS
Compiler design involves making various decisions and choices at different stages of the
compilation process. These decisions impact the design, efficiency, and performance of the
compiler. Here are some key compiler design options:
1. Front-end Language Support:

 Choosing the programming languages supported by the compiler's front end. Some
compilers target specific languages, while others support multiple languages.
 Determining whether the compiler will handle a high-level programming language,
assembly language, or both.
2. Parsing Technique:
 Selecting the parsing technique for syntactic analysis, such as Recursive Descent
Parsing, LL Parsing, LR Parsing, or Earley Parsing.
 Choosing between hand-written parsers or parser generators like yacc/bison, ANTLR,
or language-specific parsing libraries.
3. Intermediate Representation (IR):
 Designing the intermediate representation (IR) used internally by the compiler. This
includes selecting an appropriate format for representing the program's semantics and
control flow.
 Deciding whether to use Abstract Syntax Trees (ASTs), Three-Address Code (TAC),
Control Flow Graphs (CFGs), or other IR formats.
4. Optimization Strategies:
 Choosing optimization techniques to improve the performance and efficiency of
generated code. Common optimizations include constant folding, loop optimization,
inlining, and register allocation.
 Determining the level of optimization (e.g., -O0, -O1, -O2, -O3) and whether
optimizations are performed during compilation or deferred to a separate optimization
phase.
5. Code Generation:
 Selecting the target architecture and generating machine code or bytecode optimized
for the target platform.
 Choosing between different instruction selection, register allocation, and instruction
scheduling algorithms to produce efficient code.
 Supporting multiple output formats such as executable files, shared libraries, or
bytecode for virtual machines.
6. Error Handling:
 Designing error handling mechanisms to detect and report syntax errors, semantic
errors, and other issues in the source code.
 Determining how errors are reported to users and whether the compiler attempts to
recover from errors to continue compilation.
7. Platform-specific Optimization:
 Implementing platform-specific optimizations to exploit the features and resources of
the target hardware architecture.
 Tailoring code generation and optimization strategies for specific CPU architectures,
instruction sets, or memory hierarchies.
8. Back-end Support:
 Supporting various back ends to generate code for different target platforms,
operating systems, or execution environments.
 Implementing support for cross-compilation to generate code for platforms different
from the one hosting the compiler.
9. Integration with Development Tools:
 Integrating the compiler with development tools such as text editors, IDEs, debuggers,
and version control systems.
 Providing support for features like syntax highlighting, code completion, debugging
symbols, and profiling information.
These design options influence the complexity, performance, and usability of the compiler.
Compiler designers must carefully consider these options to create compilers that meet the
requirements of the target language, platform, and application domains.
Grammar :
It is a finite set of formal rules for generating syntactically correct sentences or meaningful
correct sentences.
Constitute Of Grammar :
Grammar is basically composed of two basic elements –
1. Terminal Symbols –
Terminal symbols are those which are the components of the sentences generated using
a grammar and are represented using small case letter like a, b, c etc.
2. Non-Terminal Symbols –
Non-Terminal Symbols are those symbols which take part in the generation of the
sentence but are not the component of the sentence. Non-Terminal Symbols are also
called Auxiliary Symbols and Variables. These symbols are represented using a capital
letter like A, B, C, etc.
Formal Definition of Grammar :
Any Grammar can be represented by 4 tuples – <N, T, P, S>
 N – Finite Non-Empty Set of Non-Terminal Symbols.
 T – Finite Set of Terminal Symbols.
 P – Finite Non-Empty Set of Production Rules.
 S – Start Symbol (Symbol from where we start producing our sentences or strings).
Production Rules :
A production or production rule in computer science is a rewrite rule specifying a symbol
substitution that can be recursively performed to generate new symbol sequences. It is of
the form α-> β where α is a Non-Terminal Symbol which can be replaced by β which is a
string of Terminal Symbols or Non-Terminal Symbols.
Example-1 :
Consider Grammar G1 = <N, T, P, S>
T = {a,b} #Set of terminal symbols
P = {A->Aa,A->Ab,A->a,A->b,A-> } #Set of all production rules
S = {A} #Start Symbol
As the start symbol is S then we can produce Aa, Ab, a,b, which can further produce
strings where A can be replaced by the Strings mentioned in the production rules and hence
this grammar can be used to produce strings of the form (a+b)*.
Derivation Of Strings :
All Your Queries Answered | DSA to Development Program | GeeksforGeeks
SKIP
A->a #using production rule 3
OR
A->Aa #using production rule 1
Aa->ba #using production rule 4
OR
Aa->AAa #using production rule 1
AAa->bAa #using production rule 4
bAa->ba #using production rule 5
Example-2 :
Consider Grammar G2 = <N, T, P, S>
N = {A} #Set of non-terminals Symbols
T = {a} #Set of terminal symbols
P = {A->Aa, A->AAa, A->a, A-> } #Set of all production rules
S = {A} #Start Symbol
As the start symbol is S then we can produce Aa, AAa, a, which can further produce
strings where A can be replaced by the Strings mentioned in the production rules and hence
this grammar can be used to produce strings of form (a)*.
Derivation Of Strings :
A->a #using production rule 3
OR
Aa->aa #using production rule 3
OR
Aa->AAa #using production rule 1
AAa->Aa #using production rule 4
Aa->aa #using production rule 3
Equivalent Grammars :
Grammars are said to be equivalent if they produce the same language.
Different Types Of Grammars :

Grammar can be divided on basis of –
 Type of Production Rules
 Number of Derivation Trees
 Number of Strings
Lexical analysis
Lexical analysis is the first phase of a compiler. It takes modified source code from language
preprocessors that are written in the form of sentences. The lexical analyzer breaks these
syntaxes into a series of tokens, by removing any whitespace or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works
closely with the syntax analyzer. It reads character streams from the source code, checks for
legal tokens, and passes the data to the syntax analyzer when it demands.
Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some
predefined rules for every lexeme to be identified as a valid token. These rules are defined by
grammar rules, by means of a pattern. A pattern explains what can be a token, and these
patterns are defined by means of regular expressions.
In programming language, keywords, constants, identifiers, strings, numbers, operators and
punctuations symbols can be considered as tokens.
For example, in C language, the variable declaration line
int value = 100;
contains the tokens:
int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).
Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets,

{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets, {a-z, A-Z} is a set of
English language alphabets.
Strings
Any finite sequence of alphabets (characters) is called a string. Length of the string is the
total number of occurrence of alphabets, e.g., the length of the string tutorialspoint is 14 and
is denoted by |tutorialspoint| = 14. A string having no alphabets, i.e. a string of zero length is
known as an empty string and is denoted by ε (epsilon).
Special symbols
A typical high-level language contains the following symbols:-
Arithmetic Addition(+), Subtraction(-), Modulo(%),

Symbols Multiplication(*), Division(/)
Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)
Assignment =
Special Assignment +=, /=, *=, -=
Comparison ==, !=, <, <=, >, >=
Preprocessor #
Location Specifier &
Logical &, &&, |, ||, !

Shift Operator >>, >>>, <<, <<<
Language
A language is considered as a finite set of strings over some finite set of alphabets. Computer
languages are considered as finite sets, and mathematically set operations can be performed
on them. Finite languages can be described by means of regular expressions.
Regular Expressions
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme
that belong to the language in hand. It searches for the pattern defined by the language rules.
Regular expressions have the capability to express finite languages by defining a pattern for
finite strings of symbols. The grammar defined by regular expressions is known as regular
grammar. The language defined by regular grammar is known as regular language.
Regular expression is an important notation for specifying patterns. Each pattern matches a
set of strings, so regular expressions serve as names for a set of strings. Programming
language tokens can be described by regular languages. The specification of regular
expressions is an example of a recursive definition. Regular languages are easy to understand
and have efficient implementation.
There are a number of algebraic laws that are obeyed by regular expressions, which can be
used to manipulate regular expressions into equivalent forms.
Operations
The various operations on languages are:
 Union of two languages L and M is written as

L U M = {s | s is in L or s is in M}
 Concatenation of two languages L and M is written as
LM = {st | s is in L and t is in M}
 The Kleene Closure of a language L is written as
L* = Zero or more occurrence of language L.
Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
 Union : (r)|(s) is a regular expression denoting L(r) U L(s)

 Concatenation : (r)(s) is a regular expression denoting L(r)L(s)
 Kleene closure : (r)* is a regular expression denoting (L(r))*
 (r) is a regular expression denoting L(r)
Precedence and Associativity
 *, concatenation (.), and | (pipe sign) are left associative

 * has the highest precedence
 Concatenation (.) has the second highest precedence.
 | (pipe sign) has the lowest precedence of all.
Representing valid tokens of a language in regular expression
If x is a regular expression, then:
x* means zero or more occurrence of x.
i.e., it can generate { e, x, xx, xxx, xxxx, … }
x+ means one or more occurrence of x.
i.e., it can generate { x, xx, xxx, xxxx … } or x.x*
x? means at most one occurrence of x
i.e., it can generate either {x} or {e}.
[a-z] is all lower-case alphabets of English language.
[A-Z] is all upper-case alphabets of English language.
[0-9] is all natural digits used in mathematics.
Representation occurrence of symbols using regular expressions
letter = [a – z] or [A – Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
sign = [ + | - ]
Representation of language tokens using regular expressions
Decimal = (sign)?(digit)+
Identifier = (letter)(letter | digit)*
The only problem left with the lexical analyzer is how to verify the validity of a regular
expression used in specifying the patterns of keywords of a language. A well-accepted
solution is to use finite automata for verification.
Finite Automata
Finite automata is a state machine that takes a string of symbols as input and changes its state
accordingly. Finite automata is a recognizer for regular expressions. When a regular
expression string is fed into finite automata, it changes its state for each literal. If the input
string is successfully processed and the automata reaches its final state, it is accepted, i.e., the
string just fed was said to be a valid token of the language in hand.
The mathematical model of finite automata consists of:
 Finite set of states (Q)

 Finite set of input symbols (Σ)
 One Start state (q0)
 Set of final states (qf)
 Transition function (δ)
The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ),
Q×Σ➔Q
Finite Automata Construction
Let L(r) be a regular language recognized by some finite automata (FA).
 States : States of FA are represented by circles. State names are of the state is written
inside the circle.
 Start state : The state from where the automata starts, is known as start state. Start
state has an arrow pointed towards it.
 Intermediate states : All intermediate states has at least two arrows; one pointing to
and another pointing out from them.
 Final state : If the input string is successfully parsed, the automata is expected to be
in this state. Final state is represented by double circles. It may have any odd number
of arrows pointing to it and even number of arrows pointing out from it. The number
of odd arrows are one greater than even, i.e. odd = even+1.
 Transition : The transition from one state to another state happens when a desired
symbol in the input is found. Upon transition, automata can either move to next state
or stay in the same state. Movement from one state to another is shown as a directed
arrow, where the arrows points to the destination state. If automata stays on the same
state, an arrow pointing from a state to itself is drawn.
Example : We assume FA accepts any three digit binary value ending in digit 1. FA = {Q(q 0,
qf), Σ(0,1), q0, qf, δ}
Longest Match Rule
When the lexical analyzer read the source-code, it scans the code letter by letter; and when a
whitespace, operator symbol, or special symbols occurs, it decides that a word is completed.
For example:
int intvalue;
While scanning both lexemes till ‘int’, the lexical analyzer cannot determine whether it is a
keyword int or the initials of identifier int value.
The Longest Match Rule states that the lexeme scanned should be determined based on the
longest match among all the tokens available.
The lexical analyzer also follows rule priority where a reserved word, e.g., a keyword, of a
language is given priority over user input. That is, if the lexical analyzer finds a lexeme that
matches with any existing reserved word, it should generate an error.
Syntax analysis or parsing is the second phase of a compiler. In this chapter, we shall learn
the basic concepts used in the construction of a parser.
We have seen that a lexical analyzer can identify tokens with the help of regular expressions
and pattern rules. But a lexical analyzer cannot check the syntax of a given sentence due to
the limitations of the regular expressions. Regular expressions cannot check balancing
tokens, such as parenthesis. Therefore, this phase uses context-free grammar (CFG), which is
recognized by push-down automata.
CFG, on the other hand, is a superset of Regular Grammar, as depicted below:
It implies that every Regular Grammar is also context-free, but there exists some problems,
which are beyond the scope of Regular Grammar. CFG is a helpful tool in describing the
syntax of programming languages.
Context-Free Grammar
In this section, we will first see the definition of context-free grammar and introduce
terminologies used in parsing technology.
A context-free grammar has four components:

 A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of
strings. The non-terminals define sets of strings that help define the language
generated by the grammar.
 A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols
from which strings are formed.
 A set of productions (P). The productions of a grammar specify the manner in which
the terminals and non-terminals can be combined to form strings. Each production
consists of a non-terminal called the left side of the production, an arrow, and a
sequence of tokens and/or on- terminals, called the right side of the production.
 One of the non-terminals is designated as the start symbol (S); from where the
production begins.
The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially
the start symbol) by the right side of a production, for that non-terminal.
Example
We take the problem of palindrome language, which cannot be described by means of

Regular Expression. That is, L = { w | w = w R } is not a regular language. But it can be
described by means of CFG, as illustrated below:
G = ( V, Σ, P, S )
Where:
V = { Q, Z, N }
Σ = { 0, 1 }
P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 }
S={Q}
This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101,
11111, etc.
Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token
streams. The parser analyzes the source code (token stream) against the production rules to
detect any errors in the code. The output of this phase is a parse tree.
This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and
generating a parse tree as the output of the phase.
Parsers are expected to parse the whole code even if some errors exist in the program. Parsers
use error recovering strategies, which we will learn later in this chapter.
Derivation
A derivation is basically a sequence of production rules, in order to get the input string.
During parsing, we take two decisions for some sentential form of input:
 Deciding the non-terminal which is to be replaced.

 Deciding the production rule, by which, the non-terminal will be replaced.
To decide which non-terminal to be replaced with production rule, we can have two options.
Left-most Derivation
If the sentential form of an input is scanned and replaced from left to right, it is called left-
most derivation. The sentential form derived by the left-most derivation is called the left-
sentential form.
Right-most Derivation
If we scan and replace the input with production rules, from right to left, it is known as right-
most derivation. The sentential form derived from the right-most derivation is called the
right-sentential form.
Example
Production rules:
E→E+E
E→E*E
E → id
Input string: id + id * id
The left-most derivation is:
E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
Notice that the left-most side non-terminal is always processed first.
The right-most derivation is:

E→E+E
E→E+E*E
E → E + E * id
E → E + id * id
E → id + id * id
Parse Tree
A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are
derived from the start symbol. The start symbol of the derivation becomes the root of the
parse tree. Let us see this by an example from the last topic.
We take the left-most derivation of a + b * c
The left-most derivation is:
E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
Step 1:
E→E*E
Step 2:
E→E+E*E
Step 3:
E → id + E * E
Step 4:
E → id + id * E
Step 5:
E → id + id * id
In a parse tree:
 All leaf nodes are terminals.

 All interior nodes are non-terminals.
 In-order traversal gives original input string.
A parse tree depicts associativity and precedence of operators. The deepest sub-tree is
traversed first, therefore the operator in that sub-tree gets precedence over the operator which
is in the parent nodes.
Ambiguity
A grammar G is said to be ambiguous if it has more than one parse tree (left or right
derivation) for at least one string.
Example
E→E+E
E→E–E
E → id
For the string id + id – id, the above grammar generates two parse trees:
The language generated by an ambiguous grammar is said to be inherently ambiguous.

Ambiguity in grammar is not good for a compiler construction. No method can detect and
remove ambiguity automatically, but it can be removed by either re-writing the whole
grammar without ambiguity, or by setting and following associativity and precedence
constraints.
Associativity
If an operand has operators on both sides, the side on which the operator takes this operand is
decided by the associativity of those operators. If the operation is left-associative, then the
operand will be taken by the left operator or if the operation is right-associative, the right
operator will take the operand.
Example
Operations such as Addition, Multiplication, Subtraction, and Division are left associative. If
the expression contains:
id op id op id
it will be evaluated as:
(id op id) op id
For example, (id + id) + id
Operations like Exponentiation are right associative, i.e., the order of evaluation in the same
expression will be:
id op (id op id)
For example, id ^ (id ^ id)
Precedence
If two different operators share a common operand, the precedence of operators decides
which will take the operand. That is, 2+3*4 can have two different parse trees, one
corresponding to (2+3)*4 and another corresponding to 2+(3*4). By setting precedence
among operators, this problem can be easily removed. As in the previous example,
mathematically * (multiplication) has precedence over + (addition), so the expression 2+3*4
will always be interpreted as:
2 + (3 * 4)
These methods decrease the chances of ambiguity in a language or its grammar.
Left Recursion
A grammar becomes left-recursive if it has any non-terminal ‘A’ whose derivation contains
‘A’ itself as the left-most symbol. Left-recursive grammar is considered to be a problematic
situation for top-down parsers. Top-down parsers start parsing from the Start symbol, which
in itself is non-terminal. So, when the parser encounters the same non-terminal in its
derivation, it becomes hard for it to judge when to stop parsing the left non-terminal and it
goes into an infinite loop.
Example:
(1) A => Aα | β
(2) S => Aα | β
A => Sd
(1) is an example of immediate left recursion, where A is any non-terminal symbol and α
represents a string of non-terminals.
(2) is an example of indirect-left recursion.
A top-down parser will first parse the A, which in-turn will yield a string consisting of A
itself and the parser may go into a loop forever.
Removal of Left Recursion
One way to remove left recursion is to use the following technique:
The production
A => Aα | β
is converted into following productions
A => βA'
A'=> αA' | ε
This does not impact the strings derived from the grammar, but it removes immediate left
recursion.
Second method is to use the following algorithm, which should eliminate all direct and
indirect left recursions.
START
Arrange non-terminals in some order like A1, A2, A3,…, An
for each i from 1 to n

{
for each j from 1 to i-1
{
replace each production of form Ai ⟹Aj𝜸
with Ai ⟹ δ1𝜸 | δ2𝜸 | δ3𝜸 |…| 𝜸
where Aj ⟹ δ1 | δ2|…| δn are current Aj productions
}
}
eliminate immediate left-recursion
END
Example
The production set
S => Aα | β
A => Sd
after applying the above algorithm, should become
S => Aα | β
A => Aαd | βd
and then, remove immediate left recursion using the first technique.
A => βdA'
A' => αdA' | ε
Now none of the production has either direct or indirect left recursion.
Left Factoring
If more than one grammar production rules has a common prefix string, then the top-down
parser cannot make a choice as to which of the production it should take to parse the string in
hand.
Example
If a top-down parser encounters a production like
A ⟹ αβ | α𝜸 | …
Then it cannot determine which production to follow to parse the string as both productions
are starting from the same terminal (or non-terminal). To remove this confusion, we use a
technique called left factoring.
Left factoring transforms the grammar to make it useful for top-down parsers. In this
technique, we make one production for each common prefixes and the rest of the derivation is
added by new productions.
Example
The above productions can be written as
A => αA'
A'=> β | 𝜸 | …
Now the parser has only one production per prefix which makes it easier to take decisions.
First and Follow Sets
An important part of parser table construction is to create first and follow sets. These sets can
provide the actual position of any terminal in the derivation. This is done to create the parsing
table where the decision of replacing T[A, t] = α with some production rule.
First Set
This set is created to know what terminal symbol is derived in the first position by a non-
terminal. For example,
α→tβ
That is α derives t (terminal) in the very first position. So, t ∈ FIRST(α).
Algorithm for calculating First set
Look at the definition of FIRST(α) set:
 if α is a terminal, then FIRST(α) = { α }.

 if α is a non-terminal and α → ℇ is a production, then FIRST(α) = { ℇ }.
 if α is a non-terminal and α → 𝜸1 𝜸2 𝜸3 … 𝜸n and any FIRST(𝜸) contains t then t is
in FIRST(α).
First set can be seen as:
Follow Set
Likewise, we calculate what terminal symbol immediately follows a non-terminal α in

production rules. We do not consider what the non-terminal can generate but instead, we see
what would be the next terminal symbol that follows the productions of a non-terminal.
Algorithm for calculating Follow set:
 if α is a start symbol, then FOLLOW() = $
 if α is a non-terminal and has a production α → AB, then FIRST(B) is in
FOLLOW(A) except ℇ.
 if α is a non-terminal and has a production α → AB, where B ℇ, then FOLLOW(A) is
in FOLLOW(α).
Follow set can be seen as: FOLLOW(α) = { t | S *αt*}
Limitations of Syntax Analyzers
Syntax analyzers receive their inputs, in the form of tokens, from lexical analyzers. Lexical
analyzers are responsible for the validity of a token supplied by the syntax analyzer. Syntax
analyzers have the following drawbacks -
 it cannot determine if a token is valid,

 it cannot determine if a token is declared before it is being used,
 it cannot determine if a token is initialized before it is being used,
 it cannot determine if an operation performed on a token type is valid or not.
Tutorials Point is a leading Ed Tech company striving to provide the best learning material on
technical and non-technical subjects.
Code generation can be considered as the final phase of compilation. Through post code
generation, optimization process can be applied on the code, but that can be seen as a part of
code generation phase itself. The code generated by the compiler is an object code of some
lower-level programming language, for example, assembly language. We have seen that the
source code written in a higher-level language is transformed into a lower-level language that
results in a lower-level object code, which should have the following minimum properties:
 It should carry the exact meaning of the source code.

 It should be efficient in terms of CPU usage and memory management.
We will now see how the intermediate code is transformed into target object code (assembly
code, in this case).
Directed Acyclic Graph
Directed Acyclic Graph (DAG) is a tool that depicts the structure of basic blocks, helps to see
the flow of values flowing among the basic blocks, and offers optimization too. DAG
provides easy transformation on basic blocks. DAG can be understood here:
 Leaf nodes represent identifiers, names or constants.

 Interior nodes represent operators.
 Interior nodes also represent the results of expressions or the identifiers/name where
the values are to be stored or assigned.
Example:
t0 = a + b
t1 = t0 + c
d = t0 + t1
[t0 = a + b]
[t1 = t0 + c]
[d = t0 + t1]
Peephole Optimization
This optimization technique works locally on the source code to transform it into an
optimized code. By locally, we mean a small portion of the code block at hand. These
methods can be applied on intermediate codes as well as on target codes. A bunch of
statements is analyzed and are checked for the following possible optimization:
Redundant instruction elimination
At source code level, the following can be done by the user:
int add_ten(int x) int add_ten(int x) int add_ten(int x) int add_ten(int x

{ { { {
int y, z; int y; int y = 10; return x + 10;
y = 10; y = 10; return x + y; }
z = x + y; y = x + y; }
return z; return y;
} }
At compilation level, the compiler searches for instructions redundant in nature. Multiple
loading and storing of instructions may carry the same meaning even if some of them are
removed. For example:
 MOV x, R0
 MOV R0, R1
We can delete the first instruction and re-write the sentence as:
MOV x, R1
Unreachable code
Unreachable code is a part of the program code that is never accessed because of
programming constructs. Programmers may have accidently written a piece of code that can
never be reached.
Example:
void add_ten(int x)
{
return x + 10;
printf(“value of x is %d”, x);
}
In this code segment, the printf statement will never be executed as the program control
returns back before it can execute, hence printf can be removed.
Flow of control optimization
There are instances in a code where the program control jumps back and forth without
performing any significant task. These jumps can be removed. Consider the following chunk
of code:
...
MOV R1, R2
GOTO L1
...
L1 : GOTO L2
L2 : INC R1
In this code,label L1 can be removed as it passes the control to L2. So instead of jumping to
L1 and then to L2, the control can directly reach L2, as shown below:
...
MOV R1, R2
GOTO L2
...
L2 : INC R1
Algebraic expression simplification
There are occasions where algebraic expressions can be made simple. For example, the
expression a = a + 0 can be replaced by a itself and the expression a = a + 1 can simply be
replaced by INC a.
Strength reduction
There are operations that consume more time and space. Their ‘strength’ can be reduced by
replacing them with other operations that consume less time and space, but produce the same
result.
For example, x * 2 can be replaced by x << 1, which involves only one left shift. Though the
output of a * a and a2 is same, a2 is much more efficient to implement.
Accessing machine instructions
The target machine can deploy more sophisticated instructions, which can have the capability
to perform specific operations much efficiently. If the target code can accommodate those
instructions directly, that will not only improve the quality of code, but also yield more
efficient results.
Code Generator
A code generator is expected to have an understanding of the target machine’s runtime

environment and its instruction set. The code generator should take the following things into
consideration to generate the code:
 Target language : The code generator has to be aware of the nature of the target
language for which the code is to be transformed. That language may facilitate some
machine-specific instructions to help the compiler generate the code in a more
convenient way. The target machine can have either CISC or RISC processor
architecture.
 IR Type : Intermediate representation has various forms. It can be in Abstract Syntax
Tree (AST) structure, Reverse Polish Notation, or 3-address code.
 Selection of instruction : The code generator takes Intermediate Representation as
input and converts (maps) it into target machine’s instruction set. One representation
can have many ways (instructions) to convert it, so it becomes the responsibility of
the code generator to choose the appropriate instructions wisely.
 Register allocation : A program has a number of values to be maintained during the
execution. The target machine’s architecture may not allow all of the values to be kept
in the CPU memory or registers. Code generator decides what values to keep in the
registers. Also, it decides the registers to be used to keep these values.
 Ordering of instructions : At last, the code generator decides the order in which the
instruction will be executed. It creates schedules for instructions to execute them.
Descriptors
The code generator has to track both the registers (for availability) and addresses (location of
values) while generating the code. For both of them, the following two descriptors are used:
 Register descriptor : Register descriptor is used to inform the code generator about
the availability of registers. Register descriptor keeps track of values stored in each
register. Whenever a new register is required during code generation, this descriptor is
consulted for register availability.
 Address descriptor : Values of the names (identifiers) used in the program might be
stored at different locations while in execution. Address descriptors are used to keep
track of memory locations where the values of identifiers are stored. These locations
may include CPU registers, heaps, stacks, memory or a combination of the mentioned
locations.
Code generator keeps both the descriptor updated in real-time. For a load statement, LD R1,
x, the code generator:
 updates the Register Descriptor R1 that has value of x and

 updates the Address Descriptor (x) to show that one instance of x is in R1.
Code Generation
Basic blocks comprise of a sequence of three-address instructions. Code generator takes these
sequence of instructions as input.
Note : If the value of a name is found at more than one place (register, cache, or memory),
the register’s value will be preferred over the cache and main memory. Likewise cache’s
value will be preferred over the main memory. Main memory is barely given any preference.
getReg : Code generator uses getReg function to determine the status of available registers
and the location of name values. getReg works as follows:
 If variable Y is already in register R, it uses that register.
 Else if some register R is available, it uses that register.
 Else if both the above options are not possible, it chooses a register that requires
minimal number of load and store instructions.
For an instruction x = y OP z, the code generator may perform the following actions. Let us
assume that L is the location (preferably register) where the output of y OP z is to be saved:
 Call function getReg, to decide the location of L.

 Determine the present location (register or memory) of y by consulting the Address
Descriptor of y. If y is not presently in register L, then generate the following
instruction to copy the value of y to L:
MOV y’, L
where y’ represents the copied value of y.
 Determine the present location of z using the same method used in step 2 for y and
generate the following instruction:
OP z’, L
where z’ represents the copied value of z.
 Now L contains the value of y OP z, that is intended to be assigned to x. So, if L is a
register, update its descriptor to indicate that it contains the value of x. Update the
descriptor of x to indicate that it is stored at location L.
 If y and z has no further use, they can be given back to the system.
Other code constructs like loops and conditional statements are transformed into assembly
language in general assembly way.
Tutorials Point is a leading Ed Tech company striving to provide the best learning material on
technical and non-technical subjects.
About us
 Tutorials Point India Private Limited, Incor9 Building, Kavuri Hills, Madhapur,
Hyderabad

About us
 , INDIA






SS Unit 4

Uploaded by

Copyright:

Available Formats

SS Unit 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SS Unit 4

Uploaded by

Copyright:

Available Formats

BASIC COMPILER FUNCTIONS:

1. Lexical Analysis (Scanner):

A context-free grammar consists of four components:

Here's a simple example of a context-free grammar for a basic arithmetic expression

Intermediate Representation (IR):

COMPILER DESIGN OPTIONS

1. Front-end Language Support:

Different Types Of Grammars :

For example, in C language, the variable declaration line

int value = 100;

contains the tokens:

int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).

Any finite set of symbols {0,1} is a set of binary alphabets,

A typical high-level language contains the following symbols:-

Arithmetic Addition(+), Subtraction(-), Modulo(%),

Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)

Special Assignment +=, /=, *=, -=

Comparison ==, !=, <, <=, >, >=

Location Specifier &

Logical &, &&, |, ||, !

The various operations on languages are:

 Union of two languages L and M is written as

 Union : (r)|(s) is a regular expression denoting L(r) U L(s)

Precedence and Associativity

 *, concatenation (.), and | (pipe sign) are left associative

Representing valid tokens of a language in regular expression

If x is a regular expression, then:

x* means zero or more occurrence of x.

i.e., it can generate { e, x, xx, xxx, xxxx, … }

x+ means one or more occurrence of x.

i.e., it can generate { x, xx, xxx, xxxx … } or x.x*

x? means at most one occurrence of x

i.e., it can generate either {x} or {e}.

[a-z] is all lower-case alphabets of English language.

[A-Z] is all upper-case alphabets of English language.

[0-9] is all natural digits used in mathematics.

Representation occurrence of symbols using regular expressions

Representation of language tokens using regular expressions

Identifier = (letter)(letter | digit)*

The mathematical model of finite automata consists of:

 Finite set of states (Q)

Finite Automata Construction

Let L(r) be a regular language recognized by some finite automata (FA).

Longest Match Rule

CFG, on the other hand, is a superset of Regular Grammar, as depicted below:

A context-free grammar has four components:

We take the problem of palindrome language, which cannot be described by means of

 Deciding the non-terminal which is to be replaced.

The left-most derivation is:

Notice that the left-most side non-terminal is always processed first.

The right-most derivation is:

We take the left-most derivation of a + b * c

The left-most derivation is:

 All leaf nodes are terminals.

The language generated by an ambiguous grammar is said to be inherently ambiguous.

it will be evaluated as:

For example, (id + id) + id

For example, id ^ (id ^ id)

These methods decrease the chances of ambiguity in a language or its grammar.

Follow set can be seen as: FOLLOW(α) = { t | S αt}