Compiler Design 1
Compiler Design 1
Compiler Design 1
The compiler has two modules namely front end and back end. Front-end
constitutes of the Lexical analyzer, semantic analyzer, syntax analyzer and
intermediate code generator. And the rest are assembled to form the back
end.
Lexical Analysis
The first phase of scanner works as a text scanner. This phase scans the source
code as a stream of characters and converts it into meaningful lexemes. Lexical
analyzer represents these lexemes in the form of tokens as:
<token-name, attribute-value>
Syntax Analysis
The next phase is called the syntax analysis or parsing. It takes the token produced
by lexical analysis as input and generates a parse tree (or syntax tree). In this phase,
token arrangements are checked against the source code grammar, i.e. the parser
checks if the expression made by the tokens is syntactically correct.
Semantic Analysis
Semantic analysis checks whether the parse tree constructed follows the rules of
language. For example, assignment of values is between compatible data types, and
adding string to an integer. Also, the semantic analyzer keeps track of identifiers, their
types and expressions; whether identifiers are declared before use or not etc. The
semantic analyzer produces an annotated syntax tree as an output.
Intermediate Code Generation
After semantic analysis the compiler generates an intermediate code of the source
code for the target machine. It represents a program for some abstract machine. It is
in between the high-level language and the machine language. This intermediate
code should be generated in such a way that it makes it easier to be translated into
the target machine code.
Code Optimization
The next phase does code optimization of the intermediate code. Optimization can
be assumed as something that removes unnecessary code lines, and arranges the
sequence of statements in order to speed up the program execution without wasting
resources (CPU, memory).
Code Generation
In this phase, the code generator takes the optimized representation of the
intermediate code and maps it to the target machine language. The code generator
translates the intermediate code into a sequence of (generally) re-locatable machine
code. Sequence of instructions of machine code performs the task as the
intermediate code would do.
Symbol Table
It is a data-structure maintained throughout all the phases of a compiler. All the
identifier's names along with their types are stored here. The symbol table makes it
easier for the compiler to quickly search the identifier record and retrieve it. The
symbol table is also used for scope management.
Lexical Analysis
Lexical analysis is the first phase of a compiler. It takes the modified source code
from language preprocessors that are written in the form of sentences. The lexical
analyzer breaks these syntaxes into a series of tokens, by removing any whitespace
or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer
works closely with the syntax analyzer. It reads character streams from the source
code, checks for legal tokens, and passes the data to the syntax analyzer when it
demands.
Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There
are some predefined rules for every lexeme to be identified as a valid token. These
rules are defined by grammar rules, by means of a pattern. A pattern explains what
can be a token, and these patterns are defined by means of regular expressions.
In programming language, keywords, constants, identifiers, strings, numbers,
operators and punctuations symbols can be considered as tokens.
For example, in C language, the variable declaration line
int value = 100;
Specifications of Tokens
Let us understand how the language theory undertakes the following terms:
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets,
{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets, {a-z, A-Z} is a
set of English language alphabets.
Strings
Any finite sequence of alphabets is called a string. Length of the string is the total
number of occurrence of alphabets, e.g., the length of the string tutorialspoint is 14
and is denoted by |tutorialspoint| = 14. A string having no alphabets, i.e. a string of
zero length is known as an empty string and is denoted by ε (epsilon).
Special Symbols
A typical high-level language contains the following symbols:-
Assignment =
Preprocessor #
Language
A language is considered as a finite set of strings over some finite set of alphabets.
Computer languages are considered as finite sets, and mathematically set operations
can be performed on them. Finite languages can be described by means of regular
expressions.
While scanning both lexemes till ‘int’, the lexical analyzer cannot determine whether
it is a keyword int or the initials of identifier int value.
The Longest Match Rule states that the lexeme scanned should be determined based
on the longest match among all the tokens available.
Regular Expressions
Regular Expressions are used to denote regular languages
The lexical analyzer needs to scan and identify only a finite set of valid
string/token/lexeme that belong to the language in hand. It searches for the pattern
defined by the language rules.
Regular expressions have the capability to express finite languages by defining a
pattern for finite strings of symbols. The grammar defined by regular expressions is
known as regular grammar. The language defined by regular grammar is known
as regular language.
Regular expression is an important notation for specifying patterns. Each pattern
matches a set of strings, so regular expressions serve as names for a set of strings.
Programming language tokens can be described by regular languages. The
specification of regular expressions is an example of a recursive definition. Regular
languages are easy to understand and have efficient implementation.
There are a number of algebraic laws that are obeyed by regular expressions, which
can be used to manipulate regular expressions into equivalent forms.
Operations
The various operations on languages are:
• Union of two languages L and M is written as
L U M = {s | s is in L or s is in M}
• Concatenation of two languages L and M is written as
LM = {st | s is in L and t is in M}
• The Kleene Closure of a language L is written as
L* = Zero or more occurrence of language L.
Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
• Union : (r)|(s) is a regular expression denoting L(r) U L(s)
• Concatenation : (r)(s) is a regular expression denoting L(r)L(s)
• Kleene closure : (r)* is a regular expression denoting (L(r))*
• (r) is a regular expression denoting L(r)
Precedence and Associativity
• *, concatenation (.), and | (pipe sign) are left associative
• * has the highest precedence
• Concatenation (.) has the second highest precedence.
• | (pipe sign) has the lowest precedence of all.
Representing valid tokens of a language in regular expression
If x is a regular expression, then:
• x* means zero or more occurrence of x.
i.e., it can generate { e, x, xx, xxx, xxxx, … }
• x+ means one or more occurrence of x.
i.e., it can generate { x, xx, xxx, xxxx … } or x.x*
• x? means at most one occurrence of x
i.e., it can generate either {x} or {e}.
[a-z] is all lower-case alphabets of English language.
[A-Z] is all upper-case alphabets of English language.
[0-9] is all natural digits used in mathematics.
Representing occurrence of symbols using regular expressions
letter = [a – z] or [A – Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
sign = [ + | - ]
Representing language tokens using regular expressions
Decimal = (sign)?(digit)+
Identifier = (letter)(letter | digit)*
The only problem left with the lexical analyzer is how to verify the validity of a regular
expression used in specifying the patterns of keywords of a language. A well-
accepted solution is to use finite automata for verification.
Pass of compiler
A pass refers to the number of times the compiler goes through the source code.
There are single-pass compilers and multi-pass compilers. Single-pass compiler
goes through the program only once. In other words, the single pass compiler
allows the source code to pass through each compilation unit only once. It
immediately translates each code section into its final machine code.
Multi-pass compiler goes through the source code several times. In other words,
it allows the source code to pass through each compilation unit several times.
Each pass takes the result of the previous pass as input and creates intermediate
outputs. Therefore, the code improves in each pass. The final code is generated
after the final pass. Multi-pass compilers perform additional tasks such as
intermediate code generation, machine dependent code optimization, and
machine independent code optimization.
Speed
Speed is a major difference between single pass and multipass compiler. A
multipass compiler is slower than single pass compiler because each pass reads
and writes an intermediate file.
What is Interpreter?
An interpreter is a computer program, which coverts each high-level program
statement into the machine code. This includes source code, pre-compiled
code, and scripts. Both compiler and interpreters do the same job which is
converting higher level programming language to machine code. However, a
compiler will convert the code into machine code (create an exe) before
program run. Interpreters convert code into machine code when the program is
run.
Bootstrapping in Compiler Design
Bootstrapping is a process in which simple language is used to translate more
complicated program which in turn may handle for more complicated program.
This complicated program can further handle even more complicated program
and so on.
Writing a compiler for any high level language is a complicated process. It takes
lot of time to write a compiler from scratch. Hence simple language is used to
generate target code in some stages. to clearly understand
the Bootstrapping technique consider a following scenario.
Suppose we want to write a cross compiler for new language X. The
implementation language of this compiler is say Y and the target code being
generated is in language Z. That is, we create XYZ. Now if existing compiler Y runs
on machine M and generates code for M then it is denoted as YMM. Now if we run
XYZ using YMM then we get a compiler XMZ. That means a compiler for source
language X that generates a target code in language Z and which runs on machine
M.
Following diagram illustrates the above scenario.
Example:
We can create compiler of many different forms. Now we will generate.
Compiler which takes C language and generates an assembly language as an
output with the availability of a machine of assembly language.
• Step-1: First we write a compiler for a small of C in assembly language.
• Step-2: Then using with small subset of C i.e. C0, for the source
language c the compiler is written.
LEX
o Lex is a program that generates lexical analyzer. It is used with YACC parser generator.
o The lexical analyzer is a program that transforms an input stream into a sequence of
tokens.
o It reads the input stream and produces the source code as output through
implementing the lexical analyzer in the C program.
The function of Lex is as follows:
o Firstly lexical analyzer creates a program lex.1 in the Lex language. Then Lex compiler
runs the lex.1 program and produces a C program lex.yy.c.
o Finally C compiler runs the lex.yy.c program and produces an object program a.out.
o a.out is lexical analyzer that transforms an input stream into a sequence of tokens.
1. { definitions }
2. %%
3. { rules }
4. %%
5. { user subroutines }
Where pi describes the regular expression and action1 describes the actions what
action the lexical analyzer should take when pattern pi matches a lexeme.
User subroutines are auxiliary procedures needed by the actions. The subroutine can
be loaded with the lexical analyzer and compiled separately.
CONTEXT FREE GRAMMAR
Context-free Grammars: Definition: Formally, a context-free grammar G is a 4-tuple
G = (V, T, P, S), where:
1. V is a finite set of variables (or nonterminals). These describe sets of “related”
strings.
Where E, A are the non-terminals while id, +, *, -, /,(, ) are the terminals
Top-Down Parser
Advertisements
Previous Page
Next Page
We have learnt in the last chapter that the top-down parsing technique parses the
input, and starts constructing a parse tree from the root node gradually moving down
to the leaf nodes. The types of top-down parsing are depicted below:
Recursive Descent Parsing
Recursive descent is a top-down parsing technique that constructs the parse tree
from the top and the input is read from left to right. It uses procedures for every
terminal and non-terminal entity. This parsing technique recursively parses the input
to make a parse tree, which may or may not require back-tracking. But the grammar
associated with it (if not left factored) cannot avoid back-tracking. A form of recursive-
descent parsing that does not require any back-tracking is known as predictive
parsing.
This parsing technique is regarded recursive as it uses context-free grammar which
is recursive in nature.
Back-tracking
Top- down parsers start from the root node (start symbol) and match the input string
against the production rules to replace them (if matched). To understand this, take
the following example of CFG:
S → rXd | rZd
X → oa | ea
Z → ai
For an input string: read, a top-down parser, will behave like this:
It will start with S from the production rules and will match its yield to the left-most
letter of the input, i.e. ‘r’. The very production of S (S → rXd) matches with it. So the
top-down parser advances to the next input letter (i.e. ‘e’). The parser tries to expand
non-terminal ‘X’ and checks its production from the left (X → oa). It does not match
with the next input symbol. So the top-down parser backtracks to obtain the next
production rule of X, (X → ea).
Now the parser matches all the input letters in an ordered manner. The string is
accepted.
Ref-pdf-unit-1
Shorts-
Left Recursion
A grammar becomes left-recursive if it has any non-terminal ‘A’ whose derivation
contains ‘A’ itself as the left-most symbol. Left-recursive grammar is considered to be
a problematic situation for top-down parsers. Top-down parsers start parsing from the
Start symbol, which in itself is non-terminal. So, when the parser encounters the same
non-terminal in its derivation, it becomes hard for it to judge when to stop parsing the
left non-terminal and it goes into an infinite loop.
Example:
(1) A => Aα | β
(2) S => Aα | β
A => Sd
(1) is an example of immediate left recursion, where A is any non-terminal symbol
and α represents a string of non-terminals.
(2) is an example of indirect-left recursion.
A top-down parser will first parse the A, which in-turn will yield a string consisting of
A itself and the parser may go into a loop forever.
Removal of Left Recursion
One way to remove left recursion is to use the following technique:
The production
A => Aα | β
is converted into following productions
A => βA'
A'=> αA' | ε
This does not impact the strings derived from the grammar, but it removes immediate
left recursion.
Left Factoring
If more than one grammar production rules has a common prefix string, then the top-
down parser cannot make a choice as to which of the production it should take to
parse the string in hand.
Example
If a top-down parser encounters a production like
A ⟹ αβ | α𝜸 | …
Then it cannot determine which production to follow to parse the string as both
productions are starting from the same terminal (or non-terminal). To remove this
confusion, we use a technique called left factoring.
Left factoring transforms the grammar to make it useful for top-down parsers. In this
technique, we make one production for each common prefixes and the rest of the
derivation is added by new productions.
Example
The above productions can be written as
A => αA'
A'=> β | 𝜸 | …
Now the parser has only one production per prefix which makes it easier to take
decisions.
First Function-
Follow Function-
Follow(α) is a set of terminal symbols that appear immediately to the right of α.
xample-
Then, we have-
First(A) = { a , d , g }
Problem-01:
Calculate the first and follow functions for the given grammar-
S → aBDh
B → cC
C → bC / ∈
D → EF
E→g/∈
F→f/∈
Solution-
First Functions-
• First(S) = { a }
• First(B) = { c }
• First(C) = { b , ∈ }
• First(D) = { First(E) – ∈ } ∪ First(F) = { g , f , ∈ }
• First(E) = { g , ∈ }
• First(F) = { f , ∈ }
Follow Functions-
• Follow(S) = { $ }
• Follow(B) = { First(D) – ∈ } ∪ First(h) = { g , f , h }
• Follow(C) = Follow(B) = { g , f , h }
• Follow(D) = First(h) = { h }
• Follow(E) = { First(F) – ∈ } ∪ Follow(D) = { f , h }
• Follow(F) = Follow(D) = { h }
Predictive Parser
Predictive parser is a recursive descent parser, which has the capability to predict
which production is to be used to replace the input string. The predictive parser does
not suffer from backtracking.
To accomplish its tasks, the predictive parser uses a look-ahead pointer, which points
to the next input symbols. To make the parser back-tracking free, the predictive
parser puts some constraints on the grammar and accepts only a class of grammar
known as LL(k) grammar.
Predictive parsing uses a stack and a parsing table to parse the input and generate
a parse tree. Both the stack and the input contains an end symbol $ to denote that
the stack is empty and the input is consumed. The parser refers to the parsing table
to take any decision on the input and stack element combination.
In recursive descent parsing, the parser may have more than one production to
choose from for a single instance of input, whereas in predictive parser, each step
has at most one production to choose. There might be instances where there is no
production matching the input string, making the parsing procedure to fail.
LL(1) Parsing:
Here the 1st L represents that the scanning of the Input will be done from
Left to Right manner and the second L shows that in this parsing technique
we are going to use Left most Derivation Tree. And finally, the 1 represents
the number of look-ahead, which means how many symbols are you going to
see when you want to make a decision.
In the table, rows will contain the Non-Terminals and the column will contain
the Terminal Symbols. All the Null Productions of the Grammars will go
under the Follow elements and the remaining productions will lie under the
elements of the First set.
B → B and B
B → B or
B → B and B | true
B → B or B | false
B →(B)
Therefore, the given grammar in left recursive, removing left recursion from the grammar
A → Aα | B
A → βA'
A' → A' | ε
Therefore we have,
B → true B’
B’ → and B B’| ε
B’ → false B’
B’ → or BB’ | ε
B → (B)
The above grammar contains repetitions of the symbol, therefore, performing left factoring on
it.
B → 𝜆1 | 𝜆2 | 𝜆3 ……… 𝜆v
Therefore
B → B1B0 | (B) | ε