Unit I SRM
Unit I SRM
Unit I SRM
Compilers – Analysis of the source program, Phases of a compiler – Cousins of the Compiler, Grouping
of Phases – Compiler construction tools, Lexical Analysis – Role of Lexical Analyzer, Input Buffering –
Specification of Tokens- design of lexical analysis (LEX), Finite automation (deterministic & non
deterministic) - Conversion of regular expression of NDFA – Thompson’s, Conversion of NDFA to DFA-
minimization of NDFA, Derivation - parse tree - ambiguity
1. INTRODUCTION - COMPILER
Compiler is a program that can read a program in one language - the source language - and
translate it into an equivalent program in another language - the target language; see Fig. 1.1. An
important role of the compiler is to report any errors in the source program that it detects during
the translation process.
If the target program is an executable machine-language program, it can then be called by the
user to process inputs and produce outputs; see Fig. 1.2.
An interpreter is another common kind of language processor. Instead of producing a target
program as a translation, an interpreter appears to directly execute the operations specified in the
source program on inputs supplied by the user, as shown in Fig. 1.3.
No Compiler Interpreter
Program need not be compiled every Every time higher level program is
4
time converted into lower level program
Errors are displayed after entire Errors are displayed for every
5
program is checked instruction interpreted (if any)
Java language processors combine compilation and interpretation. A Java source program may
first be compiled into an intermediate form called bytecodes. The bytecodes are then interpreted
by a virtual machine. A benefit of this arrangement is that bytecodes compiled on one machine
can be interpreted on another machine, even across a network.
Just-in-time compilers To achieve faster processing of inputs to outputs Just-in-time compiler
compiler is used, which translates the byte code into platform-specific executable code that is
immediately executed.
Preprocessor
A preprocessor produce input to compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are short hands for
longer constructs.
2. File inclusion: A preprocessor may include header files into the program text.
Compiler
The compiler may produce an assembly-language program as its output, because assembly
language
Assembler
An assembler translates assembly language programs into relocatable machine code. The output
of an assembler is called an object file, which contains a combination of machine instructions as
well as the data required to place these instructions in memory.
Linker
A linker or link editor is a computer program that takes one or more object files generated by
a compiler and combines them into a single executable file
Large programs are often compiled in pieces, so the relocatable machine code may have to be
linked together with other relocatable object files and library files into the code that actually runs
on the machine.
The linker also resolves external memory addresses, where the code in one file may refer to a
location in another file.
Loader
Loader is responsible for loading executable files into memory and execute them. It calculates
the size of a program (instructions and data) and creates memory space for it. It initializes
various registers to initiate execution.
Cross-compiler
A compiler that runs on platform (A) and is capable of generating executable code for platform
(B) is called a cross-compiler.
Source-to-source Compiler
A compiler that takes the source code of one programming language and translates it into the
source code of another programming language is called a source-to-source compiler.
Analysis Phase
Synthesis Phase
Analysis Phase
Known as the front-end of the compiler, the analysis phase of the compiler reads the source
program, divides it into core parts, and then checks for lexical, grammar, and syntax errors. The
analysis phase generates an intermediate representation of the source program and symbol table,
which should be fed to the Synthesis phase as input.
Synthesis Phase
Known as the back-end of the compiler, the synthesis phase generates the target program with
the help of intermediate source code representation and symbol table.
If we examine the compilation process in more detail, it is partitioned into no-of-sub processes
called phases.
A phase is a logically interrelated operation that takes source program in one representation and
produces output in another representation.
2. PHASES OF A COMPILER
The compilation process is a sequence of various phases. Each phase takes input from its
previous stage, has its own representation of source program, and feeds its output to the next
phase of the compiler. Let us understand the phases of a compiler.
Lexical Analysis
The first phase of a compiler is called lexical analysis or scanning. The lexical analyzer reads the
stream of characters and groups the characters into meaningful sequences called lexemes. For
each lexeme, the lexical analyzer produces as output a token of the form
(token-name, attribute-value)
that it passes on to the subsequent phase, syntax analysis. In the token, the first component
token-name is an abstract symbol that is used during syntax analysis, and the second component
attribute-value points to an entry in the symbol table for this token. Information from the symbol-
table entry is needed for semantic analysis and code generation.
The characters in this assignment could be grouped into the following lexemes and mapped into
the following tokens passed on to the syntax analyzer:
1. position is a lexeme that would be mapped into a token (id, 1), where id is an abstract symbol
standing for identifier and 1 points to the symbol table entry for position. The symbol-table entry
for an identifier holds information about the identifier, such as its name and type.
2. The assignment symbol = is a lexeme that is mapped into the token (=). Since this token needs
no attribute-value, the second component is omitted. Any abstract symbol such as assign can be
used for the token-name, but for notational convenience lexeme itself chosen as the name of the
abstract symbol.
3. initial is a lexeme that is mapped into the token (id, 2), where 2 points to the symbol-table
entry for initial.
5. rate is a lexeme that is mapped into the token (id, 3), where 3 points to the symbol-table entry
for rate.
Blanks separating the lexemes would be discarded by the lexical analyzer. Output of the lexical
analyser are as follows:
Syntax Analysis
The next phase is called the syntax analysis or parsing. It takes the token produced by lexical
analysis as input and generates a parse tree (or syntax tree).
A syntax tree in which each interior node represents an operation and the children of the node
represent the arguments of the operation. It depicts the grammatical structure of the token stream.
Semantic Analysis
The semantic analyser uses the syntax tree and the information in the symbol table to check the
source program for semantic consistency with the language definition. i.e. It checks whether the
parse tree constructed follows the rules of language.
An important part of semantic analysis is type checking, where the compiler checks that each
operator has matching operands. For example, a binary arithmetic operator may be applied to
either a pair of integers or to a pair of floating-point numbers. If the operator is applied to a
floating-point number and an integer, the compiler may convert or coerce the integer into a
floating-point number.
In the above example assume that the variables position, initial, and rate have been declared to be
floating-point numbers, and that the lexeme 60 by itself forms an integer. The type checker in the
semantic analyser discovers that the operator * is applied to a floating-point number r a t e and an
integer 60. In this case, the integer may be converted into a floating-point number.
The output of the semantic analyzer has an extra node for the operator inttofloat
Intermediate Code Generation
After semantic analysis, the compiler generates an intermediate code of the source code for the
target machine. It represents a program for some abstract machine. It is in between the high-level
language and the machine language.
This intermediate representation should have two important properties: it should be easy to
produce and it should be easy to translate into the target machine.
Commonly used intermediate form is called three-address code, which consists of a sequence of
assembly- like instructions with three operands per instruction. Each operand can act like a
register.
The output of the intermediate code generator for the above example is
Each three-address assignment instruction has at most one operator on the right side.
The compiler must generate a temporary name to hold the value computed by a three-
address instruction.
Some "three-address instructions" have fewer than three operands.
Code Optimization
Optimization can be assumed as something that removes unnecessary code lines, and arranges
the sequence of statements in order to speed up the program execution without wasting resources
(CPU, memory).
In the above example the conversion of 60 from integer to floating point is done once , so the
inttofloat operation can be eliminated by replacing the integer 60 by the floating-point number
60.0. Moreover, t3 is used only once to transmit its value to id1 so the optimizer can transform
above intermediate code into the shorter sequence
Code Generation
The code generator takes as input an intermediate representation of the source program and maps
it into the target language.
If the target language is machine code, registers or memory locations are selected for each of the
variables used by the program. Then, the intermediate instructions are translated into sequences
of machine instructions that perform the same task.
For example, using registers R1 and R2, the above intermediate code might get translated into
the machine code
Symbol-Table Management
It is a data-structure maintained throughout all the phases of a compiler. All the identifiers’
names long with their types are stored here. The symbol table makes it easier for the compiler to
quickly search the identifier record and retrieve it. The symbol table is also used for scope
management.
3. GROUPING OF PHASES
Several phases may be grouped together into a pass that reads an input file and writes an
output file. For example, the front-end phases of lexical analysis, syntax analysis,
semantic analysis, and intermediate code generation might be grouped together into one
pass. Code optimization might be an optional pass. A back-end pass consisting of code
generation for a particular target machine
4. COMPILER CONSTRUCTION TOOLS
Several specialized tools are available to implement various phases of a compiler. Some
commonly used compiler-construction tools include
1. Parser generators that automatically produce syntax analyzers from a grammatical description
of a programming language.
3. Syntax-directed translation engines that produce collections of routines for walking a parse
tree and generating intermediate code.
4. Code-generator generators that produce a code generator from a collection of rules for
translating each operation of the intermediate language into the machine language for a target
machine.
5. Data-flow analysis engines that facilitate the gathering of information about how values are
transmitted from one part of a program to each other part. Data-flow analysis is a key part of
code optimization.
The lexical analyzer also interacts with the symbol table. When the lexical analyzer discovers a
lexeme constituting an identifier, it enters that lexeme into the symbol table.
Interactions between the lexical analyzer and the parser
The interaction is implemented by having the parser call the lexical analyzer. The parser issues
getNextToken command, the lexical analyzer reads the characters from its input until it can
identify the next lexeme and produces the corresponding token, which it returns to the parser.
It reads the input characters of the source program and generates the stream of tokens.
Stripping out comments and whitespaces
It keeps track of newline characters and associates a line number with each error
message.
It also performs macro expansion.
Why the analysis portion of a compiler is separated into lexical analysis and parsing (syntax
analysis) phases
Token: Token is a sequence of characters that can be treated as a single logical entity. Typical
tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5) constants
Pattern: A pattern is a rule describing the set of lexemes that can represent a particular token in
source program.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by the
pattern for a token.
Description of token
Example:
In many programming languages, the following classes cover most or all of the tokens:
1. One token for each keyword. The pattern for a keyword is the same as the keyword itself.
2. Tokens for the 1 operators, either individually or in classes .
3. One token representing all identifiers.
4. One or more tokens representing constants, such as numbers and literal
5. Tokens for each punctuation symbol, such as left and right parentheses, comma, and
semicolon.
Several pieces of information need to maintain for identifier - its lexeme, its type, and the
location at which it is first found— is kept in the symbol table. Thus, the appropriate attribute
value for an identifier is a pointer to the symbol-table entry for that identifier.
Example
E = M * C ** 2
Lexical Errors
The lexical analyzer is unable to proceed because none of the patterns for tokens matches any
prefix of the remaining input.
The simplest recovery strategy is "panic mode" recovery.
- Delete successive characters from the remaining input, until the lexical analyzer can
find a well-formed token at the beginning of what input is left.
Transformations like these may be tried in an attempt to repair the input. The simplest such
strategy is to see whether a prefix of the remaining input can be transformed into a valid lexeme
by a single transformation.
6. INPUT BUFFERING
7. SPECIFICATION OF TOKENS
Expressing Tokens by Regular Expressions
Regular expressions are an important notation for specifying lexeme patterns.
Operations on Languages
In lexical analysis, the most important operations on languages are union, concatenation, and
closure.
Example
Let L be the set of letters {A, B,.. . , Z, a, b,... , z} and let D be the set of digits {0,1,.. . 9}.
Regular Expressions
Each regular expression r denotes a language L(r). The rules that define the regular expressions
over some alphabet £ and the languages that those expressions denote.
BASIS:
There are two rules that form the basis:
1. e is a regular expression, and L(e) is {e}, that is, the language whose sole member is the empty
string.
2. If a is a regular expression, and L(a) = {a}, tha t is, the language with one string, of length one,
with a in its one position.
INDUCTION:
There are four parts to the induction whereby larger regular expressions are built from smaller
ones. Suppose r and s are regular expressions denoting languages L(r) and L(s), respectively.
Example
1. The regular expression a|b denotes the language {a, b}.
2. (a|b)(a|b) denotes {aa, ah, ba, bb}, the language of all strings of length two over the alphabet
E. Another regular expression for the same language is aa|ab|ba|bb.
3. a* denotes the language consisting of all strings of zero or more a's, tha t is, {e, a,aa,aaa,...}. 4.
(a|b)* denotes the set of all strings consisting of zero or more instances of a or b, that is, all
strings of a's and 6's: {e,a, b,aa, ab, ba, bb,aaa,...}. Another regular expression for the same
language is (a*b*)*.
5. a|a*b denotes the language {a, b, ab, aab, aaab,...}, that is, the string a and all strings
consisting of zero or more a's and ending in b.
Regular Definitions
The third section holds whatever additional functions are used in the actions.
Alternatively, these functions can be compiled separately and loaded with the lexical
analyzer.
The lexical analyzer returns the token name and integer variable yylval to pass additional
information about the lexeme found, if needed.
Example: A Lex program that recognizes the tokens of the following table and returns the
token found.
Two variables that are set automatically by the lexical analyzer
(a) yytext is a pointer to the beginning of the lexeme
METHOD
BASIS:
INDUCTION:
Example
RE : (a|b)*abb
Parse tree for (a|b)*abb
11. CONVERSION OF NDFA TO DFA
ALGORITHM
INPUT: An NFA N
OUTPUT : A DFA D accepting the same language as N.
METHOD
Example (a|b)*abb
Transition table Dtran for DFA D
DFA
12. MINIMIZATION OF DFA
Algorithm
INPUT: A DFA D with set of states S, input alphabet Ʃ, state state S0 , and set of accepting
states F.
OUTPUT: A DFA D' accepting the same language as D and having as few states as possible.
METHOD:
1. Start with an initial partition ∏ with two groups, F and S — F, the accepting and non
accepting states of D.
2. Apply the procedure below to construct a new partition ∏new
The initial partition consists of the two groups {A, B, C, D}{E}, which are respectively the non
accepting states and the accepting states.
Consider the group {A, B, C, D} on input a, each of these states goes to state B, so there is no
way to distinguish these states using strings that begin with a. On input b, states A, B, and C go
to members of group {A, B, C, D}: while state D goesto E, a member of another group. Thus, in
∏new, group {A, B,C, D} is split into {A,B,C}{D}, and ∏new for this round is {A,B,C}{D}{E}.
In the next round, we can split {A,B,C} into {A,C}{B}, since A and C each go to a member of {A,
B, C} on input b, while B goes to a member of another group, {D}. Thus, after the second round,
∏new = {A, C}{B}{D}{E}.
For the third round, we cannot split, since A and C each go to the same state on each input. We
conclude that ∏final = {A, C}{B}{D}{E}.
Construct the minimum-state DFA by choosing representative for each group. Let us pick A, B,
D, and E as the representatives of these groups and replace C by A in the previous DFA table.
1. Terminals are the basic symbols from which strings are formed. The term "token name" is a
synonym for "terminal".
2. Nonterminals are syntactic variables that denote sets of strings. The sets of strings denoted by
nonterminals help define the language generated by the grammar.
3. In a grammar, one nonterminal is distinguished as the start symbol. Always left side of the
first production is the start symbol.
4. The productions of a grammar specify the manner in which the terminals and nonterminals
can be combined to form strings. Each production consists of:
(a) A nonterminal called the head or left side of the production
(b) A body or right side consisting of zero or more terminals and nonterminals. The components
of the body describe one way in which strings of the nonterminal at the head can be constructed.
Notational Conventions
3. Uppercase letters late in the alphabet, such as X, Y, Z, represent grammar symbols; that is,
either nonterminals or terminals.
4. Lowercase letters late in the alphabet, chiefly u,v,..., z, represent strings of terminals.
5. Lowercase Greek letters, a, β, γ for example, represent strings of grammar symbols. Thus, a
generic production can be written as A -> α, where A is the head and α the body.
6. A set of productions A -> αi, A -> αi+1,... , A -> αk with a common head A (call them A-
productions), may be written as
7. Unless stated otherwise, the head of the first production is the start symbol.
E -> E + T | E — T | T
T -> T * F | T / F | F
F -> (E) | id
Derivations
Beginning with the start symbol, each step replaces a non terminal by the body of one of its
productions is called derivation.
The top down parser is equivalent to leftmost derivation, where as bottom up parser is equivalent
to rightmost derivation in reverse.
Example: Consider the following grammar
In the above derivation string –(id) is a sentence of the grammar. The strings –E,-(E) and -(id)
are all sentential form of the grammar.
Leftmost Derivation : The derivation in which only the leftmost non terminal is replaced at each
step.
Rightmost Derivation : The derivation in which only the rightmost non terminal is replaced at
each step.
Left sentential form : If α is derived from the start symbol using leftmost derivation is called left
sentential form of the grammar.
Analogous definition is hold for right sentential form of the grammar. Rightmost derivation is
sometimes called as canonical derivations.