Unit I SRM

15CS314J - COMPILER DESIGN
UNIT I: INTRODUCTION TO COMPILER & AUTOMATA
Compilers – Analysis of the source program, Phases of a compiler – Cousins of the Compiler, Grouping
of Phases – Compiler construction tools, Lexical Analysis – Role of Lexical Analyzer, Input Buffering –
Specification of Tokens- design of lexical analysis (LEX), Finite automation (deterministic & non
deterministic) - Conversion of regular expression of NDFA – Thompson’s, Conversion of NDFA to DFA-
minimization of NDFA, Derivation - parse tree - ambiguity
1. INTRODUCTION - COMPILER
Compiler is a program that can read a program in one language - the source language - and
translate it into an equivalent program in another language - the target language; see Fig. 1.1. An
important role of the compiler is to report any errors in the source program that it detects during
the translation process.
If the target program is an executable machine-language program, it can then be called by the
user to process inputs and produce outputs; see Fig. 1.2.
An interpreter is another common kind of language processor. Instead of producing a target
program as a translation, an interpreter appears to directly execute the operations specified in the
source program on inputs supplied by the user, as shown in Fig. 1.3.
Difference between Compiler and Interpreter
No Compiler Interpreter
Compiler Takes Entire program as Interpreter Takes Single instruction

1
input as input .
Intermediate Object Code is No Intermediate Object Code is

2
Generated Generated
Memory Requirement : More(Since

3 Memory Requirement is Less
Object Code is Generated)
Program need not be compiled every Every time higher level program is
4
time converted into lower level program
Errors are displayed after entire Errors are displayed for every
5
program is checked instruction interpreted (if any)
6 Example : C Compiler Example : BASIC
Java language processors combine compilation and interpretation. A Java source program may
first be compiled into an intermediate form called bytecodes. The bytecodes are then interpreted
by a virtual machine. A benefit of this arrangement is that bytecodes compiled on one machine
can be interpreted on another machine, even across a network.
Just-in-time compilers To achieve faster processing of inputs to outputs Just-in-time compiler
compiler is used, which translates the byte code into platform-specific executable code that is
immediately executed.
Language Processing System

In addition to a compiler, several other programs may be required to create an executable target
program.
 Preprocessor
A preprocessor produce input to compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are short hands for
longer constructs.
2. File inclusion: A preprocessor may include header files into the program text.
The modified source program is then fed to a compiler.
 Compiler
The compiler may produce an assembly-language program as its output, because assembly
language
is easier to produce as output and is easier to debug.
 Assembler
An assembler translates assembly language programs into relocatable machine code. The output
of an assembler is called an object file, which contains a combination of machine instructions as
well as the data required to place these instructions in memory.
 Linker
A linker or link editor is a computer program that takes one or more object files generated by
a compiler and combines them into a single executable file
Large programs are often compiled in pieces, so the relocatable machine code may have to be
linked together with other relocatable object files and library files into the code that actually runs
on the machine.
The linker also resolves external memory addresses, where the code in one file may refer to a
location in another file.
 Loader
Loader is responsible for loading executable files into memory and execute them. It calculates
the size of a program (instructions and data) and creates memory space for it. It initializes
various registers to initiate execution.
Cross-compiler
A compiler that runs on platform (A) and is capable of generating executable code for platform
(B) is called a cross-compiler.
Source-to-source Compiler
A compiler that takes the source code of one programming language and translates it into the
source code of another programming language is called a source-to-source compiler.
The Structure of a Compiler
A compiler can broadly be divided into two phases:
 Analysis Phase
 Synthesis Phase
Analysis Phase
Known as the front-end of the compiler, the analysis phase of the compiler reads the source
program, divides it into core parts, and then checks for lexical, grammar, and syntax errors. The
analysis phase generates an intermediate representation of the source program and symbol table,
which should be fed to the Synthesis phase as input.
Synthesis Phase
Known as the back-end of the compiler, the synthesis phase generates the target program with
the help of intermediate source code representation and symbol table.
If we examine the compilation process in more detail, it is partitioned into no-of-sub processes
called phases.
A phase is a logically interrelated operation that takes source program in one representation and
produces output in another representation.
2. PHASES OF A COMPILER
The compilation process is a sequence of various phases. Each phase takes input from its
previous stage, has its own representation of source program, and feeds its output to the next
phase of the compiler. Let us understand the phases of a compiler.
Lexical Analysis
The first phase of a compiler is called lexical analysis or scanning. The lexical analyzer reads the
stream of characters and groups the characters into meaningful sequences called lexemes. For
each lexeme, the lexical analyzer produces as output a token of the form
(token-name, attribute-value)
that it passes on to the subsequent phase, syntax analysis. In the token, the first component
token-name is an abstract symbol that is used during syntax analysis, and the second component
attribute-value points to an entry in the symbol table for this token. Information from the symbol-
table entry is needed for semantic analysis and code generation.
For example, suppose a source program contains the assignment statement
The characters in this assignment could be grouped into the following lexemes and mapped into
the following tokens passed on to the syntax analyzer:
1. position is a lexeme that would be mapped into a token (id, 1), where id is an abstract symbol
standing for identifier and 1 points to the symbol table entry for position. The symbol-table entry
for an identifier holds information about the identifier, such as its name and type.
2. The assignment symbol = is a lexeme that is mapped into the token (=). Since this token needs
no attribute-value, the second component is omitted. Any abstract symbol such as assign can be
used for the token-name, but for notational convenience lexeme itself chosen as the name of the
abstract symbol.
3. initial is a lexeme that is mapped into the token (id, 2), where 2 points to the symbol-table
entry for initial.
4. + is a lexeme that is mapped into the token (+).
5. rate is a lexeme that is mapped into the token (id, 3), where 3 points to the symbol-table entry
for rate.
6. * is a lexeme that is mapped into the token (*) .
7. 60 is a lexeme that is mapped into the token (60) .
Blanks separating the lexemes would be discarded by the lexical analyzer. Output of the lexical
analyser are as follows:
Syntax Analysis
The next phase is called the syntax analysis or parsing. It takes the token produced by lexical
analysis as input and generates a parse tree (or syntax tree).
A syntax tree in which each interior node represents an operation and the children of the node
represent the arguments of the operation. It depicts the grammatical structure of the token stream.
A syntax tree for the above token stream is shown as
Semantic Analysis
The semantic analyser uses the syntax tree and the information in the symbol table to check the
source program for semantic consistency with the language definition. i.e. It checks whether the
parse tree constructed follows the rules of language.
An important part of semantic analysis is type checking, where the compiler checks that each
operator has matching operands. For example, a binary arithmetic operator may be applied to
either a pair of integers or to a pair of floating-point numbers. If the operator is applied to a
floating-point number and an integer, the compiler may convert or coerce the integer into a
floating-point number.
In the above example assume that the variables position, initial, and rate have been declared to be
floating-point numbers, and that the lexeme 60 by itself forms an integer. The type checker in the
semantic analyser discovers that the operator * is applied to a floating-point number r a t e and an
integer 60. In this case, the integer may be converted into a floating-point number.
The output of the semantic analyzer has an extra node for the operator inttofloat
Intermediate Code Generation
After semantic analysis, the compiler generates an intermediate code of the source code for the
target machine. It represents a program for some abstract machine. It is in between the high-level
language and the machine language.
This intermediate representation should have two important properties: it should be easy to
produce and it should be easy to translate into the target machine.
Commonly used intermediate form is called three-address code, which consists of a sequence of
assembly- like instructions with three operands per instruction. Each operand can act like a
register.
The output of the intermediate code generator for the above example is
Important points about three-address instructions
 Each three-address assignment instruction has at most one operator on the right side.
 The compiler must generate a temporary name to hold the value computed by a three-
address instruction.
 Some "three-address instructions" have fewer than three operands.
Code Optimization
The machine-independent code-optimization phase attempts to improve the intermediate code so

that better target code will result.
Optimization can be assumed as something that removes unnecessary code lines, and arranges
the sequence of statements in order to speed up the program execution without wasting resources
(CPU, memory).
In the above example the conversion of 60 from integer to floating point is done once , so the
inttofloat operation can be eliminated by replacing the integer 60 by the floating-point number
60.0. Moreover, t3 is used only once to transmit its value to id1 so the optimizer can transform
above intermediate code into the shorter sequence
Code Generation
The code generator takes as input an intermediate representation of the source program and maps
it into the target language.
If the target language is machine code, registers or memory locations are selected for each of the
variables used by the program. Then, the intermediate instructions are translated into sequences
of machine instructions that perform the same task.
For example, using registers R1 and R2, the above intermediate code might get translated into
the machine code
Symbol-Table Management
It is a data-structure maintained throughout all the phases of a compiler. All the identifiers’
names long with their types are stored here. The symbol table makes it easier for the compiler to
quickly search the identifier record and retrieve it. The symbol table is also used for scope
management.
3. GROUPING OF PHASES
Several phases may be grouped together into a pass that reads an input file and writes an
output file. For example, the front-end phases of lexical analysis, syntax analysis,
semantic analysis, and intermediate code generation might be grouped together into one
pass. Code optimization might be an optional pass. A back-end pass consisting of code
generation for a particular target machine
4. COMPILER CONSTRUCTION TOOLS
Several specialized tools are available to implement various phases of a compiler. Some
commonly used compiler-construction tools include
1. Parser generators that automatically produce syntax analyzers from a grammatical description
of a programming language.
2. Scanner generators that produce lexical analyzers from a regular-expression description of

the tokens of a language.
3. Syntax-directed translation engines that produce collections of routines for walking a parse
tree and generating intermediate code.
4. Code-generator generators that produce a code generator from a collection of rules for
translating each operation of the intermediate language into the machine language for a target
machine.
5. Data-flow analysis engines that facilitate the gathering of information about how values are
transmitted from one part of a program to each other part. Data-flow analysis is a key part of
code optimization.
6. Compiler-construction toolkits that provide an integrated set of routines for constructing

various phases of a compiler.
5. LEXICAL ANALYSIS – ROLE OF LEXICAL ANALYZER

The main task of the lexical analyzer is to read the input characters of the source program, group
them into lexemes, and produce as output a sequence of tokens. The stream of tokens is sent to
the parser for syntax analysis.
The lexical analyzer also interacts with the symbol table. When the lexical analyzer discovers a
lexeme constituting an identifier, it enters that lexeme into the symbol table.
Interactions between the lexical analyzer and the parser
The interaction is implemented by having the parser call the lexical analyzer. The parser issues
getNextToken command, the lexical analyzer reads the characters from its input until it can
identify the next lexeme and produces the corresponding token, which it returns to the parser.
Main functions of lexical analyzer
 It reads the input characters of the source program and generates the stream of tokens.
 Stripping out comments and whitespaces
 It keeps track of newline characters and associates a line number with each error
message.
 It also performs macro expansion.
Why the analysis portion of a compiler is separated into lexical analysis and parsing (syntax
analysis) phases
 Simplicity of design. The separation of lexical and syntactic analysis allows us to

simplify at least one of these tasks.
 Compiler efficiency is improved. A separate lexical analyzer allows us to apply
specialized techniques that serve only the lexical task.
 Compiler portability is enhanced. Input-device-specific peculiarities can be restricted
to the lexical analyzer.
Tokens, Patterns, and Lexemes
Token: Token is a sequence of characters that can be treated as a single logical entity. Typical
tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5) constants
Pattern: A pattern is a rule describing the set of lexemes that can represent a particular token in
source program.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by the
pattern for a token.
Description of token
Example:
In many programming languages, the following classes cover most or all of the tokens:
1. One token for each keyword. The pattern for a keyword is the same as the keyword itself.
2. Tokens for the 1 operators, either individually or in classes .
3. One token representing all identifiers.
4. One or more tokens representing constants, such as numbers and literal
5. Tokens for each punctuation symbol, such as left and right parentheses, comma, and
semicolon.
Attributes for Tokens

When more than one lexeme can match a pattern, the lexical analyzer must provide the
subsequent compiler phases additional information about the particular lexeme that matched. For
example, the pattern for token number matches both 0 and 1, but it is extremely important for the
code generator to know which lexeme was found in the source program. Thus, in many cases the
lexical analyzer returns to the parser not only a token name, but an attribute value that describes
the lexeme represented by the token.
Several pieces of information need to maintain for identifier - its lexeme, its type, and the
location at which it is first found— is kept in the symbol table. Thus, the appropriate attribute
value for an identifier is a pointer to the symbol-table entry for that identifier.
Example
E = M * C ** 2
Lexical Errors
The lexical analyzer is unable to proceed because none of the patterns for tokens matches any
prefix of the remaining input.
The simplest recovery strategy is "panic mode" recovery.
- Delete successive characters from the remaining input, until the lexical analyzer can
find a well-formed token at the beginning of what input is left.
Other possible error-recovery actions are:

1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.
Transformations like these may be tried in an attempt to repair the input. The simplest such
strategy is to see whether a prefix of the remaining input can be transformed into a valid lexeme
by a single transformation.
6. INPUT BUFFERING
7. SPECIFICATION OF TOKENS
Expressing Tokens by Regular Expressions
Regular expressions are an important notation for specifying lexeme patterns.
Strings and Languages

An alphabet is any finite set of symbols.
Examples
- The set {0,1} is the binary alphabet.
- ASCII is an important example of an alphabet; it is used in many software systems.
Unicode, which includes approximately 100,000 characters from alphabets around the
world, is another important example of an alphabet.
A string is a finite sequence of symbols over an alphabet.
- The length of a string s, usually written |s|, is the number of occurrences of symbols
in s. For example, banana is a string of length six.
- The empty string, denoted e, is the string of length zero.
A language is set of strings over some fixed alphabet.
Terms for Parts of Strings

The following string-related terms are commonly used:
1. A prefix of string s is any string obtained by removing zero or more symbols from the
end of s.
For example, ban, banana, and e are prefixes of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from the
beginning of s.
For example, nana, banana, and e are suffixes of banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s.
For instance, banana, nan, and e are substrings of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those, prefixes, suffixes,
and substrings, respectively, of s that are not e or not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not necessarily
consecutive positions of s.
For example, baan is a subsequence of banana.
Operations on Languages
In lexical analysis, the most important operations on languages are union, concatenation, and
closure.
Example
Let L be the set of letters {A, B,.. . , Z, a, b,... , z} and let D be the set of digits {0,1,.. . 9}.
1. L U D is the set of letters and digits

2. LD is the set of strings of length two, each consisting of one letter followed by one digit.
3. L4 is the set of all 4-letter strings.
4. L* is the set of all strings of letters, including e, the empty string.
5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.
6. D+ is the set of all strings of one or more digits.
Regular Expressions
Each regular expression r denotes a language L(r). The rules that define the regular expressions
over some alphabet £ and the languages that those expressions denote.
BASIS:
There are two rules that form the basis:
1. e is a regular expression, and L(e) is {e}, that is, the language whose sole member is the empty
string.
2. If a is a regular expression, and L(a) = {a}, tha t is, the language with one string, of length one,
with a in its one position.
INDUCTION:
There are four parts to the induction whereby larger regular expressions are built from smaller
ones. Suppose r and s are regular expressions denoting languages L(r) and L(s), respectively.
1. (r)|(s) is a regular expression denoting the language L(r) U L(s).

2. (r)(s) is a regular expression denoting the language L(r)L(s).
3. (r)* is a regular expression denoting (L(r))*.
4. (r) is a regular expression denoting L(r).
A regular expressions often contain unnecessary pairs of parentheses. We may drop certain pairs
of parentheses if we adopt the conventions that:
a) The unary operator * has highest precedence and is left associative.
b) Concatenation has second highest precedence and is left associative.
c) | has lowest precedence and is left associative.
Example
1. The regular expression a|b denotes the language {a, b}.
2. (a|b)(a|b) denotes {aa, ah, ba, bb}, the language of all strings of length two over the alphabet
E. Another regular expression for the same language is aa|ab|ba|bb.
3. a* denotes the language consisting of all strings of zero or more a's, tha t is, {e, a,aa,aaa,...}. 4.
(a|b)* denotes the set of all strings consisting of zero or more instances of a or b, that is, all
strings of a's and 6's: {e,a, b,aa, ab, ba, bb,aaa,...}. Another regular expression for the same
language is (a*b*)*.
5. a|a*b denotes the language {a, b, ab, aab, aaab,...}, that is, the string a and all strings
consisting of zero or more a's and ending in b.
A language tha t can be defined by a regular expression is called a regular set.
Algebraic laws for regular expressions
Regular Definitions
A regular definition is a sequence of definitions of the form:

Regular Definition for an identifier
Regular Definition for an unsigned number
Extensions of Regular Expressions
Zero or one instance

The unary postfix operator ? means "zero or one occurrence." That is, r? is equivalent to r|e, or
put another way, L(r?) = L(r) U {e}. The ? operator has the same precedence and associativity as
* and +.
Character classes
[abc] is shorthand for a|b|c, and [a-z] is shorthand for a|b|.--|z .
letter -> [a-zA-Z]

digit -> [0-9]
id -> letter(letter|digit)*
digit -> [0-9]

digits -> digit+
number -> digits (.digits)?(E[+-]?digits+)?
8. DESIGN OF LEXICAL ANALYSIS (LEX)

Lex tool generates the lexical analyzer. The input notation for the Lex tool is referred to as the
Lex language and the tool itself is the Lex compiler. Behind the scenes, the Lex compiler
transforms the input patterns into a transition diagram and generates code, in a file called
lex.yy.c, that simulates this transition diagram.
 An input file, which we call lex . l , is written in the Lex language and describes the
lexical analyzer to be generated.
 The Lex compiler transforms lex.l to a C program, in a file that is always named lex.yy.c.
 The latter file is compiled by the C compiler into a file called a.out .
 The C-compiler output is a working lexical analyzer that can take a stream of input
characters and produce a stream of tokens.
Structure of Lex Programs
A Lex program has the following form:

declarations
%%
translation rules
%%
auxiliary functions
 The declarations section includes declarations of variables, manifest constants and
regular definitions.
 The translation rules each have the form

Pattern { Action }
- Each pattern is a regular expression, which may use the regular definitions of the
declaration section.
- The actions are fragments of code, typically written in C.
 The third section holds whatever additional functions are used in the actions.
Alternatively, these functions can be compiled separately and loaded with the lexical
analyzer.
The lexical analyzer returns the token name and integer variable yylval to pass additional
information about the lexeme found, if needed.
Example: A Lex program that recognizes the tokens of the following table and returns the
token found.
Two variables that are set automatically by the lexical analyzer
(a) yytext is a pointer to the beginning of the lexeme
(b) yyleng is the length of the lexeme found.
Conflict Resolution in Lex

When several prefixes of the input match one or more patterns:
1. Always prefer a longer prefix to a shorter prefix.
2. If the longest possible prefix matches two or more patterns, prefer the pattern listed first in the
Lex program.
9. FINITE AUTOMATION (DETERMINISTIC & NON
DETERMINISTIC)
A recognizer for a language is a program that takes as input a string x and answer “yes” if x is a
sentence of the language and “no” otherwise. A regular expression can be compiled into a
recognizer by constructing a generalized transition diagram called finite automata.
Finite automat a come in two flavors:

(a) Nondeterministic finite automata (NFA) have no restrictions on the labels of their edges. A
symbol can label several edges out of the same state, and e, the empty string, is a possible label.
(b) Deterministic finite automata (DFA) have, for each state, and for each symbol of its input
alphabet exactly one edge with that symbol leaving that state.
10. CONVERSION OF REGULAR EXPRESSION OF

NDFA- THOMPSON’S
The McNaughton-Yamada-Thompson algorithm to convert a regular expression to an NFA
INPUT: A regular expression r over alphabet S.

OUTPUT: An NFA N accepting L(r).
METHOD
BASIS:
INDUCTION:
Example
RE : (a|b)*abb
Parse tree for (a|b)*abb
11. CONVERSION OF NDFA TO DFA
ALGORITHM
INPUT: An NFA N
OUTPUT : A DFA D accepting the same language as N.
METHOD
where S0 is initial state
Example (a|b)*abb
Transition table Dtran for DFA D
DFA
12. MINIMIZATION OF DFA
Algorithm
INPUT: A DFA D with set of states S, input alphabet Ʃ, state state S0 , and set of accepting
states F.
OUTPUT: A DFA D' accepting the same language as D and having as few states as possible.
METHOD:
1. Start with an initial partition ∏ with two groups, F and S — F, the accepting and non
accepting states of D.
2. Apply the procedure below to construct a new partition ∏new
initially, let ∏new = ∏;

for ( each group G of ∏ ) {
partition G into subgroups such that two states s and t are in the same
subgroup if and only if for all input symbols a, states s and t have transitions
on a to states in the same group of ∏;
/* at worst, a state will be in a subgroup by itself */
replace G in ∏new by the set of all subgroups formed;
}
3. If ∏new = ∏, let ∏final = ∏ and continue with step (4). Otherwise, repeat step (2) with ∏new
place of ∏.
4. Choose one state in each group of ∏final as the representative for that group. The
representatives will be the states of the minimum-state DFA D'.
Example : consider the following DFA Table
The initial partition consists of the two groups {A, B, C, D}{E}, which are respectively the non
accepting states and the accepting states.
Consider the group {A, B, C, D} on input a, each of these states goes to state B, so there is no
way to distinguish these states using strings that begin with a. On input b, states A, B, and C go
to members of group {A, B, C, D}: while state D goesto E, a member of another group. Thus, in
∏new, group {A, B,C, D} is split into {A,B,C}{D}, and ∏new for this round is {A,B,C}{D}{E}.
In the next round, we can split {A,B,C} into {A,C}{B}, since A and C each go to a member of {A,
B, C} on input b, while B goes to a member of another group, {D}. Thus, after the second round,
∏new = {A, C}{B}{D}{E}.
For the third round, we cannot split, since A and C each go to the same state on each input. We
conclude that ∏final = {A, C}{B}{D}{E}.
Construct the minimum-state DFA by choosing representative for each group. Let us pick A, B,
D, and E as the representatives of these groups and replace C by A in the previous DFA table.
13. CONTEXT-FREE GRAMMARS AND DERIVATIONS
Context-Free Grammar is a notation used to describe the syntax of programming language

constructs like expressions and statements.
A context-free grammar (grammar for short) consists of terminals, nonterminals, a start symbol,
and productions.
1. Terminals are the basic symbols from which strings are formed. The term "token name" is a
synonym for "terminal".
2. Nonterminals are syntactic variables that denote sets of strings. The sets of strings denoted by
nonterminals help define the language generated by the grammar.
3. In a grammar, one nonterminal is distinguished as the start symbol. Always left side of the
first production is the start symbol.
4. The productions of a grammar specify the manner in which the terminals and nonterminals
can be combined to form strings. Each production consists of:
(a) A nonterminal called the head or left side of the production
(b) A body or right side consisting of zero or more terminals and nonterminals. The components
of the body describe one way in which strings of the nonterminal at the head can be constructed.
E x a m p l e : The grammar defines simple arithmetic expressions.
expression -> expression + term

expression expression - term
expression -> term
term -> term * factor
term -» term / factor
term -» factor
factor ( expression )
factor -> i d
Notational Conventions
1. These symbols are terminals:

(a) Lowercase letters early in the alphabet, such as a, b, c.
(b) Operator symbols such as +, *, and so on.
(c) Punctuation symbols such as parentheses, comma, and so on.
(d) The digits 0 , 1 , . . . ,9.
(e) Boldface strings such as id or if, each of which represents a single terminal symbol.
2. These symbols are nonterminals:

(a) Uppercase letters early in the alphabet, such as A, B, C.
(b) The letter S, which, when it appears, is usually the start symbol.
(c) Lowercase, italic names such as expr or stmt.
3. Uppercase letters late in the alphabet, such as X, Y, Z, represent grammar symbols; that is,
either nonterminals or terminals.
4. Lowercase letters late in the alphabet, chiefly u,v,..., z, represent strings of terminals.
5. Lowercase Greek letters, a, β, γ for example, represent strings of grammar symbols. Thus, a
generic production can be written as A -> α, where A is the head and α the body.
6. A set of productions A -> αi, A -> αi+1,... , A -> αk with a common head A (call them A-
productions), may be written as
7. Unless stated otherwise, the head of the first production is the start symbol.
Using these conventions, the above grammar can be rewritten concisely as
E -> E + T | E — T | T
T -> T * F | T / F | F
F -> (E) | id
Derivations
Beginning with the start symbol, each step replaces a non terminal by the body of one of its
productions is called derivation.
The top down parser is equivalent to leftmost derivation, where as bottom up parser is equivalent
to rightmost derivation in reverse.
Example: Consider the following grammar
Derivation for the string –(id) is as follows:
Various derivation symbols
derives in one step
derives in zero or more steps
derives in one or more steps
Sentence – String of terminals w
Sentential form – String of terminals or String of non terminals or combination of both.
In the above derivation string –(id) is a sentence of the grammar. The strings –E,-(E) and -(id)
are all sentential form of the grammar.
Leftmost Derivation : The derivation in which only the leftmost non terminal is replaced at each
step.
Rightmost Derivation : The derivation in which only the rightmost non terminal is replaced at
each step.
Left sentential form : If α is derived from the start symbol using leftmost derivation is called left
sentential form of the grammar.
Analogous definition is hold for right sentential form of the grammar. Rightmost derivation is
sometimes called as canonical derivations.
14. PARSE TREE

A parse tree is a graphical representation of a derivation. The interior node is labeled with the
non terminal A in the head of the production; the children of the node are labeled, from left to
right, by the symbols in the body of the production by which this A was replaced during the
derivation.
For example, the parse tree for - ( i d + id) of the following grammar is given below
15. AMBIGUITY
A grammar that produces more than one parse tree or more than one leftmost derivation or more
than one rightmost
derivation for the same sentence is said to be ambiguous.
For example let us consider the above grammar. The sentence id + id * id has two leftmost
derivations and two parse trees
Eliminating Ambiguity
An ambiguous grammar can be rewritten to eliminate the ambiguity.

As an example, we shall eliminate the ambiguity from the following "dangling else" grammar:
stmt —> if expr t h e n stmt
| if expr t h e n stmt else stmt
| other
Here "other" stands for any other statement. According to this grammar, the compound
conditional statement
if E1 then S1 else if E22 then S22else S3
In all programming languages with conditional statements of this form, the first parse tree is
preferred. The general rule is, "Match each else with the closest unmatched then." This
disambiguating rule can theoretically be incorporated directly into a grammar.

Unit I SRM

Uploaded by

Document Informationclick to expand document information

Copyright:

Unit I SRM

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Unit I SRM

Uploaded by

Copyright:

15CS314J - COMPILER DESIGN

UNIT I: INTRODUCTION TO COMPILER & AUTOMATA

Difference between Compiler and Interpreter

Compiler Takes Entire program as Interpreter Takes Single instruction

Intermediate Object Code is No Intermediate Object Code is

Memory Requirement : More(Since

6 Example : C Compiler Example : BASIC

Language Processing System

The modified source program is then fed to a compiler.

is easier to produce as output and is easier to debug.

The Structure of a Compiler

A compiler can broadly be divided into two phases:

For example, suppose a source program contains the assignment statement

4. + is a lexeme that is mapped into the token (+).

6. * is a lexeme that is mapped into the token (*) .

7. 60 is a lexeme that is mapped into the token (60) .

A syntax tree for the above token stream is shown as

Important points about three-address instructions

The machine-independent code-optimization phase attempts to improve the intermediate code so

2. Scanner generators that produce lexical analyzers from a regular-expression description of

6. Compiler-construction toolkits that provide an integrated set of routines for constructing

5. LEXICAL ANALYSIS – ROLE OF LEXICAL ANALYZER

Main functions of lexical analyzer

 Simplicity of design. The separation of lexical and syntactic analysis allows us to

Tokens, Patterns, and Lexemes

Attributes for Tokens

Other possible error-recovery actions are:

Strings and Languages

Terms for Parts of Strings

1. L U D is the set of letters and digits

1. (r)|(s) is a regular expression denoting the language L(r) U L(s).

A language tha t can be defined by a regular expression is called a regular set.

Algebraic laws for regular expressions

A regular definition is a sequence of definitions of the form:

Regular Definition for an unsigned number

Extensions of Regular Expressions

Zero or one instance

letter -> [a-zA-Z]

digit -> [0-9]

8. DESIGN OF LEXICAL ANALYSIS (LEX)

Structure of Lex Programs

A Lex program has the following form:

 The translation rules each have the form

(b) yyleng is the length of the lexeme found.

Conflict Resolution in Lex

Finite automat a come in two flavors:

10. CONVERSION OF REGULAR EXPRESSION OF

INPUT: A regular expression r over alphabet S.

where S0 is initial state

initially, let ∏new = ∏;

Example : consider the following DFA Table

13. CONTEXT-FREE GRAMMARS AND DERIVATIONS

Context-Free Grammar is a notation used to describe the syntax of programming language

E x a m p l e : The grammar defines simple arithmetic expressions.

expression -> expression + term

1. These symbols are terminals:

2. These symbols are nonterminals:

Using these conventions, the above grammar can be rewritten concisely as

Derivation for the string –(id) is as follows:

Various derivation symbols

derives in one step