Automata Theory and Compiler Design

Dr. N. G.
Goudru
Professor
Department of ISE
Sambhram Institute of Technology
Bangalore
MODULE – 1
Text Book: “Compilers, Principles, and Tools” by Alfred V. Aho,

Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman, Second edition,
Pearson publication.
Chapter -1: 1.1, 1.2.

Introduction
Programs running on a computer is written in some

programming languages called High level language.
 Before executing a program, computer first translate the source

program into an assembly level language or machine level
language that can be executed by the computer.
The software that can do this translation is called compiler.

A Compiler
 Compiler is a software that can read a program called source

program and translate that into an equivalent machine
program called target language.
 An important task of the compiler is to report the errors

detected in the source program during the translation process.
Target program
The target program is an executable machine level language

program.
 The user call the target program to process the input and
produce output.
Interpreter
 An interpreter is also a language processor.
 It directly execute the operations specified in the source

program, based on the input given by the user.
 It execute the source program statement by statement, and

also detect errors in the source program.
A Language processing system
A source program is translated into an equivalent target machine
code or executable machine level language program using the
following steps.
A Language processing system (cond..)
(i) A source program is divided into modules and stored. The pre-
processor collect the source file and add the required macros into
the source language statements. The modified source program is
then fed into a compiler.
(ii) The compiler after compiling the modified source program

produce assembly level language program as its output.
A Language processing system (cond..)
(iii) The assembly program is then processed by a software called
an assembler that produce machine code as its output.
(iv) The machine code will be linked together with other object files
and library files by the Linker.
(v) The loader then puts together all the executable object files into
memory for execution.
Phases of Compiler OR Structure of Compiler
Symbol table contain a

record for each variable
like storage location, its
type, its scope etc..
Phases of Compiler OR Structure of Compiler
Compiler is a software program that convert high-level source code into low-
level machine code which can be executed by the computer. The process of
conversion has the following phases.
(i) Lexical analyzer

The first phase of the compiler is lexical analysis also known as scanning. It read
the source code and breaks it into stream of characters which are the basic
units
of programming language called Token. The token Stream is then passed on
to syntax analyzer.
(ii) Syntax analyzer

The syntax analyzer perform syntax analysis also known as Parsing. The syntax
analyzer take the stream of token generated by the lexical analyzer and check
for the grammar of the programming language. It check whether the source
code is syntactically correct or not. It ensure that variables in the program are
used correctly. Outcome of this phase is called Syntax Tree.
Phases of Compiler OR Structure of Compiler (contd..)
(iii) Semantic analyzer

It check for semantic errors such as undeclared variables and incorrect function
calls if any.
(iv) Intermediate Code-Generator

This software generate an intermediate representation of the source code that
can be easily translated into machine code.
(v) Optimizer
An optimizer apply various optimization techniques to the intermediate code to
improve the performance of the machine code.
(vi) Code generator

This take optimized intermediate code and generate the actual machine code
that can be executed by the computer.
1) Lexical Analysis
The first phase of the compiler is called Lexical analysis or

Scanning.
 The Lexical analyser reads the stream of characters and group

the characters into a meaningful sequence called Lexemes.
 For each Lexeme, the Lexical analyser produces an output called

Token.
 The format of Token is: <token-name, attribute-value> Where,

token-name : is an abstract symbol,
attribute-name : points to the entry in the table (storage location,
type, scope etc.)
Example 1: Construct the sequence of Tokens for a source program
having the assignment statement as,
position = initial + rate * 60
Answer:
The lexical analyser can group into lexeme as follows:
1) Position is a lexeme and mapped into a token <id, 1>
2) = is a lexeme and mapped into a token <=>
3) initial is a lexeme and mapped into a token <id, 2>
4) + is a lexeme and mapped into a token <+>
5) rate is a lexeme and mapped into a token <id, 3>
6) * is a lexeme and mapped into a token <*>
7) 60 is a lexeme and mapped into a token <60>
The sequence of tokens for the assignment statement is :

<id,1> <=> <id,2> <+> <id,3> <*> <60>
Note: For Operators, Punctuations, and keywords no need of
attribute value.
2) Syntax Analysis
 Syntax analysis is the second phase of compiler. It is also called

as parsing.
 The parser take the token stream and create syntax tree.
 In the syntax tree, parent-node represent operator and child
node represent the token.
 During evaluation of the token stream, it follow usual
convention of precedence rule, where here *, +, =
<id,1> <=> <id,2> <+> <id,3> <*> <60>
Syntax tree
3) Semantic analysis
 The semantic analyser use syntax tree and the information in
the symbol table to check for semantic errors and save it in the
syntax tree or symbol table.
 Also, the semantic analyser perform
(i) type checking.
(ii) checks that each operator has matching operands.
(iii) checks matching with respect to array type declaration etc.
Inttofloat which explicitly convert int into float-point number.

4) Intermediate code generator
 After semantic analysis, the compiler generate low-level

intermediate code.
 To generate the intermediate code, the generator use three-

address code rule which has three operands per instruction.
 Also, it fixes how operations are to be done.

5) Code Optimizer
 Code optimization generate optimized target code that help
in improving the execution time, storage etc. of the program.
 The sequence of optimized code of the intermediate code is,

6) Code generator
 The code generator takes the optimized intermediate code as

input.
 It assign registers or memory locations for each of the variables.
 For example, using registers R1 & R2 the optimized intermediate

code is translated into machine code as,
LDF-load data to floating point register;
STF-store data from floating-point register
Translation of an
assignment
statement into
Machine code by
compiler
position = initial + rate * 60
Compiler construction tools
Compiler has specialize tools to implement its various phases.
Some commonly used compiler construction tools are,
1) Parser generator- that produce syntax analyzer.
2) Scanner generator- that produce lexical analyzer.
3) Syntax translation engine- that produce collection of routines
for traversing a parse tree.
4) Code generator- that produce code generator from a collection
of rules.
5) Data flow analysis engine-that facilitate the gathering of
information on how values are transmitted from one part of a
program to each other parts of the program.
6) Compiler construction tool kit- that produce an integrated set of
routines for construction of various phases of compiler.
End of Module-1
MODULE – 2 : Lexical analysis phases of CD
The role of lexical analyzer
 The main task of lexical analyzer is to identify the lexeme.
 It read the input characters from the source program, group
them into lexeme, and produce output as a sequence of token for
each lexeme in the source program.
 The stream of tokens is sent to the parser for syntax analysis.
 When a lexical analyzer discovers a lexeme constituting an
identifier, it enter that lexeme into the symbol table.
 Another important task of Lexical analyzer is removal of
comment statement, white space, newline, tab etc.
Interaction between the LA and the Parser

The reasons why the Lexical analysis phase and syntax (parsing)
analysis phase are separated in the compiler design is,
 to ease the design complexity.

 to improve the performance & efficiency of the compiler.
 to enhance compiler portability.
Tokens, Patterns & Lexemes
Lexeme
 Laxer take a stream of strings and produce a stream of Token.
 Lexer partition the string.
 Reading from left to right.
 Recognize one token at a time.
 Lexeme is represented as, <class, string>
Example: if(x>=y) ← stream of characters

Lexemes: if, (, x, >=, y, )
Lexeme representation: <keyword, “if”>
Token stream:
<keyword, “if”> <LPAREN, “(”> <id, ”x”> <op,”>=”>
<RPAREN, “)”>
Token
A token is a two component representation denoted as

<token-name, attribute-value>.
The token name is an abstract symbol representing Lexeme.

Example, Keywords, identifiers etc..
 Token names are the input symbols for parser.

Classification of token
1) keyword class: Keywords like if, then, else etc., belongs class
keyword.
2) Identifier class: Variables declared in the program like var,
var1, sum, count etc., belongs to identifier class.
3) Constant class: Constants like 2, 5, -4, 5.4 etc., belongs to
constant class.
4) operator class : Symbols like (),[], <=,>=, = etc., belongs to
operator class.
5) delimiter class : Punctuation marks like ;, :, “ ”,/, etc., belongs
to delimiter class.
6) White space class: Blank space, \n, \t etc.
Example of Tokens
 One Token for each keyword. The pattern is same as the

keyword itself.
 One Token for each operator.
 One Token representing all identifiers.
 One or more Tokens representing constants such as numbers,
and literal strings.
 One token for each punctuation symbols such as left
parenthesis, right parenthesis, comma, semicolon.
Attributes for Token
 Lexical Analyser return to the Parser the Token name and its
attribute values describing the lexeme represented by the
token.
 For example, the information (attribute) about an identifier is

its lexeme, its type, its location entry in table, scope etc.
 Operators, punctuations, and keywords, there is no need for an

attribute value.
Example:
The token names and associated attribute values for the FORTRAN
statement E = M*C**2.
The sequence of pairs is,
 For operators, punctuations, and keywords, there is no need for attribute value.
 For example, the token number has given an integer-value attribute.
Lexical Error
Consider a source-code statement be, fi(a==f(x))
 Lexical analyzer treat it as an undeclared function identifier.
 Since fi is a valid lexeme for the token id, the lexical analyzer
return the token id to the parser and in this case parser handle the
error.
Input Buffering
 The program is stored in the hard disc.
To read a Token LA use two pointers.
 The first pointer is lexemeBegin pointer and second pointer is
forward pointer.
Example:
Int main()
{
}
This program statement is stored in the memory as follows:
lexemeBegin
i n t m a i n ( ) { }
forward
The buffering take place as follows:
 LexemeBegin pointer points to the beginning character of the
current lexeme.
 int is a token.
Forward pointer is placed at the beginning of character I, moves
to next character n, after reading t, the pointer encounter blank
space, assume that it is the end of a token.
 After reading first token both lexemeBegin and forward pointers
move to the first character of second token.
lexemeBegin
i n t m a i n ( ) { }
forward
Problem with this method of buffering
 To read each character form the hard disc, the processor use
one
system call.
 Suppose there are 1000 characters in a program, the system use

1000 system calls, which is a overhead on the system
performance.
 To overcome this problem, compiler use the following buffering

technique.
Buffering method 2
 A block of characters to be read into the buffering in only one

system call.
 It is implemented in two ways.
(i) one-buffer scheme

(ii) Two-buffer scheme.
1) One-buffer scheme
 It use only one buffer block to read the string size is 4096 bytes.
 The problem with this method is when the size of the input
string is very large and buffer block is minimum fails to store the
string.
 Whenever the forward pointer encounter eof character it

identify that buffer is full.
Two-buffer Scheme
 It use two buffer blocks.
 After first buffer is completely filled, it will fill the second
buffer.
 To determine whether first buffer is filled or not a special
character eof is used.
 The other name of eof is sentinel character. eof is not a part of
the source program.
 Whenever the forward pointer encounter eof the first buffer is
full, then lexemeBegin and forward pointers move to the second
buffer. When it moves to second buffer, the content of first buffer
is overwritten.
Input buffer Algorithm
Terms for parts of strings
1) prefix: A prefix of a string s is any string obtained by removing

zero or more symbols from the end of s.
Example: prefixes of the word banana are – ban, banana, ε.
2) Suffix : A suffix of a string s is any string obtained by removing
zero or more symbols from the beginning of s.
Example: suffixes of the word banana are – nana, banana, ε.
3) Substring: A substring of s is obtained by deleting any prefix and
any suffix from s.
Example: The substrings of the word banana are – nan, banana, ε.
4) Proper prefixes, suffixes and substrings of a string s are those,
prefixes, suffixes, and substrings respectively, of s that are not ε
or not equal to s itself.
5) Subsequence: A subsequence of s is any string formed by
deleting zero or more not necessarily consecutive positions of s.
Example: baan is a subsequence of the string banana.
Operations on languages
In lexical analysis the most important operations on languages are

Union, Concatenation and Closure.
Example
Let L={A,B,C,…..,Z,a,b,c…z} and D={0,1,2,….,9}
Find
1) L U D = {Set of letters and digits}

2) LD = {Set of strings of length atleast 2 each consisting of one
letter followed by one digit}
3) L4 = {The set of all 4-letter strings}
4) L* ={The set of all strings of letters including ε }
5) L(L U D)*= {The set of all strings of letters and digits beginning
with a letter}
6) D+ ={The set of all strings of one or more digits}
Regular Expressions
1) We can write L(L U D)* in terms of identifiers as letter_(letter_
| digit )*
2) A regular expression r denote a language L is written as L( r ).
3) Epselon, ε is a RE, then L(ε ) = {ε }
4) If Σ = {a} is a RE the L(a) = {a}
5) Let r & s are Res the L( r ) U L(s) , L ( r ) L(s) and (L(r))* are
Regular languages.
Order of Precedence
6) The unary operator * has highest precedence and is left
associative.
7) Concatenation has second highest precedence and is left
associative.
8) | (union operator) has lowest precedence and is left
associative.
Example- RE: (a)|((b)*( c )) = a|b*c.
Example: Let Σ = {a,b}.
No. RE Language
1 a|b L={a,b}
2 (a|b)(a|b) L={aa,ab,ba,bb}
3 a* L={ε,a,aa,aaa,…..}
4 (a|b)* L={ε,a,b,aa,ab,ba,bb,aaa…..}
5 a|a*b L={a,b,ab,aab,aaab,….}
1) A language that can be defined by a RE is called a Regular

language or Regular set.
2) If two REs r and s denote the same regular set, we say they are
equivalent, r=s, then (a|b) = (b|a).
Algebraic Laws for REs
Let r, s and t be three REs

C-identifiers and Numbers defined in RE
1) Identifiers
(i) Letter_ → A|B|….|Z|a|b|….|z|
(ii) digit → 0|1|…|9
(iii) id → letter_ (letter_|digit)*
2) Unsigned Numbers (integer or floating point)

Unsigned numbers are strings such as 5280, 0.01234, 6.336E4 or 1.89E-4.
The RE is
(iv) digit → 0|1|…..|9
(v) digit → digit digit* (Ex 5280)
(vi) optionalFraction → . digit| ϵ (EX 0.01234)
(vii) optionalExponent → ((E |+ | - | ϵ) digits)|ϵ (Ex 6.236E4 or 1.89E-4)
(viii) number → digit optionalFraction optionalExponent (Ex 1.0)
Note:
1) optionalFraction is either a decimal point (dot followed by one or more
digits).
2) optionalExponent is a letter E followed by an optional + or – sign followed
by one or two digits.
Extensions of REs
1) Positive closure
Let r be a RE generation a language L( r ), then ( r )+ generate a
language (L( r ))+ and satisfy the relations
(i) r * = r+ | ϵ
(ii) r+ = r r* = r* r
2) Zero or one instance

The unary postfix operator ? Means “Zero or one occurrence”.
Example:
(i) r? = r | ϵ
(ii) L( r?) = L( r ) U {ϵ}
Shorthand Notations
1) [abc] = a|b|….|z
2) [a-z] = a|b|…… |z
Example:
Using shorthand notations we can rewite the C-identifiers as,
(i) Letter_ → A|B|….|Z|a|b|….|z| as letter_ [A….Za…z]
(ii) digit → 0|1|…|9 as [0-9]
(iii) id → letter_ (letter_|digit)* as letter_(letter_ | digit)
Unsigned numbers as
(iv)digit → 0|1|…..|9 as digit → [0-9]

(v) digits → digit digit* as digits → digit+
(vi) number → digit optionalFraction optionalExponent as
number → digits (.digits?)(E[+ -]?digits)?
Exercise Examples
1) Describe the language denoted by the following REs

(i) a(a|b)*a
(ii) ((ϵ|a)b*)*
(iii) (a|b)*a(a|b)(a|b)
(iv) a*ba*ba*ba*
2) Write Res for the following language

(v) All strings of lowercase letters that contains the five vowels in
order.
(vi) All strings of lower case letters in which letters are in
ascending lexicographic order.
Token, their Pattern, and Attribute values
Transition Diagrams
 All transition diagrams are deterministic.

 We perform the conversion from RE to DFA.
 In manual construction of DFA , an edge after reading an input
moves to the next state.
 Similarly, in LA the forward pointer advances to next character
position on reading previous character.
Transition diagram that recognise the lexeme matching token relop
0 is the start state. When the LA encounter < symbol,move to state-

1, if it read = move to state-2 recognising <= relop
Transition diagram that recognise the lexeme matching token
identifier
(i) Letter_ → A|B|….|Z|a|b|….|z|

(ii) digit → 0|1|…|9
(iii) id → letter_ (letter_ | digit)*
keyword
Unsigned numbers
The RE is
(i) digit → digit digit*
(ii) optionalFraction → . digit| ϵ
(iii) optionalExponent → ((E |+ | - | ϵ) digits)|ϵ
(iv) number → digit optionalFraction optionalExponent
white space
Implementation of relop transition diagram
END OF MODULE-2

Automata Theory and Compiler Design

Uploaded by

Copyright:

Available Formats

Automata Theory and Compiler Design

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automata Theory and Compiler Design

Uploaded by

Copyright:

Available Formats

Dr. N. G.

Text Book: “Compilers, Principles, and Tools” by Alfred V. Aho,

Chapter -1: 1.1, 1.2.

Programs running on a computer is written in some

 Before executing a program, computer first translate the source

The software that can do this translation is called compiler.

 Compiler is a software that can read a program called source

 An important task of the compiler is to report the errors

The target program is an executable machine level language

 An interpreter is also a language processor.

 It directly execute the operations specified in the source

 It execute the source program statement by statement, and

(ii) The compiler after compiling the modified source program

Symbol table contain a

(i) Lexical analyzer

(ii) Syntax analyzer

(iii) Semantic analyzer

(iv) Intermediate Code-Generator

(vi) Code generator

The first phase of the compiler is called Lexical analysis or

 The Lexical analyser reads the stream of characters and group

 For each Lexeme, the Lexical analyser produces an output called

 The format of Token is: <token-name, attribute-value> Where,

The sequence of tokens for the assignment statement is :

 Syntax analysis is the second phase of compiler. It is also called

<id,1> <=> <id,2> <+> <id,3> <*> <60>

Inttofloat which explicitly convert int into float-point number.

 After semantic analysis, the compiler generate low-level

 To generate the intermediate code, the generator use three-

 Also, it fixes how operations are to be done.

 The sequence of optimized code of the intermediate code is,

 The code generator takes the optimized intermediate code as

 For example, using registers R1 & R2 the optimized intermediate

Interaction between the LA and the Parser

 to ease the design complexity.

Example: if(x>=y) ← stream of characters

A token is a two component representation denoted as

The token name is an abstract symbol representing Lexeme.

 Token names are the input symbols for parser.

 One Token for each keyword. The pattern is same as the

 For example, the information (attribute) about an identifier is

 Operators, punctuations, and keywords, there is no need for an

The sequence of pairs is,

Consider a source-code statement be, fi(a==f(x))

 Lexical analyzer treat it as an undeclared function identifier.

 Suppose there are 1000 characters in a program, the system use

 To overcome this problem, compiler use the following buffering

 A block of characters to be read into the buffering in only one

(i) one-buffer scheme

 Whenever the forward pointer encounter eof character it

1) prefix: A prefix of a string s is any string obtained by removing

In lexical analysis the most important operations on languages are

1) L U D = {Set of letters and digits}

1) A language that can be defined by a RE is called a Regular

Let r, s and t be three REs

2) Unsigned Numbers (integer or floating point)

2) Zero or one instance

(iv)digit → 0|1|…..|9 as digit → [0-9]