Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Automata Theory and Compiler Design

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 55

Dr. N. G.

Goudru
Professor
Department of ISE
Sambhram Institute of Technology
Bangalore
MODULE – 1

Text Book: “Compilers, Principles, and Tools” by Alfred V. Aho,


Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman, Second edition,
Pearson publication.

Chapter -1: 1.1, 1.2.


Introduction

Programs running on a computer is written in some


programming languages called High level language.

 Before executing a program, computer first translate the source


program into an assembly level language or machine level
language that can be executed by the computer.

The software that can do this translation is called compiler.


A Compiler

 Compiler is a software that can read a program called source


program and translate that into an equivalent machine
program called target language.

 An important task of the compiler is to report the errors


detected in the source program during the translation process.
Target program

The target program is an executable machine level language


program.

 The user call the target program to process the input and
produce output.
Interpreter

 An interpreter is also a language processor.

 It directly execute the operations specified in the source


program, based on the input given by the user.

 It execute the source program statement by statement, and


also detect errors in the source program.
A Language processing system
A source program is translated into an equivalent target machine
code or executable machine level language program using the
following steps.
A Language processing system (cond..)
(i) A source program is divided into modules and stored. The pre-
processor collect the source file and add the required macros into
the source language statements. The modified source program is
then fed into a compiler.

(ii) The compiler after compiling the modified source program


produce assembly level language program as its output.
A Language processing system (cond..)
(iii) The assembly program is then processed by a software called
an assembler that produce machine code as its output.

(iv) The machine code will be linked together with other object files
and library files by the Linker.

(v) The loader then puts together all the executable object files into
memory for execution.
Phases of Compiler OR Structure of Compiler

Symbol table contain a


record for each variable
like storage location, its
type, its scope etc..
Phases of Compiler OR Structure of Compiler

Compiler is a software program that convert high-level source code into low-
level machine code which can be executed by the computer. The process of
conversion has the following phases.

(i) Lexical analyzer


The first phase of the compiler is lexical analysis also known as scanning. It read
the source code and breaks it into stream of characters which are the basic
units
of programming language called Token. The token Stream is then passed on
to syntax analyzer.

(ii) Syntax analyzer


The syntax analyzer perform syntax analysis also known as Parsing. The syntax
analyzer take the stream of token generated by the lexical analyzer and check
for the grammar of the programming language. It check whether the source
code is syntactically correct or not. It ensure that variables in the program are
used correctly. Outcome of this phase is called Syntax Tree.
Phases of Compiler OR Structure of Compiler (contd..)

(iii) Semantic analyzer


It check for semantic errors such as undeclared variables and incorrect function
calls if any.

(iv) Intermediate Code-Generator


This software generate an intermediate representation of the source code that
can be easily translated into machine code.

(v) Optimizer
An optimizer apply various optimization techniques to the intermediate code to
improve the performance of the machine code.

(vi) Code generator


This take optimized intermediate code and generate the actual machine code
that can be executed by the computer.
1) Lexical Analysis

The first phase of the compiler is called Lexical analysis or


Scanning.

 The Lexical analyser reads the stream of characters and group


the characters into a meaningful sequence called Lexemes.

 For each Lexeme, the Lexical analyser produces an output called


Token.

 The format of Token is: <token-name, attribute-value> Where,


token-name : is an abstract symbol,
attribute-name : points to the entry in the table (storage location,
type, scope etc.)
Example 1: Construct the sequence of Tokens for a source program
having the assignment statement as,
position = initial + rate * 60
Answer:
The lexical analyser can group into lexeme as follows:
1) Position is a lexeme and mapped into a token <id, 1>
2) = is a lexeme and mapped into a token <=>
3) initial is a lexeme and mapped into a token <id, 2>
4) + is a lexeme and mapped into a token <+>
5) rate is a lexeme and mapped into a token <id, 3>
6) * is a lexeme and mapped into a token <*>
7) 60 is a lexeme and mapped into a token <60>

The sequence of tokens for the assignment statement is :


<id,1> <=> <id,2> <+> <id,3> <*> <60>
Note: For Operators, Punctuations, and keywords no need of
attribute value.
2) Syntax Analysis

 Syntax analysis is the second phase of compiler. It is also called


as parsing.
 The parser take the token stream and create syntax tree.
 In the syntax tree, parent-node represent operator and child
node represent the token.
 During evaluation of the token stream, it follow usual
convention of precedence rule, where here *, +, =

<id,1> <=> <id,2> <+> <id,3> <*> <60>

Syntax tree
3) Semantic analysis
 The semantic analyser use syntax tree and the information in
the symbol table to check for semantic errors and save it in the
syntax tree or symbol table.
 Also, the semantic analyser perform
(i) type checking.
(ii) checks that each operator has matching operands.
(iii) checks matching with respect to array type declaration etc.

Inttofloat which explicitly convert int into float-point number.


4) Intermediate code generator

 After semantic analysis, the compiler generate low-level


intermediate code.

 To generate the intermediate code, the generator use three-


address code rule which has three operands per instruction.

 Also, it fixes how operations are to be done.


5) Code Optimizer
 Code optimization generate optimized target code that help
in improving the execution time, storage etc. of the program.

 The sequence of optimized code of the intermediate code is,


6) Code generator

 The code generator takes the optimized intermediate code as


input.
 It assign registers or memory locations for each of the variables.

 For example, using registers R1 & R2 the optimized intermediate


code is translated into machine code as,
LDF-load data to floating point register;
STF-store data from floating-point register
Translation of an
assignment
statement into
Machine code by
compiler
position = initial + rate * 60
Compiler construction tools
Compiler has specialize tools to implement its various phases.
Some commonly used compiler construction tools are,
1) Parser generator- that produce syntax analyzer.
2) Scanner generator- that produce lexical analyzer.
3) Syntax translation engine- that produce collection of routines
for traversing a parse tree.
4) Code generator- that produce code generator from a collection
of rules.
5) Data flow analysis engine-that facilitate the gathering of
information on how values are transmitted from one part of a
program to each other parts of the program.
6) Compiler construction tool kit- that produce an integrated set of
routines for construction of various phases of compiler.

End of Module-1
MODULE – 2 : Lexical analysis phases of CD
The role of lexical analyzer
 The main task of lexical analyzer is to identify the lexeme.
 It read the input characters from the source program, group
them into lexeme, and produce output as a sequence of token for
each lexeme in the source program.
 The stream of tokens is sent to the parser for syntax analysis.
 When a lexical analyzer discovers a lexeme constituting an
identifier, it enter that lexeme into the symbol table.
 Another important task of Lexical analyzer is removal of
comment statement, white space, newline, tab etc.

Interaction between the LA and the Parser


The reasons why the Lexical analysis phase and syntax (parsing)
analysis phase are separated in the compiler design is,

 to ease the design complexity.


 to improve the performance & efficiency of the compiler.
 to enhance compiler portability.
Tokens, Patterns & Lexemes
Lexeme
 Laxer take a stream of strings and produce a stream of Token.
 Lexer partition the string.
 Reading from left to right.
 Recognize one token at a time.
 Lexeme is represented as, <class, string>

Example: if(x>=y) ← stream of characters


Lexemes: if, (, x, >=, y, )
Lexeme representation: <keyword, “if”>
Token stream:
<keyword, “if”> <LPAREN, “(”> <id, ”x”> <op,”>=”>
<RPAREN, “)”>
Token

A token is a two component representation denoted as


<token-name, attribute-value>.

The token name is an abstract symbol representing Lexeme.


Example, Keywords, identifiers etc..

 Token names are the input symbols for parser.


Classification of token

1) keyword class: Keywords like if, then, else etc., belongs class
keyword.
2) Identifier class: Variables declared in the program like var,
var1, sum, count etc., belongs to identifier class.
3) Constant class: Constants like 2, 5, -4, 5.4 etc., belongs to
constant class.
4) operator class : Symbols like (),[], <=,>=, = etc., belongs to
operator class.
5) delimiter class : Punctuation marks like ;, :, “ ”,/, etc., belongs
to delimiter class.
6) White space class: Blank space, \n, \t etc.
Example of Tokens

 One Token for each keyword. The pattern is same as the


keyword itself.
 One Token for each operator.
 One Token representing all identifiers.
 One or more Tokens representing constants such as numbers,
and literal strings.
 One token for each punctuation symbols such as left
parenthesis, right parenthesis, comma, semicolon.
Attributes for Token

 Lexical Analyser return to the Parser the Token name and its
attribute values describing the lexeme represented by the
token.

 For example, the information (attribute) about an identifier is


its lexeme, its type, its location entry in table, scope etc.

 Operators, punctuations, and keywords, there is no need for an


attribute value.
Example:
The token names and associated attribute values for the FORTRAN
statement E = M*C**2.

The sequence of pairs is,

 For operators, punctuations, and keywords, there is no need for attribute value.
 For example, the token number has given an integer-value attribute.
Lexical Error

Consider a source-code statement be, fi(a==f(x))

 Lexical analyzer treat it as an undeclared function identifier.

 Since fi is a valid lexeme for the token id, the lexical analyzer
return the token id to the parser and in this case parser handle the
error.
Input Buffering
 The program is stored in the hard disc.
To read a Token LA use two pointers.
 The first pointer is lexemeBegin pointer and second pointer is
forward pointer.
Example:
Int main()
{

}
This program statement is stored in the memory as follows:

lexemeBegin

i n t m a i n ( ) { }

forward
The buffering take place as follows:
 LexemeBegin pointer points to the beginning character of the
current lexeme.
 int is a token.
Forward pointer is placed at the beginning of character I, moves
to next character n, after reading t, the pointer encounter blank
space, assume that it is the end of a token.
 After reading first token both lexemeBegin and forward pointers
move to the first character of second token.

lexemeBegin

i n t m a i n ( ) { }

forward
Problem with this method of buffering

 To read each character form the hard disc, the processor use
one
system call.

 Suppose there are 1000 characters in a program, the system use


1000 system calls, which is a overhead on the system
performance.

 To overcome this problem, compiler use the following buffering


technique.
Buffering method 2

 A block of characters to be read into the buffering in only one


system call.
 It is implemented in two ways.

(i) one-buffer scheme


(ii) Two-buffer scheme.
1) One-buffer scheme

 It use only one buffer block to read the string size is 4096 bytes.

 The problem with this method is when the size of the input
string is very large and buffer block is minimum fails to store the
string.

 Whenever the forward pointer encounter eof character it


identify that buffer is full.
Two-buffer Scheme
 It use two buffer blocks.
 After first buffer is completely filled, it will fill the second
buffer.
 To determine whether first buffer is filled or not a special
character eof is used.
 The other name of eof is sentinel character. eof is not a part of
the source program.
 Whenever the forward pointer encounter eof the first buffer is
full, then lexemeBegin and forward pointers move to the second
buffer. When it moves to second buffer, the content of first buffer
is overwritten.
Input buffer Algorithm
Terms for parts of strings

1) prefix: A prefix of a string s is any string obtained by removing


zero or more symbols from the end of s.
Example: prefixes of the word banana are – ban, banana, ε.
2) Suffix : A suffix of a string s is any string obtained by removing
zero or more symbols from the beginning of s.
Example: suffixes of the word banana are – nana, banana, ε.
3) Substring: A substring of s is obtained by deleting any prefix and
any suffix from s.
Example: The substrings of the word banana are – nan, banana, ε.
4) Proper prefixes, suffixes and substrings of a string s are those,
prefixes, suffixes, and substrings respectively, of s that are not ε
or not equal to s itself.
5) Subsequence: A subsequence of s is any string formed by
deleting zero or more not necessarily consecutive positions of s.
Example: baan is a subsequence of the string banana.
Operations on languages

In lexical analysis the most important operations on languages are


Union, Concatenation and Closure.
Example
Let L={A,B,C,…..,Z,a,b,c…z} and D={0,1,2,….,9}

Find

1) L U D = {Set of letters and digits}


2) LD = {Set of strings of length atleast 2 each consisting of one
letter followed by one digit}
3) L4 = {The set of all 4-letter strings}
4) L* ={The set of all strings of letters including ε }
5) L(L U D)*= {The set of all strings of letters and digits beginning
with a letter}
6) D+ ={The set of all strings of one or more digits}
Regular Expressions
1) We can write L(L U D)* in terms of identifiers as letter_(letter_
| digit )*
2) A regular expression r denote a language L is written as L( r ).
3) Epselon, ε is a RE, then L(ε ) = {ε }
4) If Σ = {a} is a RE the L(a) = {a}
5) Let r & s are Res the L( r ) U L(s) , L ( r ) L(s) and (L(r))* are
Regular languages.
Order of Precedence
6) The unary operator * has highest precedence and is left
associative.
7) Concatenation has second highest precedence and is left
associative.
8) | (union operator) has lowest precedence and is left
associative.
Example- RE: (a)|((b)*( c )) = a|b*c.
Example: Let Σ = {a,b}.

No. RE Language
1 a|b L={a,b}
2 (a|b)(a|b) L={aa,ab,ba,bb}
3 a* L={ε,a,aa,aaa,…..}
4 (a|b)* L={ε,a,b,aa,ab,ba,bb,aaa…..}
5 a|a*b L={a,b,ab,aab,aaab,….}

1) A language that can be defined by a RE is called a Regular


language or Regular set.
2) If two REs r and s denote the same regular set, we say they are
equivalent, r=s, then (a|b) = (b|a).
Algebraic Laws for REs

Let r, s and t be three REs


C-identifiers and Numbers defined in RE
1) Identifiers
(i) Letter_ → A|B|….|Z|a|b|….|z|
(ii) digit → 0|1|…|9
(iii) id → letter_ (letter_|digit)*

2) Unsigned Numbers (integer or floating point)


Unsigned numbers are strings such as 5280, 0.01234, 6.336E4 or 1.89E-4.
The RE is
(iv) digit → 0|1|…..|9
(v) digit → digit digit* (Ex 5280)
(vi) optionalFraction → . digit| ϵ (EX 0.01234)
(vii) optionalExponent → ((E |+ | - | ϵ) digits)|ϵ (Ex 6.236E4 or 1.89E-4)
(viii) number → digit optionalFraction optionalExponent (Ex 1.0)
Note:
1) optionalFraction is either a decimal point (dot followed by one or more
digits).
2) optionalExponent is a letter E followed by an optional + or – sign followed
by one or two digits.
Extensions of REs
1) Positive closure
Let r be a RE generation a language L( r ), then ( r )+ generate a
language (L( r ))+ and satisfy the relations
(i) r * = r+ | ϵ
(ii) r+ = r r* = r* r

2) Zero or one instance


The unary postfix operator ? Means “Zero or one occurrence”.
Example:
(i) r? = r | ϵ
(ii) L( r?) = L( r ) U {ϵ}
Shorthand Notations
1) [abc] = a|b|….|z
2) [a-z] = a|b|…… |z
Example:
Using shorthand notations we can rewite the C-identifiers as,
(i) Letter_ → A|B|….|Z|a|b|….|z| as letter_ [A….Za…z]
(ii) digit → 0|1|…|9 as [0-9]
(iii) id → letter_ (letter_|digit)* as letter_(letter_ | digit)

Unsigned numbers as

(iv)digit → 0|1|…..|9 as digit → [0-9]


(v) digits → digit digit* as digits → digit+
(vi) number → digit optionalFraction optionalExponent as
number → digits (.digits?)(E[+ -]?digits)?
Exercise Examples

1) Describe the language denoted by the following REs


(i) a(a|b)*a
(ii) ((ϵ|a)b*)*
(iii) (a|b)*a(a|b)(a|b)
(iv) a*ba*ba*ba*

2) Write Res for the following language


(v) All strings of lowercase letters that contains the five vowels in
order.
(vi) All strings of lower case letters in which letters are in
ascending lexicographic order.
Token, their Pattern, and Attribute values
Transition Diagrams

 All transition diagrams are deterministic.


 We perform the conversion from RE to DFA.
 In manual construction of DFA , an edge after reading an input
moves to the next state.
 Similarly, in LA the forward pointer advances to next character
position on reading previous character.
Transition diagram that recognise the lexeme matching token relop

0 is the start state. When the LA encounter < symbol,move to state-


1, if it read = move to state-2 recognising <= relop
Transition diagram that recognise the lexeme matching token
identifier

(i) Letter_ → A|B|….|Z|a|b|….|z|


(ii) digit → 0|1|…|9
(iii) id → letter_ (letter_ | digit)*
Transition diagram that recognise the lexeme matching token
keyword
Transition diagram that recognise the lexeme matching token
Unsigned numbers
The RE is
(i) digit → digit digit*
(ii) optionalFraction → . digit| ϵ
(iii) optionalExponent → ((E |+ | - | ϵ) digits)|ϵ
(iv) number → digit optionalFraction optionalExponent
Transition diagram that recognise the lexeme matching token
white space
Implementation of relop transition diagram

END OF MODULE-2

You might also like