Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
12 views62 pages

Lexical Analysis

Download as ppsx, pdf, or txt
Download as ppsx, pdf, or txt
Download as ppsx, pdf, or txt
You are on page 1/ 62

Lexical Analysis

By:
Trusha R. Patel
Asst. Prof.
CE Dept, CSPIT, CHARUSAT

1
Role of Lexical Analyzer
1. Main task is to read input characters of the
source program, group them into lexemes
and produce as output a sequence of tokens
for each lexeme in the program

2. Interact with symbol table, when it discovers


a lexeme constituting an identifier, it need to
enter that lexeme into the symbol table

2
Role of Lexical Analyzer
3. Stripping out comments and whitespace

4. Generate error messages ( lexical errors )

5. Keep track of the line number so it can


associate a line number with each error

3
Interaction of Lexical Analyzer from Syntax Analyzer

token
Lexical Syntax to
Source
Analyzer Analyzer semantic
program
(Scanner) (Parser) analyzer
getNextToken

Symbol table

4
Separated Lexical Analyzer from Syntax Analyzer

1. Simplicity of design
e.g. its complex for parser to deal with
comments and whitespaces as syntactic unit
so they are removed in lexical analysis

2. Compiler efficiency is improved as it allows


to apply specialized techniques that serve
only the lexical task not parsing job
specialized input buffering techniques for
reading input characters can speed up the
compiler

5
Separated Lexical Analyzer from Syntax Analyzer

3. Compiler portability is enhanced


input-device-specific peculiarities can be
restricted to the lexical analysis

6
Token, Pattern, Lexeme
 Token
 It’s a pair consisting of token name and attribute
value
 Generally write token name in boldface
 Refer to a token by its name

 Pattern
 Description of the form that lexemes of a token
may take

 Lexeme
 Sequence of characters in the source program that
matches the pattern for a token
7
Token, Pattern, Lexeme

Sample
Token Informal description
lexeme
if Characters i , f if

else Characters e , l , s , e else

< or > or <= or >= or == or !


comparison <= , =>
=
Letter followed by letters and
id pi , score , D2
digits

number Any numeric constant 3.14159 , 6.02e23

Anything but “ surrounded by


literal “core dumped”

8
Attributes for Token
 When more then one lexemes can match a
pattern, lexical analyzer must provide
additional information to the subsequent
compiler phases

 So lexical analyzer return token name with


attribute value

 Token have at most one associated attribute,


although this attribute may have a structure
that combined several information

9
Attributes for Token
 Example
 Token id
 Information about identifier is kept in symbol table
 Attribute value for identifier is a pointer to the
symbol table entry for that identifier

10
Attributes for Token
 Example
 Name and associated attribute value for
E = M * C ** 2

< id , pointer to symbol-table entry of E >


< assign-op >
< id , pointer to symbol-table entry of M >
< mult-op >
< id , pointer to symbol-table entry of C >
<exp-op>
< number , integer value 2 >

11
Lexical Error
 fi is encountered for the first time in C :
fi ( a== f(x) ) …
then lexical analyzer cannot tell whether fi is
a misspelling of the keyword if or undefined
function identifier

 Since fi is valid lexeme for token id the lexical


analyzer return token id to the parser and
parser handle the error

12
Lexical Error
 If lexical analyzer is unable to proceed
because none of the patterns of token
matches any prefix of the remaining input

 Simplest recovery strategy is “panic mode”


recovery
delete successive characters from the
remaining input until the lexical analyzer can
find a well-formed token

13
Lexical Error
 Other possible error-recovery actions are
 Delete one character from the remaining input
 Insert a missing character into the remaining input
 Replace a character by another character
 Transpose two adjacent characters

14
Input Buffering
 In C, single-character operator like > , = , <
can also be the beginning of two-character
operator >= , == , <=

 Often have to look one or more characters


beyond the next lexeme before we can be
sure about correct lexeme

 Introduce a two-buffer scheme that handle


large lookaheads

15
Buffer Pairs
 Large amount of time taken to process
characters
This buffering technique have been developed
to reduce the amount of overhead required to
process a single input character

 It involves two buffer that are alternately


reloaded
Each buffer is of the same size N (usually the
Buffer 1 Buffer 2
size of disk block)

N N
16
Buffer Pairs
 One system read command will read N
characters into buffer instead of one character
If fewer than N characters remain in input file,
then special character eof marks the end of
source file

 Maintains two pointer


 lexemeBegin
 Marks beginning of current lexeme
 forward
 Scan ahead until a pattern match is found

17
Buffer Pairs

a b = 2 ; N character loaded in buffer


d = a b + 3 ; N character loaded in buffer

Match with id
Start of new token, set forward to one left position
id token generated for ab , now set lexemeBegine to next position of forward
Same way tokens generated for = , 2 , ;
then pointer reach to end of first buffer so next buffer will be loaded
Set points in new buffer and do the same process for the next buffer

a b = 2 ; d = a b + 3 ;

forward

lexemeBegine
18
Symbol, Alphabet, String, Language
 Symbol
 It can be letters, digits, punctuations, operators
 E.g.
a-z A-Z a-z 0-9 ; = + , etc.

 Alphabet (Σ)
 Finite set of symbols
 E.g.
{0,1} { a , b , c } etc.

19
Symbol, Alphabet, String, Language
 String
 String over alphabet is finite sequence of symbols
drawn from that alphabet
 E.g.

 Empty string is denoted using ϵ


for alphabet {0,1} 0101 can be a string

 Language (L)
 Any countable set of strings over some fixed
alphabet
 E.g.
for alphabet {0,1} language is set of string of
length 2 can be given as { 00 , 01 , 10 , 11 }
20
Terms for Parts of String
 Prefix
 String obtain by removing zero or more symbols
from end of string

ϵ b ba ban bana banan banana


 E.g. all possible prefixes of banana are:

 Suffix
 String obtain by removing zero or more symbols
from beginning of string

banana anana nana ana na a ϵ


 E.g. all possible suffixes of banana are:

21
Terms for Parts of String
 Substring
 Obtained by deleting any prefix and any suffix
from string
 E.g. some substrings of banana can be
nan anan …

 Proper prefixes, suffixes and substrings

they are not ϵ or not equal to string itself


 Prefix, suffix and substring are called proper if

22
Terms for Parts of String
 Subsequence
 Deleting zero or more not necessarily consecutive
positions from string
 E.g. substring of banana can be
baan bn …

23
Operations on Languages

OPERATION DEFINITION & NOTATION

Union of L and M L∪M = { s | s is in L or s is in M }


Concatenation of L
and M LM = { st | s is in L and t is in M }

Kleene closer of L L* =

Positive closer of L L+ =

Here L and M are two languages

24
Regular Expression

Notatio Notatio
Meaning Meaning
n n
ϵ Null character Used to define
{} rang of occurrence
| or / Choice
Kleene closure ø Null set
* 0 or more
occurrence
U Union
Positive closure
+ 1 or more [] Character set
occurrence

? 0 or 1 occurrence ^ Reverse of set

() Concatenation

25
Algebraic laws for Regular Expression

LAW DESCRIPTION

r|s=s|r | is commutative

r|(s|t)=(r|s)|t | is associative

r(st)=(rs)t Concatenation is associative


r(s|t)=rs|rt ; (s|t)r Concatenation distributes over
=sr|tr |
ϵ is the identity for
ϵr=rϵ=r
concatenation
r* = ( r | ϵ )* ϵ is guaranteed in a closure

r** = r* * Is idempotent

26
Regular expression
 RE for identifier

[ a-z A-Z _ ] [ a-z A-Z _ 0-9 ] *

27
Regular Definition
 Allow to give names to certain regular
expressions and that name can be used in
subsequent expression
 Regular definition is sequence of definitions of
the form
d1  r1
d2  r2

dn  rn

1) each di is new symbol, not in Σ and not


where

same as any d’s

alphabet Σ ∪ { d1 , d2 , … , dn }
2) each ri is a regular expression over
28
Regular Definition
 Regular definition for identifier

letter  A | B | … | Z | a | b | … | z | _
digit  0|1|2|…|9
id  letter ( letter | digit )*

 Using shorthands

letter  [ A-Z a-z _ ]


digit  [0–9]
id  letter ( letter | digit )*

29
Regular Definition
 Regular definition for unsigned number

digit  0|1|…|9

. digits | ϵ
digits  digit digit *

( E ( + | - | ϵ ) digits ) | ϵ
optionalFraction 
optionalExponent 
number  digits optionalFraction
optionalExponent

 Using shorthands

digit  [0–9]
digits  digit+
number  digits ( . digits ) ? ( E [ + - ] ? digits )?
30
Transition Diagram
 Intermediate state in construction of lexical
analyzer is conversion of patterns into stylish
flowchart called “transition diagram”

 Transition diagram have a collection of nodes


or circles called “states”
Each state represents a condition that could
occur during the process of scanning the input
looking for a lexeme that matches one of
several pattern

31
Transition Diagram
 “Edges” are directed from one state to other,
labeled by a symbol or set of symbols
Is current state is “s” and input symbol is “a”
then find edge out from “s” with label “a”, if
found then advance the forward pointer in
buffer and enter the state of transition
diagram which that edge leads
Assume that all transition diagrams are
deterministic means there is never more than
one edge out of a state with a given symbol
among its labels

32
Transition Diagram
 “Start state” or “initial state” indicated by edge
labeled “start”
Transition diagram always begin in the start state

 Certain states are called “accepting state” of


“final state”, indicated the lexeme has been found
and indicated by double circle, if any action to be
taken need to attach that action o accepting state

 If it is necessary to retract the forward pointer (at


that point of time you can not decide that token
can be generated or not so you need to check one
or more characters more) then additionally place
* near accepting state
33
Transition Diagram
TOKEN ATTRIBUTE
LEXEME
NAME VALUE
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE
if if -
then then -
Pointer to table
Any id id
entry
Pointer to table
Any number number
entry
34 Any ws - -
Transition Diagram (Recognition relational operation)

= 2 return (relop , LE)


<
1 >
3 return (relop , NE)

other *
4 return (relop , LT)
start =
0 5 return (relop , EQ)

> =
6 7 return (relop , GE)

other *
8 return (relop , GT)

35
Transition Diagram (Recognition identifier and keyword)

start letter other *


9 10 11
return ( getToken() , installID() )

Letter or digit

 Problem is to differentiate keyword and


identifier

36
Transition Diagram (Recognition identifier and keyword)

 Two solutions

 Install keyword in symbol-table initially


 One field of symbol-table indicate that it’s a keyword or
identifier
when pattern match then getToken find in symbol-table
if found then return pointer to its symbol –table entry
if not then installID enter that identifier in symbol-table
and return the pointer to that entry

 Create separate transition diagram for each


keyword
 all edge show successive letters of keyword
last test for “nonletter-or-digit”

37
Transition Diagram (Recognition unsigned number)

E
digit
digit digit
digit

start digit . digit E + or - digit

other other other

* * *

38
Transition Diagram (Recognition whitespace)

delim

start delim other *

39
Lexical Analyzer Generator LEX

Lex source file LEX


lex.yy.c
(.l) Compiler

C
lex.yy.c a.out
Compiler

Input stream a.out Sequence of token

40
Structure of Lex Program
declarations Declaration of variables,
constants and regular
%% definitions
translation rules
%%
of form :- Pattern { Action }
Auxiliary functions where
Pattern :- regular expression
which may use regular
definition
from declaration section
Action :- fragment of code
written in C

Hold additional functions used in


the actions

41
Finite Automata
 They are recognizers

 Simply say “yes” or “no” about each possible


input string

 Two flowers

 NFA (Nondeterministic Finite Automata)

ϵ is also a possible label


 A symbol can label several edges out of he same state,

 DFA (Deterministic Finite Automata)


 For each symbol exactly one edge with that symbol
42 leaving that state
NFA
 Consists of

S finite set of states

Σ set of input symbols ,called input alphabet

Transition gives for each state and each input symbol a set
function of next states

s0 Start state or initial state , from S

F Set of accepting states , subset of S

43
DFA
 Special case of NFA

 No moves on input ϵ
 Each state “s” and input symbol “a”, exactly one
edge out of “s” labeled with “a”

44
RE to NFA (using Thompson Construction)
 McNaughton-Yamada-Thompson algorithm

 Works recursively

 For each subexpression it constructs an NFA with a


single accepting state

45
RE to NFA (using Thompson Construction)

start s
N(s)
start

s
N(t)

ϵ s|t

start ϵ ϵ
N(s)
start
N(s) N(t)
ϵ
st
s*
46
RE to NFA (using Thompson Construction)

(a|b)*ab
b
ϵ

ϵ ϵ
a

ϵ ϵ a b b

ϵ b ϵ

47
RE to NFA
 More examples
 (a|b)*
a * | b *
( a * | b ) * c a *
( a | b ) * c * d
a * b a
a + b * ( c | d )
a + b ( c | d ) a * b
( ( a * b * ) c ) * a *
 ( a | b )* a b b
 a a* | b b*

48
RE to DFA
 Steps

1. Add # at the end of RE


2. Construct tree of RE
3. Give numbers (starting with 1) to all alphabet
4. Find FIRSTPOS for all position in constructed tree
5. Find FOLLOWPOS for all position in constructed tree
6. Take highest FIRSTPOS set as A state
7. Find Next state for all possible input symbol from A
8. If new set comes then give it new name (i.e. B, C, D, …)
9. Continue till new won't get any new set
10. Make transition table
11. Construct DFA from transition table
12. Remove transition with #
13. Change final state accordingly
14. If from any state transitions are missing for any symbol then
create dead end ant connect missing transitions with it
49
nullable( ) , fisrtpos( ) , followpos( )
 nullable(n)
 If “n” is a leaf node labeled ϵ then nullable(n) is
TRUE
If “n” is a leaf node labeled “n” then nullable(n) is
FALSE

 firstpos(n)
 Set of symbols or positions that can come as first
of subexpression appearing at position “n”

 followpos(n)
 Set of symbols or positions that can follow the
subexpression appearing at position “n”
50
nullable( ) , fisrtpos( ) , followpos( )
NODE n nullable(n) firstpos(n)

A leaf labeled ϵ true Ø

A leaf with
false {i}
position i

An or-node n=c1| nullable(c1) or firstpos(c1)∪firstpos(c


c2 nullable(c2) 2)
if (nullable(c1))

∪firstpos(c2)
fisrtpos(c1)
A cat-node nullable(c1) and
n=c1c2 nullable(c2)
else
firstpos(c1)
A star-node
true firstpos(c1)
n=c1*
51
nullable( ) , fisrtpos( ) , followpos( )
n
* 1
/
FIRSTPOS (n1) = n1
n n n
1 1 2

FIRSTPOS ( * ) = FIRSTPOS (n1)


FIRSTPOS (n1)
FIRSTPOS ( / ) = U
FIRSTPOS (n2)
.
n n
1 2

FIRSTPOS ( . ) = FIRSTPOS (n1) U FIRSTPOS (n2) if n1is nullable


FIRSTPOS ( . ) = FIRSTPOS (n1) if n1 is not nullable

52
RE to DFA (using fistpos, followpos, lastpos)

( a | b ) * a #b {1,2,3} .
1 2 3 4
5 {1,2,3} . #
A {1 , 2 , 3 } {5}
{1,2,3} . b
{4}
{1,2} * a
{3}
FOLLOWPOS (5) = { } =ø {1,2} /
FOLLOWPOS (4) = { 5 }
FOLLOWPOS (3) = { 4 }
FOLLOWPOS (2) = { 1 ,
a b
2,3} {1} {2}
FOLLOWPOS (1) = { 1 ,
2 53
,3}
RE to DFA (using fistpos, followpos, lastpos)

( a | b ) * a #b
1 2 3 4
5 ab#
A {1 , 2 , 3 } BA- A {1,2,3}
B {1 , 2 , 3 , 4 a b a
}
i/p a = FOLLOW (1) U FOLLOW (3) = { 1 , 2 , 3
,4 }=B
i/p b = FOLLOW (2) = { 1 , 2 , 3 } = A
i/p # = ---
FOLLOWPOS (5) = { } =ø
FOLLOWPOS (4) = { 5 }
FOLLOWPOS (3) = { 4 }
FOLLOWPOS (2) = { 1 ,
2,3}
FOLLOWPOS (1) = { 1 ,
2 54
,3}
RE to DFA (using fistpos, followpos, lastpos)

( a | b ) * a #b
1 2 3 4
5 ab#
A {1 , 2 , 3 } BA- B {1,2,3,4}
B {1 , 2 , 3 , 4 B C - a b a b
}
C {1 , 2 , 3 , 5
i/p a = FOLLOW (1) U FOLLOW (3) = { 1 , 2 , 3 ,
}
4}=B
i/p b = FOLLOW (2) U FOLLOW (4) = { 1 , 2 , 3 ,
5}=C
FOLLOWPOS (5) = { } =i/pø # = ---
FOLLOWPOS (4) = { 5 }
FOLLOWPOS (3) = { 4 }
FOLLOWPOS (2) = { 1 ,
2,3}
FOLLOWPOS (1) = { 1 ,
2 55
,3}
RE to DFA (using fistpos, followpos, lastpos)

( a | b ) * a #b
1 2 3 4
5 ab#
A {1 , 2 , 3 } BA- C {1,2,3,5}
B {1 , 2 , 3 , 4 B C - a b a #
}
C {1 , 2 , 3 , 5 B A D
i/p a = FOLLOW (1) U FOLLOW (3) = { 1 , 2 , 3 ,
}
D {} 4}=B
i/p b = FOLLOW (2) = A
i/p # = FOLOW (5) = { } = D
FOLLOWPOS (5) = { } = ø
FOLLOWPOS (4) = { 5 }
FOLLOWPOS (3) = { 4 }
FOLLOWPOS (2) = { 1 ,
2,3}
FOLLOWPOS (1) = { 1 ,
2 56
,3}
RE to DFA (using fistpos, followpos, lastpos)

( a | b ) * a #b
1 2 3 4
5 ab#
A {1 , 2 , 3 } BA- D {}
B {1 , 2 , 3 , 4 B C -
} i/p a = ---
C {1 , 2 , 3 , 5 B A D
i/p b = ---
} - - - i/p # = ---
D {}

FOLLOWPOS (5) = { } =ø
FOLLOWPOS (4) = { 5 }
FOLLOWPOS (3) = { 4 }
FOLLOWPOS (2) = { 1 ,
2,3}
FOLLOWPOS (1) = { 1 ,
2 57
,3}
RE to DFA (using fistpos, followpos, lastpos)

( a | b ) * a #b
1 2 3 4
Initial5state
ab#
A {1 , 2 , 3 } BA- a
B {1 , 2 , 3 , 4 B C -
}
C {1 , 2 , 3 , 5 B A D
B
} - - -
D {}
a b a
Final state
b
A C
#
D
b

58
RE to DFA
 More examples
( a* | b* ) c* a
( ( a* b a ) | ( b* a ) ) ( a b )*
( a | b )* a ( a| b )
( ( a* b* ) c )* a*

59
RE to NFA (using fistpos, followpos, lastpos)

( a | b ) * a #b {1,2,3} .
1 2 3 4
5 {1,2,3} . #
{5}
{1,2,3} . b
{4}
{1,2} * a
{3}
{1,2} /

a b
{1} {2}

60
RE to NFA (using fistpos, followpos, lastpos)
FOLLOWPOS (1) = { 1 ,
( a | b ) * a #b 2 , 3 } (2) = { 1 ,
FOLLOWPOS
1 2 3 4 2 , 3 } (3) = { 4 }
FOLLOWPOS
5 FOLLOWPOS (4) = { 5 }
a FOLLOWPOS (5) = { }

1 a
b #
b a 3 4 5

2 a
It contain 1,2,3
So draw transition from
b 11
#Do
Take 5 (1
someans
same take 5 “a”
for allas so label
isfirstpos(root)={1,2,3}
states
final aswith
state “a”)
initial sta
Draw
Find state
follow for
of all
1 numbers
12 (2 means “b” so label with “b”)
e can construct DFA from this NFA using subset
13 construction
(3 means “a” method (covered
so label in
with “a”)

61
END

62

You might also like