Lexical Analysis
Lexical Analysis
Lexical Analysis
By:
Trusha R. Patel
Asst. Prof.
CE Dept, CSPIT, CHARUSAT
1
Role of Lexical Analyzer
1. Main task is to read input characters of the
source program, group them into lexemes
and produce as output a sequence of tokens
for each lexeme in the program
2
Role of Lexical Analyzer
3. Stripping out comments and whitespace
3
Interaction of Lexical Analyzer from Syntax Analyzer
token
Lexical Syntax to
Source
Analyzer Analyzer semantic
program
(Scanner) (Parser) analyzer
getNextToken
Symbol table
4
Separated Lexical Analyzer from Syntax Analyzer
1. Simplicity of design
e.g. its complex for parser to deal with
comments and whitespaces as syntactic unit
so they are removed in lexical analysis
5
Separated Lexical Analyzer from Syntax Analyzer
6
Token, Pattern, Lexeme
Token
It’s a pair consisting of token name and attribute
value
Generally write token name in boldface
Refer to a token by its name
Pattern
Description of the form that lexemes of a token
may take
Lexeme
Sequence of characters in the source program that
matches the pattern for a token
7
Token, Pattern, Lexeme
Sample
Token Informal description
lexeme
if Characters i , f if
8
Attributes for Token
When more then one lexemes can match a
pattern, lexical analyzer must provide
additional information to the subsequent
compiler phases
9
Attributes for Token
Example
Token id
Information about identifier is kept in symbol table
Attribute value for identifier is a pointer to the
symbol table entry for that identifier
10
Attributes for Token
Example
Name and associated attribute value for
E = M * C ** 2
11
Lexical Error
fi is encountered for the first time in C :
fi ( a== f(x) ) …
then lexical analyzer cannot tell whether fi is
a misspelling of the keyword if or undefined
function identifier
12
Lexical Error
If lexical analyzer is unable to proceed
because none of the patterns of token
matches any prefix of the remaining input
13
Lexical Error
Other possible error-recovery actions are
Delete one character from the remaining input
Insert a missing character into the remaining input
Replace a character by another character
Transpose two adjacent characters
14
Input Buffering
In C, single-character operator like > , = , <
can also be the beginning of two-character
operator >= , == , <=
15
Buffer Pairs
Large amount of time taken to process
characters
This buffering technique have been developed
to reduce the amount of overhead required to
process a single input character
N N
16
Buffer Pairs
One system read command will read N
characters into buffer instead of one character
If fewer than N characters remain in input file,
then special character eof marks the end of
source file
17
Buffer Pairs
Match with id
Start of new token, set forward to one left position
id token generated for ab , now set lexemeBegine to next position of forward
Same way tokens generated for = , 2 , ;
then pointer reach to end of first buffer so next buffer will be loaded
Set points in new buffer and do the same process for the next buffer
a b = 2 ; d = a b + 3 ;
forward
lexemeBegine
18
Symbol, Alphabet, String, Language
Symbol
It can be letters, digits, punctuations, operators
E.g.
a-z A-Z a-z 0-9 ; = + , etc.
Alphabet (Σ)
Finite set of symbols
E.g.
{0,1} { a , b , c } etc.
19
Symbol, Alphabet, String, Language
String
String over alphabet is finite sequence of symbols
drawn from that alphabet
E.g.
Language (L)
Any countable set of strings over some fixed
alphabet
E.g.
for alphabet {0,1} language is set of string of
length 2 can be given as { 00 , 01 , 10 , 11 }
20
Terms for Parts of String
Prefix
String obtain by removing zero or more symbols
from end of string
Suffix
String obtain by removing zero or more symbols
from beginning of string
21
Terms for Parts of String
Substring
Obtained by deleting any prefix and any suffix
from string
E.g. some substrings of banana can be
nan anan …
22
Terms for Parts of String
Subsequence
Deleting zero or more not necessarily consecutive
positions from string
E.g. substring of banana can be
baan bn …
23
Operations on Languages
Kleene closer of L L* =
Positive closer of L L+ =
24
Regular Expression
Notatio Notatio
Meaning Meaning
n n
ϵ Null character Used to define
{} rang of occurrence
| or / Choice
Kleene closure ø Null set
* 0 or more
occurrence
U Union
Positive closure
+ 1 or more [] Character set
occurrence
() Concatenation
25
Algebraic laws for Regular Expression
LAW DESCRIPTION
r|s=s|r | is commutative
r|(s|t)=(r|s)|t | is associative
r** = r* * Is idempotent
26
Regular expression
RE for identifier
27
Regular Definition
Allow to give names to certain regular
expressions and that name can be used in
subsequent expression
Regular definition is sequence of definitions of
the form
d1 r1
d2 r2
…
dn rn
alphabet Σ ∪ { d1 , d2 , … , dn }
2) each ri is a regular expression over
28
Regular Definition
Regular definition for identifier
letter A | B | … | Z | a | b | … | z | _
digit 0|1|2|…|9
id letter ( letter | digit )*
Using shorthands
29
Regular Definition
Regular definition for unsigned number
digit 0|1|…|9
. digits | ϵ
digits digit digit *
( E ( + | - | ϵ ) digits ) | ϵ
optionalFraction
optionalExponent
number digits optionalFraction
optionalExponent
Using shorthands
digit [0–9]
digits digit+
number digits ( . digits ) ? ( E [ + - ] ? digits )?
30
Transition Diagram
Intermediate state in construction of lexical
analyzer is conversion of patterns into stylish
flowchart called “transition diagram”
31
Transition Diagram
“Edges” are directed from one state to other,
labeled by a symbol or set of symbols
Is current state is “s” and input symbol is “a”
then find edge out from “s” with label “a”, if
found then advance the forward pointer in
buffer and enter the state of transition
diagram which that edge leads
Assume that all transition diagrams are
deterministic means there is never more than
one edge out of a state with a given symbol
among its labels
32
Transition Diagram
“Start state” or “initial state” indicated by edge
labeled “start”
Transition diagram always begin in the start state
other *
4 return (relop , LT)
start =
0 5 return (relop , EQ)
> =
6 7 return (relop , GE)
other *
8 return (relop , GT)
35
Transition Diagram (Recognition identifier and keyword)
Letter or digit
36
Transition Diagram (Recognition identifier and keyword)
Two solutions
37
Transition Diagram (Recognition unsigned number)
E
digit
digit digit
digit
* * *
38
Transition Diagram (Recognition whitespace)
delim
39
Lexical Analyzer Generator LEX
C
lex.yy.c a.out
Compiler
40
Structure of Lex Program
declarations Declaration of variables,
constants and regular
%% definitions
translation rules
%%
of form :- Pattern { Action }
Auxiliary functions where
Pattern :- regular expression
which may use regular
definition
from declaration section
Action :- fragment of code
written in C
41
Finite Automata
They are recognizers
Two flowers
Transition gives for each state and each input symbol a set
function of next states
43
DFA
Special case of NFA
No moves on input ϵ
Each state “s” and input symbol “a”, exactly one
edge out of “s” labeled with “a”
44
RE to NFA (using Thompson Construction)
McNaughton-Yamada-Thompson algorithm
Works recursively
45
RE to NFA (using Thompson Construction)
start s
N(s)
start
s
N(t)
ϵ s|t
start ϵ ϵ
N(s)
start
N(s) N(t)
ϵ
st
s*
46
RE to NFA (using Thompson Construction)
(a|b)*ab
b
ϵ
ϵ ϵ
a
ϵ ϵ a b b
ϵ b ϵ
47
RE to NFA
More examples
(a|b)*
a * | b *
( a * | b ) * c a *
( a | b ) * c * d
a * b a
a + b * ( c | d )
a + b ( c | d ) a * b
( ( a * b * ) c ) * a *
( a | b )* a b b
a a* | b b*
48
RE to DFA
Steps
firstpos(n)
Set of symbols or positions that can come as first
of subexpression appearing at position “n”
followpos(n)
Set of symbols or positions that can follow the
subexpression appearing at position “n”
50
nullable( ) , fisrtpos( ) , followpos( )
NODE n nullable(n) firstpos(n)
A leaf with
false {i}
position i
∪firstpos(c2)
fisrtpos(c1)
A cat-node nullable(c1) and
n=c1c2 nullable(c2)
else
firstpos(c1)
A star-node
true firstpos(c1)
n=c1*
51
nullable( ) , fisrtpos( ) , followpos( )
n
* 1
/
FIRSTPOS (n1) = n1
n n n
1 1 2
52
RE to DFA (using fistpos, followpos, lastpos)
( a | b ) * a #b {1,2,3} .
1 2 3 4
5 {1,2,3} . #
A {1 , 2 , 3 } {5}
{1,2,3} . b
{4}
{1,2} * a
{3}
FOLLOWPOS (5) = { } =ø {1,2} /
FOLLOWPOS (4) = { 5 }
FOLLOWPOS (3) = { 4 }
FOLLOWPOS (2) = { 1 ,
a b
2,3} {1} {2}
FOLLOWPOS (1) = { 1 ,
2 53
,3}
RE to DFA (using fistpos, followpos, lastpos)
( a | b ) * a #b
1 2 3 4
5 ab#
A {1 , 2 , 3 } BA- A {1,2,3}
B {1 , 2 , 3 , 4 a b a
}
i/p a = FOLLOW (1) U FOLLOW (3) = { 1 , 2 , 3
,4 }=B
i/p b = FOLLOW (2) = { 1 , 2 , 3 } = A
i/p # = ---
FOLLOWPOS (5) = { } =ø
FOLLOWPOS (4) = { 5 }
FOLLOWPOS (3) = { 4 }
FOLLOWPOS (2) = { 1 ,
2,3}
FOLLOWPOS (1) = { 1 ,
2 54
,3}
RE to DFA (using fistpos, followpos, lastpos)
( a | b ) * a #b
1 2 3 4
5 ab#
A {1 , 2 , 3 } BA- B {1,2,3,4}
B {1 , 2 , 3 , 4 B C - a b a b
}
C {1 , 2 , 3 , 5
i/p a = FOLLOW (1) U FOLLOW (3) = { 1 , 2 , 3 ,
}
4}=B
i/p b = FOLLOW (2) U FOLLOW (4) = { 1 , 2 , 3 ,
5}=C
FOLLOWPOS (5) = { } =i/pø # = ---
FOLLOWPOS (4) = { 5 }
FOLLOWPOS (3) = { 4 }
FOLLOWPOS (2) = { 1 ,
2,3}
FOLLOWPOS (1) = { 1 ,
2 55
,3}
RE to DFA (using fistpos, followpos, lastpos)
( a | b ) * a #b
1 2 3 4
5 ab#
A {1 , 2 , 3 } BA- C {1,2,3,5}
B {1 , 2 , 3 , 4 B C - a b a #
}
C {1 , 2 , 3 , 5 B A D
i/p a = FOLLOW (1) U FOLLOW (3) = { 1 , 2 , 3 ,
}
D {} 4}=B
i/p b = FOLLOW (2) = A
i/p # = FOLOW (5) = { } = D
FOLLOWPOS (5) = { } = ø
FOLLOWPOS (4) = { 5 }
FOLLOWPOS (3) = { 4 }
FOLLOWPOS (2) = { 1 ,
2,3}
FOLLOWPOS (1) = { 1 ,
2 56
,3}
RE to DFA (using fistpos, followpos, lastpos)
( a | b ) * a #b
1 2 3 4
5 ab#
A {1 , 2 , 3 } BA- D {}
B {1 , 2 , 3 , 4 B C -
} i/p a = ---
C {1 , 2 , 3 , 5 B A D
i/p b = ---
} - - - i/p # = ---
D {}
FOLLOWPOS (5) = { } =ø
FOLLOWPOS (4) = { 5 }
FOLLOWPOS (3) = { 4 }
FOLLOWPOS (2) = { 1 ,
2,3}
FOLLOWPOS (1) = { 1 ,
2 57
,3}
RE to DFA (using fistpos, followpos, lastpos)
( a | b ) * a #b
1 2 3 4
Initial5state
ab#
A {1 , 2 , 3 } BA- a
B {1 , 2 , 3 , 4 B C -
}
C {1 , 2 , 3 , 5 B A D
B
} - - -
D {}
a b a
Final state
b
A C
#
D
b
58
RE to DFA
More examples
( a* | b* ) c* a
( ( a* b a ) | ( b* a ) ) ( a b )*
( a | b )* a ( a| b )
( ( a* b* ) c )* a*
59
RE to NFA (using fistpos, followpos, lastpos)
( a | b ) * a #b {1,2,3} .
1 2 3 4
5 {1,2,3} . #
{5}
{1,2,3} . b
{4}
{1,2} * a
{3}
{1,2} /
a b
{1} {2}
60
RE to NFA (using fistpos, followpos, lastpos)
FOLLOWPOS (1) = { 1 ,
( a | b ) * a #b 2 , 3 } (2) = { 1 ,
FOLLOWPOS
1 2 3 4 2 , 3 } (3) = { 4 }
FOLLOWPOS
5 FOLLOWPOS (4) = { 5 }
a FOLLOWPOS (5) = { }
1 a
b #
b a 3 4 5
2 a
It contain 1,2,3
So draw transition from
b 11
#Do
Take 5 (1
someans
same take 5 “a”
for allas so label
isfirstpos(root)={1,2,3}
states
final aswith
state “a”)
initial sta
Draw
Find state
follow for
of all
1 numbers
12 (2 means “b” so label with “b”)
e can construct DFA from this NFA using subset
13 construction
(3 means “a” method (covered
so label in
with “a”)
61
END
62