Lexical Analysis

Lexical Analysis
By:
Trusha R. Patel
Asst. Prof.
CE Dept, CSPIT, CHARUSAT
1
Role of Lexical Analyzer
1. Main task is to read input characters of the
source program, group them into lexemes
and produce as output a sequence of tokens
for each lexeme in the program
2. Interact with symbol table, when it discovers

a lexeme constituting an identifier, it need to
enter that lexeme into the symbol table
2
Role of Lexical Analyzer
3. Stripping out comments and whitespace
4. Generate error messages ( lexical errors )
5. Keep track of the line number so it can

associate a line number with each error
3
Interaction of Lexical Analyzer from Syntax Analyzer
token
Lexical Syntax to
Source
Analyzer Analyzer semantic
program
(Scanner) (Parser) analyzer
getNextToken
Symbol table
4
Separated Lexical Analyzer from Syntax Analyzer
1. Simplicity of design
e.g. its complex for parser to deal with
comments and whitespaces as syntactic unit
so they are removed in lexical analysis
2. Compiler efficiency is improved as it allows

to apply specialized techniques that serve
only the lexical task not parsing job
specialized input buffering techniques for
reading input characters can speed up the
compiler
5
Separated Lexical Analyzer from Syntax Analyzer
3. Compiler portability is enhanced

input-device-specific peculiarities can be
restricted to the lexical analysis
6
Token, Pattern, Lexeme
 Token
 It’s a pair consisting of token name and attribute
value
 Generally write token name in boldface
 Refer to a token by its name
 Pattern
 Description of the form that lexemes of a token
may take
 Lexeme
 Sequence of characters in the source program that
matches the pattern for a token
7
Token, Pattern, Lexeme
Sample
Token Informal description
lexeme
if Characters i , f if
else Characters e , l , s , e else
< or > or <= or >= or == or !

comparison <= , =>
=
Letter followed by letters and
id pi , score , D2
digits
number Any numeric constant 3.14159 , 6.02e23
Anything but “ surrounded by

literal “core dumped”
“
8
Attributes for Token
 When more then one lexemes can match a
pattern, lexical analyzer must provide
additional information to the subsequent
compiler phases
 So lexical analyzer return token name with

attribute value
 Token have at most one associated attribute,

although this attribute may have a structure
that combined several information
9
 Example
 Token id
 Information about identifier is kept in symbol table
 Attribute value for identifier is a pointer to the
symbol table entry for that identifier
10
 Example
 Name and associated attribute value for
E = M * C ** 2
< id , pointer to symbol-table entry of E >

< assign-op >
< id , pointer to symbol-table entry of M >
< mult-op >
< id , pointer to symbol-table entry of C >
<exp-op>
< number , integer value 2 >
11
Lexical Error
 fi is encountered for the first time in C :
fi ( a== f(x) ) …
then lexical analyzer cannot tell whether fi is
a misspelling of the keyword if or undefined
function identifier
 Since fi is valid lexeme for token id the lexical

analyzer return token id to the parser and
parser handle the error
12
Lexical Error
 If lexical analyzer is unable to proceed
because none of the patterns of token
matches any prefix of the remaining input
 Simplest recovery strategy is “panic mode”

recovery
delete successive characters from the
remaining input until the lexical analyzer can
find a well-formed token
13
Lexical Error
 Other possible error-recovery actions are
 Delete one character from the remaining input
 Insert a missing character into the remaining input
 Replace a character by another character
 Transpose two adjacent characters
14
Input Buffering
 In C, single-character operator like > , = , <
can also be the beginning of two-character
operator >= , == , <=
 Often have to look one or more characters

beyond the next lexeme before we can be
sure about correct lexeme
 Introduce a two-buffer scheme that handle

large lookaheads
15
Buffer Pairs
 Large amount of time taken to process
characters
This buffering technique have been developed
to reduce the amount of overhead required to
process a single input character
 It involves two buffer that are alternately

reloaded
Each buffer is of the same size N (usually the
Buffer 1 Buffer 2
size of disk block)
N N
16
Buffer Pairs
 One system read command will read N
characters into buffer instead of one character
If fewer than N characters remain in input file,
then special character eof marks the end of
source file
 Maintains two pointer

 lexemeBegin
 Marks beginning of current lexeme
 forward
 Scan ahead until a pattern match is found
17
Buffer Pairs
a b = 2 ; N character loaded in buffer

d = a b + 3 ; N character loaded in buffer
Match with id
Start of new token, set forward to one left position
id token generated for ab , now set lexemeBegine to next position of forward
Same way tokens generated for = , 2 , ;
then pointer reach to end of first buffer so next buffer will be loaded
Set points in new buffer and do the same process for the next buffer
a b = 2 ; d = a b + 3 ;
forward
lexemeBegine
18
Symbol, Alphabet, String, Language
 Symbol
 It can be letters, digits, punctuations, operators
 E.g.
a-z A-Z a-z 0-9 ; = + , etc.
 Alphabet (Σ)
 Finite set of symbols
 E.g.
{0,1} { a , b , c } etc.
19
Symbol, Alphabet, String, Language
 String
 String over alphabet is finite sequence of symbols
drawn from that alphabet
 E.g.
 Empty string is denoted using ϵ

for alphabet {0,1} 0101 can be a string
 Language (L)
 Any countable set of strings over some fixed
alphabet
 E.g.
for alphabet {0,1} language is set of string of
length 2 can be given as { 00 , 01 , 10 , 11 }
20
Terms for Parts of String
 Prefix
 String obtain by removing zero or more symbols
from end of string
ϵ b ba ban bana banan banana

 E.g. all possible prefixes of banana are:
 Suffix
 String obtain by removing zero or more symbols
from beginning of string
banana anana nana ana na a ϵ

 E.g. all possible suffixes of banana are:
21
 Substring
 Obtained by deleting any prefix and any suffix
from string
 E.g. some substrings of banana can be
nan anan …
 Proper prefixes, suffixes and substrings
they are not ϵ or not equal to string itself

 Prefix, suffix and substring are called proper if
22
 Subsequence
 Deleting zero or more not necessarily consecutive
positions from string
 E.g. substring of banana can be
baan bn …
23
Operations on Languages
OPERATION DEFINITION & NOTATION
Union of L and M L∪M = { s | s is in L or s is in M }

Concatenation of L
and M LM = { st | s is in L and t is in M }
Kleene closer of L L* =
Positive closer of L L+ =
Here L and M are two languages
24
Regular Expression
Notatio Notatio
Meaning Meaning
n n
ϵ Null character Used to define
{} rang of occurrence
| or / Choice
Kleene closure ø Null set
* 0 or more
occurrence
U Union
Positive closure
+ 1 or more [] Character set
occurrence
? 0 or 1 occurrence ^ Reverse of set
() Concatenation
25
Algebraic laws for Regular Expression
LAW DESCRIPTION
r|s=s|r | is commutative
r|(s|t)=(r|s)|t | is associative
r(st)=(rs)t Concatenation is associative

r(s|t)=rs|rt ; (s|t)r Concatenation distributes over
=sr|tr |
ϵ is the identity for
ϵr=rϵ=r
concatenation
r* = ( r | ϵ )* ϵ is guaranteed in a closure
r** = r* * Is idempotent
26
Regular expression
 RE for identifier
[ a-z A-Z _ ] [ a-z A-Z _ 0-9 ] *
27
Regular Definition
 Allow to give names to certain regular
expressions and that name can be used in
subsequent expression
 Regular definition is sequence of definitions of
the form
d1  r1
d2  r2
…
dn  rn
1) each di is new symbol, not in Σ and not

where
same as any d’s
alphabet Σ ∪ { d1 , d2 , … , dn }
2) each ri is a regular expression over
28
Regular Definition
 Regular definition for identifier
letter  A | B | … | Z | a | b | … | z | _
digit  0|1|2|…|9
id  letter ( letter | digit )*
 Using shorthands
letter  [ A-Z a-z _ ]

digit  [0–9]
id  letter ( letter | digit )*
29
Regular Definition
 Regular definition for unsigned number
digit  0|1|…|9
. digits | ϵ
digits  digit digit *
( E ( + | - | ϵ ) digits ) | ϵ
optionalFraction 
optionalExponent 
number  digits optionalFraction
optionalExponent
 Using shorthands
digit  [0–9]
digits  digit+
number  digits ( . digits ) ? ( E [ + - ] ? digits )?
30
Transition Diagram
 Intermediate state in construction of lexical
analyzer is conversion of patterns into stylish
flowchart called “transition diagram”
 Transition diagram have a collection of nodes

or circles called “states”
Each state represents a condition that could
occur during the process of scanning the input
looking for a lexeme that matches one of
several pattern
31
Transition Diagram
 “Edges” are directed from one state to other,
labeled by a symbol or set of symbols
Is current state is “s” and input symbol is “a”
then find edge out from “s” with label “a”, if
found then advance the forward pointer in
buffer and enter the state of transition
diagram which that edge leads
Assume that all transition diagrams are
deterministic means there is never more than
one edge out of a state with a given symbol
among its labels
32
Transition Diagram
 “Start state” or “initial state” indicated by edge
labeled “start”
Transition diagram always begin in the start state
 Certain states are called “accepting state” of

“final state”, indicated the lexeme has been found
and indicated by double circle, if any action to be
taken need to attach that action o accepting state
 If it is necessary to retract the forward pointer (at

that point of time you can not decide that token
can be generated or not so you need to check one
or more characters more) then additionally place
* near accepting state
33
Transition Diagram
TOKEN ATTRIBUTE
LEXEME
NAME VALUE
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE
if if -
then then -
Pointer to table
Any id id
entry
Pointer to table
Any number number
entry
34 Any ws - -
Transition Diagram (Recognition relational operation)
= 2 return (relop , LE)

<
1 >
3 return (relop , NE)
other *
4 return (relop , LT)
start =
0 5 return (relop , EQ)
> =
6 7 return (relop , GE)
other *
8 return (relop , GT)
35
Transition Diagram (Recognition identifier and keyword)
start letter other *

9 10 11
return ( getToken() , installID() )
Letter or digit
 Problem is to differentiate keyword and

identifier
36
Transition Diagram (Recognition identifier and keyword)
 Two solutions
 Install keyword in symbol-table initially

 One field of symbol-table indicate that it’s a keyword or
identifier
when pattern match then getToken find in symbol-table
if found then return pointer to its symbol –table entry
if not then installID enter that identifier in symbol-table
and return the pointer to that entry
 Create separate transition diagram for each

keyword
 all edge show successive letters of keyword
last test for “nonletter-or-digit”
37
Transition Diagram (Recognition unsigned number)
E
digit
digit digit
digit
start digit . digit E + or - digit
other other other
* * *
38
Transition Diagram (Recognition whitespace)
delim
start delim other *
39
Lexical Analyzer Generator LEX
Lex source file LEX

lex.yy.c
(.l) Compiler
C
lex.yy.c a.out
Compiler
Input stream a.out Sequence of token
40
Structure of Lex Program
declarations Declaration of variables,
constants and regular
%% definitions
translation rules
%%
of form :- Pattern { Action }
Auxiliary functions where
Pattern :- regular expression
which may use regular
definition
from declaration section
Action :- fragment of code
written in C
Hold additional functions used in

the actions
41
Finite Automata
 They are recognizers
 Simply say “yes” or “no” about each possible

input string
 Two flowers
 NFA (Nondeterministic Finite Automata)
ϵ is also a possible label

 A symbol can label several edges out of he same state,
 DFA (Deterministic Finite Automata)

 For each symbol exactly one edge with that symbol
42 leaving that state
NFA
 Consists of
S finite set of states
Σ set of input symbols ,called input alphabet
Transition gives for each state and each input symbol a set
function of next states
s0 Start state or initial state , from S
F Set of accepting states , subset of S
43
DFA
 Special case of NFA
 No moves on input ϵ
 Each state “s” and input symbol “a”, exactly one
edge out of “s” labeled with “a”
44
RE to NFA (using Thompson Construction)
 McNaughton-Yamada-Thompson algorithm
 Works recursively
 For each subexpression it constructs an NFA with a

single accepting state
45
start s
N(s)
start
s
N(t)
ϵ s|t
start ϵ ϵ
N(s)
start
N(s) N(t)
ϵ
st
s*
46
(a|b)*ab
b
ϵ
ϵ ϵ
a
ϵ ϵ a b b
ϵ b ϵ
47
RE to NFA
 More examples
 (a|b)*
a * | b *
( a * | b ) * c a *
( a | b ) * c * d
a * b a
a + b * ( c | d )
a + b ( c | d ) a * b
( ( a * b * ) c ) * a *
 ( a | b )* a b b
 a a* | b b*
48
RE to DFA
 Steps
1. Add # at the end of RE

2. Construct tree of RE
3. Give numbers (starting with 1) to all alphabet
4. Find FIRSTPOS for all position in constructed tree
5. Find FOLLOWPOS for all position in constructed tree
6. Take highest FIRSTPOS set as A state
7. Find Next state for all possible input symbol from A
8. If new set comes then give it new name (i.e. B, C, D, …)
9. Continue till new won't get any new set
10. Make transition table
11. Construct DFA from transition table
12. Remove transition with #
13. Change final state accordingly
14. If from any state transitions are missing for any symbol then
create dead end ant connect missing transitions with it
49
nullable( ) , fisrtpos( ) , followpos( )
 nullable(n)
 If “n” is a leaf node labeled ϵ then nullable(n) is
TRUE
If “n” is a leaf node labeled “n” then nullable(n) is
FALSE
 firstpos(n)
 Set of symbols or positions that can come as first
of subexpression appearing at position “n”
 followpos(n)
 Set of symbols or positions that can follow the
subexpression appearing at position “n”
50
NODE n nullable(n) firstpos(n)
A leaf labeled ϵ true Ø
A leaf with
false {i}
position i
An or-node n=c1| nullable(c1) or firstpos(c1)∪firstpos(c

c2 nullable(c2) 2)
if (nullable(c1))
∪firstpos(c2)
fisrtpos(c1)
A cat-node nullable(c1) and
n=c1c2 nullable(c2)
else
firstpos(c1)
A star-node
true firstpos(c1)
n=c1*
51
n
* 1
/
FIRSTPOS (n1) = n1
n n n
1 1 2
FIRSTPOS ( * ) = FIRSTPOS (n1)

FIRSTPOS (n1)
FIRSTPOS ( / ) = U
FIRSTPOS (n2)
.
n n
1 2
FIRSTPOS ( . ) = FIRSTPOS (n1) U FIRSTPOS (n2) if n1is nullable

FIRSTPOS ( . ) = FIRSTPOS (n1) if n1 is not nullable
52
RE to DFA (using fistpos, followpos, lastpos)
( a | b ) * a #b {1,2,3} .
1 2 3 4
5 {1,2,3} . #
A {1 , 2 , 3 } {5}
{1,2,3} . b
{4}
{1,2} * a
{3}
FOLLOWPOS (5) = { } =ø {1,2} /
FOLLOWPOS (4) = { 5 }
FOLLOWPOS (2) = { 1 ,
a b
2,3} {1} {2}
2 53
,3}
( a | b ) * a #b
1 2 3 4
5 ab#
A {1 , 2 , 3 } BA- A {1,2,3}
B {1 , 2 , 3 , 4 a b a
}
i/p a = FOLLOW (1) U FOLLOW (3) = { 1 , 2 , 3
,4 }=B
i/p b = FOLLOW (2) = { 1 , 2 , 3 } = A
i/p # = ---
FOLLOWPOS (5) = { } =ø
2,3}
2 54
,3}
( a | b ) * a #b
1 2 3 4
5 ab#
A {1 , 2 , 3 } BA- B {1,2,3,4}
B {1 , 2 , 3 , 4 B C - a b a b
}
C {1 , 2 , 3 , 5
i/p a = FOLLOW (1) U FOLLOW (3) = { 1 , 2 , 3 ,
}
4}=B
i/p b = FOLLOW (2) U FOLLOW (4) = { 1 , 2 , 3 ,
5}=C
FOLLOWPOS (5) = { } =i/pø # = ---
2,3}
2 55
,3}
( a | b ) * a #b
1 2 3 4
5 ab#
A {1 , 2 , 3 } BA- C {1,2,3,5}
B {1 , 2 , 3 , 4 B C - a b a #
}
C {1 , 2 , 3 , 5 B A D
i/p a = FOLLOW (1) U FOLLOW (3) = { 1 , 2 , 3 ,
}
D {} 4}=B
i/p b = FOLLOW (2) = A
i/p # = FOLOW (5) = { } = D
FOLLOWPOS (5) = { } = ø
2,3}
2 56
,3}
( a | b ) * a #b
1 2 3 4
5 ab#
A {1 , 2 , 3 } BA- D {}
B {1 , 2 , 3 , 4 B C -
} i/p a = ---
C {1 , 2 , 3 , 5 B A D
i/p b = ---
} - - - i/p # = ---
D {}
FOLLOWPOS (5) = { } =ø
2,3}
2 57
,3}
( a | b ) * a #b
1 2 3 4
Initial5state
ab#
A {1 , 2 , 3 } BA- a
B {1 , 2 , 3 , 4 B C -
}
C {1 , 2 , 3 , 5 B A D
B
} - - -
D {}
a b a
Final state
b
A C
#
D
b
58
RE to DFA
 More examples
( a* | b* ) c* a
( ( a* b a ) | ( b* a ) ) ( a b )*
( a | b )* a ( a| b )
( ( a* b* ) c )* a*
59
RE to NFA (using fistpos, followpos, lastpos)
( a | b ) * a #b {1,2,3} .
1 2 3 4
5 {1,2,3} . #
{5}
{1,2,3} . b
{4}
{1,2} * a
{3}
{1,2} /
a b
{1} {2}
60
RE to NFA (using fistpos, followpos, lastpos)
( a | b ) * a #b 2 , 3 } (2) = { 1 ,
FOLLOWPOS
1 2 3 4 2 , 3 } (3) = { 4 }
FOLLOWPOS
5 FOLLOWPOS (4) = { 5 }
a FOLLOWPOS (5) = { }
1 a
b #
b a 3 4 5
2 a
It contain 1,2,3
So draw transition from
b 11
#Do
Take 5 (1
someans
same take 5 “a”
for allas so label
isfirstpos(root)={1,2,3}
states
final aswith
state “a”)
initial sta
Draw
Find state
follow for
of all
1 numbers
12 (2 means “b” so label with “b”)
e can construct DFA from this NFA using subset
13 construction
(3 means “a” method (covered
so label in
with “a”)
61
END
62

Lexical Analysis

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Lexical Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lexical Analysis

Uploaded by

Copyright:

Available Formats

Lexical Analysis

2. Interact with symbol table, when it discovers

4. Generate error messages ( lexical errors )

5. Keep track of the line number so it can

2. Compiler efficiency is improved as it allows

3. Compiler portability is enhanced

else Characters e , l , s , e else

< or > or <= or >= or == or !

number Any numeric constant 3.14159 , 6.02e23

Anything but “ surrounded by

 So lexical analyzer return token name with

 Token have at most one associated attribute,

< id , pointer to symbol-table entry of E >

 Since fi is valid lexeme for token id the lexical

 Simplest recovery strategy is “panic mode”

 Often have to look one or more characters

 Introduce a two-buffer scheme that handle

 It involves two buffer that are alternately

 Maintains two pointer

a b = 2 ; N character loaded in buffer

 Empty string is denoted using ϵ

ϵ b ba ban bana banan banana

banana anana nana ana na a ϵ

 Proper prefixes, suffixes and substrings

they are not ϵ or not equal to string itself

OPERATION DEFINITION & NOTATION

Union of L and M L∪M = { s | s is in L or s is in M }

Here L and M are two languages

? 0 or 1 occurrence ^ Reverse of set

r(st)=(rs)t Concatenation is associative

[ a-z A-Z _ ] [ a-z A-Z _ 0-9 ] *

1) each di is new symbol, not in Σ and not

same as any d’s

letter  [ A-Z a-z _ ]

 Transition diagram have a collection of nodes

 Certain states are called “accepting state” of

 If it is necessary to retract the forward pointer (at

= 2 return (relop , LE)

start letter other *

 Problem is to differentiate keyword and

 Install keyword in symbol-table initially

 Create separate transition diagram for each

start digit . digit E + or - digit

other other other

start delim other *

Lex source file LEX

Input stream a.out Sequence of token

Hold additional functions used in

 Simply say “yes” or “no” about each possible

 NFA (Nondeterministic Finite Automata)

ϵ is also a possible label

 DFA (Deterministic Finite Automata)

S finite set of states

Σ set of input symbols ,called input alphabet

s0 Start state or initial state , from S

F Set of accepting states , subset of S

 For each subexpression it constructs an NFA with a

1. Add # at the end of RE