Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lexical Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Lexical Analysis and

Design of Lexical Analyzer


Lexical Analysis
• Input is scanned completely to identify the tokens
• Tokens (Logical unit)
– Identifier, Keywords, operators etc.
Specification of Tokens
– Strings and Languages
• Finite sequence of Symbols is called Strings
• Set of strings over some alphabet is called Language
– Operation on Languages
• Concatenation:
– L1L2 = { s1s2 | s1  L1 and s2  L2 }
• Union
– L1 L2 = { s | s  L1 or s  L2 }
• Kleene Closure
– L* = 

L i

• Positive Closure
i 0

– L+ = 

L i

– Regular Expressions
i 1
4

Regular Expression
• Notation for representing Tokens
• Ex: Identifiers in Pascal
letter  A | B | ... | Z | a | b | ... | z
digit  0 | 1 | ... | 9
id  letter (letter | digit ) *
5

The Reason Why Lexical


Analysis is a Separate Phase
• Simplifies the design of the compiler
– LL(1) or LR(1) parsing with 1 token lookahead would
not be possible (multiple characters/tokens to match)
• Provides efficient implementation
– Systematic techniques to implement lexical analyzers
by hand or automatically from specifications
– Stream buffering methods to scan input
• Improves portability
– Non-standard symbols and alternate character
encodings can be normalized (e.g. trigraphs)
6

Interaction of the Lexical


Analyzer with the Parser
Token,
Source Lexical tokenval
Program Parser
Analyzer
Get next
token

error error

Symbol Table
7

Attributes of Tokens

y := 31 + 28*x Lexical analyzer

<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>

token
tokenval
(token attribute) Parser
8

Tokens, Patterns, and Lexemes


• A token is a classification of lexical units
– For example: id and num
• Lexemes are the specific character strings that
make up a token
– For example: abc and 123
• Patterns are rules describing the set of lexemes
belonging to a token
– For example: “letter followed by letters and digits” and
“non-empty sequence of digits”
9

Specification of Patterns for


Tokens: Definitions
• An alphabet  is a finite set of symbols
(characters)
• A string s is a finite sequence of symbols
from 
– s denotes the length of string s
–  denotes the empty string, thus  = 0
• A language is a specific set of strings over
some fixed alphabet 
10

Specification of Patterns for


Tokens: String Operations
• The concatenation of two strings x and y is
denoted by xy
• The exponentation of a string s is defined
by

s0 = 
si = si-1s for i > 0

note that s = s = s
11

Specification of Patterns for


Tokens: Language Operations
• Union
L  M = {s  s  L or s  M}
• Concatenation
LM = {xy  x  L and y  M}
• Exponentiation
L0 = {}; Li = Li-1L
• Kleene closure
L* = i=0,…, Li
• Positive closure
L+ = i=1,…, Li
12

Specification of Patterns for


Tokens: Regular Expressions
• Basis symbols:
–  is a regular expression denoting language {}
– a   is a regular expression denoting {a}
• If r and s are regular expressions denoting
languages L(r) and M(s) respectively, then
– rs is a regular expression denoting L(r)  M(s)
– rs is a regular expression denoting L(r)M(s)
– r* is a regular expression denoting L(r)*
– (r) is a regular expression denoting L(r)
• A language defined by a regular expression is
called a regular set
13

Specification of Patterns for


Tokens: Regular Definitions
• Regular definitions introduce a naming
convention:
d 1  r1
d 2  r2

d n  rn
where each ri is a regular expression over
  {d 1, d 2, …, d i-1 }
• Any d j in ri can be textually substituted in ri to
obtain an equivalent set of definitions
14

Specification of Patterns for


Tokens: Regular Definitions
• Example:

letter  AB…Zab…z
digit  01…9
id  letter ( letterdigit )*
• Regular definitions are not recursive:

digits  digit digitsdigit wrong!


15

Specification of Patterns for


Tokens: Notational Shorthand
• The following shorthands are often used:

r+ = rr*
r? = r
[a-z] = abc…z

• Examples:
digit  [0-9]
num  digit+ (. digit+)? ( E (+-)? digit+ )?
16

Regular Definitions and


Grammars
Grammar
stmt  if expr then stmt
 if expr then stmt else stmt

expr  term relop term
 term Regular definitions
term  id if  if
 num then  then
else  else
relop  <  <=  <>  >  >=  =
id  letter ( letter | digit )*
num  digit+ (. digit+)? ( E (+-)? digit+ )?
17

Coding Regular Definitions in


Transition Diagrams
relop  <<=<>>>==
start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other
4 * return(relop, LT)
=
5 return(relop, EQ)
> =
6 7 return(relop, GE)
other
8 * return(relop, GT)
id  letter ( letterdigit )* letter or digit

start letter other


9 10 11 * return(gettoken(),
install_id())
Coding Regular Definitions in 18

Transition Diagrams: Code


token nexttoken()
{ while (1) {
switch (state) {
case 0: c = nextchar();
if (c==blank || c==tab || c==newline) { Decides the
state = 0;
lexeme_beginning++; next start state
}
else if (c==‘<’) state = 1; to check
else if (c==‘=’) state = 5;
else if (c==‘>’) state = 6;
else state = fail();
break; int fail()
case 1: { forward = token_beginning;
… swith (start) {
case 9: c = nextchar(); case 0: start = 9; break;
if (isletter(c)) state = 10; case 9: start = 12; break;
else state = fail(); case 12: start = 20; break;
break; case 20: start = 25; break;
case 10: c = nextchar(); case 25: recover(); break;
if (isletter(c)) state = 10; default: /* error */
else if (isdigit(c)) state = 10; }
else state = 11; return start;
break; }

19

The Lex and Flex Scanner


Generators
• Lex and its newer cousin flex are scanner
generators
• Systematically translate regular definitions
into C source code for efficient scanning
• Generated code is easy to integrate in C
applications
20

Creating a Lexical Analyzer with


Lex and Flex
lex
source lex or flex lex.yy.c
program compiler
lex.l

lex.yy.c C a.out
compiler

input sequence
stream a.out of tokens
21

Design of a Lexical Analyzer


Generator
• Translate regular expressions to NFA
• Translate NFA to an efficient DFA

Optional

regular
NFA DFA
expressions

Simulate NFA Simulate DFA


to recognize to recognize
tokens tokens
22

Nondeterministic Finite
Automata
• An NFA is a 5-tuple (S, , , s0, F) where

S is a finite set of states


 is a finite set of symbols, the alphabet
 is a mapping from S   to a set of states
s0  S is the start state
F  S is the set of accepting (or final) states
23

Transition Graph
• An NFA can be diagrammatically
represented by a labeled directed graph
called a transition graph

a
S = {0,1,2,3}
start a b b  = {a,b}
0 1 2 3
s0 = 0
b F = {3}
24

Transition Table
• The mapping  of an NFA can be
represented in a transition table

Input Input
State
(0,a) = {0,1} a b
(0,b) = {0} 0 {0, 1} {0}
(1,b) = {2} 1 {2}
(2,b) = {3}
2 {3}
25

The Language Defined by an


NFA
• An NFA accepts an input string x if and only if
there is some path with edges labeled with
symbols from x in sequence from the start state to
some accepting state in the transition graph
• A state transition from one state to another on the
path is called a move
• The language defined by an NFA is the set of
input strings it accepts, such as (ab)*abb for the
example NFA
26

Design of a Lexical Analyzer


Generator: RE to NFA to DFA
Lex specification with NFA
regular expressions
p1 { action1 }
 N(p1) action1
p2 { action2 } start
s0
 N(p2) action2


pn { actionn } 
N(pn) actionn

Subset construction

DFA
27

From Regular Expression to NFA


(Thompson’s Construction)

start
i  f

a start a
i f

 N(r1) 
r1r2
start
i f
 N(r2) 
start
r1r2 i N(r1) N(r2) f

r* start
i  N(r)  f


28

Combining the NFAs of a Set of


Regular Expressions
start a
1 2

a { action1 }
start a b b
abb { action2 } 3 4 5 6
a b
a*b+ { action3 }
start
7 b 8
a
1 2

start
0  3
a
4
b
5
b
6
a b

7 b 8
29

Simulating the Combined NFA


Example 1
a
1 2 action1

start
0  3
a
4
b
5
b
6 action2
a b

7 b 8 action3

a a b a
none
0 2 7 8 action3
1 4
3 7 Must find the longest match:
7 Continue until no further moves are possible
When last state is accepting: execute action
30

Simulating the Combined NFA


Example 2
a
1 2 action1

start
0  3
a
4
b
5
b
6 action2
a b

7 b 8 action3

a b b a
none
0 2 5 6 action2
1 4 8 8 action3
3 7
7 When two or more accepting states are reached, the
first action given in the Lex specification is executed
31

Deterministic Finite Automata


• A deterministic finite automaton is a special case
of an NFA
– No state has an -transition
– For each state s and input symbol a there is at most one
edge labeled a leaving s
• Each entry in the transition table is a single state
– At most one path exists to accept a string
– Simulation algorithm is simple
32

Example DFA

A DFA that accepts (ab)*abb

b
b
a
start a b b
0 1 2 3

a a
33

Conversion of an NFA into a


DFA
• The subset construction algorithm converts an
NFA into a DFA using:
-closure(s) = {s}  {t  s  …  t}
-closure(T) = sT -closure(s)
move(T,a) = {t  s a t and s  T}
• The algorithm produces:
Dstates is the set of states of the new DFA
consisting of sets of states of the NFA
Dtran is the transition table of the new DFA
34

-closure and move Examples


-closure({0}) = {0,1,3,7}
a
1 2 move({0,1,3,7},a) = {2,4,7}
 -closure({2,4,7}) = {2,4,7}
start
0  3
a
4
b
5
b
6
move({2,4,7},a) = {7}
a b -closure({7}) = {7}
 move({7},b) = {8}
7 b 8 -closure({8}) = {8}
move({8},a) = 
a a b a
none
0 2 7 8
1 4
3 7
7 Also used to simulate NFAs
35

Simulating an NFA using


-closure and move
S := -closure({s0})
Sprev := 
a := nextchar()
while S   do
Sprev := S
S := -closure(move(S,a))
a := nextchar()
end do
if Sprev  F   then
execute action in Sprev
return “yes”
else return “no”
36

Minimizing the Number of States


of a DFA

C
b a
b a
start a b b start a b b
A B D E A B D E
a a
a
a b a

You might also like