Chapter 3 - Lexical Analysis
Chapter 3 - Lexical Analysis
Chapter 3 - Lexical Analysis
Symbol
table
Why to separate Lexical analysis
and parsing
1. Simplicity of design
2. Improving compiler efficiency
3. Enhancing compiler portability
Tokens, Patterns and Lexemes
A token is a pair a token name and an optional token
value
A pattern is a description of the form that the lexemes
of a token may take
A lexeme is a sequence of characters in the source
program that matches the pattern for a token
Example
Token Informal description Sample lexemes
if Characters i, f if
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=
E = M * C * * 2 eof
Sentinels
E = M eof * C * * 2 eof eof
Switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if {forward is at end of second buffer) {
reload first buffer;\
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */
terminate lexical analysis;
break;
cases for the other characters;
}
Specification of tokens
In theory of compilation regular expressions are used
to formalize the specification of tokens
Regular expressions are means for specifying regular
languages
Example:
Letter_(letter_ | digit)*
Each regular expression is a pattern specifying the
form of strings
Regular expressions
Ɛ is a regular expression, L(Ɛ) = {Ɛ}
If a is a symbol in ∑then a is a regular expression, L(a)
= {a}
(r) | (s) is a regular expression denoting the language
L(r) ∪ L(s)
(r)(s) is a regular expression denoting the language
L(r)L(s)
(r)* is a regular expression denoting (L9r))*
(r) is a regular expression denting L(r)
Regular definitions
d1 -> r1
d2 -> r2
…
dn -> rn
Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
Extensions
One or more instances: (r)+
Zero of one instances: r?
Character classes: [abc]
Example:
letter_ -> [A-Za-z_]
digit -> [0-9]
id -> letter_(letter|digit)*
Recognition of tokens
The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
lex.yy.c
C a.out
compiler
Sequence
Input stream a.out
of tokens
Structure of Lex programs
Declarations
%%
Translation rules Pattern {Action}
%%
Auxiliary functions
Example
%{
Int installID() {/* funtion to install the
/* definitions of manifest constants
lexeme, whose first character is
LT, LE, EQ, NE, GT, GE, pointed to by yytext, and whose
IF, THEN, ELSE, ID, NUMBER, RELOP */ length is yyleng, into the symbol
%} table and return a pointer thereto
*/
/* regular definitions }
delim [ \t\n]
ws {delim}+ Int installNum() { /* similar to
installID, but puts numerical
letter [A-Za-z]
constants into a separate table */
digit [0-9]
}
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?
%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID); }
{number} {yylval = (int) installNum(); return(NUMBER);}
…
Finite Automata
Regular expressions = specification
Finite automata = implementation
25
Finite
A state
Automata State Graphs
• An accepting state
a
• A transition
26
A ASimple Example
finite automaton that accepts only “1”
27
Another Simple Example
A finite automaton accepting any number of 1’s
followed by a single 0
Alphabet: {0,1}
28
Epsilon Moves
Another kind of transition: -moves
A B
29
Execution of Finite Automata
A DFA can take only one path through the state graph
Completely determined by input
30
Acceptance of NFAs
An NFA can get into multiple states
1
0 1
• Input: 1 0 1
31
NFA vs. DFA (1)
NFAs and DFAs recognize the same set of languages
(regular languages)
32
NFA vs. DFA (2)
For a given language the NFA can be simpler than the
DFA
1
0 0
NFA
0
1 0
0 0
DFA
1
1
33
Implementation
A DFA can be implemented by a 2D table T
One dimension is “states”
Other dimension is “input symbols”
For every transition Si a Sk define T[i,a] = k
DFA “execution”
If in state Si and input a, read T[i,a] = k and skip to state
Sk
Very efficient
34