Lexical Analysis All Token List and Diffence
Lexical Analysis All Token List and Diffence
Lexical Analysis All Token List and Diffence
1 / 16 2 / 16
3 / 16 4 / 16
Examples How to recognize tokens?
a(b|c) represents {ab, ac} A recognizer for a language L is a program that takes a string x and answers
a∗ b represents {b, ab, aab, aaab, . . .} “yes” if x ∈ L else “no”.
(a|b)∗ represents any combination of a or b. A recognizer, called finite automaton, for regular expressions can be
(ab)∗ represents {, ab, abab, ababab, . . .} constructed from regular expressions.
(a|b)(c|d) represents {ac, ad, bc, bd} Two classes of finite automaton: i) Nondeterministic and ii) Deterministic.
0∗ 10∗ represents the set of all strings over {0, 1} containing exactly one 1. Nondeterministic Finite Automaton (NFA): A mathematical model that
Let d = (0| · · · |9), l = (A| · · · |Z ) consists of i) a set of states S, ii) a set of input symbols Σ, iii) a transition
1. A comment that begins with -- and ends with Eol: function that maps state-symbol pairs to sets of states, iv) a state s0 that is
Comment = -- Not(Eol)* Eol distinguished as the start state, and v) a set of states F distinguished as
2. A fixed decimal literal: accepting or final states.
Lit = d + .d + An NFA can be represented as a labeled directed graph, called transition graph:
3. An identifier, composed of letters, digits, and underscores, that begins with a
letter, ends with a letter or digit, and contains no consecutive underscores:
nodes are states, and labeled edges represent transition function. Also through
Id = l(l|d)∗ ( (l|d)+ )∗ Transition table.
4. Comment delimited by ## markers, but allow single # within the comment An NFA accepts an input string x iff there is some path in transition graph
body: from start state to some accepting state.
Comment2 = ##((#|)Not(#))∗ ## Language defined by NFA: set of strings it accepts.
Regular expressions are limited in description power. They cannot represent Example NFA for RE (a|b)∗ abb:
a
many of the programming language constructs. Cannot describe languages that State a b
contain strings of the form: {an b n }. This language describes balanced Start a b b 0 {0, 1} {0}
1
parentheses.
0 2 3
1 – {2}
b 2 – {3}
5 / 16 6 / 16
7 / 16 8 / 16
Tools for constructing Scanners Lex - cont’d.
Several tools for building lexical analyzers from special purpose notation based Part 1: Define pattern for a token and give it a symbolic name, so that the
on regular expressions. name can be used when referring to the token in description and action part.
Regular
Scanner Generator Lexical Analyzer digs [0-9]+
expression
integer {digs}
String
Lexical Analyzer yes/no It can also contain variables and other declarations to be included in the C code
generated by Lex. Usually defined in the beginning and included in {% and %}:
Lex produces an entire scanner module that can be compiled and linked with %{
other compiler modules. int linecount = 1;
Regular %}
Scanner Generator Lexical Analyzer
expression Part 3: In certain cases, actions associated with tokens may be complex
String
Lexical Analyzer yes/no
enough to warrant function and procedure definitions. Such procedures can be
defined in the third section.
Three components of a lex program: Second part (transition rules): specifies how to define tokens and what actions
declarations
%%
to take when tokens are identified. For instance, return the identifier (say an
transition rules integer value) for the token.
%% Part 1 and Part 3 are mostly to enable define second part.
auxiliary procedures {real} {return FLOAT;}
Any part may be omitted, but separators must appear. begin {return BEGIN;}
lex generate a function yylex. Every time yylex is called, it returns a token. whitespace ;
Lex maintains a set of variables to define attributes of lexemes.
yytext: Actual contents of the lexeme identified.
yyleng: Length of the lexeme.
yylval: used to store lexical value of token if any.
yylineno: number of the current input line.
9 / 16 10 / 16
11 / 16 12 / 16
Practical Considerations
Example 3: Pascal lexemes How to handle words such as if and while? Note that these words match
lexical syntax of ordinary identifiers.
digit [0-9]
Most languages make such words reserved. Facilitates parsing and ease of
digits {digit}+
letter [A-Za-z] programming. If not, what does the following string really mean?
lr ({letter}|{digit}) if if then else = then;
sign [+\-]
Reserving such words makes regular expression complicated because reserved
dtdgts {dot}{digits}
exponent {Ee]{sign}?{digits} words are similar to identifiers.
real {digits}({dtdgts}|{exponent}|{dtdgts}{exponent})
ident {letter}{l_or_d}* Use nots
newline [\n] Directly write regular expression (very hard though)
quote [\"]
wspace [ \t] Simple solution: Treat reserved words as ordinary identifiers and use a separate
comment ("{"[^}]*"}"|"(*""("*([^*)]|[^*]")"|"*"[^)])*"*"*"*)")
string \’([^’\n]|\’\’)+\’
table look up to detect them.
badstring {quote}[’"]*{quote} Another solution: Define distinct regular expression for each reserved word. A
dotdot ".." string may match more than one regular expression. Define some mechanism
dot "."
other . for choosing one of them. In Lex, order of listing of token specifications makes
a difference.
if {return(IF);}
then {return(THEN);}
{id} {return(ID);}
Problem: Underlying finite automaton and its transition table will be
significantly larger.
13 / 16 14 / 16
15 / 16 16 / 16