Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lexical Analysis All Token List and Diffence

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Tokens, Lexemes, and Patterns

Lexical Analysis  Token: a certain classification of entities of a program.


Roles four kinds of tokens in previous example: identifiers, operators, constraints, and
 Primary role: Scan a source program (a string) and break it up into small, punctuation.
meaningful units, called tokens.  Lexeme: A specific instance of a token. Used to differentiate tokens. For
Example: instance, both position and initial belong to the identifier class, however
position := initial + rate * 60; each a different lexeme.
Transform into meaningful units: identifiers, constants, operators, and  Lexical analyzer may return a token type to the Parser, but must also keep
punctuation. track of “attributes” that distinguish one lexeme from another.
 Other roles: Examples of attributes:
 Removal of comments  Identifiers: string
 Case conversion  Numbers: value
 Removal of white spaces
Attributes are used during semantic checking and code generation. They are
 Interpretation of compiler directives or pragmas: For instance, in Turbo Pascal
{$R+ means range checking is enabled. not needed during parsing.
 Communication with symbol table: Store information regarding an identifier in  Patterns: Rule describing how tokens are specified in a program. Needed
the symbol table. Not advisable in cases where scopes can be nested. because a language can contain infinite possible strings. They all cannot be
 Preparation of output listing: Keep track of source program, line numbers, and enumerated.
correspondences between error messages and line numbers.  Formal mechanisms used to represent these patterns. Formalism helps in
 Why separate LA from parser? describing precisely (i) which strings belong to the language, and (ii) which do
 Simpler design of both LA and parser not.
 More efficient compiler Also, form basis for developing tools that can automatically determine if a
 More portable compiler string belongs to a language.

1 / 16 2 / 16

How are patterns specified? Regular expressions


 Using a meta-language, called regular expressions.   is a regular expression.
 Alphabet: finite set of symbols. Use term Σ for specifying an alphabet. L() = {}
 Sentence or term: string. Note that this set is different from an empty set.
 Empty string: denoted , string of length 0.  If a ∈ Σ, then a is a regular expression.
 Language: Any set of strings defined over an alphabet. From the lexical L(a) = {a}
analyzer point of view, this language denotes the set of all tokens in  Operators: Assume that r, s regular expressions. The following operators
programming language. construct regular expressions from r and s:
 Define following operators over sets of strings: 1. Choice |: r |s
1. Union: L ∪ U L(r |s) = L(r ) ∪ L(s)
S = L ∪ U = {s|(s ∈ L) ∨ (s ∈ U)} 2. Concatenation: rs or r .s
2. Concatenation: LU or L.U L(rs) = L(r )L(s)
S = L U = {s t|(s ∈ L) ∧ (t ∈ U)} 3. Kleene closure: r ∗
∗ L(r ∗ ) = (L(r ))∗
∞ L , set of all strings of letters, including ,
3. Kleene closure:
S = L∗ = i=0 Li  Any finite set of strings can be represented by a regular expression of the form
4. Positive closure: L+ . s1 |s2 | · · · |sk
S = LL∗  Some additional operations for notational convenience:
 Regular expression: a notation for defining the set of tokens that normally 1. r + denoting all strings consisting of one or more r .
occur in programming languages. r ∗ = (r + |)
 For each regular expression r, there is a corresponding set of strings, say L(r ) 2. Not(r): denoting strings in set (Σ∗ − L(r ))
that is said to be derived from regular expressions. Also called regular set. 3. r k : denotes all strings formed by concatenating k strings from L(r).

3 / 16 4 / 16
Examples How to recognize tokens?
 a(b|c) represents {ab, ac}  A recognizer for a language L is a program that takes a string x and answers
 a∗ b represents {b, ab, aab, aaab, . . .} “yes” if x ∈ L else “no”.
 (a|b)∗ represents any combination of a or b.  A recognizer, called finite automaton, for regular expressions can be
 (ab)∗ represents {, ab, abab, ababab, . . .} constructed from regular expressions.
 (a|b)(c|d) represents {ac, ad, bc, bd}  Two classes of finite automaton: i) Nondeterministic and ii) Deterministic.
 0∗ 10∗ represents the set of all strings over {0, 1} containing exactly one 1.  Nondeterministic Finite Automaton (NFA): A mathematical model that
 Let d = (0| · · · |9), l = (A| · · · |Z ) consists of i) a set of states S, ii) a set of input symbols Σ, iii) a transition
1. A comment that begins with -- and ends with Eol: function that maps state-symbol pairs to sets of states, iv) a state s0 that is
Comment = -- Not(Eol)* Eol distinguished as the start state, and v) a set of states F distinguished as
2. A fixed decimal literal: accepting or final states.
Lit = d + .d +  An NFA can be represented as a labeled directed graph, called transition graph:
3. An identifier, composed of letters, digits, and underscores, that begins with a
letter, ends with a letter or digit, and contains no consecutive underscores:
nodes are states, and labeled edges represent transition function. Also through
Id = l(l|d)∗ ( (l|d)+ )∗ Transition table.
4. Comment delimited by ## markers, but allow single # within the comment  An NFA accepts an input string x iff there is some path in transition graph
body: from start state to some accepting state.
Comment2 = ##((#|)Not(#))∗ ## Language defined by NFA: set of strings it accepts.
 Regular expressions are limited in description power. They cannot represent  Example NFA for RE (a|b)∗ abb:
a
many of the programming language constructs. Cannot describe languages that State a b
contain strings of the form: {an b n }. This language describes balanced Start a b b 0 {0, 1} {0}
1
parentheses.
0 2 3
1 – {2}
b 2 – {3}

5 / 16 6 / 16

Deterministic Finite Automaton (DFA) Approaches to building Scanner


 DFA: special case of NFA where i) no -transition, and ii) for each state s,  Many ways in which a scanner can be built.
input a, only one transition possible. In other words, at every state s, the  Approach 1: Construct an NFA from the regular expression specification, and
transition for an input a is known. use an NFA simulator (ASU Alg.3.4) for simulating the generated NFA.
 Easy to determine if a DFA accepts a string as there is only one path. Number of states: O(|r |), where |r | is the length of regular expression.
 A nondeterministic finite automaton can be converted into a deterministic Time Cost: O(|r | × |x|) where |x| is string length.
finite automaton. Space Requirements: O(|r |)
DFA for (a|b)∗ abb:  Approach 2: Construct DFA from the NFA and then simulate the DFA. In this
State a b approach, there is a possibility of state explosion.
b
Start a b
0 1 0 Time: O(|x|)
b
1 1 2 Space: O(2|r | )
0 1 2 3
a
b a 2 1 3 
a Approach 3: lazy transition evaluation. Construct a transition as and when
3 1 0

needed. The computed transition table is stored in a cache. Each time a
One entry for each input symbol in the transition table.
 Very easy to determine if a string is accepted by DFA: the transition graph. A transition is about to be made, the cache is consulted. If it does not exist,
scanner driver that interprets a transition table: compute the new transition.
State := Initial State; Combines the space requirements of the NFA with method with time
loop requirement of DFA.
NextState := Table(State, CurrentChar);
exit when NextState = Error; Space requirements: size of cache + |r |
State := NextState; Observed running time is almost as fast as that of DFA recognizer. Sometimes
exit when CurrentChar = Eof; faster because does not compute transition tables that need not be computed.
Read(CurrentChar);
end loop
if State in FinalStates then return valid token
else LexicalError;

7 / 16 8 / 16
Tools for constructing Scanners Lex - cont’d.
 Several tools for building lexical analyzers from special purpose notation based  Part 1: Define pattern for a token and give it a symbolic name, so that the
on regular expressions. name can be used when referring to the token in description and action part.
Regular
Scanner Generator Lexical Analyzer digs [0-9]+
expression
integer {digs}
String
Lexical Analyzer yes/no It can also contain variables and other declarations to be included in the C code
generated by Lex. Usually defined in the beginning and included in {% and %}:
 Lex produces an entire scanner module that can be compiled and linked with %{
other compiler modules. int linecount = 1;
Regular %}
Scanner Generator Lexical Analyzer
expression  Part 3: In certain cases, actions associated with tokens may be complex
String
Lexical Analyzer yes/no
enough to warrant function and procedure definitions. Such procedures can be
defined in the third section.
 Three components of a lex program:  Second part (transition rules): specifies how to define tokens and what actions
declarations
%%
to take when tokens are identified. For instance, return the identifier (say an
transition rules integer value) for the token.
%% Part 1 and Part 3 are mostly to enable define second part.
auxiliary procedures {real} {return FLOAT;}
Any part may be omitted, but separators must appear. begin {return BEGIN;}
 lex generate a function yylex. Every time yylex is called, it returns a token. whitespace ;
 Lex maintains a set of variables to define attributes of lexemes.
 yytext: Actual contents of the lexeme identified.
 yyleng: Length of the lexeme.
 yylval: used to store lexical value of token if any.
 yylineno: number of the current input line.
9 / 16 10 / 16

How are tokens specified in Lex? Lex: Example 2


A scanner that adds line numbers to text:
%{
[xz] x or z
#include <stdio.h>
[x - z] x through z int lineno = 1;
[^x] Any character except x %}
. Any character except end of line line .*\n
%%
^x An x at the beginning of a line
{line} {printf("%5d %s’’, lineno++, yytext);}
x$ An x at the end of a line %%
x? Optional x main() {
x* 0 or more instances of x yylex(); return 0;
}
x+ 1 or more instances of x
x{m,n} m through n occurrences of x A scanner that selects only lines that end or begin with letter ’a’:
x|y x or y %{
(x) same as x #include <stdio.h>
x/y x only if followed by y %}
ends_with_a .*a\n
 Examples: begins_with_a a.*\n
 Single character such as b matches b in source program. %%
Special characters such as ., %, (, ), etc. have special meaning. Use "(". {ends_with_a} ECHO;
 [a-dw-z] = a|b|c|d|w|y|z {begins_with_a} ECHO;
 a(b|c) means ab or ac. .*\n ;
 ab? is equivalent to a|ab and (ab)? is equivalent to ab|. %%

main() {
[a-z]$ match a=z if at end of line
yylex(); return 0;
~[a-z] match a=z if at beginning of line
}

11 / 16 12 / 16
Practical Considerations
Example 3: Pascal lexemes  How to handle words such as if and while? Note that these words match
lexical syntax of ordinary identifiers.
digit [0-9]
 Most languages make such words reserved. Facilitates parsing and ease of
digits {digit}+
letter [A-Za-z] programming. If not, what does the following string really mean?
lr ({letter}|{digit}) if if then else = then;
sign [+\-]
 Reserving such words makes regular expression complicated because reserved
dtdgts {dot}{digits}
exponent {Ee]{sign}?{digits} words are similar to identifiers.
real {digits}({dtdgts}|{exponent}|{dtdgts}{exponent})
ident {letter}{l_or_d}*  Use nots
newline [\n]  Directly write regular expression (very hard though)
quote [\"]
wspace [ \t]  Simple solution: Treat reserved words as ordinary identifiers and use a separate
comment ("{"[^}]*"}"|"(*""("*([^*)]|[^*]")"|"*"[^)])*"*"*"*)")
string \’([^’\n]|\’\’)+\’
table look up to detect them.
badstring {quote}[’"]*{quote}  Another solution: Define distinct regular expression for each reserved word. A
dotdot ".." string may match more than one regular expression. Define some mechanism
dot "."
other . for choosing one of them. In Lex, order of listing of token specifications makes
a difference.
if {return(IF);}
then {return(THEN);}
{id} {return(ID);}
Problem: Underlying finite automaton and its transition table will be
significantly larger.

13 / 16 14 / 16

Practical Considerations: multi-character lookahead


 Some languages require considerable analysis to resolve ambiguities in tokens, Practical Considerations - cont’d.
in particular FORTRAN. Lexical Errors
Example: two statements and corresponding lexical views:  Scanner may come across certain errors: invalid character, invalid token etc.
do 100 x = 1, 10 do100x=1,10
do 100 x = 1. 10 do100x=1.10
Usually detected by reaching a state that is not final and there are no
In one case, a do loop and another assignment to a variable do100x. Scanner transitions for the current input symbol.
 Cannot uncover syntactical, semantic or logical errors: view of the lexical
cannot determine if it is a do loop or variable assignment until it reaches ’,’
char. analyzer is localized.
 What to do when lexical errors occur?
 Milder form in Pascal or ADA: To scan 10..100, need two character lookahead
 Delete the characters read so far and restart scanning at the next unread
after the 10. character.
 Solution: requires backtracking through the states. Fortran scanner puts a  Delete the first character read by the scanner and resume scanning at the
temporary marker after do, and continues looking until it has found a comma, character following it.
a period, or the end of the statement. If a comma, return to marker else ignore  Local transformations: replace a char by another, transpose adjacent chars etc.
marker and pass do100x to parser.  Note that error recovery at this stage may create errors in the parsing stage:
 In Ada, tic char ’ used both as as attribute symbol (arrayname’length) and for instance, replace beg#in by beg in which will cause error during the
to delimit character literals (’x’). parsing phase.
Solution: Treat tic char as a special case. When a tic is seen, a flag (set by the Other approach could be for the scanner to provide a certain warning token to
parser) is checked to see if an attribute symbol or character literal may be read the parser. Parser can use this information to do syntactic error-repair.
next.  Error recovery of a common problem: runaway comments and strings. Possible
 In Lex, use operator ’/’ to represent such lookaheads: r1/r2 implies recognize solutions: Introduce error token that represent a runaway string or comment.
string denoted by r1 if followed by r2.
DO/(letter| digit)* = (letter| digit)*, Once the runaway error token is recognized, a special error message may be
issued.

15 / 16 16 / 16

You might also like