Lexical Analysis
Lexical Analysis
Lexical Analysis
Whats a Token? Output of lexical analysis is a stream of tokens A token is a syntactic category
In English:
noun, verb, adjective,
In a programming language:
Identifier, Integer, Keyword, Whitespace,
with a letter Integers: non-empty strings of digits Keywords: else or if or begin or Whitespace: non-empty sequences of blanks, newlines, and tabs OpenPars: left-parentheses
2. Return:
1. The type or syntactic category of the token, 2. the value or lexeme of the token (the substring itself).
Example
Our example again:
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
Lexical Analyzer: Implementation The lexer usually discards uninteresting tokens that dont contribute to parsing. Examples: Whitespace, Comments
Next We need
A way to describe the lexemes of each token
A way to resolve ambiguities
Is if two variables i and f? Is == two equal signs = =?
Regular Languages There are several formalisms for specifying tokens Regular languages are the most popular
Simple and useful theory Easy to understand Efficient implementations
Languages
Def. Let S be a set of characters. A language over S is a set of strings of characters drawn from S (S is called the alphabet )
Examples of Languages
Alphabet = English characters Language = English sentences Not every string on English characters is an English sentence Alphabet = ASCII Language = C programs
Notation Languages are sets of strings. Need some notation for specifying which sets we want For lexical analysis we care about regular languages, which can be described using
regular expressions.
Regular Expressions and Regular Languages Each regular expression is a notation for a regular language (a set of words) If A is a regular expression then we write L(A) to refer to the language denoted by A
Regular Expressions Single character: c L(c) = { c } (for any c S) Concatenation: AB (where A and B are reg. exp.) L(AB) = { ab | a L(A) and b L(B) }
Another example:
L((0 | 1) (0 | 1)) = { 00, 01, 10, 11 }
More Compound Regular Expressions So far we do not have a notation for infinite languages Iteration: A* L(A*) = { } | L(A) | L(AA) | L(AAA) | Examples:
0* = { , 0, 00, 000, } 1 0* = { strings starting with 1 and followed by 0s }
Epsilon:
L() = { }
else | if | begin |
Example: Integers Integer: a non-empty string of digits digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 number = digit digit*
letter = A | | Z | a | | z identifier = letter (letter | digit) * Is (letter* | digit*) the same as (letter | digit) * ?
( | \t | \n)+
Finite Automata Specifying lexical structure using regular expressions create finite automata for recognizing RE Finite automata
Deterministic Finite Automata (DFAs) Non-deterministic Finite Automata (NFAs)
Regular Expressions 3. Construct R, matching all lexemes for all tokens R = Keyword | Identifier | Number | = R1 | R2 | R3 | Facts: If s L(R) then s is a lexeme
Furthermore s L(Ri) for some i This i determines the token that is reported
(x1 ... xn are characters in the language alphabet) For 1 i n check x1xi L(R) ?
x1xi L(Rj) for some i and j
5. It must be that
6. Remove x1xi from input and go to (4)
Finite Automata Regular expressions = specification Finite automata = implementation A finite automaton consists of
An input alphabet S A set of states S A start state n A set of accepting states F S A set of transitions state input state
a s1 s2
Finite Automata State Graphs A state The start state An accepting state A transition
a
A finite automaton accepts a string if we can follow transitions labeled with the characters in the string from the start to some accepting state
Another Simple Example A finite automaton accepting any number of 1s followed by a single 0 Alphabet: {0,1}
1 0
Execution of Finite Automata A DFA can take only one path through the state graph
Completely determined by input
NFA vs. DFA (1) NFAs and DFAs recognize the same set of languages (regular languages)
NFA vs. DFA (2) For a given language the NFA can be simpler than the DFA
1
NFA
1 0
DFA
0 1
DFA
Lexical Specification
For
For input a
For A | B
Example of RE -> NFA conversion Consider the regular expression (1 | 0)*1 The NFA is
A C
0 F D
G H I
Transition Table
Lex Specifications
A Lex program consists of three parts: 1. Declarations %% 2. Translation Rules %% 3. Auxiliary procedures