2 Lexical

Lexical Analysis
The first phase of the compiler is the lexical analyzer,

also known as the scanner, which recognizes the basic
language units, called tokens.
• The exact characters in a token is called its lexeme.
• Tokens are classified by token types, e.g. identi-
1
fiers, constant literals, strings, operators, punctua-

tion marks, and key words.
• Different types of tokens may have their own seman-
tic attributes (or values) which must be extracted
and stored in the symbol table.
• The lexical analyzer may perform semantic actions

to extract such values and insert them in the symbol
table.
• How to classify token types? It mainly depends on
what form of input is needed by the next compiler
phase, the parser. (The parser takes a sequence of
tokens as its input.)
2
After we decide how to classify token types, we can use
one of several ways to precisely express the classifica-
tion.
• A common method is to use a finite automaton to
define all character sequences (i.e. strings) which
belong to a particular token type.
3
• We shall look at several examples (e.g. in Figure

2.3) of token types and their corresponding finite
automata.
• The states, the starting state, the accepting states
of a finite automaton. An accepting state is also
called a final state.
Given the definitions of different token types, it is possi-

ble for a string to belong to more than one type. Such
ambiguity is resolved by assigning priorities to token
types. For example: Key words have a higher priority
over identifiers.
4
• Finite automata for different token types are com-
bined into a transition diagram for the lexical ana-
lyzer.
• Following the “longest match” rule – keep scanning
the next character until there is no corresponding
transition. The longest string which matches an ac-
ceptance state during the scanning is the recognized
5
token.
• Go back to the starting state of the transition dia-
gram, ready to recognize the next token in the pro-
gram.
• Semantic actions can be specified in the transition

diagram.
• (The lexical analyzer can also be used to remove
comments from the program.)
Merging several transition diagrams into one may create
the problem of nondeterminism.
6
A nondeterministic finite automaton (NFA) accepts an

input string x if and only if there exists some path from
the start state to some accepting state, such that the
edge labels along the path spell out x.
Let us look at examples of NFAs accepting and rejecting
strings. Pay attention to the treatment of ² in spelling
x: ²y = y and y² = y.
Scanners based on NFAs can be inefficient due to the

possibility of backtracking.
7
We study an algorithm which transform an NFA into a

DFA (deterministic finite automaton). This algorithm
can be found on P. 27.
The intuition behind the algorithm which transforms

an NFA to a DFA is factoring. Let us look at an ex-
tremely simple example first, to see the idea of factor-
ing.
The idea is formalized by identifying a set of states

which can be reached after scanning a substring.
8
For an NFA which contains ² edges, we also need to

define the ²-closure of a state s, which the set of states
reachable from s by taking ² transitions. The ²-closure
of s of course includes s itself.
Regular Expressions
A DFA can be easily implemented based on a table
look-up.
For a programming language which has a big alphabet

and many token types, it would be desirable to auto-
matically construct such a table.
9
Such a lexical-analyzer generator takes a definition

of token types as input and generate the DFA table.
The graphical form of DFA is not suitable as an input
to the lexical-analyzer generator. Some textual repre-
sentation is needed.
Regular expressions are such a textual representation.
Regular expressions are equivalent to DFAs in that:
1) For any given regular expression, there exist a DFA

which accepts the same set of strings represented by
10
the regular expression.
2) For a given DFA, there exist a regular expression

which represents the same set of strings accepted by
the DFA.
Regular expressions are composed by following a set
of syntax rules:
• Given an input alphabet Σ, a regular expression is a
string of symbols from the union of the set Σ and
the set { (, ), ∗, |, ² }
• The regular expression ² defines a language which
11
contains the null string. What is the DFA to recog-

nize it?
• The regular expression a defines the language { a }.
• If regular expressions RA and RB define languages
A and B, respectively, then the regular expression
(RA) | (RB ) defines the language A ∪ B, the reg-
ular expression (RA)(RB ) defines the language AB

(concatenation), and the regular expression (RA)∗
defines the language A∗ (Kleene closure).
• The parentheses define the order of the construc-
tion operators (|, ∗ and concatenation) in regular
expressions. Within the same pair of parentheses (or
among the operators not in any parentheses), ∗ takes
12
the highest precedence, concatenation comes next, |

comes last.
• When a pair of parentheses are unnecessary for defin-
ing precedences, they can be omitted.
Let us look at a number of examples of regular expres-
sions, including those in Figure 2.2.
Two regular expressions are equivalent if they represent

the exactly the same set of strings.
There exist algorithms which, for any regular expression

13
E, can directly construct a DFA to recognize the set of

strings represented by E. We shall not study such a
direct algorithm. Instead, we study an algorithm which
construct an NFA for the given regular expression (see
P. 25).
We already know how to convert NFA to DFA.
There also exist algorithms which, for any DFA M ,

can construct a regular expression which represents
the set of strings recognized by M . (Unfortunately,
sometimes the regular expression generated by such
algorithms can be difficult to read.) We do not discuss
14
such algorithms in this course.
The compiler-generator tool JavaCC contains a lexical-

analyzer generator. In our project, we will apply the
tool to a simple language called miniJava. This will be
discussed in our PSOs.

2 Lexical

Uploaded by

Copyright:

Available Formats

2 Lexical

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 Lexical

Uploaded by

Copyright:

Available Formats

Lexical Analysis

The first phase of the compiler is the lexical analyzer,

fiers, constant literals, strings, operators, punctua-

• The lexical analyzer may perform semantic actions

• We shall look at several examples (e.g. in Figure

Given the definitions of different token types, it is possi-

• Semantic actions can be specified in the transition

A nondeterministic finite automaton (NFA) accepts an

Scanners based on NFAs can be inefficient due to the

We study an algorithm which transform an NFA into a

The intuition behind the algorithm which transforms

The idea is formalized by identifying a set of states

For an NFA which contains ² edges, we also need to

For a programming language which has a big alphabet

Such a lexical-analyzer generator takes a definition

Regular expressions are such a textual representation.

Regular expressions are equivalent to DFAs in that:

1) For any given regular expression, there exist a DFA

the regular expression.

2) For a given DFA, there exist a regular expression

contains the null string. What is the DFA to recog-

ular expression (RA)(RB ) defines the language AB

the highest precedence, concatenation comes next, |

Two regular expressions are equivalent if they represent

There exist algorithms which, for any regular expression

E, can directly construct a DFA to recognize the set of

We already know how to convert NFA to DFA.

There also exist algorithms which, for any DFA M ,

such algorithms in this course.

The compiler-generator tool JavaCC contains a lexical-

You might also like