The first phase of a compiler is the lexical analyzer, which recognizes basic language units called tokens. It classifies tokens into types like identifiers and keywords. Finite automata are used to precisely define the character sequences that belong to each token type. Regular expressions provide a textual representation of token definitions that can be used as input to a lexical analyzer generator tool, which produces a deterministic finite automaton table to recognize tokens.
The first phase of a compiler is the lexical analyzer, which recognizes basic language units called tokens. It classifies tokens into types like identifiers and keywords. Finite automata are used to precisely define the character sequences that belong to each token type. Regular expressions provide a textual representation of token definitions that can be used as input to a lexical analyzer generator tool, which produces a deterministic finite automaton table to recognize tokens.
The first phase of a compiler is the lexical analyzer, which recognizes basic language units called tokens. It classifies tokens into types like identifiers and keywords. Finite automata are used to precisely define the character sequences that belong to each token type. Regular expressions provide a textual representation of token definitions that can be used as input to a lexical analyzer generator tool, which produces a deterministic finite automaton table to recognize tokens.
The first phase of a compiler is the lexical analyzer, which recognizes basic language units called tokens. It classifies tokens into types like identifiers and keywords. Finite automata are used to precisely define the character sequences that belong to each token type. Regular expressions provide a textual representation of token definitions that can be used as input to a lexical analyzer generator tool, which produces a deterministic finite automaton table to recognize tokens.
Copyright:
Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online from Scribd
Download as pdf or txt
You are on page 1of 7
Lexical Analysis
The first phase of the compiler is the lexical analyzer,
also known as the scanner, which recognizes the basic language units, called tokens. • The exact characters in a token is called its lexeme. • Tokens are classified by token types, e.g. identi- 1
tion marks, and key words. • Different types of tokens may have their own seman- tic attributes (or values) which must be extracted and stored in the symbol table.
• The lexical analyzer may perform semantic actions
to extract such values and insert them in the symbol table. • How to classify token types? It mainly depends on what form of input is needed by the next compiler phase, the parser. (The parser takes a sequence of tokens as its input.) 2 After we decide how to classify token types, we can use one of several ways to precisely express the classifica- tion. • A common method is to use a finite automaton to define all character sequences (i.e. strings) which belong to a particular token type. 3
• We shall look at several examples (e.g. in Figure
2.3) of token types and their corresponding finite automata. • The states, the starting state, the accepting states of a finite automaton. An accepting state is also called a final state.
Given the definitions of different token types, it is possi-
ble for a string to belong to more than one type. Such ambiguity is resolved by assigning priorities to token types. For example: Key words have a higher priority over identifiers. 4 • Finite automata for different token types are com- bined into a transition diagram for the lexical ana- lyzer. • Following the “longest match” rule – keep scanning the next character until there is no corresponding transition. The longest string which matches an ac- ceptance state during the scanning is the recognized 5
token. • Go back to the starting state of the transition dia- gram, ready to recognize the next token in the pro- gram.
• Semantic actions can be specified in the transition
diagram. • (The lexical analyzer can also be used to remove comments from the program.) Merging several transition diagrams into one may create the problem of nondeterminism. 6
A nondeterministic finite automaton (NFA) accepts an
input string x if and only if there exists some path from the start state to some accepting state, such that the edge labels along the path spell out x. Let us look at examples of NFAs accepting and rejecting strings. Pay attention to the treatment of ² in spelling x: ²y = y and y² = y.
Scanners based on NFAs can be inefficient due to the
possibility of backtracking. 7
We study an algorithm which transform an NFA into a
DFA (deterministic finite automaton). This algorithm can be found on P. 27.
The intuition behind the algorithm which transforms
an NFA to a DFA is factoring. Let us look at an ex- tremely simple example first, to see the idea of factor- ing.
The idea is formalized by identifying a set of states
which can be reached after scanning a substring. 8
For an NFA which contains ² edges, we also need to
define the ²-closure of a state s, which the set of states reachable from s by taking ² transitions. The ²-closure of s of course includes s itself. Regular Expressions A DFA can be easily implemented based on a table look-up.
For a programming language which has a big alphabet
and many token types, it would be desirable to auto- matically construct such a table. 9
Such a lexical-analyzer generator takes a definition
of token types as input and generate the DFA table. The graphical form of DFA is not suitable as an input to the lexical-analyzer generator. Some textual repre- sentation is needed.
Regular expressions are such a textual representation.
Regular expressions are equivalent to DFAs in that:
1) For any given regular expression, there exist a DFA
which accepts the same set of strings represented by 10
the regular expression.
2) For a given DFA, there exist a regular expression
which represents the same set of strings accepted by the DFA. Regular expressions are composed by following a set of syntax rules: • Given an input alphabet Σ, a regular expression is a string of symbols from the union of the set Σ and the set { (, ), ∗, |, ² } • The regular expression ² defines a language which 11
contains the null string. What is the DFA to recog-
nize it? • The regular expression a defines the language { a }. • If regular expressions RA and RB define languages A and B, respectively, then the regular expression (RA) | (RB ) defines the language A ∪ B, the reg-
ular expression (RA)(RB ) defines the language AB
(concatenation), and the regular expression (RA)∗ defines the language A∗ (Kleene closure). • The parentheses define the order of the construc- tion operators (|, ∗ and concatenation) in regular expressions. Within the same pair of parentheses (or among the operators not in any parentheses), ∗ takes 12
the highest precedence, concatenation comes next, |
comes last. • When a pair of parentheses are unnecessary for defin- ing precedences, they can be omitted. Let us look at a number of examples of regular expres- sions, including those in Figure 2.2.
Two regular expressions are equivalent if they represent
the exactly the same set of strings.
There exist algorithms which, for any regular expression
13
E, can directly construct a DFA to recognize the set of
strings represented by E. We shall not study such a direct algorithm. Instead, we study an algorithm which construct an NFA for the given regular expression (see P. 25).
We already know how to convert NFA to DFA.
There also exist algorithms which, for any DFA M ,
can construct a regular expression which represents the set of strings recognized by M . (Unfortunately, sometimes the regular expression generated by such algorithms can be difficult to read.) We do not discuss 14
such algorithms in this course.
The compiler-generator tool JavaCC contains a lexical-
analyzer generator. In our project, we will apply the tool to a simple language called miniJava. This will be discussed in our PSOs.