Compiler Design - Lexical Analysis
Compiler Design - Lexical Analysis
CHAPTER 02:
LEXICAL ANALYSIS
7
Compiler Design Chapter 02 : Lexical Analysis
3. Regular Expressions
The regular expression is a formula that describes a possible set of string.
Component of the regular expression.
X the character x
. any character, usually accept a new line
[x y z] any of the characters x, y, z, …..
R? a R or nothing (=optionally as R)
R* zero or more occurrences…..
R+ one or more occurrences ……
R1R2 an R1 followed by an R2
R2R1 either an R1 or an R2.
A token is either a single string or one of a collection of strings of a certain type. If we view
the set of strings in each token class as a language, we can use the regular-expression notation
to describe tokens.
Consider an identifier, which is defined to be a letter followed by zero or more letters or
digits. In regular expression notation, we would write.
Identifier = letter (letter | digit)*
8
Compiler Design Chapter 02 : Lexical Analysis
Here are the rules that define the regular expression over alphabet.
is a regular expression denoting { € }, the language containing only the
empty string.
For each „a‟ in ∑, is a regular expression denoting { a }, the language with only
one string consisting of the single symbol „a‟.
If R and S are regular expressions, then
(R) | (S) means LrULs
R.S means Lr. Ls
R* denotes Lr*
3.1 Operations
3.2 Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then
9
Compiler Design Chapter 02 : Lexical Analysis
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
sign = [ + | - ]
The only problem left with the lexical analyzer is how to verify the validity of a regular
expression used in specifying the patterns of keywords of a language. A well-accepted
solution is to use finite automata for verification.
10
Compiler Design Chapter 02 : Lexical Analysis
Example 1
For each of the following regular expressions ri, we want to determine the language denoted
by ri.
r1 = (a/b)(a/b)
r2 = (a/b)*
r3 = (a*b*)*
r4 = a/a*b
L(r2) ={ε, a, aa, aaa, ., b, bb, bbb, …, ab, aab, , abb, abbb, ba, bba, bbaa, aba, …}. It is the set
of all strings composed of any number of a and any number of b.
L(r3) = L(r2)
L(r4) = {a, b, ab, aab, aaab, …, aaaa…ab}. In addition to the string a, this set contains strings
composed of any non-zero number of a followed by a b.
11
Compiler Design Chapter 02 : Lexical Analysis
12
Compiler Design Chapter 02 : Lexical Analysis
1. Certain states are said to be accepting or final. These states indicate that a lexeme has been
found, although the actual lexeme may not consist of all positions b/w the lexeme Begin and
forward pointers we always indicate an accepting state by a double circle.
2. In addition, if it is necessary to return the forward pointer to one position, then we shall
additionally place a * near that accepting state.
3. One state is designed as the state or initial state; it is indicated by an edge labeled “start”
entering from nowhere. The transition diagram always begins in the state before any input
symbols have been used.
13
Compiler Design Chapter 02 : Lexical Analysis
The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ),
Q×Σ➔Q
14
Compiler Design Chapter 02 : Lexical Analysis
15
Compiler Design Chapter 02 : Lexical Analysis
• For each state e and each input symbol a there is at most one arc labeled a leaving e
In the transition table of an AFD, an input contains a single state at most (the input symbols are
the characters of the source text), so it is very easy to determine if a string is accepted by the
automaton since 'there is, at most, only one path between the initial state and a final state labeled
by the string in question.
Note: The transition table of an AFN for a regular expression pattern can be considerably
smaller than that of an AFD. However, AFD has the advantage of being able to recognize
regular expression patterns faster than the equivalent AFN.
a) Basic Components:
16
Compiler Design Chapter 02 : Lexical Analysis
b) Concatenation:
c) Alternation (Union):
o If the regular expression is an alternation (e.g., "a|b"), create a new initial state
and ε-transitions from this initial state to the initial states of the expressions
being alternated.
o If the regular expression has the Kleene star (e.g., "a*"), create a new initial state,
a new accepting state, and ε-transitions from the new initial state to the initial
state of the expression being repeated and from the accepting state of the
expression to both the new accepting state and the initial state of the expression.
17
Compiler Design Chapter 02 : Lexical Analysis
o If there is only one expression in the regular expression, mark its final state as
an accepting state.
f) Finalization:
o If there are multiple expressions or operations, you should now have a modified
NFA. You can simplify it further by removing any ε-transitions (by performing
ε-closure operations) if the language allows it.
This process constructs an NFA that recognizes the language defined by the given regular
expression. The key is to understand the operations in the regular expression (concatenation,
alternation, and closure) and systematically build the corresponding components in the NFA.
18
Compiler Design Chapter 02 : Lexical Analysis
b. For each state in the DFA, compute the ε-closure of the set of states reachable from it
using ε-transitions. This set of states becomes the next state in the DFA for the corresponding
input symbols.
c. Repeat step b for all newly generated states until no new states can be added.
DFA States and Transitions:
Each state in the DFA corresponds to a set of states from the NFA. The transitions in the
DFA are determined by the transitions of the NFA on the individual characters.
DFA Accepting States:
A state in the DFA is an accepting state if it contains at least one accepting state from the
NFA.
Here's an example conversion of a simple NFA to a DFA:
Before minimizing an AFD, it must be completed, ie add a trash state like the state R on the
diagram below.
19
Compiler Design Chapter 02 : Lexical Analysis
"Lex" is a popular tool used for generating lexical analyzers (also known as scanners or
tokenizers) in compiler construction and other related fields. It helps in converting a stream of
characters into a stream of tokens, which are the basic units of a programming language. Lexical
analysis is the first step in the compilation process, where the source code is divided into
meaningful tokens for further processing.
20
Compiler Design Chapter 02 : Lexical Analysis
Lex operates based on regular expressions. It allows you to define patterns for tokens using
regular expressions and associate corresponding actions that generate output when a match is
found. The tool generates C code for the lexical analyzer, which can then be integrated into a
larger compiler.
The Lex tool works in conjunction with the "Yacc" (Yet Another Compiler Compiler) tool.
While Lex handles the lexical analysis phase, Yacc deals with the parsing phase, helping to
construct the syntax tree and perform syntactic analysis.
1. Specification: You write a Lex specification file that defines regular expressions and
the corresponding actions. Each regular expression corresponds to a token pattern, and
the associated action generates output for the matched token.
2. Compilation: You run the Lex tool on the specification file. Lex generates C code for
the lexical analyzer based on your specifications.
3. Integration: You include the generated C code in your larger compiler project. The
lexical analyzer code reads input characters and matches them against the defined
patterns.
4. Tokenization: As input characters are read, the lexical analyzer uses the generated C
code to match patterns. When a pattern is matched, the associated action is executed,
producing the corresponding token.
5. Output: The token stream generated by the lexical analyzer is passed to the parser
(usually implemented using Yacc or another parsing tool) for further syntactic analysis.
Here's a simple example of a Lex specification for recognizing integers and identifiers in a
programming language:
lex
%{
#include <stdio.h>
%}
DIGIT [0-9]
IDENTIFIER [a-zA-Z][a-zA-Z0-9]*
21
Compiler Design Chapter 02 : Lexical Analysis
%%
{DIGIT}+ printf("INTEGER: %s\n", yytext);
{IDENTIFIER} printf("IDENTIFIER: %s\n", yytext);
/* Ignore other characters */
%%
int main() {
yylex();
return 0;
}
This example would recognize integers (sequences of digits) and identifiers (starting with a
letter and followed by letters or digits) and print corresponding messages for each match.
22