Lexical Analysis and Compiler Basics
Lexical Analysis and Compiler Basics
What Do Compilers Do
• Compilers may generate three types of code:
– Pure Machine Code
• Machine instruction set without assuming the existence
of any operating system or library.
• Mostly being OS or embedded applications.
– Augmented Machine Code
• Code with OS routines and runtime support routines.
• More often
– Virtual Machine Code
• Virtual instructions, can be run on any architecture with
a virtual machine interpreter or a just-in-time compiler
• Ex. Java
5
Source Target
Compiler
Program Program
Compilers
• Source languages: Fortran, Pascal, C, etc.
• Target languages: another PL, machine Lang
• Compilers:
– Single-pass
– Multi-pass
– Load-and-Go
– Debugging
– Optimizing
8
Source
Program
Interpreter Output
Input
Error messages
9
Preprocessors, Compilers,
Assemblers, and Linkers
Skeletal Source Program
Preprocessor
Source Program
Compiler Try for example:
gcc -v myprog.c
Target Assembly Program
Assembler
Relocatable Object Code
Linker Libraries and
Relocatable Object Files
Absolute Machine Code
11
Phases of Compiler
13
Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Routines
Stream)
Intermediate
Representation
Scanner
The scanner begins the analysis of the source
Symbol and Optimizer
program by reading the input, character by
Attribute
character, and grouping characters into individual
Tables
words and symbols (tokens)
(Used by all
RE ( Regular expression )
Phases
NFA ( Non-deterministic of Automata )
Finite
The
DFA ( Deterministic Compiler)
Finite Automata ) Code
LEX Generator
Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Routines
Stream)
Intermediate
Parser Representation
Given a formal syntax specification (typically as a
context-free grammar [CFG] ), the parse reads
Symbol and Optimizer
tokens and groups them into units as specified by
Attribute
the productions of the CFG being used.
Tables
As syntactic structure is recognized, the parser
either calls corresponding semantic routines
(Used by all
directly or builds a syntax tree.
CFG ( Context-FreePhases
Grammarof )
BNF ( Backus-NaurThe
FormCompiler)
) Code
GAA ( Grammar Analysis Algorithms ) Generator
LL, LR, SLR, LALR Parsers
YACC
Target machine code
16
Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Routines
Stream)
Intermediate
Semantic Routines Representation
Perform two functions
Check the static semantics of each construct
Do the actualSymbol and
translation Optimizer
The heart of a compiler
Attribute
Tables
Syntax Directed Translation
Semantic Processing
(Used by all
Techniques
IR (IntermediatePhases of
Representation)
The Compiler) Code
Generator
Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Routines
Stream)
Intermediate
Optimizer Representation
The IR code generated by the semantic routines is
analyzed and transformed into functionally
Symbol and Optimizer
equivalent but improved IR code
This phase can beAttribute
very complex and slow
Tables
Peephole optimization
loop optimization, register allocation, code
(Used by all
scheduling
Phases of
The Compiler)
Register and Temporary Management Code
Peephole Optimization Generator
Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Routines
Stream)
Intermediate
Code Generator Representation
Interpretive Code Generation
Generating Code from Tree/Dag
Grammar-Based Code Generator Optimizer
Code
Generator
Target machine code
19
Symbol-table Management
• To record the identifiers in source program
– Identifier is detected by lexical analysis and then is
stored in symbol table
• To collect the attributes of identifiers
(not by lexical analysis)
– Storage allocation : memory address
– Types
– Scope (where it is valid, local or global)
– Arguments (in case of procedure names)
• Arguments numbers and types
• Call by reference or address
• Return types
20
Symbol-table Management
Translation of A Statement
23
Translation of A Statement
24
Translation of A Statement
The Reason Why Lexical Analysis is a
Separate Phase (Issues in Lexical Analysis)
• Simplifies the design of the compiler
– LL(1) or LR(1) with 1 lookahead would not be possible
• Provides efficient implementation
– Systematic techniques to implement lexical analyzers
by hand or automatically
– Stream buffering methods to scan input
• Improves portability
– Non-standard symbols and alternate character
encodings can be more easily translated
Interaction of the Lexical
Analyzer with the Parser
Token,
Source Lexical tokenval
Program Parser
Analyzer
Get next
token
error error
Symbol Table
Attributes of Tokens
<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>
token
tokenval
(token attribute) Parser
Tokens, Patterns, and Lexemes
• A token is a pair consisting of a token name and an optional
attribute value. The token name is an abstract symbol representing
a kind of lexical unit, e.g., a particular keyword, or a sequence of
input characters denoting an identifier. The token names are the
input symbols that the parser processes.
– For example: id and num
• A lexeme is a sequence of characters in the source program that
matches the pattern for a token and is identified by the lexical
analyzer as an instance of that token.
– For example: abc and 123
Tokens, Patterns, and Lexemes
• A pattern is a description of the form that the lexemes of a
token may take. In the case of a keyword as a token, the
pattern is just the sequence of characters that form the
keyword. For identifiers and some other tokens, the pattern
is a more complex structure that is matched by many
strings.
– For example: “letter followed by letters and digits” and “non-empty
sequence of digits”
Example
• Consider Pascal Statement
– const pi = 3.1416;
Tokens, Patterns, and Lexemes
• In many programming languages, the following classes
cover most or all of the tokens:
1. One token for each keyword. The pattern for a keyword is the same as the
keyword itself.
2. Tokens for the operators, either individually or in classes such as the token
comparison
3. One token representing all identifiers
4. One or more tokens representing constants, such as numbers and literal
strings
5. Tokens for each punctuation symbol, such as left and right parentheses,
comma, and semicolon.
Exercise
Attributes for Tokens
• When more than one lexeme can match a pattern, the lexical
analyzer must provide the subsequent compiler phases
additional information about the particular lexeme that
matched.
– For example, the pattern for token number matches both 0 and 1, but it is
extremely important for the code generator to know which lexeme was found
in the source program.
• Thus, in many cases the lexical analyzer returns to the parser
not only a token name, but an attribute value that describes
the lexeme represented by the token; the token name
influences parsing decisions, while the attribute value
influences translation of tokens after the parse.
Attributes for Tokens
• We shall assume that tokens have at most one associated
attribute, although this attribute may have a structure that
combines several pieces of information.
• The most important example is the token id, where we
need to associate with the token a great deal of
information.
• Normally, information about an identifier - e.g., its lexeme,
its type, and the location at which it is first found (in case
an error message about that identifier must be issued) - is
kept in the symbol table.
• Thus, the appropriate attribute value for an identifier is a
pointer to the symbol-table entry for that identifier.
Example
Lexical Errors
• It is hard for a lexical analyser to tell, without the aid of
other components, that there is a source-code error.
– Example: fi(a==f(x)) …
• fi is a valid lexeme for the token id to the parser.
• However, suppose a situation arises in which the lexical
analyzer is unable to proceed because none of the patterns
for tokens matches any prefix of the remaining input.
• The simplest recovery strategy is "panic mode" recovery.
– We delete successive characters from the remaining input, until the
lexical analyzer can find a well-formed token at the beginning of
what input is left.
– This recovery technique may confuse the parser, but in an
interactive computing environment it may be quite adequate.
Lexical Errors
• Other possible error-recovery actions are:
– Delete one character from the remaining input.
– Insert a missing character into the remaining input.
– Replace a character by another character.
– Transpose two adjacent characters.
Input Buffering
• let us examine some ways that the simple but important task
of reading the source program can be speeded.
• This task is made difficult by the fact that we often have to
look one or more characters beyond the next lexeme before we
can be sure we have the right lexeme.
• Thus, we shall introduce a two-buffer scheme that handles
large lookaheads safely.
• We then consider an improvement involving "sentinels" that
saves time checking for the ends of buffers.
Input Buffering
• Two-buffer input scheme to look ahead on
the input and identify tokens
• Buffer pairs
• Sentinels (Guards)
Input Buffering
• Buffer Pairs
– Because of the amount of time taken to process
characters and the large number of characters that must
be processed during the compilation of a large source
program, specialized buffering techniques have been
developed to reduce the amount of overhead required to
process a single input character.
– An important scheme involves two buffers that are
alternately reloaded, as suggested in Fig..
Input Buffering
• Buffer Pairs
– Each buffer is of the same size N, and N is usually the
size of a disk block, e.g., 4096 bytes.
– Using one system read command we can read N
characters into a buffer, rather than using one system
call per character.
– If fewer than N characters remain in the input file, then
a special character, represented by eof, marks the end
of the source file and is different from any possible
character of the source program.
Input Buffering
• Two pointers to the input are maintained:
– Pointer lexemeBegin, marks the beginning of the current lexeme,
whose extent we are attempting to determine.
– Pointer forward scans ahead until a pattern match is found
• Once the next lexeme is determined, forward is set to the
character at its right end.
• Then, after the lexeme is recorded as an attribute value of a
token returned to the parser, 1exemeBegin is set to the
character immediately after the lexeme just found.
Input Buffering
• Once the next lexeme is determined, forward is set to the
character at its right end.
• Then, after the lexeme is recorded as an attribute value of a
token returned to the parser, lexemeBegin is set to the
character immediately after the lexeme just found.
• Advancing forward requires that we first test whether we
have reached the end of one of the buffers, and if so, we must
reload the other buffer from the input, and move forward to
the beginning of the newly loaded buffer.
• As long as we never need to look so far ahead of the actual
lexeme that the sum of the lexeme's length plus the distance
we look ahead is greater than N, we shall never overwrite the
lexeme in its buffer before determining it.
Specification of Tokens
• Regular expressions are an important notation for
specifying lexeme patterns.
• While they cannot express all possible patterns, they are
very effective in specifying those types of patterns that we
actually need for tokens
Specification of Patterns for
Tokens: Terminology
• Strings and Languages
– An alphabet is a finite set of symbols
(characters)
• Typical examples of symbols are letters, digits, and
punctuation. The set {0,1) is the binary alphabet.
• ASCII is an important example of an alphabet; it is
used in many software systems.
• Unicode, which includes approximately 100,000
characters from alphabets around the world, is
another important example of an alphabet.
Specification of Patterns for
Tokens: Terminology
• Strings and Languages
– A string s is a finite sequence of symbols from
• |s| denotes the length of string s
• denotes the empty string, thus || = 0
– A language is a specific set of strings over
some fixed alphabet
Specification of Patterns for
Tokens: String Operations
• The concatenation of two strings x and y is
denoted by xy
• The exponentation of a string s is defined
by
s0 =
si = si-1s for i > 0
(note that s = s = s)
Specification of Patterns for
Tokens: Language Operations
• Union
L M = {s | s L or s M}
• Concatenation
LM = {xy | x L and y M}
• Exponentiation
L0 = {}; Li = Li-1L
• Kleene closure
L* = i=0,…, Li
• Positive closure
L+ = i=1,…, Li
Example
• Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z) and let D be
the set of digits {0,1,.. .9). We may think of L and D in two,
essentially equivalent, ways. One way is that L and D are, respectively, the
alphabets of uppercase and lowercase letters and of digits. The second
way is that L and D are languages, all of whose strings happen to be of
length one. Here are some other languages that can be constructed from
languages L and D.
• L U D is the set of letters and digits - strictly speaking the language with
62 strings of length one, each of which strings is either one letter or one
digit
• LD is the set of 520 strings of length two, each consisting of one letter
followed by one digit.
• L4 is the set of all 4-letter strings.
• L* is the set of ail strings of letters, including , the empty string.
• L(L U D)* is the set of all strings of letters and digits beginning with a
letter.
• D+ is the set of all strings of one or more digits.
Specification of Patterns for
Tokens: Regular Expressions
• Basis symbols:
– is a regular expression denoting language {}
– a is a regular expression denoting {a}
• If r and s are regular expressions denoting
languages L(r) and M(s) respectively, then
– r | s is a regular expression denoting L(r) M(s)
– rs is a regular expression denoting L(r)M(s)
– r* is a regular expression denoting L(r)*
– (r) is a regular expression denoting L(r)
• A language defined by a regular expression is
called a regular set
Example
• Let = {a, b}.
– The regular expression a | b denotes the language {a, b}.
– (a | b) (a | b) denotes {aa, ab, ba, bb), the language of all strings of
length two
– over the alphabet . Another regular expression for the same
language is aa | ab | ba | bb.
– a* denotes the language consisting of all strings of zero or more
a's, that is, {, a, aa, aaa, . . . }.
– (a | b)* denotes the set of all strings consisting of zero or more
instances of a or b, that is, all strings of a's and b's: {, a, b, aa, ab,
ba, bb, aaa, . . .}. Another regular expression for the same
language is (a*b*)*.
– a | a*b denotes the language {a, b, ab, aab,aaab,. . .), that is, the
string a and all strings consisting of zero or more a's and ending in
b.
Specification of Patterns for
Tokens: Regular Definitions
• For notational convenience, we may wish to give names to
certain regular expressions and use those names in subsequent
expressions, as if the names were themselves symbols.
• If is an alphabet of basic symbols, then a regular definition is
a sequence of definitions of the form:
• d1 r1
d2 r2
…
dn rn
where ri is a regular expression over
{d1, d2, …, di-1 }
• Each dj is a new symbol, not in and not the same as any other
Example
• C identifiers are strings of letters, digits, and underscores.
Here is a regular definition for the language of C
identifiers.
Example
• Unsigned numbers (integer or floating point) are strings
such as 5280, 0.01234, 6.336E4, or 1.89E-4. The regular
definition
Specification of Patterns for
Tokens: Notational Shorthands
• We frequently use the following shorthands:
r+ = rr*
r? = r | (The unary postfix operator ? means "zero or one occurrence.“)
• [a-z] = a | b | c | … | z
• For example:
digit [0-9]
num digit+ (. digit+)? ( E (+|-)? digit+ )?
Exercise
• Describe the languages denoted by the following regular
expressions:
Regular Definitions and
Grammars
Grammar
stmt if expr then stmt
| if expr then stmt else stmt
|
expr term relop term
| term
term id Regular definitions
| num if if
then then
else else
relop < | <= | <> | > | >= | =
id letter ( letter | digit )*
num digit+ (. digit+)? ( E (+|-)? digit+ )?
Implementing a Scanner Using
Transition Diagrams
relop < | <= | <> | > | >= | =
start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other
4 * return(relop, LT)
=
5 return(relop, EQ)
> =
6 7 return(relop, GE)
other
8 * return(relop, GT)
id letter ( letter | digit )* letter or digit
[Link].c
C [Link]
compiler
input sequence
stream [Link] of tokens
Lex Specification
• A lex specification consists of three parts:
regular definitions, C declarations in %
{ %}
%%
translation rules
%%
user-defined auxiliary procedures
• The translation rules are of the form:
p1 { action1 }
p2 { action2 }
…
pn { actionn }
Regular Expressions in Lex
x match the character x
\. match the character .
“string”match contents of string of characters
. match any character except newline
^ match beginning of a line
$ match the end of a line
[xyz] match one character x, y, or z (use \ to escape -)
[^xyz]match any character except x, y, and z
[a-z] match one of a to z
r* closure (match zero or more occurrences)
r+ positive closure (match one or more occurrences)
r? optional (match zero or one occurrence)
r1 r2 match r1 then r2 (concatenation)
r1|r2 match r1 or r2 (union)
(r) grouping
r1\r2 match r1 when followed by r2
{d} match the regular expression defined by d
Lex actions
• BEGIN: It indicates the start state. The lexical analyzer starts at state
0.
• ECHO: It emits the input as it is.
• yytext: When lexer matches or recognizes the token from input
token then the lexeme is stored in a null terminated string called
yytext.
• yylex(): As soon as call to yylex() is encountered scanner starts
scanning the source program.
• yywrap(): It is called when scanner encounters eof i.e. return 0. If
returns 0 then scanner continues scanning.
• yyin: It is the standard input file that stores input source program.
• yyleng: when a lexer reconizes token then the lexeme is stored in a
null terminated string called yytext. It stores the length or number of
characters in the input string. The value in yyleng is same as strlen()
functions.
Installing Software
• Download Flex 2.5.4a
• Download Bison 2.4.1
• Download DevC++
• Install Flex at "C:\GnuWin32"
• Install Bison at "C:\GnuWin32"
• Install DevC++ at "C:\Dev-Cpp"
• Open Environment Variables.
– Add "C:\GnuWin32\bin;C:\Dev-Cpp\bin;" to path.
For Windows 8
Example Lex Specification 1
%{
#include <stdio.h>
Contains
%} the matching
Translation lexeme
rules %%
[0-9]+ { printf("%s\n", yytext); }
.|\n {}
%% Invokes
int main( ) the lexical
{ analyzer
yylex( );
}
int yywrap( ) lex spec.l
{ gcc [Link].c -ll
return 1; ./[Link] < spec.l
}
Execution Lex Specification 1
Digit
only
printed
Example Lex Specification 2
%{
#include <stdio.h> Regular
int ch = 0, wd = 0, nl = 0;
definition
Translation %}
rules delim [ \t]+
%%
\n { ch++; wd++; nl++; }
^{delim} { ch+=yyleng; }
{delim} { ch+=yyleng; wd++; }
. { ch++; }
%%
main()
{ yylex();
printf("%8d%8d%8d\n", nl, wd, ch);
}
Example Lex Specification 3
%{
#include <stdio.h> Regular
%}
definitions
Translation digit [0-9]
rules letter [A-Za-z]
id {letter}({letter}|{digit})*
%%
{digit}+ { printf(“number: %s\n”, yytext); }
{id} { printf(“ident: %s\n”, yytext); }
. { printf(“other: %s\n”, yytext); }
%%
main()
{ yylex();
}
Example Lex Specification 4
%{ /* definitions of manifest constants */
#define LT (256)
…
%}
delim [ \t\n]
ws {delim}+
letter [A-Za-z] Return
digit [0-9]
id {letter}({letter}|{digit})* token to
number {digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
%% parser
{ws} { }
if {return IF;} Token
then {return THEN;}
else {return ELSE;} attribute
{id} {yylval = install_id(); return ID;}
{number} {yylval = install_num(); return NUMBER;}
“<“ {yylval = LT; return RELOP;}
“<=“ {yylval = LE; return RELOP;}
“=“ {yylval = EQ; return RELOP;}
“<>“ {yylval = NE; return RELOP;}
“>“ {yylval = GT; return RELOP;}
“>=“ {yylval = GE; return RELOP;}
%% Install yytext as
int install_id()
…
identifier in symbol table
Lex Program
(Write a program in Lex to identify identifier and keyword in a sentence.)
%{
#include<stdio.h>
static int key_word=0;
static int identifier=0;
%}
%%
"include"|"for"|"define" {key_word++;printf("keyword found");}
"int"|"char"|"float"|"double" {identifier++;printf("identifier found");}
%%
int main()
{
printf("enter the sentence");
yylex();
printf("keyword are: %d\n and identifier are:%d\n",key_word,identifier);
}
int yywrap()
{
return 1;
}
Design of a Lexical Analyzer
Generator
• Translate regular expressions to NFA
• Translate NFA to an efficient DFA
Optional
regular
NFA DFA
expressions
a
S = {0,1,2,3}
start
0
a
1
b
2
b
3
= {a,b}
s0 = 0
b F = {3}
Transition Table
• The mapping of an NFA can be
represented in a transition table
Input Input
State a b
(0,a) = {0,1}
(0,b) = {0} 0 {0, 1} {0}
(1,b) = {2} 1 {2}
(2,b) = {3} 2 {3}
The Language Defined by an
NFA
• An NFA accepts an input string x iff there is some
path with edges labeled with symbols from x in
sequence from the start state to some accepting
state in the transition graph
• A state transition from one state to another on the
path is called a move
• The language defined by an NFA is the set of input
strings it accepts, such as (a|b)*abb for the
example NFA
Design of a Lexical Analyzer
Generator: RE to NFA to DFA
Lex specification with NFA
regular expressions
p1 { action1 } N(p1) action1
p2 { action2 } start
s0
N(p2) action2
… …
pn { actionn }
N(pn) actionn
Subset construction
(optional)
DFA
From Regular Expression to NFA
(Thompson’s Construction)
start
i f
a start a
i f
start N(r1)
r1 | r2 i f
N(r2)
start
r1 r2 i N(r1) N(r2) f
r* start
i N(r) f
Combining the NFAs of a Set of
Regular Expressions
start a
1 2
a { action1 } start a b b
abb { action2 } 3 4 5 6
a b
a*b+ { action3 }
start
7 b 8
a
1 2
start
0 3
a
4
b
5
b
6
a b
7 b 8
Simulating the Combined NFA
Example 1
a
1 2 action1
start
0 3
a
4
b
5
b
6 action2
a b
7 b 8 action3
a a b a
0 2 7 8
none
action3
1 4
3 7 Must find the longest match:
7 Continue until no further moves are possible
When last state is accepting: execute action
Simulating the Combined NFA
Example 2
a
1 2 action1
start
0 3
a
4
b
5
b
6 action2
a b
7 b 8 action3
a b b a
0 2 5 6
none
action2
1 4 8 8
action3
3 7
7 When two or more accepting states are reached, the
first action given in the Lex specification is executed
Deterministic Finite Automata
• A deterministic finite automaton is a special case
of an NFA
– No state has an -transition
– For each state s and input symbol a there is at most one
edge labeled a leaving s
• Each entry in the transition table is a single state
– At most one path exists to accept a string
– Simulation algorithm is simple
Example DFA
b
b
a
start a b b
0 1 2 3
a a
Conversion of an NFA into a
DFA
• The subset construction algorithm converts an
NFA into a DFA using:
-closure(s) = {s} {t | s … t}
-closure(T) = sT -closure(s)
move(T,a) = {t | s a t and s T}
• The algorithm produces:
Dstates is the set of states of the new DFA
consisting of sets of states of the NFA
Dtran is the transition table of the new DFA
-closure and move Examples
-closure({0}) = {0,1,3,7}
a
1 2 move({0,1,3,7},a) = {2,4,7}
-closure({2,4,7}) = {2,4,7}
start
a b b move({2,4,7},a) = {7}
0 3 4 5 6
a b -closure({7}) = {7}
move({7},b) = {8}
7 b 8 -closure({8}) = {8}
move({8},a) =
a a b a
0 2 7 8
none
1 4
3 7
7 Also used to simulate NFAs
Simulating an NFA using
-closure and move
S := -closure({s0})
Sprev :=
a := nextchar()
while S do
Sprev := S
S := -closure(move(S,a))
a := nextchar()
end do
if Sprev F then
execute action in Sprev
return “yes”
else return “no”
The Subset Construction
Algorithm
Initially, -closure(s0) is the only state in Dstates and it is unmarked
while there is an unmarked state T in Dstates do
mark T
for each input symbol a do
U := -closure(move(T,a))
if U is not in Dstates then
add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
end do
end do
Subset Construction Example 1
a
2 3
start a b b
0 1 6 7 8 9 10
4
b
5
b
Dstates
C A = {0,1,2,4,7}
b a b B = {1,2,3,4,6,7,8}
start a b b C = {1,2,4,5,6,7}
A B D E
a D = {1,2,4,5,6,7,9}
a
a E = {1,2,4,5,6,7,10}
Subset Construction Example 2
a
1 2 a1
start
0 3
a
4
b
5
b
6 a2
a b
7 b 8 a3
b
Dstates
C
a3
A = {0,1,3,7}
b a
b b B = {2,4,7}
start C = {8}
A D
D = {7}
a
a
b b E = {5,8}
B E F
F = {6,8}
a1 a3 a2 a3
Minimizing the Number of States
of a DFA
C
b a
b a
start a b b start a b b
A B D E A B D E
a a
a
a b a
Minimizing the number of states of a
DFA (Hoff croft Algorithm)
Algorithm: Minimizing the number of states of a DFA.
Input: A DFA D with set of states S, input alphabet , state state s0, and set of accepting
states F.
Output: A DFA D' accepting the same language as D and having as few states as possible.
Method:
1. Start with an initial partition with two groups, F and S – F, the accepting and non-
accepting states of D.
2. Apply the procedure, to construct a new partition new
Initially, let new = ;
for ( each group G of ) do begin
partition G into subgroups such that two states s and t are in the same subgroup if and
only if for all input symbols a, states s and t have transitions on a to states in the same
group of II; /* at worst, a state will be in a subgroup by itself * /
replace G in new by the set of all subgroups formed;
end
3. If new = , let final = and continue with step (4) . Otherwise, repeat step (2) with new in
place of new.
4. Choose one state in each group of final as the representative for that group. The
Example
• By using the above algorithm, minimizing the given DFA
state. Transition table of given DFA.
• Minimizing the sates INPUT
STATE SYMBOL
0 = (ABCD) (E) a b
A B C
1 = (ABC) (D) (E) B B D
C B C
2 = (AC) (B) (D) (E) D B E
E B C
3 = (AC) (B) (D) (E)
• Now, construct the minimum-state DFA. It has four states,
corresponding to the four groups of 3 and let us pick A, B, D,
and E as the representatives of these groups. The initial state is
A, and the only accepting state is E. Below table shows the
transition function for the DFA.
Example (Contd…)
E B C E B A
From Regular Expression to DFA
Directly
• The important states of an NFA are those
without an -transition, that is if
move({s},a) for some a then s is an
important state
• The subset construction algorithm uses only
the important states when it determines
-closure(move(T,a))
From Regular Expression to DFA
Directly (Algorithm)
• Augment the regular expression r with a
special end symbol # to make accepting
states important: the new expression is r#
• Construct a syntax tree for r#
• Traverse the tree to construct functions
nullable, firstpos, lastpos, and followpos
From Regular Expression to DFA
Directly: Syntax Tree of (a|b)*abb#
concatenation
#
6
b
closure 5
b
4
* a
3
alternation
| position
number
a b (for leafs )
1 2
From Regular Expression to DFA
Directly: Annotating the Tree
• nullable(n): the subtree at node n generates
languages including the empty string
• firstpos(n): set of positions that can match the first
symbol of a string generated by the subtree at
node n
• lastpos(n): the set of positions that can match the
last symbol of a string generated be the subtree at
node n
• followpos(i): the set of positions that can follow
position i in the tree
From Regular Expression to DFA
Directly: Annotating the Tree
Node n nullable(n) firstpos(n) lastpos(n)
Leaf true
{3} a {3}
firstpos lastpos
{1, 2} {1, 2}
* 3
{1, 2} | {1, 2}
b b
a
start a 1,2, b 1,2, b 1,2,
1,2,3
3,4 3,5 3,6
a
a
Time-Space Tradeoffs
Space Time
Automaton
(worst case) (worst case)
NFA O(|r|) O(|r||x|)
DFA O(2|r|) O(|x|)
Compiler Construction tools
These tools use specialized languages for specifying
and implementing specific components, and many
use quite sophisticated algorithms.