0% found this document useful (0 votes)

73 views109 pages

Lexical Analysis and Compiler Basics

The document provides an overview of lexical analysis and compiler design, detailing the history and evolution of compilers from early assembly language to modern programming languages. It discusses the roles of compilers and interpreters, the structure of a compiler, and the phases involved in compilation, including lexical analysis, syntax analysis, and semantic analysis. Additionally, it covers token attributes, error detection, and input buffering techniques to enhance the efficiency of lexical analysis.

Uploaded by

prespective

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views109 pages

Lexical Analysis and Compiler Basics

Uploaded by

prespective

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Unit - I

Lexical Analysis and

Lexical Analyzer Generators
2

Overview and History

• Cause
– Software for early computers was written in assembly
language
– The benefits of reusing software on different CPUs
started to become significantly greater than the cost of
writing a compiler

• The first real compiler

– FORTRAN compilers of the late 1950s
3

Overview and History (Contd…)

• Compiler technology
– is more broadly applicable and has been
employed in rather unexpected areas.
• Text-formatting languages, like nroff and
troff; preprocessor packages like eqn, tbl, pic
• Silicon compiler for the creation of VLSI
circuits
• Command languages of OS
• Query languages of Database systems
4

What Do Compilers Do
• Compilers may generate three types of code:
– Pure Machine Code
• Machine instruction set without assuming the existence
of any operating system or library.
• Mostly being OS or embedded applications.
– Augmented Machine Code
• Code with OS routines and runtime support routines.
• More often
– Virtual Machine Code
• Virtual instructions, can be run on any architecture with
a virtual machine interpreter or a just-in-time compiler
• Ex. Java
5

What Do Compilers Do (Contd…)

• Another way that compilers differ from one
another is in the format of the target
machine code they generate:
– Assembly or other source format
– Relocatable binary
• Relative address
• A linkage step is required
– Absolute binary
• Absolute address
• Can be executed directly
6

Compilers and Interpreters

• “Compilation”
– Translation of a program written in a source
language into a semantically equivalent
program written in a target language
Input

Source Target
Compiler
Program Program

Error messages Output

Compilers
• Source languages: Fortran, Pascal, C, etc.
• Target languages: another PL, machine Lang
• Compilers:
– Single-pass
– Multi-pass
– Load-and-Go
– Debugging
– Optimizing
8

Compilers and Interpreters

(cont’d)
• “Interpretation”
– Performing the operations implied by the
source program

Source
Program
Interpreter Output
Input

Error messages
9

Other Tools that Use the

Analysis-Synthesis Model
• Editors (syntax highlighting)
• Pretty printers (e.g. doxygen)
• Static checkers (e.g. lint and splint)
• Interpreters
• Text formatters (e.g. TeX and LaTeX)
• Silicon compilers (e.g. VHDL)
• Query interpreters/compilers (Databases)
10

Preprocessors, Compilers,
Assemblers, and Linkers
Skeletal Source Program

Preprocessor
Source Program
Compiler Try for example:
gcc -v myprog.c
Target Assembly Program
Assembler
Relocatable Object Code
Linker Libraries and
Relocatable Object Files
Absolute Machine Code
11

The Analysis-Synthesis Model of

Compilation
• There are two parts to compilation:
– Analysis determines the operations implied by the
source program which are recorded in a tree
structure
– Synthesis takes the tree structure and translates the
operations therein into the target program
12

Phases of Compiler
13

The Structure of a Compiler

Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Routines
Stream)
Intermediate
Representation

Symbol and Optimizer

Attribute
Tables

(Used by all Phases of The

Compiler)
Code
Generator

Target machine code

The Structure of a Compiler (Contd…)

Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Routines
Stream)
Intermediate
Representation
Scanner
 The scanner begins the analysis of the source
Symbol and Optimizer
program by reading the input, character by
Attribute
character, and grouping characters into individual
Tables
words and symbols (tokens)


(Used by all
RE ( Regular expression )
 Phases
NFA ( Non-deterministic of Automata )
Finite
 The
DFA ( Deterministic Compiler)
Finite Automata ) Code
 LEX Generator

Target machine code

The Structure of a Compiler (Contd…)

Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Routines
Stream)
Intermediate
Parser Representation
 Given a formal syntax specification (typically as a
context-free grammar [CFG] ), the parse reads
Symbol and Optimizer
tokens and groups them into units as specified by
Attribute
the productions of the CFG being used.
Tables
 As syntactic structure is recognized, the parser
either calls corresponding semantic routines
(Used by all
directly or builds a syntax tree.
 CFG ( Context-FreePhases
Grammarof )
 BNF ( Backus-NaurThe
FormCompiler)
) Code
 GAA ( Grammar Analysis Algorithms ) Generator
 LL, LR, SLR, LALR Parsers
 YACC
Target machine code
16

The Structure of a Compiler (Contd…)

Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Routines
Stream)
Intermediate
Semantic Routines Representation
 Perform two functions
 Check the static semantics of each construct
 Do the actualSymbol and
translation Optimizer
 The heart of a compiler
Attribute
Tables
 Syntax Directed Translation
 Semantic Processing
(Used by all
Techniques
 IR (IntermediatePhases of
Representation)
The Compiler) Code
Generator

Target machine code

The Structure of a Compiler (Contd…)

Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Routines
Stream)
Intermediate
Optimizer Representation
 The IR code generated by the semantic routines is
analyzed and transformed into functionally
Symbol and Optimizer
equivalent but improved IR code
 This phase can beAttribute
very complex and slow
Tables
 Peephole optimization
 loop optimization, register allocation, code
(Used by all
scheduling
Phases of
The Compiler)
 Register and Temporary Management Code
 Peephole Optimization Generator

Target machine code

The Structure of a Compiler (Contd...)

Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Routines
Stream)
Intermediate
Code Generator Representation
 Interpretive Code Generation
 Generating Code from Tree/Dag
 Grammar-Based Code Generator Optimizer

Code
Generator
Target machine code
19

Symbol-table Management
• To record the identifiers in source program
– Identifier is detected by lexical analysis and then is
stored in symbol table
• To collect the attributes of identifiers
(not by lexical analysis)
– Storage allocation : memory address
– Types
– Scope (where it is valid, local or global)
– Arguments (in case of procedure names)
• Arguments numbers and types
• Call by reference or address
• Return types
20

Symbol-table Management

• Semantic analysis uses type information check

the type consistence of identifiers
• Code generating uses storage allocation
information to generate proper relocation
address code
21

Error Detection and Reporting

• Syntax and semantic analysis handle a large fraction
of errors
• Some of the Errors may occur in compilation phase
– Lexical phase: could not form any token
• e.g. Misspelling or Juxtaposing of characters
– Syntax phase: tokens violate structure rules
• e.g. Unbalanced parenthesis, Missing punctuation
operators or Undeclared variables.
– Semantic phase: no meaning of operations
• Add an array name and a procedure name, Truncation
of results or Unreachable Code.
22

Translation of A Statement
23

Translation of A Statement
24

Translation of A Statement
The Reason Why Lexical Analysis is a
Separate Phase (Issues in Lexical Analysis)
• Simplifies the design of the compiler
– LL(1) or LR(1) with 1 lookahead would not be possible
• Provides efficient implementation
– Systematic techniques to implement lexical analyzers
by hand or automatically
– Stream buffering methods to scan input
• Improves portability
– Non-standard symbols and alternate character
encodings can be more easily translated
Interaction of the Lexical
Analyzer with the Parser
Token,
Source Lexical tokenval
Program Parser
Analyzer
Get next
token
error error

Symbol Table
Attributes of Tokens

y := 31 + 28*x Lexical analyzer

<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>

token
tokenval
(token attribute) Parser
Tokens, Patterns, and Lexemes
• A token is a pair consisting of a token name and an optional
attribute value. The token name is an abstract symbol representing
a kind of lexical unit, e.g., a particular keyword, or a sequence of
input characters denoting an identifier. The token names are the
input symbols that the parser processes.
– For example: id and num
• A lexeme is a sequence of characters in the source program that
matches the pattern for a token and is identified by the lexical
analyzer as an instance of that token.
– For example: abc and 123
Tokens, Patterns, and Lexemes
• A pattern is a description of the form that the lexemes of a
token may take. In the case of a keyword as a token, the
pattern is just the sequence of characters that form the
keyword. For identifiers and some other tokens, the pattern
is a more complex structure that is matched by many
strings.
– For example: “letter followed by letters and digits” and “non-empty
sequence of digits”
Example
• Consider Pascal Statement
– const pi = 3.1416;
Tokens, Patterns, and Lexemes
• In many programming languages, the following classes
cover most or all of the tokens:
1. One token for each keyword. The pattern for a keyword is the same as the
keyword itself.
2. Tokens for the operators, either individually or in classes such as the token
comparison
3. One token representing all identifiers
4. One or more tokens representing constants, such as numbers and literal
strings
5. Tokens for each punctuation symbol, such as left and right parentheses,
comma, and semicolon.
Exercise
Attributes for Tokens
• When more than one lexeme can match a pattern, the lexical
analyzer must provide the subsequent compiler phases
additional information about the particular lexeme that
matched.
– For example, the pattern for token number matches both 0 and 1, but it is
extremely important for the code generator to know which lexeme was found
in the source program.
• Thus, in many cases the lexical analyzer returns to the parser
not only a token name, but an attribute value that describes
the lexeme represented by the token; the token name
influences parsing decisions, while the attribute value
influences translation of tokens after the parse.
Attributes for Tokens
• We shall assume that tokens have at most one associated
attribute, although this attribute may have a structure that
combines several pieces of information.
• The most important example is the token id, where we
need to associate with the token a great deal of
information.
• Normally, information about an identifier - e.g., its lexeme,
its type, and the location at which it is first found (in case
an error message about that identifier must be issued) - is
kept in the symbol table.
• Thus, the appropriate attribute value for an identifier is a
pointer to the symbol-table entry for that identifier.
Example
Lexical Errors
• It is hard for a lexical analyser to tell, without the aid of
other components, that there is a source-code error.
– Example: fi(a==f(x)) …
• fi is a valid lexeme for the token id to the parser.
• However, suppose a situation arises in which the lexical
analyzer is unable to proceed because none of the patterns
for tokens matches any prefix of the remaining input.
• The simplest recovery strategy is "panic mode" recovery.
– We delete successive characters from the remaining input, until the
lexical analyzer can find a well-formed token at the beginning of
what input is left.
– This recovery technique may confuse the parser, but in an
interactive computing environment it may be quite adequate.
Lexical Errors
• Other possible error-recovery actions are:
– Delete one character from the remaining input.
– Insert a missing character into the remaining input.
– Replace a character by another character.
– Transpose two adjacent characters.
Input Buffering
• let us examine some ways that the simple but important task
of reading the source program can be speeded.
• This task is made difficult by the fact that we often have to
look one or more characters beyond the next lexeme before we
can be sure we have the right lexeme.
• Thus, we shall introduce a two-buffer scheme that handles
large lookaheads safely.
• We then consider an improvement involving "sentinels" that
saves time checking for the ends of buffers.
Input Buffering
• Two-buffer input scheme to look ahead on
the input and identify tokens
• Buffer pairs
• Sentinels (Guards)
Input Buffering
• Buffer Pairs
– Because of the amount of time taken to process
characters and the large number of characters that must
be processed during the compilation of a large source
program, specialized buffering techniques have been
developed to reduce the amount of overhead required to
process a single input character.
– An important scheme involves two buffers that are
alternately reloaded, as suggested in Fig..
Input Buffering
• Buffer Pairs
– Each buffer is of the same size N, and N is usually the
size of a disk block, e.g., 4096 bytes.
– Using one system read command we can read N
characters into a buffer, rather than using one system
call per character.
– If fewer than N characters remain in the input file, then
a special character, represented by eof, marks the end
of the source file and is different from any possible
character of the source program.
Input Buffering
• Two pointers to the input are maintained:
– Pointer lexemeBegin, marks the beginning of the current lexeme,
whose extent we are attempting to determine.
– Pointer forward scans ahead until a pattern match is found
• Once the next lexeme is determined, forward is set to the
character at its right end.
• Then, after the lexeme is recorded as an attribute value of a
token returned to the parser, 1exemeBegin is set to the
character immediately after the lexeme just found.
Input Buffering
• Once the next lexeme is determined, forward is set to the
character at its right end.
• Then, after the lexeme is recorded as an attribute value of a
token returned to the parser, lexemeBegin is set to the
character immediately after the lexeme just found.
• Advancing forward requires that we first test whether we
have reached the end of one of the buffers, and if so, we must
reload the other buffer from the input, and move forward to
the beginning of the newly loaded buffer.
• As long as we never need to look so far ahead of the actual
lexeme that the sum of the lexeme's length plus the distance
we look ahead is greater than N, we shall never overwrite the
lexeme in its buffer before determining it.
Specification of Tokens
• Regular expressions are an important notation for
specifying lexeme patterns.
• While they cannot express all possible patterns, they are
very effective in specifying those types of patterns that we
actually need for tokens
Specification of Patterns for
Tokens: Terminology
• Strings and Languages
– An alphabet  is a finite set of symbols
(characters)
• Typical examples of symbols are letters, digits, and
punctuation. The set {0,1) is the binary alphabet.
• ASCII is an important example of an alphabet; it is
used in many software systems.
• Unicode, which includes approximately 100,000
characters from alphabets around the world, is
another important example of an alphabet.
Specification of Patterns for
Tokens: Terminology
• Strings and Languages
– A string s is a finite sequence of symbols from

• |s| denotes the length of string s
•  denotes the empty string, thus || = 0
– A language is a specific set of strings over
some fixed alphabet 
Specification of Patterns for
Tokens: String Operations
• The concatenation of two strings x and y is
denoted by xy
• The exponentation of a string s is defined
by
s0 = 
si = si-1s for i > 0
(note that s = s = s)
Specification of Patterns for
Tokens: Language Operations
• Union
L  M = {s | s  L or s  M}
• Concatenation
LM = {xy | x  L and y  M}
• Exponentiation
L0 = {}; Li = Li-1L
• Kleene closure
L* = i=0,…, Li
• Positive closure
L+ = i=1,…, Li
Example
• Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z) and let D be
the set of digits {0,1,.. .9). We may think of L and D in two,
essentially equivalent, ways. One way is that L and D are, respectively, the
alphabets of uppercase and lowercase letters and of digits. The second
way is that L and D are languages, all of whose strings happen to be of
length one. Here are some other languages that can be constructed from
languages L and D.
• L U D is the set of letters and digits - strictly speaking the language with
62 strings of length one, each of which strings is either one letter or one
digit
• LD is the set of 520 strings of length two, each consisting of one letter
followed by one digit.
• L4 is the set of all 4-letter strings.
• L* is the set of ail strings of letters, including , the empty string.
• L(L U D)* is the set of all strings of letters and digits beginning with a
letter.
• D+ is the set of all strings of one or more digits.
Specification of Patterns for
Tokens: Regular Expressions
• Basis symbols:
–  is a regular expression denoting language {}
– a   is a regular expression denoting {a}
• If r and s are regular expressions denoting
languages L(r) and M(s) respectively, then
– r | s is a regular expression denoting L(r)  M(s)
– rs is a regular expression denoting L(r)M(s)
– r* is a regular expression denoting L(r)*
– (r) is a regular expression denoting L(r)
• A language defined by a regular expression is
called a regular set
Example
• Let  = {a, b}.
– The regular expression a | b denotes the language {a, b}.
– (a | b) (a | b) denotes {aa, ab, ba, bb), the language of all strings of
length two
– over the alphabet  . Another regular expression for the same
language is aa | ab | ba | bb.
– a* denotes the language consisting of all strings of zero or more
a's, that is, {, a, aa, aaa, . . . }.
– (a | b)* denotes the set of all strings consisting of zero or more
instances of a or b, that is, all strings of a's and b's: {, a, b, aa, ab,
ba, bb, aaa, . . .}. Another regular expression for the same
language is (a*b*)*.
– a | a*b denotes the language {a, b, ab, aab,aaab,. . .), that is, the
string a and all strings consisting of zero or more a's and ending in
b.
Specification of Patterns for
Tokens: Regular Definitions
• For notational convenience, we may wish to give names to
certain regular expressions and use those names in subsequent
expressions, as if the names were themselves symbols.
• If  is an alphabet of basic symbols, then a regular definition is
a sequence of definitions of the form:
• d1  r1
d2  r2
…
dn  rn
where ri is a regular expression over
  {d1, d2, …, di-1 }
• Each dj is a new symbol, not in  and not the same as any other
Example
• C identifiers are strings of letters, digits, and underscores.
Here is a regular definition for the language of C
identifiers.
Example
• Unsigned numbers (integer or floating point) are strings
such as 5280, 0.01234, 6.336E4, or 1.89E-4. The regular
definition
Specification of Patterns for
Tokens: Notational Shorthands
• We frequently use the following shorthands:
r+ = rr*
r? = r |  (The unary postfix operator ? means "zero or one occurrence.“)
• [a-z] = a | b | c | … | z
• For example:
digit  [0-9]
num  digit+ (. digit+)? ( E (+|-)? digit+ )?
Exercise
• Describe the languages denoted by the following regular
expressions:
Regular Definitions and
Grammars
Grammar
stmt  if expr then stmt
| if expr then stmt else stmt
|
expr  term relop term
| term
term  id Regular definitions
| num if  if
then  then
else  else
relop  < | <= | <> | > | >= | =
id  letter ( letter | digit )*
num  digit+ (. digit+)? ( E (+|-)? digit+ )?
Implementing a Scanner Using
Transition Diagrams
relop  < | <= | <> | > | >= | =
start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other
4 * return(relop, LT)
=
5 return(relop, EQ)
> =
6 7 return(relop, GE)
other
8 * return(relop, GT)
id  letter ( letter | digit )* letter or digit

start letter other

9 10 11 * return(gettoken(),
install_id())
Implementing a Scanner Using
Transition Diagrams (Code)
token nexttoken()
{ while (1) {
switch (state) {
case 0: c = nextchar();
if (c==blank || c==tab || c==newline) { Decides what
state = 0;
lexeme_beginning++; other start state
}
else if (c==‘<‘) state = 1; is applicable
else if (c==‘=‘) state = 5;
else if (c==‘>’) state = 6;
else state = fail();
break; int fail()
case 1: { forward = token_beginning;
… swith (start) {
case 9: c = nextchar(); case 0: start = 9; break;
if (isletter(c)) state = 10; case 9: start = 12; break;
else state = fail(); case 12: start = 20; break;
break; case 20: start = 25; break;
case 10: c = nextchar(); case 25: recover(); break;
if (isletter(c)) state = 10; default: /* error */
else if (isdigit(c)) state = 10; }
else state = 11; return start;
break; }
…
The Lex and Flex Scanner
Generators
• Lex and its newer cousin flex are scanner
generators
• Systematically translate regular definitions
into C source code for efficient scanning
• Generated code is easy to integrate in C
applications
Creating a Lexical Analyzer with
Lex and Flex
lex
source lex or flex [Link].c
program compiler
lex.l

[Link].c
C [Link]
compiler

input sequence
stream [Link] of tokens
Lex Specification
• A lex specification consists of three parts:
regular definitions, C declarations in %
{ %}
%%
translation rules
%%
user-defined auxiliary procedures
• The translation rules are of the form:
p1 { action1 }
p2 { action2 }
…
pn { actionn }
Regular Expressions in Lex
x match the character x
\. match the character .
“string”match contents of string of characters
. match any character except newline
^ match beginning of a line
$ match the end of a line
[xyz] match one character x, y, or z (use \ to escape -)
[^xyz]match any character except x, y, and z
[a-z] match one of a to z
r* closure (match zero or more occurrences)
r+ positive closure (match one or more occurrences)
r? optional (match zero or one occurrence)
r1 r2 match r1 then r2 (concatenation)
r1|r2 match r1 or r2 (union)
(r) grouping
r1\r2 match r1 when followed by r2
{d} match the regular expression defined by d
Lex actions
• BEGIN: It indicates the start state. The lexical analyzer starts at state
0.
• ECHO: It emits the input as it is.
• yytext: When lexer matches or recognizes the token from input
token then the lexeme is stored in a null terminated string called
yytext.
• yylex(): As soon as call to yylex() is encountered scanner starts
scanning the source program.
• yywrap(): It is called when scanner encounters eof i.e. return 0. If
returns 0 then scanner continues scanning.
• yyin: It is the standard input file that stores input source program.
• yyleng: when a lexer reconizes token then the lexeme is stored in a
null terminated string called yytext. It stores the length or number of
characters in the input string. The value in yyleng is same as strlen()
functions.
Installing Software
• Download Flex 2.5.4a
• Download Bison 2.4.1
• Download DevC++
• Install Flex at "C:\GnuWin32"
• Install Bison at "C:\GnuWin32"
• Install DevC++ at "C:\Dev-Cpp"
• Open Environment Variables.
– Add "C:\GnuWin32\bin;C:\Dev-Cpp\bin;" to path.
For Windows 8
Example Lex Specification 1
%{
#include <stdio.h>
Contains
%} the matching
Translation lexeme
rules %%
[0-9]+ { printf("%s\n", yytext); }
.|\n {}
%% Invokes
int main( ) the lexical
{ analyzer
yylex( );
}
int yywrap( ) lex spec.l
{ gcc [Link].c -ll
return 1; ./[Link] < spec.l
}
Execution Lex Specification 1

Digit
only
printed
Example Lex Specification 2
%{
#include <stdio.h> Regular
int ch = 0, wd = 0, nl = 0;
definition
Translation %}
rules delim [ \t]+
%%
\n { ch++; wd++; nl++; }
^{delim} { ch+=yyleng; }
{delim} { ch+=yyleng; wd++; }
. { ch++; }
%%
main()
{ yylex();
printf("%8d%8d%8d\n", nl, wd, ch);
}
Example Lex Specification 3
%{
#include <stdio.h> Regular
%}
definitions
Translation digit [0-9]
rules letter [A-Za-z]
id {letter}({letter}|{digit})*
%%
{digit}+ { printf(“number: %s\n”, yytext); }
{id} { printf(“ident: %s\n”, yytext); }
. { printf(“other: %s\n”, yytext); }
%%
main()
{ yylex();
}
Example Lex Specification 4
%{ /* definitions of manifest constants */
#define LT (256)
…
%}
delim [ \t\n]
ws {delim}+
letter [A-Za-z] Return
digit [0-9]
id {letter}({letter}|{digit})* token to
number {digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
%% parser
{ws} { }
if {return IF;} Token
then {return THEN;}
else {return ELSE;} attribute
{id} {yylval = install_id(); return ID;}
{number} {yylval = install_num(); return NUMBER;}
“<“ {yylval = LT; return RELOP;}
“<=“ {yylval = LE; return RELOP;}
“=“ {yylval = EQ; return RELOP;}
“<>“ {yylval = NE; return RELOP;}
“>“ {yylval = GT; return RELOP;}
“>=“ {yylval = GE; return RELOP;}
%% Install yytext as
int install_id()
…
identifier in symbol table
Lex Program
(Write a program in Lex to identify identifier and keyword in a sentence.)
%{
#include<stdio.h>
static int key_word=0;
static int identifier=0;
%}
%%
"include"|"for"|"define" {key_word++;printf("keyword found");}
"int"|"char"|"float"|"double" {identifier++;printf("identifier found");}
%%
int main()
{
printf("enter the sentence");
yylex();
printf("keyword are: %d\n and identifier are:%d\n",key_word,identifier);
}
int yywrap()
{
return 1;
}
Design of a Lexical Analyzer
Generator
• Translate regular expressions to NFA
• Translate NFA to an efficient DFA

Optional

regular
NFA DFA
expressions

Simulate NFA Simulate DFA

to recognize to recognize
tokens tokens
Nondeterministic Finite
Automata
• Definition: an NFA is a 5-tuple (S,,,s0,F)
where

S is a finite set of states

 is a finite set of input symbol alphabet
 is a mapping from S to a set of states
s0  S is the start state
F  S is the set of accepting (or final) states
Transition Graph
• An NFA can be diagrammatically
represented by a labeled directed graph
called a transition graph

a
S = {0,1,2,3}
start
0
a
1
b
2
b
3
 = {a,b}
s0 = 0
b F = {3}
Transition Table
• The mapping  of an NFA can be
represented in a transition table

Input Input
State a b
(0,a) = {0,1}
(0,b) = {0} 0 {0, 1} {0}
(1,b) = {2} 1 {2}
(2,b) = {3} 2 {3}
The Language Defined by an
NFA
• An NFA accepts an input string x iff there is some
path with edges labeled with symbols from x in
sequence from the start state to some accepting
state in the transition graph
• A state transition from one state to another on the
path is called a move
• The language defined by an NFA is the set of input
strings it accepts, such as (a|b)*abb for the
example NFA
Design of a Lexical Analyzer
Generator: RE to NFA to DFA
Lex specification with NFA
regular expressions
p1 { action1 } N(p1) action1

p2 { action2 } start
s0
 N(p2) action2
… …
pn { actionn } 
N(pn) actionn

Subset construction
(optional)
DFA
From Regular Expression to NFA
(Thompson’s Construction)

start
i  f

a start a
i f

start  N(r1) 
r1 | r2 i f
 N(r2) 
start
r1 r2 i N(r1) N(r2) f


r* start
i  N(r)  f


Combining the NFAs of a Set of
Regular Expressions
start a
1 2

a { action1 } start a b b
abb { action2 } 3 4 5 6
a b
a*b+ { action3 }
start
7 b 8
a
1 2

start
0  3
a
4
b
5
b
6
a b

7 b 8
Simulating the Combined NFA
Example 1
a
1 2 action1

start
0  3
a
4
b
5
b
6 action2
a b

7 b 8 action3

a a b a
0 2 7 8
none
action3
1 4
3 7 Must find the longest match:
7 Continue until no further moves are possible
When last state is accepting: execute action
Simulating the Combined NFA
Example 2
a
1 2 action1

start
0  3
a
4
b
5
b
6 action2
a b

7 b 8 action3

a b b a
0 2 5 6
none
action2
1 4 8 8
action3
3 7
7 When two or more accepting states are reached, the
first action given in the Lex specification is executed
Deterministic Finite Automata
• A deterministic finite automaton is a special case
of an NFA
– No state has an -transition
– For each state s and input symbol a there is at most one
edge labeled a leaving s
• Each entry in the transition table is a single state
– At most one path exists to accept a string
– Simulation algorithm is simple
Example DFA

A DFA that accepts (a|b)*abb

b
b
a
start a b b
0 1 2 3

a a
Conversion of an NFA into a
DFA
• The subset construction algorithm converts an
NFA into a DFA using:
-closure(s) = {s}  {t | s  …  t}
-closure(T) = sT -closure(s)
move(T,a) = {t | s a t and s  T}
• The algorithm produces:
Dstates is the set of states of the new DFA
consisting of sets of states of the NFA
Dtran is the transition table of the new DFA
-closure and move Examples
-closure({0}) = {0,1,3,7}
a
1 2 move({0,1,3,7},a) = {2,4,7}
 -closure({2,4,7}) = {2,4,7}
start
 a b b move({2,4,7},a) = {7}
0 3 4 5 6
a b -closure({7}) = {7}
 move({7},b) = {8}
7 b 8 -closure({8}) = {8}
move({8},a) = 
a a b a
0 2 7 8
none
1 4
3 7
7 Also used to simulate NFAs
Simulating an NFA using
-closure and move
S := -closure({s0})
Sprev := 
a := nextchar()
while S   do
Sprev := S
S := -closure(move(S,a))
a := nextchar()
end do
if Sprev  F   then
execute action in Sprev
return “yes”
else return “no”
The Subset Construction
Algorithm
Initially, -closure(s0) is the only state in Dstates and it is unmarked
while there is an unmarked state T in Dstates do
mark T
for each input symbol a   do
U := -closure(move(T,a))
if U is not in Dstates then
add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
end do
end do
Subset Construction Example 1

a
2 3

start    a b b
0 1 6 7 8 9 10

4
b
5


b
Dstates
C A = {0,1,2,4,7}
b a b B = {1,2,3,4,6,7,8}
start a b b C = {1,2,4,5,6,7}
A B D E
a D = {1,2,4,5,6,7,9}
a
a E = {1,2,4,5,6,7,10}
Subset Construction Example 2
a
1 2 a1

start
0  3
a
4
b
5
b
6 a2
a b

7 b 8 a3
b
Dstates
C
a3
A = {0,1,3,7}
b a
b b B = {2,4,7}
start C = {8}
A D
D = {7}
a
a
b b E = {5,8}
B E F
F = {6,8}
a1 a3 a2 a3
Minimizing the Number of States
of a DFA

C
b a
b a
start a b b start a b b
A B D E A B D E
a a
a
a b a
Minimizing the number of states of a
DFA (Hoff croft Algorithm)
 Algorithm: Minimizing the number of states of a DFA.
 Input: A DFA D with set of states S, input alphabet , state state s0, and set of accepting
states F.
 Output: A DFA D' accepting the same language as D and having as few states as possible.
 Method:
1. Start with an initial partition  with two groups, F and S – F, the accepting and non-
accepting states of D.
2. Apply the procedure, to construct a new partition new
Initially, let new = ;
for ( each group G of  ) do begin
partition G into subgroups such that two states s and t are in the same subgroup if and
only if for all input symbols a, states s and t have transitions on a to states in the same
group of II; /* at worst, a state will be in a subgroup by itself * /
replace G in new by the set of all subgroups formed;
end
3. If new = , let final =  and continue with step (4) . Otherwise, repeat step (2) with new in
place of new.
4. Choose one state in each group of final as the representative for that group. The
Example
• By using the above algorithm, minimizing the given DFA
state. Transition table of given DFA.
• Minimizing the sates INPUT
STATE SYMBOL
0 = (ABCD) (E) a b
A B C
1 = (ABC) (D) (E) B B D
C B C
2 = (AC) (B) (D) (E) D B E
E B C
3 = (AC) (B) (D) (E)
• Now, construct the minimum-state DFA. It has four states,
corresponding to the four groups of 3 and let us pick A, B, D,
and E as the representatives of these groups. The initial state is
A, and the only accepting state is E. Below table shows the
transition function for the DFA.
Example (Contd…)

Transition table of given DFA. Transition table of minimized DFA.

INPUT SYMBOL INPUT

STATE STATE SYMBOL
a b
a b
A B C
A B A
B B D
B B D
C B C
D B E D B E

E B C E B A
From Regular Expression to DFA
Directly
• The important states of an NFA are those
without an -transition, that is if
move({s},a)   for some a then s is an
important state
• The subset construction algorithm uses only
the important states when it determines
-closure(move(T,a))
From Regular Expression to DFA
Directly (Algorithm)
• Augment the regular expression r with a
special end symbol # to make accepting
states important: the new expression is r#
• Construct a syntax tree for r#
• Traverse the tree to construct functions
nullable, firstpos, lastpos, and followpos
From Regular Expression to DFA
Directly: Syntax Tree of (a|b)*abb#
concatenation
#
6
b
closure 5
b
4

* a
3
alternation
| position
number
a b (for leafs )
1 2
From Regular Expression to DFA
Directly: Annotating the Tree
• nullable(n): the subtree at node n generates
languages including the empty string
• firstpos(n): set of positions that can match the first
symbol of a string generated by the subtree at
node n
• lastpos(n): the set of positions that can match the
last symbol of a string generated be the subtree at
node n
• followpos(i): the set of positions that can follow
position i in the tree
From Regular Expression to DFA
Directly: Annotating the Tree
Node n nullable(n) firstpos(n) lastpos(n)

Leaf  true  

Leaf i false {i} {i}

| nullable(c1) firstpos(c1) lastpos(c1)

/ \ or  
c1 c2 nullable(c2) firstpos(c2) lastpos(c2)
if nullable(c1) then if nullable(c2) then
• nullable(c1)
firstpos(c1)  lastpos(c1) 
/ \ and
firstpos(c2) lastpos(c2)
c1 c2 nullable(c2)
else firstpos(c1) else lastpos(c2)
*
| true firstpos(c1) lastpos(c1)
c1
From Regular Expression to DFA
Directly: Syntax Tree of (a|b)*abb#
{1, 2, 3} {6}

{1, 2, 3} {5} {6} # {6}

6
{1, 2, 3} {4} {5} b {5}
nullable 5
{1, 2, 3} {3} {4} b {4}
4

{3} a {3}
firstpos lastpos
{1, 2} {1, 2}
* 3

{1, 2} | {1, 2}

{1} a {1} {2} b {2}

1 2
From Regular Expression to DFA
Directly: followpos
for each node n in the tree do
if n is a cat-node with left child c1 and right child c2 then
for each i in lastpos(c1) do
followpos(i) := followpos(i)  firstpos(c2)
end do
else if n is a star-node
for each i in lastpos(n) do
followpos(i) := followpos(i)  firstpos(n)
end do
end if
end do
From Regular Expression to DFA
Directly: Algorithm
s0 := firstpos(root) where root is the root of the syntax tree
Dstates := {s0} and is unmarked
while there is an unmarked state T in Dstates do
mark T
for each input symbol a   do
let U be the set of positions that are in followpos(p)
for some position p in T,
such that the symbol at position p is a
if U is not empty and not in Dstates then
add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
end do
From Regular Expression to DFA
Directly: Example
Node followpos
1 {1, 2, 3} 1
2 {1, 2, 3} 3 4 5 6
3 {4}
2
4 {5}
5 {6}
6 -

b b
a
start a 1,2, b 1,2, b 1,2,
1,2,3
3,4 3,5 3,6
a
a
Time-Space Tradeoffs

Space Time
Automaton
(worst case) (worst case)
NFA O(|r|) O(|r||x|)
DFA O(2|r|) O(|x|)
Compiler Construction tools
These tools use specialized languages for specifying
and implementing specific components, and many
use quite sophisticated algorithms.

• Parser Generators: automatically produce

syntax analyzer from a grammatical description of
a programming language.

• Scanner Generator: produce lexical analyzer

from a regular expression description of the
tokens of a language.
Compiler Construction tools
• Syntax directed translation engines: produce
collection of routines for walking a parse tree
and generating intermediate code.

• Code- generator generators: produce a code

generator from a collection of rules for
translating each operation of the intermediate
language into the machine language for a
target machine.
Compiler Construction tools
• Data flow analysis engines: facilitate the
gathering of information about how values are
transmitted from one part of a program to each
other part. Data flow analysis is the key part of
code optimization.

• Compiler construction toolkits: provide an

integrated set of routines for constructing
various phases of compiler.
Source Program

Compiler and Interpreter Fundamentals
No ratings yet
Compiler and Interpreter Fundamentals
47 pages
Compiler Construction Course Overview
No ratings yet
Compiler Construction Course Overview
27 pages
Overview of VBA Compiler Design
No ratings yet
Overview of VBA Compiler Design
24 pages
Compiler Design Overview and Phases
No ratings yet
Compiler Design Overview and Phases
124 pages
Introduction to Compiler Design
No ratings yet
Introduction to Compiler Design
30 pages
Compiler Design Overview and Phases
100% (1)
Compiler Design Overview and Phases
193 pages
Introduction to Compiler Design Basics
No ratings yet
Introduction to Compiler Design Basics
46 pages
Lexical Analysis and Parsing Overview
No ratings yet
Lexical Analysis and Parsing Overview
24 pages
Front-End Compiler Design Overview
No ratings yet
Front-End Compiler Design Overview
20 pages
Compiler Design Syllabus Overview
No ratings yet
Compiler Design Syllabus Overview
60 pages
Compiler Construction Overview
No ratings yet
Compiler Construction Overview
37 pages
Compiler Design Overview and Classification
No ratings yet
Compiler Design Overview and Classification
43 pages
Compiler Design Overview and Phases
No ratings yet
Compiler Design Overview and Phases
23 pages
Compiler Design Overview and Structure
No ratings yet
Compiler Design Overview and Structure
250 pages
Compiler Design Overview and Phases
100% (1)
Compiler Design Overview and Phases
120 pages
Overview of Compiler Design Phases
No ratings yet
Overview of Compiler Design Phases
111 pages
Compiler Basics: Lexical Analysis & Structures
No ratings yet
Compiler Basics: Lexical Analysis & Structures
94 pages
Automata in Compiler Design Overview
No ratings yet
Automata in Compiler Design Overview
638 pages
Compiler Structure and Lexical Analysis
No ratings yet
Compiler Structure and Lexical Analysis
28 pages
Compiler Design and Phases Overview
No ratings yet
Compiler Design and Phases Overview
49 pages
Compiler Structure and Phases Overview
No ratings yet
Compiler Structure and Phases Overview
43 pages
Compiler Design Concepts Explained
0% (1)
Compiler Design Concepts Explained
5 pages
Compiler Structure and Lexical Analysis
No ratings yet
Compiler Structure and Lexical Analysis
28 pages
Automata in Compiler Design Overview
No ratings yet
Automata in Compiler Design Overview
55 pages
Compiler Overview and Phases Explained
No ratings yet
Compiler Overview and Phases Explained
35 pages
Compiler Design and Phases Explained
No ratings yet
Compiler Design and Phases Explained
76 pages
Compiler Construction Course Overview
No ratings yet
Compiler Construction Course Overview
23 pages
Compiler Design Overview and Phases
No ratings yet
Compiler Design Overview and Phases
53 pages
Compilers Course Notes - CS 218
No ratings yet
Compilers Course Notes - CS 218
100 pages
Compilers Course Notes - CS 218
No ratings yet
Compilers Course Notes - CS 218
100 pages
Compiler Design: Unit 1 Overview
100% (1)
Compiler Design: Unit 1 Overview
13 pages
Compiler Construction Overview and Analysis
No ratings yet
Compiler Construction Overview and Analysis
11 pages
Compiler Construction Overview
No ratings yet
Compiler Construction Overview
24 pages
Compiler Design: Lexical Analysis Overview
No ratings yet
Compiler Design: Lexical Analysis Overview
28 pages
APBN Compilation Mechanism Explained
No ratings yet
APBN Compilation Mechanism Explained
19 pages
Compiler Structure and Lexical Analysis
No ratings yet
Compiler Structure and Lexical Analysis
28 pages
Compiler Design and Complexity Theory
No ratings yet
Compiler Design and Complexity Theory
19 pages
LL(1) Parsing and Attribute Grammar
No ratings yet
LL(1) Parsing and Attribute Grammar
72 pages
Compiler Design Lecture Notes
No ratings yet
Compiler Design Lecture Notes
58 pages
Compiler Design Principles Explained
No ratings yet
Compiler Design Principles Explained
53 pages
Парсер для системного программирования
No ratings yet
Парсер для системного программирования
69 pages
TK3163 Sem2 2020 1MyCh1.1-1.2 Intro-20200211121547
No ratings yet
TK3163 Sem2 2020 1MyCh1.1-1.2 Intro-20200211121547
39 pages
Understanding Compiler Construction Basics
No ratings yet
Understanding Compiler Construction Basics
36 pages
Compiler Design Overview
No ratings yet
Compiler Design Overview
97 pages
Compilers Course Overview and Structure
No ratings yet
Compilers Course Overview and Structure
8 pages
Compiler Design Overview and Phases
No ratings yet
Compiler Design Overview and Phases
14 pages
Compiler Design Overview and Phases
No ratings yet
Compiler Design Overview and Phases
25 pages
Compiler Functions and Stages Explained
No ratings yet
Compiler Functions and Stages Explained
13 pages
Introduction to Compiler Basics
No ratings yet
Introduction to Compiler Basics
33 pages
Compiler Design Overview and Concepts
No ratings yet
Compiler Design Overview and Concepts
53 pages
Compiler Design: Phases and Analysis
No ratings yet
Compiler Design: Phases and Analysis
50 pages
Introduction to Compilers Overview
No ratings yet
Introduction to Compilers Overview
77 pages
Compiler Design: Phases and Tools
100% (1)
Compiler Design: Phases and Tools
36 pages
Compiler Design: Language Processing
No ratings yet
Compiler Design: Language Processing
12 pages
Compiler Design: Lexical Analysis Overview
No ratings yet
Compiler Design: Lexical Analysis Overview
51 pages
Syntax Analysis in Compiler Design
No ratings yet
Syntax Analysis in Compiler Design
52 pages
Branch and Bound Algorithms Explained
No ratings yet
Branch and Bound Algorithms Explained
32 pages
Multiplexers and Demultiplexers Explained
No ratings yet
Multiplexers and Demultiplexers Explained
17 pages
Analysis of Search and Sorting Algorithms
No ratings yet
Analysis of Search and Sorting Algorithms
51 pages
Understanding Algorithms in Computing
No ratings yet
Understanding Algorithms in Computing
15 pages
Booth Multiplication Algorithm Explained
No ratings yet
Booth Multiplication Algorithm Explained
9 pages
Synthesized and Inherited Attributes in SDD
No ratings yet
Synthesized and Inherited Attributes in SDD
50 pages
MTE: Theory of Computation Exam 2024
No ratings yet
MTE: Theory of Computation Exam 2024
2 pages
Loop Currents in AC/DC Circuits
No ratings yet
Loop Currents in AC/DC Circuits
6 pages
Understanding Jaw Relations in Dentures
No ratings yet
Understanding Jaw Relations in Dentures
15 pages
The Essence of a Happy Life
No ratings yet
The Essence of a Happy Life
1 page
K-Pop Fandom Loyalty and V Live
No ratings yet
K-Pop Fandom Loyalty and V Live
56 pages
Physics EoS1 1819 V2
No ratings yet
Physics EoS1 1819 V2
12 pages
Personal Data Sheet for Faculty Evaluation
No ratings yet
Personal Data Sheet for Faculty Evaluation
10 pages
A Comparative Analysis of Exxonmobil, Conoco-Phillips, BP, Chevron, and Shell Societal and Environmental Sustainability Strategies
No ratings yet
A Comparative Analysis of Exxonmobil, Conoco-Phillips, BP, Chevron, and Shell Societal and Environmental Sustainability Strategies
15 pages
Children of Police Personnel Certificate
No ratings yet
Children of Police Personnel Certificate
1 page
Newspaper Page Makeup Styles Explained
No ratings yet
Newspaper Page Makeup Styles Explained
4 pages
Abraham Maslow: Life and Legacy
No ratings yet
Abraham Maslow: Life and Legacy
19 pages
Surayya Bakery Menu Pricing Insights
No ratings yet
Surayya Bakery Menu Pricing Insights
8 pages
Monthly Report on Molluscum Cases
No ratings yet
Monthly Report on Molluscum Cases
71 pages
First Contact: Cultural Clash 1788
No ratings yet
First Contact: Cultural Clash 1788
40 pages
Overview of Buddhism's Branches
100% (1)
Overview of Buddhism's Branches
42 pages
Understanding Gastronomy Tourism
No ratings yet
Understanding Gastronomy Tourism
10 pages
Pandu Roga: Iron Deficiency Anaemia Study
No ratings yet
Pandu Roga: Iron Deficiency Anaemia Study
6 pages
Annual IT Lesson Plan for Grade 11
No ratings yet
Annual IT Lesson Plan for Grade 11
19 pages
Trace Leak Dye Tablet Safety Data
No ratings yet
Trace Leak Dye Tablet Safety Data
3 pages
Chittagong's Industrial Growth Insights
No ratings yet
Chittagong's Industrial Growth Insights
2 pages
Comprehensive Guide to Muscle Anatomy
No ratings yet
Comprehensive Guide to Muscle Anatomy
21 pages
OzCharter Flight Booking System
No ratings yet
OzCharter Flight Booking System
3 pages
Fluid Mechanics: Types & Analysis
No ratings yet
Fluid Mechanics: Types & Analysis
23 pages
Grade 7 Maths Entrance Exam 2025
50% (2)
Grade 7 Maths Entrance Exam 2025
12 pages
Drilling and Completion Overview
No ratings yet
Drilling and Completion Overview
15 pages
Carbohydrates and Chirality Overview
No ratings yet
Carbohydrates and Chirality Overview
17 pages
FAW CA3250 Truck Maintenance Manual
No ratings yet
FAW CA3250 Truck Maintenance Manual
16 pages
Non-Store Retailing Overview
No ratings yet
Non-Store Retailing Overview
30 pages
Secure Key Management for IoT Networks
No ratings yet
Secure Key Management for IoT Networks
14 pages
Sauter - Katalog 2010 EN
No ratings yet
Sauter - Katalog 2010 EN
56 pages
Senior Informatica Developer Profile
No ratings yet
Senior Informatica Developer Profile
4 pages

Lexical Analysis and Compiler Basics

Uploaded by

Lexical Analysis and Compiler Basics

Uploaded by

Unit - I

Lexical Analysis and

Overview and History

• The first real compiler

Overview and History (Contd…)

What Do Compilers Do (Contd…)

Compilers and Interpreters

Error messages Output

Compilers and Interpreters

Other Tools that Use the

The Analysis-Synthesis Model of

The Structure of a Compiler

Symbol and Optimizer

(Used by all Phases of The

Target machine code

The Structure of a Compiler (Contd…)

Target machine code

The Structure of a Compiler (Contd…)

The Structure of a Compiler (Contd…)

Target machine code

The Structure of a Compiler (Contd…)

Target machine code

The Structure of a Compiler (Contd...)

• Semantic analysis uses type information check

Error Detection and Reporting

y := 31 + 28*x Lexical analyzer

start letter other

Simulate NFA Simulate DFA

S is a finite set of states

A DFA that accepts (a|b)*abb

Transition table of given DFA. Transition table of minimized DFA.

INPUT SYMBOL INPUT

Leaf i false {i} {i}

| nullable(c1) firstpos(c1) lastpos(c1)

{1, 2, 3} {5} {6} # {6}

{1} a {1} {2} b {2}

• Parser Generators: automatically produce

• Scanner Generator: produce lexical analyzer

• Code- generator generators: produce a code

• Compiler construction toolkits: provide an

You might also like