Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
73 views109 pages

Lexical Analysis and Compiler Basics

The document provides an overview of lexical analysis and compiler design, detailing the history and evolution of compilers from early assembly language to modern programming languages. It discusses the roles of compilers and interpreters, the structure of a compiler, and the phases involved in compilation, including lexical analysis, syntax analysis, and semantic analysis. Additionally, it covers token attributes, error detection, and input buffering techniques to enhance the efficiency of lexical analysis.

Uploaded by

prespective
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views109 pages

Lexical Analysis and Compiler Basics

The document provides an overview of lexical analysis and compiler design, detailing the history and evolution of compilers from early assembly language to modern programming languages. It discusses the roles of compilers and interpreters, the structure of a compiler, and the phases involved in compilation, including lexical analysis, syntax analysis, and semantic analysis. Additionally, it covers token attributes, error detection, and input buffering techniques to enhance the efficiency of lexical analysis.

Uploaded by

prespective
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Unit - I

Lexical Analysis and


Lexical Analyzer Generators
2

Overview and History


• Cause
– Software for early computers was written in assembly
language
– The benefits of reusing software on different CPUs
started to become significantly greater than the cost of
writing a compiler

• The first real compiler


– FORTRAN compilers of the late 1950s
3

Overview and History (Contd…)


• Compiler technology
– is more broadly applicable and has been
employed in rather unexpected areas.
• Text-formatting languages, like nroff and
troff; preprocessor packages like eqn, tbl, pic
• Silicon compiler for the creation of VLSI
circuits
• Command languages of OS
• Query languages of Database systems
4

What Do Compilers Do
• Compilers may generate three types of code:
– Pure Machine Code
• Machine instruction set without assuming the existence
of any operating system or library.
• Mostly being OS or embedded applications.
– Augmented Machine Code
• Code with OS routines and runtime support routines.
• More often
– Virtual Machine Code
• Virtual instructions, can be run on any architecture with
a virtual machine interpreter or a just-in-time compiler
• Ex. Java
5

What Do Compilers Do (Contd…)


• Another way that compilers differ from one
another is in the format of the target
machine code they generate:
– Assembly or other source format
– Relocatable binary
• Relative address
• A linkage step is required
– Absolute binary
• Absolute address
• Can be executed directly
6

Compilers and Interpreters


• “Compilation”
– Translation of a program written in a source
language into a semantically equivalent
program written in a target language
Input

Source Target
Compiler
Program Program

Error messages Output


7

Compilers
• Source languages: Fortran, Pascal, C, etc.
• Target languages: another PL, machine Lang
• Compilers:
– Single-pass
– Multi-pass
– Load-and-Go
– Debugging
– Optimizing
8

Compilers and Interpreters


(cont’d)
• “Interpretation”
– Performing the operations implied by the
source program

Source
Program
Interpreter Output
Input

Error messages
9

Other Tools that Use the


Analysis-Synthesis Model
• Editors (syntax highlighting)
• Pretty printers (e.g. doxygen)
• Static checkers (e.g. lint and splint)
• Interpreters
• Text formatters (e.g. TeX and LaTeX)
• Silicon compilers (e.g. VHDL)
• Query interpreters/compilers (Databases)
10

Preprocessors, Compilers,
Assemblers, and Linkers
Skeletal Source Program

Preprocessor
Source Program
Compiler Try for example:
gcc -v myprog.c
Target Assembly Program
Assembler
Relocatable Object Code
Linker Libraries and
Relocatable Object Files
Absolute Machine Code
11

The Analysis-Synthesis Model of


Compilation
• There are two parts to compilation:
– Analysis determines the operations implied by the
source program which are recorded in a tree
structure
– Synthesis takes the tree structure and translates the
operations therein into the target program
12

Phases of Compiler
13

The Structure of a Compiler


Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Routines
Stream)
Intermediate
Representation

Symbol and Optimizer


Attribute
Tables

(Used by all Phases of The


Compiler)
Code
Generator

Target machine code


14

The Structure of a Compiler (Contd…)

Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Routines
Stream)
Intermediate
Representation
Scanner
 The scanner begins the analysis of the source
Symbol and Optimizer
program by reading the input, character by
Attribute
character, and grouping characters into individual
Tables
words and symbols (tokens)


(Used by all
RE ( Regular expression )
 Phases
NFA ( Non-deterministic of Automata )
Finite
 The
DFA ( Deterministic Compiler)
Finite Automata ) Code
 LEX Generator

Target machine code


15

The Structure of a Compiler (Contd…)

Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Routines
Stream)
Intermediate
Parser Representation
 Given a formal syntax specification (typically as a
context-free grammar [CFG] ), the parse reads
Symbol and Optimizer
tokens and groups them into units as specified by
Attribute
the productions of the CFG being used.
Tables
 As syntactic structure is recognized, the parser
either calls corresponding semantic routines
(Used by all
directly or builds a syntax tree.
 CFG ( Context-FreePhases
Grammarof )
 BNF ( Backus-NaurThe
FormCompiler)
) Code
 GAA ( Grammar Analysis Algorithms ) Generator
 LL, LR, SLR, LALR Parsers
 YACC
Target machine code
16

The Structure of a Compiler (Contd…)

Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Routines
Stream)
Intermediate
Semantic Routines Representation
 Perform two functions
 Check the static semantics of each construct
 Do the actualSymbol and
translation Optimizer
 The heart of a compiler
Attribute
Tables
 Syntax Directed Translation
 Semantic Processing
(Used by all
Techniques
 IR (IntermediatePhases of
Representation)
The Compiler) Code
Generator

Target machine code


17

The Structure of a Compiler (Contd…)

Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Routines
Stream)
Intermediate
Optimizer Representation
 The IR code generated by the semantic routines is
analyzed and transformed into functionally
Symbol and Optimizer
equivalent but improved IR code
 This phase can beAttribute
very complex and slow
Tables
 Peephole optimization
 loop optimization, register allocation, code
(Used by all
scheduling
Phases of
The Compiler)
 Register and Temporary Management Code
 Peephole Optimization Generator

Target machine code


18

The Structure of a Compiler (Contd...)

Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Structure Routines
Stream)
Intermediate
Code Generator Representation
 Interpretive Code Generation
 Generating Code from Tree/Dag
 Grammar-Based Code Generator Optimizer

Code
Generator
Target machine code
19

Symbol-table Management
• To record the identifiers in source program
– Identifier is detected by lexical analysis and then is
stored in symbol table
• To collect the attributes of identifiers
(not by lexical analysis)
– Storage allocation : memory address
– Types
– Scope (where it is valid, local or global)
– Arguments (in case of procedure names)
• Arguments numbers and types
• Call by reference or address
• Return types
20

Symbol-table Management

• Semantic analysis uses type information check


the type consistence of identifiers
• Code generating uses storage allocation
information to generate proper relocation
address code
21

Error Detection and Reporting


• Syntax and semantic analysis handle a large fraction
of errors
• Some of the Errors may occur in compilation phase
– Lexical phase: could not form any token
• e.g. Misspelling or Juxtaposing of characters
– Syntax phase: tokens violate structure rules
• e.g. Unbalanced parenthesis, Missing punctuation
operators or Undeclared variables.
– Semantic phase: no meaning of operations
• Add an array name and a procedure name, Truncation
of results or Unreachable Code.
22

Translation of A Statement
23

Translation of A Statement
24

Translation of A Statement
The Reason Why Lexical Analysis is a
Separate Phase (Issues in Lexical Analysis)
• Simplifies the design of the compiler
– LL(1) or LR(1) with 1 lookahead would not be possible
• Provides efficient implementation
– Systematic techniques to implement lexical analyzers
by hand or automatically
– Stream buffering methods to scan input
• Improves portability
– Non-standard symbols and alternate character
encodings can be more easily translated
Interaction of the Lexical
Analyzer with the Parser
Token,
Source Lexical tokenval
Program Parser
Analyzer
Get next
token
error error

Symbol Table
Attributes of Tokens

y := 31 + 28*x Lexical analyzer

<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>

token
tokenval
(token attribute) Parser
Tokens, Patterns, and Lexemes
• A token is a pair consisting of a token name and an optional
attribute value. The token name is an abstract symbol representing
a kind of lexical unit, e.g., a particular keyword, or a sequence of
input characters denoting an identifier. The token names are the
input symbols that the parser processes.
– For example: id and num
• A lexeme is a sequence of characters in the source program that
matches the pattern for a token and is identified by the lexical
analyzer as an instance of that token.
– For example: abc and 123
Tokens, Patterns, and Lexemes
• A pattern is a description of the form that the lexemes of a
token may take. In the case of a keyword as a token, the
pattern is just the sequence of characters that form the
keyword. For identifiers and some other tokens, the pattern
is a more complex structure that is matched by many
strings.
– For example: “letter followed by letters and digits” and “non-empty
sequence of digits”
Example
• Consider Pascal Statement
– const pi = 3.1416;
Tokens, Patterns, and Lexemes
• In many programming languages, the following classes
cover most or all of the tokens:
1. One token for each keyword. The pattern for a keyword is the same as the
keyword itself.
2. Tokens for the operators, either individually or in classes such as the token
comparison
3. One token representing all identifiers
4. One or more tokens representing constants, such as numbers and literal
strings
5. Tokens for each punctuation symbol, such as left and right parentheses,
comma, and semicolon.
Exercise
Attributes for Tokens
• When more than one lexeme can match a pattern, the lexical
analyzer must provide the subsequent compiler phases
additional information about the particular lexeme that
matched.
– For example, the pattern for token number matches both 0 and 1, but it is
extremely important for the code generator to know which lexeme was found
in the source program.
• Thus, in many cases the lexical analyzer returns to the parser
not only a token name, but an attribute value that describes
the lexeme represented by the token; the token name
influences parsing decisions, while the attribute value
influences translation of tokens after the parse.
Attributes for Tokens
• We shall assume that tokens have at most one associated
attribute, although this attribute may have a structure that
combines several pieces of information.
• The most important example is the token id, where we
need to associate with the token a great deal of
information.
• Normally, information about an identifier - e.g., its lexeme,
its type, and the location at which it is first found (in case
an error message about that identifier must be issued) - is
kept in the symbol table.
• Thus, the appropriate attribute value for an identifier is a
pointer to the symbol-table entry for that identifier.
Example
Lexical Errors
• It is hard for a lexical analyser to tell, without the aid of
other components, that there is a source-code error.
– Example: fi(a==f(x)) …
• fi is a valid lexeme for the token id to the parser.
• However, suppose a situation arises in which the lexical
analyzer is unable to proceed because none of the patterns
for tokens matches any prefix of the remaining input.
• The simplest recovery strategy is "panic mode" recovery.
– We delete successive characters from the remaining input, until the
lexical analyzer can find a well-formed token at the beginning of
what input is left.
– This recovery technique may confuse the parser, but in an
interactive computing environment it may be quite adequate.
Lexical Errors
• Other possible error-recovery actions are:
– Delete one character from the remaining input.
– Insert a missing character into the remaining input.
– Replace a character by another character.
– Transpose two adjacent characters.
Input Buffering
• let us examine some ways that the simple but important task
of reading the source program can be speeded.
• This task is made difficult by the fact that we often have to
look one or more characters beyond the next lexeme before we
can be sure we have the right lexeme.
• Thus, we shall introduce a two-buffer scheme that handles
large lookaheads safely.
• We then consider an improvement involving "sentinels" that
saves time checking for the ends of buffers.
Input Buffering
• Two-buffer input scheme to look ahead on
the input and identify tokens
• Buffer pairs
• Sentinels (Guards)
Input Buffering
• Buffer Pairs
– Because of the amount of time taken to process
characters and the large number of characters that must
be processed during the compilation of a large source
program, specialized buffering techniques have been
developed to reduce the amount of overhead required to
process a single input character.
– An important scheme involves two buffers that are
alternately reloaded, as suggested in Fig..
Input Buffering
• Buffer Pairs
– Each buffer is of the same size N, and N is usually the
size of a disk block, e.g., 4096 bytes.
– Using one system read command we can read N
characters into a buffer, rather than using one system
call per character.
– If fewer than N characters remain in the input file, then
a special character, represented by eof, marks the end
of the source file and is different from any possible
character of the source program.
Input Buffering
• Two pointers to the input are maintained:
– Pointer lexemeBegin, marks the beginning of the current lexeme,
whose extent we are attempting to determine.
– Pointer forward scans ahead until a pattern match is found
• Once the next lexeme is determined, forward is set to the
character at its right end.
• Then, after the lexeme is recorded as an attribute value of a
token returned to the parser, 1exemeBegin is set to the
character immediately after the lexeme just found.
Input Buffering
• Once the next lexeme is determined, forward is set to the
character at its right end.
• Then, after the lexeme is recorded as an attribute value of a
token returned to the parser, lexemeBegin is set to the
character immediately after the lexeme just found.
• Advancing forward requires that we first test whether we
have reached the end of one of the buffers, and if so, we must
reload the other buffer from the input, and move forward to
the beginning of the newly loaded buffer.
• As long as we never need to look so far ahead of the actual
lexeme that the sum of the lexeme's length plus the distance
we look ahead is greater than N, we shall never overwrite the
lexeme in its buffer before determining it.
Specification of Tokens
• Regular expressions are an important notation for
specifying lexeme patterns.
• While they cannot express all possible patterns, they are
very effective in specifying those types of patterns that we
actually need for tokens
Specification of Patterns for
Tokens: Terminology
• Strings and Languages
– An alphabet  is a finite set of symbols
(characters)
• Typical examples of symbols are letters, digits, and
punctuation. The set {0,1) is the binary alphabet.
• ASCII is an important example of an alphabet; it is
used in many software systems.
• Unicode, which includes approximately 100,000
characters from alphabets around the world, is
another important example of an alphabet.
Specification of Patterns for
Tokens: Terminology
• Strings and Languages
– A string s is a finite sequence of symbols from

• |s| denotes the length of string s
•  denotes the empty string, thus || = 0
– A language is a specific set of strings over
some fixed alphabet 
Specification of Patterns for
Tokens: String Operations
• The concatenation of two strings x and y is
denoted by xy
• The exponentation of a string s is defined
by
s0 = 
si = si-1s for i > 0
(note that s = s = s)
Specification of Patterns for
Tokens: Language Operations
• Union
L  M = {s | s  L or s  M}
• Concatenation
LM = {xy | x  L and y  M}
• Exponentiation
L0 = {}; Li = Li-1L
• Kleene closure
L* = i=0,…, Li
• Positive closure
L+ = i=1,…, Li
Example
• Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z) and let D be
the set of digits {0,1,.. .9). We may think of L and D in two,
essentially equivalent, ways. One way is that L and D are, respectively, the
alphabets of uppercase and lowercase letters and of digits. The second
way is that L and D are languages, all of whose strings happen to be of
length one. Here are some other languages that can be constructed from
languages L and D.
• L U D is the set of letters and digits - strictly speaking the language with
62 strings of length one, each of which strings is either one letter or one
digit
• LD is the set of 520 strings of length two, each consisting of one letter
followed by one digit.
• L4 is the set of all 4-letter strings.
• L* is the set of ail strings of letters, including , the empty string.
• L(L U D)* is the set of all strings of letters and digits beginning with a
letter.
• D+ is the set of all strings of one or more digits.
Specification of Patterns for
Tokens: Regular Expressions
• Basis symbols:
–  is a regular expression denoting language {}
– a   is a regular expression denoting {a}
• If r and s are regular expressions denoting
languages L(r) and M(s) respectively, then
– r | s is a regular expression denoting L(r)  M(s)
– rs is a regular expression denoting L(r)M(s)
– r* is a regular expression denoting L(r)*
– (r) is a regular expression denoting L(r)
• A language defined by a regular expression is
called a regular set
Example
• Let  = {a, b}.
– The regular expression a | b denotes the language {a, b}.
– (a | b) (a | b) denotes {aa, ab, ba, bb), the language of all strings of
length two
– over the alphabet  . Another regular expression for the same
language is aa | ab | ba | bb.
– a* denotes the language consisting of all strings of zero or more
a's, that is, {, a, aa, aaa, . . . }.
– (a | b)* denotes the set of all strings consisting of zero or more
instances of a or b, that is, all strings of a's and b's: {, a, b, aa, ab,
ba, bb, aaa, . . .}. Another regular expression for the same
language is (a*b*)*.
– a | a*b denotes the language {a, b, ab, aab,aaab,. . .), that is, the
string a and all strings consisting of zero or more a's and ending in
b.
Specification of Patterns for
Tokens: Regular Definitions
• For notational convenience, we may wish to give names to
certain regular expressions and use those names in subsequent
expressions, as if the names were themselves symbols.
• If  is an alphabet of basic symbols, then a regular definition is
a sequence of definitions of the form:
• d1  r1
d2  r2

dn  rn
where ri is a regular expression over
  {d1, d2, …, di-1 }
• Each dj is a new symbol, not in  and not the same as any other
Example
• C identifiers are strings of letters, digits, and underscores.
Here is a regular definition for the language of C
identifiers.
Example
• Unsigned numbers (integer or floating point) are strings
such as 5280, 0.01234, 6.336E4, or 1.89E-4. The regular
definition
Specification of Patterns for
Tokens: Notational Shorthands
• We frequently use the following shorthands:
r+ = rr*
r? = r |  (The unary postfix operator ? means "zero or one occurrence.“)
• [a-z] = a | b | c | … | z
• For example:
digit  [0-9]
num  digit+ (. digit+)? ( E (+|-)? digit+ )?
Exercise
• Describe the languages denoted by the following regular
expressions:
Regular Definitions and
Grammars
Grammar
stmt  if expr then stmt
| if expr then stmt else stmt
|
expr  term relop term
| term
term  id Regular definitions
| num if  if
then  then
else  else
relop  < | <= | <> | > | >= | =
id  letter ( letter | digit )*
num  digit+ (. digit+)? ( E (+|-)? digit+ )?
Implementing a Scanner Using
Transition Diagrams
relop  < | <= | <> | > | >= | =
start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other
4 * return(relop, LT)
=
5 return(relop, EQ)
> =
6 7 return(relop, GE)
other
8 * return(relop, GT)
id  letter ( letter | digit )* letter or digit

start letter other


9 10 11 * return(gettoken(),
install_id())
Implementing a Scanner Using
Transition Diagrams (Code)
token nexttoken()
{ while (1) {
switch (state) {
case 0: c = nextchar();
if (c==blank || c==tab || c==newline) { Decides what
state = 0;
lexeme_beginning++; other start state
}
else if (c==‘<‘) state = 1; is applicable
else if (c==‘=‘) state = 5;
else if (c==‘>’) state = 6;
else state = fail();
break; int fail()
case 1: { forward = token_beginning;
… swith (start) {
case 9: c = nextchar(); case 0: start = 9; break;
if (isletter(c)) state = 10; case 9: start = 12; break;
else state = fail(); case 12: start = 20; break;
break; case 20: start = 25; break;
case 10: c = nextchar(); case 25: recover(); break;
if (isletter(c)) state = 10; default: /* error */
else if (isdigit(c)) state = 10; }
else state = 11; return start;
break; }

The Lex and Flex Scanner
Generators
• Lex and its newer cousin flex are scanner
generators
• Systematically translate regular definitions
into C source code for efficient scanning
• Generated code is easy to integrate in C
applications
Creating a Lexical Analyzer with
Lex and Flex
lex
source lex or flex [Link].c
program compiler
lex.l

[Link].c
C [Link]
compiler

input sequence
stream [Link] of tokens
Lex Specification
• A lex specification consists of three parts:
regular definitions, C declarations in %
{ %}
%%
translation rules
%%
user-defined auxiliary procedures
• The translation rules are of the form:
p1 { action1 }
p2 { action2 }

pn { actionn }
Regular Expressions in Lex
x match the character x
\. match the character .
“string”match contents of string of characters
. match any character except newline
^ match beginning of a line
$ match the end of a line
[xyz] match one character x, y, or z (use \ to escape -)
[^xyz]match any character except x, y, and z
[a-z] match one of a to z
r* closure (match zero or more occurrences)
r+ positive closure (match one or more occurrences)
r? optional (match zero or one occurrence)
r1 r2 match r1 then r2 (concatenation)
r1|r2 match r1 or r2 (union)
(r) grouping
r1\r2 match r1 when followed by r2
{d} match the regular expression defined by d
Lex actions
• BEGIN: It indicates the start state. The lexical analyzer starts at state
0.
• ECHO: It emits the input as it is.
• yytext: When lexer matches or recognizes the token from input
token then the lexeme is stored in a null terminated string called
yytext.
• yylex(): As soon as call to yylex() is encountered scanner starts
scanning the source program.
• yywrap(): It is called when scanner encounters eof i.e. return 0. If
returns 0 then scanner continues scanning.
• yyin: It is the standard input file that stores input source program.
• yyleng: when a lexer reconizes token then the lexeme is stored in a
null terminated string called yytext. It stores the length or number of
characters in the input string. The value in yyleng is same as strlen()
functions.
Installing Software
• Download Flex 2.5.4a
• Download Bison 2.4.1
• Download DevC++
• Install Flex at "C:\GnuWin32"
• Install Bison at "C:\GnuWin32"
• Install DevC++ at "C:\Dev-Cpp"
• Open Environment Variables.
– Add "C:\GnuWin32\bin;C:\Dev-Cpp\bin;" to path.
For Windows 8
Example Lex Specification 1
%{
#include <stdio.h>
Contains
%} the matching
Translation lexeme
rules %%
[0-9]+ { printf("%s\n", yytext); }
.|\n {}
%% Invokes
int main( ) the lexical
{ analyzer
yylex( );
}
int yywrap( ) lex spec.l
{ gcc [Link].c -ll
return 1; ./[Link] < spec.l
}
Execution Lex Specification 1

Digit
only
printed
Example Lex Specification 2
%{
#include <stdio.h> Regular
int ch = 0, wd = 0, nl = 0;
definition
Translation %}
rules delim [ \t]+
%%
\n { ch++; wd++; nl++; }
^{delim} { ch+=yyleng; }
{delim} { ch+=yyleng; wd++; }
. { ch++; }
%%
main()
{ yylex();
printf("%8d%8d%8d\n", nl, wd, ch);
}
Example Lex Specification 3
%{
#include <stdio.h> Regular
%}
definitions
Translation digit [0-9]
rules letter [A-Za-z]
id {letter}({letter}|{digit})*
%%
{digit}+ { printf(“number: %s\n”, yytext); }
{id} { printf(“ident: %s\n”, yytext); }
. { printf(“other: %s\n”, yytext); }
%%
main()
{ yylex();
}
Example Lex Specification 4
%{ /* definitions of manifest constants */
#define LT (256)

%}
delim [ \t\n]
ws {delim}+
letter [A-Za-z] Return
digit [0-9]
id {letter}({letter}|{digit})* token to
number {digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
%% parser
{ws} { }
if {return IF;} Token
then {return THEN;}
else {return ELSE;} attribute
{id} {yylval = install_id(); return ID;}
{number} {yylval = install_num(); return NUMBER;}
“<“ {yylval = LT; return RELOP;}
“<=“ {yylval = LE; return RELOP;}
“=“ {yylval = EQ; return RELOP;}
“<>“ {yylval = NE; return RELOP;}
“>“ {yylval = GT; return RELOP;}
“>=“ {yylval = GE; return RELOP;}
%% Install yytext as
int install_id()

identifier in symbol table
Lex Program
(Write a program in Lex to identify identifier and keyword in a sentence.)
%{
#include<stdio.h>
static int key_word=0;
static int identifier=0;
%}
%%
"include"|"for"|"define" {key_word++;printf("keyword found");}
"int"|"char"|"float"|"double" {identifier++;printf("identifier found");}
%%
int main()
{
printf("enter the sentence");
yylex();
printf("keyword are: %d\n and identifier are:%d\n",key_word,identifier);
}
int yywrap()
{
return 1;
}
Design of a Lexical Analyzer
Generator
• Translate regular expressions to NFA
• Translate NFA to an efficient DFA

Optional

regular
NFA DFA
expressions

Simulate NFA Simulate DFA


to recognize to recognize
tokens tokens
Nondeterministic Finite
Automata
• Definition: an NFA is a 5-tuple (S,,,s0,F)
where

S is a finite set of states


 is a finite set of input symbol alphabet
 is a mapping from S to a set of states
s0  S is the start state
F  S is the set of accepting (or final) states
Transition Graph
• An NFA can be diagrammatically
represented by a labeled directed graph
called a transition graph

a
S = {0,1,2,3}
start
0
a
1
b
2
b
3
 = {a,b}
s0 = 0
b F = {3}
Transition Table
• The mapping  of an NFA can be
represented in a transition table

Input Input
State a b
(0,a) = {0,1}
(0,b) = {0} 0 {0, 1} {0}
(1,b) = {2} 1 {2}
(2,b) = {3} 2 {3}
The Language Defined by an
NFA
• An NFA accepts an input string x iff there is some
path with edges labeled with symbols from x in
sequence from the start state to some accepting
state in the transition graph
• A state transition from one state to another on the
path is called a move
• The language defined by an NFA is the set of input
strings it accepts, such as (a|b)*abb for the
example NFA
Design of a Lexical Analyzer
Generator: RE to NFA to DFA
Lex specification with NFA
regular expressions
p1 { action1 } N(p1) action1

p2 { action2 } start
s0
 N(p2) action2
… …
pn { actionn } 
N(pn) actionn

Subset construction
(optional)
DFA
From Regular Expression to NFA
(Thompson’s Construction)

start
i  f

a start a
i f

start  N(r1) 
r1 | r2 i f
 N(r2) 
start
r1 r2 i N(r1) N(r2) f


r* start
i  N(r)  f


Combining the NFAs of a Set of
Regular Expressions
start a
1 2

a { action1 } start a b b
abb { action2 } 3 4 5 6
a b
a*b+ { action3 }
start
7 b 8
a
1 2

start
0  3
a
4
b
5
b
6
a b

7 b 8
Simulating the Combined NFA
Example 1
a
1 2 action1

start
0  3
a
4
b
5
b
6 action2
a b

7 b 8 action3

a a b a
0 2 7 8
none
action3
1 4
3 7 Must find the longest match:
7 Continue until no further moves are possible
When last state is accepting: execute action
Simulating the Combined NFA
Example 2
a
1 2 action1

start
0  3
a
4
b
5
b
6 action2
a b

7 b 8 action3

a b b a
0 2 5 6
none
action2
1 4 8 8
action3
3 7
7 When two or more accepting states are reached, the
first action given in the Lex specification is executed
Deterministic Finite Automata
• A deterministic finite automaton is a special case
of an NFA
– No state has an -transition
– For each state s and input symbol a there is at most one
edge labeled a leaving s
• Each entry in the transition table is a single state
– At most one path exists to accept a string
– Simulation algorithm is simple
Example DFA

A DFA that accepts (a|b)*abb

b
b
a
start a b b
0 1 2 3

a a
Conversion of an NFA into a
DFA
• The subset construction algorithm converts an
NFA into a DFA using:
-closure(s) = {s}  {t | s  …  t}
-closure(T) = sT -closure(s)
move(T,a) = {t | s a t and s  T}
• The algorithm produces:
Dstates is the set of states of the new DFA
consisting of sets of states of the NFA
Dtran is the transition table of the new DFA
-closure and move Examples
-closure({0}) = {0,1,3,7}
a
1 2 move({0,1,3,7},a) = {2,4,7}
 -closure({2,4,7}) = {2,4,7}
start
 a b b move({2,4,7},a) = {7}
0 3 4 5 6
a b -closure({7}) = {7}
 move({7},b) = {8}
7 b 8 -closure({8}) = {8}
move({8},a) = 
a a b a
0 2 7 8
none
1 4
3 7
7 Also used to simulate NFAs
Simulating an NFA using
-closure and move
S := -closure({s0})
Sprev := 
a := nextchar()
while S   do
Sprev := S
S := -closure(move(S,a))
a := nextchar()
end do
if Sprev  F   then
execute action in Sprev
return “yes”
else return “no”
The Subset Construction
Algorithm
Initially, -closure(s0) is the only state in Dstates and it is unmarked
while there is an unmarked state T in Dstates do
mark T
for each input symbol a   do
U := -closure(move(T,a))
if U is not in Dstates then
add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
end do
end do
Subset Construction Example 1

a
2 3

start    a b b
0 1 6 7 8 9 10

4
b
5


b
Dstates
C A = {0,1,2,4,7}
b a b B = {1,2,3,4,6,7,8}
start a b b C = {1,2,4,5,6,7}
A B D E
a D = {1,2,4,5,6,7,9}
a
a E = {1,2,4,5,6,7,10}
Subset Construction Example 2
a
1 2 a1

start
0  3
a
4
b
5
b
6 a2
a b

7 b 8 a3
b
Dstates
C
a3
A = {0,1,3,7}
b a
b b B = {2,4,7}
start C = {8}
A D
D = {7}
a
a
b b E = {5,8}
B E F
F = {6,8}
a1 a3 a2 a3
Minimizing the Number of States
of a DFA

C
b a
b a
start a b b start a b b
A B D E A B D E
a a
a
a b a
Minimizing the number of states of a
DFA (Hoff croft Algorithm)
 Algorithm: Minimizing the number of states of a DFA.
 Input: A DFA D with set of states S, input alphabet , state state s0, and set of accepting
states F.
 Output: A DFA D' accepting the same language as D and having as few states as possible.
 Method:
1. Start with an initial partition  with two groups, F and S – F, the accepting and non-
accepting states of D.
2. Apply the procedure, to construct a new partition new
Initially, let new = ;
for ( each group G of  ) do begin
partition G into subgroups such that two states s and t are in the same subgroup if and
only if for all input symbols a, states s and t have transitions on a to states in the same
group of II; /* at worst, a state will be in a subgroup by itself * /
replace G in new by the set of all subgroups formed;
end
3. If new = , let final =  and continue with step (4) . Otherwise, repeat step (2) with new in
place of new.
4. Choose one state in each group of final as the representative for that group. The
Example
• By using the above algorithm, minimizing the given DFA
state. Transition table of given DFA.
• Minimizing the sates INPUT
STATE SYMBOL
0 = (ABCD) (E) a b
A B C
1 = (ABC) (D) (E) B B D
C B C
2 = (AC) (B) (D) (E) D B E
E B C
3 = (AC) (B) (D) (E)
• Now, construct the minimum-state DFA. It has four states,
corresponding to the four groups of 3 and let us pick A, B, D,
and E as the representatives of these groups. The initial state is
A, and the only accepting state is E. Below table shows the
transition function for the DFA.
Example (Contd…)

Transition table of given DFA. Transition table of minimized DFA.

INPUT SYMBOL INPUT


STATE STATE SYMBOL
a b
a b
A B C
A B A
B B D
B B D
C B C
D B E D B E

E B C E B A
From Regular Expression to DFA
Directly
• The important states of an NFA are those
without an -transition, that is if
move({s},a)   for some a then s is an
important state
• The subset construction algorithm uses only
the important states when it determines
-closure(move(T,a))
From Regular Expression to DFA
Directly (Algorithm)
• Augment the regular expression r with a
special end symbol # to make accepting
states important: the new expression is r#
• Construct a syntax tree for r#
• Traverse the tree to construct functions
nullable, firstpos, lastpos, and followpos
From Regular Expression to DFA
Directly: Syntax Tree of (a|b)*abb#
concatenation
#
6
b
closure 5
b
4

* a
3
alternation
| position
number
a b (for leafs )
1 2
From Regular Expression to DFA
Directly: Annotating the Tree
• nullable(n): the subtree at node n generates
languages including the empty string
• firstpos(n): set of positions that can match the first
symbol of a string generated by the subtree at
node n
• lastpos(n): the set of positions that can match the
last symbol of a string generated be the subtree at
node n
• followpos(i): the set of positions that can follow
position i in the tree
From Regular Expression to DFA
Directly: Annotating the Tree
Node n nullable(n) firstpos(n) lastpos(n)

Leaf  true  

Leaf i false {i} {i}

| nullable(c1) firstpos(c1) lastpos(c1)


/ \ or  
c1 c2 nullable(c2) firstpos(c2) lastpos(c2)
if nullable(c1) then if nullable(c2) then
• nullable(c1)
firstpos(c1)  lastpos(c1) 
/ \ and
firstpos(c2) lastpos(c2)
c1 c2 nullable(c2)
else firstpos(c1) else lastpos(c2)
*
| true firstpos(c1) lastpos(c1)
c1
From Regular Expression to DFA
Directly: Syntax Tree of (a|b)*abb#
{1, 2, 3} {6}

{1, 2, 3} {5} {6} # {6}


6
{1, 2, 3} {4} {5} b {5}
nullable 5
{1, 2, 3} {3} {4} b {4}
4

{3} a {3}
firstpos lastpos
{1, 2} {1, 2}
* 3

{1, 2} | {1, 2}

{1} a {1} {2} b {2}


1 2
From Regular Expression to DFA
Directly: followpos
for each node n in the tree do
if n is a cat-node with left child c1 and right child c2 then
for each i in lastpos(c1) do
followpos(i) := followpos(i)  firstpos(c2)
end do
else if n is a star-node
for each i in lastpos(n) do
followpos(i) := followpos(i)  firstpos(n)
end do
end if
end do
From Regular Expression to DFA
Directly: Algorithm
s0 := firstpos(root) where root is the root of the syntax tree
Dstates := {s0} and is unmarked
while there is an unmarked state T in Dstates do
mark T
for each input symbol a   do
let U be the set of positions that are in followpos(p)
for some position p in T,
such that the symbol at position p is a
if U is not empty and not in Dstates then
add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
end do
From Regular Expression to DFA
Directly: Example
Node followpos
1 {1, 2, 3} 1
2 {1, 2, 3} 3 4 5 6
3 {4}
2
4 {5}
5 {6}
6 -

b b
a
start a 1,2, b 1,2, b 1,2,
1,2,3
3,4 3,5 3,6
a
a
Time-Space Tradeoffs

Space Time
Automaton
(worst case) (worst case)
NFA O(|r|) O(|r||x|)
DFA O(2|r|) O(|x|)
Compiler Construction tools
These tools use specialized languages for specifying
and implementing specific components, and many
use quite sophisticated algorithms.

• Parser Generators: automatically produce


syntax analyzer from a grammatical description of
a programming language.

• Scanner Generator: produce lexical analyzer


from a regular expression description of the
tokens of a language.
Compiler Construction tools
• Syntax directed translation engines: produce
collection of routines for walking a parse tree
and generating intermediate code.

• Code- generator generators: produce a code


generator from a collection of rules for
translating each operation of the intermediate
language into the machine language for a
target machine.
Compiler Construction tools
• Data flow analysis engines: facilitate the
gathering of information about how values are
transmitted from one part of a program to each
other part. Data flow analysis is the key part of
code optimization.

• Compiler construction toolkits: provide an


integrated set of routines for constructing
various phases of compiler.
Source Program

You might also like