Unit-1 PCD
Unit-1 PCD
Unit-1 PCD
Structure of compiler – Functions and Roles of lexical phase – Input buffering – Representation
of tokens using regular expression – Properties of regular expression – Finite Automata – Regular
Expression to Finite Automata – NFA to Minimized DFA.
STRUCTURE OF COMPILER:
Compiler is a translator program that reads a program written in one language -the source
language- and translates it into an equivalent program in another language-the target language.
As an important part of this translation process, the compiler reports to its user the presence of
errors in the source program.
functions:
Fig. 1.2. A language-processing system
2
1. Macro processing: A preprocessor may allow a user to define macros that are shorthand
for longer constructs.
2. File inclusion: A preprocessor may include header files into the program text. For
example, the C
3. Preprocessor causes the contents of the file <stdio.h> to replace the statement
4. #include <stdio.h>when it processes a file containing this statement.
5. "Rational" preprocessors. These processors augment older languages with more modern
flow-of-control and data-structuring facilities. For example, such a preprocessor might
provide the user with built-in macros for constructs like while-statements or if statements,
where none exist in the programming language itself.
6. Language extensions. These processors attempt to add capabilities to the language by what
amounts to built-in-macros.
Assemblers:
Some compilers produce assembly code, which is passed to an assembler for producing a
relocatable machine code that is passed directly to the loader/linker editor. The assembly code is
the mnemonic version of machine code. A typical sequence of assembly code is: MOV a, R1
ADD #2, R1
MOV R1, b
A compiler operates in phases, each of which transforms the source program from one
representation to another. The structure of compiler is shown in Fig.1.3.The first three phases
form the analysis portion of the compiler and rest of the phases form the synthesis phase.
Syntax Analysis:
Hierarchical analysis is called parsing or syntax analysis. It involves grouping the tokens of the
source program into grammatical phrases that are used by the compiler to synthesize the output.
5
The source program is represented by a parse tree as one shown in Fig. 1.5.
6
• An important component of semantic analysis is type checking. i.e .whether the operands
are type compatible.
• For example, a real number used to index an array.
7
• There is a better way to perform the same calculation for the above three address code
,which is given as follows:
temp1 = id3 * 60.0
id1 = id2 + temp1
• There are various techniques used by most of the optimizing compilers, such as:
1. Common sub-expression elimination
2. Dead Code elimination
3. Constant folding
4. Copy propagation
5. Induction variable elimination
6. Code motion
7. Reduction in strength. .......... etc..
Code Generation:
• The final phase of the compiler is the generation of target code, consisting of relocatable
machine code or assembly code.
• The intermediate instructions are each translated into sequence of machine instructions that
perform the same task. A crucial aspect is the assignment of variables to registers. • Using
registers R1 and R2,the translation of the given example is:
MOV id3 ,R2
MUL #60.0 , R2
MOV id2 , R1
ADD R2 , R1
MOV R1 , id1
Symbol-Table Management:
• An essential function of a compiler is to record the identifiers used in the source
program and collect its information.
• A symbol table is a data structure containing a record for each identifier with fields for
attributes.(such as, its type, its scope, if procedure names then the number and type
of arguments etc.,)
• The data structure allows finding the record for each identifier and store or retrieving
8
data from that record quickly.
Error Handling and Reporting:
• Each phase can encounter errors. After detecting an error, a phase must deal that error,
so that compilation can proceed, allowing further errors to be detected. • Lexical phase
can detect error where the characters remaining in the input do not form any token.
• The syntax and semantic phases handle large fraction of errors. The stream of tokens
violates the syntax rules are determined by syntax phase.
• During semantic, the compiler tries to detect constructs that have the right syntactic
structure but no meaning to the operation involved. e.g. if we try to add two
identifiers ,one is an array name and the other a procedure name.
Fig.1.4. Translation of statement
9
FUNCTIONS AND ROLES OF LEXICAL PHASE:
• It is the first phase of a compiler.
• Its main task is to read input characters and produce tokens.
• "get next token "is a command sent from the parser to the lexical analyzer(LA). • On receipt
of the command, the LA scans the input until it determines the next token and returns it.
• It skips white spaces and comments while creating these tokens.
• If any error is present the LA will correlate that error with source file and the line number.
Fig.1.5.Interaction of lexical analyzer with parser
Issues in Lexical Analysis:
There are several reasons for separating the analysis phase of compiling into lexical and
parsing: 1. Simpler design is the most important consideration.
2. Compiler efficiency is improved. A large amount of time is spent reading the source
program and partitioning into tokens. Buffering techniques are used for reading input
characters and processing tokens that speed up the performance of the compiler. 3. Compiler
portability is enhanced.
10
• There is a set of strings in the input for which a token is produced as output. This set is
described a rule called pattern.
• A lexeme is a sequence of characters in the source program that is matched by the pattern
for a token.
Fig.1.6. Examples of tokens
Attributes of tokens:
• The lexical analyzer collects information about tokens into their associated attributes. The
tokens influence parsing decisions and attributes influence the translation of tokens. • Usually
a token has a single attribute i.e. pointer to the symbol table entry in which the information
about the token is kept.
Example: The tokens and associated attribute values for the statement given,
11
Lexical errors:
Few errors are discernible at the lexical level alone, because a LA has a much localized view of
the source program. A LA may be unable to proceed because none of the patterns for tokens
matches a prefix of the remaining input.
Error-recovery actions are:
1. Deleting an extraneous character.
2. Inserting a missing character.
3. Replacing an incorrect character by a correct character.
4. Transposing two adjacent characters.
INPUT BUFFERING
A two-buffer input scheme that is useful when look ahead on the input is necessary to identify
tokens is discussed. Later, other techniques for speeding up the LA, such as the use of “sentinels”
to mark the buffer end are also discussed.
Buffer Pairs:
A large amount of time is consumed in scanning characters, specialized buffering techniques are
developed to reduce the amount of overhead required to process an input character. A buffer
divided into two N-character halves is shown in Fig. 1.7.Typically, N is the number of characters
on one disk block, e.g., 1024 or 4096.
12
the left half is filled with N new input characters and the wraps to the beginning of the
buffer. Code for advancing forward pointer is shown in Fig.1.8.
Fig. 1.8. Code to advance forward pointer
Sentinels:
• With the previous algorithm, we need to check each time we move the forward pointer that
we have not moved off one half of the buffer. If so, then we must reload the other half.
• This can be reduced, if we extend each buffer half to hold a sentinel character at the end. •
The new arrangement and code is shown in Fig. 1.9 and 1.10.This code performs only one
test to see whether forward points to an eof.
13
REPRESENTATION OF TOKENS USING REGULAR EXPRESSION: Regular expression
is an important notation for specifying patterns. The term alphabet or character class denotes any
finite set of symbols. Symbols are letters and characters. The set {0,1} is the binary alphabet.
ASCII and EBCDIC are two examples of computer alphabets.
• A string over some alphabet is a finite sequence of symbols drawn from that alphabet. The
length of a string s is written as |s|, is the number of occurrences of symbols in s. • The empty
string, denoted by ε, is a special string of length zero.
• The term language denotes any set of strings over some fixed alphabet. •
The empty set {ε}, the set containing only the empty string.
• If x and y are strings, then the concatenation of x and y ,written as xy ,is the string formed
by appending y to x.
Some common terms associated with the parts of the string are shown in Fig. 1.11.
Fig.
1.11 Terms for parts of a string
Operations on Languages
There are several important operations that can be applied to languages. The various operations
and their definition are shown in Fig.1.12.
14
Fig. 1.12 Definitions of operations on languages.
Let L be theset {A,B,......,Z,a,b,...............,z} consisting upper and lowercase alphabets, and D the
set {0,1,2,.....,9} consisting the set of ten decimal digits. Here are some examples of new
languages created from L and D.
Regular Language
• A regular language over an alphabet is the one that can be obtained from the basic
languages using the operations Union, Concatenation and Kleene *.
• A language is said to be a regular language if there exists a Deterministic Finite Automata
(DFA) for that language.
• The language accepted by DFA is a regular language.
• A regular language can be converted into a regular expression by leaving out {} or by
replacing {} with () and by replacing U by +.
15
Regular Expressions
Regular expression is a formula that describes a possible set of string.
Component of regular expression:
X the character x
. any character, usually accept a new line
[xyz] any of the characters x, y, z,…..
R? a R or nothing (=optionally as R)
R* zero or more occurrences…..
R+ one or more occurrences……
R1R2 an R1 followed by anR2
R1 | R2 either an R1 or an R2.
A token is either a single string or one of a collection of strings of a certain type. If we view the
set of strings in each token class as a language, we can use the regular-expression notation to
describe tokens.
Consider an identifier, which is defined to be a letter followed by zero or more letters or digits.
In regular expression notation we would write.
16
PROPERTIES OF REGULAR EXPRESSION:
There are a number of algebraic laws obeyed by regular expressions and these can be used to
manipulate regular expressions into equivalent forms. Algebraic properties are shown in Fig.
1.13.
17
Transition Diagrams for Relational operators and identifier is shown below.
Fig.
1.14 Transition Diagrams for Relational operators
FINITE AUTOMATA:
A recognizer for a language is a program that takes as input a string x and answer "yes" if x is a
sentence of the language and "no" otherwise. The regular expression is compiled into a
recognizer by constructing a generalized transition diagram called a finite automaton. A finite
automaton can be deterministic or non deterministic, where "nondeterministic" means more than
one transition out of a state may be possible on the same input symbol.
18
• A is the set of all accepting states or final states.
• δ is the transition function, Q×∑→Q
For any element q of Q and any symbol a in ∑, we interpret δ(q,a) as the state to which the Finite
Automata moves, if it is in state q and receives the input ‘a’.
How to draw a DFA?
1. Start scanning the string from its left end.
2. Go over the string one symbol at a time.
3. To be ready with the answer as soon as the string is entirely scanned.
Non-Deterministic Finite Automata (NFA) Definition:
A NFA is defined as a 5-tuple, M=(Q, ∑,q0,A, δ) where,
• Q is the set of finite states
• ∑ is the set of input symbols (Finite alphabet)
• q0 is the initial state
• A is the set of all accepting states or final states.
• δ is the transition function, Q×∑→2Q
19
Regular expression to NFA (Thompson’s construction method):
For each kind of RE, define an NFA.
Input: A Regular Expression R over an alphabet ∑
Output: An NFA N accepting L(R)
Method:
1. R = ε
2. R = a
3. R= a | b
4. R=ab
5. R= a*
20
In general the rules if Thompson’s construction method is given below:
Problems:
1. Construct NFA for: (a+b)*abb
21
to which the NFA makes a transition on the given input symbol. In other words, after processing
a sequence of input symbols the DFA is in a state that actually corresponds to a set of states from
the NFA reachable from the starting symbol on the same inputs.
There are three operations that can be applied on NFA states:
The
staring state of the automaton is assumed to be s0. The ε-closure( s ) operation computes exactly
all the states reachable from a particular state on seeing an input symbol. When such operations
are defined the states to which our automaton can make a transition from set T on input a can be
simply specified as: ε-closure( move( T, a ))
Subset Construction Algorithm:
22
Example: Convert the NFA for the expression: (a|b)*abb into a DFA using the subset
construction algorithm.
Step 1: Convert the above expression in to NFA using Thompson rule constructions.
Step 2:
Start state of equivalent DFA is ε-closure(0)
ε-closure(0) ={ 0,1,2,4,7}
Step 2.1: Compute ε-closure(move(A,a))
move(A,a) ={3,8}
ε-closure(move(A,a)) = ε-closure(3,8) = {3,8,6,7,1,2,4}
ε-closure(move(A,a)) ={1,2,3,4,6,7,8}
Dtran[A,a]=B
Step 2.2: Compute ε-closure(move(A,b))
move(A,b) ={5}
ε-closure(move(A,b)) = ε-closure(5) = {5,6,7,1,2,4}
ε-closure(move(A,a)) ={1,2,4,5,6,7}
Dtran[A,b]=C
DFA and Transition table after step 2 is shown below.
23
move(B,a) ={3,8}
ε-closure(move(B,a)) = ε-closure(3,8) = {3,8,6,7,1,2,4}
ε-closure(move(B,a)) ={1,2,3,4,6,7,8}
Dtran[B,a]=B
Step 3.2: Compute ε-closure(move(B,b))
move(B,b) ={5,9}
ε-closure(move(B,b)) = ε-closure(5,9) = {5,9,6,7,1,2,4}
ε-closure(move(B,b)) ={1,2,4,5,6,7,9}
Dtran[B,b]=D
DFA and Transition table after step 3 is shown below.
24
Step 5: Compute Transition from state D on input symbol {a,b}
D = {1,2,,4,5,6,7,9}
Step 5.1: Compute ε-closure(move(D,a))
move(D,a) ={3,8}
ε-closure(move(D,a)) = ε-closure(3,8) = {3,8,6,7,1,2,4}
ε-closure(move(D,a)) ={1,2,3,4,6,7,8}
Dtran[D,a]= B
Step 5.2: Compute ε-closure(move(D,b))
move(D,b) ={5,10}
ε-closure(move(D,b)) = ε-closure(5,10) = {5,10,6,7,1,2,4}
ε-closure(move(C,b)) ={1,2,4,5,6,7,10}
Dtran[D,b]= E
Step 6: Compute Transition from state E on input symbol {a,b}
E = {1,2,,4,5,6,7,10}
Step 6.1: Compute ε-closure(move(E,a))
move(E,a) ={3,8}
ε-closure(move(E,a)) = ε-closure(3,8) = {3,8,6,7,1,2,4}
ε-closure(move(E,a)) ={1,2,3,4,6,7,8}
Dtran[E,a]= B
Step 6.2: Compute ε-closure(move(E,b))
move(E,b) ={5}
ε-closure(move(E,b)) = ε-closure(5) = {5,6,7,1,2,4}
ε-closure(move(E,b)) ={1,2,4,5,6,7}
Dtran[E,b]= C
25
DFA and Transition table after step 6 is shown below.
Minimized DFA:
Convert the above DFA in to minimized DFA by applying the following algorithm.
Minimized DFA algorithm:
Input: DFA with ‘s’ no of states
Output: Minimized DFA with reduced no of states.
Steps:
1. Partition the set of states in to two groups. They are set of accepting states and non
accepting states.
26
2. For each group G of π do the following steps until π=π new .
3. Divide G in to as many groups as possible, such that two states s and t are in the
same group only when for all states s and t have transitions for all input symbols
‘s’ are in the same group itself. Place newly formed group in π new.
4. Choose representative state for each group.
5. Remove any dead state from the group.
After applying minimized DFA algorithm for the regular expression (a|b)*abb , the transition
table for the minimized DFA becomes Transition table for Minimized state DFA :
Minimized DFA:
Exercises:
Convert the following regular expression in to minimized state DFA,
1. (a|b)*
2.(b|a)*abb(b|a)*
3.((a|c)*)ac(ba)*
27