Process of Execution of A Program:: Compiler Design
Process of Execution of A Program:: Compiler Design
UNIT-1
INTRODUCTION
Process of execution of a program:
The hardware understands a language, which humans cannot understand. So we write programs
in high level language, which is easier for us to understand and remember. These programs are
then fed into a series of tools and OS components to get the desired code that can be used by the
machine. This is known as Language Processing System.
The high-level language is converted into binary language in various phases. A compiler is a
program that converts high-level language to assembly language. Similarly, an assembler is a
program that converts the assembly language to machine-level language.
Let us first understand how a program, using C compiler, is executed on a host machine.
Before diving straight into the concepts of compilers, we should understand a few other tools that
work closely with compilers.
PREPROCESSOR:
A preprocessor produce input to compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are short hands for
longer constructs.
2. File inclusion: A preprocessor may include header files into the program text.
3. Rational preprocessor: these preprocessors augment older languages with more modern flow-
of-control and data structuring facilities.
COMPILER:
Compiler is a translator program that translates a program written in (HLL) the source program
and translates it into an equivalent program in (MLL) the target program. As an important part of
a compiler is error showing to the programmer.
Executing a program written n HLL programming language is basically of two parts. the source
program must first be compiled translated into a object program. Then the results object program
is loaded into a memory executed
COMPILER ASSEMBLER
INTERPRETER:
Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also uses
interpreter. The process of interpretation can be carried out in following phases.
1. Lexical analysis
2. Synatx analysis
3. Semantic analysis
4. Direct Execution
Advantages:
Disadvantages:
Once the assembler procedures an object program, that program must be placed into memory and
executed. The assembler could place the object program directly in memory and transfer control
to it, thereby causing the machine language program to be execute. This would waste core by
leaving the assembler in memory while the user’s program was being executed. Also the
programmer would have to retranslate his program with each execution, thus wasting translation
time. To overcome this problem of wasted translation time and memory, System programmers
developed another component called loader.
“A loader is a program that places programs into memory and prepares them for execution.” It
would be more efficient if subroutines could be translated into object form the loader could
”relocate” directly behind the user’s program. The task of adjusting programs o they may be
placed in arbitrary core locations is called relocation. Relocation loaders perform four functions.
STRUCTURE OF A COMPILER: A compiler can broadly be divided into two phases based
on the way they compile.
Analysis Phase: Known as the front-end of the compiler, the analysis phase of the compiler
reads the source program, divides it into core parts, and then checks for lexical, grammar, and
syntax errors. The analysis phase generates an intermediate representation of the source program
and symbol table, which should be fed to the Synthesis phase as input.
Synthesis Phase: Known as the back-end of the compiler, the synthesis phase generates the
target program with the help of intermediate source code representation and symbol table. A
compiler can have many phases and passes.
Pass: A pass refers to the traversal of a compiler through the entire program.
Phase: A phase of a compiler is a distinguishable stage, which takes input from the previous
stage, processes and yields output that can be used as input for the next stage. A pass can have
more than one phase.
PHASES OF A COMPILER:
A compiler operates in phases. A phase is a logically interrelated operation that takes source
program in one representation and produces output in another representation. The phases of a
compiler are shown in below there are two phases of compilation.
Lexical Analysis: The first phase of scanner works as a text scanner. This phase scans the source
code as a stream of characters and converts it into meaningful lexemes. Lexical analyzer
represents these lexemes in the form of tokens as:
<token-name, attribute-value>
Syntax Analysis: The next phase is called the syntax analysis or parsing. It takes the token
produced by lexical analysis as input and generates a parse tree (or syntax tree). In this phase,
token arrangements are checked against the source code grammar, i.e., the parser checks if the
expression made by the tokens is syntactically correct.
Semantic Analysis: Semantic analysis checks whether the parse tree constructed follows the
rules of language. For example, assignment of values is between compatible data types, and
adding string to an integer. Also, the semantic analyzer keeps track of identifiers, their types and
expressions; whether identifiers are declared before use or not, etc. The semantic analyzer
produces an annotated syntax tree as an output.
intermediate code should be generated in such a way that it makes it easier to be translated into
the target machine code.
Code Optimization: The next phase does code optimization of the intermediate code.
Optimization can be assumed as something that removes unnecessary code lines, and arranges
the sequence of statements in order to speed up the program execution without wasting resources
(CPU, memory).
Code Generation In this phase, the code generator takes the optimized representation of the
intermediate code and maps it to the target machine language. The code generator translates the
intermediate code into a sequence of (generally) re-locatable machine code. Sequence of
instructions of machine code performs the task as the intermediate code would do.
Symbol Table It is a data-structure maintained throughout all the phases of a compiler. All the
identifiers’ names along with their types are stored here. The symbol table makes it easier for the
compiler to quickly search the identifier record and retrieve it. The symbol table is also used for
scope management.
It is the first phase of compiler. It reads input source program form left to right one character at a
time and generates the sequence of tokens.
Each token is a single logical cohesive unit such as identifier, keywords, operations and
punctuation marks. Then the parser to determine the syntax of the source program can use these
tokens.
As the lexical analyzer scans the program to recognize the tokens it is also called as scanner.
Apart from token identification lexical analyzer also performs following functions.
The lexical analyzer works in two phases. In first phase it performs scan and in the
second phase it does lexical analysis means it generates the series of tokens.
TOKEN, PATTERNS & LEXEME: Let us learn some terminologies which are frequently
used when we talk about the activity of lexical analysis.
Tokens: it describes the class or category of input string. For example identifier, key words,
constant are called tokens.
Lexemes: sequence of characters in the source program that the matched with the pattern of the
token for example int, i, num, ans
Here “if” “(“ “a” “<” “b” “)” are all lexemes.
When we want to compile a given source program we submit this program to compiler. A
compiler scans the program and produces the sequence of tokens therefore lexical analysis is also
called as scanner.
LEXEME TOKEN
Int Key word
Max Identifier
( Operator
Int Keyword
A Identifier
, Operator
B Identifier
{ operator
Lexical errors:
These types of errors can be detected during lexical analysis phase, typical lexical phase errors
are
This is also a lexical error as an illegal character $ appears at the end of the statement. If length
of the identifier gets exceeded the errors occurs.
INPUT BUFFERING:
The lexical analyzer scans the input string from left to right one character at a time .it uses two
pointers begin-ptr(bp) and forward –ptr(fp) to keep frack of the portion input scanned . Initially
both the pointers point to the first character of the input string. As shown below.
bp
i N t i , j ; i = i + 1 ; j = j + 1 ;
fp
Initial configuration
The forward –ptr moves ahead to search for end lexeme .as soon as the blank space is
encountered. It indicates end of lexeme. In above example as soon as forward –ptr(fp)encounters
a blank space the lexeme “int” is identified.
fp Figure
The fp will be moved ahead at white space. When fp encounters white space. It ignore
and moves ahead . Then both the begin –ptr (bp) and forward-ptr(fp) are set at next token i.
The input character is read from secondary storage. But reading in this way from secondary
storage is costly. Hence buffering technique is used.
A block of data is first read into a buffer and then scanned by lexical analyzer .there are two
methods used in this context:
One buffer scheme: In this only one buffer is used to store the input string .but the problem with
this scheme is that if lexeme is very long then it crosses the buffer boundary, to scan rest of the
lexeme the buffer has to be refilled ,that makes overwriting the first part of the lexeme.
bp
i n T i = i + 1
fp
Two buffer scheme: to overcome the problems of one buffer scheme, in this method two buffers
are used to store the input string.
The first buffer and second buffer are scanned alternatively when end of current buffer is
reached the other buffer is filled
The only problem with this method is that is length of the
lexeme is larger than length of the
buffer then scanning input cannot be scanned completely.
Initially both the fp and bp are pointing tothe first character of first buffer then fp moves
towards right in search of end of lexeme.
As soon as blank character is recognized the string between bp and fp is identified as
corresponding token .to identify the boundary of first buffer “end of buffer” character
should be placed at the end of first buffer.
Similarly end of second buffer is also recognized by the end of buffer mark present at the end of
second buffer. When fp encounters first eof then one can recognize end of first buffer and hence
filling up of second buffer is started. bp
Buffer 1 i n t i = i + 1
Buffer 2 ; j = j + 1 ; eof
fp
In the same way when second eof is obtained .then it indicates end of second buffer.
Alternatively both the buffers can be filled up until end of the input program & stream of token is
identified. This eof character introduces at the end is called sentinel.
If(fp==eof(buff 1))
{
Fp++; /*refill buffer 2*1
}
Else if ((fp==eof(buff2)))
{
fp ++ / *refill buffer 1*1
}
Else if (fp==eof(input))
Return;
Else fp++;
Regular expressions: Regular expressions are mathematically symbolisms which describe the
set of strings of specific language. It provides convenient and useful notation for representing
tokens.
Here are some rules that describe definition of the regular expression over the input set denoted
by ∑.
A language denoted by regular expressions is said to be a regular set (or) a regular language
Problems: write a regular expression for a language containing the strings of length two over
∑={0, 1}
Sol: R.E=(0+1)(0+1)
Write
a regular expression for a language containing strings which end with “add”over ∑={a,
b}
Recognizing of tokens: for a programming language there are various types of tokens such as
identifier, key words, constants, operators and so on. The token is usually represented by a pair
token type& token value
The token type tells us the category of token and token value gives the information regarding
token .the token value is also called as token attribute. During lexical analysis process the
symbol table is maintained.
The token value can be a pointer to symbol table in case of identifier and constant. The lexical
analyzer reads the input program and generates a symbol table for tokens.
Our lexial analyzer will generate following token stream. Consider code
if(a<10)
I=i+2;
Else
I=i-2;
1, (8,1),(5,100),(7,1),(6,105),(8,2),(5,107)(9,1)(6,10). 2,(5,107),10,(5,107),(9,2)(6,110).
105 Constant 10
.
107 Identifier i
.
110 Constant 2
Suppose r and s are regular expressions denoting the language L(r) and L(s) then
1. (r)/(s) is a regular expression denoting the L(r) U L(S) which represents the union
operation.
2. (r). (s) is a Re denoting L(r).L(s), which represents the concatenation operation.
3. (r)* is a RE denoting (L(r))*, which represents the kleen closure.
AXIOM DESCRIPTION
r/s=s/r / is commutative
r/(s/t)=(r/s)/t / is associate
€r=r
R**=r* * is idempotent.
RECOGNIZING TOKENS:
For a programming language there are various types of tokens such as identifier, key words,
constants and operations so on. The token is usually represented by a pair, token type and toke
value.
The token type tells us the category of token and toke value gives us the information regarding
token.
S iEts/iEtses/E
E T relop T/T
T id/num.
Here terminals are I, t, e, relop, id and num. They generate set of string given by following
regular definitions.
i i
t- t
e e
relop =|<|>|<=|>=|<>
id letter (letter / digit)*
num digit (. Digit +)?
Letter [A-Z a……….z]
Digit [o………q]
These are the patterns for token for expression
V= set of variables
T= set of terminals
P=set of production
S= start symbol
Ex: let A ab
B B/E
Then the grammar G is defined as
V= {A,B}
T= {a,b}
P= {A aB, B b|€}
S= {A}
This is called as regular language and it can be represented by DFA .regular grammar which can
represent by using finite automata.
Sequence of string: Any string formed by removing Zero /more not necessarily the contiguous
symbol is called sequence of strings
for example.
In above example, scanner scans the input string and recognize “if” as a key word and returns
token type as 1 since in given encoding code 1 indicates keyword “if” and hence 1 is at the
beginning of token stream.
Next is a pair (8, 1) where 8 indicates parenthesis And ‘1’ indicates opening parenthesis”
(”. Then we scan the input ‘a’ if recognize it as identifier and searches the symbol table to check
whether the same entry is present. If not it inserts the information about this identifier in symbol
table and returns 10.
If the same identifier or variable is already present in symbol table than lexical analyzer does not
insert it into the table inserted it returns the location where it is present.
1) Strings: string is a collection of finite number of alphabets or letters. The strings are
synonymously called as words.
The length of a string is denoted by |s|
The empty string can be denoted by €
The empty set of string is denoted by φ
Prefix of string: A string obtained by removing zero or more tail symbols. For ex, for string
Hindustan the prefix could be “Hindu”.
zero or more loading symbol for ex the
Suffix of string: A string obtained by removing
string “Hindustan” Suffix could be “stan”.
Sub string: A string obtained by removing prefix and suffix of a given string is called
substring. For example the string “Hindustan”, The string “indu” can be substring.
3) Comments: The regular expression for the comment statement can be
r.e=// (letter+digit+whitespace)*
r.e=/* (letter+digit+whitespace+newline)* */
Lexical analysis is a process of recognizing tokens from input source program. To recognize
tokens lexical analyzer performs following steps.
Step2: the token is read from input buffer and regulator expression is built for corresponding
token.
Step 3: from these regular expressions finite automata is built .the finite automata is usually in
non deterministic form. That means a non-deterministic finite automaton is built.
Step4: for each state of NFA, a function is designed and each input along the translation edges
corresponds to input parameters of these functions.
Step 5: The set of such functions ultimately create lexical analyzer program.
The finite automata are typically represented using transition diagram. The transition graph can
be defined as collection of
Reserved words are the special words used in the programming language which is associated
with some means for example if, else, while, for, break, pf, sf, exit so on are the reserved words
that are used in C language. The lexical analyzer should identify the reserved words correctly.
The identifier is a kind of variable which stores some values. It is a collection of letters or
alphanumeric letters. The first letter of the identifier must always be a letter.
The lexical analysis is a process of recognizing tokens from source program and complier does
this job by constructing recognizer that looks for the lexemes stored in the input buffer. This
working is based on the rule “if more than one pattern matches then recognizer has to choose the
longest lexeme matched.
LEX: For efficient design of compiler various tools have been built for constructing lexical
analyzer using the special purpose notations called regular expressions.
The regular expressions are used in recognizing the tokens. Now we will discuss a special
language that specifies the tokens the tokens using regular expression. A tool called LEX gives
this specification.
Lex scans the source program in order to get the stream of tokens and these are related together
so that various programming constructs such as expressions block of statements, procedures,
control structure can realized.
During the parsing of the program the rules are defined to establish the relationship between the
tokens. These roles are called grammar
The YACC yet another complier is another automated tool which is used to specify the grammar
for realizing the source programming constructs.
YACC takes the description of a grammar in some specification file & produces the C routine
called parser. Thus LEX and YACC are two important utilities that generate the lexical analyzer
and syntax analyzer.
For efficient design of complier various tools have been built for constructing lexical analyzer
using the special purpose notation called regular expressions.
The regular expressions are used in recognizing the tokens .now we will discuss a special
language that specifies the tokens. Using regular expressions, a tool called LEX gives this
specification. Basically LEX is a Unix utility which generates the analyzer.
A LEX laxer is very much faster in finding the token as compared to the handwritten LEX
program in C.
The lex specification file can be created using the extension .l. For example the specification file
can be x.1. This lex.yy.c is a C program which is actually a lexical analyzer program.
The lex specification files stores the regular expressions for the tokens and the lexyy.c file
consists of the tabular representation of the transition diagrams constructed from the regular
expression.
C compiler
a.out
lex.yy.c
a.out Sequence
input stream of tokens
The lexemes can be represented recognized eith the help of this tabular representation of
transition diagram.
The action associated with regular expression in lex.l are pieces of a C code and are carried our
directly to lex.yy.c.
Finally lex.yy.c is run through the C complier to produces an object program a.out, which is the
lexical analyzer that transforms an input stream into a sequence of tokens.
1. Declaration section
2. Rule section
3. Procedure section (auxiliary procedure section)
%{
Declaration section
%}
%%
Rule section
%%
Declaration section:
Declaration of variables is done in declaration section regular definitions can also be written
here.
is used to define macros and is used to improve important
In general the definition section
headers files written in C.
Rule section: Rule section consists of regular expressions with associated actions. The
translations rules can take the format as
R1{action}
R2{action}
Rn{action}
action I describe the action that action needs to
Here Ri indicates the regular expression and
take for corresponding regular expression.
Rule section is the most importantsection here the patterns are associated with C statements
pattern are nothing but regular C.
In this section required procedures are being defined these procedures may also be required
by the action in the role section.
It also called as C –code section it contains C statements and function they consist of code
called by the rules in the rules sec.
Ex : %{
%}
“RAMA”
“SITA”
“geeta”/ pf(“\n noun”)
“sings”/
“dances”/ pf(“\n verb”)
“eat”
%%
Main()
{
Yylex();
}
Int yywrap()
{
Return I;
}
During the process of compilation it would be always efficient to have a symbol table so
that while lexer is running we can add new words without modifying or recompiling the
lex program. There could be two important activities that are associated with the symbol
table and those are insert –word () and search –word().
The insert word will insert the newly encountered word into the symbol table .
The search word will perform the look up activity.
The command line parameters are the parameters that are appearing on the prompt. The
command line interface is the interface. This allows user to interact with the computer by typing
the commands. In C we can pass these parameters to the main function in the form of character
array.
Here argv[0]=cp
argv[1]=abc.txt
argv[2]=pqr.txt.
Impacts on Compilers
Since the design of programming languages and compilers are intimately related, the
advances in programming languages placed new demands on compiler writ-ers. They had to
devise algorithms and representations to translate and support the new language features. Since
the 1940's, computer architecture has evolved as well. Not only did the compiler writers have to
track new language fea-tures, they also had to devise translation algorithms that would take
maximal advantage of the new hardware capabilities.
Parallelism
All modern microprocessors exploit instruction-level parallelism. However, this parallelism can
be hidden from the programmer. Programs are written as if all instructions were executed in
sequence; the hardware dynamically checks for dependencies in the sequential instruction stream
and issues them in parallel when possible.
Memory Hierarchies
A memory hierarchy consists of several levels of storage with different speeds and sizes, with
the level closest to the processor being the fastest but small-est. The average memory-access
time of a program is reduced if most of its accesses are satisfied by the faster levels of the
hierarchy. Both parallelism and the existence of a memory hierarchy improve the potential
performance of a machine, but they must be harnessed effectively by the compiler to deliver real
performance on an application.