Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
84 views

Process of Execution of A Program:: Compiler Design

The document discusses the process of compiling a program from a high-level language to machine code. It involves several steps: 1. A user writes a program in a high-level language like C. 2. The program is compiled from the high-level language to assembly code by a compiler. 3. An assembler then translates the assembly code to machine code (object code). 4. A linker links all parts of the program together to create executable machine code, which is loaded into memory by a loader and executed by the processor.

Uploaded by

Naresh Software
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views

Process of Execution of A Program:: Compiler Design

The document discusses the process of compiling a program from a high-level language to machine code. It involves several steps: 1. A user writes a program in a high-level language like C. 2. The program is compiled from the high-level language to assembly code by a compiler. 3. An assembler then translates the assembly code to machine code (object code). 4. A linker links all parts of the program together to create executable machine code, which is loaded into memory by a loader and executed by the processor.

Uploaded by

Naresh Software
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

UNIT-1 COMPILER DESIGN

UNIT-1
INTRODUCTION
Process of execution of a program:

The hardware understands a language, which humans cannot understand. So we write programs
in high level language, which is easier for us to understand and remember. These programs are
then fed into a series of tools and OS components to get the desired code that can be used by the
machine. This is known as Language Processing System.

The high-level language is converted into binary language in various phases. A compiler is a
program that converts high-level language to assembly language. Similarly, an assembler is a
program that converts the assembly language to machine-level language.

Let us first understand how a program, using C compiler, is executed on a host machine.

  User writes a program in C language (high-level language).


 The C compiler compiles the program and translates it to assembly program (low level
 language).
  An assembler then translates the assembly program into machine code (object).
 A linker tool is used to link all the parts of the program together for execution (executable
 machine code).
 A loader loads all of them into memory and then the program is executed.

Before diving straight into the concepts of compilers, we should understand a few other tools that
work closely with compilers.

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 1


UNIT-1 COMPILER DESIGN

PREPROCESSOR:

A preprocessor produce input to compilers. They may perform the following functions.

1. Macro processing: A preprocessor may allow a user to define macros that are short hands for
longer constructs.

2. File inclusion: A preprocessor may include header files into the program text.

3. Rational preprocessor: these preprocessors augment older languages with more modern flow-
of-control and data structuring facilities.

4. Language Extensions: These preprocessor attempts to add capabilities to the language by


certain amounts to build-in macro.

COMPILER:

Compiler is a translator program that translates a program written in (HLL) the source program
and translates it into an equivalent program in (MLL) the target program. As an important part of
a compiler is error showing to the programmer.

Executing a program written n HLL programming language is basically of two parts. the source
program must first be compiled translated into a object program. Then the results object program
is loaded into a memory executed

ASSEMBLER: programmers found it difficult to write or read programs in machine language.


They begin to use a mnemonic (symbols) for each machine instruction, which they would
subsequently translate into machine language. Such a mnemonic machine language is now called
an assembly language. Programs known as assembler were written to automate the translation of
assembly language in to machine language. The input to an assembler program is called source
program, the output is a machine language translation (object program).

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 2


UNIT-1 COMPILER DESIGN

Assembly language Machine code


Source program

COMPILER ASSEMBLER

INTERPRETER:

An interpreter is a program that appears to execute a source program as if it were machine


language. Which produces the result directly when the source language and data is given to it as
input.

Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also uses
interpreter. The process of interpretation can be carried out in following phases.

1. Lexical analysis
2. Synatx analysis
3. Semantic analysis
4. Direct Execution

Advantages:

 Modification of user program can be easily made and implemented as execution


 proceeds.
  Type of object that denotes various may change dynamically.
 Debugging a program and finding errors is simplified task for a program used for
 interpretation.
 The interpreter for the language makes it machine independent.

Disadvantages:

  The execution of the program is slower.


 Memory consumption is more.

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 3


UNIT-1 COMPILER DESIGN

The difference between an interpreter and a compiler is given below:

LOADER AND LINK-EDITOR:

Once the assembler procedures an object program, that program must be placed into memory and
executed. The assembler could place the object program directly in memory and transfer control
to it, thereby causing the machine language program to be execute. This would waste core by
leaving the assembler in memory while the user’s program was being executed. Also the
programmer would have to retranslate his program with each execution, thus wasting translation
time. To overcome this problem of wasted translation time and memory, System programmers
developed another component called loader.

“A loader is a program that places programs into memory and prepares them for execution.” It
would be more efficient if subroutines could be translated into object form the loader could
”relocate” directly behind the user’s program. The task of adjusting programs o they may be
placed in arbitrary core locations is called relocation. Relocation loaders perform four functions.

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 4


UNIT-1 COMPILER DESIGN

STRUCTURE OF A COMPILER: A compiler can broadly be divided into two phases based
on the way they compile.

Analysis Phase: Known as the front-end of the compiler, the analysis phase of the compiler
reads the source program, divides it into core parts, and then checks for lexical, grammar, and
syntax errors. The analysis phase generates an intermediate representation of the source program
and symbol table, which should be fed to the Synthesis phase as input.

Synthesis Phase: Known as the back-end of the compiler, the synthesis phase generates the
target program with the help of intermediate source code representation and symbol table. A
compiler can have many phases and passes.

Pass: A pass refers to the traversal of a compiler through the entire program.

Phase: A phase of a compiler is a distinguishable stage, which takes input from the previous
stage, processes and yields output that can be used as input for the next stage. A pass can have
more than one phase.

PHASES OF A COMPILER:

A compiler operates in phases. A phase is a logically interrelated operation that takes source
program in one representation and produces output in another representation. The phases of a
compiler are shown in below there are two phases of compilation.

a. Analysis (Machine Independent/Language Dependent)


b. Synthesis (Machine Dependent/Language independent)
Compilation process is partitioned into no-of-sub processes called ‘phases’.

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 5


UNIT-1 COMPILER DESIGN

Lexical Analysis: The first phase of scanner works as a text scanner. This phase scans the source
code as a stream of characters and converts it into meaningful lexemes. Lexical analyzer
represents these lexemes in the form of tokens as:

<token-name, attribute-value>

Syntax Analysis: The next phase is called the syntax analysis or parsing. It takes the token
produced by lexical analysis as input and generates a parse tree (or syntax tree). In this phase,
token arrangements are checked against the source code grammar, i.e., the parser checks if the
expression made by the tokens is syntactically correct.

Semantic Analysis: Semantic analysis checks whether the parse tree constructed follows the
rules of language. For example, assignment of values is between compatible data types, and
adding string to an integer. Also, the semantic analyzer keeps track of identifiers, their types and
expressions; whether identifiers are declared before use or not, etc. The semantic analyzer
produces an annotated syntax tree as an output.

Intermediate Code Generation: After semantic analysis, the compiler generates an


intermediate code of the source code for the target machine. It represents a program for some
abstract machine. It is in between the high-level language and the machine language. This

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 6


UNIT-1 COMPILER DESIGN

intermediate code should be generated in such a way that it makes it easier to be translated into
the target machine code.

Code Optimization: The next phase does code optimization of the intermediate code.
Optimization can be assumed as something that removes unnecessary code lines, and arranges
the sequence of statements in order to speed up the program execution without wasting resources
(CPU, memory).

Code Generation In this phase, the code generator takes the optimized representation of the
intermediate code and maps it to the target machine language. The code generator translates the
intermediate code into a sequence of (generally) re-locatable machine code. Sequence of
instructions of machine code performs the task as the intermediate code would do.

Symbol Table It is a data-structure maintained throughout all the phases of a compiler. All the
identifiers’ names along with their types are stored here. The symbol table makes it easier for the
compiler to quickly search the identifier record and retrieve it. The symbol table is also used for
scope management.

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 7


UNIT-1 COMPILER DESIGN

ROLE OF LEXICAL ANALYZER:

It is the first phase of compiler. It reads input source program form left to right one character at a
time and generates the sequence of tokens.

Each token is a single logical cohesive unit such as identifier, keywords, operations and
punctuation marks. Then the parser to determine the syntax of the source program can use these
tokens.

The role of lexical analyzer in the process of compilation is given below.

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 8


UNIT-1 COMPILER DESIGN

As the lexical analyzer scans the program to recognize the tokens it is also called as scanner.
Apart from token identification lexical analyzer also performs following functions.

Functions of lexical analyzers:

  It produces stream of tokens


  It eliminates blank and comments
 It generates symbol table which stores the information about identifiers, constants
 encountered in the input.
  It keeps track of line numbers.
 It reports error encounter while generating the tokens.

The lexical analyzer works in two phases. In first phase it performs scan and in the
second phase it does lexical analysis means it generates the series of tokens.

DIFFERENCES BETWEEN LEXICAL ANALYSIS VS PARSING:

LEXICAL ANALYSIS PARSING


It is one of the phases in compilation process in It is one of the phase in the compilation
which the stream of tokens is generated by the processes
scanning of source data. In which the stream of tokens is obtained from
lexical analysis phase for building the parse
free.
This phase is also recognized as scanning This phase is also recognized as syntax
phase analyzing phase
The input buffering scheme is used to scanning The top down and bottom up parsing
the source code techniques are used for syntax analysis
The regular expressions and finite automata are The context free grammars are used in the
used in the design of lexical analysis design of parsing
lex is an automated tool which is used to The yacc is an automated tool which is used to

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 9


UNIT-1 COMPILER DESIGN

generate lexical analyzer generate syntax analyzer.

TOKEN, PATTERNS & LEXEME: Let us learn some terminologies which are frequently
used when we talk about the activity of lexical analysis.

Tokens: it describes the class or category of input string. For example identifier, key words,
constant are called tokens.

Patterns: set the rule that describes the token.

Lexemes: sequence of characters in the source program that the matched with the pattern of the
token for example int, i, num, ans

For example if (a<b)

Here “if” “(“ “a” “<” “b” “)” are all lexemes.

If is a keyword, “(“ is a open parentheses, ‘a’ is identifier.

Now to define the identifier pattern could be

  Identifier is collection of letters.


 Identifier is a collection of alpha numeric character and identifiers beginning character
should be necessarily a letter.

When we want to compile a given source program we submit this program to compiler. A
compiler scans the program and produces the sequence of tokens therefore lexical analysis is also
called as scanner.

LEXEME TOKEN
Int Key word
Max Identifier
( Operator
Int Keyword
A Identifier
, Operator
B Identifier
{ operator

The blank and new line characters can be ignored.

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 10


UNIT-1 COMPILER DESIGN

Lexical errors:

These types of errors can be detected during lexical analysis phase, typical lexical phase errors
are

  Exceeding length of identifier (or)numeric constants,


  Appearance of illegal characters.
 Unmatched string

Consider if(“ \n hello India ”); $

This is also a lexical error as an illegal character $ appears at the end of the statement. If length
of the identifier gets exceeded the errors occurs.

INPUT BUFFERING:

The lexical analyzer scans the input string from left to right one character at a time .it uses two
pointers begin-ptr(bp) and forward –ptr(fp) to keep frack of the portion input scanned . Initially
both the pointers point to the first character of the input string. As shown below.

bp

i N t i , j ; i = i + 1 ; j = j + 1 ;

fp
Initial configuration

The forward –ptr moves ahead to search for end lexeme .as soon as the blank space is
encountered. It indicates end of lexeme. In above example as soon as forward –ptr(fp)encounters
a blank space the lexeme “int” is identified.

fp Figure

The fp will be moved ahead at white space. When fp encounters white space. It ignore
and moves ahead . Then both the begin –ptr (bp) and forward-ptr(fp) are set at next token i.

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 11


UNIT-1 COMPILER DESIGN

The input character is read from secondary storage. But reading in this way from secondary
storage is costly. Hence buffering technique is used.

A block of data is first read into a buffer and then scanned by lexical analyzer .there are two
methods used in this context:

  one buffer scheme


 Two buffer scheme.

One buffer scheme: In this only one buffer is used to store the input string .but the problem with
this scheme is that if lexeme is very long then it crosses the buffer boundary, to scan rest of the
lexeme the buffer has to be refilled ,that makes overwriting the first part of the lexeme.

bp

i n T i = i + 1
fp

Two buffer scheme: to overcome the problems of one buffer scheme, in this method two buffers
are used to store the input string.


 The first buffer and second buffer are scanned alternatively when end of current buffer is
reached the other buffer is filled

The only problem with this method is that is length of the 
lexeme is larger than length of the
 buffer then scanning input cannot be scanned completely.

Initially both the fp and bp are pointing tothe first character of first buffer then fp moves
 towards right in search of end of lexeme.

As soon as blank character is recognized the string between bp and fp is identified as
corresponding token .to identify the boundary of first buffer “end of buffer” character
should be placed at the end of first buffer.

Similarly end of second buffer is also recognized by the end of buffer mark present at the end of
second buffer. When fp encounters first eof then one can recognize end of first buffer and hence
filling up of second buffer is started. bp

Buffer 1 i n t i = i + 1

Buffer 2 ; j = j + 1 ; eof
fp

In the same way when second eof is obtained .then it indicates end of second buffer.
Alternatively both the buffers can be filled up until end of the input program & stream of token is
identified. This eof character introduces at the end is called sentinel.

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 12


UNIT-1 COMPILER DESIGN

Code for input buffering:

If(fp==eof(buff 1))
{
Fp++; /*refill buffer 2*1
}
Else if ((fp==eof(buff2)))
{
fp ++ / *refill buffer 1*1
}
Else if (fp==eof(input))
Return;
Else fp++;

Regular expressions: Regular expressions are mathematically symbolisms which describe the
set of strings of specific language. It provides convenient and useful notation for representing
tokens.

Here are some rules that describe definition of the regular expression over the input set denoted
by ∑.

1. ∑ is a regular expression that denotes the set containing empty string.


2. If R1 and R2 are regular expressions then R=R1+R2 is also regular expression which
represents union operation.
3. If R1 and R2 are regular expressions then R=R1.R2 is also a regular expression, which
represents concatenation operation.
4. If r1 is a regular expression then R=R1* is also a regular expression which represents
kleen closure.

A language denoted by regular expressions is said to be a regular set (or) a regular language

Problems: write a regular expression for a language containing the strings of length two over
∑={0, 1}

Sol: R.E=(0+1)(0+1)


Write
 a regular expression for a language containing strings which end with “add”over ∑={a,
 b}

Sol: (a+b)* add.

Recognizing of tokens: for a programming language there are various types of tokens such as
identifier, key words, constants, operators and so on. The token is usually represented by a pair
token type& token value

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 13


UNIT-1 COMPILER DESIGN

Token type Token value

The token type tells us the category of token and token value gives the information regarding
token .the token value is also called as token attribute. During lexical analysis process the
symbol table is maintained.

The token value can be a pointer to symbol table in case of identifier and constant. The lexical
analyzer reads the input program and generates a symbol table for tokens.

We will consider some encoding of tokens as follows

Token Code Value


If 1 -
Else 2 -
While 3 -
For 4 -
Identifier 5 Ptr to symbol table
Constant 6 Ptr to symbol table
< 7 1
<= 7 2
> 7 3
>= 7 4
!= 7 5
( 8 1
) 8 2
+ 9 1
- 9 2
= 10 -

Our lexial analyzer will generate following token stream. Consider code

if(a<10)

I=i+2;

Else

I=i-2;

1, (8,1),(5,100),(7,1),(6,105),(8,2),(5,107)(9,1)(6,10). 2,(5,107),10,(5,107),(9,2)(6,110).

The corresponding symbol table for identifiers and constant will be

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 14


UNIT-1 COMPILER DESIGN

Location Type Value


counter
100 Identifier A
.
.

105 Constant 10
.
107 Identifier i
.
110 Constant 2

REGULAR EXPRESSION: Regular expression is used to store describe tokens of a


programming language. RE is built or consisting of smaller RE.

 Each RE denotes a language .the language denoted by regular expression is called as


 regular set.
 Here are some rules that describe definition of the regular expression over the ip set
 denoted by∑.
  The regular expression defined as.
  ∑ is a RE denoting a language which has an empty string.
 A is a RE denoting a language containing only{a}

Suppose r and s are regular expressions denoting the language L(r) and L(s) then

1. (r)/(s) is a regular expression denoting the L(r) U L(S) which represents the union
operation.
2. (r). (s) is a Re denoting L(r).L(s), which represents the concatenation operation.
3. (r)* is a RE denoting (L(r))*, which represents the kleen closure.

Ex: let ∑={a, b)

  The RE a/b denotes the set{a,b}


  The RE’s (a/b)(a/b) denotes {aa, ab,ba,bb} the set of all strings of a’s &b’s of length two.
 Another RE for this same subset is aa/ab/ba/bb.
 The RE a* denotes the set of all strings zero / more a’s
{∑, a,aa,aaa,……….}
  The RE (a/b)* denotes the set containing the string a and all string consisting of zero
/more a’s followed by ab.

AXIOM DESCRIPTION

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 15


UNIT-1 COMPILER DESIGN

r/s=s/r / is commutative

r/(s/t)=(r/s)/t / is associate

(rs)t=r(st) concatenation is associative

R(s/t)=rs/rt concatenation is distribution over/

(s|t)r=sr/tr concatenation is distribution over/

€r=r

r€=r E is identify element for


concatenation.

R*=(r/E)* relation between * and E

R**=r* * is idempotent.

RECOGNIZING TOKENS:

For a programming language there are various types of tokens such as identifier, key words,
constants and operations so on. The token is usually represented by a pair, token type and toke
value.

Token type Token value

The token type tells us the category of token and toke value gives us the information regarding
token.

Ex: consider the grammar,


S iEts/iEtses/E

E T relop T/T

T id/num.
Here terminals are I, t, e, relop, id and num. They generate set of string given by following
regular definitions.


i i

t- t

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 16


UNIT-1 COMPILER DESIGN


e e

relop =|<|>|<=|>=|<>

id letter (letter / digit)*

num digit (. Digit +)?

Letter [A-Z a……….z]

Digit [o………q]
These are the patterns for token for expression

Lexemes Token name Attribute value


I I -
E E -
T T -
= Relop EQ
< Relop LT
> Relop GT
>= Relop GE
<= Relop LE
<> Relop NE
Any id Id pointer to table
entry
Any num Num pointer to table
entry

REGULAR DEFINITION FOR LANGUAGE CONSTRUCTS:

Regular grammar: A regular grammar (G) is defined as

G=<V, T, P, S> where

V= set of variables

T= set of terminals

P=set of production

S= start symbol


Ex: let A ab

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 17


UNIT-1 COMPILER DESIGN


B B/E
Then the grammar G is defined as

V= {A,B}

T= {a,b}

 
P= {A aB, B b|€}
S= {A}

This is called as regular language and it can be represented by DFA .regular grammar which can
represent by using finite automata.

Sequence of string: Any string formed by removing Zero /more not necessarily the contiguous
symbol is called sequence of strings

for example.

Multiline=>mutine can be sequence of string.

In above example, scanner scans the input string and recognize “if” as a key word and returns
token type as 1 since in given encoding code 1 indicates keyword “if” and hence 1 is at the
beginning of token stream.

Next is a pair (8, 1) where 8 indicates parenthesis And ‘1’ indicates opening parenthesis”
(”. Then we scan the input ‘a’ if recognize it as identifier and searches the symbol table to check
whether the same entry is present. If not it inserts the information about this identifier in symbol
table and returns 10.

If the same identifier or variable is already present in symbol table than lexical analyzer does not
insert it into the table inserted it returns the location where it is present.

REGULAR DEFINITION FOR LANGUAGE CONSTRUCTS:

1) Strings: string is a collection of finite number of alphabets or letters. The strings are
synonymously called as words.
  
The length of a string is denoted by |s|
  
The empty string can be denoted by €
 
The empty set of string is denoted by φ

2) Sequences: Following terms are commonly used in strings.



Prefix of string: A string obtained by removing zero or more tail symbols. For ex, for string
Hindustan the prefix could be “Hindu”.

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 18


UNIT-1 COMPILER DESIGN


  zero or more loading symbol for ex the
Suffix of string: A string obtained by removing
string “Hindustan” Suffix could be “stan”.
 
Sub string: A string obtained by removing prefix and suffix of a given string is called
substring. For example the string “Hindustan”, The string “indu” can be substring.
3) Comments: The regular expression for the comment statement can be

r.e=// (letter+digit+whitespace)*

The regular expression for multipleline comments statement can be

r.e=/* (letter+digit+whitespace+newline)* */

TRANSITION DIAGRAM FOR REORGANIZATION OF TOKEN:

Lexical analysis is a process of recognizing tokens from input source program. To recognize
tokens lexical analyzer performs following steps.

Step 1: lexical analyzer store the input in input buffer.

Step2: the token is read from input buffer and regulator expression is built for corresponding
token.

Step 3: from these regular expressions finite automata is built .the finite automata is usually in
non deterministic form. That means a non-deterministic finite automaton is built.

Step4: for each state of NFA, a function is designed and each input along the translation edges
corresponds to input parameters of these functions.

Step 5: The set of such functions ultimately create lexical analyzer program.

The finite automata are typically represented using transition diagram. The transition graph can
be defined as collection of

1. Finite set of states K


2. 2. Finite set of symbol ∑
3. 3. A non empty set S of K. it is called start state
4. A set F<= K of final states.

5. A transition function K*A K with a K as state and A as input from ∑*

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 19


UNIT-1 COMPILER DESIGN

RESERVED WORDS AND IDENTIFIERS:

Reserved words are the special words used in the programming language which is associated
with some means for example if, else, while, for, break, pf, sf, exit so on are the reserved words
that are used in C language. The lexical analyzer should identify the reserved words correctly.

The identifier is a kind of variable which stores some values. It is a collection of letters or
alphanumeric letters. The first letter of the identifier must always be a letter.

SIGNIFICANCE OF LEXEMENS WITH LONGEST PREFIX:

The lexical analysis is a process of recognizing tokens from source program and complier does
this job by constructing recognizer that looks for the lexemes stored in the input buffer. This
working is based on the rule “if more than one pattern matches then recognizer has to choose the
longest lexeme matched.

LEX: For efficient design of compiler various tools have been built for constructing lexical
analyzer using the special purpose notations called regular expressions.

The regular expressions are used in recognizing the tokens. Now we will discuss a special
language that specifies the tokens the tokens using regular expression. A tool called LEX gives
this specification.

Lex scans the source program in order to get the stream of tokens and these are related together
so that various programming constructs such as expressions block of statements, procedures,
control structure can realized.

This task of relating the tokens to gather is known as parsing.

During the parsing of the program the rules are defined to establish the relationship between the
tokens. These roles are called grammar

The YACC yet another complier is another automated tool which is used to specify the grammar
for realizing the source programming constructs.

YACC takes the description of a grammar in some specification file & produces the C routine
called parser. Thus LEX and YACC are two important utilities that generate the lexical analyzer
and syntax analyzer.

LEX: LEXICAL ANALYZER GENERATOR:

For efficient design of complier various tools have been built for constructing lexical analyzer
using the special purpose notation called regular expressions.

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 20


UNIT-1 COMPILER DESIGN

The regular expressions are used in recognizing the tokens .now we will discuss a special
language that specifies the tokens. Using regular expressions, a tool called LEX gives this
specification. Basically LEX is a Unix utility which generates the analyzer.

A LEX laxer is very much faster in finding the token as compared to the handwritten LEX
program in C.

The lex specification file can be created using the extension .l. For example the specification file
can be x.1. This lex.yy.c is a C program which is actually a lexical analyzer program.

The lex specification files stores the regular expressions for the tokens and the lexyy.c file
consists of the tabular representation of the transition diagrams constructed from the regular
expression.

LEX compiler Lex.yy.c


Lex source
program

C compiler
a.out
lex.yy.c

a.out Sequence
input stream of tokens

The lexemes can be represented recognized eith the help of this tabular representation of
transition diagram.

The action associated with regular expression in lex.l are pieces of a C code and are carried our
directly to lex.yy.c.

Finally lex.yy.c is run through the C complier to produces an object program a.out, which is the
lexical analyzer that transforms an input stream into a sequence of tokens.

Recognizing words with LEX:

Lex program consists of three sections.

1. Declaration section
2. Rule section
3. Procedure section (auxiliary procedure section)

Lex source program has the basic format as

%{

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 21


UNIT-1 COMPILER DESIGN

Declaration section

%}

%%

Rule section

%%

Auxiliary procedure section.

Declaration section:


 
Declaration of variables is done in declaration section regular definitions can also be written
here.

 is used to define macros and is used to improve important
In general the definition section
headers files written in C.

Rule section: Rule section consists of regular expressions with associated actions. The
translations rules can take the format as

R1{action}

R2{action}

Rn{action}


 action I describe the action that action needs to
Here Ri indicates the regular expression and
take for corresponding regular expression.

Rule section is the most importantsection here the patterns are associated with C statements
pattern are nothing but regular C.

Auxilary procedure section:


 In this section required procedures are being defined these procedures may also be required
by the action in the role section.

It also called as C –code section it contains C statements and function they consist of code
 called by the rules in the rules sec.

Ex : %{

%}

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 22


UNIT-1 COMPILER DESIGN

“RAMA”
“SITA”
“geeta”/ pf(“\n noun”)
“sings”/
“dances”/ pf(“\n verb”)
“eat”
%%
Main()
{
Yylex();
}
Int yywrap()
{
Return I;
}

USE OF SYMBOL TABLE IN LEX:


During the process of compilation it would be always efficient to have a symbol table so
that while lexer is running we can add new words without modifying or recompiling the
 lex program. There could be two important activities that  are associated with the symbol
table and those are insert –word () and search –word().
  The insert word will insert the newly encountered word into the symbol table . 
 
The search word will perform the look up activity.

Parsing command line with LEX:

The command line parameters are the parameters that are appearing on the prompt. The
command line interface is the interface. This allows user to interact with the computer by typing
the commands. In C we can pass these parameters to the main function in the form of character
array.

Ex: $ cp abc.txt pqr.txt

Here argv[0]=cp

argv[1]=abc.txt

argv[2]=pqr.txt.

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 23


UNIT-1 COMPILER DESIGN

THE EVOLUTION OF PROGRAMMING LANGUAGES


The first electronic computers appeared in the 1940's and were programmed in machine language
by sequences of O's and l's that explicitly told the computer what operations to execute and in
what order. The operations themselves were very low level: move data from one location to
another, add the contents of two registers, compare two values, and so on. Needless to say, this
kind of programming was slow, tedious, and error prone. And once written, the programs were
hard to understand and modify.

The Move to Higher-level Languages


The first step towards more people-friendly programming languages was the
development of mnemonic assembly languages in the early 1950's. Initially, the instructions in
an assembly language were just mnemonic representations of machine instructions. Later, macro
instructions were added to assembly languages so that a programmer could define parameterized
short hands for frequently used sequences of machine instructions.

Impacts on Compilers
Since the design of programming languages and compilers are intimately related, the
advances in programming languages placed new demands on compiler writ-ers. They had to
devise algorithms and representations to translate and support the new language features. Since
the 1940's, computer architecture has evolved as well. Not only did the compiler writers have to
track new language fea-tures, they also had to devise translation algorithms that would take
maximal advantage of the new hardware capabilities.

THE SCIENCE OF BUILDING A COMPILER


Compiler design is full of beautiful examples where complicated real-world problems are
solved by abstracting the essence of the problem mathematically. These serve as excellent
illustrations of how abstractions can be used to solve problems: take a problem, formulate a
mathematical abstraction that captures the key characteristics, and solve it using mathematical
techniques. The problem formulation must be grounded in a solid understanding of the
characteristics of computer programs, and the solution must be validated and refined empirically.

Modelling in Compiler Design and Implementation:


The study of compilers is mainly a study of how we design the right mathematical models and
choose the right algorithms, while balancing the need for generality and power against simplicity
and efficiency.
The Science of Code Optimization
The term "optimization" in compiler design refers to the attempts that a com-piler makes
to produce code that is more efficient than the obvious code. "Op-timization" is thus a misnomer,
since there is no way that the code produced by a compiler can be guaranteed to be as fast or
faster than any other code that performs the same task.

APPLICATIONS OF COMPILER TECHNOLOGY


Compiler design is not only about compilers, and many people use the technology learned by
studying compilers in school, yet have never, strictly speaking, written (even part of) a compiler
for a major programming language. Compiler technology has other important uses as well.
Additionally, compiler design impacts several other areas of computer science. In this section,
we review the most important interactions and applications of the technology.
Implementation of High-Level Programming
Languages
A high-level programming language defines a programming abstraction: the programmer
expresses an algorithm using the language, and the compiler must translate that program to the
target language. Generally, higher-level programming languages are easier to program in, but are
less efficient, that is, the target programs run more slowly. Programmers using a low-level
language have more control over a computation and can, in principle, produce more efficient
code.

Optimizations for Computer Architectures


The rapid evolution of computer architectures has also led to an insatiable demand for new
compiler technology. Almost all high-performance systems take advantage of the same two basic
techniques: parallelism and memory hierarchies. Parallelism can be found at several levels: at
the instruction level, where multiple operations are executed simultaneously and at
the processor level, where different threads of the same application are run on different
processors. Memory hierarchies are a response to the basic limitation that we can build very fast
storage or very large storage, but not storage that is both fast and large.

Parallelism
All modern microprocessors exploit instruction-level parallelism. However, this parallelism can
be hidden from the programmer. Programs are written as if all instructions were executed in
sequence; the hardware dynamically checks for dependencies in the sequential instruction stream
and issues them in parallel when possible.
Memory Hierarchies
A memory hierarchy consists of several levels of storage with different speeds and sizes, with
the level closest to the processor being the fastest but small-est. The average memory-access
time of a program is reduced if most of its accesses are satisfied by the faster levels of the
hierarchy. Both parallelism and the existence of a memory hierarchy improve the potential
performance of a machine, but they must be harnessed effectively by the compiler to deliver real
performance on an application.

PROGRAMMING LANGUAGE BASICS


If a language uses a policy that allows the compiler to decide an issue, then we say that the
language uses a static policy or that the issue can be decided at compile time. On the other hand,
a policy that only allows a decision to be made when we execute the program is said to be a
dynamic policy or to require a decision at run time.

G.LAVANYA, Asst.Prof, NIT, NARASARAOPETA Page 24

You might also like