CD - 1
CD - 1
CD - 1
Unit – I--Syllabus
Introduction: Language processors, The Structure of a Compiler, the science of building a
complier.
Lexical Analysis: The Role of the lexical analyzer, Input buffering, Specification of tokens,
Recognition of tokens, the lexical analyzer generator Lex, Design of a Lexical Analyzer
generator.
INTRODUCTION
LANGUAGE PROCESSOR:
A language processor is a special type of computer software that has the capacity of
translator the source code or program codes into machine codes. The following are different
types of language processors are:
Compiler
Assembler
Interpreter
Compiler
A compiler is a program that can read a program in one language(the source language) and
translate it into an equivalent program inanother language (the target language). An important
role of thecompiler is to report any errors in the source program that it detects duringthe
translation process.
If the target program is an executable machine-language program,then the user to process inputs
and produces outputs.
1
Simply, the Compiler Translates high level language into assembly language.
Interpreter:
An interpreter is another common kind of language processor. Instead ofproducing a
target program as a translation, an interpreter appears to directlyexecute the operations specified
in the source program on inputs supplied bythe user, as shown in Fig.
An interpreter,however, can usually give better error diagnostics than a compiler, because
itexecutes the source program statement by statement.
The modified source program is then fed to a compiler. The compiler mayproduce an
assembly-language program as its output, because assembly languageis easier to produce as
output and is easier to debug. The assemblylanguage is then processed by a program called an
assembler that producesrelocatable machine code as its output.
2
program. If the analysis partdetects that the source program is either syntactically ill formed or
semanticallyunsound, then it must provide informative messages, so the user can takecorrective
action. The analysis part also collects information about the sourceprogram and stores it in a data
structure called a symbol table, which is passedalong with the intermediate representation to the
synthesis part.
The synthesispart constructs the desired target program from the intermediaterepresentation and
the information in the symbol table. The analysis part is often called the front end of the
compiler; the synthesis part is the back end.
3
Lexical Analysis:
The first phase of a compiler is called lexical analysis or scanning. The lexical analyzer reads the
stream of characters making up the source program and groups the characters into meaningful
sequences called lexemes. For each lexeme, the lexical analyzer produces as output a token of the
form (token-name, attribute-value)that it passes on to the subsequent phase, syntax analysis.
p o s i t i o n = i n i t i a l + r a t e * 60
The characters in this assignment could be grouped into the following lexemesand mapped into
the following tokens passed on to the syntax analyzer:
4
1. p o s i t i o n is a lexeme that would be mapped into a token (id, 1), where idis an abstract
symbol standing for identifier and 1 points to the symboltableentry for p o s i t i o n . The
symbol-table entry for an identifier holdsinformation about the identifier, such as its name and
type.
2. Theassignment symbol = is a lexeme that is mapped into the token (=).Since this token needs
no attribute-value, we have omitted the secondcomponent. We could have used any abstract
symbol such as assign forthe token-name, but for notational convenience we have chosen to use
thelexeme itself as the name of the abstract symbol.
3. i n i t i a l is a lexeme that is mapped into the token (id, 2), where 2 pointsto the symbol-table
entry for i n i t i a l .
4. + is a lexeme that is mapped into the token (+).
5.r a t e is a lexeme that is mapped into the token (id, 3), where 3 points tothe symbol-table entry
for r a t e .
6.* is a lexeme that is mapped into the token (*).
7. 60 is a lexeme that is mapped into the token (60).
Blanks separating the lexemes would be discarded by the lexical analyzer.Figure 1.7 shows the
representation of the assignment statement (1.1) afterlexical analysis as the sequence of tokens
( i d , l ) <=) (id, 2) (+) (id, 3) (*) (60) ………………..(1.2)
In this representation, the token names =, +, and * are abstract symbols forthe assignment,
addition, and multiplication operators, respectively.
Syntax Analysis:
The second phase of the compiler is syntax analysis or parsing.The parser usesthe first
components of the tokens produced by the lexical analyzer to createa tree-like intermediate
representation that depicts the grammatical structureof the token stream. A typical representation
is a syntax treein which eachinterior node represents an operation and the children of the node
represent thearguments of the operation.
Semantic Analysis:
The semantic analyzer uses the syntax tree and the information in the symboltable to
check the source program for semantic consistency with the languagedefinition. An important
5
part of semantic analysis is type checking, where the compilerchecks that each operator has
matching operands. For example, many programminglanguage definitions require an array index
to be an integer; the compilermust report an error if a floating-point number is used to index an
array.
Code Optimization:
The machine-independent code-optimization phase attempts to improve theintermediate
code so that better target code will result. Usually better meansfaster, but other objectives may be
desired, such as shorter code, or target codethat consumes less power.
t1 = id3 * 60.0
idl = id2 + tl
Code Generation:
6
The code generator takes as input an intermediate representation of the sourceprogram
and maps it into the target language. If the target language is machinecode, registers Or memory
locations are selected for each of the variables used bythe program. Then, the intermediate
instructions are translated into sequencesof machine instructions that perform the same task.
For example, using registers Rl and R2, the intermediate code in (1.4) mightget translated into
the machine code
The first operand of each instruction specifies a destination. The F in each instruction
tells us that it deals with floating-point numbers. The code loads the contents of address id3 into
register R2, and then multiplies it withfloating-point constant 60.0. The # signifies that 60.0 is to
be treated as animmediate constant. The third instruction moves id2 into register Rl and
thefourth adds to it the value previously computed in register R2. Finally, the valuein register
Rlis stored into the address of idl, so the code correctly implementsthe assignment statement
(1.1).
Symbol-Table Management:
The symbol table is a data structure containing a record for each variablename, with
fields for the attributes of the name. The data structure should bedesigned to allow the compiler
to find the record for each name quickly and tostore or retrieve data from that record quickly.
7
Compiler-Construction Tools:
Some commonly used compiler-construction tools include
1.Parser generatorsthat automatically produce syntax analyzers from agrammatical description
of a programming language.
2.Scanner generatorsthat produce lexical analyzers from a regular-expressiondescription of the
tokens of a language.
3.Syntax-directed translation enginesthat produce collections of routinesfor walking a parse tree
and generating intermediate code.
4. Code-generation generatorsthat produce a code generator from a collectionof rules for
translating each operation of the intermediate language intothe machine language for a target
machine.
5.Data-flow analysis enginesthat facilitate the gathering of informationabout how values are
transmitted from one part of a program to eachother part. Data-flow analysis is a key part of code
optimization.
6.Compiler-construction toolkitsthat provide an integrated set of routinesfor constructing
various phases of a compiler.
A compiler must accept all source programs that conform to the specificationof the
language; the set of source programs is infinite and any program can bevery large, consisting of
possibly millions of lines of code. Any transformationperformed by the compiler while
translating a source program must preserve themeaning of the program being compiled.
Compiler writers thus have influenceover not just the compilers they create, but all the programs
that their compilerscompile.
1. Modeling in Compiler Design and Implementation
The study of compilers is mainly a study of how we design the right mathematical models and
choose the right algorithms, while balancing the need forgenerality and power against simplicity and
efficiency.
8
• The compilation time must be kept reasonable,
• The engineering effort required must be manageable.
Applications of Compiler Technology:
1. Implementation of High-Level Programming Languages.
2. Optimizations for Computer Architectures
3. Design of New Computer Architectures
4. Program Translations
5. Software Productivity Tools
LEXICAL ANALYSIS
The first phase of a compiler, the main task of the lexical analyzer is toread the input
characters of the source program, group them into lexemes, andproduce as output a sequence of
tokens for each lexeme in the source program.The stream of tokens is sent to the parser for
syntax analysis.When thelexical analyzer discovers a lexeme constituting an identifier, it needs
to enterthatlexeme into the symbol table.
Since the lexical analyzer is the part of the compiler that reads the sourcetext, it may
perform certain other tasks besides identification of lexemes. Onesuch task is stripping out
comments and whitespace (blank, newline, tab, andperhaps other characters that are used to
separate tokens in the input). Anothertask is correlating error messages generated by the
compiler with the sourceprogram.
The lexical analyzers are divided into two processes:
a) Scanningconsists of the simple processes that do not require tokenizationof the input, such as
deletion of comments and compaction of consecutivewhitespace characters into one.
9
b) Lexical analysisproper is the more complex portion, where the scannerproduces the sequence
of tokens as output.
Example:
Lexeme Token Pattern
Float Float Float
Key Id A letter followed by any no.of letters or digits
= Relop >=|>=|=…………..
1.2 Num Any numeric constant
; ; ;
10
Fig: Examples of tokens
Lexical Errors:
It is hard for a lexical analyzer to tell, without the aid of other components,that there is a
source-code error. For instance, if the string fiis encounteredfor the first time in a C program in
the context:
fi ( a == f ( x ) ) . ..
A lexical analyzer cannot tell whether fi is a misspelling of the keyword if oran undeclared
function identifier. Since f i is a valid lexeme for the token id,the lexical analyzer must return the
token id to the parser and let some otherphase of the compiler — probably the parser in this case
— handle an errordue to transposition of the letters.
Other possible error-recovery actions are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.
Example: The token names and associated attribute values for the Fortranstatement
E = M * C ** 2
is written below as a sequence of pairs.
11
Lexical Errors:
It is hard for a lexical analyzer to tell, without the aid of other components,that there is a source-
code error. For instance, if the string fiis encounteredfor the first time in a C program in the
context:
fi ( a == f ( x ) ) . ..
a lexical analyzer cannot tell whether f i is a misspelling of the keyword if oran undeclared
function identifier. Since f i is a valid lexeme for the token id,the lexical analyzer must return the
token id to the parser and let some otherphase of the compiler — probably the parser in this case
— handle an errordue to transposition of the letters.
INPUT BUFFERING
This task is made difficult by the fact that we often haveto look one or more characters
beyond the next lexeme before we can be surewe have the right lexeme. For instance,we cannot
be sure we've seen the end of an identifier until we see a characterthat is not a letter or digit, and
therefore is not part of the lexeme for id. InC, single-character operators like -, =, or < could also
be the beginning of atwo-character operator like ->, ==, or <=.
There are two methods
1. Two-buffer scheme- that handles large lookaheads safely.
2. "Sentinels" that saves time checking for the ends of buffers.
12
Each buffer is of the same size N, and N is usually the size of a disk block,e.g., 4096 bytes. Using
one system read command we can read N charactersinto a buffer, rather than using one system
call per character. If fewer than Ncharacters remain in the input file, then a special character,
represented by eof,marks the end of the source file and is different from any possible character
ofthe source program.
Once the next lexeme is determined, forward is set to the character at its rightend. Then, after
the lexeme is recorded as an attribute value of a token returnedto the parser, lexemeBeginis set
to the character immediately after the lexemejust found. In above Fig, we see forward has
passed the end of the next lexeme,** (the Fortran exponentiation operator), and must be retracted
one positionto its left.
Algorithm :
13
Sentinels:
If we use the scheme of previous, we must check, each time weadvance forward, that we have
not moved off one of the buffers; if we do, thenwe must also reload the other buffer. Thus, for
each character read, we maketwo tests: one for the end of the buffer, and one to determine what
characteris read (the latter may be a multiway branch). We can combine the buffer-endtest with
the test for the current character if we extend each buffer to hold asentinel character at the end.
The sentinel is a special character that cannotbe part of the source program, and a natural choice
is the character eof.
Algorithm:
caseeof:
if(forwardis at end of first buffer ) {
reloadsecond buffer;
forward = beginning of second buffer;
}
else if (forwardis at end of second buffer ) {
reloadfirst buffer;
forward = beginning of first buffer;
}
else /* eofwithin a buffer marks the end of input */
terminate lexical analysis;
break;
SPECIFICATION OF TOKENS:
Regular expressions are an important notation for specifying lexeme patterns.While they
cannot express all possible patterns, they are very effective in specifying those types of patterns
that we actually need for tokens. In this sectionwe shall study the formal notation for regular
expressions,
Strings and Languages
14
An alphabet is any finite set of symbols. Typical examples of symbols are letters,digits, and
punctuation. The set {0,1} is the binary alphabet. ASCII is animportant example of an alphabet;
it is used in many software systems.
Operations on Languages:
15
3. L4 is the set of all 4-letter strings.
4. L* is the set of all strings of letters, including e, the empty string.
5. L(L U D)* is the set of all strings of letters and digits beginning with aletter.
6. D+ is the set of all strings of one or more digits.
RECOGNITION OF TOKENS:
It means how to take the patterns for all the needed tokensand build a piece of code that
examines the input string and finds a prefix thatis a lexeme matching one of the patterns. Our
discussion will make use of thefollowing running example.
For relop, we use the comparison operators of languages like Pascal or SQL,where = is "equals"
and <> is "not equals," because it presents an interestingstructure of lexemes.
The terminals of the grammar, which are if, then, else, relop, id, andnumber, are the names of
tokens as far as the lexical analyzer is concerned. Thepatterns for these tokens are described
using regular definitions, as shown in below.
digit-------->[0-9]
digits--------> digit+
number-----> digits (. digits)? ( E [+-]? digits )?
letter-------->[A-Za-z]
id----------->letter ( letter \ digit )*
if------------->if
then--------->then
else--------->e l se
relop------->< 1 > 1 <= 1 >= 1 = 1 <>
16
Figure: Tokens, their patterns, and attribute values
Transition Diagrams:
Transition diagrams have a collection of nodes or circles, called states. Eachstate
represents a condition that could occur during the process of scanningthe input looking for a
lexeme that matches one of several patterns. We maythink of a state as summarizing all we need
to know about what characters wehave seen between the lexemeBeginpointer and the forward
pointer.
Edges are directed from one state of the transition diagram to another.Each edge is
labeled by a symbol or set of symbols. If we are in some state5, and the next input symbol is a,
we look for an edge out of state slabeledby a (and perhaps by other symbols, as well). If we find
such an edge, weadvance the forward pointer arid enter the state of the transition diagram
towhich that edge leads. We shall assume that all our transition diagrams aredeterministic,
meaning that there is never more than one edge out of a givenstate with a given symbol among
its labels.
Example: Below Figure is a transition diagram that recognizes the lexemesmatching the token
relop. We begin in state 0, the start state. If we see < as thefirst input symbol, then among the
lexemes that match the pattern for relopwe can only be looking at <, <>, or <=. We therefore go
to state 1, and look atthe next character. If it is =, then we recognize lexeme <=, enter state 2,
andreturn the token relopwith attribute LE, the symbolic constant representingthis particular
comparison operator. If in state 1 the next character is >, theninstead we have lexeme <>, and
enter state 3 to return an indication that thenot-equals operator has been found. On any other
character, the lexeme is <,and we enter state 4 to return that information. Note, however, that
state 4has a * to indicate that we must retract the input one position.
17
Figure: Transition diagram for relop
The transition diagram for token n u m b e r is shown in Fig. 3.16, and is sofar the most complex
diagram we have seen. Beginning in state 12, if we see adigit, we go to state 13. In that state, we
can read any number of additionaldigits. However, if we see anything but a digit or a dot, we
have seen a numberin the form of an integer; 123 is an example. That case is handled by
enteringstate 20, where we return token n u m b e r and a pointer to a table of constantswhere the
found lexeme is entered. These mechanics are not shown on thediagram but are analogous to the
way we handled identifiers.
18
The final transition diagram, shown in Fig. 3.17, is for whitespace. In thatdiagram, we look for
one or more "whitespace" characters, represented by d e l I min that diagram — typically these
characters would be blank, tab, newline, andperhaps other characters that are not considered by
the language design to bepart of any token.
Use of Lex:
An input file, which we call lex.l , iswritten in the Lex language and describes the lexical
analyzer to be generated.The Lex compiler transforms lex.1 to a C program, in a file that is
alwaysnamed lex.yy.c. The latter file is compiled by the C compiler into a file calleda.out , as
always. The C-compiler output is a working lexical analyzer that cantake a stream of input
characters and produce a stream of tokens.
19
Structure of Lex Programs:
A Lex program has the following form:
declarations
%%
translation rules
%%
auxiliary functions
Pattern1{ Action1 }
Pattern2{ Action2 }
……….
……….
Pattern-n{ Action-n }
Each pattern is a regular expression, which may use the regular definitions ofthe
declaration section. The actions are fragments of code, typically written inC, although many
variants of Lex using other languages have been created.The third section holds whatever
additional functions are used in the actions.Alternatively, these functions can be compiled
separately and loaded with thelexical analyzer.
In the declarations section we see a pair of special brackets, °/.{ and %}.Anything within
these brackets is copied directly to the file l e x .yy . c , and isnot treated as a regular definition. It
is common to place there the definitions ofthe manifest constants, using C #defi n e statements to
associate unique integercodes with each of the manifest constants.
The third section holds whatever additional functions are used in the actions.Alternatively,
these functions can be compiled separately and loaded with thelexical analyzer.
20
21
Example: Suppose the DFA of Fig. 3.54 is given input abba. The sequenceof states entered is
0137,247,58,68, and at the final athere is no transitionout of state 68. Thus, we consider the
sequence from the end, and in thiscase, 68 itself is an accepting state that reports pattern p2 =
abb.
22