Unit-1 Notes CD OU
Unit-1 Notes CD OU
Unit-1 Notes CD OU
UNIT-I
Syllabus:
Introduction: The Structure of a Compiler, Phases of Compilation, The
Translation Process, Major Data Structures in a Compiler, Bootstrapping and
Porting.
Lexical Analysis (Scanner): The Role of the Lexical Analyzer, Input
Buffering, Specification of Tokens, Recognition of Tokens, The Lexical
Analyzer Generator Lex.
Objective: Basic ideas in concepts of designing and implementing translators for
various languages and system building tools like LEX is introduced for lexical
analysis.
Outcome:
Introduction
In this new era, all are using software application in their daily lives. Software
applications are written in some programming languages. These programs must
be converted into a form which can be executed by a computer. Software that does
this conversion is called as translators.
Types of Translators
1. Interpreter
2. Assembler
3. Compiler
1) Interpreter is one of the translators that translate high level language to low
level language. An interpreter reads the source code one instruction or line at a
time, converts this line into machine code and executes it. The machine code is
then discarded and the next line is read.
Compiler scans the entire program once and then converts it into machine
language which can then be executed by computer's processor. In short compiler
translates the entire program in one go and then executes it. Interpreter on the
other hand first converts high level language into an intermediate code and then
executes it line by line.
Compiler generates error report after translation of entire code whereas in case of
interpreter once an error is encountered it is notified and no further code is
scanned.
A compiler reads a program written in one language called as the source language
and translate it into an equivalent program in another language called as target
language. In this conversion process compiler detects and reports any syntactical
errors in the source program.
Input
First computers of late 1940s were programmed in machine language. They are
soon replaced by assembly language instructions and memory locations are
specified with symbolic names. An assembler translates the symbolic assembly
code into equivalent machine code. Assembly language is an improved
programming language but still it is machine dependent. Later high level languages
are introduced, where programs are written in English related statements.
Brief History
• The term “compiler” was coined in the early 1950s by Grace Murray Hopper.
Translation was then viewed as the “compilation” of a sequence of routines selected
from a library
• The first compiler of the high-level language FORTRAN was developed between
1954 and 1957 at IBM by a group led by John Backus
• The study of the scanning and parsing problems was pursued in the 1960s and
1970s and led fairly to a complete solution
• The development of methods for generating efficient target code, known as
optimization techniques, is still ongoing research
• Compiler technology was also applied in rather unexpected areas:
1) Analysis part breaks up the source program into constituent pieces and creates
an intermediate representation of the source program. The analysis phase contains
the following stages of compiler.
Lexical Analysis
Syntax Analysis
Semantic Analysis
2) Synthesis part constructs the desired target program from the intermediate
representation of the source program.
Intermediate Code Generator
Code Optimizer
Code Generator
Analysis Synthesis
Lexical Analyzer
Tokens
Syntax Analyzer
Syntax Tree
Semantic Analyzer
Syntax Tree
Symbol-table Error Handler
Intermediate Code Generator
Manager
Intermediate Representation
Machine-Independent Code
Optimizer
Intermediate Representation
Code Generator
Machine-Dependent Code
Optimizer
Phases of compiler
The symbol table, which stores information about the entire source program, is
used by all phases of the compiler.
Phases of Compilation
Lexical Analysis
Syntax Analysis
Semantic Analysis
Intermediate Code Generation
Code Optimization
Code Generation
Symbol Table Management
Error Handling
Lexical Analysis:
The first phase of a compiler is called lexical analysis or scanning. The lexical
analyzer reads the stream of characters that make up the source program starting
from left to right and groups the characters into meaningful sequences called
lexemes. For each lexeme, the lexical analyzer produces a token of the form
<token-name, attribute-value>
token-name is an abstract symbol that is used during syntax analysis.
attribute-value points to an entry in the symbol table
Example: p = i + r * 60
Here p, =, i, +, r, *, 60 are all separate lexemes.
<id,1> <=> <id,2> <+> <id,3> <*> <60> are the tokens generated.
Syntax Analysis:
The second phase of the compiler is syntax analysis or parsing. The parser uses
the tokens produced by the lexical analyzer to create a tree-like intermediate
representation that depicts the grammatical structure of the token stream. A
typical representation is a syntax tree in which each interior node represents an
operation and the children of the node represent the arguments(operands) of the
operation.
Example:
For p = i + r * 60, the syntax tree is
<id,1> = <id,2> + <id,3> * 60
Semantic Analysis:
It is the third phase of the compiler. The semantic analyzer uses the syntax tree
and the information in the symbol table to check the source program for semantic
consistency with the language definition. It performs type conversion of all the data
types into real data types.
Example:
For p = i + r * 60, the syntax tree is
<id,1> = <id,2> + <id,3> * 60
After semantic analysis of the source program, many compilers generate an explicit
low-level or machine-like intermediate representation of the source program.
Example:
For p = i + r * 60, the syntax tree is
<id,1> = <id,2> + <id,3> * 60
Code Optimization:
It is the fifth phase of the compiler. It gets the intermediate code as input and
produces optimized intermediate code as output.
Example:
For p = i + r * 60, the syntax tree is
<id,1> = <id,2> + <id,3> * 60
Code Generation:
It is the sixth phase of the compiler. The code generator takes intermediate
representation/optimized machine independent representation as input and maps
Example:
For p = i + r * 60, the syntax tree is
<id,1> = <id,2> + <id,3> * 60
using registers R1 and R2, the machine code is:
LDF R2, id3
MULF R2, R2, #60.0
LDF R1, id2
ADDF R1, R1, R2
STF id1, R1
Symbol Table Manager:
Symbol table is used to store all the information about identifiers used in the
program.
It is a data structure containing a record for each identifier, with fields for
the attributes of the identifier.
It allows to find the record for each identifier quickly and to store or retrieve
data from that record.
Whenever an identifier is detected in any of the phases, it is stored in the
symbol table.
Error Handling:
One of the most important functions of a compiler is the detection and
reporting of errors in the source program. The error message should allow the
programmer to determine exactly where the errors have occurred.
Each phase can encounter errors. After detecting an error, a phase must
handle the error so that compilation can proceed.
Lexical Analyzer
Syntax Analyzer
:=
<id,1> +
<id,2> *
<id,3> 60
Dept. of CSE, MEC
Subject: Compiler Design
Semantic Analyzer
<id,1> +
<id,2> *
<id,3> inttofloat
60
Code Optimizer
Code Generator
MOVF id3, r2
MULF *60.0, r2
MOVF id2, r2
ADDF r2, r1
MOVF r1, id1
Syntax errors are errors in the program text. they may be either lexical or
grammatical.
A lexical error is a mistake in a lexeme, for examples, typing tehn instead of
then, or missing off one of the quotes in a literal.
A grammatical error is a one that violates the (grammatical) rules of the
language, for example if x = 7 y := 4 (missing semicolon).
Syntax errors must be detected by a compiler and at least reported to the user (in a
helpful way). If possible, the compiler should make the appropriate correction(s).
Compiler-Construction Tools
Software development environments contains tools like language editors,
debuggers, version managers, profilers, test harnesses, and so on are used to
construct compilers. In addition to general software-development tools, other more
specialized tools are created to implement various phases of a compiler. They are
Syntax Tree: Parser generates syntax tree. The syntax tree is constructed as a
standard pointer-based structure that is dynamically allocated. Entire tree can be
kept as a single variable pointing to the root. Each node is a record. Its fields
represent the information collected by the parser and the semantic analyzer.
Symbol table is used by both the analysis and the synthesis parts of a compiler.
A symbol table may serve the following purposes:-
To store the names of all entities in a structured form at one place.
To verify if a variable has been declared.
To implement type checking, by verifying assignments and expressions in the
source code are semantically correct.
To determine the scope of a name (scope resolution).
Scanner, parser may enter identifiers into table, semantic analyzer will add data
type and other information to identifiers.
Literal Table: The Literal Table Stores constants and strings used in the program.
One literal table applies globally to the entire program
Used by the code generator to:
Assign addresses for literals
Enter data definitions in the target code file
Avoids the replication of constants and strings. Quick insertion and lookup are
essential. Deletion is not allowed.
Intermediate Code:
Depending on the kind of intermediate code, it may be kept in an array of text
strings, a temporary text file, Linked list of structures.
Temporary Files:
Computers did not have enough memory for the entire program to be kept in
memory during compilation. This was solved by using temporary files to hold the
products of intermediate steps.
Dept. of CSE, MEC
Subject: Compiler Design
Now a days, memory constrains are not at all a problem. Occasionally, compilers
generate intermediate files during some of the steps.
So, when a translator is required for an old language on a new machine, or a new
language on an old machine, use of existing compilers on either machine is the
best choice for developing. i.e., write the compiler in another language for which a
compiler already exists.
Bootstrapping is used to create compilers and to move them from one machine to
another by modifying the back end.
For example: To write a compiler for new language X and the implementation
language of this compiler is say Y and the target code being generated is in
language Z. That is, we create XYZ. Now if Y runs on machine M and generates
code for M then it is denoted as YMM. Now if we run XYZ using YMM then we get a
compiler XMZ. That means a compiler for source language X that generates a
target code in language Z and which runs on machine M.
X Z X Z
Y Y M M
M
Development of Pascal translator for C++ language is done by bootstrapping Pascal
translator for C language with C language translator for M.
P C++ P C++
C C M M
I
Dept. of CSE, MEC
Subject: Compiler Design
Porting:
The process of modifying an existing compiler to work on a new machine is often
known as porting the compiler.
To develop a compiler for new hardware machine from an existing compiler, change
the synthesis part of the compiler because, synthesis part is machine dependent
part. This is called porting.
Native Compiler: Native compiler are compilers that generates code for the same
Platform on which it runs. It converts high language into computer’s native
language.
The getNextToken command, causes the lexical analyzer to read characters from
its input until it can identify the next lexeme and produce for it the next token,
which it returns to the parser.
Lexical analyzer interacts with the symbol table when it discovers a lexeme
constituting an identifier and enters that lexeme into the symbol table.
Input Buffering:
The lexical analyzer scans the input from left to right one character at a time. It
uses two pointers begin ptr(bp) and forward ptr(fp) to keep track of the pointer of
the input scanned.
The forward ptr(fp) moves ahead to search for end of lexeme. As soon as the blank
space is encountered, it indicates end of lexeme. Once a blank space(whitespace) is
encountered, lexeme is identified and whitespace is ignored, them both pointers
are placed at the next character.
bp
i n t i = 1 0 ;
fp
Input character are always read from secondary storage, but this reading is costly.
To speed up the scanning process, buffering technique is used. A block of data is
first read into a buffer and lexical analysis process id continued on buffer.
Buffering techniques:
1. Buffer (one or two buffers are used)
2. Sentinels
Buffering:
One Buffer Scheme: In this scheme, only one buffer is used to store the input
string. The problem with this scheme is that if lexeme is very long then it crosses
the buffer boundary, to scan rest of the lexeme the buffer has to be refilled.
Two Buffer Scheme: To overcome the problem of one buffer scheme, two buffers are
used to store the input string. The first buffer and second buffer are scanned
alternately.
Initially both the bp and fp are pointing to the first character of first buffer. Then
the fp moves towards right in search of end of lexeme. As soon as blank character
is recognized, the string between bp and fp is identified as corresponding token. To
identify the boundary of first buffer, end of buffer character should be placed at the
end first buffer.
Sentinel:
Special character introduced at the end of the buffer is called as Sentinel which is
not a part of program. eof is the natural choice for sentinel.
Specification of Tokens:
An alphabet is any finite set of symbols. Examples of symbols are letters, digits,
and punctuation. The set {0, 1} is the binary alphabet. ASCII and Unicode are
important examples of an alphabet. A string over an alphabet is a finite sequence
of symbols drawn from that alphabet. In language theory, the terms "sentence" and
"word" are often used as synonyms for "string".
A language is any countable set of strings over some fixed alphabet including null
set “ø” and set with empty character {ε}.
4. Language: A language is considered as a finite set of strings over some finite set
of alphabets.
5. Longest match rule: When the lexical analyzer read the source-code, it scans
the code letter by letter and when it encounters a whitespace, operator symbol, or
special symbols it decides that a word is completed.
Recognition of Tokens:
Patterns are created by using regular expression. These patterns are used to build
a piece of code that examines the input strings to find a prefix that matches the
required lexemes.
Consider the below grammar for branching statements and recognize the tokens
from it.
The terminals of the grammar are if, then, else, relop, id, and number, which are
the names of tokens. The patterns for these tokens are described using the
following regular definitions.
with the help of above definitions, lexical analyzer will recognize, if, then, else as
keywords, and relop, id, number as lexemes
Lexical analyzer also strips out white space, by recognizing the “token” with the
following regular expression:
ws -> (blank/tab/newline)+
The below table shows, for each lexeme or family of lexemes, which token name is
returned to the parser and what is the attribute value of token.
All the sections are separated by double percent signs. Default layout of a Lex file
is:
{definitions}
%%
{rules}
%%
{auxiliary routines}
%{
Declarations
%}
Rules Section: The translation rules each have the form Rulei{ Actioni }.
Each rule is a regular expression, which may use the regular definitions of the
declaration section. The actions are fragments of code, typically written in C. The
following syntax is used to include rules section in LEX specification
%%
Rule1 { Action1 }
Rule2 { Action2 }
%%
When lexical analyzer starts reading the input character by character. If a
character/set of characters is matched with one of the regular expressions, then
the corresponding action part will be executed.
Auxiliary Routines: The third section contains additional functions that are
required. These functions may be compiled separately and loaded with the lexical
analyzer.
Some procedures are required by actions in rule section.
yylex(), yywrap() are the predefined procedures of LEX.
The input notation for the Lex tool is referred as the Lex language and the tool
itself is the Lex compiler. The Lex compiler transforms lex.l to a C program, in a
file that is always named lex.yy.c. The latter file is compiled by the C compiler into
a file called a.out.
lex.l Lex
Compiler lex.yy.c
C a.out
lex.yy.c Compiler
The below program appends line number to the lines of the loaded file:
/* Declaration section */
%{
int yylineno;
%}
/* Rules section */
%%
^(.*)\n printf("%4d\t%s", ++yylineno, yytext);
%%
/* Auxiliary Procedures */