Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
511 views

Automata and Compiler Design - Lecture Notes On UNIT 1

The document discusses language processing and the key components involved. It begins by explaining how language translators like compilers and interpreters bridge the gap between human and machine languages. It then defines compilers and interpreters, and describes the typical phases in a language processing system including preprocessing, compiling, assembling, linking, and loading. The document also categorizes different types of compilers and discusses the phases a compiler typically proceeds through like lexical analysis, syntax analysis, semantic analysis, code generation, and optimization.

Uploaded by

Rue Chagaresango
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
511 views

Automata and Compiler Design - Lecture Notes On UNIT 1

The document discusses language processing and the key components involved. It begins by explaining how language translators like compilers and interpreters bridge the gap between human and machine languages. It then defines compilers and interpreters, and describes the typical phases in a language processing system including preprocessing, compiling, assembling, linking, and loading. The document also categorizes different types of compilers and discusses the phases a compiler typically proceeds through like lexical analysis, syntax analysis, semantic analysis, code generation, and optimization.

Uploaded by

Rue Chagaresango
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

UNIT-I

Introduction to Language Processing

As Computers became inevitable and indigenous part of human life, and several
languages with different and more advanced features are evolved into this stream to satisfy or
comfort the user in communicating with the machine , the development of the translators or
mediator Software’s have become essential to fill the huge gap between the human and machine
understanding. This process is called Language Processing to reflect the goal and intent of the
process. On the way to this process to understand it in a better way, we have to be familiar with
some key terms and concepts explained in following lines.

Language Translator

Is a computer program which translates a program written in one (Source) language to its
equivalent program in other (Target) language. The Source program is a high level language
where as the Target language can be any thing from the machine language of a target machine
(between Microprocessor to Supercomputer) to another high level language program.

Two commonly Used Translators:-

1. Compiler

2. Interpreter

 Compiler

Compiler is a program, reads program in one language called Source Language


and translates in to its equivalent program in another Language called Target
Language, in addition to this its presents the error information to the User.
If the target program is an executable machine-language program, it can then be called by
the user to process inputs and produce outputs

Input Target Program Output

Fig: Running the target Program

 Interpreter
An interpreter is another common kind of language processor. Instead of producing a
target program as a translation, an interpreter appears to directly execute the
operations specified in the source program on inputs supplied by the user.

Source Program Interpreter


Input Output

LANGUAGE PROCESSING SYSTEM

Based on the input the translator takes and the output it produces, a language translator
can be called as any one of the following.
Preprocessor

A preprocessor takes the skeletal source program as input and produces an extended version of it,
which is the resulted as expanding the Macros, manifest constants if any, and including header
files etc. For example, the C preprocessor is a macro processor that is used automatically by the
C compiler to transform our source before actual compilation. Over and above a preprocessor
performs the following activities:

-Collects all the modules, files in case if the source program is divided into different
modules stored at different files.

-Expands short hands / macros into source language statements.

Compiler

Is a translator that takes as input a source program written in high level language and converts it
into its equivalent target program in machine language. In addition to above the compiler also

-Reports to its user the presence of errors in the source program.

-Facilitates the user in rectifying the errors, and execute the code.

Assembler

Is a program that takes as input an assembly language program and converts it into its equivalent
machine language code?

Loader/Linker

This is a program that takes as input a re locatable code and collects the library functions, re
locatable object files, and produces its equivalent absolute machine code.

Specifically the Loading consists of taking the re locatable machine code, altering the re
locatable addresses, and placing the altered instructions and data in memory at the proper
locations.

Linking allows us to make a single program from several files of re locatable machine code.
These files may have been result of several different compilations, one or more may be library
routines provided by the system available to any program that needs them.
In addition to these translators, programs like interpreters, text formatters etc., may be used in
language processing system. To translate a program in a high level language program to an
executable one, the Compiler performs by default the compile and linking functions.

Normally the steps in a language processing system includes Preprocessing the skeletal Source
program which produces an extended or expanded source program , followed by compiling the
resultant, then linking / loading ,and a finally executable code is produced. As I said earlier not
all these steps are mandatory. In some cases, the Compiler only performs this linking and loading
functions implicitly.

Source Program

Preprocessor

Modified Source Program

Compiler

Target Assembly Program

Assembler

Relocatable Machine Code

Loader/Linker Library files

Relocatable Object files

Target Machine Code

Fig: Context of a Compiler in Language Processing System


Types of Compilers

Based on the specific input it takes and the output it produces, the Compilers can be classified
into the following types;

Traditional Compilers(C, C++, Pascal):

These Compilers convert a source program in a HLL into its equivalent in native machine code
or object code.

Interpreters (LISP, SNOBOL, Java1.0):

These Compilers first convert Source code into intermediate code, and then interprets (emulates)
it to its equivalent machine code.

Cross-Compilers: These are the compilers that run on one machine and produce code for
another machine.

Incremental Compilers:

These compilers separate the source into user defined–steps; Compiling/recompiling step- by-
step; interpreting steps in a given order

Converters (e.g. COBOL to C++):

These Programs will be compiling from one high level language to another.

Just-In-Time (JIT) Compilers (Java, Micosoft.NET):

These are the runtime compilers from intermediate language (byte code, MSIL) to executable
code or native machine code. These performs type –based verification which makes the
executable code more trustworthy

Ahead-of-Time (AOT) Compilers (e.g., .NET ngen):

These are the pre-compilers to the native code for Java and .NET
Binary Compilation: These compilers will be compiling object code of one platform into object code
of another platform.

Phases of a Compiler

Due to the complexity of compilation task, a Compiler typically proceeds in a Sequence of


compilation phases. The phases communicate with each other via clearly defined interfaces.
Generally an interface contains a Data structure (e.g., tree), Set of exported functions. Each
phase works on an abstract intermediate representation of the source program, not the source
program text itself (except the first phase)

Compiler Phases are the individual modules which are chronologically executed to perform their
respective Sub-activities, and finally integrate the solutions to give target code.

It is desirable to have relatively few phases, since it takes time to read and write immediate files.
Following diagram (inFigure1.2) depicts the phases of a compiler through which it goes during
the compilation. There fore a typical Compiler is having following Phases:

 Lexical Analyzer(Scanner)

 Syntax Analyzer(Parser)

 Semantic Analyzer

 Intermediate Code Generator

 Code Optimizer

 Code Generator

In addition to these, it also has Symbol table management, and Error handler phases. Not all
the phases are mandatory in every Compiler. e.g, Code Optimizer phase is optimal in some
cases. The description is given in next section.

The Phases of compiler divided in to two parts, first three phases we are calling as
analysis part remaining three calling as synthesis part.
Figure: Different Phases of a Compiler
Phase, Passes of a Compiler

In some application we can have a compiler that is organized into what is called passes.
Where a pass is a collection of phases that convert the input into a completely deferent
representation. Each pass makes a complete scan of the input and produces its output to be
processed by the subsequent pass. For example a two pass Assembler.

The Front-End & Back-End of a Compiler

All of these phases of a general Compiler are conceptually divided into The Front-end,
and The Back-end. This division is due to their dependence on either the Source Language or the
Target machine. This model is called an Analysis & Synthesis model of a compiler.
The Front-end of the compiler consists of phases that depend primarily on the Source
language and are largely independent of the target machine. For example, front-end of the
compiler includes Scanner, Parser, Creation of Symbol table, Semantic Analyzer, and the
Intermediate Code Generator.

The Back-end of the compiler consists of phases that depend on the target machine, and
those portions don’t dependent on the Source language, just the Intermediate language. In this we
have different aspects of Code Optimization phase, code generation along with the necessary
Error handling, and Symbol table operations.

Lexical Analyzer (Scanner):

The Scanner is the first phase that works as interface between the compiler and the Source
language program and performs the following functions:

 Reads the characters in the Source program and groups them into a stream of tokens in
which each token specifies a logically cohesive sequence of characters, such as an
identifier , a Keyword , a punctuation mark, a multi character operator like := .

 The character sequence forming a token is called a lexeme of the token.

 The Scanner generates a token-id, and also enters that identifiers name in the Symbol
table if it doesn’t exist.
 Also removes the Comments, and unnecessary spaces.

The format of the token is

<token name, attribute value>

Syntax Analyzer (Parser):

The Parser interacts with the Scanner, its subsequent phase Semantic Analyzer and performs the
following functions:

 Groups the above received, and recorded token stream into syntactic structures, usually
into a structure called Parse Tree whose leaves are tokens.

 The interior node of this tree represents the stream of tokens that logically belongs
together.

 It means it checks the syntax of program elements.

Semantic Analyzer:

This phase receives the syntax tree as input, and checks the semantically correctness of the
program .Though the tokens are valid and syntactically correct, it may happen that they are not
correct semantically. Therefore the semantic analyzer checks the semantics (meaning) of the
statements formed.

 The Syntactical and Semantically corrected structures are produced here in the form of a
Syntax tree or DAG or some other sequential representation like matrix.

Intermediate Code Generator:

This phase takes the syntactically and semantically correct structure as input , and produces its
equivalent intermediate notation of the source program. The Intermediate Code should have two
important properties specified below:
--It should be easy to produce, and Easy to translate into the target program. Example
intermediate code forms are –Three address codes, Polish notations, etc.

Code Optimizer:

This phase is optional in some Compilers, but so useful and beneficial in terms of saving
development time, effort, and cost. This phase performs the following specific functions:

 Attempts to improve the IC so as to have a faster machine code. Typical functions


include –Loop Optimization, Removal of redundant computations, strength reduction,
frequency reductions etc.

 Sometimes the data structures used in representing the intermediate forms may also be
changed.

Code Generator:

This is the final phase of the compiler and generates the target code, normally consisting of the
relocatable machine code or Assembly code or absolute machine code.

 Memory locations are selected for each variable used, and assignment of variables to
registers is done.

 Intermediate instructions are translated into a sequence of machine instructions.


The Compiler also performs the Symbol table management and Error handling throughout the
compilation process. Symbol table is nothing but a data structure that stores different source
language constructs, and tokens generated during the compilation. These two interact with all
phases of the Compiler.

For example the source program is an assignment statement; the following figure shows how the
phases of compiler will process the program.
The input source program is Position=initial+rate*60

Fig: Translation of an assignment Statement


LEXICAL ANALYSIS

As the first phase of a compiler, the main task of the lexical analyzer is to read the
input characters of the source program, group them into lexemes, and produce as output a
sequence of tokens for each lexeme in the source program. The stream of tokens is sent to the
parser for syntax analysis. It is common for the lexical analyzer to interact with the symbol table
as well. When the lexical analyzer discovers a lexeme constituting an identifier, it needs to
enter that lexeme into the symbol table. these process is shown in the below figure.

Fig: Lexical Analyzer

. When lexical analyzer identifies the first token it will send it to the parser, the parser
receives the token and calls the lexical analyzer to send next token by issuing the getNextToken
Command. This Process continues until the lexical analyzer identifies the tokens. During this
process the lexical analyzer will neglect the white spaces and comment lines.

Tokens, Patterns and Lexemes

 A token is a pair consisting of a token name and an optional attribute value.


The token name is an abstract symbol representing a kind of lexical unit, e.g., a
particular keyword, or a sequence of input characters denoting an identifier. The
token names are the input symbols that the parser processes. In what follows, we
shall generally write the name of a token in boldface. We will often refer to a
token by its token name.

 A pattern is a description of the form that the lexemes of a token may take. In the
case of a keyword as a token, the pattern is just the sequence of characters that
form the keyword. For identifiers and some other tokens, the pattern is a more
complex structure that is matched by many strings.
 A lexeme is a sequence of characters in the source program that matches the
pattern for a token and is identified by the lexical analyzer as an instance of
that token.

Example: In the C statement

printf ("Total = %d\n”, score) ;

both printf and score are lexemes matching the pattern for token id, and "Total = %d\nI1
is a lexeme matching literal.

Fig: Examples of Tokens

Lexical Analysis versus parsing

There are a number of reasons why the analysis portion of a compiler is normally separated into
lexical analysis and parsing (syntax analysis) phases.

 Simplicity of design is the most important consideration. The separation of lexical and
syntactic analysis often allows us to simplify at least one of these tasks. For example, a
parser that had to deal with comments and whitespace as syntactic units would be
considerably more complex than one that can assume comments and whitespace have
already been removed by the lexical analyzer.

 Compiler efficiency is improved. A separate lexical analyzer allows us to apply


specialized techniques that serve only the lexical task, not the job of parsing. In addition,
specialized buffering techniques for reading input characters can speed up the compiler
significantly.
 Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to
the lexical analyzer.

Input Buffering

Before discussing the problem of recognizing lexemes in the input, let us examine
some ways that the simple but important task of reading the source program can be speeded.
This task is made difficult by the fact that we often have to look one or more characters beyond
the next lexeme before we can be sure we have the right lexeme. There are many situations
where we need to look at least one additional character ahead. For instance, we cannot be sure
we've seen the end of an identifier until we see a character that is not a letter or digit, and
therefore is not part of the lexeme for id. In C, single-character operators like -, =, or <
could also be the beginning of a two-character operator like ->, ==, or <=. Thus, we shall
introduce a two-buffer scheme that handles large look aheads safely. We then consider an
improvement involving "sentinels" that saves time checking for the ends of buffers.

Buffer Pairs

Because of the amount of time taken to process characters and the large number of characters
that must be processed during the compilation of a large source program, specialized buffering
techniques have been developed to reduce the amount of overhead required to process a single
input character. An important scheme involves two buffers that are alternately reloaded.

Fig: Using a Pair of Input Buffers

Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096
bytes. Using one system read command we can read N characters in to a buffer, rather than
using one system call per character. If fewer than N characters remain in the input file, then a
special character, represented by eof, marks the end of the source file and is different from any
possible character of the source program.
 Two pointers to the input are maintained:

1. 11qPointer lexemeBegin, marks the beginning of the current lexeme, whose extent
we are attempting to determine.

2. Pointer forward scans ahead until a pattern match is found; the exact strategy
whereby this determination is made will be covered in the balance of this chapter.

Once the next lexeme is determined, forward is set to the character at its right end. Then,
after the lexeme is recorded as an attribute value of a token returned to the parser, 1exemeBegin
is set to the character immediately after the lexeme just found. In Fig, we see forward has passed
the end of the next lexeme, ** (the FORTRAN exponentiation operator), and must be retracted
one position to its left.

Advancing forward requires that we first test whether we have reached the end of one
of the buffers, and if so, we must reload the other buffer from the input, and move forward to
the beginning of the newly loaded buffer. As long as we never need to look so far ahead of the
actual lexeme that the sum of the lexeme's length plus the distance we look ahead is greater
than N, we shall never overwrite the lexeme in its buffer before determining it.

Sentinels

If we use the above scheme as described, we must check, each time we advance forward,
that we have not moved off one of the buffers; if we do, then we must also reload the other
buffer. Thus, for each character read, we make two tests: one for the end of the buffer, and one
to determine what character is read (the latter may be a multi way branch). We can combine the
buffer-end test with the test for the current character if we extend each buffer to hold a sentinel
character at the end. The sentinel is a special character that cannot be part of the source program,
and a natural choice is the character eof. Figure 3.4 shows the same arrangement as Fig. 3.3, but
with the sentinels added. Note that eof retains its use as a marker for the end of the entire input.
Any eof that appears other than at the end of a buffer means that the input is at an end. Figure 3.5
summarizes the algorithm for advancing forward. Notice how the first test, which can be part of
a multiway branch based on the character pointed to by forward, is the only test we make, except
in the case where we actually are at the end of a buffer or the end of the input.

Fig: Sentential at the end of each buffer

switch ( *forward++ )

case eof: if (forward is at end of first buffer )

reload second buffer;

forward = beginning of second buffer;

else if (forward is at end of second buffer )

reload first buffer;

forward = beginning of first buffer;

else /* eof within a buffer marks the end of input */

terminate lexical analysis;

break;

Fig: Case for the sentential


Specification of tokens

Regular expressions are an important notation for specifying lexeme patterns. While they cannot express
all possible patterns, they are very effective in specifying those types of patterns that we actually need for
tokens.

LEX the Lexical Analyzer generator

Lex is a tool used to generate lexical analyzer, the input notation for the Lex tool is
referred to as the Lex language and the tool itself is the Lex compiler. Behind the scenes, the
Lex compiler transforms the input patterns into a transition diagram and generates code, in a
file called lex .yy .c, it is a c program given for C Compiler, gives the Object code. Here we need
to know how to write the Lex language. The structure of the Lex program is given below.

Structure of LEX Program

A Lex program has the following form:

declarations

%%

translation rules

%%

auxiliary functions

The declarations section includes declarations of variables, manifest constants (identifiers


declared to stand for a constant, e.g., the name of a token), and regular definitions. It appears
between %{. . .%}

The translation rules each have the form

Pattern {Action}

The auxiliary function is a procedure to install identifiers and numbers in the Symbol
tale.
LEX Program Example:

%{

/* definitions of manifest constants LT,LE,EQ,NE,GT,GE, IF,THEN, ELSE,ID, NUMBER,


RELOP */

%}

/* regular definitions */

delim [ \t\n]

ws { delim}+

letter [A-Za-z]

digit [o-91

id {letter} ({letter} | {digit}) *

number {digit}+ (\ . {digit}+)? (E [+-I]?{digit}+)?

%%

{ws} {/* no action and no return */}

if {return(1F) ; }

then {return(THEN) ; }

else {return(ELSE) ; }

(id) {yylval = (int) installID(); return(1D);}

(number) {yylval = (int) installNum() ; return(NUMBER) ; }

”< ” {yylval = LT; return(REL0P) ; )}

“ <=” {yylval = LE; return(REL0P) ; }

“=” {yylval = EQ ; return(REL0P) ; }

“<>” {yylval = NE; return(REL0P);}

“<” {yylval = GT; return(REL0P);)}

“<=” {yylval = GE; return(REL0P);}


%%

int installID0 {/* function to install the lexeme, whose first character is pointed to by yytext,

and whose length is yyleng, into the symbol table and return a pointer

thereto */

int installNum() {/* similar to installID, but puts numerical constants into a separate table */}

Fig: Lex Program for tokens

Important Questions:

1. Explain the Compiler phases?

2. Explain the Lexical analyzer?

3. Write short notes on tokens, pattern and lexemes?

4. Write Short notes on Input buffering?

5. Explain LEX tool?

Assignment Questions:

1. Write the differences between compilers and interpreters?

2. Write short notes on token reorganization?

3. Write the Applications of the Finite Automata?

4. Explain How Finite automata is useful in the lexical analysis?

5. Explain DFA and NFA with an Example?

Multiple Choices:

1. The action of passing the source program into statistic classes is known
as________________.
a. Lexical analysis
b. Syntax analysis
c. Interpretation analysis
d. Parsing
2. The output of _____________ is absolute machine code.
a. Preprocessor
b. Loader linkeditor
c. Assembler
d. Compiler

3. Intermediate code generator is ________________ phase in computer design.


a. second
b. first
c. fourth
d. third

4. The total number of phases in compiler design is______________.


a. 4
b. 5
c. 3
d. 6
5. Semantic analysis is related to ____________________phase.
a. Neither analysis nor synthesis
b. Analysis and synthesis
c. Analysis
d. Synthesis

6. System program such as compiler are designed so that they are________________.


a. Re entrable
b. Serially usable
c. Recursive
d. Non reusable

7. _______________ is related to sysnthesis phase.


a. Syntax analysis
b. Code generaton
c. Lexical analysis
d. Semantic analysis
8. Parsing in compiler design is________________ phase.
a. Second
b. Fourth
c. First
d. Third

9. The input to________________is a target assembly program.


a. Assembler
b. Preprocessor
c. Loader link editor
d. Compiler

10. Parsing in compiler design is________________ phase.


a. Second
b. Fourth
c. First
d. Third

11. The input to________________is a target assembly program.


a. Assembler
b. Preprocessor
c. Loader link editor
d. Compiler

12. In lexical analysis the original string that comprises the token is called a
_______________.
A. Pass
B. LEX
C. Lexeme
D. Phase

13. In a incompletely specified automata_________________.


A. Start state may not be there
B. From any given state there cant be any token leading to two different states
C. Some states have transition on some tokens
D. No edge should be labeled epsilon
14. Manchine independent synthesis phase is __________________.
A. Code generator
B. Intermediate code generation
C. Lexical analysis
D. Syntacx analysis

15. The reason for An interpreter is preferred to a compiler_________________________.


A. Debugging can be slower
B. It is so much helpuful in the intial stages of program development.
C. It takes less time to execute
D. It needs less computer resources

16. In c language void pointer is capable to store___________________type.


A. Int only
B. Char only
C. Float only
D. Any type

17. Any type Pick the odd man out


A. Fortran
B. Pascal
C. C
D. Lisp

18. _____________symbol table implementation makes efficient use of memeory.


A. Self organizing list
B. Search tree
C. List
D. Hash table

19. The cost of developing a compiler______________.


A. Is inversely proportional to complexity of architecture of the target machine.
B. Is inversely proportional to the flexibility of availability instruction set
C. Is inversely proportional to complexity of the source language
D. Id proportional to complexity of source language
20. FORTRAN is a_______________.
A. Regular language
B. Turing language
C. Context sensitive language
D. Context free language

21. Which of the following translation program converts assembly language programs to
object program
A. Assembler
B. Compiler
C. Linker
D. Preprocessor

22. The output of lexical analysis is________________


A. A sequence of patterns
B. A sequence of lexenes
C. A sequence of tokens
D. A sequence of characters

23. Which of the following is not related to analysis phase________________.


A. Semantic analysis
B. Code generation
C. Syntax analysis
D. Lexical analysis

24. ________________is not related to synthesis phase.


A. Optimization
B. Code generation
C. Lexical analysis
D. Intermediate code generation

25. Transition diagrams uses________________________notation to represent state.


A. Rectangles
B. Ellipses
C. Traingles
D. CIRCLES
Answers:

1) A 2) B 3)C 4) d 5) c 6) a 7) b 8) a 9) a 10 ) a 11) a 12) c 13) c 14) b 15)a 16)d

17) a 18) d 19) a 20) b 21) a 22) c 23) b 24) c 25) d

Fill in the Blanks:

1. A __________is a program which performs translation from a HLL into machine


language of computer
2. A ________is a program that reads a program written in one language (source ) and
translates in to its equivalent language
3. In a _________each node represents an operator
4. In a syntax tree the children’s of a node represents the _______of the operation
5. ___________breaks up the source program in to consistent pieces and creates
intermediate representation
6. ____________constructs the desired target program from the intermediate representation
7. __________reads the source program from left to right and groups into tokens
8. ________is sequence of characters having a collective meaning
9. In _________tokens are grouped hierarchically into nested collection with collective
meaning
10. In __________certain checks are performed to ensure that the components of s program
fit together meaningful
11. _________is data structure contains a record for each identifier
12. The character sequence forming a token is called the _________for the token
13. The __________of the compiler include those phases which are dependent on the source
language and independent of the target machine
14. The _________of the compiler dependent on target language and independent of source
language
15. A _____means one complete scan of source program
16. ____and ___ languages permit one pass compilation
17. ____ and _____are the language whose structure requires that a compiler have atleast two
passes
18. ________is a concept of obtaining a compiler for language by using the compiler which
is the subject of same language
19. Using the facilities offered by a language to compiler itself is called _________
20. A compiler which runs on one machine and produce target code for another machine is
called ______
21. ________represents pattern of strings of characters
22. ________is a tool that has been widely used to specify lexical analyzer for a variety of
language
23. A lex program consists of three parts _______,________ and ________
24. Formally a finite automata is a five tuple given as __________________
25. Tokens are recognized by ________
26. Finite automata is used in the ________phase of compiler

Answers:

1. Translator 2. Compiler 3. Syntax tree 4. Arguments 5. Analysis part 6. Synthesis


part 7. Lexical analysis 8. Token 9. Syntax analysis 10. Semantic analysis 11. Symbol
table 12. Lexeme 13. Front end 14. Back end 15. Pass 16. PASCAL, C 17. Madula-2,
C++ 18. Bootstrapping 19. Bootstrapping 20. Cross compiler 21. Regular expressions
22. Lex 23. Declarations, translation rules, auxiliary procedures 24. M= (Q, ∑,
δ,q0,F) 25. Regular expressions 26. lexical analysis

You might also like