Compiler Design Notes
Compiler Design Notes
another.
Compiler operates in various phases each phase transforms the source program from one representation to another.
Every phase takes inputs from its previous stage and feeds its output to the next phase of the compiler.
There are 6 phases in a compiler. Each of this phase help in converting the high-level langue the machine code. The
phases of a compiler are:
X identifier
= Assignment operator
Y identifier
+ Addition operator
10 Number
In Parse Tree
Interior node: record with an operator filed and two files for children
Leaf: records with 2/more fields; one for token and other information about the token
Ensure that the components of the program fit together meaningfully
Gathers type information and checks for type compatibility
Checks operands are permitted by the source language
Phase 3: Semantic Analysis
Semantic analysis checks the semantic consistency of the code. It uses the syntax tree of the previous phase along with
the symbol table to verify that the given source code is semantically consistent. It also checks whether the code is
conveying an appropriate meaning.
Semantic Analyzer will check for Type mismatches, incompatible operands, a function called with improper
arguments, an undeclared variable, etc.
Functions of Semantic analyses phase are:
Helps you to store type information gathered and save it in symbol table or syntax tree
Allows you to perform type checking
In the case of type mismatch, where there are no exact type correction rules which satisfy the desired
operation a semantic error is shown
Collects type information and checks for type compatibility
Checks if the source language permits the operands or not
Example
float x = 20.2;
float y = x*30;
In the above code, the semantic analyzer will typecast the integer 30 to float 30.0 before multiplication
Phase 4: Intermediate Code Generation
Once the semantic analysis phase is over the compiler, generates intermediate code for the target machine. It
represents a program for some abstract machine.
Intermediate code is between the high-level and machine level language. This intermediate code needs to be generated
in such a manner that makes it easy to translate it into the target machine code.
Functions on Intermediate Code generation:
It should be generated from the semantic representation of the source program
Holds the values computed during the process of translation
Helps you to translate the intermediate code into target language
Allows you to maintain precedence ordering of the source language
It holds the correct number of operands of the instruction
Example
For example,
total = count + rate * 5
Intermediate code with the help of address code method is:
t1 := int_to_float(5)
t2 := rate * t1
t3 := count + t2
total := t3
Phase 5: Code Optimization
The next phase of is code optimization or Intermediate code. This phase removes unnecessary code line and arranges
the sequence of statements to speed up the execution of the program without wasting resources. The main goal of this
phase is to improve on the intermediate code to generate a code that runs faster and occupies less space.
The primary functions of this phase are:
It helps you to establish a trade-off between execution and compilation speed
Improves the running time of the target program
Generates streamlined code still in intermediate representation
Removing unreachable code and getting rid of unused variables
Removing statements which are not altered from the loop
Example:
Consider the following code
a = intofloat(10)
b=c*a
d=e+b
f=d
Can become
b =c * 10.0
f = e+b
Phase 6: Code Generation
Code generation is the last and final phase of a compiler. It gets inputs from code optimization phases and produces
the page code or object code as a result. The objective of this phase is to allocate storage and generate relocatable
machine code.
It also allocates memory locations for the variable. The instructions in the intermediate code are converted into
machine instructions. This phase coverts the optimize or intermediate code into the target language.
The target language is the machine code. Therefore, all the memory locations and registers are also selected and
allotted during this phase. The code generated by this phase is executed to take inputs and generate expected outputs.
Example
a = b + 60.0
Would be possibly translated to registers.
MOVF a, R1
MULF #60.0, R2
ADDF R1, R2
Symbol Table Management
A symbol table contains a record for each identifier with fields for the attributes of the identifier. This
Why two-buffer scheme is used in lexical analysis? Elaborate input buffering strategy, used in lexical analysis phase.
Lexical analysis, the first phase of a compiler, breaks down the source code into meaningful units called tokens
(keywords, identifiers, operators, etc.). During this process, the lexical analyzer needs to look ahead at the input
stream to correctly identify tokens. However, relying on a single buffer can lead to inefficiencies when dealing with
long lexemes (tokens) that might span across the buffer boundary.
The two-buffer scheme addresses this by utilizing two buffers of a fixed size (typically the same size):
1. Active Buffer: This buffer currently holds the input characters being scanned.
2. Inactive Buffer: This buffer is filled with the next chunk of input characters in anticipation of needing more
data.
The lexical analyzer works as follows:
It scans the active buffer for a complete token using patterns or rules.
If the potential token reaches the end of the active buffer, the following steps occur:
o The inactive buffer becomes the active buffer.
o Scanning resumes in the newly active buffer, continuing the token identification process.
Why two-buffer scheme is used in lexical analysis? Elaborate input buffering strategy, used in lexical analysis phase.
Wh
at is Input Buffering in compiler design?
To identify tokens, Lexical Analysis must visit secondary memory each time. It takes a long time and costs a lot of
money. As a result, the input strings are buffered before being examined by Lexical Analysis.
Lexical analysis reads the input string one character at a time from left to right to detect tokens. To scan tokens, it
employs two pointers.
The Begin Pointer (bptr) is a pointer that points to the start of the string to be read.
Look Ahead Pointer(lptr) continues its hunt for the token's end.
Sample Example
Example: For the statement int a,b;
Both points begin at the start of the string that is saved in the buffer.
The Look Ahead Pointer examines the buffer until it finds the token.
Before the token ("int") can be identified, the character ("blank space") beyond the token ("int") must be
checked.
Both pointers will be set to the next token ('a') after processing the token ("int"), and this procedure will be
continued throughout the program.
Two portions of a buffer can be separated. If you move the look Ahead cursor halfway through the first half, the
second half will be filled with fresh characters to read. If you shift the look Ahead cursor to the right end of the second
half's buffer, the first half will be filled with new characters, and so on.
Sentinels − Sentinels are used to making a check, each time when the forward pointer is converted, a check is
completed to provide that one half of the buffer has not converted off. If it is completed, then the other half
should be reloaded.
Buffer Pairs
Specialized buffering techniques decrease the overhead required to process an input character in moving
characters.
It consists of two buffers, each of which has an N-character size and is alternately reloaded.
There are two pointers: lexemeBegin and forward.
Lexeme Begin denotes the start of the current lexeme, which has yet to be discovered.
Forward scans until it finds a match for a pattern.
When a lexeme is discovered, lexeme begin is set to the character immediately after the newly
discovered lexeme, and forward is set to the character at the right end of the lexeme.
The collection of characters between two points is the current lexeme.
Preliminary Scanning − Pre-processing the character stream being subjected to lexical analysis saves the
trouble of moving the look ahead pointer back and forth over a string of blanks.
List the cousins of compiler and explain the role of any one of them.
Preprocessor
The preprocessor is one of the cousins of the Compiler. It is a program that performs
preprocessing. It performs processing on the given data and produces an output. The output
generated is used as an input for some other program.
The preprocessor increases the readability of the code by replacing a complex expression with a
simpler one by using a macro.
A preprocessor performs multiple types of functionality and operations on the data.
Some of them are-
Macro processing
Macro processing is mapping the input to output data based on a certain set of rules and defined
processes. These rules are known as macros.
Rational Preprocessors
Relational preprocessors are the processors that change older languages with some modern flow-
of-control and data-structuring facilities.
File Inclusion
The preprocessor is also used to include header files in the program text. A header file is a text file
included in our source program file during compilation. When the preprocessor finds an #include
directive in the program, it replaces it with the entire content of the specified header file.
Language extension
Language extension is used to add new capabilities to the existing language. This is done by
including certain libraries in our program, which provides extra functionality. An example of this is
Equel, a database query language embedded in C.
Error Detection
Some preprocessors are capable of performing error-checking on the source code that is given as
input to them. For example, it can check if the headers files are included properly and if the
macros are defined correctly or not.
Conditional Compilation
Certain preprocessors are capable of including or excluding certain pieces of code based on the
result of a condition. They provide more flexibility to the programmers for writing the code as they
allow the programmers to include or exclude certain features of the program based upon some
condition.
Assembler
Assembler is also one of the cousins of the compiler. A compiler takes the preprocessed code and
then converts it into assembly code. This assembly code is given as input to the assembler, and
the assembler converts it into the machine code. Assembler comes into effect in the compilation
process after the Compiler has finished its job.
There are two types of assemblers-
One-Pass assembler: They go through the source code (output of Compiler) only
once and assume that all symbols will be defined before any instruction that references
them.
Two-Pass assembler: Two-pass assemblers work by creating a symbol table with the
symbols and their values in the first pass, and then using the symbol table in a second
pass, they generate code.
Linker
Linker takes the output produced by the assembler as input and combines them to create an
executable file. It merges two or more object files that might be created by different assemblers
and creates a link between them. It also appends all the libraries that will be required for the
execution of the file. A linker's primary function is to search and find referred modules in a program
and establish the memory address where these codes will be loaded.
Multiple tasks that can be performed by linkers include-
Library Management: Linkers can be used to add external libraries to our code to add
additional functionalities. By adding those libraries, our code can now use the functions
defined in those libraries.
Code Optimization: Linkers are also used to optimize the code generated by the
compiler by reducing the code size and increasing the program's performance.
Memory Management: Linkers are also responsible for managing the memory
requirement of the executable code. It allocates the memory to the variables used in
the program and ensures they have a consistent memory location when the code is
executed.
Symbol Resolution: Linkers link multiple object files, and a symbol can be redefined
in multiple files, giving rise to a conflict. The linker resolves these conflicts by choosing
one definition to use.
Loader
The loader works after the linker has performed its task and created the executable code. It takes
the input of executable files generated from the linker, loads it to the main memory, and prepares
this loaded code for execution by a computer. It also allocates memory space to the program. The
loader is also responsible for the execution of programs by allocating RAM to the program and
initializing specific registers.
Relocation: The loader adjusts the memory addresses of the program to relocate its
location in memory.
Symbol Resolution: The loader is used to resolve the symbols not defined directly in
the program. They do this by looking for the definition of that symbol in a library linked
to the executable file.
Dynamic Linking: The loader dynamically links the libraries into the executable file at
runtime to add additional functionality to our program.
) Define Top-Down parsing and what are the key problems with Top-Down parse.
In top-down parsing, the parse tree is generated from top to bottom, i.e., from root to leaves & expand till all leaves
are generated.
It generates the parse tree containing root as the starting symbol of the Grammar. It starts derivation from the start
symbol of Grammar & performs leftmost derivation at each step.
Drawback of Top-Down Parsing
Top-down parsing tries to identify the left-most derivation for an input string ω which is similar to generating
a parse tree for the input string ω that starts from the root and produce the nodes in a pre-defined order.
The reason that top-down parsing follow the left-most derivation for an input string ω and not the right-most
derivation is that the input string ω is scanned by the parser from left to right, one symbol/token at a time. The
left-most derivation generates the leaves of the parse tree in the left to right order, which connect the input
scan order.
In the top-down parsing, each terminal symbol produces by multiple production of the grammar (which is
predicted) is connected with the input string symbol pointed by the string marker. If the match is successful,
the parser can sustain. If the mismatch occurs, then predictions have gone wrong.
At this phase it is essential to reject previous predictions. The prediction which led to the mismatching
terminal symbol is rejected and the string marker (pointer) is reset to its previous position when the rejected
production was made. This is known as backtracking.
Backtracking was the major drawback of top-down parsing.
Solution-
$ int id , id ; $ Shift
$ T id , id ; $ Reduce L → id
$TL , id ; $ Shift
$TL, id ; $ Shift
$ T L , id ;$ Reduce L → L , id
$TL ;$ Shift
$TL; $ Reduce S → T L
$S $ Accept
lexical analysis
High-Level Languages (HLLs):
Human-Readable: HLLs use syntax that resembles natural language or mathematical notation, making them
easier for programmers to understand and write compared to low-level languages (machine code, assembly).
Abstraction: HLLs provide a layer of abstraction from the underlying hardware details. Programmers don't
need to worry about memory management, register usage, or specific machine instructions.
Platform Independence: HLL code is generally portable across different hardware platforms with the help of
compilers or interpreters. The code itself is written independently of the target machine.
Factors Affecting "Purity" of HLLs:
Side Effects: Some HLLs allow for side effects, meaning they can modify data outside the current function or
scope, potentially impacting program behavior. Examples of side effects include modifying global variables,
reading/writing to files, or interacting with external devices. Languages that emphasize minimizing side
effects might be considered "purer."
Memory Management: High-level languages might handle memory management automatically (garbage
collection) or require manual memory allocation/deallocation by the programmer. Languages with automatic
memory management can be considered "purer" from the perspective of programmer convenience and
reducing potential errors.
Low-Level Access: Some HLLs offer mechanisms to access low-level hardware features or perform
operations closer to the machine. This can be useful for performance optimization in certain scenarios, but it
introduces some dependency on the underlying architecture and reduces portability. Languages that minimize
low-level access might be considered "purer."
The optimization must be correct, it must not, in any way, change the meaning of the program.
Optimization should increase the speed and performance of the program.
The compilation time must be kept reasonable.
The optimization process should not delay the overall compiling process.