Compiler Lecture-1
Compiler Lecture-1
TA H S I N A H A S H E M
Lecturer
CSE,JU
TEXTBOOK
Compilers: Principles, Techniques, and Tools
Aho, Lam, Sethi, Ullman
Modern Compiler Implementation in C (The Tiger Book).
Andrew W. Appel
GRADING POLICY
Attendance 10%
Assignment 5%
Class Test 20% (Best two of three)
(includes exercises and lecture materials)
Surprise test 5%
(at the end or starting of lecture)
==========================================================
Total 40%
Final Exam 60%
(includes exercises and lecture materials)
LANGUAGE PROCESSOR
A computer understands instructions in machine code, i.e. in the form of 0s and 1s. It is a tedious
task to write a computer program directly in machine code.
The programs are written mostly in high level languages like Java, C++, Python etc. and are
called source code. These source code cannot be executed directly by the computer and must be
converted into machine language to be executed.
Hence, a special translator system software is used to translate the program written in high-level
language into machine code is called Language Processor and the program after translated into
machine code (object program / object code).
The language processors can be any of the following three types:
1. Compiler
2. Interpreter
3. Assembler
LANGUAGE PROCESSOR: COMPILER
A compiler is a program that reads the whole program written in a source
language/high level language as a whole in one go and translates it into an
equivalent program in a target language.
LANGUAGE PROCESSOR: INTERPRETER
An interpreter is another common kind of language processor.
Instead of producing a target program as a translation, an interpreter appears to
directly execute the operations specified in the source program on inputs
supplied by the user.
Source program
Interpreter Output
Input
Error messages
COMPILER VS. INTERPRETER
HYBRID COMPILER
Java language processors combine compilation and interpretation.
HYBRID COMPILER
A Java source program may first be compiled into an intermediate
form called bytecodes.
The bytecodes are then interpreted by a virtual machine.
Assembler
OTHER LANGUAGE PROCESSORS:
COMPILATION TASK IS FULL OF VARIETY
??
Thousands of source languages
• Fortran, Pascal, C, C++, Java, ……
Thousands of target languages
• Some other lower level language (assembly language), machine language
Compilation process has similar variety
• Single pass, multi-pass, load-and-go, debugging, optimizing….
Variety is overwhelming……
Terminology:
Structure ≡ Syntax
Meaning ≡ Semantics
ANALYSIS-SYNTHESIS MODEL OF
COMPILATION
Two major parts --
Analysis: an intermediate representation is created from the given source
program.
Lexical Analyzer, Syntax Analyzer and Semantic Analyzer
Synthesis: the equivalent target program is created from this intermediate
representation
Intermediate Code Generator, Code Optimizer, and Code Generator
PHASES OF COMPILER
COMPILATION STEPS/PHASES
Lexical Analysis: Generates the “tokens” in the source program
Syntax Analysis: Recognizes “sentences" in the program using the syntax of the
language
Semantic Analysis: Infers information about the program using the semantics of the
language
Intermediate Code Generation: Generates “abstract” code based on the syntactic
structure of the program and the semantic information
Optimization: Refines the generated code using a series of optimizing transformations
Final Code Generation: Translates the abstract intermediate code into specific
machine instructions
LEXICAL ANALYSIS
Convert the stream of characters representing input program into a meaningful sequences called lexemes.
For each lexeme, the lexical analyzer produces as output
A token of the form:
< token-name, attribute-value >
token-name an abstract symbol that is used during syntax analysis
attribute-value points to an entry in the symbol table for this token
Example:
Input: “*x++" Output: three tokens “*", “x", “++"
Input: “static int" Output: two tokens: “static" , “int"
identifiers operators
LEXICAL ANALYSIS
Input:
Output: Sequence of tokens
• In this representation, the token names =, +, and * are abstract symbols for
the assignment, addition, and multiplication operators, respectively.
SYNTAX ANALYSIS (PARSING)
Build a tree called a parse tree that reflects the structure of the input sentence.
A syntax tree in which each interior node represents an operation and the children of
the node represent the arguments of the operation.
Example:
The Phrase : x = +y
Four Tokens “x", “=“ ,“+" and “y“
Structure x = (x+(y)) i.e., an assignment expression
SYNTAX ANALYSIS: GRAMMARS
Expression grammar
Exp Exp ‘+’ Exp
| Exp ‘*’ Exp
| ID
| NUMBER
SYNTAX ANALYSIS: SYNTAX TREE
Input: result = a + b * 10
SEMANTIC ANALYSIS
Check the source program for semantic errors
It uses the hierarchical structure determined by the syntax-analysis phase to
identify the operators and operands of expressions and statements
Performs type checking
Operator operand compatibility
Example:
The compiler must report an error if a floating-point number is used to index an array.
SEMANTIC ANALYSIS
The language specification may permit some type conversions
called coercions.
Example:
The compiler may convert or coerce
the integer into a floating-point number.
INTERMEDIATE CODE GENERATION
Translate each hierarchical structure decorated as tree into intermediate code
A program translated for an abstract machine
Properties of intermediate codes
Should be easy to produce
Should be easy to translate into the target program
Intermediate code hides many machine-level details, but has instruction-level
mapping to many assembly languages
Main motivation: portability
One commonly used form is “Three-address Code”
INTERMEDIATE CODE GENERATION
We consider an intermediate form called “three-address code”.
Like the assembly language for a machine in which every memory
location can act like a register.
Three-address code consists of a
sequence of instructions,
each of which has at most three operands.
CODE OPTIMIZATION
Apply a series of transformations to improve the time and space efficiency of
the generated code.
Peephole optimizations: generate new instructions by combining/expanding on
a small number of consecutive instructions.
Global optimizations: reorder, remove or add instructions to change the
structure of generated code
Consumes a significant fraction of the compilation time
Optimization capability varies widely
Simple optimization techniques can be vary valuable
CODE OPTIMIZATION
CODE GENERATION
Map instructions in the intermediate code to specific machine instructions.
Memory management, register allocation, instruction selection, instruction
scheduling, …
Generates sufficient information to enable symbolic debugging.
CODE GENERATION
For example, using registers R1 and R2, the intermediate code might get
translated into the machine code
SYMBOL TABLE
Records the identifiers used in the source program
Collect information about various attributes of each identifier
Variables: type, scope, storage allocation
Procedure: number and types of arguments, method of argument passing
It’s a data structure containing a record for each identifier
Different fields are collected and used at different phases of compilation
When an identifier in the source program is detected by the lexical analyzer, the
identifier is entered into the symbol table
SYMBOL TABLE
It is built in lexical and syntax analysis phases and It is used by compiler to achieve compile
time efficiency.
The information is collected by the analysis phases of compiler and is used by synthesis phases
of compiler to generate code.