CD Notes
CD Notes
CD Notes
Computer programs are generally written in high-level languages (like C++, Python, and
Java). A language processor, or language translator, is a computer program that convert
source code from one programming language to another language or to machine code
(also known as object code). They also find errors during translation.
What is Language Processors?
Compilers, interpreters, translate programs written in high-level languages into machine
code that a computer understands and assemblers translate programs written in low-
level or assembly language into machine code. In the compilation process, there are
several stages. To help programmers write error-free code, tools are available.
Assembly language is machine-dependent, yet mnemonics used to represent instructions
in it are not directly understandable by machine and high-Level language is machine-
independent. A computer understands instructions in machine code, i.e. in the form of 0s
and 1s. It is a tedious task to write a computer program directly in machine code. The
programs are written mostly in high-level languages like Java, C++, Python etc. and are
called source code. These source code cannot be executed directly by the computer and
must be converted into machine language to be executed. Hence, a special translator
system software is used to translate the program written in a high-level language into
machine code is called Language Processor and the program after translated into
machine code (object program/object code).
Types of Language Processors
The language processors can be any of the following three types:
1. Compiler
The language processor that reads the complete source program written in high-level
language as a whole in one go and translates it into an equivalent program in machine
language is called a Compiler. Example: C, C++, C#.
In a compiler, the source code is translated to object code successfully if it is free of
errors. The compiler specifies the errors at the end of the compilation with line numbers
when there are any errors in the source code. The errors must be removed before the
compiler can successfully recompile the source code again the object program can be
executed number of times without translating it again.
2. Assembler
The Assembler is used to translate the program written in Assembly language into
machine code. The source program is an input of an assembler that contains assembly
language instructions. The output generated by the assembler is the object code or
machine code understandable by the computer. Assembler is basically the 1st interface
that is able to communicate humans with the machine. We need an assembler to fill the
gap between human and machine so that they can communicate with each other. code
written in assembly language is some sort of mnemonics(instructions) like ADD, MUL,
MUX, SUB, DIV, MOV and so on. and the assembler is basically able to convert these
mnemonics in binary code. Here, these mnemonics also depend upon the architecture of
the machine.
For example, the architecture of intel 8085 and intel 8086 are different.
3. Interpreter
The translation of a single statement of the source program into machine code is done by
a language processor and executes immediately before moving on to the next line is
called an interpreter. If there is an error in the statement, the interpreter terminates its
translating process at that statement and displays an error message. The interpreter
moves on to the next line for execution only after the removal of the error. An Interpreter
directly executes instructions written in a programming or scripting language without
previously converting them to an object code or machine code. An interpreter translates
one line at a time and then executes it.
Example: Perl, Python and Matlab.
The compiler requires a lot of memory for It requires less memory than a compiler because no
generating object codes. object code is generated.
Compiler Interpreter
Introduction of Compiler design
We basically have two phases of compilers, namely the Analysis phase and Synthesis
phase. The analysis phase creates an intermediate representation from the given source
code. The synthesis phase creates an equivalent target program from the intermediate
representation.
A compiler is a software program that converts the high-level source code written in a
programming language into low-level machine code that can be executed by the
computer hardware. The process of converting the source code into machine code
involves several phases or stages, which are collectively known as the phases of a
compiler. The typical phases of a compiler are:
1. Lexical Analysis: The first phase of a compiler is lexical analysis, also known as
scanning. This phase reads the source code and breaks it into a stream of
tokens, which are the basic units of the programming language. The tokens
are then passed on to the next phase for further processing.
2. Syntax Analysis: The second phase of a compiler is syntax analysis, also
known as parsing. This phase takes the stream of tokens generated by the
lexical analysis phase and checks whether they conform to the grammar of
the programming language. The output of this phase is usually an Abstract
Syntax Tree (AST).
3. Semantic Analysis: The third phase of a compiler is semantic analysis. This
phase checks whether the code is semantically correct, i.e., whether it
conforms to the language’s type system and other semantic rules. In this
stage, the compiler checks the meaning of the source code to ensure that it
makes sense. The compiler performs type checking, which ensures that
variables are used correctly and that operations are performed on compatible
data types. The compiler also checks for other semantic errors, such as
undeclared variables and incorrect function calls.
4. Intermediate Code Generation: The fourth phase of a compiler is intermediate
code generation. This phase generates an intermediate representation of the
source code that can be easily translated into machine code.
5. Optimization: The fifth phase of a compiler is optimization. This phase applies
various optimization techniques to the intermediate code to improve the
performance of the generated machine code.
6. Code Generation: The final phase of a compiler is code generation. This phase
takes the optimized intermediate code and generates the actual machine code
that can be executed by the target hardware.
In summary, the phases of a compiler are: lexical analysis, syntax analysis, semantic
analysis, intermediate code generation, optimization, and code generation.
Symbol Table – It is a data structure being used and maintained by the compiler,
consisting of all the identifier’s names along with their types. It helps the compiler to
function smoothly by finding the identifiers quickly.
The analysis of a source program is divided into mainly three phases. They are:
1. Linear Analysis-
This involves a scanning phase where the stream of characters is read from
left to right. It is then grouped into various tokens having a collective
meaning.
2. Hierarchical Analysis-
In this analysis phase, based on a collective meaning, the tokens are
categorized hierarchically into nested groups.
3. Semantic Analysis-
This phase is used to check whether the components of the source program
are meaningful or not.
The compiler has two modules namely the front end and the back end. Front-end
constitutes the Lexical analyzer, semantic analyzer, syntax analyzer, and intermediate
code generator. And the rest are assembled to form the back end.
1. Lexical Analyzer –
It is also called a scanner. It takes the output of the preprocessor (which
performs file inclusion and macro expansion) as the input which is in a pure
high-level language. It reads the characters from the source program and
groups them into lexemes (sequence of characters that “go together”). Each
lexeme corresponds to a token. Tokens are defined by regular expressions
which are understood by the lexical analyzer. It also removes lexical errors
(e.g., erroneous characters), comments, and white space.
2. Syntax Analyzer – It is sometimes called a parser. It constructs the parse
tree. It takes all the tokens one by one and uses Context-Free Grammar to
construct the parse tree.
Why Grammar?
The rules of programming can be entirely represented in a few productions.
Using these productions we can represent what the program actually is. The
input has to be checked whether it is in the desired format or not.
The parse tree is also called the derivation tree. Parse trees are generally
constructed to check for ambiguity in the given grammar. There are certain
rules associated with the derivation tree.
Any identifier is an expression
Any number can be called an expression
Performing any operations in the given expression will always result
in an expression. For example, the sum of two expressions is also
an expression.
The parse tree can be compressed to form a syntax tree
Syntax error can be detected at this level if the input is not in accordance with the
grammar.
Lexical Analysis is the first phase of the compiler also known as a scanner. It converts the
High level input program into a sequence of Tokens.
1. Lexical Analysis can be implemented with the Deterministic finite Automata.
2. The output is a sequence of tokens that is sent to the parser for syntax
analysis
What
is a Token?
A lexical token is a sequence of characters that can be treated as a unit in the grammar
of the programming languages. Example of tokens:
Type token (id, number, real, . . . )
Punctuation tokens (IF, void, return, . . . )
Alphabetic tokens (keywords)
Keywords; Examples-for, while, if etc.
Identifier; Examples-Variable name, function name, etc.
Operators; Examples '+', '++', '-' etc.
Separators; Examples ',' ';' etc
Example of Non-Tokens:
Comments, preprocessor directive, macros, blanks, tabs, newline, etc.
Lexeme: The sequence of characters matched by a pattern to form the corresponding
token or a sequence of input characters that comprises a single token is called a lexeme.
eg- “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” .
How Lexical Analyzer Works?
1. Input preprocessing: This stage involves cleaning up the input text and
preparing it for lexical analysis. This may include removing comments,
whitespace, and other non-essential characters from the input text.
2. Tokenization: This is the process of breaking the input text into a sequence of
tokens. This is usually done by matching the characters in the input text
against a set of patterns or regular expressions that define the different types
of tokens.
3. Token classification: In this stage, the lexer determines the type of each
token. For example, in a programming language, the lexer might classify
keywords, identifiers, operators, and punctuation symbols as separate token
types.
4. Token validation: In this stage, the lexer checks that each token is valid
according to the rules of the programming language. For example, it might
check that a variable name is a valid identifier, or that an operator has the
correct syntax.
5. Output generation: In this final stage, the lexer generates the output of the
lexical analysis process, which is typically a list of tokens. This list of tokens
can then be passed to the next stage of compilation or interpretation.
The lexical analyzer identifies the error with the help of the automation
machine and the grammar of the given language on which it is based like C,
C++, and gives row number and column number of the error.
Suppose we pass a statement through lexical analyzer – a = b + c; It will
generate token sequence like this: id=id+id; Where each id refers to it’s
variable in the symbol table referencing all details For example, consider the program
int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}
Above are the valid tokens. You can observe that we have omitted comments. As another
Lexeme
Lexemes Tokens Tokens
s
( LAPREN = ASSIGNMENT
a IDENTIFIER a IDENTIFIER
b IDENTIFIER 2 INTEGER
) RPAREN ; SEMICOLON
Advantages
1. Simplifies Parsing:Breaking down the source code into tokens makes it
easier for computers to understand and work with the code. This helps
programs like compilers or interpreters to figure out what the code is
supposed to do. It’s like breaking down a big puzzle into smaller pieces, which
makes it easier to put together and solve.
2. Error Detection: Lexical analysis will detect lexical errors such as misspelled
keywords or undefined symbols early in the compilation process. This helps in
improving the overall efficiency of the compiler or interpreter by identifying
errors sooner rather than later.
3. Efficiency: Once the source code is converted into tokens, subsequent
phases of compilation or interpretation can operate more efficiently. Parsing
and semantic analysis become faster and more streamlined when working
with tokenized input.
Disadvantages
1. Limited Context: Lexical analysis operates based on individual tokens and
does not consider the overall context of the code. This can sometimes lead to
ambiguity or misinterpretation of the code’s intended meaning especially in
languages with complex syntax or semantics.
2. Overhead: Although lexical analysis is necessary for the compilation or
interpretation process, it adds an extra layer of overhead. Tokenizing the
source code requires additional computational resources which can impact the
overall performance of the compiler or interpreter.
3. Debugging Challenges: Lexical errors detected during the analysis phase
may not always provide clear indications of their origins in the original source
code. Debugging such errors can be challenging especially if they result from
subtle mistakes in the lexical analysis process.
Step-2: Then using with small subset of C i.e. C0, for the source language c
the compiler is written.
Step-3: Finally we compile the second compiler. using compiler 1 the
compiler 2 is compiled.
Advantages:
Disadvantages:
1. It can be a time-consuming process, especially for complex languages or
compilers.
2. Debugging a bootstrapped compiler can be challenging since any errors or
bugs in the compiler will affect the subsequent versions of the compiler.
3. Bootstrapping requires that a minimal version of the compiler be written in a
different language, which can introduce compatibility issues between the two
languages.
4. Overall, bootstrapping is a useful technique in compiler design, but it
requires careful planning and execution to ensure that the benefits outweigh
the drawbacks.
The lexical analyzer scans the input from left to right one character at a time. It uses two
pointers begin ptr(bp) and forward ptr(fp) to keep track of the pointer of the input
scanned.
Input buffering is an important concept in compiler design that refers to the way in which
the compiler reads input from the source code. In many cases, the compiler reads input
one character at a time, which can be a slow and inefficient process. Input buffering is a
technique that allows the compiler to read input in larger chunks, which can improve
performance and reduce overhead.
1. The basic idea behind input buffering is to read a block of input from the
source code into a buffer, and then process that buffer before reading the next
block. The size of the buffer can vary depending on the specific needs of the
compiler and the characteristics of the source code being compiled. For
example, a compiler for a high-level programming language may use a larger
buffer than a compiler for a low-level language, since high-level languages
tend to have longer lines of code.
2. One of the main advantages of input buffering is that it can reduce the
number of system calls required to read input from the source code. Since
each system call carries some overhead, reducing the number of calls can
improve performance. Additionally, input buffering can simplify the design of
the compiler by reducing the amount of code required to manage input.
However, there are also some potential disadvantages to input buffering. For example, if
the size of the buffer is too large, it may consume too much memory, leading to slower
performance or even crashes. Additionally, if the buffer is not properly managed, it can
lead to errors in the output of the compiler.
Overall, input buffering is an important technique in compiler design that can help
improve performance and reduce overhead. However, it must be used carefully and
appropriately to avoid potential problems.
Initia
lly both the pointers point to the first character of the input string as shown below
The forward ptr moves
ahead to search for end of lexeme. As soon as the blank space is encountered, it
indicates end of lexeme. In above example as soon as ptr (fp) encounters a blank space
the lexeme “int” is identified. The fp will be moved ahead at white space, when fp
encounters white space, it ignore and moves ahead. then both the begin ptr(bp) and
forward ptr(fp) are set at next token. The input character is thus read from secondary
storage, but reading in this way from secondary storage is costly. hence buffering
technique is used.A block of data is first read into a buffer, and then second by lexical
analyzer. there are two methods used in this context: One Buffer Scheme, and Two
Buffer Scheme. These are explained as following below.
1. One Buffer Scheme: In this scheme, only one buffer is used to store the
input string but the problem with this scheme is that if lexeme is very long
then it crosses the buffer boundary, to scan rest of the lexeme the buffer has
to be refilled, that makes overwriting the first of lexeme.
2. Two Buffer Scheme: To overcome the problem of one buffer scheme, in this
method two buffers are used to store the input string. the first buffer and
second buffer are scanned alternately. when end of current buffer is reached
the other buffer is filled. the only problem with this method is that if length of
the lexeme is longer than length of the buffer then scanning input cannot be
scanned completely. Initially both the bp and fp are pointing to the first
character of first buffer. Then the fp moves towards right in search of end of
lexeme. as soon as blank character is recognized, the string between bp and
fp is identified as corresponding token. to identify, the boundary of first buffer
end of buffer character should be placed at the end first buffer. Similarly end
of second buffer is also recognized by the end of buffer mark present at the
end of second buffer. when fp encounters first eof, then one can recognize
end of first buffer and hence filling up second buffer is started. in the same
way when second eof is obtained then it indicates of second buffer.
alternatively both the buffers can be filled up until end of the input program
and stream of tokens is identified. This eof character introduced at the end is
calling Sentinel which is used to identify the end of buffer.
Advantages:
Input buffering can reduce the number of system calls required to read input from the
source code, which can improve performance.
Input buffering can simplify the design of the compiler by reducing the amount of code
required to manage input.
Disadvantages:
If the size of the buffer is too large, it may consume too much memory, leading to slower
performance or even crashes.
If the buffer is not properly managed, it can lead to errors in the output of the compiler.
Overall, the advantages of input buffering generally outweigh the disadvantages when
used appropriately, as it can improve performance and simplify the compiler design.
Recognition of Tokens
1. Transition Table
2. Transition Diagram
1. Transition Table
It is a tabular representation that lists all possible transitions for each state and input
symbol combination.
EXAMPLE
Assume the following grammar fragment to generate a specific language
where the terminals if, then, else, relop, id and num generates sets of strings given by
following regular definitions.
where letter and digits are defined as - (letter → [A-Z a-z] & digit → [0-9])
For this language, the lexical analyzer will recognize the keywords if, then,
and else, as well as lexemes that match the patterns for relop, id, and number.
To simplify matters, we make the common assumption that keywords are also
reserved words: that is they cannot be used as identifiers.
The num represents the unsigned integer and real numbers of Pascal.
In addition, we assume lexemes are separated by white space, consisting of
nonnull sequences of blanks, tabs, and newlines.
Our lexical analyzer will strip out white space. It will do so by comparing a string
against the regular definition ws, below.
If a match for ws is found, the lexical analyzer does not return a token to the
parser.
It is the following token that gets returned to the parser.
2. Transition Diagram
It is a directed labeled graph consisting of nodes and edges. Nodes represent states,
while edges represent state transitions.
Components of Transition Diagram
1. One state is labelled the Start State. It is the initial state of transition diagram
2. Position is a transition diagram are drawn as circles and are called states.
3. The states are connected by Arrows called edges. Labels on edges are indicating
the input characters
4. Zero or more final states or Accepting states are represented by double circle in
which the tokens has been found.
5. Example:
Here is the transition diagram of Finite Automata that recognizes the lexemes matching
the token relop.
Here is the Finite Automata Transition Diagram for recognizing white spaces.
Lexical Analysis
It is the first step of compiler design, it takes the input as a stream of characters and
gives the output as tokens also known as tokenization. The tokens can be classified into
identifiers, Sperators, Keywords, Operators, Constant and Special Characters.
It has three phases:
1. Tokenization: It takes the stream of characters and converts it into tokens.
2. Error Messages: It gives errors related to lexical analysis such as exceeding
length, unmatched string, etc.
3. Eliminate Comments: Eliminates all the spaces, blank spaces, new lines,
and indentations.
Lex
Lex is a tool or a computer program that generates Lexical Analyzers (converts the
stream of characters into tokens). The Lex tool itself is a compiler. The Lex compiler
takes the input and transforms that input into input patterns. It is commonly used
with YACC(Yet Another Compiler Compiler). It was written by Mike Lesk and Eric Schmidt.
Function of Lex
1. In the first step the source code which is in the Lex language having the file name
‘File.l’ gives as input to the Lex Compiler commonly known as Lex to get the output as
lex.yy.c.
2. After that, the output lex.yy.c will be used as input to the C compiler which gives the
output in the form of an ‘a.out’ file, and finally, the output file a.out will take the stream
of character and generates tokens as output.
lex.yy.c: It is a C program.
File.l: It is a Lex source program
a.out: It is a Lexical analyzer
o In DFA, there is only one path for specific input from the current state to the next
state.
o DFA does not accept the null move, i.e., the DFA cannot change state without any
input character.
o DFA can contain multiple final states. It is used in Lexical Analysis in Compiler.
In the following diagram, we can see that from state q0 for input a, there is only one path
which is going to q1. Similarly, from q0, there is only one path for input b going to q2.
Example 01 :Construct a DFA with ∑ = {0,1} that accepts the only input a string
“10”.
In the above example, the language contains only one string given below
L= {10}
DFA that accepts the only input a string “10” is given below
Example 02:Construct a DFA with ∑ = {a, b} that accepts the only input “aaab”.
The given question provides the following language
L= {aaab}
Example:3
Construct DFA, which accept all the string over alphabets ∑ {0,1} that start with
“0”.
Solution:
The given question provides the following language
L = {0, 01, 00, 010, 011, 000, 001,…, }
Let’s draw the DFA, which accepts all the strings starting with “0”.
Example:4
Construct DFA, which accept all the string over alphabets ∑ {0,1} that start with “01”.
Solution:
The given question provides the following language
L = {01, 010, 011, 0100, 0101, 0110, 0111}
Let’s draw the DFA which accepts all the strings of the above language
Example:05
Construct DFA, which accepts all the string over alphabets ∑ {0,1} where binary
integers divisible by 3
Solution:
The given question provides the following language
L = {0, 11, 110, 1001, 1100, 1111, ……, }
The explanation for string 1001 is given below.
1001= 9, 9/3 = 3; hence, 1001 is divisible by 3.
The following diagram represents the DFA accepter for the language where binary
integers are divisible by three.
DFA Example 06
Construct a DFA with sigma ∑ = {0, 1} for the language accepting strings ending
with ‘0011’.
Solution:
The given question provides the following language (L) and DFA diagram
L={0011,00011,10011,110011,000011,…..}
NFA (Non-Deterministic finite automata)
o The finite automata are called NFA when there exist many paths for specific input
from the current state to the next state.
o Every NFA is not DFA, but each NFA can be translated into DFA.
o NFA is defined in the same way as DFA but with the following two exceptions, it
contains multiple next states, and it contains ε transition.
In the following image, we can see that from state q0 for input a, there are two next
states q1 and q2, similarly, from q0 for input b, the next states are q0 and q1. Thus it is
not fixed or determined that with a particular input where to go next. Hence this FA is
called non-deterministic finite automata.
NFA also has five states same as DFA, but with different transition function, as shown
follows:
δ: Q x ∑ →2Q
where,
Example 1:
Solution:
Transition diagram:
Transition Table:
→q0 q0, q1 q1
q1 q2 q0
*q2 q2 q1, q2
In the above diagram, we can see that when the current state is q0, on input 0, the next
state will be q0 or q1, and on 1 input the next state will be q1. When the current state is
q1, on input 0 the next state will be q2 and on 1 input, the next state will be q0. When
the current state is q2, on 0 input the next state is q2, and on 1 input the next state will
be q1 or q2.
NFA Example
Design an NFA with ∑ = {0, 1} for all binary strings where the second last bit is 1.
Solution
The language generated by this example will include all strings in which the second-last bit is 1.
L= {10, 010, 000010, 11, 101011……..}
The following NFA automaton machine accepts all strings where the second last bit is 1.
State 0 1
s
q0 q0 q0,q1
q1 q2 q2
q2 – –
NFA Example
Draw an NFA with ∑ = {0, 1} such that the third symbol from the right is “1”.
Solution
The language generated by this example will include all strings where the third symbol from the
right is “1”.
L= {100, 111, 101, 110, 0100, 1100, 00100, 100101……..}
States 0 1
q0 q0 q0,q1
NFA for the above language is Where,
q1 q2 q2 {q0, q1, q2} refers to the set of states
{0,1} refers to the set of input alphabets
q2 -q3 q3 δ refers to the transition function
q0 refers to the initial state
q3 – – {q3} refers to the set of final states
Transition function δ is defined as
δ (q0, 0) = q0
δ (q0, 1) = q0,q1
δ (q1, 0) = q2
δ (q1, 1) = q2
δ (q2, 0) = q3
δ (q2, 1) = q3
δ (q3, 0) = ϕ
δ (q3, 1) = ϕ
Transition Table for the above Non-Deterministic Finite Automata is –
NFA Example
Construct an NFA with ∑ = {a,b,c} where strings contain some “a’s” followed by some “b’s”
followed by some c’s. Solution
The language-generated strings will be like as
L={abc, aabbcc, aaabbcc, aabbbcc, aaaabbbbccc, ………… }
So, the NFA transition diagram for the above language is Where,
{q0, q1, q2} refers to the set of states
{a,b,c} refers to the set of input alphabets
δ refers to the transition function
q0 refers to the initial state
{q2} refers to the set of final states
Transition function δ is defined as
δ (q0, a) = q0
δ (q0, b) = q1
δ (q0, c) = ϕ
δ (q1, a) = ϕ
δ (q1, b) = q1
δ (q1, c) = q2
δ (q2, a) = ϕ
δ (q2, b) = ϕ
δ (q2, c) = q2
Transition Table for the above Non-Deterministic Finite Automata is-
States a b c
q0 q0 q1 –
q1 – q1 q2
q2 – – q2