UNIT 2 Compiler Design
UNIT 2 Compiler Design
UNIT 2 Compiler Design
Lexical Analysis is the very first phase in the compiler designing. A Lexer takes the modified
source code which is written in the form of sentences . In other words, it helps you to convert a
sequence of characters into a sequence of tokens. The lexical analyzer breaks this syntax into a
series of tokens. It removes any extra space or comment written in the source code.
Programs that perform Lexical Analysis in compiler design are called lexical analyzers or lexers.
A lexer contains a tokenizer or scanner. If the lexical analyzer detects that the token is invalid, it
generates an error. The role of Lexical Analyzer in compiler design is to read character streams
from the source code, check for legal tokens, and pass the data to the syntax analyzer when it
demands.
Example
What’s a lexeme?
A lexeme is a sequence of characters that are included in the source program according to the
matching pattern of a token. It is nothing but an instance of a token.
What’s a token?
Tokens in compiler design are the sequence of characters which represents a unit of information
in the source program.
What is Pattern?
A pattern is a description which is used by the token. In the case of a keyword which is used as a
token, the pattern is a sequence of characters.
The main task of lexical analysis is to read input characters in the code and produce tokens.
Lexical analyzer scans the entire source code of the program. It identifies each token one by one.
Scanners are usually implemented to produce tokens only when requested by a parser. Here is
how recognition of tokens in compiler design works-
3. “Get next token” is a command which is sent from the parser to the lexical analyzer.
4. On receiving this command, the lexical analyzer scans the input until it finds the next
token.
5. It skips comments and whitespaces while creating these tokens. It also handles compiler
directives.
6. If any error is present then LA will correlate that error with source file and line number.
7. It returns the token to Parser.
Lexical Analyzer skips whitespaces and comments while creating these tokens. If any error is
present, then Lexical analyzer will correlate that error with the source file and line number.
#include <stdio.h>
int maximum(int x, int y) {
// This will compare 2 numbers
if (x > y)
return x;
else {
return y;
}
}
Examples of Tokens created
Lexeme Token
int Keyword
maximum Identifier
( Operator
int Keyword
x Identifier
, Operator
int Keyword
Y Identifier
) Operator
{ Operator
If Keyword
Examples of Nontokens
Type Examples
Comment // This will compare 2 numbers
Pre-processor directive #include <stdio.h>
Pre-processor directive #define NUMS 8,9
Macro NUMS
/n /b /t
Whitespace
Lexical Errors
A character sequence which is not possible to scan into any valid token is a lexical error.
Important facts about the lexical error:
● Lexical errors are not very common, but it should be managed by a scanner
● Misspelling of identifiers, operators, keyword are considered as lexical errors
● Generally, a lexical error is caused by the appearance of some illegal character, mostly at
the beginning of a token.
3. Input Buffering
The lexical analyzer scans the input from left to right one character at a time. It uses two pointers
begin ptr(bp) and forward ptr(fp) to keep track of the pointer of the input scanned.
Initially both the pointers point to the first character of the input string as shown below
The forward ptr moves ahead to search for the end of lexeme. As soon as the blank space is
encountered, it indicates the end of lexeme. In the above example, as soon as ptr (fp) encounters
a blank space the lexeme “int” is identified.
The fp will be moved ahead at white space, when fp encounters white space, it ignores and
moves ahead. then both the begin ptr(bp) and forward ptr(fp) are set at the next token.
The input character is thus read from secondary storage, but reading in this way from secondary
storage is costly. Hence buffering technique is used.A block of data is first read into a buffer, and
then second by lexical analyzer. There are two methods used in this context: One Buffer Scheme,
and Two Buffer Scheme. These are explained below.
1. One Buffer Scheme:
In this scheme, only one buffer is used to store the input string but the problem with this
scheme is that if lexeme is very long then it crosses the buffer boundary, to scan rest of the
lexeme the buffer has to be refilled, that makes overwriting the first of lexeme.
Similarly end of second buffer is also recognized by the end of buffer mark present at the end
of second buffer. when fp encounters first eof, then one can recognize end of first buffer and
hence filling up second buffer is started. in the same way when second eof is obtained then it
indicates of second buffer. alternatively both the buffers can be filled up until end of the
input program and stream of tokens is identified. This eof character introduced at the end is
calling Sentinel which is used to identify the end of buffer.
4. Specification of Tokens
In language theory, the terms "sentence" and "word" are often used as synonyms for "string." The
length of a string s, usually written |s|, is the number of occurrences of symbols in s. For example,
banana is a string of length six. The empty string, denoted ε, is the string of length zero.
a. Operations on strings
The following string-related terms are commonly used:
1. A prefix of string s is any string obtained by removing zero or more symbols from the end of
string s. For example, ban is a prefix of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from the beginning
of s. For example, nana is a suffix of banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s. For example, nan is a
substring of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those prefixes, suffixes, and
substrings, respectively of s that are not ε or not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not necessarily consecutive
positions of s. For example, baan is a subsequence of banana.
b. Operations on languages:
The following are the operations that can be applied to languages:
1. Union → Union of two languages L & M is written as ,
L U M= {S | S is in L or S is in M}
2. Concatenation → Concatenation of two languages L & M is written as,
L U M= {ST | S is in L or T is in M}
3. Kleene closure → Kleene closure of language L is written as,
L* = 0 or more occurrence of language L
4. Positive closure → Positive closure of language L is written as,
L+ = 1 or more occurrence of language L
The following example shows the operations on strings: Let L={0,1} and S={a,b,c}
3) Regular Expressions
Each regular expression r denotes a language L(r).
Here are the rules that define the regular expressions over some alphabet Σ and the languages that
those expressions denote:
1. ε is a regular expression, and L(ε) is { ε }, that is, the language whose member is the empty
string.
2. If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = {a}, that is, the language with
one string, of length one, with ‘a’ in its one position.
3. Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then,
a) (r)|(s) is a regular expression denoting the language L(r) U L(s).
b) (r)(s) is a regular expression denoting the language L(r)L(s).
c) (r)* is a regular expression denoting (L(r))*.
d) (r) is a regular expression denoting L(r).
4. The unary operator * has highest precedence
5. Concatenation has second highest precedence
6. | has lowest precedence.
• Finally, lex.yy.c is run through the C compiler to produce an object program a.out.
The output of C compiler is the working lexical analyzer which takes a stream of input characters
and produces a stream of tokens.
1. Definition Section: The definition section contains the declaration of variables, regular
definitions, constants. In the definition section, text is enclosed in “%{ %}” brackets.
Anything written in this brackets is copied directly to the file lex.yy.c
Syntax:
%{
// Definitions
%}
2. Rules Section: The rules section contains a series of rules in the form: pattern action and
pattern must be unintended and action begin on the same line in {} brackets. The rule section is
enclosed in “ %%%%”.
Syntax:
%%
pattern action
%%
3. User Code Section: This section contains C statements and additional functions. We can also
compile these functions separately and load with the lexical analyzer.
Basic Program Structure:
%{
// Definitions
%}
%%
Rules
%%
Example : Count the number of characters and number of lines in the input
%{
int no_of_lines = 0;
int no_of_chars = 0;
%}
%%
\n { ++no_of_lines; }
. { ++no_of_chars; }
%%
int yywrap(){}
int main()
{
yylex();
return 0;
6. Finite Automata
This is the combination of five tuples (Q, Σ, q0, qf, δ) focusing on states and transitions through
input symbols.
It recognizes patterns by taking the string of symbol as input and changes its state accordingly.
The mathematical model for finite automata consists of;
● Transition function(δ)
● Transition diagram
● Transition table
● Transition function
Transition Diagram
A transition diagram or state transition diagram is a directed graph which can be constructed as
follows:
Here,
● {0,1}: Inputs
Transition Table
->q1 q1 q2
q2 qf q2
*qf - qf
Transition function
δ(q1,1)=q2
Solution:
Example 2:
DFA with Σ = {0, 1} accepts all strings ending with 0.
Example 1:
Transition Table:
→q0 q1 -
q1 - q2
*q2 q2 q2
Transition Funtion:
δ(q0, 0) = {q1}
δ(q0, 1) = {-}
δ(q1, 0) = {-}
δ(q1, 1) = {q2}
δ(q2, 0) = {q2}
δ(q2, 1) = {q2}
Example 2:
NFA with ∑ = {0, 1} and accept all string of length at least 2.
Solution:
Transition Table:
→q0 q1 q1
q1 q2 q2
*q2 ε ε
Step 3: In Q', find the possible set of states for each input symbol. If this set of states is not in Q',
then add it to Q'.
Step 4: In DFA, the final state will be all the states which contain final states of NFA
Example 1:
Now { q0, q1 } will be considered as a single state. As its entry is not in Q’, add it to Q’.
So Q’ = { q0, { q0, q1 } }
Now, moves from state { q0, q1 } on different input symbols are not present in transition table of
DFA, we will calculate it like:
As there is no new state generated, we are done with the conversion. Final state of DFA will be
state { q0, q2 }
Following are the various parameters for DFA.
Q’ = { q0, { q0, q1 }, { q0, q2 } }
∑ = ( a, b )
F = { { q0, q2 } } and transition function δ’ as shown above.
The final DFA for above NFA has been shown in Figure 2.
Non-deterministic finite automata(NFA) is a finite automata where for some cases when a specific
input is given to the current state, the machine goes to multiple states or more than 1 states. It can
contain ε move. It can be represented as M = { Q, ∑, δ, q0, F}.
NFA with ∈ move: If any FA contains ε transaction or move, the finite automata is called NFA
with ∈ move.
ε-closure: ε-closure for a given state A means a set of states which can be reached from the state
A with only ε(null) move including the state A itself. When we find closure , self transition is
consider ε. If there is no null (ε) transition available then self is always consider.
Step 1: We will take the ε-closure for the starting state of NFA as a starting state of DFA.
Step 2: Find the states for each input symbol that can be traversed from the present. That means
the union of transition value and their closures for each state of NFA present in the current state
of DFA.
Step 3: If we found a new state, take it as current state and repeat step 2.
Step 4: Repeat Step 2 and Step 3 until there is no new state present in the transition table of DFA.
Step 5: Mark the states of DFA as a final state which contains the final state of NFA.
Example 1:
Hence,
1. δ'(A, 0) = B
2. δ'(A, 1) = B
3. δ'(B, 0) = ϕ
4. δ'(B, 1) = C
=ϕ
=ϕ
5. δ'(C, 0) = ϕ
6. δ'(C, 1) = ϕ