Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

UNIT 2 Compiler Design

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

UNIT 2:

1. What is Lexical Analysis?

Lexical Analysis is the very first phase in the compiler designing. A Lexer takes the modified
source code which is written in the form of sentences . In other words, it helps you to convert a
sequence of characters into a sequence of tokens. The lexical analyzer breaks this syntax into a
series of tokens. It removes any extra space or comment written in the source code.
Programs that perform Lexical Analysis in compiler design are called lexical analyzers or lexers.
A lexer contains a tokenizer or scanner. If the lexical analyzer detects that the token is invalid, it
generates an error. The role of Lexical Analyzer in compiler design is to read character streams
from the source code, check for legal tokens, and pass the data to the syntax analyzer when it
demands.

Example

How Pleasant Is The Weather?


See this Lexical Analysis example; Here, we can easily recognize that there are five words How
Pleasant, The, Weather, Is. This is very natural for us as we can recognize the separators, blanks,
and the punctuation symbol.

HowPl easantIs Th ewe ather?


Now, check this example, we can also read this. However, it will take some time because
separators are put in the Odd Places. It is not something which comes to you immediately.

What’s a lexeme?

A lexeme is a sequence of characters that are included in the source program according to the
matching pattern of a token. It is nothing but an instance of a token.

What’s a token?

Tokens in compiler design are the sequence of characters which represents a unit of information
in the source program.
What is Pattern?

A pattern is a description which is used by the token. In the case of a keyword which is used as a
token, the pattern is a sequence of characters.

Lexical Analyzer Architecture: How tokens are recognized

The main task of lexical analysis is to read input characters in the code and produce tokens.

Lexical analyzer scans the entire source code of the program. It identifies each token one by one.
Scanners are usually implemented to produce tokens only when requested by a parser. Here is
how recognition of tokens in compiler design works-

1. It is first phase of compiler.

2. It’s main task is to read input characters and produce tokens.

3. “Get next token” is a command which is sent from the parser to the lexical analyzer.
4. On receiving this command, the lexical analyzer scans the input until it finds the next
token.
5. It skips comments and whitespaces while creating these tokens. It also handles compiler
directives.
6. If any error is present then LA will correlate that error with source file and line number.
7. It returns the token to Parser.
Lexical Analyzer skips whitespaces and comments while creating these tokens. If any error is
present, then Lexical analyzer will correlate that error with the source file and line number.

2. Roles of the Lexical analyzer

Lexical analyzer performs below given tasks:

● Helps to identify token into the symbol table


● Removes white spaces and comments from the source program
● Correlates error messages with the source program
● Helps you to expands the macros if it is found in the source program
● Read input characters from the source program

Example of Lexical Analysis, Tokens, Non-Tokens

Consider the following code that is fed to Lexical Analyzer

#include <stdio.h>
int maximum(int x, int y) {
// This will compare 2 numbers
if (x > y)
return x;
else {
return y;
}
}
Examples of Tokens created

Lexeme Token

int Keyword

maximum Identifier

( Operator
int Keyword

x Identifier

, Operator

int Keyword

Y Identifier

) Operator

{ Operator

If Keyword

Examples of Nontokens

Type Examples
Comment // This will compare 2 numbers
Pre-processor directive #include <stdio.h>
Pre-processor directive #define NUMS 8,9
Macro NUMS
/n /b /t
Whitespace

Lexical Errors

A character sequence which is not possible to scan into any valid token is a lexical error.
Important facts about the lexical error:

● Lexical errors are not very common, but it should be managed by a scanner
● Misspelling of identifiers, operators, keyword are considered as lexical errors
● Generally, a lexical error is caused by the appearance of some illegal character, mostly at
the beginning of a token.

3. Input Buffering
The lexical analyzer scans the input from left to right one character at a time. It uses two pointers
begin ptr(bp) and forward ptr(fp) to keep track of the pointer of the input scanned.

Initially both the pointers point to the first character of the input string as shown below

The forward ptr moves ahead to search for the end of lexeme. As soon as the blank space is
encountered, it indicates the end of lexeme. In the above example, as soon as ptr (fp) encounters
a blank space the lexeme “int” is identified.

The fp will be moved ahead at white space, when fp encounters white space, it ignores and
moves ahead. then both the begin ptr(bp) and forward ptr(fp) are set at the next token.

The input character is thus read from secondary storage, but reading in this way from secondary
storage is costly. Hence buffering technique is used.A block of data is first read into a buffer, and
then second by lexical analyzer. There are two methods used in this context: One Buffer Scheme,
and Two Buffer Scheme. These are explained below.
1. One Buffer Scheme:
In this scheme, only one buffer is used to store the input string but the problem with this
scheme is that if lexeme is very long then it crosses the buffer boundary, to scan rest of the
lexeme the buffer has to be refilled, that makes overwriting the first of lexeme.

2. Two Buffer Scheme:


To overcome the problem of one buffer scheme, in this method two buffers are used to store
the input string. The first buffer and second buffer are scanned alternately. When the end of
the current buffer is reached the other buffer is filled. The only problem with this method is
that if the length of the lexeme is longer than length of the buffer then scanning input cannot
be scanned completely.
Initially both the bp and fp are pointing to the first character of the first buffer. Then the fp
moves towards the right in search of the end of lexeme. as soon as a blank character is
recognized, the string between bp and fp is identified as corresponding token. to identify, the
boundary of the first buffer end of buffer character should be placed at the end of the first
buffer.

Similarly end of second buffer is also recognized by the end of buffer mark present at the end
of second buffer. when fp encounters first eof, then one can recognize end of first buffer and
hence filling up second buffer is started. in the same way when second eof is obtained then it
indicates of second buffer. alternatively both the buffers can be filled up until end of the
input program and stream of tokens is identified. This eof character introduced at the end is
calling Sentinel which is used to identify the end of buffer.

4. Specification of Tokens

There are 3 specifications of tokens:


1)Strings
2) Language
3)Regular expression

1) and 2) Strings and Languages


An alphabet or character class is a finite set of symbols.
A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
A language is any countable set of strings over some fixed alphabet.

In language theory, the terms "sentence" and "word" are often used as synonyms for "string." The
length of a string s, usually written |s|, is the number of occurrences of symbols in s. For example,
banana is a string of length six. The empty string, denoted ε, is the string of length zero.

a. Operations on strings
The following string-related terms are commonly used:
1. A prefix of string s is any string obtained by removing zero or more symbols from the end of
string s. For example, ban is a prefix of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from the beginning
of s. For example, nana is a suffix of banana.

3. A substring of s is obtained by deleting any prefix and any suffix from s. For example, nan is a
substring of banana.

4. The proper prefixes, suffixes, and substrings of a string s are those prefixes, suffixes, and
substrings, respectively of s that are not ε or not equal to s itself.

5. A subsequence of s is any string formed by deleting zero or more not necessarily consecutive
positions of s. For example, baan is a subsequence of banana.

b. Operations on languages:
The following are the operations that can be applied to languages:
1. Union → Union of two languages L & M is written as ,
L U M= {S | S is in L or S is in M}
2. Concatenation → Concatenation of two languages L & M is written as,
L U M= {ST | S is in L or T is in M}
3. Kleene closure → Kleene closure of language L is written as,
L* = 0 or more occurrence of language L
4. Positive closure → Positive closure of language L is written as,
L+ = 1 or more occurrence of language L
The following example shows the operations on strings: Let L={0,1} and S={a,b,c}

3) Regular Expressions
Each regular expression r denotes a language L(r).
Here are the rules that define the regular expressions over some alphabet Σ and the languages that
those expressions denote:
1. ε is a regular expression, and L(ε) is { ε }, that is, the language whose member is the empty
string.
2. If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = {a}, that is, the language with
one string, of length one, with ‘a’ in its one position.
3. Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then,
a) (r)|(s) is a regular expression denoting the language L(r) U L(s).
b) (r)(s) is a regular expression denoting the language L(r)L(s).
c) (r)* is a regular expression denoting (L(r))*.
d) (r) is a regular expression denoting L(r).
4. The unary operator * has highest precedence
5. Concatenation has second highest precedence
6. | has lowest precedence.

5. A LANGUAGE FOR SPECIFYING LEXICAL ANALYZER

There is a wide range of tools for constructing lexical analyzers.


Lex
Lex is a tool in lexical analysis phase to recognize tokens using regular expressions. Lex is
commonly used with the yacc parser generator.
Creating a lexical analyzer
• First, specification of a lexical analyzer is prepared by creating a program lex.l in the Lex
language. Then, lex.l is run through the Lex compiler to produce a C program lex.yy.c.

• Finally, lex.yy.c is run through the C compiler to produce an object program a.out.
The output of C compiler is the working lexical analyzer which takes a stream of input characters
and produces a stream of tokens.

In the input file, there are 3 sections:

1. Definition Section: The definition section contains the declaration of variables, regular
definitions, constants. In the definition section, text is enclosed in “%{ %}” brackets.
Anything written in this brackets is copied directly to the file lex.yy.c
Syntax:
%{

// Definitions

%}

2. Rules Section: The rules section contains a series of rules in the form: pattern action and
pattern must be unintended and action begin on the same line in {} brackets. The rule section is
enclosed in “ %%%%”.
Syntax:
%%

pattern action

%%
3. User Code Section: This section contains C statements and additional functions. We can also
compile these functions separately and load with the lexical analyzer.
Basic Program Structure:

%{

// Definitions

%}

%%

Rules

%%

User code section

Example : Count the number of characters and number of lines in the input

%{

int no_of_lines = 0;

int no_of_chars = 0;

%}

%%

\n { ++no_of_lines; }

. { ++no_of_chars; }

end { return 0;}

%%

int yywrap(){}

int main()
{

yylex();

printf("number of lines = %d, number of chars = %d\n" , no_of_lines, no_of_chars );

return 0;

6. Finite Automata

This is the combination of five tuples (Q, Σ, q0, qf, δ) focusing on states and transitions through
input symbols.
It recognizes patterns by taking the string of symbol as input and changes its state accordingly.
The mathematical model for finite automata consists of;

● Finite set of states(Q)

● Finite set of input symbols(Σ)

● One start state(q0)

● Set of final states(qf)

● Transition function(δ)

Finite Automata can be represented as follows −

● Transition diagram

● Transition table

● Transition function

Transition Diagram

A transition diagram or state transition diagram is a directed graph which can be constructed as
follows:

o There is a node for each state in Q, which is represented by the circle.


o There is a directed edge from node q to node p labeled a if δ(q, a) = p.
o In the start state, there is an arrow with no source.
o Accepting states or final states are indicating by a double circle.

An example of transition diagram is given below −

Here,

● {0,1}: Inputs

● q1: Initial state

● q2: Intermediate state

● qf: Final state

Transition Table

A transition table is represented by the following things:

o Columns correspond to input symbols.


o Rows correspond to states.
o Entries correspond to the next state.
o The start state is denoted by an arrow with no source.
o The accept state is denoted by a star.

The transition table is as follows −


State/input symbol 0 1

->q1 q1 q2

q2 qf q2

*qf - qf

Transition function

The transition function is denoted by δ.

δ (current_state, current_input_symbol) = next_state

For example, δ(q0,a)=q1

From Above table: δ(q1,0)=q1

δ(q1,1)=q2

Finite Automaton can be classified into two types −

● Deterministic Finite Automaton (DFA)

● Non-deterministic Finite Automaton (NDFA / NFA)


DFA (Deterministic finite automata)
● DFA refers to deterministic finite automata. Deterministic refers to the uniqueness of the
computation. The finite automata are called deterministic finite automata if the machine is
read an input string one symbol at a time.
Example 1:

DFA with ∑ = {0, 1} accepts all starting with 0.

Solution:

Example 2:
DFA with Σ = {0, 1} accepts all strings ending with 0.

Figure: DFA with Σ = {0, 1}

NFA (Non-Deterministic finite automata)


o NFA stands for non-deterministic finite automata. It is easy to construct an NFA than DFA
for a given regular language.
o The finite automata are called NFA when there exist many paths for specific input from
the current state to the next state.
o Every NFA is not DFA, but each NFA can be translated into DFA.
o NFA is defined in the same way as DFA but with the follow two exceptions, it contains
multiple next states, and it contains ε transition.

Example 1:

NFA with ∑ = {0, 1} accepts all strings start with 01.


Solution:

Transition Table:

Present State Next state for Input 0 Next State of Input 1

→q0 q1 -

q1 - q2

*q2 q2 q2

Transition Funtion:
δ(q0, 0) = {q1}
δ(q0, 1) = {-}
δ(q1, 0) = {-}
δ(q1, 1) = {q2}
δ(q2, 0) = {q2}
δ(q2, 1) = {q2}
Example 2:
NFA with ∑ = {0, 1} and accept all string of length at least 2.
Solution:
Transition Table:

Present State Next state for Input 0 Next State of Input 1

→q0 q1 q1

q1 q2 q2

*q2 ε ε

Conversion from NFA to DFA


In this section, we will discuss the method of converting NFA to its equivalent DFA. Steps for
converting NFA to DFA:
Step 1: Initially Q' = ϕ
Step 2: Add q0 of NFA to Q'. Then find the transitions from this start state.

Step 3: In Q', find the possible set of states for each input symbol. If this set of states is not in Q',
then add it to Q'.

Step 4: In DFA, the final state will be all the states which contain final states of NFA

Example 1:

Consider the following NFA shown in Figure 1.


Step 1: Q’ = ɸ
Step 2: Q’ = {q0}
Step 3: For each state in Q’, find the states for each input symbol.
Currently, state in Q’ is q0, find moves from q0 on input symbol a and b using transition function
of NFA and update the transition table of DFA.
Now we will obtain δ' transition for state q0.
δ'([q0], a) = [q0,q1]
δ'([q0], b) = [q1]

Now { q0, q1 } will be considered as a single state. As its entry is not in Q’, add it to Q’.
So Q’ = { q0, { q0, q1 } }
Now, moves from state { q0, q1 } on different input symbols are not present in transition table of
DFA, we will calculate it like:

δ’ ( { q0, q1 }, a ) = δ ( q0, a ) ∪ δ ( q1, a )


= { q0, q1 }

δ’ ( { q0, q1 }, b ) = δ ( q0, b ) ∪ δ ( q1, b )


= { q0, q2 }
Now we will update the transition table of DFA.
Now { q0, q2 } will be considered as a single state.
As its entry is not in Q’, add it to Q’.

So Q’ = { q0, { q0, q1 }, { q0, q2 } }


Now, moves from state {q0, q2} on different input symbols are not present in transition table of
DFA, we will calculate it like:

δ’ ( { q0, q2 }, a ) = δ ( q0, a ) ∪ δ ( q2, a )


= { q0, q1 }
δ’ ( { q0, q2 }, b ) = δ ( q0, b ) ∪ δ ( q2, b
= { q0 }

Now we will update the transition table of DFA.


δ’ (Transition Function of DFA)

As there is no new state generated, we are done with the conversion. Final state of DFA will be
state { q0, q2 }
Following are the various parameters for DFA.
Q’ = { q0, { q0, q1 }, { q0, q2 } }
∑ = ( a, b )
F = { { q0, q2 } } and transition function δ’ as shown above.
The final DFA for above NFA has been shown in Figure 2.

Conversion from NFA with ε to DFA

Non-deterministic finite automata(NFA) is a finite automata where for some cases when a specific
input is given to the current state, the machine goes to multiple states or more than 1 states. It can
contain ε move. It can be represented as M = { Q, ∑, δ, q0, F}.
NFA with ∈ move: If any FA contains ε transaction or move, the finite automata is called NFA
with ∈ move.
ε-closure: ε-closure for a given state A means a set of states which can be reached from the state
A with only ε(null) move including the state A itself. When we find closure , self transition is
consider ε. If there is no null (ε) transition available then self is always consider.

Steps for converting NFA with ε to DFA:

Step 1: We will take the ε-closure for the starting state of NFA as a starting state of DFA.

Step 2: Find the states for each input symbol that can be traversed from the present. That means
the union of transition value and their closures for each state of NFA present in the current state
of DFA.

Step 3: If we found a new state, take it as current state and repeat step 2.

Step 4: Repeat Step 2 and Step 3 until there is no new state present in the transition table of DFA.

Step 5: Mark the states of DFA as a final state which contains the final state of NFA.

Example 1:

Convert the NFA with ε into its equivalent DFA.


Solution:

Let us obtain ε-closure of each state.

1. ε-closure {q0} = {q0, q1, q2}


2. ε-closure {q1} = {q1}
3. ε-closure {q2} = {q2}
4. ε-closure {q3} = {q3}
5. ε-closure {q4} = {q4}

Now, let ε-closure {q0} = {q0, q1, q2} be state A.

Hence,

δ'(A, 0) = ε-closure {δ((q0, q1, q2), 0) }


= ε-closure {δ(q0, 0) ∪ δ(q1, 0) ∪ δ(q2, 0) }
= ε-closure {q3}
= {q3} New Entry, call it as state B.

δ'(A, 1) = ε-closure {δ((q0, q1, q2), 1) }


= ε-closure {δ((q0, 1) ∪ δ(q1, 1) ∪ δ(q2, 1) }
= ε-closure {q3}
= {q3}= B nothing but state B

Thus we have obtained

1. δ'(A, 0) = B
2. δ'(A, 1) = B

The partial DFA will be


Now, for state B

δ'(B, 0) = ε-closure {δ(q3, 0) }



δ'(B, 1) = ε-closure {δ(q3, 1) }
= ε-closure {q4}
= {q4} New Entry, i.e. state C

Thus we have obtained

3. δ'(B, 0) = ϕ
4. δ'(B, 1) = C

Now, For state C:

1. δ'(C, 0) = ε-closure {δ(q4, 0) }

2. δ'(C, 1) = ε-closure {δ(q4, 1) }

Thus we have obtained

5. δ'(C, 0) = ϕ
6. δ'(C, 1) = ϕ

As C = { q4 } in which final state q4 lies hence C is final state.

The DFA will be,

You might also like