Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
18 views

Compiler Design - Lexical Analysis

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Compiler Design - Lexical Analysis

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Compiler Design Chapter 02 : Lexical Analysis

CHAPTER 02:
LEXICAL ANALYSIS

1. Overview of lexical analysis


 To identify the tokens we need some method of describing the possible tokens that can
appear in the input stream. For this purpose, we introduce regular expression, a notation that
can be used to describe essentially all the tokens of programming language.
 Secondly, having decided what the tokens are, we need some mechanism to recognize these
in the input stream. This is done by the token recognizers, which are designed using
transition diagrams and finite automata.

2. Role of lexical analyzer


The lexical analysis is the first phase of a compiler. It main task is to read the input character
and produce as output a sequence of tokens that the parser uses for syntax analysis.

7
Compiler Design Chapter 02 : Lexical Analysis

Figure 2.1: Role of lexical analyzer


Upon receiving a get next token command from the parser, the lexical analyzer reads the
input character until it can identify the next token. The lexical analyzer returns to the parser
representation for the found token. The representation will be an integer code if the token is a
simple construct such as a parenthesis, comma, or colon.
The lexical analyzer may also perform certain secondary tasks as the user interface. One
such task is striping out the commands and white spaces from the source program in the form
of blank, tab, and newline characters. Another is correlating error messages from the compiler
with the source program.

3. Regular Expressions
The regular expression is a formula that describes a possible set of string.
Component of the regular expression.
X the character x
. any character, usually accept a new line
[x y z] any of the characters x, y, z, …..
R? a R or nothing (=optionally as R)
R* zero or more occurrences…..
R+ one or more occurrences ……
R1R2 an R1 followed by an R2
R2R1 either an R1 or an R2.
A token is either a single string or one of a collection of strings of a certain type. If we view
the set of strings in each token class as a language, we can use the regular-expression notation
to describe tokens.
Consider an identifier, which is defined to be a letter followed by zero or more letters or
digits. In regular expression notation, we would write.
Identifier = letter (letter | digit)*

8
Compiler Design Chapter 02 : Lexical Analysis

Here are the rules that define the regular expression over alphabet.
 is a regular expression denoting { € }, the language containing only the
empty string.
 For each „a‟ in ∑, is a regular expression denoting { a }, the language with only
one string consisting of the single symbol „a‟.
 If R and S are regular expressions, then
(R) | (S) means LrULs
R.S means Lr. Ls
R* denotes Lr*

3.1 Operations

The various operations on languages are:

 Union of two languages L and M is written as L U M = {s | s is in L or s is in M}

 Concatenation of two languages L and M is written as LM = {st | s is in L and t is in M}

 The Kleene Closure of a language L is written as L* = Zero or more occurrence of language


L.

3.2 Notations

If r and s are regular expressions denoting the languages L(r) and L(s), then

 Union : (r)|(s) is a regular expression denoting L(r) U L(s)

 Concatenation : (r)(s) is a regular expression denoting L(r)L(s)

 Kleene closure : (r)* is a regular expression denoting (L(r))*

 (r) is a regular expression denoting L(r)

3.3 Precedence and Associativity

 *, concatenation (.), and | (pipe sign) are left associative

 * has the highest precedence

 Concatenation (.) has the second highest precedence.

9
Compiler Design Chapter 02 : Lexical Analysis

 | (pipe sign) has the lowest precedence of all.

3.4 Representing valid tokens of a language in regular expression

If x is a regular expression, then:

 x* means zero or more occurrences of x.

i.e., it can generate { e, x, xx, xxx, xxxx, … }

 x+ means one or more occurrences of x.

i.e., it can generate { x, xx, xxx, xxxx … } or x.x*

 x? means at most one occurrence of x

i.e., it can generate either {x} or {e}.

[a-z] is all lower-case alphabets of the English language.

[A-Z] is all upper-case alphabets of the English language.

[0-9] is all-natural digits used in mathematics.

3.5 Representing the occurrence of symbols using regular expressions

letter = A/B/…/Z/a/b/…/z or [A-Za-z]

digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]

sign = [ + | - ]

3.6 Representing language tokens using regular expressions

Decimal = (sign)? (digit)+

ident = lettre (lettre/digit)* or ident = [A-Za-z] [A-Za-z0-9]*

The only problem left with the lexical analyzer is how to verify the validity of a regular
expression used in specifying the patterns of keywords of a language. A well-accepted
solution is to use finite automata for verification.

10
Compiler Design Chapter 02 : Lexical Analysis

Example 1

For each of the following regular expressions ri, we want to determine the language denoted
by ri.

r1 = (a/b)(a/b)

r2 = (a/b)*

r3 = (a*b*)*

r4 = a/a*b

Chaque langage dénoté par ri est noté L(ri).

L(r1) ={aa, ab, ba, bb}

L(r2) ={ε, a, aa, aaa, ., b, bb, bbb, …, ab, aab, , abb, abbb, ba, bba, bbaa, aba, …}. It is the set
of all strings composed of any number of a and any number of b.

L(r3) = L(r2)

L(r4) = {a, b, ab, aab, aaab, …, aaaa…ab}. In addition to the string a, this set contains strings
composed of any non-zero number of a followed by a b.

4. Approaches to constructing lexical analyzer

Constructing lexical analyzers is a crucial step in the compilation process of a programming


language. It involves analyzing the input source code and breaking it down into lexical units
(or tokens) such as keywords, identifiers, operators, constants, etc. There are several approaches
to constructing lexical analyzers, including:
a) Using Regular Expressions:
The most common approach is to use regular expressions to define patterns for each type of
token in the language. Each regular expression describes the structure of a type of token, and a
lexical analyzer uses these expressions to identify and extract tokens from the source code.
b) Deterministic Finite Automata (DFA):
DFAs are finite state machines that can be used to recognize lexical patterns in the source code.
Each state in the automaton represents a possible position in the currently recognized pattern,
and transitions between states are triggered by the characters read. DFAs are often generated
automatically from regular expressions.

11
Compiler Design Chapter 02 : Lexical Analysis

c) Automatically Generated Lexical Analyzers:


There are tools and lexical analyzer generators, such as Lex and Flex, that take a specification
of lexical patterns using regular expressions as input and automatically generate the code for a
lexical analyzer.
d) Manual Lexical Analysis:
In some cases, developers may choose to manually create lexical analyzers by writing code to
recognize tokens one by one. This may be necessary when the language structure is complex or
when specific needs arise.
e) Recursive Descent Parsers:
In certain situations, lexical analyzers are embedded directly into the syntax analyzers. This
approach is commonly used in recursive descent parsers, where tokens are generated on-the-fly
during syntax analysis.
f) Two-Pass Analyzers:
In some languages, it is necessary to perform an initial pass to identify tokens, followed by a
second pass for syntax analysis. This approach is used when the language's syntax is ambiguous
or complex.
The choice of approach often depends on the complexity of the programming language being
analyzed, performance requirements, and available tools. Regular expressions and automatic
lexer generation tools are commonly used because they greatly simplify the process of
constructing lexical analyzers and ensure accurate implementation of the language's lexical
specifications.
In the following sections, we describe each of the three approaches.

4.1 Transition Diagram


A transition Diagram has a collection of nodes or circles, called states. Each state represents
a condition that could occur during the process of scanning the input looking for a lexeme that
matches one of several patterns. Edges are directed from one state of the transition diagram to
another. Each edge is labeled by a symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we look for an edge out of state s labeled
by a. If we find such an edge, we advance the forward pointer and enter the state of the transition
diagram to which that edge leads.
Some important conventions about transition diagrams are

12
Compiler Design Chapter 02 : Lexical Analysis

1. Certain states are said to be accepting or final. These states indicate that a lexeme has been
found, although the actual lexeme may not consist of all positions b/w the lexeme Begin and
forward pointers we always indicate an accepting state by a double circle.

2. In addition, if it is necessary to return the forward pointer to one position, then we shall
additionally place a * near that accepting state.

3. One state is designed as the state or initial state; it is indicated by an edge labeled “start”
entering from nowhere. The transition diagram always begins in the state before any input
symbols have been used.

Figure 2.2: Example of Transition Diagrams


As an intermediate step in the construction of a LA, we first produce a stylized flowchart,
called a transition diagram. Positions in a transition diagram (TD) are drawn as circles and are
called states.

Figure 2.3: A transition diagram for id's and keywords


The above transition diagram for an identifier is defined to be a letter followed by any no of
letters or digits. A sequence of transition diagrams can be converted into the program to look
for the tokens specified by the diagrams. Each state gets a segment of code.

13
Compiler Design Chapter 02 : Lexical Analysis

4.2. Construction of lexical analyzer based on finite automata


Finite automata is a state machine that takes a string of symbols as input and changes its state
accordingly. Finite automata is a recognizer for regular expressions. When a regular expression
string is fed into finite automata, it changes its state for each literal. If the input string is
successfully processed and the automata reach its final state, it is accepted, i.e., the string just
fed was said to be a valid token of the language in hand.

The mathematical model of finite automata consists of:

 Finite set of states (Q)


 Finite set of input symbols (Σ)
 One Start state (q0)
 Set of final states (qf)
 Transition function (δ)

The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ),

Q×Σ➔Q

4.2.1 General Approach to Constructing a lexical analyzer


Based on the following demonstrable propositions:
• The set of lexical units of a given language constitutes a regular language L.
• For any regular expression r, there exists a nondeterministic finite automaton (NFA) that
accepts the regular set described by r.
• If a language L is accepted by a nondeterministic finite automaton (AFN), then there exists a
deterministic finite automaton (AFD) accepting L.
We can define a rigorous approach for the construction of a lexical analyzer using finite state
automata. This approach consists of 6 steps:

14
Compiler Design Chapter 02 : Lexical Analysis

Step 1: Specification of lexical units


Specify each type of lexical unit using a regular expression (RE).
Step 2: Conversion ER to AFN
Convert each regular expression into a (non-deterministic) finite state
automaton.
Step 3: Meeting of AFNs
Build the Union automaton of all the automata from step 2
(we can add a new initial state from which start a set of arcs labeled ε).
Step 4: Determination or transformation of the AFN obtained into AFD
Make the automaton in step 3 deterministic.
Step 5: Minimization of AFD.
Minimize the automaton obtained in step 4.
Step 6: Implementation of minimized AFD.
Implement the automaton obtained in step 5 by simulating it from its
transition table.

4.2.2 Non-deterministic finite automaton


An AFN is defined by:
• A state set E
• A set of input symbols or alphabet Σ
• An initial state e0
• A set F of final or acceptance states
• A transition function Transiter which makes each pair (state, symbol) correspond to a set
of states.
Example
Consider the AFN that recognizes the language described by the regular expression (a|b)*abb
(see figure 2.5). For this automaton:

E = {0, 1, 2, 3} e0 = 0 Σ = {a, b} F = {3}

15
Compiler Design Chapter 02 : Lexical Analysis

Figure 2.4: An AFN for the expression (a|b)*abb


4.2.3 Deterministic finite automaton
An AFD is a special case of an AFN in which:

• No state has an ε-transition

• For each state e and each input symbol a there is at most one arc labeled a leaving e

In the transition table of an AFD, an input contains a single state at most (the input symbols are
the characters of the source text), so it is very easy to determine if a string is accepted by the
automaton since 'there is, at most, only one path between the initial state and a final state labeled
by the string in question.

Note: The transition table of an AFN for a regular expression pattern can be considerably
smaller than that of an AFD. However, AFD has the advantage of being able to recognize
regular expression patterns faster than the equivalent AFN.

4.2.4 Construction of AFN from regular expressions

Constructing an NFA (Nondeterministic Finite Automaton) from a regular expression involves


a systematic process that involves several steps. The goal is to create a finite automaton that
recognizes the language defined by the given regular expression. Here's a step-by-step process
to construct an NFA from a regular expression:

a) Basic Components:

o Single Characters: Each character in the regular expression corresponds to a


state in the NFA. For example, if your regular expression is "a", you'll have a
state labeled "a."

16
Compiler Design Chapter 02 : Lexical Analysis

o Epsilon Transitions (ε-transitions): These are transitions that don't consume


any input. They are used for handling concatenation. Represent them with
arrows labeled ε.

b) Concatenation:

o If the regular expression is a concatenation of expressions (e.g., "ab"), create an


ε-transition from the accepting state of the first expression to the initial state of
the second expression.

c) Alternation (Union):

o If the regular expression is an alternation (e.g., "a|b"), create a new initial state
and ε-transitions from this initial state to the initial states of the expressions
being alternated.

d) Kleene Star (Closure):

o If the regular expression has the Kleene star (e.g., "a*"), create a new initial state,
a new accepting state, and ε-transitions from the new initial state to the initial
state of the expression being repeated and from the accepting state of the
expression to both the new accepting state and the initial state of the expression.

17
Compiler Design Chapter 02 : Lexical Analysis

e) Create Accepting States:

o If there is only one expression in the regular expression, mark its final state as
an accepting state.

f) Finalization:

o If there are multiple expressions or operations, you should now have a modified
NFA. You can simplify it further by removing any ε-transitions (by performing
ε-closure operations) if the language allows it.

This process constructs an NFA that recognizes the language defined by the given regular
expression. The key is to understand the operations in the regular expression (concatenation,
alternation, and closure) and systematically build the corresponding components in the NFA.

4.2.5. Transition from AFN to AFD

Converting a Nondeterministic Finite Automaton (NFA) to a Deterministic Finite Automaton


(DFA) is a crucial step in the implementation of regular expression matching and various other
language recognition tasks. The process of converting an NFA to a DFA involves the subset
construction algorithm. Here's the step-by-step process:
Subset Construction:
For each state in the DFA, we need to determine its transitions based on the ε-closure of the
states reachable from the initial state of the NFA using ε-transitions.
a. Start with the ε-closure of the initial state of the NFA as the initial state of the DFA.

18
Compiler Design Chapter 02 : Lexical Analysis

b. For each state in the DFA, compute the ε-closure of the set of states reachable from it
using ε-transitions. This set of states becomes the next state in the DFA for the corresponding
input symbols.
c. Repeat step b for all newly generated states until no new states can be added.
DFA States and Transitions:
Each state in the DFA corresponds to a set of states from the NFA. The transitions in the
DFA are determined by the transitions of the NFA on the individual characters.
DFA Accepting States:
A state in the DFA is an accepting state if it contains at least one accepting state from the
NFA.
Here's an example conversion of a simple NFA to a DFA:

4.2.6. Minimization of an AFD


Myhill-Nerode theorem (minimization of a AFD). Let L be a rational language. Among all L-
recognizing AFDs, there is one and only one that has a minimal number of states.

Before minimizing an AFD, it must be completed, ie add a trash state like the state R on the
diagram below.

19
Compiler Design Chapter 02 : Lexical Analysis

The minimization algorithm is as follows:

1. Complete the AFD named D


2. Construct an initial partition ∏ containing two sets I and II of states such that ∏ = {I
= (terminal acceptance states of D), II = (other states of D)}
3. For each state of D do
 Build the transition table
 Mark the states of departure and arrival according to their group of ∏
4. If states of the same group of ∏ have divergent behaviors
 Separate the states into a new group (III for example) - preferably do only one
separation per iteration.
 Return to 3
5. End: the new states are the groups of ∏

Here's an example conversion of minimization of DFA:

5. Generating a lexical analyzer: LEX

"Lex" is a popular tool used for generating lexical analyzers (also known as scanners or
tokenizers) in compiler construction and other related fields. It helps in converting a stream of
characters into a stream of tokens, which are the basic units of a programming language. Lexical
analysis is the first step in the compilation process, where the source code is divided into
meaningful tokens for further processing.

20
Compiler Design Chapter 02 : Lexical Analysis

Lex operates based on regular expressions. It allows you to define patterns for tokens using
regular expressions and associate corresponding actions that generate output when a match is
found. The tool generates C code for the lexical analyzer, which can then be integrated into a
larger compiler.

The Lex tool works in conjunction with the "Yacc" (Yet Another Compiler Compiler) tool.
While Lex handles the lexical analysis phase, Yacc deals with the parsing phase, helping to
construct the syntax tree and perform syntactic analysis.

Here's a simplified overview of how Lex works:

1. Specification: You write a Lex specification file that defines regular expressions and
the corresponding actions. Each regular expression corresponds to a token pattern, and
the associated action generates output for the matched token.

2. Compilation: You run the Lex tool on the specification file. Lex generates C code for
the lexical analyzer based on your specifications.

3. Integration: You include the generated C code in your larger compiler project. The
lexical analyzer code reads input characters and matches them against the defined
patterns.

4. Tokenization: As input characters are read, the lexical analyzer uses the generated C
code to match patterns. When a pattern is matched, the associated action is executed,
producing the corresponding token.

5. Output: The token stream generated by the lexical analyzer is passed to the parser
(usually implemented using Yacc or another parsing tool) for further syntactic analysis.

Here's a simple example of a Lex specification for recognizing integers and identifiers in a
programming language:

lex
%{
#include <stdio.h>
%}

DIGIT [0-9]
IDENTIFIER [a-zA-Z][a-zA-Z0-9]*

21
Compiler Design Chapter 02 : Lexical Analysis

%%
{DIGIT}+ printf("INTEGER: %s\n", yytext);
{IDENTIFIER} printf("IDENTIFIER: %s\n", yytext);
/* Ignore other characters */

%%
int main() {
yylex();
return 0;
}
This example would recognize integers (sequences of digits) and identifiers (starting with a
letter and followed by letters or digits) and print corresponding messages for each match.

22

You might also like