compiler_design- Module2-print
compiler_design- Module2-print
Programming language basics - lexical analysis – role of lexical analyzer – input buffering - specification of tokens –
recognition of tokens using finite automata. (15 Hrs)
LEXICAL ANALYSIS
A simple way to build lexical analyzer is to construct a diagram that illustrates the structure of the tokens
of the source language, and then to hand-translate the diagram into a program for finding tokens. Efficient lexical
analyzers can be produced in this manner.
Since the lexical analyzer is the part of the compiler that reads the source text, it may also perform certain
secondary tasks at the user interface. One such task is stripping out from the source program comments and white
space in the form of blank, tab, and new line character. Another is correlating error messages from the compiler
with the source program.
1
Tokens Patterns and Lexemes.
There are a set of strings in the input for which the same token is produced as output. This set of strings is
described by a rule called a pattern associated with the token. The pattern is set to match each string in the set.
In most programming languages, the following constructs are treated as tokens: keywords, operators,
identifiers, constants, literal strings, and punctuation symbols such as parentheses, commas, and semicolons.
Lexeme
Collection or group of characters forming tokens is called Lexeme. A lexeme is a sequence of characters
in the source program that is matched by the pattern for the token. For example in the Pascal’s statement const pi
= 3.1416; the substring pi is a lexeme for the token identifier.
Patterns
A pattern is a rule describing a set of lexemes that can represent a particular token in source program. The
pattern for the token const in the above table is just the single string const that spells out the keyword.
if if if
Certain language conventions impact the difficulty of lexical analysis. Languages such as FORTRAN
require a certain constructs in fixed positions on the input line. Thus the alignment of a lexeme may be important
in determining the correctness of a source program.
Attributes of Token
The lexical analyzer returns to the parser a representation for the token it has found. The representation is
an integer code if the token is a simple construct such as a left parenthesis, comma, or colon. The representation is
a pair consisting of an integer code and a pointer to a table if the token is a more complex element such as an
identifier or constant.
The integer code gives the token type, the pointer points to the value of that token. Pairs are also returned
whenever we wish to distinguish between instances of a token.
The Lexical analyzer collects information about tokens and their associated attributes. The tokens influence the
parsing decisions and the attributes influence the translation of tokens. As a practical matter, a token usually has
only a single attribute, a pointer to the symbol table entry, in which the information about the token is kept; the
pointer becomes the attributes for the tokens.
2
The attributes influence the translation of tokens.
i Constant : value of the constant
ii Identifiers: pointer to the corresponding symbol table entry.
Example:
E= M * C ** 2
1. Use lexical analyzer generator, such as the lex compiler to produce the lexical analyzer from a regular expression-
based specification. In this case, the generator provides routines for reading and buffering the input.
2. Write the lexical analyzer in a conventional system programming language, using the input facilities of that
language to read input.
3. Write the lexical analyzer in assembly language and exlicitely manage the reading of input.
3
token is discovered . We view the position of each pointer as being between the character last read and the
character next to be read. In practice each buffering scheme adopts one convention either a pointer is at the
symbol last read or the symbol it is ready to read.
The distance which the lookahead pointer may have to travel past the actual token may be large. For
example, in a PL/I program we may see:
Without knowing whether DECLARE is a keyword or an array name until we see the character that
follows the right parenthesis. In either case, the token itself ends at the second E. If the look ahead pointer
travels beyond the buffer half in which it began, the other half must be loaded with the next characters from the
source file.
Since the buffer shown in above figure is of limited size there is an implied constraint on how much look
ahead can be used before the next token is discovered. In the above example, if the look ahead traveled to the left
half and all the way through the left half to the middle, we could not reload the right half, because we would lose
characters that had not yet been grouped into tokens. While we can make the buffer larger if we chose or use
another buffering scheme, we cannot ignore the fact that overhead is limited.
BUFFER PAIRS
A buffer is divided into two N-character halves, as shown below
: : E : : = : : M : * C : * : : * : 2 : eof
Lexeme For
beginni war
ng d
Each buffer is of the same size N, and N is usually the number of characters on one disk
block. E.g., 1024 or 4096 bytes.
Using one system read command we can read N characters into a buffer.
If fewer than N characters remain in the input file, then a special character, represented by
eof, marks the end of the source file.
Two pointers to the input are maintained:
1 Pointer lexeme_beginning, marks the beginning of the current lexeme, whose extent we are attempting to
determine.
2 Pointer forward scans ahead until a match for a pattern is found.
Once the next lexeme is determined, forward pointer is set to the character at its right end.
The string of characters between the two pointers is the current lexeme. After the lexeme is recorded as an
attribute value of a token returned to the parser, lexeme_beginning is set to the character immediately after the
lexeme just found.
4
Advancing forward pointer:
Advancing forward pointer requires that we first test whether we have reached the end of one of the
buffers, and if so, we must reload the other buffer from the input, and move forward to the beginning of the newly
loaded buffer. If the end of second buffer is reached, we must again reload the first buffer with input and the
pointer wraps to the beginning of the buffer.
Sentinels
For each character read, we make two tests: one for the end of the buffer, and one to determine what
character is read. We can combine the buffer-end test with the test for the current character if we extend each
buffer to hold a sentinel character at the end.
The sentinel is a special character that cannot be part of the source program, and a natural choice is
the character eof.
The sentinel arrangement is as shown below:
Note that eof retains its use as a marker for the end of the entire input. Any eof that appears
other than at the end of a buffer means that the input is at an end.
5
begin
reload second half;
forward := forward + 1
end
else if forward at end of second half then
begin
reload first half;
move forward to beginning of first half
end
else /* eof within a buffer signifying end of input */
terminate lexical analysis
end
Operations on languages:
The following are the operations that can be applied to languages:
1 Union
2 Concatenation
3 Kleene closure
4 Positive closure
6
The following example shows the operations on strings: Let L={0,1} and S={a,b,c}
1 Union : L U S={0,1,a,b,c}
2 Concatenation : L.S={0a,1a,0b,1b,0c,1c}
3 Kleene closure : L*={ ε,0,1,00….}
4 Positive closure : L+={0,1,00….}
Regular Expressions
Each regular expression r denotes a language L(r).
Here are the rules that define the regular expressions over some
alphabet Σ and the languages that those expressions denote:
1 ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole member is the empty string.
2 If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = {a}, that is, the language with one string, of
length one, with ‘a’ in its one position.
3 Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then, a) (r)|(s) is a regular
expression denoting the language L(r) U L(s).
b) (r)(s) is a regular expression denoting the language L(r)L(s). c) (r)* is a regular expression denoting (L(r))*.
d) (r) is a regular expression denoting L(r).
4 The unary operator * has highest precedence and is left associative.
5 Concatenation has second highest precedence and is left associative.
6 | has lowest precedence and is left associative.
Regular set
A language that can be defined by a regular expression is called a regular set. If two regular expressions r and s
denote the same regular set, we say they are equivalent and write r = s.
7
Regular Definitions
Giving names to regular expressions is referred to as a Regular definition. If Σ is an alphabet of basic symbols,
then a regular definition is a sequence of definitions of the form:
dl → r 1
d2 → r2
………
dn → rn
1 Each di is a distinct name.
2 Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l}.
Example: Identifiers is the set of strings of letters and digits beginning with a letter. Regular definition for this set:
letter → A | B | …. | Z | a | b | …. | z | digit
→ 0 | 1 | …. | 9
id → letter ( letter | digit ) *
Notational Shorthands
Certain constructs occur so frequently in regular expressions that it is convenient to introduce notational short
hands for them.
1 One or more instances (+):
- The unary postfix operator + means “ one or more instances of” .
- If r is a regular expression that denotes the language L(r), then ( r )+ is a regular expression that denotes the
language (L (r ))+
- Thus the regular expression a+ denotes the set of all strings of one or more a’s.
- The operator + has the same precedence and associativity as the operator *.
- Eg:
a+ = {a,aa,aaa,aaaa…….}
8
2 Zero or one instance ( ?):
- The unary postfix operator ? means “zero or one instance of”.
- The notation r? is a shorthand for r | ε.
- If ‘r’ is a regular expression, then ( r )? is a regular expression that denotes the language
3 Character Classes:
- The notation [abc] where a, b and c are alphabet symbols denotes the regular expression a | b | c.
- Character class such as [a – z] denotes the regular expression a | b | c | d | ….|z.
- We can describe identifiers as being strings generated by the regular expression, [A–Za–z][A– Za–z0–9]*
5. The two algebraic identities r* = r+ | ε and r+ = rr* relate the klene and positive closure operators.
-
Non-regular Set
A language which cannot be described by any regular expression is a non-regular set. Example: The set of
all strings of balanced parentheses and repeating strings cannot be described by a regular expression. This set can
be specified by a context-free grammar.
term → id
| num
where the terminals if , then, else, relop, id and num generate sets of strings given by the following regular
definitions:
if → if
then → then
else → else
relop → <|<=|=|<>|>|>=
id → letter(letter|digit)*
num → digit+ (.digit+)?(E(+|-)?digit+)?
Our lexical analyzer will strip out white space. It does so by comparing a string against the regular
definition ws (white space) as shown:
delim →blank | tab | newline
ws → delim+
9
For this language fragment the lexical analyzer will recognize the keywords if, then, else, as well as the
lexemes denoted by relop, id, and num. To simplify matters, we assume keywords are reserved; that is, they
cannot be used as identifiers.
If a match for ws is found, the lexical analyzer does not return a token to the parser. Our goal is to
construct a lexical analyzer that will isolate the lexeme for the next token in the input buffer and produce as
output a pair consisting of the appropriate token and attribute value. The following table shows the regular
expression patterns for tokens.
Finite Automata
A recognizer for a language is a program that takes as input a string x and an answer ‘yes’ if x is a
sentence of the language and no otherwise. We compile a regular expression into a recognizer called a Finite
Automaton ( Plural Automata). Finite Automata is one of the mathematical models that consist of a number of
states and edges. It is a transition diagram that recognizes a regular expression or grammar.
There are tow types of Finite Automata :
Non-deterministic Finite Automata (NFA)
Deterministic Finite Automata (DFA)
10
DFA is a special case of a NFA in which
i) no state has an ε-transition.
ii there is at most one transition from each state on same input.
δ
DFA has five tuples denoted by M = {S, Ʃ, , S0 , fd}
S – finite set of states
Ʃ – finite set of input symbols
δ – transition function that maps state-symbol pairs to set of states
S0 – starting state
fd – set of final or accepting state
Transition Table :
Transition function(∂) is a function which maps S * ∑ into S . Here ‘S’ is set of states and ‘∑’ is input of alphabets.
To show this transition function we use table called transition table (State Transition Table - STT). The table takes
two values a state and a symbol and returns next state.
A transition table gives the information about:
Transition table is a table in which there is a row for each state and a column for each input symbol and ε.
Transition Diagram
It is a diagrammatic representation to depict the action that will take place when a lexical analyzer
is called by the parser to get the next token. It is used to keep track of information about the
characters that are seen as the forward pointer scans the input.
Transition diagram is a special kind of flowchart for language analysis. In transition diagram the boxes of flowchart
are drawn as circle and called as states. States are connected by arrows called as edges. The label or weight on edge
indicates the input character that can appear after that state.
Transition diagram keep track of information about characters that are seen as the forward pointer scans the input,
by moving from position to position in the diagram as characters are read.
11
a) Transition diagram for identifier
An NFA can be represented diagramatically by labelled directed graph called transition graph, in which nodes are
states and the labelled edges represent the transition function. This graph looks like a transition diagram, but the
same character can label two or more transitions out of one state and edges can be labelled by special symbol ε as
well as input symbols.
This state transition graph recognizes the input pattern (a | b)* abb. ie. It recognizes aabb, babb, abb, ababb, baabb,
etc are valid.
Here,
The set of states is {0,1,2,3}
Input symbol set is {a,b}
12
State 0 is distinguished as the start symbol.
Accepting state 3 is indicated as doubled circle.
13
14
Transition diagram for DFA
The following state transition diagram recognizes the input pattern (a | b)* abb.
It does not contain ε transition from any state. Also it has no two state change from a state for same next input symbol.
15
Lex Specification
A Lex program consists of three parts:
{ definitions }
%%
{ rules }
%%
{ user subroutines }
User subroutines are auxiliary procedures needed by the actions. These can be compiled separately and
loaded with the lexical analyzer.
Yacc turns such a specification into a subroutine that handles the input process; frequently, it is convenient
and appropriate to have most of the flow of control in the user's application handled by this subroutine.
A parser generator is a program that takes as input a specification of a syntax and produces as output a
procedure for recognizing that language. Historically, they are also called compiler compilers. YACC (yet another
compiler-compiler) is an LALR(1) (LookAhead, Left-to-right, Rightmost derivation producer with 1 lookahead
token) parser generator. YACC was originally designed for being complemented by Lex.
/* definitions */
....
%%
/* rules */
....
%%
/* auxiliary routines */
....
16