Module 4
Module 4
The parser looks into the sequence of tokens returned by the lexical analyzer and
extracts the constructs of the language appearing within the sequence. Thus, the role of parser
is
To identify the language constructs present in a given input program. If the parser
determines the input to be a valid one, it outputs a representation of the input in the
form of parse tree.
If the input is grammatically incorrect, the parser declares the detection of syntax error
in the input.
1. Lexical Errors: These errors are mainly the spelling mistakes and accidental
insertion of foreign characters, for example, $, if the language does not allow it,
they are mostly caught by the lexical analyzer.
4. Logical Errors: These are errors such as infinite loops. There is no way to catch the
logical errors automatically.
The presence of an error in the input stream leads the parser to an erroneous state,
from where it cannot proceed further until certain portion of its work is undone. The strategies
involved in the process are broadly known as error recovery strategies:
Panic mode
Phrase level
Error Production
Global Correction
Panic mode:
In this case, the parser discards enough number of tokens to reach a descent state
on the detection of error.
In this strategy, the parser makes some local corrections on the remaining input on
detection of an error, so that the resulting input stream gives a valid construct of the language.
Error Production:
This involves modifying the grammar of the language to include error situations.
Global Corrections:
A grammar gives a precise, yet easy to understand, syntactic specification for the
programs of a particular programming language
An efficient parser can be constructed automatically from a properly designed
grammar
A grammar imparts a structure to a program that is useful for its translation into
object code and for the detection of errors.
where and ( ).
and read “One way to form an expression is to take two smaller expression and connect them
with a plus sign.
A set of terminal symbols, sometimes referred to as "tokens." The terminals are the
elementary symbols of the language defined by the grammar
A set of productions, where each production consists of a nonterminal called the head
or left side of the production, an arrow, and a sequence of terminals and/or non-
terminals, called the body or right side of the production. The intuitive intent of a
production is to specify one of the written forms of a construct; if the head
nonterminal represents a construct, then the body represents a written form of the
construct.
The following are certain notations that are followed in context free grammars:
3. Capital symbols near the end of the alphabet, mainly X, Y, Z represents grammar
symbols that is either non terminal or terminals
4. Small letters near the end of the alphabet, chiefly u, v, ...z represent strings of
terminals
7. Unless otherwise stated, the left side of the first production is the start symbol.
| |( )| |
the non-terminal E is an abbreviation for expression. We can take a single E and repeatedly
apply productions in any order to obtain a sequence of replacements. For example:
( ) ( )
The following is the derivation for string aabbbcc.In each sequence ofsymbols
underlined the nonterminal that is rewritten in the following step
Here we can write T aabbbcc means that T derives aabbbcc in zero or more
steps. The derivation can be classified into two:
Leftmost derivation
Rightmost Derivation
The intermediate strings obtained while deriving a string from a CFG are
called sentential forms ( ). A sentential form ( ) consists of terminals as well as non-
terminals i.e., ( ) . There are two types of sentential forms based upon the derivation
used
Left sentential form
Right sentential form
Parse tree is a tree structure used to represent the top down derivation with the
root of the tree as the start symbol of the grammar. The interior nodes of the tree are non-
terminals of the grammar and the leaves are the terminals.
Aparse tree pictorially shows how the start symbol of a grammar derives astring
in the language.If nonterminal A has a production A XYZ, then aparse tree may have an
interior node labeled A with three children labeled X,Y, and Z, from left to right:
If A is the nonterminal labeling some interior node and X1, X2, . . . , Xn are the
labels of the children of that node from left to right, then there must be a
production A X1X2 . . Xn. Here, X1, X2, . . . ,Xn, each stand for a symbol that is
either a terminal or a nonterminal. As a special case, if A is a production,
then a node labeled A may have a single childlabeled .
The leaves of the parse tree are labeled by non-terminals or terminals and read
from left to right; they constitute the sentential form, called the yield or frontier of the tree.
Given CFG is
S XX
X XXX | bX | Xb | a
To the leftmost X, let us apply the production X bX. To the right X, let us apply X XXX
Reading from left to right we have produced bbaaaab. These tree diagrams are
called parse trees.
4.2.2.3 AMBIGUITY
A grammar G is said to be ambiguous if there exists more than one parse tree for
the same sentence. An ambiguous grammar can have more than one leftmost and rightmost
derivations. For example, consider the grammar,
| |( )| |
and the string a*a + a may be considered. There are two leftmost derivations for this string as
shown below:
Hence, the given grammar is ambiguous. It is possible to construct two parse trees with the
same yield as shown in figure 4.3.
|
Now consider the following code segment,
We can have two different parse trees as shown in Figure 4.4(a) and Figure4.4(b).
The first figure shows the situation in which the else is taken with the outer if statement. In the
second case, the else is taken with the inner if. Thus, outer one is an if-then statement while
the inner one is if-then-else statement. Most of the programming languages accept second one
as the correct syntax, that is, else is associated with the innermost if.
(a)
(b)
Figure 4.4 Two parse trees for if-then-else statement
Here, other_stmt represents all other statements part from if. There can be only
one parse tree for this grammar. So we can say that this grammar is unambiguous.
Also we can disambiguate the grammar by specifying the associativity and
precedence of the arithmetic operators.The operator unary minus have the highest precedence
followed by exponential operator ,*, /, + and -.
So the grammar is rewritten such that the precedence and associativity are given
for the operators. The grammar is rewritten starting with the lowest precedence.
|
Where S1 and S2 are variables from which the strings given by the regular
expressions r1 and r2 can be derived and S is the start symbol of the new grammar.
When a regular expression r* is encountered, a new start symbol S is added and
the productions given below are added
Where S1 is the start symbol for the set of strings that can be derived from r and S
is the start symbol of the new grammar.
When a regular expression with unary + (r+) is encountered a new start symbol S
is added and the productions given below are added.
Where S2 is the start symbol for the set of strings that can be derived from r and S
is the start symbol of the new grammar.
4.3.1 PARSERS
A parser for a grammar G is a program that takes as input a string w and produces
as output either a parse tree for w, if w is a sentence of G, or an error message indicating that
w is not a sentence of G. In reality, the parse tree exists only as a sequence of actions made by
stepping through the tree construction process. There are two types of parsers –bottom-up
and top-down.
Bottom-up parser build parse trees from the bottom (leaves) to the top (root),
while top down parsers start with the root and work down to the leaves. In both cases, the
input to the parser is being scanned from left to right, one symbol at a time.
⇒ ⇒ ⇒
If the root labeled S has children labeled A, B, and C, we create the first step of
the leftmost derivation by replacing S by the labels of its children; i.e., ⇒ . Here S is
and ABC is .
If the node for A has children labeled XYZ in T, we create the next step of the
derivation by replacing A by the labels of its children; i.e,, ABC⇒ is .If we
continue in this manner we finally got all terminals in the leaf node.
Example:
(Draw the parse tree for the leftmost and rightmost derivation for this grammar)
In general, the selection of a production for a non-terminal may involve trial-and-error; tat is,
we may have to try a production and backtrack to try another production if the first is found to
be unsuitable. A production is unsuitable if, after using the production, we cannot complete
the tree to match the input string. Predictive parsing is a special form of recursive-descent
parsing, in which the current input token unambiguously determines the production to be
applied at each step. After eliminating left recursion and left factoring, we can obtain a
grammar that can be parsed by a recursive-descent parser that needs no backtracking.
Basically, it removes the need of backtracking by fixing one production for every non-
terminal and input tokens.
However, in practice, LL(1) grammars are used i.e. one lookahead token is used.
M[X,a] where “X” is a non terminal and “a” is a terminal of the grammar
• Steps to be followed
To compute FIRST(X) for all grammar symbols X, apply the following rules until no more
terminals or ε can be added to any FIRST set.
3. If X is a non terminal and X → Y1Y2.........Yk is a production, then place a in First (X) if for
some i, a is in FIRST(Yi) and ε is in all of FIRST(Y1), FIRST(Y2),……, FIRST(Yi-1); that
is, Y1……Yi-1 * ε. If ε is in FIRST(Yj) for all i = 1,2,……,k, then add ε to FIRST(X). For
example, everything in FIRST(Y1) is surely in FIRST(X). If Y1 does not derive ε, then we
add nothing more to FIRST(X), but if Y1 * ε, then we add FIRST(Y2) and so on.
To compute FOLLOW(A) for all non-terminals A, apply the following rules until nothing can
be added to any FOLLOW set:
1. Place $ in FOLLOW(S), where S is the start symbol and $ is the input right endmarker.
The following algorithm can be used to construct a predictive parsing table for a grammar G.
The idea behind the algorithm is the following. Suppose A α is a production with a in
FIRST(α). Then, the parser will expand A by α when the current input symbol is a. The only
complication occurs when α = ε or α * ε. In this case, we should again expand A by α if the
current input symbol is in FOLLOW(A), or if the $ on the input has been reached and $ is in
FOLLOW(A). So, the algorithm goes as follows:
The construction of the parse table is aided by two functions associated with a grammar G.
These functions, FIRST and FOLLOW, allow us to fill in the entries of a predictive parsing
table for G, whenever possible.
If α is any string of grammar symbols, FIRST(α) is the set of terminals that begin the strings
derived from α. If α * ε, then ε is also in FIRST(α).
Note that there may, at some time, during the derivation, have been symbols between A and
a, but if so, they derived ε and disappeared. Also, if A can be the rightmost symbol in some
sentential form, then $ is in FOLLOW(A).
Parsing algorithm
The parser considers 'X' the symbol on top of stack, and 'a' the current input symbol
Assume that '$' is a special token that is at the bottom of the stack and terminates the input
string
if X = a = $ then halt
if X is a non terminal
end
else error
E T E‟
E' +T E' | Є
T F T'
T' * F T' | Є
F ( E ) | id
id + * ( ) $
E ETE’ ETE’
T TFT’ TFT’
F Fid F(E)
S aAcBe
A Ab | b
B d
Let us choose the leftmost b and replace it by A, the left side of the productionA
b. We obtain the stringaAbcde. We now find thatAb, b, and d each match the right side of
some production. Suppose this time we choose to replace the substringAb by A, the left side of
the productionA Ab. We now obtainaAcde. Then replacing dby B, the left side of the
production B d, we obtainaAcBe.We can now replace this entire string by S.
Each replacement of the right side of a production by the left side in the process
above is called a reduction. Thus, by a sequence of four reductions we were able to reduce
abbcde to S. These reductions, in fact, traced out a rightmost derivation in reverse.
A substring which is the right side of a production such that replacement of that
substring by the production left side leads eventually to a reduction to the start symbol, by the
reverse of a rightmost derivation, is called a "handle." The process of bottom-up parsing may
be viewed as one of finding and reducing handles.
We must not be misled by the simplicity of this example. In many cases the
leftmost substring which matches the right side of some production is not a handle
because a reduction by the production may yield a string which cannot be reduced to
the start symbol. For example, if we replaced b by A in the second string aAbcde we would
obtain a string aAAcde which cannot be subsequently reduced to S. For this reason, we must
give a more precise definition of a handle. We shall see that if we write a rightmost derivation
in reverse, then the sequence of replacements made in that derivation naturally defines a
sequence of correct replacements that reduce the sentence to the start symbol.
Handles
A substring that matches the RHS of some production AND whose reduction to the
non-terminal on the LHS is a step along the reverse of some rightmost derivation.
Formally:
Example
Consider:
S aABe
A Abc | b
Bd
S aABeaAdeaAbcdeabbcde
Handle Pruning
S = 012 … n-1n = w
Apply the following simple algorithm
for i n to 1 by -1
Find the handleAi i ini
Replacei with Ai to generate i-1
Example:
Consider the grammar | |( ) | and the input string id1 + id2 * id3. The
following sequence of reductions reduces id1 + id2 * id3 to the start symbol E.
id1+id2*id3 id1
E + E *id3 id3
E+E*E E*E
E+E E+E
There are two problems that must be solved if we are to automate parsing by
handle pruning. The first is how to locate a handle in a right-sentential form, and the second is
what production to choose in case there is more than one production with the same right side.
Stack Input
$ w$
The parser operates by shifting zero or more input symbols onto the stack until a
handle is on top of the stack. The parser then reduces to the left side of the appropriate
production. The parser repeats this cycle until it has detected an error or until the stack
contains the start symbol and the input is empty:
Stack Input
$S $
In this configuration the parser halts and announces successful completion of parsing.
The primary operations of the parser are shift and reduce, there are actually
four possible actions a shift-reduce parser can make: (1) shift, (2) reduce, (3) accept, and (4)
error.
(1) In ashiftaction, the next input symbol is shifted to the top of the stack.
(2) In a reduce action, the parser knows the right end of the handle is at the top of the
stack. It must then locate the left end of the handle within the stack and decide with
what nonterminal to replace the handle.
(3) In anacceptaction, the parser announces successful completion of parsing.
(4) In anerror action, the parser discovers that a syntax error has occurred and calls an
error recovery routine.
Example: The following figure shows the steps through the actions a shift reduce parser
might take in parsing the input string id1 + id2 * id3
$E + E * id3$ shift
$E + E * id3$ shift
$E + E * id3 $ reduce by
$E + E * E $ reduce by
$E + E $ reduce by
$E $ accept
(1) When we shift an input symbol a onto the stack we create a one-node tree labeled a.
Both the root and the yield of this tree are a, and the yield truly represents the string of
terminals “reduced” (by zero reductions) to the symbol a.
(2) When we reduce X1X2...Xn to A, we create a new node labeled A. Itschildren, from left
to right, are the roots of the trees for X,X2,...,Xn. If for all i the tree forXi, has yield xi,
then the yield for the new tree is x1x2...xn. This string has in fact been reduced to A by a
series of reductions culminating in the present one. As a special case, if we reduce to
A we create a node labeled A with one child labeled .
(d) At completion
SLR (for Simple LR) is the weakest of the three(SLR ,Canonical LR,LALR) in terms of
grammars for which it succeeds but is the easiest to implement.
1. An LR(0) item (or simply item) of a grammar is a production rule augmented with a
position marker (a dot) somewhere within its right hand side.
“ For example, the production A->XYZ yields the following four items:
“ A -> .XYZ
“ A -> X.YZ
“ A -> XY.Z
“ A -> XYZ.
“ Intuitively an item indicates how much of a production we have seen at a given point in
the parsing process.
S simple
For the rules in an augmented grammar,G‟, begin at rule zero and follow the steps
below:
a. Use one read operation on each item C (non‐terminal or terminal)in the current state
Operations defined
A, S, X: non‐terminals
start: if S is a symbol with [S ‐> w] as a production rule, then [S ‐> .w}is the item
associated with the start state.
read: if[A ‐‐> x.Cz]is an item in some state, then [A ‐‐> xC.z]is associated with some
other state. When performing a read, all the items with the dot before the same C are
associated with the same state. (Note that the dot is before anything, either terminal or
non‐terminal.)
complete: if [A ‐‐> x.Xy]is an item, then every rule of the grammar with the
form[X ‐‐> .z]must be included within this state. Repeat adding items until no new
items can be added. (Note that the dot is before a non‐terminal.)
0. S‟ ‐‐> S$
1. S ‐‐> aSbS
2. S ‐‐> a