Compiler Design Unit 2
Compiler Design Unit 2
UNIT -2
SYNTAX ANALYSIS
Parser obtains a string of tokens from the lexical analyzer and verifies that it can be generated
by the language for the source program. The parser should report any syntax errors in an
intelligible fashion. The two types of parsers employed are:
1.Top down parser: which build parse trees from top(root) to bottom(leaves)
2.Bottom up parser: which build parse trees from leaves and work up the root.
Therefore there are two types of parsing methods– top-down parsing and bottom-up parsing
Given a context-free grammar, a parse tree according to the grammar is a tree with the following
properties:
1. The root is labeled by the start symbol.
2. Each leaf is labeled by a terminal or by e.
3. Each interior node is labeled by a nonterminal
JSVG Krishna, Associate Professor.
Downloaded by Priya Rana (prinkit2002@gmail.com)
lOMoARcPSD|20951282
Ambiguity
A grammar can have more than one parse tree generating a given string of terminals. Such a grammar is
said to be ambiguous. To show that a grammar is ambiguous, all we need to do is find a terminal string
that is the yield of more than one parse tree. Since a string with more than one parse tree usually has
more than one meaning, we need to design unambiguous grammars for compiling applications, or to use
ambiguous grammars with additional rules to resolve the ambiguities.
Example :: Suppose we used a single nonterminal string and did not distinguish between digits and
lists,
Fig. shows that an expression like 9-5+2 has more than one parse tree with this grammar. The two trees
for 9-5+2 correspond to the two ways of parenthesizing the expression: (9-5) +2 and 9- (5+2) . This
second parenthesization gives the expression the unexpected value 2 rather than the customary value 6.
Example
If there is a grammar
G: N = {S, A, B} T = {a, b} P = {S → AB, A → a, B → b}
Here S produces AB, and we can replace A by a, and B by b. Here, the only accepted string is ab, i.e.,
L(G) = {ab}
a) Regular expressions provide a more concise and easier to understand notation for tokens than
grammars.
b) The lexical rules of a language are simple and to describe them, we donot need notation as
powerful as grammars.
c) Efficient lexical analyzers can be constructed automatically from RE than from grammars.
d) Separating the syntactic structure of a language into lexical and nonlexical parts provides a
convenient way of modularizing the front end into two manageable-sized components.
Ambiguity of the grammar that produces more than one parse tree for leftmost or rightmost
derivation can be eliminated by re-writing the grammar.
Consider this example,
G: stmt→if expr then stmt
|if expr then stmt else stmt
|other
This grammar is ambiguous since the string if E1 then if E2 then S1 else S2 has the following two
parse trees for leftmost derivation
The general rule is “Match each else with the closest unmatched then.This disambiguating rule can be
used directly in the grammar,
3. Eliminating left-recursion
Left factoring is a grammar transformation that is useful for producing a grammar suitable for predictive
parsing. When it is not clear which of two alternative productions to use to expand a non-terminal A, we
can rewrite the A-productions to defer the decision until we have seen enough of the input to make the
right choice.
◼ Consider S → if E then S else S | if E then S
◼ Which of the two productions should we use to expand non-terminal S when the next
token is if?
We can solve this problem by factoring out the common part in these rules.
2.4 PARSING
It is the process of analyzing a continuous stream of input in order to determine its grammatical
structure with respect to a given formal grammar.
Types of parsing:
Top-down parsing : A parser can start with the start symbol and try to transform it to the
input string. Example : LL Parsers.
Step2:
The leftmost leaf ‘c’ matches the first symbol of w, so advance the input pointer to the second symbol
of w ‘a’ and consider the next leaf ‘A’. Expand A using the first alternative.
JSVG Krishna, Associate Professor.
Downloaded by Priya Rana (prinkit2002@gmail.com)
lOMoARcPSD|20951282
Step3:
The second symbol ‘a’ of w also matches with second leaf of tree. So advance the input pointer to third
symbol of w ‘d’. But the third leaf of tree is b which does not match with the input symbol d. Hence
discard the chosen production and reset the pointer to second backtracking.
Step4: Now try the second alternative for A.
It is possible to build a nonrecursive predictive parser by maintaining a stack explicitly, rather than
implicitly via recursive calls. The key problem during predictive parsing is that of determining the
production to be applied for a nonterminal . The nonrecursive parser in figure looks up the production to
be applied in parsing table. In what follows, we shall see how the table can be constructed directly from
certain grammars.
A table-driven predictive parser has an input buffer, a stack, a parsing table, and an output
stream. The input buffer contains the string to be parsed, followed by $, a symbol used as a right
endmarker to indicate the end of the input string. The stack contains a sequence of grammar symbols
JSVG Krishna, Associate Professor.
Downloaded by Priya Rana (prinkit2002@gmail.com)
lOMoARcPSD|20951282
Method. Initially, the parser is in a configuration in which it has $S on the stack with S, the start symbol
of G on top, and w$ in the input buffer. The program that utilizes the predictive parsing table M to
produce a parse for the input is shown in Fig.
let X be the top stack symbol and a the symbol pointed to by ip. if X is a terminal of $ then
if X=a then
pop X from the stack and advance ip else error()
else
if M[X,a]=X->Y1Y2...Yk then begin pop X from the stack;
push Yk,Yk-1...Y1 onto the stack, with Y1 on top; output the production X-> Y1Y2...Yk
end
else error()
until X=$ /* stack is empty */
To compute FIRST(X) for all grammar symbols X, apply the following rules until no more
terminals or e can be added to any FIRST set.
Input : Grammar G
Output : Parsing table M
Method :
1. For each production A → α of the grammar, do steps 2 and 3.
2. For each terminal a in FIRST(α), add A → α to M[A, a].
3. If ε is in FIRST(α), add A → α to M[A, b] for each terminal b in FOLLOW(A). If ε is in
FIRST(α) and $ is in FOLLOW(A) , add A → α to M[A, $].
4. Make each undefined entry of M be error.
First( ) :
FIRST(E) = { ( , id}
FIRST(E’) ={+ , ε }
FIRST(T) = { ( , id}
FIRST(T’) = {*, ε }
FIRST(F) = { ( , id }
Follow( ):
FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
FOLLOW(T) = { +, $, ) }
FOLLOW(T’) = { +, $, ) }
FOLLOW(F) = {+, * , $ , ) }
- 24 -
2.5.5.LL(1) GRAMMAR
The parsing table algorithm can be applied to any grammar G to produce a parsing table M.
For some Grammars, for example if G is left recursive or ambiguous, then M will have at
least one multiply-defined entry. A grammar whose parsing table has no multiply defined
entries is said to be LL(1). It can be shown that the above algorithm can be used to produce
for every LL(1) grammar G, a parsing table M that parses all and only the sentences of G.
LL(1) grammars have several distinctive properties. No ambiguous or left recursive grammar
can be LL(1). There remains a question of what should be done in case of multiply defined
entries. One easy solution is to eliminate all left recursion and left factoring, hoping to
produce a grammar which will produce no multiply defined entries in the parse tables.
Unfortunately there are some grammars which will give an LL(1) grammar after any kind of
alteration. In general, there are no universal rules to convert multiply defined entries into
single valued entries without affecting the language recognized by the parser.
The main difficulty in using predictive parsing is in writing a grammar for the source
language such that a predictive parser can be constructed from the grammar. Although left
recursion elimination and left factoring are easy to do, they make the resulting grammar hard
to read and difficult to use the translation purposes. To alleviate some of this difficulty, a
common organization for a parser in a compiler is to use a predictive parser for control
constructs and to use operator precedence for expressions.however, if an lr parser generator
is available, one can get all the benefits of predictive parsing and operator precedence
automatically.
Consider error recovery predictive parsing using the following two methods
Panic-Mode recovery
Phrase Level recovery
Panic-mode error recovery is based on the idea of skipping symbols on the input until a
token in a selected set of synchronizing tokens appears. Its effectiveness depends on the
choice of synchronizing set. The sets should be chosen so that the parser recovers quickly
from errors that are likely to occur in practice.
As a starting point, we can place all symbols in FOLLOW(A) into the synchronizing set for
nonterminal A. If we skip tokens until an element of FOLLOW(A) is seen and pop A from
the stack, it is likely that parsing can continue.
It is not enough to use FOLLOW(A) as the synchronizingset for A. Fo example , if
semicolons terminate statements, as in C, then keywords that begin statements may not
appear in the FOLLOW set of the nonterminal generating expressions. A missing semicolon
after an assignment may therefore result in the keyword beginning the next statement being
skipped. Often, there is a hierarchica structure on constructs in a language; e.g., expressions
appear within statement, which appear within bblocks,and so on. We can add to the
If a nonterminal can generate the empty string, then the production deriving e can be
used as a default. Doing so may postpone some error detection, but cannot cause an error
to be missed. This approach reduces the number of nonterminals that have to be considered
during error recovery.
If a terminal on top of the stack cannot be matched, a simple idea is to pop the
terminal, issue a message saying that the terminal was inserted, and continue parsing. In
effect, this approach takes the synchronizing set of a token to consist of all other tokens.
This involves, defining the blank entries in the table with pointers to some error routines
which may
Change, delete or insert symbols in the input or
May also pop symbols from the stack
Handles: A handle of a string is a substring that matches the right side of a production, and
whose reduction to the non-terminal on the left side of the production represents one step
along the reverse of a rightmost derivation.
Example:
Consider the grammar:
E→E+E
E→E*E
E→(E)
E→id
Handle pruning:
A rightmost derivation in reverse can be obtained by “handle pruning”. (i.e.) if w is a sentence
or string of the grammar at hand, then w = γn, where γn is the nth right sentential form of
some rightmost derivation.
Actions in shift-reduce parser:
• shift - The next input symbol is shifted onto the top of the stack.
• reduce - The parser replaces the handle within a stack with a non-terminal.
• accept - The parser announces successful completion of parsing.
• error - The parser discovers that a syntax error has occurred and calls an error recovery routine.
1. Shift-reduce conflict:
Example:
Consider the grammar:
E→E+E | E*E | id and input id+id*id
INTRODUCTION TO LR PARSERS
An efficient bottom-up syntax analysis technique that can be used CFG is called LR(k) parsing.
The ‘L’ is for left-to-right scanning of the input, the ‘R’ for constructing a rightmost derivation in
reverse, and the ‘k’ for the number of input symbols. When ‘k’ is omitted, it is assumed to be 1.
Advantages of LR parsing:
1. It recognizes virtually all programming language constructs for which CFG can be written.
2. It is an efficient non-backtracking shift-reduce parsing method.
3.A grammar that can be parsed using LR method is a proper superset of a grammar that
can be parsed with predictive parser
4.It detects a syntactic error as soon as possible.
Drawbacks of LR method:
It is too much of work to construct a LR parser by hand for a programming language grammar.
A specialized tool, called a LR parser generator, is needed. Example: YACC.
Types of LR parsing method:
1. SLR- Simple LR
Easiest to implement, least powerful.
2. CLR- Canonical LR
Most powerful, most expensive.
3. LALR- Look-Ahead LR
Intermediate in size and cost between the other two methods.
It consists of an input, an output, a stack, a driver program, and a pa parts (action and goto).
Action : The parsing program determines sm, the state currently on top of stack, and ai, the current
input symbol. It then consults action[sm,ai] in the action table which can have one of four values:
1. shift s, where s is a state,
2. reduce by a grammar production A → β,
3. accept,
4. Error.
Goto : The function goto takes a state and grammar symbol as arguments and produces a state.
LR Parsing algorithm:
Input: An input string w and an LR parsing table with functions action and goto for grammar G.
Output: If w is in L(G), a bottom-up-parse for w; otherwise, an error indication.
Method: Initially, the parser has s0 on its stack, where s0 is the initial state, and w$ in the input buffer.
An LR(0) item of a grammar G is a production of G with a dot at some position of the right side. For
example, production A → XYZ yields the four items :
A→.XYZ
A → X . YZ
A → XY . Z
A → XYZ .
Closure operation:
If I is a set of items for a grammar G, then closure(I) is the set of items constructed from I by
the two rules:
1. Initially, every item in I is added to closure(I).
2. If A → α . Bβ is in closure(I) and B → γ is a production, then add the item B → . γ to I , if it
is not already there. We apply this rule until no more new items can be added to closure(I).
Goto operation:
Goto(I, X) is defined to be the closure of the set of all items [A→ αX . β] such that [A→ α . Xβ] is in I.
Steps to construct SLR parsing table for grammar G are:
1. Augment G and produce G’
2. Construct the canonical collection of set of items C for G’
3. Construct the parsing action function action and goto using the following algorithm that requires
FOLLOW(A) for each non-terminal of grammar.
Add Augment Production and insert '•' symbol at the first position for every production in G
S` → •E
E → •E + T
E → •T
T → •T * F
T → •F
F → •id
I0 State:
Add all productions starting with E in to I0 State because "." is followed by the non-terminal. So, the I0
State becomes
I0 = S` → •E
E → •E + T
E → •T
Add all productions starting with T and F in modified I0 State because "." is followed by the non-
terminal. So, the I0 State becomes.
JSVG Krishna, Associate Professor.
Downloaded by Priya Rana (prinkit2002@gmail.com)
lOMoARcPSD|20951282
Add all productions starting with T and F in I5 State because "." is followed by the non-terminal. So, the
I5 State becomes
I5 = E → E +•T
T → •T * F
T → •F
F → •id
Add all productions starting with F in I6 State because "." is followed by the non-terminal. So, the I6
State becomes
I6 = T → T * •F
F → •id
Drawing DFA
Explanation:
- 27 -