4 Syntax Analysis - Bottom Up Parsing
4 Syntax Analysis - Bottom Up Parsing
4 Syntax Analysis - Bottom Up Parsing
This is a parsing strategy based on the reverse process to top-down parsing. Instead of expanding successful
non-terminals according to production rules, a current string or right sentential form is collapsed each time until
the start non-terminal is reached to predict the legal next symbol; i.e. it can be regarded as a series of reductions.
This approach is also known as shift-reduce parsing and is the primary parsing method for many compilers,
mainly due to its speed and the tools which automatically generate a parser based on the grammar.
Example
Consider the grammar below:
1 𝑆 → 𝑎𝐴𝐵𝑒
2 𝐴 → 𝐴𝑏𝑐|𝑏
3 𝐵→𝑑
To parse the sentence 𝑎𝑏𝑏𝑐𝑑𝑒 using the bottom up approach gives the following reductions:
𝑎𝒃𝑏𝑐𝑑𝑒
𝑎𝑨𝒃𝒄𝑑𝑒 by 2
Reverse gives in right most derivation:
𝑎𝐴𝒅𝑒 by 2
𝒂𝑨𝑩𝒆 by 3 𝑆 → 𝑎𝐴𝐵𝑒 → 𝑎𝐴𝑑𝑒 → 𝑎𝐴𝑏𝑐𝑑𝑒 → 𝑎𝑏𝑏𝑐𝑑𝑒
𝑆 by 1
Generally, bottom-up parsing starts from the leaf nodes of a tree and works in upward direction till it reaches
the root node. Here, we start from a sentence and then apply production rules in reverse manner in order to
reach the start symbol. The figure below depicts the bottom-up parsers available.
Disadvantages
The primary drawback of LR parsers is that they require too much work to manually create LR parsing tables.
However, tools exist to generate LR parsers from a given grammar i.e. parser generators such as YACC,
BISON etc.
LR Parsing Methods
There are three widely used algorithms available for constructing an LR parser:
i) SLR: this stands for Simple LR. It is easy to implement but less powerful than other parsing methods. It
generally works on smallest class of grammar and have few number of states, hence very small table
ii) Canonical LR: this is the most general and powerful. However, it is tedious and costly to implement,
i.e. for the same grammar, it has got much number of states as compared to SLR parsers. Generally, it
works on complete set of LR(1) Grammar and generates large table and large number of states
iii) LALR: this stands for LOOKAHEAD LR. It is a mixture of SLR and canonical LR, but it can be
implemented efficiently i.e. it contains the same number of states as Simple LR parser for the same
grammar.
Notice that most parser generators generate LALR parsers since they are a trade-off between power and
efficiency.
The input stream holds terminals, the stack can hold a mixture of terminals and non-terminals, the latter
generated by earlier reductions.
The operation shift moves a symbol from the input to the stack while the operation reduce combines the
sequence ending with the last terminal shifted to form a non-terminal on the stack.
When the input is exhausted, the single start symbol should be presented assuming all reductions have been
performed.
Example
Consider the following grammar:
𝑒𝑥𝑝 → 𝑒𝑥𝑝 + 𝑒𝑥𝑝
𝑒𝑥𝑝 → 𝑒𝑥𝑝 ∗ 𝑒𝑥𝑝
LL LR
Does a leftmost derivation. Does a rightmost derivation in reverse.
Starts with the root nonterminal on the stack. Ends with the root nonterminal on the stack.
Ends when the stack is empty. Starts with an empty stack.
Uses the stack for designating what is still to be Uses the stack for designating what is already seen.
expected.
Builds the parse tree top-down. Builds the parse tree bottom-up.
Continuously pops a nonterminal off the stack, Tries to recognize a right hand side on the stack,
and pushes the corresponding right hand side. pops it, and pushes the corresponding nonterminal.
Expands the non-terminals. Reduces the non-terminals.
Reads the terminals when it pops one off the Reads the terminals while it pushes them on the
stack. stack.
Pre-order traversal of the parse tree. Post-order traversal of the parse tree.
Handles
A handle of a right-sentential form γ is a production A → β and a position in γ where β may be found and
replaced by A to produce the previous right sentential form in a rightmost derivation of γ i.e.,
if S *rmαAw rmαβw then A → β in the position following α is a handle of αβw
Because γ is a right-sentential form, the substring to the right of a handle contains only terminal symbols i.e. a
handle is a string that can be reduced and also allows further reductions back to the start symbol (using a
particular production at a specific spot). Informally, a "handle" is a substring that matches the body of a
production, and whose reduction represents one step along the reverse of a rightmost derivation.
Notice that the string w to the right of the handle must contain only terminal symbols. For convenience, we refer
to the body β rather than A → β as a handle.
Handle-pruning
The process to construct a bottom-up parse is called handle-pruning that is start with a string of terminals w to
be parsed. If w is a sentence of the grammar at hand, then let w = γn, where γn is the nth right-sentential form of
some as yet unknown rightmost derivation
S = γ0 γ1 γ2 ... γn=w
Compiler Construction ~ Wainaina Page 3 of 12
To reconstruct this derivation in reverse order, we locate the handle βn in γn and replace βn by the head of the
relevant production An → βn to obtain the previous right-sentential form γn-1.
Then repeat this process. That is; locate the handle βn-1 in γn-1 and reduce this handle to obtain the right-
sentential form γn-2. If by continuing this process we produce a right-sentential form consisting only of the start
symbol S, then we halt and announce successful completion of parsing.
Algorithm
Set i to n and apply the following simple algorithm
For i = n down to 1
1. Find the handle Ai → βi in γi
2. Replace βi with Ai to generate γi−1
Stack implementation
One scheme to implement a handle-pruning, bottom-up parser is to use a shift-reduce parser.
Shift-reduce parsers use a stack and an input buffer as shown below
1. initialize stack with $
2. Repeat until the top of the stack is the goal symbol and the input token is $
i). find the handle
if we don’t have a handle on top of the stack, shift an input symbol onto the stack
ii). prune the handle
if we have a handle A → β on the stack, reduce
i). pop |β| symbols off the stack
ii). push A onto the stack
LR Parsers
As with LL(1), the aim of LR parsers is to make the choice of action depend only on the next input symbol and
the symbol on top of the stack. To achieve this, a DFA needs to be constructed. The DFA is used to determine
the next action and it only needs to look at the current state (stored at the top of the stack) and the next input
symbol (shift action) or nonterminal (reduce action).
We represent the DFA as a table, where we cross-index a DFA state with a symbol (terminal or nonterminal).
Generally, you need to encode the DFA in a table i.e. a shift-reduce parser’s DFA can be encoded in two tables
One row for each state
action table encodes what to do given the current state and the next input symbol
goto table encodes the transitions to take after a reduction
Once the the DFA has been encoded as a table, you need to perform one of the following actions:
Actions (1)
Given the current state and input symbol, the main possible actions are
si – shift the input symbol and state i onto the stack (i.e., shift and move to state i )
rj – reduce using grammar production j
Compiler Construction ~ Wainaina Page 4 of 12
The production number tells us how many <symbol, state> pairs to pop off the stack
Actions (2)
Other possible action table entries
accept
blank – no transition, which means syntax error
A LR parser will detect an error as soon as possible on a left-to-right scan
real compiler needs to produce an error message, recover, and continue parsing when this happens
Goto
When a reduction is performed, <symbol, state> pairs are popped from the stack revealing a state
uncovered_s on the top of the stack
goto[uncovered_s , A] is the new state to push on the stack when reducing production A ::= β (after popping
β and revealing state uncovered_s on top)
Skeleton parser:
This takes k shifts, l reduces, and 1 accept, where k is the length of the input string and l is the length of the
reverse rightmost derivation
Structure of a LR parser
LR(k) items
The table construction algorithms use sets of LR(k) items or configurations to represent the possible states in a
parse. An LR(k) item is a pair [α, β], where
α is a production from G with a • at some position in the RHS, marking how much of the RHS of a
production has already been seen
Compiler Construction ~ Wainaina Page 6 of 12
β is a lookahead string containing k symbols (terminals or $)
Two cases of interest are k = 0 and k = 1:
LR(0) items play a key role in the SLR(1) table construction algorithm. LR(0) items means no lookahead
symbols
LR(1) items play a key role in the LR(1) and LALR(1) table construction algorithms.
Note: LR(k) recognize occurrence of β (the handle) having seen all of what is derived from β plus k symbols of
lookahead
Example
The • indicates how much of an item we have seen at a given state in the parse:
[A → •XY Z] indicates that the parser is looking for a string that can be derived from XY Z
[A → XY • Z] indicates that the parser has seen a string derived from XY and is looking for one derivable from Z
Generally, LR(0) item is a grammar rule with a dot on the right-hand side. For example, the rule A → XY Z
generates four LR(0) items:
1. [A → •XY Z]
2. [A → X • Y Z]
3. [A → XY • Z]
4. [A → XY Z•]
An LR(0) state is a set of (LR(0)) items. The items of a state indicate how much of the input has been
recognized.
i). closure0
Given an item [A → α • Bβ], its closure contains the item and any other items that can generate legal
substrings to follow α. Thus, if the parser has viable prefix α on its stack, the input should reduce to Bβ (or γ
for some other item [B → •γ] in the closure).
function closure0(I)
repeat
if [A → α • B β] ∈ I
add [B → • γ] to I
until no more items can be added to I
return I
Generally, if I contains A → μ • B χ, then closure(I) adds to I all items B → • ν for each rule for B,
continuing recursively, i.e. adds all rules that may be needed in recognizing B.
Example, for the expression grammar
Compiler Construction ~ Wainaina Page 7 of 12
1) E' → E
2) E→E+T
3) E→T
4) T→T*F
5) T→F
6) F→(E)
7) F → id
ii). goto0
Let I be a set of LR(0) items and X be a grammar symbol, then, GOTO(I,X) is the closure of the set of all
items [A → αX • β] such that [A → α • Xβ] ∈ I
If I is the set of valid items for some viable prefix γ, then GOTO(I,X) is the set of valid items for the viable
prefix γX.
function goto0(I,X)
let J be the set of items [A → α X • β]
such that [A → α • X β] ∈ I
return closure0(J)
The result of goto(I, X) contains for every item A → μ • X χ in I the closure of A → μ X • χ , i.e. the state
after X is recognized, for X ∈ V. For example, if I = {E' → • E, E → E • + T}, then goto(I, +) is:
E→E+•T T→•F F → • id
T→•T*F F→•(E)
We assume without loss of generality that each grammar contains a rule S' → S, where S' is the start symbol
and S' does not occur anywhere else.
The following is the algorithm to compute for the collection of sets of LR(0) items
function items(G’)
s0 = closure0({[S’ → • S$]})
S = {s0}
repeat
for each set of items s ∈ S
for each grammar symbol X
if goto0(s,X) ≠ ∅ and goto0(s,X) S
add goto0(s,X) to S
Example of LR(0)
Consider the following grammar
1. S → E$
2. E → E + T
3. |T
4. T → id
5. | (E)
I0 : S → •E$ I1 : S → E•$
E → •E + T E → E • +T
E → •T
T → • id
T → •(E)
I2 : S → E$• I3 : E → E + •T
T → • id
T → •(E)
I4 : E → E + T• I5 : T → id •
I6 : T → (•E) I7 : T → (E•)
E → •E + T E → E • +T
E → •T
T → • id
T → •(E)
I8 : T → (E)• I9 :E → T•