Syntax Analysis
Syntax Analysis
if
=
==
b
;
b
Syntax definition
Context free grammars
a set of tokens (terminal symbols)
a set of non terminal symbols
a set of productions of the form
nonterminal String of terminals & non terminals
a start symbol
<T, N, P, S>
A grammar derives strings by beginning with start
symbol and repeatedly replacing a non terminal by
the right hand side of a production for that non
terminal.
The strings that can be derived from the start
symbol of a grammar G form the language L(G)
defined by the grammar.
4
Examples
String of balanced parentheses
S(S)S|
Grammar
list list + digit
| list digit
| digit
digit 0 | 1 | | 9
Consists of language which is a list of
digit separated by + or -.
5
Derivation
list list + digit
list digit + digit
digit digit + digit
9 digit + digit
9 5 + digit
95+2
Therefore, the string 9-5+2 belongs to the
language specified by the grammar
The name context free comes from the fact
that use of a production X does not
depend on the context of X
6
Examples
Grammar for Pascal block
block begin statements end
statements stmt-list |
stmtlist stmt-list ; stmt
| stmt
7
Syntax analyzers
Testing for membership whether w
belongs to L(G) is just a yes or no
answer
However the syntax analyzer
Must generate the parse tree
Handle errors gracefully if string is not in
the language
Derivation
If there is a production A then we say that A
derives and is denoted by A
A if A is a production
If 1 2 n then 1 n +
Given a grammar G and a string w of terminals
in L(G) we can write+ S w
If S * where is a string of terminals and non
terminals of G then we say that is a sentential
form of G
9
Derivation
Parse tree
It shows how the start symbol of a grammar
derives a string in the language
root is labeled by the start symbol
leaf nodes are labeled by tokens
Each internal node is labeled by a non
terminal
if A is a non-terminal labeling an internal
node and x1, x2, xn are labels of children of
that node then A x1 x2 xn is a production
11
Example
Parse tree for 9-5+2
list
list
digit
list
digit
digit
9
12
Ambiguity
A Grammar can have more than
one parse tree for a string
Consider grammar
string string + string
| string string
|0|1||9
String 9-5+2 has two parse trees
13
string
string
string
9
string
string
string
string
string
string
string
2
14
Ambiguity
Associativity
Precedence
String a+5*2 has two possible
interpretations because of two
different parse trees
corresponding to
(a+5)*2 and a+(5*2)
Precedence determines the
correct interpretation.
17
Parsing
Example
Construction of parse tree is done by
starting root labeled by start symbol
repeat following two steps
at node labeled with non terminal A select
one of the production of A and construct
children nodes
(Which production?)
find the next node at which subtree is
Constructed (Which node?)
20
Parse
array [ num dotdot num ] of integer
Start symbol
type
Expanded using the
rule type simple
simple
Can not proceed as non terminal simple never
generates a string beginning with token array.
Therefore, requires back-tracking.
look-ahead
array
type
[
simple
of
type
dotdot
num
simple
integer
22
First set:
procedure simple;
if lookahead = integer
then match(integer)
else if lookahead = char
then match(char)
else if lookahead = num
then begin match(num);
match(dotdot);
match(num)
end
else
error;
procedure match(t:token);
if lookahead = t
then lookahead = next token
else error;
25
Ambiguity
Dangling else problem
Stmt if expr then stmt
| if expr then stmt else stmt
according to this grammar, string
if el then if e2 then S1 else S2
has two parse trees
26
if e1
then if e2
then s1
else s2
stmt
if
expr
e1
if e1
then if e2
then s1
else s2
then
if
e1
expr
else
then
e2
stmt
if
stmt
expr
then
stmt
if
expr
then
e2
stmt
s1
stmt
stmt
s2
s1
else
stmt
s2
27
28
Left recursion
A top down parser with production
A A may loop forever
From the grammar A A |
left recursion may be eliminated by
transforming the grammar to
A R
RR|
29
Parse tree
corresponding
to left recursive
grammar
Example
Consider grammar for arithmetic expressions
EE+T|T
TT*F|F
F ( E ) | id
After removal of left recursion the grammar becomes
E T E
E + T E |
T F T
T * F T |
F ( E ) | id
31
S Aa | b
A Ac | Aad | bd |
After the second step (removal of left recursion)
the grammar becomes
S Aa | b
A bdA' | A'
A' cA' | adA' |
34
Left factoring
In top-down parsing when it is not clear which
production to choose for expansion of a symbol
defer the decision till we have seen enough input.
In general if A 1 | 2
defer decision by expanding A to A'
we can then expand A to 1 or 2
Therefore A 1 | 2
transforms to
A A
A 1 | 2
35
Predictive parsers
A non recursive top down parsing method
Parser predicts which production to use
It removes backtracking by fixing one
production for every non-terminal and input
token(s)
Predictive parsers accept LL(k) languages
First L stands for left to right scan of input
Second L stands for leftmost derivation
k stands for number of lookahead token
In practice LL(1) is used
37
Predictive parsing
Predictive parser can be implemented
by maintaining an external stack
input
stack
parser
output
Parse table is a
two dimensional array
M[X,a] where X is a
non terminal and a is
a terminal of the grammar
Parse
table
38
Parsing algorithm
The parser considers 'X' the symbol on top of stack, and 'a'
the current input symbol
These two symbols determine the action to be taken by the
parser
Assume that '$' is a special token that is at the bottom of
the stack and terminates the input string
if X = a = $ then halt
if X = a $ then pop(x) and ip++
if X is a non terminal
then if M[X,a] = {X UVW}
then begin pop(X); push(W,V,U)
end
else error
39
Example
Consider the grammar
E T E
E' +T E' |
T F T'
T' * F T' |
F ( E ) | id
40
T
F
ETE
E
T
ETE
E+TE
TFT
TFT
Fid
T*FT
F(E
)
Example
Stack
input
action
$E
id + id * id $
expand by ETE
$ET
id + id * id $
expand by TFT
$ETF
id + id * id $
expand by Fid
$ETid
id + id * id $
$ET
+ id * id $
expand by T
$E
+ id * id $
expand by E+TE
$ET+
+ id * id $
$ET
id * id $
expand by TFT
42
Example
Stack
input
action
$ETF
id * id $
expand by Fid
$ETid
id * id $
$ET
* id $
expand by T*FT
$ETF*
* id $
$ETF
id $
expand by Fid
$ETid
id $
$ET
expand by T
$E
expand by E
halt
43
first
follow
44
Example
For the expression grammar
E T E
E' +T E' |
T F T'
T' * F T' |
F ( E ) | id
First(E) = First(T) = First(F) = { (, id }
First(E') = {+, }
First(T') = { *, }
46
47
Example
For the expression grammar
E T E
E' + T E' |
T F T'
T' * F T' |
F ( E ) | id
follow(E) = follow(E) = { $, ) }
follow(T) = follow(T) = { $, ), + }
follow(F) = { $, ), +, *}
48
Practice Assignment
Construct LL(1) parse table for the expression
grammar
bexpr bexpr or bterm | bterm
bterm bterm and bfactor | bfactor
bfactor not bfactor | ( bexpr ) | true | false
Steps to be followed
Remove left recursion
Compute first sets
Compute follow sets
Construct the parse table
Not to be submitted
50
Error handling
Stop at the first error and print a message
Compiler writer friendly
But not user friendly
Every reasonable compiler must recover from error and identify as
many errors as possible
However, multiple error messages due to a single fault must be
avoided
Error recovery methods
Panic mode
Phrase level recovery
Error productions
Global correction
51
Panic mode
Simplest and the most popular method
Most tools provide for specifying panic
mode recovery in the grammar
When an error is detected
Discard tokens one at a time until a set of
tokens is found whose role is clear
Skip to the next token that can be placed
reliably in the parse tree
52
Panic mode
Consider following code
begin
a = b + c;
x=pr;
h = x < 0;
end;
The second expression has syntax error
Panic mode recovery for begin-end block
skip ahead to next ; and try to parse the next expression
It discards one expression and tries to continue parsing
May fail if no further ; is found
53
Error productions
Add erroneous constructs as productions in the
grammar
Works only for most common mistakes which
can be easily identified
Essentially makes common errors as part of the
grammar
Complicates the grammar and does not work
very well
55
Global corrections
Considering the program as a whole find a
correct nearby program
Nearness may be measured using certain
metric
PL/C compiler implemented this scheme:
anything could be compiled!
It is complicated and not a very good idea!
56
Assignment
Reading assignment: Read about error
recovery in LL(1) parsers
Assignment to be submitted:
introduce synch symbols (using both follow
and first sets) in the parse table created for
the boolean expression grammar in the
previous assignment
Parse not (true and or false) and show
how error recovery works
Due on todate+10
58
Bottom up parsing
Construct a parse tree for an input string beginning at
leaves and going towards root
OR
Reduce a string w of input to start symbol of grammar
Consider a grammar
S aABe
A Abc | b
Bd
And reduction of a string
a bbcde
aAbcde
aAde
aABe
S
.w
60
.pqr
p.qr
Example
Assume grammar is
Parse id*id+id
E E+E | E*E | id
String
action
.id*id+id shift
id.*id+id reduce Eid
E.*id+id shift
E*.id+id shift
E*id.+id reduce Eid
E*E.+id reduce EE*E
E.+id
shift
E+.id
shift
E+id.
Reduce Eid
E+E.
Reduce EE+E
E. ACCEPT
62
Bottom up parsing
A more powerful parsing technique
LR grammars more expensive than LL
Can handle left recursive grammars
Can handle virtually all the programming languages
Natural expression of programming language syntax
Automatic generation of parsers (Yacc, Bison etc.)
Detects errors as soon as possible
Allows better error recovery
64
Handle
A string that matches right hand side of a production and
whose replacement gives a step in the reverse right most
derivation
If S rm* Aw rm w then (corresponding to production
A ) in the position following is a handle of w. The
string w consists of only terminal symbols
We only want to reduce handle and not any rhs
Handle pruning: If is a handle and A is a production
then replace by A
A right most derivation in reverse can be obtained by
handle pruning.
66
Handles
Handles always appear at the top of the stack
and never inside it
This makes stack a suitable data structure
Consider two cases of right most derivation to
verify the fact that handle appears on the top of
the stack
S Az Byz yz
S BxAz Bxyz xyz
input action
reduce by B
shift y
z
reduce by A By
input action
reduce by B
shift x
shift y
z
reduce Ay
z
68
Conflicts
The general shift-reduce technique is:
if there is no handle on the stack then shift
If there is a handle then reduce
*id
*id
id
reduce by EE+E
shift
shift
reduce by Eid
reduce byEE*EE
E+E
*id
E+E* id
E+E*id
E+E*E
E+E
shift
shift
reduce by Eid
reduce byEE*E
reduce byEE+EE
70
c
R
R+
R+c
R+R
input
c+c
+c
+c
c
action
shift
reduce by Rc
shift
shift
reduce by Rc
reduce by R+RM
Stack
c
R
R+
R+c
input action
c+c
shift
+c
reduce by Rc
+c
shift
c
shift
reduce by MR+cM
71
LR parsing
input
stack
parser
output
action goto
Parse table
72
Configurations in LR parser
Stack: S0X1S1X2XmSm
Input: aiai+1an$
If action[Sm,ai] = shift S
Then the configuration becomes
Stack: S0X1S1XmSmaiS Input: ai+1an$
If action[Sm,ai] = reduce A
Then the configuration becomes
Stack: S0X1S1Xm-rSm-r AS Input: aiai+1an$
Where r = || and S = goto[Sm-r,A]
If action[Sm,ai] = accept
Then parsing is completed. HALT
If action[Sm,ai] = error
Then invoke error recovery routine.
74
LR parsing Algorithm
Initial state:
Stack: S0
Input: w$
Loop{
if action[S,a] = shift S
then push(a); push(S); ip++
else if action[S,a] = reduce A
then pop (2*||) symbols;
push(A); push (goto[S,A])
(S is the state after popping symbols)
Example
EE+T | T
TT*F | F
F ( E ) | id
id
s5
s4
s6
r2
s7
r2
r2
r4
r4
r4
r4
s4
r6
acc
s5
r6
r6
s5
s4
s5
s4
r6
10
s6
s11
r1
s7
r1
r1
10
r3
r3
r3
r3
11
r5
r5
r5
r5
76
Parse id + id * id
Stack
0
0 id 5
0F3
0T2
0E1
0E1+6
0 E 1 + 6 id 5
0E1+6F3
0E1+6T9
0E1+6T9*7
0 E 1 + 6 T 9 * 7 id 5
0 E 1 + 6 T 9 * 7 F 10
0E1+6T9
0E1
Input
id+id*id$
+id*id$
+id*id$
+id*id$
+id*id$
id*id$
*id$
*id$
*id$
id$
$
$
$
$
Action
shift 5
reduce by Fid
reduce by TF
reduce by ET
shift 6
shift 5
reduce by Fid
reduce by TF
shift 7
shift 5
reduce by Fid
reduce by TT*F
reduce by EE+T
ACCEPT
77
Parser states
Goal is to know the valid reductions at any
given point
Summarize all possible stack prefixes as a
parser state
Parser state is defined by a DFA state that
reads in the stack
Accept states of DFA are unique reductions
78
Viable prefixes
is a viable prefix of the grammar if
There is a w such that w is a right sentential form
.w is a configuration of the shift reduce parser
LR(0) items
An LR(0) item of a grammar G is a production of G with
a special symbol . at some position of the right side
Thus production AXYZ gives four LR(0) items
A .XYZ
A X.YZ
A XY.Z
A XYZ.
An item indicates how much of a production has been
seen at a point in the process of parsing
Symbols on the left of . are already on the stacks
Symbols on the right of . are expected in the input
81
Start state
Start state of DFA is empty stack
corresponding to S.S item
This means no input has been seen
The parser expects to see a string derived from S
82
Closure operation
If I is a set of items for a grammar G then
closure(I) is a set constructed as follows:
Every item in I is in closure (I)
If A .B is in closure(I) and B is a
production then B . is in closure(I)
Example
Consider the grammar
E E
EE+T | T
TT*F | F
F ( E ) | id
If I is { E .E } then closure(I) is
E .E
E .E + T
E .T
T .T * F
T .F
F .id
F .(E)
84
85
Goto operation
Goto(I,X) , where I is a set of items and X is a grammar
symbol,
is closure of set of item A X.
such that A .X is in I
86
Sets of items
C : Collection of sets of LR(0) items for
grammar G
C = { closure ( { S .S } ) }
repeat
for each set of items I in C
and each grammar symbol X
such that goto (I,X) is not empty and not
in C
ADD goto(I,X) to C
until no more additions
87
Example
Grammar:
E E
E E+T | T
T T*F | F
F (E) | id
I0: closure(E.E)
E .E
E .E + T
E .T
T .T * F
T .F
F .(E)
F .id
I1: goto(I0,E)
E E.
E E. + T
I2: goto(I0,T)
E T.
T T. *F
I3: goto(I0,F)
T F.
I4: goto( I0,( )
F (.E)
E .E + T
E .T
T .T * F
T .F
F .(E)
F .id
I5: goto(I0,id)
F id.
88
I6: goto(I1,+)
E E + .T
T .T * F
T .F
F .(E)
F .id
I7: goto(I2,*)
T T * .F
F .(E)
F .id
I8: goto(I4,E)
F (E.)
E E. + T
goto(I4,T) is I2
goto(I4,F) is I3
goto(I4,( ) is I4
goto(I4,id) is I5
I9: goto(I6,T)
E E + T.
T T. * F
goto(I6,F) is I3
goto(I6,( ) is I4
goto(I6,id) is I5
I10: goto(I7,F)
T T * F.
goto(I7,( ) is I4
goto(I7,id) is I5
I11: goto(I8,) )
F (E).
goto(I8,+) is I6
goto(I9,*) is I7
89
id
I5
id
I1
I0
I6
(
(
(
I4
I9
id
*
)
I8
I2
I11
id
I7
I10
I3
90
I5
I1
I0
I6
E
I4
T
I8
I9
I11
I2
I7
I10
I3
91
id
I1
I0
(
T
I4
F
I6
(
I5
id
I9
id
*
)
I8
I2
I11
id
I7
I10
I3
92
If A. is in Ii
then action[i,a] = reduce A for all a in follow(A)
If goto(Ii,A) = Ij
then goto[i,A]=j for all non terminals A
Notes
This method of parsing is called SLR (Simple LR)
LR parsers accept LR(k) languages
L stands for left to right scan of input
R stands for rightmost derivation
k stands for number of lookahead token
Assignment
Construct SLR parse table for following grammar
E E + E | E - E | E * E | E / E | ( E ) | digit
Show steps in parsing of string
9*5+(2+3*7)
Steps to be followed
Due on todate+5
95
Example
Consider following grammar and its SLR parse table:
S S
SL=R
SR
L *R
L id
RL
I0: S .S
S .L=R
S .R
L .*R
L .id
R .L
I1: goto(I0, S)
S S.
I2: goto(I0, L)
S L.=R
R L.
Assignment (not
to be
submitted):
Construct rest
of the items and
the parse table.
96
id
s4
s5
1
2
s6,r6
r6
r3
s4
s5
r5
r5
s4
s5
r4
r4
r6
r6
acc
3
5
r2
97
There is both a shift and a reduce entry in action[2,=]. Therefore state 2 has
a shift-reduce conflict on symbol =, However, the grammar is not
ambiguous.
Canonical LR Parsing
Carry extra information in the state so that
wrong reductions by A will be ruled out
Redefine LR items to include a terminal
symbol as a second component (look ahead
symbol)
The general form of the item becomes [A
., a] which is called LR(1) item.
Item [A ., a] calls for reduction only if next
input is a. The set of symbols as will be a
subset of Follow(A).
100
Closure(I)
repeat
for each item [A .B, a] in I
for each production B in G'
and for each terminal b in First(a)
add item [B ., b] to I
until no more additions to I
101
Example
Consider the following grammar
S S
S CC
C cC | d
Compute closure(I) where I={[S .S, $]}
S .S,
S .CC,
C .cC,
C .cC,
C .d,
C .d,
$
$
c
d
c
d
102
Example
Construct sets of LR(1) items for the grammar on previous
slide
I0: S .S,
S .CC,
C .cC,
C .d,
$
$
c/d
c/d
I4: goto(I0,d)
C d.,
c/d
I5: goto(I2,C)
S CC.,
$
$
$
$
I1: goto(I0,S)
S S.,
I2: goto(I0,C)
S C.C,
C .cC,
C .d,
I6: goto(I2,c)
C c.C,
C .cC,
C .d,
$
$
$
I7: goto(I2,d)
C d.,
I8: goto(I3,C)
C cC.,
c/d
I9: goto(I6,C)
C cC.,
I3: goto(I0,c)
C c.C,
C .cC,
C .d,
c/d
c/d
c/d
103
Construction of Canonical LR
parse table
Construct C={I0, ,In} the sets of LR(1) items.
If [A ., a] is in Ii
then action[i,a] reduce A
If [S S., $] is in Ii
then action[i,$] = accept
Parse table
State
s3
s4
acc
s6
s7
s3
s4
r3
r3
5
6
r1
s6
s7
7
8
9
9
r3
r2
r2
r2
105
For each core present in LR(1) items find all sets having
the same core and replace these sets by their union
Let J = I1 U I2.U Ik
since I1 , I2., Ik have same core, goto(J,X) will have
he same core
Let K=goto(I1,X) U goto(I2,X)goto(Ik,X) the
goto(J,X)=K
108
s36
s47
acc
s36
s47
36
s36
s47
89
47
r3
r3
5
89
r3
r1
r2
r2
r2
109
Error Recovery
An error is detected when an entry in the action table is
found to be empty.
Panic mode error recovery can be implemented as
follows:
scan down the stack until a state S with a goto on a
particular nonterminal A is found.
discard zero or more input symbols until a symbol a is
found that can legitimately follow A.
stack the state goto[S,A] and resume parsing.
Parser Generator
Some common parser generators
Lex
Yacc
Lex.yy.c
C code for
parser
y.tab.c
C
Compiler
Object code
Input
program
Parser
Abstract
Syntax tree
115