Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
12 views

Compilers - Week 4

Predictive parsing uses a top-down approach and lookahead to parse inputs based on LL(k) grammars. It constructs a parsing table based on the first and follow sets of grammar rules to determine the next move without backtracking. Left-factoring rewrites grammar rules to eliminate common prefixes and delay decisions to ensure the grammar is LL(1). Bottom-up parsing builds the parse tree in reverse by repeatedly replacing productions with their nonterminals.

Uploaded by

rahma aboeldahab
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Compilers - Week 4

Predictive parsing uses a top-down approach and lookahead to parse inputs based on LL(k) grammars. It constructs a parsing table based on the first and follow sets of grammar rules to determine the next move without backtracking. Left-factoring rewrites grammar rules to eliminate common prefixes and delay decisions to ensure the grammar is LL(1). Bottom-up parsing builds the parse tree in reverse by repeatedly replacing productions with their nonterminals.

Uploaded by

rahma aboeldahab
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Compilers - Week 4

I- predictive parsing (introduction)


-A top-down algorithm that is similar to recursive descent, except that
the parser is able to correctly guess which production will lead to a
successful parse.
-it uses look-ahead, which means it takes a look at the next few tokens,
not just one.
-only works for a restricted form of grammars (LL(k) grammars)
-no backtracking involved with predictive parsing

II- LL(k) grammars


L: left-to-right scan
L: left-most derivation
k: number of tokens of lookahead (usually 1)
Note:
When parsing an LL(1) grammar, there will always be at most one
production that will succeed.

III- left factoring


Consider the following grammar:
E -> T + E| T
T -> int| int * T| (E)

-With only one token of lookahead, if the parser knew the next terminal
in the input stream is an int, it wouldn’t know which production to use
because we have two production rules that begin with an int.
-the parser wouldn't be able to decide on a production rule for E, since
they both begin with the nonterminal T.
-to solve this problem, the grammar is left factored.

-The idea behind left factoring is to eliminate the common prefixes of


multiple productions for one nonterminal.
-What we're going to do is:
1- factor out that common prefix into a single production
2- introduce a non terminal for the different suffixes
E -> TX
3- write multiple production rules; one for each possible suffix
X -> +E| ε
-repeat the same steps for T.
T -> intY| (E)
Y -> *T| ε
Note: This is effectively going to delay the decision about which
production we're using.

IV- Parsing table

-When the current nonterminal is E, that is the left most nonterminal


on the parse tree is E, and the next input is int, Then we should use the
production E -> TX, that is expanding E with the child TX.
Note: Blank entries are error entries, for example if the current
nonterminal is E, and the next input is *, then there’s no way to go, and a
parsing error should be used.

V- implementation of Predictive parsing algorithm


-instead of using recursive functions to trace out the parse tree, we're
going to have a stack of records that can record the frontier, where:
i-Nonterminals that have yet to be expanded, and terminals that
have yet to be matched against the input are located inside the
stack.
ii-Left-most pending terminal (the terminal we’re trying to
match) or nonterminal (the nonterminal we’re trying to expand)
is located on top of the stack.
-Input is rejected if we reach an error state
-Input is accepted when we reach the end of the input, and we have an
empty stack, that is we have neither unexpanded nonterminals nor
unmatched terminals.

Code:
Initialize stack <S $> and next
repeat
Case stack of
<X, rest> : if T[X, *next] = 𝑌1.... 𝑌𝑛
then stack <- <𝑌1.... 𝑌𝑛rest>;
else error();
<t, rest> : if t == *next++
Then stack <- <rest>;
Else error();
until stack == <>

Notes:
I-the stack initially contains the starting nonterminal and the
dollar sign.
ii-the dollar sign marks the bottom of the stack, or the end of the
input.

If the top of the stack is a terminal:


if it matches the next token in the input, then the input is
advanced, and we pop the stack, if not an error is raised.
If the top of the stack is a nonterminal:
1- we look at our parsing table under the entry for x and the
next input symbol, and that should give us the right hand
side of a production.
2- we pop the stack
3- we push the children of X into the stack
4- if there's no entry for the current nonterminal and input
in the table, then an error is raised.

Note: the top of the stack will be a non-terminal, only when it’s
the leftmost non-terminal.
Example:

VI- construction of parsing tables


i- Conditions for constructing LL(1) parsing table:
If we have a nonterminal A, a production rule A -> α, and a token
t, then T[A, t] = α in two cases:
i- if α can derive t in the first position after zero or more
moves, and if so we say that t ∈ First(α) (t is one of the
terminals that α can produce in the first position)
ii- if α can derive ε in zero or more moves, that is if alpha
can be totally erased, and t follows A, in at least one
derivation, and if so we say that t ∈ Follow(𝐴)
Note: A doesn’t produce t. t appears in a derivation after A.
iii- if we’re at the end of the input and A is still our leftmost
nonterminal, then our only hope is to get rid of A
completely, and so we pick a production for A that can go to
ε, that is A -> α, and ε ∈ First(α), and $ ∈ Follow(𝐴)

ii- first sets


-X could be a single terminal, it could be a single non-terminal, or
it could be a string of grammar symbols.
First(X) = { t | X -> * tα} ∪ {ε | X -> * ε}
-The first part represents all the possible terminals
that can be derived in the first position after zero or
more moves.
-The second part checks if X can produce the empty
string ε, after zero or more moves.

If X is a terminal:
The first set of a terminal contains only the terminal
itself.
If X is a nonterminal:
i- and it goes to ε, then ε ∈ First(𝑋)
ii- and it goes to a sequence of other nonterminals
which all go to ε, then ε ∈ First(𝑋)
iii- and it goes to a sequence of other nonterminals
which all go to ε followed by another nonterminal or a
terminal α , then the First(α) ⊆ First(𝑋)
iii- follow sets
Follow(X) = { t | S -> *β X t δ }
we say that t is in the follow of x if there is some derivation, where
that terminal t can appear immediately after the symbol x.
Consider the grammar:
S -> Xt
X -> AB
i- First(B) - {ε} ⊆ Follow(A)
ii- Follow(X) ⊆ Follow(B) where X is on the left of the
production, B is the rightmost terminal or nonterminal on
the right side of the production
S -> Xt -> ABt
iii- Follow(X) ⊆ Follow(A) if B-> *ε
Notes:
i- the dollar sign is in the follow set of the starting symbol.
ii- Epsilons never appear in follow sets, so follow sets are
just sets of terminals.
iii- If Follow(X) ⊆ Follow(E) and Follow(E) ⊆ Follow(X), then
Follow(E) = Follow(X).
iv- Blank entries represent parsing errors

iv- more on LL(1) parsing tables


Consider the following grammar:
S -> Sa| b
First(S) = {b}
Follow(S) = {a, $}
Parsing table:

a b $

S b/ Sa
Here we have a multiply defined entry since for the nonterminal
S, we can obtain b in the first position by using either of these
two production rules.
-if S is our leftmost nonterminal, and b is our next input symbol,
this table doesn't tell us exactly what move to make. (it’s not
deterministic)
-if any entry in the table is multiply defined, then the grammar is
not LL(1).
Grammars that are guaranteed to not be LL(1):
-Not left factored
-Left recursive
-Ambiguous
-grammars that require more than one token of look ahead
Notes:
-the above list is not complete.
-just because a grammar is left factored, not left
recursive, and unambiguous, doesn’t guarantee it’s
LL(1). The only way to know for sure is to construct
the parsing table and check if all entries are not
multiply defined.
-The grammars that describe most programming
languages are not LL1, since LL(1) grammars are too
weak to actually capture all of the interesting and
important constructs in commonly used
programming languages.
VII- bottom-up parsing
-Although bottom-up parsing builds on the ideas in top-down parsing,
it’s more general and just as efficient.
-an important advantage of bottom-up parsing is that it doesn’t have to
deal with left-factored grammars only -but it sure can’t deal with
ambiguous grammars.
-bottom-up parsing reduces the input string of tokens to the start
symbol by inverting productions (reductions).
-When we do a reduction we replace the children (right hand side) of
some production by its left hand side (the parent).

An interesting fact:
A bottom-up parser traces a right-most derivation in reverse
(using reductions instead of productions).

VIII- shift-reduce parsing (the only two moves used by bottom-up parser)
Let αβω be a step of a bottom-up parse, assuming the next reduction is
by X -> β, then ω is a string of terminals because a bottom-up parsing
traces a right-most derivation in reverse which means X has to be the
rightmost nonterminal, that is there are no nonterminals on the right
of X.
Note:
-those terminal symbols represented by ω to the right of the
right most non-terminal are exactly the unexamined input.
-a vertical bar is placed between the examined substring of the
input -which contains both terminals and nonterminals- and the
unexamined substring of the input -which contains terminals
only.
i- shift moves
a shift move, reads one token of input, and that can be
represented by moving the vertical bar one token to the right.
ii- reduce moves
A reduce move is to apply an inverse production at the right end
of the string on the left of the vertical bar.
Note:
it turns out that the left string can be implemented by a
stack, and that's because we only do reduce operations
immediately to the left of the vertical bar, so it's always
some suffix of the string to the left of the vertical bar
where the reduction is happening.

In short:
A shift move pushes a terminal that has been read onto the stack,
while a reduce move pops some number of symbols - that
represent the right hand side of a production rule- off of the
stack, and pushes a nonterminal -that represents the left hand
side of the production rule- onto the stack.

iii- shift-reduce conflict


-If it is legal to shift or reduce, there is a shift-reduce conflict.
-shift-reduce conflicts are not good, but they’re easier to remove.
iv- reduce-reduce conflict
-If it is legal to reduce by two different productions, there is a
reduce-reduce conflict.
-reduce-reduce conflicts are always bad, they indicate some kind
of serious problem with the grammar.
Note:
If you have either one of these conflicts, it means that there's
some state in which the parser doesn't know what to do, and you
either need to rewrite the grammar, or give it a hint as to what it
should do in order to successfully parse your language.
IX- handles
Consider the grammar:
E -> T + E| T
T -> int * T| int |(E)
And the input string: int * int + int
-At this step: int| * int + int, we could have decided to reduce int to
T, but that would’ve been a fatal mistake.
-There is no production in the grammar that begins with T *, And
therefore if we were to make this move, we would get stuck and
never get to the starting symbol.
-therefore, we can only reduce if the result can still be reduced to the
starting symbol.
Assume the rightmost derivation:
S ->* α𝑋ω -> αβω
Here, it’s okay to reduce β by X because we can still, by some
sequence of moves, get back to the start symbol
-we say that αβ is a handle of αβω.
-a handle is a reduction that also allows further reductions back to the
start symbol.
Note:
-handles appear only at the top of the stack, never inside, and
therefore they never appear to the left of the rightmost
nonterminal (the nonterminal that has just replaced a bunch of
terminals in the stack).
-The next handle is obtained after a sequence of shift moves
-Bottom-up parsing is based on recognizing handles.
X -recognizing handles
-For arbitrary grammar, there is no known algorithm for recognizing
handles, but the good news is that there are some heuristics that
correctly identify handles, and they work for a fairly large class of CFGs.

-LR(K) grammars are one of the most general deterministic families of


deterministic grammars that we know of, but those aren't the ones that
are actually used in practice. Most of the bottom up practical tools use
the LALR(K) grammars, which are a subset of the LR(K) grammars.
-SLR(K) grammars are simplified versions of LALR(K) ones.

Note: in LR(K), L stands for left-to-right scan, R stands for rightmost


derivation, and k represents the number of tokens of lookahead.

How does recognizing handles work?


-At each step the parser sees only the stack, not the entire input.
-α is a viable prefix if there is an ω such that α| ω is a state of a
shift-reduce parser. The parser knows only α and the first token
of ω using look-ahead.
– A viable prefix does not extend past the right end of the handle
– It’s called “viable” prefix because it is a prefix of the handle.
– As long as a parser has viable prefixes on the stack no parsing
error has been detected.
A very important fact:
-For any grammar, the set of viable prefixes is a regular
language. -that means that the set of viable prefixes can be
recognized by a finite automaton.

Items of a production:
An item is a production with a “.” somewhere on the RHS.
Consider the following production:
T -> (E)
Possible items for this production are:
T -> .(E)
T -> (.E)
T -> (E.)
T -> (E).
Notes:
-the only item for a production X -> ε is X -> . and it’s
referred to as an LR(0) item.
-in any successful parse, what is on the stack has to always
be a prefix of the right hand side of some production or
productions.
Consider the input (int), and the following production rules:
E -> T + E | T
T -> int * T | int | (E)
-(E is a prefix of the rhs of T -> (E), and it’s gonna be
reduced after the next shift.
- Item T -> (E.) records the fact that we're working on the
production T -> (E), so far we've seen (E, and we’re
hoping to see a ).

-what's to the left of the dot records the portion of the


input that the parser had already seen, and what is to the
right of the dot records the portion of the input which the
parser is waiting to see on the stack before it can perform a
reduction.
Structure of the Stack
-The stack consists of prefixes of the right hand side, and every
prefix will eventually reduce to the left hand side of its
production.
-we're always working on the top-most prefix on the stack.
-when a bunch of prefixes have been removed from the stack
through reductions, then we get to work on the prefixes that are
lower in the stack.
There were more to this part that you can check out in the slides

XI- Recognizing viable prefixes algorithm


1- Add a dummy production S’ -> S to G, where S is the start symbol of
our grammar G.
2-since the set of viable prefixes for a given grammar is regular, we’re
gonna construct a non-deterministic finite automaton that recognizes
the viable prefixes.
-The states of NFA are going to be the items of G.
-The input of the NFA is the stack(starting from the bottom of
the stack towards the top).
-The NFA always outputs either a yes, meaning this stack could
wind up parsing the input, or no ,meaning that what we've got on
the stack now, doesn't resemble any valid stack for any possible
parse of any input string for this grammar.
3- we have two classes of transitions:
i- Extending a prefix:
for an item E -> α. 𝑋β, if having α on the stack at this point is valid,
and the next symbol on the stack is X, then we can transition to
the state: E -> α𝑋. β
𝑋
E -> α. 𝑋β −> E -> α𝑋. β
Note: in this case X is either a terminal or a nonterminal.
ii- guessing or discovering where the end of the prefix is.
for an item E -> α. 𝑋β, if having α on the stack at this point is valid,
and the next symbol on the stack is not X but rather a symbol
that is going to eventually reduce to X (that is it can be generated
by using a sequence of X productions) then we can make an
epsilon move, that is we can just shift to a state where we try to
recognize the right hand side of something derived from X.
ε
E -> α. 𝑋β −> 𝑋 -> .γ
Note: in this case X is a nonterminal.
4- Every state is an accepting state
not every state is going to have transition on every possible
symbol, so there will be plenty of possible stacks that, that are
rejected
5- Start state is S’ -> .S
Example:
The NFA used to recognize viable prefixes of the following grammar:
S’ -> E
E -> T + E | T
T -> int * T | int | (E)

XII- valid items


-The previous NFA can be converted into a DFA, where multiple states of
the NFA are going to be items of one state of the DFA.
-the start state of the DFA has the item S’ -> E
-The states of the DFA are called “canonical collections of items” or
“canonical collections of LR(0) items”
-an item X -> β. γ is valid for a viable prefix αβ if the following condition
is true:
S’ ->* α𝑋ω -> αβγω
-by a series of right-most derivation steps, we can get to a
configuration, α𝑋ω, and then in one step, X can go to βγ.
-that means that after seeing αβ on the bottom of the stack, the
valid item can be the top of the stack.
-Valid items for a viable prefix α are the items that are in the final
states of the DFA after it reads that prefix (all items in that
state).
Example: valid items for the prefix (int * are:
T -> int * . T
T -> .(E)
T -> .int * T
T -> .int
-an item can be valid for multiple prefixes
Example:
The item T -> (.E) is valid for all sequences of open
parentheses (i.e (, ((, (((, ((((, ….)
XIII- SLR (A bottom-up parsing algorithm)
i- LR(0) : a weak bottom up parsing
-Assume the following:
The stack contains α
The next input is t
The DFA is on input α (that is, it reads the content of the
stack) and terminates in state s.
-Reduce by X -> β if s contains item X -> β.
if we see a complete production (dot all the way in the right
hand side) in the final state of the DFA, then we're just
going to reduce by that production.
-Shift if s contains item X-> β.tω
That says it’s okay at this point to add a t to the stack (to do
a shift move) since it’s our input.
Conflicts:
i- reduce-reduce conflict
if any state of DFA has two possible reductions (i.e
two complete productions)
Example:
X -> β. and X -> ω.
ii- shift-reduce conflict
If any state has a reduce item and a shift item.
Example:
X-> β. and Y-> ω.tδ where t is the next token of
the input.
ii- improve LR(0) by using SLR
SLR is going to improve LR(0) by adding another condition for the
reduction move, which is the following:
t ∈ Follow(X)
Consider the item X -> β.
Here β is on top of the stack: β| t
If we were to reduce β to X then we’d have X on top of
the stack: X| t -which says that t has to follow X.
Notes:
-If there are conflicts under these rules, the grammar is not
SLR.
-we take into account two pieces of information; the
contents of the stack (that’s what the DFA does for us, it
tells us what items are possible when we get to the top of
the stack), and what’s coming up in the input.
-SLR grammars are those grammars where there are no
conflicts, meaning there is a unique move in every possible
state under those rules.
-All ambiguous grammars are not SLR ones.
iii- how to make SLR parse even more grammars?
By using precedence declaration
Consider the following grammar:
E -> E + E | E * E | (E) | int
The DFA for this grammar contains a state with the following
items:
E -> E * E .
E -> E . + E which means we have a shift/reduce conflict.
By declaring that * has higher precedence than +, the conflict is
resolved in favor of reducing.
Note:
These declarations do not define precedence; they define
conflict resolutions.
iv- SLR algorithm

Notes:
-M is the machine and G is our grammar.
-S|$ means all the input is gone and we've reduced the
entire input to the start symbol.
-checking if M rejects α is not necessary since a parsing
error is reported if neither a shift nor reduce moves can
take place which already implies that we will never form an
invalid stack.
-if M accepts α with items I -> items I is a state of items in
our DFA.
-If there is a conflict in the last step, grammar is not SLR(k)
where k is the amount of lookahead -which is 1 in practice.

For the configuration - DFA halt state- action table


In order to determine the state for the configuration you
wind up at after applying the action from the previous row,
follow along the states starting from the first one -that has
S' -> .E item- until you form it and the DFA halt state is the
one you end up at.
XIV- SLR improvements
-The SLR parsing algorithm has one major inefficiency, which is that
most of the work the automation does when it reads the stack is
actually redundant (repeated).
-The automaton is rerun on the entire stack even though there is some
small number of symbols that change at the top of the stack at each
step, while most of the stack stays the same.
How to solve the problem of redundant DFA work?
-In order to solve this problem, we need to remember the state of
the automaton on each prefix of the stack by changing the
representation of what goes into the stack to be a pair instead of
just a symbol:
<symbol S, DFA state st>
-The DFA state st is going to be the result of running the
DFA on all symbols below S on the stack.
Note: the bottom element in the stack is <any, start> here
any is a dummy symbol ,and start is the start symbol.
We have two tables:
i- Goto table (the graph of the DFA written out as an array)
goto is just the transition function of the DFA
𝐴
goto[i, A] = j (i.e 𝑠𝑡𝑎𝑡𝑒𝑖−> 𝑠𝑡𝑎𝑡𝑒𝑗 )
Possible move of the SLR parsing algorithm
•Shift x
–Push <a, x> on the stack where a is
current input and
x is a DFA state
• Reduce X -> α
–As before
• Accept
• Error
ii- Action table
-It tells us which kind of move to make in every
possible state.
-It’s indexed by a state of the automaton and the next
input symbol.
action[i, a]
-possible moves are shift, reduce, accept, or error.
-If 𝑆𝑖 has an item X -> α.aβ and goto[i, a] = j then:
action[i,a] = shift j (shift the pair <a, j> into the
stack>
-If 𝑆𝑖 has item X -> α. and a ∈ Follow(X) and X ≠ S’
action[i,a] = reduce X -> α
–If 𝑆𝑖 has item S’ -> S.
action[i,$] = accept
–Otherwise:
action[i,a] = error
Note: the algorithm uses only the DFA states and the
input, but the stack symbols are never used.

😁
CHECK OUT THE TWO EXAMPLES WRITTEN IN THE NOTEBOOK
And with that, the compiler's season comes to an end! Good luck!

You might also like