Chapter 3 Syntax Analysis
Chapter 3 Syntax Analysis
Syntax Analysis
Syntax Analysis
Check syntax and construct abstract syntax tree
The information which syntax analysis phase gets from the previous phase (lexical analysis) is whether
a token is valid or not and which class of tokens does it belong to. Hence it is beyond the capabilities
of the syntax analysis phase to settle issues like:
1
Limitations of regular languages
. How to describe language syntax precisely and conveniently. Can regular expressions be used?
. Many languages are not regular, for example, string of balanced parentheses
Regular expressions cannot be used to describe language syntax precisely and conveniently. There are
many languages which are not regular. For example, consider a language consisting of all strings of
balanced parentheses. There is no regular expression for this language. Regular expressions can not
be used for syntax analysis (specification of grammar) because: . The pumping lemma for regular
languages prevents the representation of constructs like a string of balanced parenthesis where there
is no limit on the number of parenthesis. Such constructs are allowed by most of the programming
languages. . This is because a finite automaton may repeat states, however, it does not have the power
to remember the number of times a state has been reached. . Many programming languages have an
inherently recursive structure that can be defined by Context Free Grammars (CFG) rather intuitively.
So a more powerful language is needed to describe valid string of tokens
A set of tokens, known as terminal symbols (T) . Terminals are the basic symbols from which
strings are formed (usually small letters).
A set of non-terminals (N): Non-terminals are syntactic variables that denote sets of strings. The
non-terminals define sets of strings that help define the language generated by the
grammar.(Capital letters)
A set of productions (P): The productions of a grammar specify the manner in which the terminals
and non-terminals can be combined to form strings. Each production consists of a non-terminal
called the left side of the production, an arrow, and a sequence of tokens and/or on- terminals,
called the right side of the production.
A designation of one of the non-terminals as the start symbol (S) , and the set of strings it denotes
is the language defined by the grammar.
The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the start
symbol) by the right hand side of a production for that non-terminal.
2
Example
S (S)S by P1
S ((S)S)S by P1
S ((S)S) by P2
S (( )S) by P2
S (( ) (S)S) by P1
S (( )( )S) by P2
S (( )( )) by P2 i.e. S (( )( ))
Similarly,
9 - digit + digit
9 - 5 + digit
9-5+2
It would be interesting to know that the name context free grammar comes from the fact that use of a
production X . does not depend on the context of X.
3
Syntax analyzers
. Testing for membership whether word(w) belongs to L(G) is just a "yes" or "no" answer
- Must generate the parse tree and Handle errors gracefully if string is not in the language
A parse tree may be viewed as a graphical representation for a derivation that filters out the choice
regarding replacement order. Each interior node of a parse tree is labeled by some non-terminal A , and
that the children of the node are labeled, from left to right, by the symbols in the right side of the
production by which this A was replaced in the derivation.
A syntax analyzer not only tests whether a construct is syntactically correct i.e. belongs to the language
represented by the specified grammar but also generates the parse tree. It also reports appropriate error
messages in case the string is not in the language represented by the grammar specified. It is possible
that many grammars represent the same language. However, the tools such as yacc (yet another
compiler compiler) or other parser generators are sensitive to the grammar form.
Derivation
4
What is Derivation?
The process of deriving a string is called as derivation.
The geometrical representation of a derivation is called as a parse tree or derivation tree.
1. Leftmost derivation
The process of deriving a string by expanding the leftmost non-terminal at each step is called
as leftmost derivation.
The geometrical representation of leftmost derivation is called as a leftmost derivation tree
Example-
Consider the following grammar-
S → aB| bA
A → aS |bAA | a
B → bS | aBB | b
Leftmost Derivation-
S → aB
S→ aaabBB (Using B → b)
S→ aaabbB (Using B → b)
S→ aaabbaBB (Using B → aBB)
S→ aaabbabB (Using B → b)
S→ aaabbabbba (Using A → a)
5
2. Rightmost Derivation-
The process of deriving a string by expanding the rightmost non-terminal at each step is called
as rightmost derivation.
The geometrical representation of rightmost derivation is called as a rightmost derivation tree.
Example- Consider the following grammar-
S → aB |bA
A→ aS | bAA | a
B → bS | aBB |b
Rightmost Derivation-
S → aB
→ aaBB (Using B → aBB)
→ aaBaBbba (Using A → a)
→ aaBabbba (Using B → b)
6
Parse tree
It shows how the start symbol of a grammar derives a string in the language
root is labeled by the start symbol
leaf nodes are labeled by tokens
Each internal node is labeled by a non terminal
A parse tree may be viewed as a graphical representation for a derivation that filters out the choice
regarding replacement order. Thus, a parse tree pictorially shows how the start symbol of a grammar
derives a string in the language. Each interior node of a parse tree is labeled by some non-terminal A
, and that the children of the node are labeled, from left to right, by the symbols in the right side of
the production by which this A was replaced in the derivation. The root of the parse tree is labeled by
the start symbol and the leaves by non-terminals or terminals and, read from left to right, they
constitute a sentential form, called the yield or frontier of the tree. So, if A is a non-terminal labeling
an internal node and x1 , x2 , .xn are labels of children of that node then A x1 x2 . xn is a
production.
7
Example Parse tree for 9-5+2
digit 0|1|2|3|4|5|6|7|8|9
Ambiguity: A grammar is said to be an ambiguous grammar if there is some string that it can generate
in more than one way (i.e., the string has more than one parse tree or more than one leftmost
derivation). A language is inherently ambiguous if it can only be generated by ambiguous grammars.
For example, consider the following grammar: string string + string | string – string | 0 | 1 | . | 9
In this grammar, the string 9-5+2 has two possible parse trees as shown below.
Ambiguity is harmful to the intent of the program. The input might be deciphered in a way which was
not really the intention of the programmer, as shown above in the 9-5+2 example. Though there is no
general technique to handle ambiguity i.e., it is not possible to develop some feature which
8
automatically identifies and removes ambiguity from any grammar. However, it can be removed,
broadly speaking, in the following possible ways:-
Associativity
If an operand has operator on both the sides, the side on which operator takes this operand is the
associativity of that operator
A binary operation on a set S that does not satisfy the associative law is called non-associative. A left-
associative operation is a non-associative operation that is conventionally evaluated from left to right
i.e., operand is taken by the operator on the left side.
Following is the grammar to generate strings with left associative operators. (Note that this is left
recursive and may go into infinite loop. But we will handle this problem later on by making it right
recursive)
Precedence: Precedence is a simple ordering, based on either importance or sequence. One thing is
said to "take precedence" over another if it is either regarded as more important or is to be performed
first.
For example, consider the string a+5*2. It has two possible interpretations because of two different
parse trees corresponding to (a+5)*2 and a+(5*2). But the * operator has precedence over the +
operator. So, the second interpretation is correct. Hence, the precedence determines the correct
interpretation.
9
Parsing
.Process of determination whether a string can be generated by a grammar
Parsing falls in two categories:
Top-down parsing: Construction of the parse tree starts at the root (from the
start symbol) and proceeds towards leaves (token or terminals)
Bottom-up parsing: Construction of the parse tree starts from the leaf nodes
(tokens or terminals of the grammar) and proceeds towards root (start symbol)
Parsing is the process of analyzing a continuous stream of input (read from a file or a keyboard, for
example) in order to determine its grammatical structure with respect to a given formal grammar.
The task of the parser is essentially to determine if and how the input can be derived from the start
symbol within the rules of the formal grammar. This can be done in essentially two ways:
Top-down parsing ()
It can be viewed as an attempt to find a leftmost derivation of an input string. A parser can start
with the start symbol (The root) and try to create the nodes of the pars tree in pre-order.
Intuitively, the parser starts from the largest elements and breaks them down into incrementally
smaller parts. It may require a back-tracking (It means, if one derivation of a production fails,
syntax analyzer restarts process using different rules of same production. This technique may
process input string more than once to determine right production.) LL ( Left-right Scanning of
Input , Left most Derivation) and Recursive descent parsers are examples of top-down parsers.
ScAd
A ab | a
10
In fig A the parse tree is not correct since the final strings (nodes) are not match with the input string.
Because of the production in Aab is not suitable for the given input string. So we need go backward
(back-tracking) on production A and rather than Aab, we use the other alternative Aa (shown in
fig. B) which is accepted.
Predictive Parser
A Predictive Parser is a special case of Recursive Descent Parser, where no Back-racking is
required. Predictive parsing relies on information about what first symbols can be generated by the
right side of a production. The lookahead symbol guides the selection of the production A to
be used.
if starts with a token, then the production can be used when the lookahead symbol
matches this token
if starts with a nonterminal B, then the production can be used if the lookahead symbol
can be generated from B.
So match (a) moves the cursor lookahead one symbol forward iff lookahead points to a. Otherwise an
error is produced.
Predictive parsing identifies what production to use to replace the input string. It does not have
backtracking. The predicative parser uses a look ahead pointer. It points to the next input symbols. In
order to make the parser free of backtracking.
11
Predictive parser can be implemented by maintaining an external stack as shown below.
First
FIRST (α) is defined as the collection of terminal symbols which are the first letters of strings derived from α.
12
Follow
Follow (A) is defined as the collection of terminal symbols that occur directly to the right of A.
SACB | CbB | Ba
First Follow
Ada | BC
First (S) = { d, g, h, ε ,b, a} Follow(S) = {$}
Bg | ε First (A) = {d, g, h, ε } Follow (A) = {h, g, $}
13
Important Notes-
∈ may appear in the first function of a non-terminal.
If k=1 it is LL(1) grammar that we only look ahead one input string at a time. It is used in top down
parsing. Consider the following grammar
S aA | Bb
1.
A aB | Cb
2.
3. bbC | aC
4. CbD
5. Dd
Consider the input string w=aaabd
The LL(1) derivation will look by taking 1 input string at atime
SaA for 1st letter w=aaabd
AaaB using rule 2 for 2nd letter w=aaabd
BaaaC using rule 3 for 3rd letter w=aaabd
CaaabD using rule 4 for 4th letter w=aaabd
Daaabd using rule 5 for last letter w=aaabd
14
Hence w=aaabd is LL(1) grammar based on the productions.
If it is not deterministic to select a single input string, then it is not LL(1) parsing
Example :
S abB | aaA
Bd
Ac | d
Let input string w=abd
At the start symbol S we have two possibilities that the first (a) input string can be derived (either SabB or
SaaA) . This leads ambiguity. If we look by two letters at a time we can drive the w=abd as follows:
15
Recursion
Grammar may be Left recursion or Right recursion.
Left recursion
A production of grammar is said to have left recursion if the leftmost variable of its RHS is
same as variable of its LHS.
A grammar containing a production having left recursion is called as Left Recursive Grammar.
Example-
S → Sa / ∈
(Left Recursive Grammar)
Left recursion is considered to be a problematic situation for Top down parsers.
Therefore, left recursion has to be eliminated from the grammar
Elimination of Left Recursion
Left recursion is eliminated by converting the grammar into a right recursive grammar.
If we have the left-recursive pair of productions-
A → Aα / β
where β does not begin with an A.
Then, we can eliminate left recursion by replacing the pair of productions with-
A → βA’
A’ → αA’ / ∈
(Right Recursive Grammar)
A’ → BdA’ / aA’ / ∈
B → bB’
B’ → eB’ / ∈
16
Another example
Right Recursion
A production of grammar is said to have right recursion if the rightmost variable of its RHS is
same as variable of its LHS.
A grammar containing a production having right recursion is called as Right Recursive Grammar.
Example-
S → aS / ∈ (Right Recursive Grammar)
Right recursion does not create any problem for the Top down parsers.
Therefore, there is no need of eliminating right recursion from the grammar
Left factoring
Left factoring is a grammar transformation that is useful for producing a grammar suitable for
predictive parsing. The basic idea is that when it is not clear which of two or more alternative
productions to use to expand a non-terminal A, we defer the decision till we have seen enough input
to make the right choice.
Therefore A α ß1 | α ß2 transforms to
A α A'
A' ß1 | ß2
17
Bottom-up parsing
A parser can start with the input and attempt to rewrite it to the start symbol. LR (Left to Right
Scanning of Input, Rightmost Derivation) parsers are examples of bottom-up parsers.
We will now study a general style of bottom up parsing, also known as shift-reduce parsing. Shift-
reduce parsing attempts to construct a parse tree for an input string beginning at the leaves (the
bottom) and working up towards the root (the top). We can think of the process as one of "reducing"
a string w to the start symbol of a grammar. At each reduction step, a particular substring matching
the right side of a production is replaced by the symbol on the left of that production, and if the
substring is chosen correctly at each step, a rightmost derivation is traced out in reverse.
Parser
$ w$ 1. “Shit”: Parser shifts zero or more symbols to the stack until handle β
. . 2. “Reduce”: β is reduced to left hand side of the production
. .
3. “Accept”: announce successful production
. .
$S $ 4. “Error”: Calls an error recovery routine
Eg. Let Production rule Aabc here abc is considered as handle β and A is Left side of the production
The set of strings to be replaced at each reduction step is called a handle.
Although a handle of a string can be described as a substring that equals the right side of a production rule
S aABe
A Abc | b
Bd
abbcde
aAbcde
aAde
aABe
S
These reductions trace out the following right-most derivation in reverse:
S a A B e
Sa A d e
Sa A b c d e
Sa b b c d e
LR Parser can be LR SLR(Simple LR) LALR(Look Ahead LR) CLR( Canonical LR)
How does the LR(k) parser know when to shift and to reduce?
Uses a DFA
At each step, parser runs DFA using symbols on stack as input
Input is sequence of terminals and non-terminals from bottom to top
Current state of DFA plus next k tokens indicate whether to shift or reduce
19
Error Recovery
A parser should be able to detect and report any error in the program. It is expected that when an error
is encountered, the parser should be able to handle it and carry on parsing the rest of the input. Mostly
it is expected from the parser to check for errors but errors may be encountered at various stages of
the compilation process. A program may have the following kinds of errors at various stages:
Panic mode
When a parser encounters an error anywhere in the statement, it ignores the rest of the statement by
not processing input from erroneous input to delimiter, such as semi-colon. This is the easiest way of
error-recovery and also, it prevents the parser from developing infinite loops.
Statement mode
When a parser encounters an error, it tries to take corrective measures so that the rest of inputs of
statement allow the parser to parse ahead. For example, inserting a missing semicolon, replacing
comma with a semicolon etc. Parser designers have to be careful here because one wrong correction
may lead to an infinite loop.
Error productions
Some common errors are known to the compiler designers that may occur in the code. In addition, the
designers can create augmented grammar to be used, as productions that generate erroneous constructs
when these errors are encountered.
Global correction
The parser considers the program in hand as a whole and tries to figure out what the program is
intended to do and tries to find out a closest match for it, which is error-free. When an erroneous input
(statement) X is fed, it creates a parse tree for some closest error-free statement Y. This may allow the
parser to make minimal changes in the source code, but due to the complexity (time and space) of this
strategy, it has not been implemented in practice yet.
20