Compiler Design - Syntax Analysis
Compiler Design - Syntax Analysis
We have seen that a lexical analyzer can identify tokens with the
help of regular expressions and pattern rules. But a lexical analyzer
cannot check the syntax of a given sentence due to the limitations
of the regular expressions. Regular expressions cannot check
balancing tokens, such as parenthesis. Therefore, this phase uses
context-free grammar (CFG), which is recognized by push-down
automata.
Context-Free Grammar
In this section, we will first see the definition of context-free
grammar and introduce terminologies used in parsing technology.
Example
G = ( V, Σ, P, S )
Where:
V = { Q, Z, N }
Σ = { 0, 1 }
P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 }
S={Q}
This grammar describes palindrome language, such as: 1001,
11100111, 00100, 1010101, 11111, etc.
Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer
in the form of token streams. The parser analyzes the source code
(token stream) against the production rules to detect any errors in
the code. The output of this phase is a parse tree.
This way, the parser accomplishes two tasks, i.e., parsing the code,
looking for errors and generating a parse tree as the output of the
phase.
Parsers are expected to parse the whole code even if some errors
exist in the program. Parsers use error recovering strategies, which
we will learn later in this chapter.
Derivation
A derivation is basically a sequence of production rules, in order to
get the input string. During parsing, we take two decisions for some
sentential form of input:
Right-most Derivation
If we scan and replace the input with production rules, from right to
left, it is known as right-most derivation. The sentential form
derived from the right-most derivation is called the right-sentential
form.
Example
Production rules:
E → E + E
E → E * E
E → id
Input string: id + id * id
E → E * E
E → E + E * E
E → id + E * E
E → id + id * E
E → id + id * id
E → E + E
E → E + E * E
E → E + E * id
E → E + id * id
E → id + id * id
Parse Tree
A parse tree is a graphical depiction of a derivation. It is convenient
to see how strings are derived from the start symbol. The start
symbol of the derivation becomes the root of the parse tree. Let us
see this by an example from the last topic.
E → E * E
E → E + E * E
E → id + E * E
E → id + id * E
E → id + id * id
Step 1:
E→E*E
Step 2:
E→E+E*E
Step 3:
E → id + E * E
Step 4:
E → id + id * E
Step 5:
E → id + id * id
In a parse tree:
Ambiguity
A grammar G is said to be ambiguous if it has more than one parse
tree (left or right derivation) for at least one string.
Example
E → E + E
E → E – E
E → id
For the string id + id – id, the above grammar generates two parse
trees:
The language generated by an ambiguous grammar is said to
be inherently ambiguous. Ambiguity in grammar is not good for a
compiler construction. No method can detect and remove ambiguity
automatically, but it can be removed by either re-writing the whole
grammar without ambiguity, or by setting and following
associativity and precedence constraints.
Associativity
If an operand has operators on both sides, the side on which the
operator takes this operand is decided by the associativity of those
operators. If the operation is left-associative, then the operand will
be taken by the left operator or if the operation is right-associative,
the right operator will take the operand.
Example
id op id op id
(id op id) op id
id op (id op id)
Precedence
If two different operators share a common operand, the precedence
of operators decides which will take the operand. That is, 2+3*4
can have two different parse trees, one corresponding to (2+3)*4
and another corresponding to 2+(3*4). By setting precedence
among operators, this problem can be easily removed. As in the
previous example, mathematically * (multiplication) has precedence
over + (addition), so the expression 2+3*4 will always be
interpreted as:
2 + (3 * 4)
Left Recursion
A grammar becomes left-recursive if it has any non-terminal ‘A’
whose derivation contains ‘A’ itself as the left-most symbol. Left-
recursive grammar is considered to be a problematic situation for
top-down parsers. Top-down parsers start parsing from the Start
symbol, which in itself is non-terminal. So, when the parser
encounters the same non-terminal in its derivation, it becomes hard
for it to judge when to stop parsing the left non-terminal and it goes
into an infinite loop.
Example:
(1) A => Aα | β
(2) S => Aα | β
A => Sd
(1) is an example of immediate left recursion, where A is any non-
terminal symbol and α represents a string of non-terminals.
A top-down parser will first parse the A, which in-turn will yield a
string consisting of A itself and the parser may go into a loop
forever.
The production
A => Aα | β
A => βA'
A'=> αA' | ε
This does not impact the strings derived from the grammar, but it
removes immediate left recursion.
START
END
Example
S => Aα | β
A => Sd
S => Aα | β
A => Aαd | βd
and then, remove immediate left recursion using the first technique.
A => βdA'
A' => αdA' | ε
Left Factoring
If more than one grammar production rules has a common prefix
string, then the top-down parser cannot make a choice as to which
of the production it should take to parse the string in hand.
Example
A ⟹ αβ | α𝜸 | …
Then it cannot determine which production to follow to parse the
string as both productions are starting from the same terminal (or
non-terminal). To remove this confusion, we use a technique called
left factoring.
Example
A => αA'
A'=> β | 𝜸 | …
Now the parser has only one production per prefix which makes it
easier to take decisions.
First Set
α→tβ
Follow Set