Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Context Free Grammar

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 23

Context-Free Grammars and Languages

(Based on Hopcroft, Motwani and Ullman (2007) & Cohen (1997))


Introduction
Consider an example sentence: A small cat eats the fish
English grammar has rules for constructing sentences; e.g.,
1. A sentence can be a subject followed by a predicate
(Subject: A small cat)
(Predicate: eats the fish)
2. A subject can be a noun-phrase (Noun-Phrase: A small cat)
3. A noun-phrase can be an article, followed by an adjective followed by a noun
(Article: A) (Adjective: small) (Noun: cat)
4. A predicate can be a verb-phrase (Verb-Phrase: eats the fish)
5. A verb-phrase can be a verb followed by a noun-phrase
(Verb: eats) (Noun-Phrase: the fish)
7. A noun-phrase can be an article, followed by a noun
(Article: the) (Noun: fish)
8. A verb can be: buries, touches, grabs, eats
9. An adjective can be: big, small
10. An article can be: the, a, an
Definition: The things that cannot be replaced by anything are called terminals.
Definition: The things that must be replaced by other things are called nonterminals or
variables
In the above example,
small and eats are terminals.
noun-phrase and verb-phrase are nonterminals.
Definition: The sequence of applications of the rules that produces the finished string of
terminals from the starting symbol is called a derivation.
Context-Free Grammar (CFG)
Example 1: terminals: = { a }
productions:
S -> aS
S -> ^

nonterminal: N = { S }

The above CFG defines a* = { ^, a, aa, aaa, aaaa, aaaaa, . . . }


It can generate or derive aaaa as follows:
S => aS
=> aaS
=> aaaS
=> aaaaS
=> aaaa
The same derivation can also be written as:

S => aS => aaS => aaaS => aaaaS => aaaa


Example 2: terminal: = { a }
productions:
S -> SS
S -> a
S -> ^

nonterminal: N = { S }

This CFG can be written in more compact notation:


S -> SS | a | ^
which is also called the Backus Normal Form or Backus-Naur Form (BNF).
The above CFG defines the same CFL: a*
It can generate the string aa as follows:
S => SS => SSS
=> SSa
=>SSSa
=>SaSa
=>aSa
=>aa
In this grammar, a string in the CFL has infinitely many derivations. In other examples,
a string is usually generated in limited number of ways.
Definition: A context-free grammar (CFG) is composed of:
1. : A finite non-empty set of symbols called terminals from which strings of the
language are made.
2. N: A finite set of symbols called nonterminals. and N are disjoint.
3. S: A special start symbol in N.
4. P: A finite set of productions of the form:
Ni a string from ( + N)*
where Ni is in N.
Thus, each production is of the form:
one nonterminal -> finite string of terminals and/or nonterminals
where the strings of terminals, nonterminals can consist of only terminals or of only
nonterminals, or any mixture of terminals and nonterminals or even the empty string. We
require that at least one production has the nonterminal S as its left side. On the right
side of the arrow an string from ( + N)* is allowed.
Convention: Terminals will typically be lowercase letters. Nonterminals will typically be
uppercase letters.
Example 3: A CFG for ab* = { a, ab, abb, abbb, abbbb, . . . . }
1. Terminals: = {a, b},
2. Nonterminal: N = {S, B}
3. Productions: P =
{
S -> aB
B -> bB | ^

}
DERIVATION of

abbb using the CFG of example 3:

S => aB => abB => abbB => abbbB => abbb


Most of the time the set of productions are explicitly given for a CFG and the terminals
and the non-terminals are understood from the context as shown in examples 4-6 :

Example 4: A CFG for

aba*
S ->
A ->

= { ab, aba, abaa, abaaa, . . . .

abA
aA | ^

DERIVATION of abaaa using the CFG of example 4:


S => abA => abaA => abaaA => abaaaA => abaaa
Example 5: A CFG for

ab*a
S ->
B ->

= { aa, aba, abba, abbba, . . . .

aBa
aB | ^

DERIVATION of abbbba using the CFG of example 5:


S => aBa => abBa => abbBa => abbbBa => abbbbBa => abbbba
Example 6: A CFG for { ancbn :n > 0} = { acb, aacbb, aaacbbb, . . . .
S ->

aSb | acb

DERIVATION of aaacbbb using the CFG of example 6:


S => aSb => aaSbb => aaacbbb
Definition: The language generated (defined, derived, produced) by a CFG, G, is the set
of all strings of terminals that can be produced from the start symbol S using the
productions as substitutions. A language generated by a CFG, G, is called a context-free
language (CFL) and is denoted by L(G).
Example: terminals: = {a}, nonterminal: N = {S}
productions: P =
{
S -> aS
S -> ^
}
Let L1 the language generated by this CFG, and let L2 be the language generated by
regular expression a*.

Claim: L1 = L2.
Proof:
We first show that L2 L1.
Consider a^n L2 for n >= 1. We can generate an by using first production n times, and
then second production.
Can generate ^ L2 by using second production only.
Hence L2 L1.
We now show that L1 L2.
Since a is the only terminal, CFG can only produce strings having only as.
Thus, L1 L2.
Note that Two types of arrows:
-> used in statement of productions
=> used in derivation of string
in the above derivation of a4, there were many unfinished stages that consisted of both
terminals and nonterminals. These are called working strings.
^ is neither a nonterminal (since it cannot be replaced with something else) nor a
terminal (since it disappears from the string).
Example: terminals: = { a, b }
nonterminals: S
productions:
S -> aS
S -> bS
S -> a
S -> b
More compact notation:
S -> aS | bS | a | b
Can produce the string abbab as follows:
S => aS
=> abS
=> abbS
=> abbaS
=> abbab
Let L1 be the CFL, and let L2 be the language generated by the
regular expression (a + b)+.
Claim: L1 = L2.
Proof:
First we show that L2 L1.
Consider any string w L2.
Read letters of w from left to right.
For each letter read in, if it is not the last, then

use the production S -> aS if the letter is a or


use the production S -> bS if the letter is b
For the last letter of the string,
use the production S -> a if the letter is a or
use the production S -> b if the letter is b
In each stage of the derivation, the working string has the form
(string of terminals)S. Hence, we have shown how to generate w using the CFG, which
means that w L1. Hence, L2 L1.
Now we show that L1 L2.
To show this, we need to show that if w L1, then w L2.
This is equivalent to showing that if w L2, then w L1.
Note that the only string w L2 is w = A.
But note that A cannot be generated by the CFG, so A L1.
Hence, we have proven that L1 L2.

Trees
Can use a tree to illustrate how a string is derived from a CFG.
Definition: These trees are called syntax trees, parse trees, generation trees,
production trees, or derivation trees.
Example: CFG:
terminals: a, b
nonterminals: S, A
productions:
S ! AAA | A
A ! AA | aA | Ab | a | b
String abaaba has the following derivation:
S => AAA
=> aAAA
=> abAA
=> abAbA
=> abaAbA
=> abaabA
=> abaaba
which corresponds to the following derivation tree:
S
/|\
/|\
/|\
/|\

AAA
/||\|
/||\|
a AAb a
|/\
|/\
baA
|
|
a

Example: CFG for simplified arithmetic expressions.


Terminals
Nonterminals
Productions

: +, _, 0, 1, 2, . . . , 9
:S
:

S!S+S|S_S|0|1|2||9
Consider the expression 2 _ 3 + 4.
Ambiguous how to evaluate this:
Does this mean (2 _ 3) + 4 = 10 or 2 _ (3 + 4) = 14 ?
Can eliminate ambiguity by examining the two possible derivation trees
SS
/|\/|\
/|\/|\
/|\/|\
/|\/|\
S+SS*S
/|\||/|\
/|\||/|\
2*3423+4
Eliminate the Ss as follows:
+*
/\/\
/\/\
/\/\
/\/\
*42+
/\/\
/\/\
2334
Note that we can construct a new notation for mathematical expressions:
start at top of tree
walk around tree keeping left hand touching tree
first time hit each terminal, print it out.

This gives us a string which is in operator prefix notation or Polish notation.


In above examples,
first tree yields
+_234
second tree yields
_2+34
To evaluate the string:
1. scan string from left to right.
2. the first time we read a substring of the form operator-operand-operand (o-o-o),
replace the three symbols with the one result of the indicated arithmetic
calculation.
3. go back to step 1
Example: (from above)
first tree yields:
string first o-o-o substring
+ _ 2 3 4 _ 2 3 + 6 4 + 6 4 10
second tree yields:
string first o-o-o substring
_ 2 + 3 4 + 3 4 _ 2 7 _ 2 7 14

Definition: A CFG is ambiguous if for at least one string in its CFL there are two possible
derivations of the string that correspond to two different syntax trees.
Example: PALINDROME
Terminals
: a, b
Nonterminals : S
Productions :
S ! aSa | bSb | a | b | _
Can generate the string babbab as follows:
S => bSb
=> baSab
=> babSbab
=> babbab
which has derivation tree:
S
/|\
bSb

/|\
aSa
/|\
bSb
|
^
Can show that this CFG is unambiguous.

Definition: For a given CFG, the total language tree is the tree with root S, whose
children are all the productions of S, whose second descendents are all the working
strings that can be constructed by applying one production to the leftmost nonterminal in
each of the children, and so on.
Example:
Terminals : a, b
Nonterminals : S, X
Productions :
S -> aX | Xa | aXbXa
X -> ba | ab
This CFG has total language tree as follows:
S
/|\
/|\
/|\
/|\
/|\
aX Xa aXbXa
/|/|/\
/|/|/\
aba aab baa aba ababXa aabbXa
/\/\
ababbaa abababa aabbbaa aabbaba
The CFL is finite.
Other References:
http://web.njit.edu/~marvin/cis341/chap12.pdf
http://www.cs.appstate.edu/~dap/classes/2490/chap12.html

Chapter 13 Grammatical Format


Regular Grammars
We previously saw that
CFGs can generate some regular languages.
CFGs can generate some nonregular languages.
We will see that
all regular languages can be generated by CFGs.
some nonregular languages cannot be generated by CFGs.
Can turn FA into a CFG as follows:
Example: L = all strings ending in a.
FA:

S-

A
+
b

B
b
9

Definition: The path development of a string processed on a machine:


Start in starting state S.
For each state visited, print out the input letters used thus far and the current
state.

The string ababba has following path development on the FA:


S
aA
abB
abaA
ababB
ababbB
ababbaA
ababba
Now we define the following productions:
S -> aA | bB
A -> aA | bB | _
B > aA | bB
Note that:
The CFG has a production
X -> cY
if and only if in the FA, there is an arc from state X to state Y labeled with c.
The CFG has a production
X -> A
if and only if state X in the FA is a final state.
Derivation of ababba using the CFG:
S => aA
=> abB
=> abaA
=> ababB
=> ababbB
=> ababbaA
=> ababba
There is a one-to-one correspondence between path developments on the FA and
derivations in the CFG; i.e., we can use the pigeonhole principle. The derivation of the
string ababba using the CFG is exactly the same as the path development given above.
Theorem 21 All regular languages are CFLs.

10

Example:
FA:

A+

a
a

a
S-

C+

b
b
b
a

productions:
S => aS | bA
A => aC | bB | A
B => aB | bC
C => aA | bB | A
Consider a CFG G = (E, , R, S), where
E is the set of terminals
is the set of nonterminals, and
S is the starting nonterminal R (E+ )* is the set of productions,
where a production (N, U) R with N and U (E + )* is written as N -> U
Definition: For a given CFG G = (E,,R, S), W is a semiword if W E * ;
i.e., W is a string of terminals (maybe none) cancatenated with exactly one nonterminal
(on the right).

11

Example: aabaN is a semiword if N is a nonterminal and a and b are terminals.

Definition: A Regular grammar, RG, is composed of


1. : A finite non-empty set of symbols called terminals
2. N: A finite set of symbols, called Non-terminals. and N are disjoint
3. S: A special start symbol in N.
4. P: A set of productions in one of the following two forms (not both):
4a.

Right RG:

Ni *Nj

4b.

Left RG:

Ni Nj *

A regular grammar can be either Right Regular or Left Regular not both. Each of N i
and Nj is a single non-terminal and they could be the same. That is, N i = Nj is
allowed.
Theorem 22 If a CFG is a regular grammar, then the language generated by this CFG is
regular.
Proof.
We will prove theorem by showing that there is a TG that accepts the language enerated
by the CFG.
Suppose CFG is as follows:
N1 -> w1M1
N2 -> w2M2
...
Nn -> wnMn
Nn+1 -> wn+1
Nn+2 -> wn+2
...
Nn+m ! wn+m
where Ni and Mi are nonterminals (not necessarily distinct) and wi E* are strings of
terminals. Thus, wiMi is a semiword. At least one of the Ni = S. Assume that N1 = S.
Create a state of the TG for each nonterminal Ni and for each nonterminal Mj .
Also create a state +.
Make the state for nonterminal S the initial state of the transition graph.
Draw an arc labeled with wi from state Ni to state Mi if and only if there is a
production Ni -> wiMi.
Draw an arc labeled with wi from state Ni to state + if and only if there is a
production Ni -> wi.
Thus, we have created a TG.

12

By considering the path developments of words accepted by the TG, we can


show that there is a one-to-one correspondence between words accepted by TG
and words in CFL.
Thus, these are the same language.
Kleenes Theorem implies that the language has a regular expression.
Thus, language is regular.

Remarks:
all regular languages can be generated by some regular grammars (Theorem 21)
all regular grammars generate some regular language.
a regular language may have many CFGs that generate it, where some of the
CFGs may not be regular grammars.
Example: CFG
productions:
S -> aB | bA | abA | baB
A -> abaA | bb
B -> baA | ab

Corresponding TG (note that CFG does not generate A):

ab

aba
bb

b
S-

ba
a

a, b

ba
B

Definition: A production (N, U) R is a A-production if U = A, i.e., the

13

production is N -> A. If CFG does not contain a A -production, then A CFL.


However, CFG may have A -production and A CFL.
Example: productions:
S -> aX
X -> A
Chomsky Normal Form
A Productions and Nullable Nonterminals
Recall we previously defined A -production:
N-> A where N is some nonterminal.
Note that
If some CFL contains the word A, then the CFG must have a A -production.
However, if a CFG has a A -production, then the CFL does not necessarily
contain A;
e.g.,
S -> aX
X -> A
which defines the CFL {a}.
Definition: For a given CFG with as its set of nonterminals E and A as its set of
terminals, a working string W 2 (E + )* is any string of nonterminals and/or terminals
that can be generated from the CFG starting from any nonterminal.
Definition: For a given CFG having a nonterminal X and W a possible working string, we
use the notation
X => W
if there is some derivation in the CFG starting from X that can result in the working string
W.
Example: CFG:
S -> a | Xb | aY a
X -> Y | A
Y -> X | a
Since we have the following derivation
S => aY a
=> aXa
=> aa
we can write S *=> aY a and S *=> aXa and S *=> aa.
Definition: In a given CFG, a nonterminal X is nullable if

14

1. There is a production X -> A, or


2. X *=> A; i.e., there is a derivation that starts at X and leads to A
3. X => . => A
Chomsky Normal Form
Definition: A CFG G = (E, ,R, S) is in Chomsky Normal Form (CNF) if (N, U) R
implies U () + E; i.e., each of its productions has one of the two forms:
1. Nonterminal -> string of exactly two Nonterminals
2. Nonterminal -> one terminal
Theorem 26 For any CFL L, the non- A words of L can be generated by a
CFG in CNF.

Leftmost Nonterminals and Derivations


Definition: The leftmost nonterminal (LMN) in a working string is the first nonterminal that
we encounter when we scan the string from left to right. Example: In the string
bbabXbaY SbXbY , the LMN is X.
Definition: If a word w is generated by a CFG by a certain derivation and at each step in
the derivation, a rule of production is applied to the leftmost nonterminal in the working
string, then this derivation is called a leftmost derivation (LMD).
Example: CFG:
S -> baXaS | ab
X -> Xab | aa
The following is a LMD:
S => baXaS
=> baXabaS
=> baXababaS
=> baaaababaS
=> baaaababaab

15

Example: CFG:
S -> XY
X -> Y b | Xa | aa | Y Y
Y -> XbbX | ab
The word abbaaabbabab has the following derivation tree:
S
/\
/ \
/ \
/
\
/
\
X _Y_
/\ //\\
Yb /||\
/\ XbbX
ab /\/\
Xa /\
/\Y Y
aa /\/\
ab ab
Note that if we walk around the tree starting down the left branch of the root with our left
hand always touching the tree, then the order in which we first visit each nonterminal
corresponds to the order in which the nonterminals are replaced in LMD.
This is true for any derivation in any CFG
Theorem 27 Any word that can be generated by a given CFG by some derivation
also has a LMD.
Other References:
http://web.njit.edu/~marvin/cis341/chap13.pdf

16

Chapter 14 Pushdown Automata


Introduction
Previously, we saw connection between
Regular languages and
Finite automata
We saw that certain languages generated by CFGs could not be accepted by FAs.
Pushdown Automata
Now we will introduce new kind of machine: pushdown automaton (PDA).
Will see connection between
context-free languages and
pushdown automata
Pushdown automata and FAs share some features, but a PDA can have one extra key
feature: STACK. (infinitely long INPUT TAPE on which input is written.)
INPUT TAPE is divided into cells, and each cell holds one input letter or a blank .

17

Once blank is encountered on INPUT TAPE, all of the following cells also
contain .
Read TAPE one cell at a time, from left to right. Cannot go back.
START, ACCEPT, and REJECT states.
Once enter either ACCEPT or REJECT state, cannot ever leave.
READ state to read input letter from INPUT TAPE.
Also, have an infinitely tall PUSHDOWN STACK, which has Last-In-First-Out
(LIFO) discipline.
Always start with STACK empty.
STACK can hold letters of STACK alphabet (which can be same as input
alphabet) and blanks .

PUSH and POP states alter contents of STACK.


PUSH adds something to the top of the STACK.
POP takes off the thing on the top of the STACK.

A
B
B

.
.

Example: Convert FA to PDA

a
b

ba
18

a
+

Figure 2:

An FA which is converted to a PDA in Figure 3

Start

Rea
d

Rea
d

Rea
d

Accept

b
Reject

Figure 3: A

Reject

PDA

19

A pushdown automaton for { anbn : where n >= 0 } is given below in Figure 4. Please
note that X is used a stack symbol; that is, when an a is read from input X is
pushed into the stack and when a b is read from the input X is poped out of the
stack. On the other hand, a could also have been used as a stack symbol.

Figure 4.

A Pushdown Automaton

20

Determinism and Nondeterminism


Definition: A PDA is deterministic if each input string can only be processed by the
machine in one way.
Definition: A PDA is nondeterministic if there is some string that can be processed by the
machine in more than one way.
A nondeterministic PDA
may have more than one edge with the same label leading out of a certain READ
state or POP state.
may have more than one arc leaving the START state.
Both deterministic and nondeterministic PDAs
may have no edge with a certain label leading out of a certain READ state or
POP state.
if we are in a READ or POP state and encounter a letter for which there is no outedge from this state, the PDA crashes.
Remarks:
For FAs, nondeterminism does not increase power of machines.
For PDAs, nondeterminism does increase power of machines.
Example: Language PALINDROMEX, which consists of all words of the form
sXreverse(s) where s is any string generated by (a + b)_.
PALINDROMEX = {X, aXa, bXb, aaXaa, abXba, baXab, bbXbb, . . .}
Each word in PALINDROMEX has odd length and X in middle.
When processing word on PDA, first read letters from TAPE and PUSH letters
onto STACK until read in X.
Then POP letters off STACK, and check if they are the same as rest of input
string on TAPE.

PDA:
Input alphabet E = {a, b,X}
Stack alphabet L = {a, b}

21

Start

Push a

a
Read
1

Accep
t

PO
P1

Push b

Read
2

PO
P2

PO
P3

Formal Definition of PDA


Definition: A pushdown automaton (PDA) is a collection of eight things:
An alphabet of input letters.
An input TAPE (infinite in one direction), which initially contains the input string to
be processed followed by an infinite number of blanks
An alphabet E of STACK characters.
A pushdown STACK (infinite in one direction), which initially contains all blanks
.
One START state that has only out-edges, no in-edges. Can have more than one
arc leaving the START state. There are no labels on arcs leaving the START
state.
Halt states of two kinds:
1. zero or more ACCEPT states
2. zero or more REJECT states

Each of the Halt states have in-edges but no out-edges.


Finitely many nonbranching PUSH states that introduce characters from E onto
the top of the STACK.
Finitely many branching states of two kinds:
1. READ states, which read the next unused letter from TAPE and may have
out-edges labeled with letters from or a blank . (There is no restriction

22

on duplication of labels and no requirement that there be a label for each


letter of , or .)
2. POP states, which read the top character of STACK and may have outedges labeled with letters of E and the blank character , with no
restrictions.
Remarks:
The definition for PDA allows for nondeterminism. If we want to consider a PDA that
does not have nondeterminism, then we will call it a deterministic PDA.

Other References:
http://www.cs.odu.edu/~toida/nerzic/390teched/cfl/cfg.html

23

You might also like