Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
7 views

Lecture 07

Uploaded by

1162407364
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Lecture 07

Uploaded by

1162407364
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Natural Language

Processing
Lecture 7: Parsing with Context Free Grammars II.
CKY for PCFGs. Earley Parser.

11/13/2020

COMS W4705
Yassine Benajiba
Recall: Syntactic Ambiguity
S → NP VP NP → she
VP → V NP NP → glasses
VP → VP PP D → the
PP → P NP N → cat
NP →DN N → glasses
NP → NP PP V → saw
S S
P → with
VP[1,6] VP[1,6]

VP[2,4] NP[2,6]

NP PP[4,6] NP PP

NP V D N P NP NP V[1,2] D N P NP
she saw the cat with glasses she saw the cat with glasses

Which parse tree is “better”? More probable?


Probabilities for Parse Trees
• Let be the set of all parse trees generated by
grammar G.

• We want a model that assigns a probability to each parse


tree, such that .

• We can use this model to select the most probable parse


tree compatible with an input sentence.

• This is another example of a generative model!


Selecting Parse Trees
• Let be the set of trees generated by grammar G whose
yield (sequence of leafs) is string s.

• The most likely parse tree produced by G for string s is

• How do we define P(t)?

• How do we learn such a model from training data (annotated or


un-annotated).

• How do we find the highest probability tree for a given


sentence? (parsing/decoding)
Probabilistic Context Free
Grammars (PCFG)
• A PCFG consists of a Context Free Grammar
G=(N, Σ, R, S) and a probability P(A → β) for each
production A → β ∈ R.

• The probabilities for all rules with the same left-hand-


side sum up to 1:

• Think of this as the conditional probability for A → β,


given the left-hand-side nonterminal A.
PCFG Example
S → NP VP [1.0] NP → she [0.05]
VP → V NP [0.6] NP → glasses [0.05]
VP → VP PP [0.4] D → the [1.0]
PP → P NP [1.0] N → cat [0.3]
NP →DN [0.7] N → glasses [0.7]
NP → NP PP [0.2] V → saw [1.0]
P → with [1.0]
Parse Tree Probability
• Given a parse tree , containing rules
the probability of t is

S → NP PP 1.0

VP → VP PP .4

VP → V NP
.6
NP → D N PP →P NP
.7 1.0

D→ the N→cat
NP→ she V→ saw saw cat P→with NP→ glasses
.05 1.0 1.0 .3 1.0 .05

1 x .05 x .4 x .6 x 1 x 0.7 x 1 x 0.3 x 1 x 1 x .05 = .000126


Parse Tree Probability
• Given a parse tree , containing rules
the probability of t is

S → NP PP 1.0

VP → V NP .6

NP → NP PP
.2

NP → D N PP →P NP
.7 1.0

D→ the N→cat
NP→ she V→ saw saw cat P→with NP→ glasses
.05 1.0 1.0 .3 1.0 .05

1 x .05 x .6 x 1 x .2 x .7 x 1 x .3 x 1 x 1 x .05 = 0.000063 < 0.000126


Estimating PCFG
probabilities
• Supervised training: We can estimate PCFG probabilities from a
treebank, a corpus manually annotated with constituency
structure using maximum likelihood estimates:

• Unsupervised training:

• What if we have a grammar and a corpus, but no annotated


parses?

• Can use the inside-outside algorithm for parsing and do EM


estimation of the probabilities (not discussed in this course)
The Penn Treebank
• Syntactically annotated corpus of newspaper text (1989
Wall Street Journal Articles).
• The source text is naturally occurring but the treebank is
not:
• Assumes a specific linguistic theory (although a simple
one).
• Very flat structure (NPs, Ss, VPs).
PTB Example
( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken))
(, ,)
(ADJP (NML (CD 61) (NNS years))
(JJ old))
(, ,))
(VP (MD will)
(VP (VB join)
(NP (DT the) (NN board))
(PP-CLR (IN as)
(NP (DT a) (JJ nonexecutive) (NN director)))
(NP-TMP (NNP Nov.) (CD 29))))
(. .)))
PTB Example
Parsing with PCFG
• We want to use PCFG to answer the following questions:

• What is the total probability of the sentence under the


PCFG?

• What is the most probable parse tree for a sentence


under the PCFG? (decoding/parsing)

• We can modify the CKY algorithm.


Basic idea: Compute these probabilities bottom-up using
dynamic programming.
Computing Probabilities
Bottom-Up

S → NP PP .05 x .00126 x 1 = 0.000063

VP → V NP 1 x .0021 x .6 = .00126

NP → NP PP .21 x x .05 x .2 = .0021

NP → D N PP →P NP
1 x .3 x .7 = .21 1 x 1 x .05 = .05

D→ the N→cat
NP→ she V→ saw saw cat P→with NP→ glasses
.05 1.0 1.0 .3 1.0 .05
CKY for PCFG Parsing
• Let be the set of trees generated by grammar G
starting at nonterminal A, whose yield is string s

• Use a chart π so that π[i,j,A] contains the probability of the highest


probability parse tree for string s[i,j] starting in nonterminal A.

• We want to find π[0,lenght(s),S] -- the probability of the highest-


scoring parse tree for s rooted in the start symbol S.
CKY for PCFG Parsing
• To compute π[0,lenght(s),S] we can use the following recursive
definition:

Base case:

• Then fill the chart using dynamic programming.


CKY for PCFG Parsing
• Input: PCFG G=(N, Σ, R, S), input string s of length n.

• for i=0…n-1: initialization

• for length=2…n: main loop


for i=0…(n-length):
j = i+length
for k=i+1…j-1:
for A ∈ N:

Use backpointers to retrieve the highest-scoring parse tree (see previous lecture).
Probability of a Sentence

• What if we are interested in the probability of a sentence,


not of a single parse tree (for example, because we want
to use the PCFG as a language model).

• Problem: Spurious ambiguity. Need to sum the


probabilities of all parse trees for the sentence.

• How do we have to change CKY to compute this?


Earley Parser
• CKY parser starts with words and builds parse trees bottom-
up; requires the grammar to be in CNF.

• The Earley parser instead starts at the start symbol and tries
to “guess” derivations top-down.

• It discards derivations that are incompatible with the


sentence.

• The early parser sweeps through the sentence left-to-right


only once. It keeps partial derivations in a table (“chart”).

• Allows arbitrary CFGs, no limitation to CNF.


Parser States
• Earley parser keeps track of partial derivations using parser
states / items.

• State represent hypotheses about constituent structure based


on the grammar, taking into account the input.

• Parser states are represented as dotted rules with spans.


• The constituents to the left of the · have already been seen
in the input string s (corresponding to the span)

S → · NP VP [0,0] “According to the grammar, there may be an NP


starting in position 0. “

NP → D A · N [0,2] "There is a determiner followed by an adjective in s[0,2]“

NP → NP PP · [3,8] "There is a complete NP in s[3,8], consisting of an NP and PP”


Earley Parser (sketch)
S → NP VP V → saw
VP → V NP P → with S → · NP VP [0,0]
VP → VP PP D → the
NP → · NP PP [0,0] NP → · D N [0,0]
PP → P NP N → cat
NP →DN N → tail D → · the [0,0]
NP → NP PP N → student
Three parser operations:
1. Predict new subtrees top-down.

the student saw the cat with the tail


0 1 2 3 4 5 6 7
Earley Parser (sketch)
S → NP VP V → saw
VP → V NP P → with S → · NP VP [0,0]
VP → VP PP D → the
NP → · NP PP [0,0] NP → · D N [0,0]
PP → P NP N → cat
NP →DN N → tail D → the · [0,1]
NP → NP PP N → student
Three parser operations:
1. Predict new subtrees top-down.

2. Scan input terminals.

the student saw the cat with the tail


0 1 2 3 4 5 6 7
Earley Parser (sketch)
S → NP VP V → saw
VP → V NP P → with S → · NP VP [0,0]
VP → VP PP D → the
NP → · NP PP [0,0] NP → · D N [0,0]
PP → P NP N → cat
NP →DN N → tail D → the · [0,1] passive state
NP → NP PP N → student
Three parser operations:
1. Predict new subtrees top-down.

2. Scan input terminals.

the student saw the cat with the tail


0 1 2 3 4 5 6 7
Earley Parser (sketch)
S → NP VP V → saw
VP → V NP P → with S → · NP VP [0,0]
VP → VP PP D → the
NP → · NP PP [0,0] NP → D · N [0,1]
PP → P NP N → cat
NP →DN N → tail D → the · [0,1] passive state
NP → NP PP N → student
Three parser operations:
1. Predict new subtrees top-down.

2. Scan input terminals.

3. Complete with passive states.

the student saw the cat with the tail


0 1 2 3 4 5 6 7
Earley Parser (sketch)
S → NP VP V → saw
VP → V NP P → with S → · NP VP [0,0]
VP → VP PP D → the
NP → · NP PP [0,0] NP → D · N [0,1]
PP → P NP N → cat
NP →DN N → tail D → the · [0,1] N → · cat [1,1]
NP → NP PP N → student
N → · tail [1,1]
Three parser operations:
1. Predict new subtrees top-down. N → · student [1,1]

2. Scan input terminals.

3. Complete with passive states.

the student saw the cat with the tail


0 1 2 3 4 5 6 7
Earley Parser (sketch)
S → NP VP V → saw
VP → V NP P → with S → · NP VP [0,0]
VP → VP PP D → the
NP → · NP PP [0,0] NP → D · N [0,1]
PP → P NP N → cat
NP →DN N → tail D → the · [0,1] N → · cat [1,1]
NP → NP PP N → student
N → · tail [1,1]
Three parser operations:
1. Predict new subtrees top-down. N → student · [1,2]

2. Scan input terminals.

3. Complete with passive states.

the student saw the cat with the tail


0 1 2 3 4 5 6 7
Earley Parser (sketch)
S → NP VP V → saw
VP → V NP P → with S → · NP VP [0,0]
VP → VP PP D → the
NP → · NP PP [0,0] NP → D N · [0,2]
PP → P NP N → cat
NP →DN N → tail D → the · [0,1] N → · cat [1,1]
NP → NP PP N → student
N → · tail [1,1]
Three parser operations:
1. Predict new subtrees top-down. N → student · [1,2]

2. Scan input terminals.

3. Complete with passive states.

the student saw the cat with the tail


0 1 2 3 4 5 6 7
Earley Parser (sketch)
S → NP VP V → saw
VP → V NP P → with S → NP · VP [0,2]
VP → VP PP D → the
NP → NP · PP [0,2] NP → D N · [0,2]
PP → P NP N → cat
NP →DN N → tail D → the · [0,1] N → · cat [1,1]
NP → NP PP N → student
N → · tail [1,1]
Three parser operations:
1. Predict new subtrees top-down. N → student · [1,2]

2. Scan input terminals.

3. Complete with passive states.

the student saw the cat with the tail


0 1 2 3 4 5 6 7
Earley Algorithm
• Keep track of parser states in a table (“chart”). Chart[k]
contains a set of all parser states that end in position k.
• Input: Grammar G=(N, Σ, R, S), input string s of length n.

• Initialization: For each production S→α ∈R


add a state S →·α[0,0] to Chart[0].

• for i = 0 to n:
• for each state in Chart[i]:
• if state is of form A →α ·s[i] β [k,i]:
scan(state)
• elif state is of form A →α ·B β [k,i]:
predict(state)

• elif state is of form A →α · [k,i]


complete(state)
Earley Algorithm
• Keep track of parser states in a table (“chart”). Chart[k]
contains a set of all parser states that end in position k.
• Input: Grammar G=(N, Σ, R, S), input string s of length n.

• Initialization: For each production S→α ∈R


add a state S →·α[0,0] to Chart[0].

• for i = 0 to n:
• for each state in Chart[i]:
• if state is of form A →α ·s[i] β [k,i]:
scan(state) else then is states of form
A →α · β [k,i], i.e.
• elif state is of form A →α ·B β [k,i]:
predict(state) β is not s[i], in which case we
don’t want to do anything
• elif state is of form A →α · [k,i]
complete(state)
Earley Algorithm - Scan
• The scan operation can only be applied to a state if the dot is
in front of a terminal symbol that matches the next input
terminal.

• function scan(state): // state is of form A →α ·s[i] β [k,i]

• Add a new state A →α s[i]·β [k,i+1]


to Chart[i+1]
Earley Algorithm - Predict
• The predict operation can only be applied to a state if the dot is
in front of a non-terminal symbol.

• function predict(state): // state is of form A →α ·B β [k,i]:

• Add a new state B →· γ [i,i]


to Chart[i]

• Note that this modifies Chart[i] while the algorithm is looping


through it.

• No duplicate states are added (Chart[i] is a set)


Earley Algorithm - Complete
• The complete operation may only be applied to a passive item.

• function complete(state): // state is of form A →α · [k,j]

• for each state B → β ·A γ [i,k] add a new state


B → β A · γ[i,j] to Chart[j]

• Note that this modifies Chart[i] while the algorithm is looping


through it.

• Note that it is important to make a copy of the old state


before moving the dot.
• This operation is similar to the combination operation in CKY!
Earley Algorithm - Runtime
• The runtime depends on the number of items in the chart
(each item is “visited” exactly once).

• We proceed through the input exactly once, which takes


O(N).

• For each position on the chart, there are O(N) possible split
points where the dot could be.

• Each complete operation can produce O(N) possible new


items (with different starting points).

• Total: O(N3)
Earley Algorithm -
Some Observations
• How do we recover parse trees?

• What happens in case of ambiguity?

• Multiple ways to Complete the same state.

• Keep back-pointers in the parser state objects.

• Or use a separate data structure (CKY-style table or


hashed states)

• How do we make the algorithm work with PCFG?

• Easy to compute probabilities on Complete. Follow back pointer with


max probability.

You might also like