c
c
c
3
Parsing Structure in Text
Syllabus
Shallow vs Deep parsing, Approaches in parsing, Types
of parsing, Regex parser, Dependency parser, chunking,
Information extraction, Relation Extraction, Building
first NLP Application, Machine translation application
What is parsing?
• parsing in NLP is the process of determining the syntactic structure of a text by
analyzing its constituent words based on an underlying grammar (of the language).
as noun_phrase, verb_phrase etc. have children - hence they are called non-
terminals and finally, the leaves of the tree are called terminals.
Set of grammar
rules(productions)
Example
phrases act as a subject or object to a verb. Eg: "The cat chased the mouse."
• Verb phrase (VP): These phrases are lexical units that have a verb acting as the
head word. Usually, there are two forms of verb phrases. One form has the verb
• Adjective phrase (ADJP): These are phrases with an adjective as the head word.
Their main role is to describe or qualify nouns and pronouns in a sentence, and they
will be either placed before or after the noun or pronoun. The sky was beautifully blue."
• Adverb phrase (ADVP): These phrases act like adverbs since the adverb acts as
the head word in the phrase. Adverb phrases are used as modifiers for nouns, verbs,
ran quickly."
head word and other lexical components like nouns, pronouns, and so on. These act
Cont…
• VP (Verb Phrase):
– "She is cooking dinner."
– "They have been playing soccer."
• PP (Prepositional Phrase):
– "The cat is on the roof."
– "He went to the store."
• ADJP (Adjective Phrase):
– "The movie was very exciting."
– "The cake tastes really delicious."
• S (Sentence):
– "I went to the store."
– "She loves to read."
• NP (Noun Phrase):
– "The big dog barked loudly."
– "My friend is coming over tomorrow."
Deep Parsing
In deep parsing, the search strategy will give a complete syntactic structure to a
sentence.
Some cases we need to go for semantic parsing to understand the meaning of the
sentence.
• In deep or full parsing, typically, grammar concepts such as CFG, and probabilistic
S ⇒ aSa
S ⇒ abSba
S ⇒ abbSbba
S ⇒ abbcbba
Input:
a*b+c
CFG and Parse tree in NLP
Top Down and Bottom Up Parsers E.g
Grammar
S -> NP VP
Top Down Paresrs NP -> ART N
NP -> ART ADJ N
A top-down parser starts from the starting rule and rewrite it VP -> V
step by step into symbols that match words in the input VP -> V NP
V-> cried
sentence. N->dogs | man
ART->the
Button up parsers ADJ ->old
Build the parse tree from leaves to root. Bottom-up
parsing can be defined as an attempt to reduce Input 1: The dogs
the input string w to the start symbol of grammar. cried
Input2: The old
man cried
Let's write our first grammar with very limited vocabulary and very generic rules:
# toy CFG
from nltk import CFG
from nltk.parse.generate import generate
toy_grammar =nltk.CFG.fromstring
("""
S -> NP VP # S indicate the entire sentence
VP -> V NP # VP is verb phrase the
V -> "eats" | "drinks" # V is verb
NP -> Det N # NP is noun phrase (chunk that has noun in it)
Det -> "a" | "an" | "the" # Det is determiner used in the sentences
N -> "president" |"Obama" |"apple"| "coke" # N some example nouns
""")
generate(grammar, n=2)
Now, this grammar concept can generate a finite amount of sentences.
• President eats apple
• Obama drinks coke
On the other hand, the same grammar can construct meaningless sentences such as:
• Apple eats coke
• President drinks Obama
When it comes to a syntactic parser, there is a chance that a syntactically formed
sentence could be meaningless.
Different types of parsers
E
Before removing left recursion After removing left recursion
E –> T E’
E –> E + T | T E’ –> + T E’ | e
T –> T * F | F T –> F T’
F –> ( E ) | id T’ –> * F T’ | e
F –> ( E ) | id
Example − Write down the algorithm using Recursive
procedures to implement the following Grammar.
E → TE′
E′ → +TE′
T → FT′
T′ →∗ FT′|ε
F → (E)|id
//E prime()
To understand this, take following example of CFG :
S -> aAb | aBb
A -> cx | dx
B -> xe
String: adxb
2. A shift-reduce parser
•The shift-reduce parser is a simple kind of bottom-up parser.
•As is common with all bottom-up parsers, a shift-reduce parser
tries to find a sequence of words and phrases that correspond to the
right-hand side of a grammar production and replaces them with the
left-hand side of the production, until the whole sentence is reduced .
example@
http://www.nltk.org/howto/parse.html
4.A regex parser
A regex parser uses a regular expression defined in the form of grammar on top of a
POS-tagged string. The parser will use these regular expressions to parse the given
sentences and generate a parse tree out of this.
A working example of the regex parser is:
# Regex parser
>>>chunk_rules=ChunkRule("<.*>+","chunk everything")
>>>import nltk
>>>from nltk.chunk.regexp import *
>>>reg_parser = RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>} # Preposition
V: {<V.*>} # Verb
PP: {<P> <NP>} # PP -> P NP
VP: {<V> <NP|PP>*} # VP -> V (NP|PP)*''')
>>>test_sent="Mr. Obama played a big role in the Health insurance bill"
>>>test_sent_pos=nltk.pos_tag(nltk.word_tokenize(test_sent))
>>>paresed_out=reg_parser.parse(test_sent_pos)
>>> print paresed_out
Tree('S', [('Mr.', 'NNP'), ('Obama', 'NNP'), Tree('VP', [Tree('V',
[('played', 'VBD')]), Tree('NP', [('a', 'DT'), ('big', 'JJ'), ('role',
'NN')])]), Tree('P', [('in', 'IN')]), ('Health', 'NNP'), Tree('NP',
[('insurance', 'NN'), ('bill', 'NN')])])
The following is a graphical representation of the tree for the preceding code:
A chunk can be defined as the minimal unit that can be processed. So, for
example, the sentence "the President speaks about the health care reforms"
can be broken into two chunks, one is "the President", which is noun
dominated, and hence is called a noun phrase (NP). The remaining part of
the sentence is dominated by a verb, hence it is called a verb phrase (VP).
If you see, there is one more sub-chunk in the part "speaks about the health
care reforms". Here, one more NP exists that can be broken down gain in
"speaks about" and "health care reforms", as shown in the following figure:
So, let's write some code snippets to do some basic chunking:
# Chunking
>>>from nltk.chunk.regexp import *
>>>test_sent="The prime minister announced he had asked the chief
government whip, Philip Ruddock, to call a special party room meeting for
9am on Monday to consider the spill motion."
>>>test_sent_pos=nltk.pos_tag(nltk.word_tokenize(test_sent))
>>>rule_vp = ChunkRule(r'(<VB.*>)?(<VB.*>)+(<PRP>)?', 'Chunk VPs')
>>>parser_vp = RegexpChunkParser([rule_vp],chunk_label='VP')
>>>print (parser_vp.parse(test_sent_pos))
>>>rule_np = ChunkRule(r'(<DT>?<RB>?)?<JJ|CD>*(<JJ|
CD><,>)*(<NN.*>)+',
'Chunk NPs')
>>>parser_np = RegexpChunkParser([rule_np],chunk_label="NP")
>>>print (parser_np.parse(test_sent_pos))
Information Extraction
Two operations
Let's start with a very generic example where we are given a text file of
the content and we need to extract some of the most insightful named
entities from it:
Relation extraction
•Relation extraction is another commonly used information extraction
operation.
•Relation extraction as it sound is the process of extracting the different
relationships between different entities.
•There are variety of the relationship that exist between the entities.
•We have seen relationship like inheritance/synonymous/analogous.
•The definition of the relation can be dependent on the Information need.
•For example in the case where we want to look from unstructured text data
who is the writer of which book then authorship could be a relation between
the author name and book name.
•With NLTK the idea is to use the same IE pipeline that we used till NER and
extend it with a relation pattern based on the NER tags.
Process of Relation Extraction
Example
So, in the following code, we used an inbuilt corpus of ieer, where the
sentences are tagged till NER and the only thing we need to specify is the
relation pattern we want and the kind of NER we want the relation to
define.
[ORG: 'Christian Democrats'] ', the leading political forces in' [LOC: 'Italy']
[ORG: 'AP'] ') _ Lebanese guerrillas attacked Israeli forces in southern' [LOC: 'Lebanon']
[ORG: 'Security Council'] 'adopted Resolution 425. Huge yellow banners hung across
intersections in' [LOC: 'Beirut']
[ORG: 'U.N.'] 'failures in' [LOC: 'Africa']
[ORG: 'U.N.'] 'peacekeeping operation in' [LOC: 'Somalia']
[ORG: 'U.N.'] 'partners on a more effective role in' [LOC: 'Africa']
[ORG: 'AP'] ') _ A bomb exploded in a mosque in central' [LOC: 'San`a']
[ORG: 'Krasnoye Sormovo'] 'shipyard in the Soviet city of' [LOC: 'Gorky']
[ORG: 'Kelab Golf Darul Ridzuan'] 'in' [LOC: 'Perak']
[ORG: 'U.N.'] 'peacekeeping operation in' [LOC: 'Somalia']
[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan & Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']