Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

c

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 54

Module No.

3
Parsing Structure in Text
Syllabus
Shallow vs Deep parsing, Approaches in parsing, Types
of parsing, Regex parser, Dependency parser, chunking,
Information extraction, Relation Extraction, Building
first NLP Application, Machine translation application
What is parsing?
• parsing in NLP is the process of determining the syntactic structure of a text by

analyzing its constituent words based on an underlying grammar (of the language).

• the outcome of the parsing process would be a parse tree

– where sentence is the root, intermediate nodes such

as noun_phrase, verb_phrase etc. have children - hence they are called non-

terminals and finally, the leaves of the tree are called terminals.

Input text Parser Valid parse tree

Set of grammar
rules(productions)
Example

Input : Tom ate an apple


Input: An apple ate Tom
Shallow parsing

•Shallow parsing is an analysis of a sentence which first identifies


constituent parts of sentences (nouns, verbs, adjectives, etc.) ...
•Shallow parsing is the task of parsing a limited part of the syntactic
information from the given task.
• to analyzing a sentence to identify the constituents (noun groups, verbs,
verb groups, etc.). However, it does not specify their internal structure, nor
their role in the main sentence.
•Shallow syntactic parsing (also called "chunking") typically identifies
noun, verb, preposition phrases, and so forth in a sentence while deep
syntactic parsing produces full parse trees, in which the syntactic function
(e.g., Part of Speech, or POS) of each word or phrase is tagged with a short
label.
An-example-of-shallow-parsing
5 higher order parsing tags
• Noun phrase (NP): These are phrases where a noun acts as the head word. Noun

phrases act as a subject or object to a verb. Eg: "The cat chased the mouse."

• Verb phrase (VP): These phrases are lexical units that have a verb acting as the

head word. Usually, there are two forms of verb phrases. One form has the verb

components as well as other entities such as nouns, adjectives, or adverbs as parts

of the object. Eg: "She walked quickly to the store."

• Adjective phrase (ADJP): These are phrases with an adjective as the head word.

Their main role is to describe or qualify nouns and pronouns in a sentence, and they

will be either placed before or after the noun or pronoun. The sky was beautifully blue."

• Adverb phrase (ADVP): These phrases act like adverbs since the adverb acts as

the head word in the phrase. Adverb phrases are used as modifiers for nouns, verbs,

or adverbs themselves by providing further details that describe or qualify them. He

ran quickly."

• Prepositional phrase (PP): These phrases usually contain a preposition as the

head word and other lexical components like nouns, pronouns, and so on. These act
Cont…
• VP (Verb Phrase):
– "She is cooking dinner."
– "They have been playing soccer."
• PP (Prepositional Phrase):
– "The cat is on the roof."
– "He went to the store."
• ADJP (Adjective Phrase):
– "The movie was very exciting."
– "The cake tastes really delicious."
• S (Sentence):
– "I went to the store."
– "She loves to read."
• NP (Noun Phrase):
– "The big dog barked loudly."
– "My friend is coming over tomorrow."
Deep Parsing
In deep parsing, the search strategy will give a complete syntactic structure to a

sentence.

Some cases we need to go for semantic parsing to understand the meaning of the

sentence.

• It is suitable for complex NLP applications.

• In deep or full parsing, typically, grammar concepts such as CFG, and probabilistic

context-free grammar (PCFG), and a search strategy is used to give a complete

syntactic structure to a sentence.


The two Approaches in parsing
Parse Trees and Syntactic Ambiguity
Probability based parsing
S -> NP VP [0.7]
S -> VP [0.3]
NP -> Det N [0.6]
NP -> ProperN [0.4]
VP -> V NP [0.5]
VP -> V [0.3]
VP -> VP PP [0.2]
PP -> P NP [1.0]
Det -> 'the' [0.6]
Det -> 'a' [0.4]
N -> 'dog' [0.4]
N -> 'cat' [0.3]
N -> 'ball' [0.3]
ProperN -> 'John' [1.0]
V -> 'chased' [0.6]
V -> 'ate' [0.4]
P -> 'with' [0.5]
P -> 'in' [0.5]
1.S (Sentence):
1. "John chased the dog." (probability: 0.7)
2. "She ate a cake with a spoon." (probability: 0.3)
2.NP (Noun Phrase):
1. "The dog" (probability: 0.48)
2. "John" (probability: 0.4)
3.VP (Verb Phrase):
1. "John chased the dog with a stick." (probability: 0.14)
2. "She ate" (probability: 0.12)
4.PP (Prepositional Phrase):
1. "John chased the dog with a stick." (probability: 0.07)
2. "She ate a cake with a spoon." (probability: 0.09)
5.Det (Determiner):
1. "The dog" (probability: 0.48)
2. "A cake" (probability: 0.06)
6.N (Noun):
1. "Dog" (probability: 0.4)
2. "Cake" (probability: 0.03)
7.P (Preposition):
1. "With" (probability: 0.5)
2. "In" (probability: 0.5)
Context free grammar
Context free grammar is a formal grammar which is used to generate all possible strings
in a given formal language.
Context free grammar G can be defined by four tuples as:
G= (V, T, P, S)
Where,
G describes the grammar
T describes a finite set of terminal symbols.
V describes a finite set of non-terminal symbols
P describes a set of production rules
S is the start symbol.
In CFG, the start symbol is used to derive the string. You can derive the string by
repeatedly replacing a non-terminal by the right hand side of the production, until all
non-terminal have been replaced by terminal symbols.

Terminal and Non Terminal?


•Terminal symbols are those which are the components of the sentences generated using a
grammar and are represented using small case letter like a, b, c etc.
•Non-Terminal Symbols are those symbols which take part in the generation of the
sentence but are not the component of the sentence.
Example:
Production rules:
S → aSa
S → bSb
S→c

S ⇒ aSa
S ⇒ abSba
S ⇒ abbSbba
S ⇒ abbcbba

By applying the production S → aSa, S → bSb


recursively and finally applying the production S → c,
we get the string abbcbba.
String : {abcb, abcba,bacab,aca}
Capabilities of CFG
There are the various capabilities of CFG:

•Context free grammar is useful to describe most of the programming


languages.
•If the grammar is properly designed then an efficient parser can be
constructed automatically.
•Using the features of associatively & precedence information, suitable
grammars for expressions can be constructed.
•Context free grammar is capable of describing nested structures like:
balanced parentheses, matching begin-end, corresponding if-then-else's
& so on.
Derivation
Derivation is a sequence of production rules. It is used to get the input string
through these production rules. During parsing we have to take two decisions.
These are as follows:
•We have to decide the non-terminal which is to be replaced.
•We have to decide the production rule by which the non-terminal will be
replaced.
•We have two options to decide which non-terminal to be replaced with
production rule.
1.Left-most Derivation
In the left most derivation, the input is scanned and replaced with the production
rule from left to right. So in left most derivatives we read the input string from
left to right.
Example: Production rules: The left-most derivation is:
S=S+S S=S+S
S=S-S S=S-S+S
S = a | b |c S=a-S+S
Input: S=a-b+S
a-b+c S=a-b+c
2. Right-most Derivation
In the right most derivation, the input is scanned and replaced with the
production rule from right to left. So in right most derivatives we read the input
string from right to left.
Example:

Production rules: The right-most derivation is:


S=S+S S=S-S
S=S-S S=S-S+S
S = a | b |c S=S-S+c
S=S-b+c
Input:
S=a-b+c
a-b+c
a-b+c
Parse tree
•Parse tree is the graphical representation of symbol. The symbol can be
terminal or non-terminal.
•In parsing, the string is derived using the start symbol. The root of the parse
tree is that start symbol.
•It is the graphical representation of symbol that can be terminals or non-
terminals.
•Parse tree follows the precedence of operators. The deepest sub-tree traversed
first. So, the operator in the parent node has less precedence over the operator
in the sub-tree.

The parse tree follows these points:


•All leaf nodes have to be terminals.
•All interior nodes have to be non-terminals.
•In-order traversal gives original input string.
Example: Example:
S -> sAB S -> AB
A -> a A -> c/aA
B -> b B -> d/bB

The input string is “sab”, The input string is “acbd”,


then the Parse Tree is : then the Parse Tree is :
Example:
Production rules:
S= S + S | S * S
S= a|b|c

Input:
a*b+c
CFG and Parse tree in NLP
Top Down and Bottom Up Parsers E.g
Grammar
S -> NP VP
 Top Down Paresrs NP -> ART N
NP -> ART ADJ N
A top-down parser starts from the starting rule and rewrite it VP -> V
step by step into symbols that match words in the input VP -> V NP
V-> cried
sentence. N->dogs | man
ART->the
 Button up parsers ADJ ->old
Build the parse tree from leaves to root. Bottom-up
parsing can be defined as an attempt to reduce Input 1: The dogs
the input string w to the start symbol of grammar. cried
Input2: The old
man cried
Let's write our first grammar with very limited vocabulary and very generic rules:
# toy CFG
from nltk import CFG
from nltk.parse.generate import generate
toy_grammar =nltk.CFG.fromstring
("""
S -> NP VP # S indicate the entire sentence
VP -> V NP # VP is verb phrase the
V -> "eats" | "drinks" # V is verb
NP -> Det N # NP is noun phrase (chunk that has noun in it)
Det -> "a" | "an" | "the" # Det is determiner used in the sentences
N -> "president" |"Obama" |"apple"| "coke" # N some example nouns
""")
generate(grammar, n=2)
Now, this grammar concept can generate a finite amount of sentences.
• President eats apple
• Obama drinks coke
On the other hand, the same grammar can construct meaningless sentences such as:
• Apple eats coke
• President drinks Obama
When it comes to a syntactic parser, there is a chance that a syntactically formed
sentence could be meaningless.
Different types of parsers

•A parser processes an input string by using a set of grammatical rules


and builds one or more rules that construct a grammar concept.
•Grammar is a declarative specification of a well-formed sentence.
•A parser is a procedural interpretation of grammar. It searches through
the space of a variety of trees and finds an optimal tree for the given
sentence.
•Parser Types:
1. A Recursive Descent Parser
2. A Shift-Reduce Parser
3. A Chart Parser
4. A Regex Parser
5. Dependency Parsing
1. A Recursive Descent Parser
•One of the most straightforward forms of parsing is recursive
descent parsing.
•This is a top-down process in which the parser attempts to
verify that the syntax of the input stream is correct, as it is read
from left to right.
•A basic operation necessary for this involves reading characters
from the input stream and matching them with the terminals from
the grammar that describes the syntax of the input.
•Our recursive descent parser will look ahead one character and
advance the input stream reading pointer when it gets a proper
match.
•Recursive Descent Parser uses the technique of Top-Down
Parsing without backtracking.
•It can be defined as a Parser that uses the various recursive
procedure to process the input string with no backtracking.
•It can be simply performed using a Recursive language.
• The major approach of recursive-descent parsing is to relate each
non-terminal with a procedure.
•The objective of each procedure is to read a sequence of input
characters that can be produced by the corresponding non-terminal,
and return a pointer to the root of the parse tree for the non-terminal.
The structure of the procedure is prescribed by the productions for
the equivalent non-terminal.
How to create a recursive descent parser?
•By carefully writing a grammar means eliminating left recursion and
left factoring from it, the resulting grammar will be a grammar that can
be parsed by a recursive descent parser. For Recursive Descent Parser,
we are going to write one program for every variable.

•Left Recursion- A production of grammar is said to have left recursion


if the leftmost variable of its RHS is same as variable of its LHS. ...
•Right Recursion- A production of grammar is said to have right
recursion if the rightmost variable of its RHS is same as variable of its
LHS. ...
1. Remove leftrecursion
2. ambugutiy

E
Before removing left recursion After removing left recursion

E –> T E’
E –> E + T | T E’ –> + T E’ | e
T –> T * F | F T –> F T’
F –> ( E ) | id T’ –> * F T’ | e
F –> ( E ) | id
Example − Write down the algorithm using Recursive
procedures to implement the following Grammar.
E → TE′
E′ → +TE′
T → FT′
T′ →∗ FT′|ε
F → (E)|id

//E prime()
To understand this, take following example of CFG :
S -> aAb | aBb
A -> cx | dx
B -> xe
String: adxb
2. A shift-reduce parser
•The shift-reduce parser is a simple kind of bottom-up parser.
•As is common with all bottom-up parsers, a shift-reduce parser
tries to find a sequence of words and phrases that correspond to the
right-hand side of a grammar production and replaces them with the
left-hand side of the production, until the whole sentence is reduced .

•Shift Reduce parser attempts for the construction of parse in a


similar manner as done in bottom-up parsing i.e. the parse tree is
constructed from leaves(bottom) to the root(up).

A more general form of the shift-reduce parser is the LR parser.

This parser requires some data structures i.e.


An input buffer for storing the input string.
A stack for storing and accessing the production rules.
Example 1 – Consider the grammar
S –> S + S
S –> S * S
S –> id
Perform Shift Reduce parsing for input string “id + id + id”.
Example 2 – Consider the grammar
E –> 2E2
E –> 3E3
E –> 4
Perform Shift Reduce parsing for input string “32423”.
3. A chart parser
•A chart parser is a type of parser suitable for ambiguous
grammars (including grammars of natural languages).
•Dynamic programming stores intermediate results and reuses them
when appropriate, achieving significant efficiency gains.
•This technique can be applied to syntactic parsing.
•This allows us to store partial solutions to the parsing task and then
allows us to look them up when necessary in order to efficiently
arrive at a complete solution. This approach to parsing is known as
chart parsing

example@
http://www.nltk.org/howto/parse.html
4.A regex parser
A regex parser uses a regular expression defined in the form of grammar on top of a
POS-tagged string. The parser will use these regular expressions to parse the given
sentences and generate a parse tree out of this.
A working example of the regex parser is:
# Regex parser
>>>chunk_rules=ChunkRule("<.*>+","chunk everything")
>>>import nltk
>>>from nltk.chunk.regexp import *
>>>reg_parser = RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>} # Preposition
V: {<V.*>} # Verb
PP: {<P> <NP>} # PP -> P NP
VP: {<V> <NP|PP>*} # VP -> V (NP|PP)*''')
>>>test_sent="Mr. Obama played a big role in the Health insurance bill"
>>>test_sent_pos=nltk.pos_tag(nltk.word_tokenize(test_sent))
>>>paresed_out=reg_parser.parse(test_sent_pos)
>>> print paresed_out
Tree('S', [('Mr.', 'NNP'), ('Obama', 'NNP'), Tree('VP', [Tree('V',
[('played', 'VBD')]), Tree('NP', [('a', 'DT'), ('big', 'JJ'), ('role',
'NN')])]), Tree('P', [('in', 'IN')]), ('Health', 'NNP'), Tree('NP',
[('insurance', 'NN'), ('bill', 'NN')])])
The following is a graphical representation of the tree for the preceding code:

In the current example, we define the kind of patterns (a regular expression of


the POS) we think will make a phrase, for example, anything that {<DT>? <JJ>*
<NN>*} has a starting determiner followed by an adjective and then a noun is mostly
a noun phrase. Now, this is more of a linguistic rule that we have defined to get the
rule-based parse tree.
Chunking
•Chunking is shallow parsing where instead of reaching out to the deep
structure of the sentence, we try to club some chunks of the sentences that
constitute some meaning.

A chunk can be defined as the minimal unit that can be processed. So, for
example, the sentence "the President speaks about the health care reforms"
can be broken into two chunks, one is "the President", which is noun
dominated, and hence is called a noun phrase (NP). The remaining part of
the sentence is dominated by a verb, hence it is called a verb phrase (VP).
If you see, there is one more sub-chunk in the part "speaks about the health
care reforms". Here, one more NP exists that can be broken down gain in
"speaks about" and "health care reforms", as shown in the following figure:
So, let's write some code snippets to do some basic chunking:
# Chunking
>>>from nltk.chunk.regexp import *
>>>test_sent="The prime minister announced he had asked the chief
government whip, Philip Ruddock, to call a special party room meeting for
9am on Monday to consider the spill motion."
>>>test_sent_pos=nltk.pos_tag(nltk.word_tokenize(test_sent))
>>>rule_vp = ChunkRule(r'(<VB.*>)?(<VB.*>)+(<PRP>)?', 'Chunk VPs')
>>>parser_vp = RegexpChunkParser([rule_vp],chunk_label='VP')
>>>print (parser_vp.parse(test_sent_pos))
>>>rule_np = ChunkRule(r'(<DT>?<RB>?)?<JJ|CD>*(<JJ|
CD><,>)*(<NN.*>)+',
'Chunk NPs')
>>>parser_np = RegexpChunkParser([rule_np],chunk_label="NP")
>>>print (parser_np.parse(test_sent_pos))
Information Extraction

Information Extraction is the process of parsing through


unstructured data and extracting essential information into more
editable and structured data formats.

Two operations

• Named Entity Recognition---(X,Y)


• Relation Extraction---------(X,Y,Z)
Named-entity recognition (NER)
NER is a way of extracting some of the most common entities, such as
names, organizations, and locations. However, some of the modified NER
can be used to extract entities such as product names, biomedical entities,
author names, brand names, and so on.\

Let's start with a very generic example where we are given a text file of
the content and we need to extract some of the most insightful named
entities from it:
Relation extraction
•Relation extraction is another commonly used information extraction
operation.
•Relation extraction as it sound is the process of extracting the different
relationships between different entities.
•There are variety of the relationship that exist between the entities.
•We have seen relationship like inheritance/synonymous/analogous.
•The definition of the relation can be dependent on the Information need.
•For example in the case where we want to look from unstructured text data
who is the writer of which book then authorship could be a relation between
the author name and book name.
•With NLTK the idea is to use the same IE pipeline that we used till NER and
extend it with a relation pattern based on the NER tags.
Process of Relation Extraction
Example
So, in the following code, we used an inbuilt corpus of ieer, where the
sentences are tagged till NER and the only thing we need to specify is the
relation pattern we want and the kind of NER we want the relation to
define.

In the following code, a relationship between an organization and a location


has been defined and we want to extract all the combinations of these
patterns.
import re
from nltk.corpus import ieer
from nltk.sem import relextract
IN = re.compile(r'.*\bin\b .*')
for file in ieer.files():
for doc in ieer.parsed_docs(file):
for rel in relextract.extract_rels('ORG', 'LOC', doc, pattern = IN):
... print relextract.show_raw_rtuple(rel)
Output

[ORG: 'Christian Democrats'] ', the leading political forces in' [LOC: 'Italy']
[ORG: 'AP'] ') _ Lebanese guerrillas attacked Israeli forces in southern' [LOC: 'Lebanon']
[ORG: 'Security Council'] 'adopted Resolution 425. Huge yellow banners hung across
intersections in' [LOC: 'Beirut']
[ORG: 'U.N.'] 'failures in' [LOC: 'Africa']
[ORG: 'U.N.'] 'peacekeeping operation in' [LOC: 'Somalia']
[ORG: 'U.N.'] 'partners on a more effective role in' [LOC: 'Africa']
[ORG: 'AP'] ') _ A bomb exploded in a mosque in central' [LOC: 'San`a']
[ORG: 'Krasnoye Sormovo'] 'shipyard in the Soviet city of' [LOC: 'Gorky']
[ORG: 'Kelab Golf Darul Ridzuan'] 'in' [LOC: 'Perak']
[ORG: 'U.N.'] 'peacekeeping operation in' [LOC: 'Somalia']
[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']

You might also like