Text Chunking Using NLTK
Text Chunking Using NLTK
What is chunking
Why chunking
How to do chunking
An example: chunking a Wikipedia page
Some suggestion
Useful links
POS Tag
Information Extraction
keywords extraction
entity recognition
relation extraction
What is an NP chunk
base NP
an individual noun phrase that contains no other
NP-chunks
Examples:
1. We saw the yellow dog. (2 NP chunks, underlined text)
2. The market for system-management software for
Trees
Tags
IOB tags ( I-inside, O-outside, B-begin)
label O for tokens outside a chunk
he RPR B-NP
Accepted VBD B-VP
The DT B-NP
Position NN I-NP
{<DT|PP\$>?<JJ>*<NN>}
sequences of proper nouns: {<NNP>+}
consecutive nouns: {<NN>+}
# define several tag patterns, used in the way as previous slide
patterns = """
NP:
{<DT|PP\$>?<JJ>*<NN>}
{<NNP>+}
{<NN>+}
""
NPChunker = nltk.RegexpParser(patterns) # create a chunk parser
Data from CoNLL 2000 Corpus (see next slide for how to load this corpus)
(NP UAL/NNP Corp./NNP stock/NN) # e.g. define {<NNP>+<NN>} to capture
this pattern
(NP more/JJR borrowers/NNS)
(NP the/DT fact/NN)
(NP expected/VBN mortgage/NN servicing/NN fees/NNS)
(NP a/DT $/$ 7.6/CD million/CD reduction/NN)
(NP other/JJ matters/NNS)
(NP general/JJ and/CC administrative/JJ expenses/NNS)
(NP a/DT special/JJ charge/NN)
(NP the/DT increased/VBN reserve/NN)
Precision-recall tradeoff
Note that by adding more rules/tag patterns, you may achieve high recall but the
precision will usually go down.
text
divided into training and testing portions
POS tags, chunk tags available in IOB format
# get training and testing data
from nltk.corpus import conll2000
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
# training the chunker, ChunkParser is a class defined in the next slide
NPChunker = ChunkParser(train_sents)
class ChunkParser(nltk.ChunkParserI):
def __init__(self, train_sents):
train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
for sent in train_sents]
self.tagger = nltk.TrigramTagger(train_data)
def parse(self, sentence):
pos_tags = [pos for (word,pos) in sentence]
tagged_pos_tags = self.tagger.tag(pos_tags)
chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
in zip(sentence, chunktags)]
return nltk.chunk.conlltags2tree(conlltags)
Approach One
Pros: more control over what kind of tag patterns
Approach Two
Pros: high P/R for extracting all NP chunks
Cons: possibly need more post-processing to filter
unwanted words
BoilerPipe API
Plain text
file
POS
tokenization
tagging
tokenization
POS tagging
sentence
segmentation
chunking
keywords
extraction
a list of
keywords
Wikipedia page
BoilerPipe API
plain text
file
POS tagging
tokenization
sentence
segmentation
chunking
keywords
extraction
keywords
Wikipedia page
BoilerPipe API
plain text
file
POS tagging
tokenization
sentence
segmentation
chunking
keywords
extraction
keywords
Wikipedia page
BoilerPipe API
plain text
file
POS tagging
tokenization
sentence
segmentation
chunking
keywords
extraction
keywords
Top unigram NP
chunks
(freq, term) pairs:
(118, 'iphone')
(55, 'apple')
(19, 'screen')
(18, 'software')
(16, 'update')
(16, 'phone')
(13, 'application')
(12, 'user')
(12, 'itunes')
(11, 'june')
(10, 'trademark')
2. {<JJ>*<NN>}
3. {<NNP>+}
http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html
me.html