Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
364 views

Text Chunking Using NLTK

This document provides an overview of chunking, which is the process of grouping words into meaningful multi-word phrases based on part-of-speech tags. It discusses two main approaches for chunking - using regular expressions to define tag patterns for chunks, or training a chunk parser on annotated corpora. The document also provides code examples for chunking sentences using these two approaches and evaluating the results. Finally, it describes how to apply chunking to extract keywords from a Wikipedia page.

Uploaded by

VenkatMurthy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
364 views

Text Chunking Using NLTK

This document provides an overview of chunking, which is the process of grouping words into meaningful multi-word phrases based on part-of-speech tags. It discusses two main approaches for chunking - using regular expressions to define tag patterns for chunks, or training a chunk parser on annotated corpora. The document also provides code examples for chunking sentences using these two approaches and evaluating the results. Finally, it describes how to apply chunking to extract keywords from a Wikipedia page.

Uploaded by

VenkatMurthy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Dongqing Zhu

What is chunking
Why chunking
How to do chunking
An example: chunking a Wikipedia page
Some suggestion
Useful links

recovering phrases constructed by the partof-speech tags


finding base noun phrases (focus of this tutorial)
finding verb groups, etc.

POS Tag

Information Extraction
keywords extraction

entity recognition
relation extraction

other applicable areas

What is an NP chunk
base NP
an individual noun phrase that contains no other

NP-chunks

Examples:
1. We saw the yellow dog. (2 NP chunks, underlined text)
2. The market for system-management software for

Digital's hardware is fragmented enough that a giant such as


Computer Associates should do well there. (5 NP chunks)

Trees

Tags
IOB tags ( I-inside, O-outside, B-begin)
label O for tokens outside a chunk

TREE File format:


(S
(NP the/DT
little/JJ yellow/JJ
dog/NN)
barked/VBD
at/IN
(NP the/DT
cat/NN))

IOB File format:

he RPR B-NP
Accepted VBD B-VP
The DT B-NP
Position NN I-NP

Two approaches will be covered in this tutorial


approach one: chunking with regular expression

approach two: train a chunk parser

Use POS tagging as the basis for extracting


higher-level structure, i.e, phrases

Key step: define tag patterns for deriving


chunks
a tag pattern is a sequence of part-of-speech tags

delimited using angle brackets


example: <DT>?<JJ>*<NN> defines a common NP
pattern, i.e., an optional determiner (DT) followed by any
number of adjectives (JJ) and then a noun (NN)

Define tag patterns to find NP chunks

# Python code (remember to install and import nltk)


sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked",
"VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")] # a simple sentence with POS tags
pattern = "NP: {<DT>?<JJ>*<NN>}" # define a tag pattern of an NP chunk
NPChunker = nltk.RegexpParser(pattern) # create a chunk parser
result = NPChunker . parse(sentence) # parse the example sentence
print result
# or draw graphically using result.draw()
(S
(NP the/DT little/JJ yellow/JJ dog/NN)
barked/VBD
at/IN
(NP the/DT cat/NN))

More tags patterns/rules for NP chunks


determiner/possessive, adjectives and noun:

{<DT|PP\$>?<JJ>*<NN>}
sequences of proper nouns: {<NNP>+}
consecutive nouns: {<NN>+}
# define several tag patterns, used in the way as previous slide
patterns = """
NP:
{<DT|PP\$>?<JJ>*<NN>}
{<NNP>+}
{<NN>+}
""
NPChunker = nltk.RegexpParser(patterns) # create a chunk parser

Obtain tag patterns from corpus

Data from CoNLL 2000 Corpus (see next slide for how to load this corpus)
(NP UAL/NNP Corp./NNP stock/NN) # e.g. define {<NNP>+<NN>} to capture
this pattern
(NP more/JJR borrowers/NNS)
(NP the/DT fact/NN)
(NP expected/VBN mortgage/NN servicing/NN fees/NNS)
(NP a/DT $/$ 7.6/CD million/CD reduction/NN)
(NP other/JJ matters/NNS)
(NP general/JJ and/CC administrative/JJ expenses/NNS)
(NP a/DT special/JJ charge/NN)
(NP the/DT increased/VBN reserve/NN)
Precision-recall tradeoff
Note that by adding more rules/tag patterns, you may achieve high recall but the
precision will usually go down.

Use CoNLL 2000 Corpus for training


CoNLL 2000 corpus contains 270k words of WSJ

text
divided into training and testing portions
POS tags, chunk tags available in IOB format
# get training and testing data
from nltk.corpus import conll2000
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
# training the chunker, ChunkParser is a class defined in the next slide
NPChunker = ChunkParser(train_sents)

Define ChunkerParser Class


to learn tag patterns for NP chunks

class ChunkParser(nltk.ChunkParserI):
def __init__(self, train_sents):
train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
for sent in train_sents]
self.tagger = nltk.TrigramTagger(train_data)
def parse(self, sentence):
pos_tags = [pos for (word,pos) in sentence]
tagged_pos_tags = self.tagger.tag(pos_tags)
chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
in zip(sentence, chunktags)]
return nltk.chunk.conlltags2tree(conlltags)

Evaluate the trained chunk parser

>>> print NPChunker.evaluate(test_sents)

#IOB Accuracy: 93.3%


ChunkParse score:
Precision: 82.5%
Recall:
86.8%
F-Measure: 84.6%
# the chunker got decent results and is ready to use
# Note: IOB Accuracy corresponds to the IOB file format described in slide
Representation of chunk structures

Approach One
Pros: more control over what kind of tag patterns

you want to match


Cons: difficult to come up with a set of rules to
capture all base NP chunks and still keep a high
precision

Approach Two
Pros: high P/R for extracting all NP chunks
Cons: possibly need more post-processing to filter

unwanted words

Chunking a Wikipedia page


a Wikipedia page

BoilerPipe API

Plain text
file

POS
tokenization
tagging

tokenization
POS tagging

sentence
segmentation

chunking

keywords
extraction

a list of
keywords

Wikipedia page

BoilerPipe API

plain text
file

POS tagging

tokenization

sentence
segmentation

chunking

keywords
extraction

keywords

# Python code for segmentation, POS tagging and tokenization


import nltk
rawtext = open(plain_text_file).read()
sentences = nltk.sent_tokenize(rawtext) # NLTK default sentence segmenter
sentences = [nltk.word_tokenize(sent) for sent in sentences] # NLTK word tokenizer
sentences = [nltk.pos_tag(sent) for sent in sentences] # NLTK POS tagger

Wikipedia page

BoilerPipe API

plain text
file

POS tagging

tokenization

sentence
segmentation

chunking

keywords
extraction

keywords

for sent in sentences:


# TO DO LIST (already covered in a few slides ahead):
# 1. create a chunk parser by defining patterns of NP chunks or using the trained one
# 2. parse every sentence
# 3. store NP chunks

Wikipedia page

BoilerPipe API

plain text
file

POS tagging

tokenization

sentence
segmentation

chunking

keywords
extraction

keywords

# extract the keywords based on frequency

# a tree traversal function for extracting NP chunks in the parsed tree


def traverse(t):
try:
t.node
except AttributeError:
return
else:
if t.node == 'NP': print t # or do something else
else:
for child in t:
traverse(child)

Tag patterns used for this example: 1. {<NN>+}

Top unigram NP
chunks
(freq, term) pairs:
(118, 'iphone')
(55, 'apple')
(19, 'screen')
(18, 'software')
(16, 'update')
(16, 'phone')
(13, 'application')
(12, 'user')
(12, 'itunes')
(11, 'june')
(10, 'trademark')

Top bigram NP chunks


(6, 'app store')
(4, 'ipod touch')
(3, 'virtual keyboard')
(3, 'united kingdom')
(3, 'steve jobs')
(3, 'ocean telecom')
(3, 'mac os x')
...

2. {<JJ>*<NN>}

3. {<NNP>+}

Top NP chunks containing >2 terms


(3, 'mac os x')
(1, 'real-time geographic location')
(1, 'apple-approved cryptographic signature')
(1, 'free push-email service')
(1, 'computerized global information')
(1, 'apple ceo steve jobs')
(1, 'direct internal camera-to-e-mail picture')

Guess what the


title of this
Wikipedia page is?

If using approach one, define a few good tag patterns to


only extract things youre interested
e.g. do not include determiners
define tag patterns for n-gram (n<4)

If using approach two, do some post-processing


drop long NP phrases

Try to form a tag cloud by just taking frequent bigram


and trigram NP chunks, and use PMI or TF/IDF
information to prune a bit, and then add some unigrams
(remember you are only allowed to have no more than 15
tags)

Chapters from book Natural Language


Processing with Python
http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html

http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html

LingPipe does chunking in a very similar way


http://alias-i.com/lingpipe/demos/tutorial/posTags/read-

me.html

You might also like