0% found this document useful (0 votes)

364 views

Text Chunking Using NLTK

This document provides an overview of chunking, which is the process of grouping words into meaningful multi-word phrases based on part-of-speech tags. It discusses two main approaches for chunking - using regular expressions to define tag patterns for chunks, or training a chunk parser on annotated corpora. The document also provides code examples for chunking sentences using these two approaches and evaluating the results. Finally, it describes how to apply chunking to extract keywords from a Wikipedia page.

Uploaded by

VenkatMurthy

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

364 views

Text Chunking Using NLTK

Uploaded by

VenkatMurthy

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Dongqing Zhu

What is chunking
Why chunking
How to do chunking
An example: chunking a Wikipedia page
Some suggestion
Useful links

recovering phrases constructed by the partof-speech tags

finding base noun phrases (focus of this tutorial)
finding verb groups, etc.

POS Tag

Information Extraction
keywords extraction

entity recognition
relation extraction

other applicable areas

What is an NP chunk
base NP
an individual noun phrase that contains no other

NP-chunks

Examples:
1. We saw the yellow dog. (2 NP chunks, underlined text)
2. The market for system-management software for

Digital's hardware is fragmented enough that a giant such as

Computer Associates should do well there. (5 NP chunks)

Trees

Tags
IOB tags ( I-inside, O-outside, B-begin)
label O for tokens outside a chunk

TREE File format:

(S
(NP the/DT
little/JJ yellow/JJ
dog/NN)
barked/VBD
at/IN
(NP the/DT
cat/NN))

IOB File format:

he RPR B-NP
Accepted VBD B-VP
The DT B-NP
Position NN I-NP

Two approaches will be covered in this tutorial

approach one: chunking with regular expression

approach two: train a chunk parser

Use POS tagging as the basis for extracting

higher-level structure, i.e, phrases

Key step: define tag patterns for deriving

chunks
a tag pattern is a sequence of part-of-speech tags

delimited using angle brackets

example: <DT>?<JJ>*<NN> defines a common NP
pattern, i.e., an optional determiner (DT) followed by any
number of adjectives (JJ) and then a noun (NN)

Define tag patterns to find NP chunks

# Python code (remember to install and import nltk)

sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked",
"VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")] # a simple sentence with POS tags
pattern = "NP: {<DT>?<JJ>*<NN>}" # define a tag pattern of an NP chunk
NPChunker = nltk.RegexpParser(pattern) # create a chunk parser
result = NPChunker . parse(sentence) # parse the example sentence
print result
# or draw graphically using result.draw()
(S
(NP the/DT little/JJ yellow/JJ dog/NN)
barked/VBD
at/IN
(NP the/DT cat/NN))

More tags patterns/rules for NP chunks

determiner/possessive, adjectives and noun:

{<DT|PP\$>?<JJ>*<NN>}
sequences of proper nouns: {<NNP>+}
consecutive nouns: {<NN>+}
# define several tag patterns, used in the way as previous slide
patterns = """
NP:
{<DT|PP\$>?<JJ>*<NN>}
{<NNP>+}
{<NN>+}
""
NPChunker = nltk.RegexpParser(patterns) # create a chunk parser

Obtain tag patterns from corpus

Data from CoNLL 2000 Corpus (see next slide for how to load this corpus)
(NP UAL/NNP Corp./NNP stock/NN) # e.g. define {<NNP>+<NN>} to capture
this pattern
(NP more/JJR borrowers/NNS)
(NP the/DT fact/NN)
(NP expected/VBN mortgage/NN servicing/NN fees/NNS)
(NP a/DT $/$ 7.6/CD million/CD reduction/NN)
(NP other/JJ matters/NNS)
(NP general/JJ and/CC administrative/JJ expenses/NNS)
(NP a/DT special/JJ charge/NN)
(NP the/DT increased/VBN reserve/NN)
Precision-recall tradeoff
Note that by adding more rules/tag patterns, you may achieve high recall but the
precision will usually go down.

Use CoNLL 2000 Corpus for training

CoNLL 2000 corpus contains 270k words of WSJ

text
divided into training and testing portions
POS tags, chunk tags available in IOB format
# get training and testing data
from nltk.corpus import conll2000
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])
# training the chunker, ChunkParser is a class defined in the next slide
NPChunker = ChunkParser(train_sents)

Define ChunkerParser Class

to learn tag patterns for NP chunks

class ChunkParser(nltk.ChunkParserI):
def __init__(self, train_sents):
train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
for sent in train_sents]
self.tagger = nltk.TrigramTagger(train_data)
def parse(self, sentence):
pos_tags = [pos for (word,pos) in sentence]
tagged_pos_tags = self.tagger.tag(pos_tags)
chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
in zip(sentence, chunktags)]
return nltk.chunk.conlltags2tree(conlltags)

Evaluate the trained chunk parser

>>> print NPChunker.evaluate(test_sents)

#IOB Accuracy: 93.3%

ChunkParse score:
Precision: 82.5%
Recall:
86.8%
F-Measure: 84.6%
# the chunker got decent results and is ready to use
# Note: IOB Accuracy corresponds to the IOB file format described in slide
Representation of chunk structures

Approach One
Pros: more control over what kind of tag patterns

you want to match

Cons: difficult to come up with a set of rules to
capture all base NP chunks and still keep a high
precision

Approach Two
Pros: high P/R for extracting all NP chunks
Cons: possibly need more post-processing to filter

unwanted words

Chunking a Wikipedia page

a Wikipedia page

BoilerPipe API

Plain text
file

POS
tokenization
tagging

tokenization
POS tagging

sentence
segmentation

chunking

keywords
extraction

a list of
keywords

Wikipedia page

BoilerPipe API

plain text
file

POS tagging

tokenization

sentence
segmentation

chunking

keywords
extraction

keywords

# Python code for segmentation, POS tagging and tokenization

import nltk
rawtext = open(plain_text_file).read()
sentences = nltk.sent_tokenize(rawtext) # NLTK default sentence segmenter
sentences = [nltk.word_tokenize(sent) for sent in sentences] # NLTK word tokenizer
sentences = [nltk.pos_tag(sent) for sent in sentences] # NLTK POS tagger

Wikipedia page

BoilerPipe API

plain text
file

POS tagging

tokenization

sentence
segmentation

chunking

keywords
extraction

keywords

for sent in sentences:

# TO DO LIST (already covered in a few slides ahead):
# 1. create a chunk parser by defining patterns of NP chunks or using the trained one
# 2. parse every sentence
# 3. store NP chunks

Wikipedia page

BoilerPipe API

plain text
file

POS tagging

tokenization

sentence
segmentation

chunking

keywords
extraction

keywords

# extract the keywords based on frequency

# a tree traversal function for extracting NP chunks in the parsed tree

def traverse(t):
try:
t.node
except AttributeError:
return
else:
if t.node == 'NP': print t # or do something else
else:
for child in t:
traverse(child)

Top bigram NP chunks

(6, 'app store')
(4, 'ipod touch')
(3, 'virtual keyboard')
(3, 'united kingdom')
(3, 'steve jobs')
(3, 'ocean telecom')
(3, 'mac os x')
...

2. {<JJ>*<NN>}

3. {<NNP>+}

Top NP chunks containing >2 terms

(3, 'mac os x')
(1, 'real-time geographic location')
(1, 'apple-approved cryptographic signature')
(1, 'free push-email service')
(1, 'computerized global information')
(1, 'apple ceo steve jobs')
(1, 'direct internal camera-to-e-mail picture')

Guess what the

title of this
Wikipedia page is?

If using approach one, define a few good tag patterns to

only extract things youre interested
e.g. do not include determiners
define tag patterns for n-gram (n<4)

If using approach two, do some post-processing

drop long NP phrases

Try to form a tag cloud by just taking frequent bigram

and trigram NP chunks, and use PMI or TF/IDF
information to prune a bit, and then add some unigrams
(remember you are only allowed to have no more than 15
tags)

Chapters from book Natural Language

Processing with Python
http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html

http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html

LingPipe does chunking in a very similar way

http://alias-i.com/lingpipe/demos/tutorial/posTags/read-

me.html

Full Download Getting Started with Natural Language Processing MEAP V06 Ekaterina Kochmar PDF DOCX
100% (3)
Full Download Getting Started with Natural Language Processing MEAP V06 Ekaterina Kochmar PDF DOCX
55 pages
Natural language processing with TensorFlow Teach language to machines using Python s deep learning library 1st Edition Thushan Ganegedara 2024 scribd download
50% (2)
Natural language processing with TensorFlow Teach language to machines using Python s deep learning library 1st Edition Thushan Ganegedara 2024 scribd download
62 pages
LangChain - Chat With Your Data
No ratings yet
LangChain - Chat With Your Data
32 pages
Java Certification Study Notes
No ratings yet
Java Certification Study Notes
91 pages
Adaline/Madaline:Applications
100% (1)
Adaline/Madaline:Applications
25 pages
HM Inter Prediction 111022 r3
No ratings yet
HM Inter Prediction 111022 r3
52 pages
PAS 55-2 Part 1 (2008) PDF
No ratings yet
PAS 55-2 Part 1 (2008) PDF
20 pages
Natural Language Toolkit NLTK PDF
No ratings yet
Natural Language Toolkit NLTK PDF
23 pages
Algorithms in ML
No ratings yet
Algorithms in ML
15 pages
Xgboost: Release 1.0.0-SNAPSHOT
No ratings yet
Xgboost: Release 1.0.0-SNAPSHOT
147 pages
Engineering Design Knowledge Representation Based On Logic and Objects
No ratings yet
Engineering Design Knowledge Representation Based On Logic and Objects
19 pages
Hydrophore Set (Mectron Engineering)
No ratings yet
Hydrophore Set (Mectron Engineering)
2 pages
Part Xi-Electrical Equipment
No ratings yet
Part Xi-Electrical Equipment
123 pages
Function Calling - OpenAI API
No ratings yet
Function Calling - OpenAI API
5 pages
Automatic Fault Detection System Using PLC
No ratings yet
Automatic Fault Detection System Using PLC
26 pages
Single Layer Perceptron
No ratings yet
Single Layer Perceptron
113 pages
An Introduction To Programming Physics-Informed Neural Network-Based Computational Solid Mechanics
No ratings yet
An Introduction To Programming Physics-Informed Neural Network-Based Computational Solid Mechanics
32 pages
How To Use An Existing DNN Recognizer For Decoding in Kaldi
No ratings yet
How To Use An Existing DNN Recognizer For Decoding in Kaldi
14 pages
Data-Driven Modelling - Concepts, Approaches and Experiences
0% (1)
Data-Driven Modelling - Concepts, Approaches and Experiences
15 pages
6 Graph Databases Neo4j
No ratings yet
6 Graph Databases Neo4j
46 pages
Natural Language Processing
100% (1)
Natural Language Processing
12 pages
Mining Massive Datasets Preface
No ratings yet
Mining Massive Datasets Preface
17 pages
Computer Vision I: Ai Courses by Opencv
No ratings yet
Computer Vision I: Ai Courses by Opencv
9 pages
Natural Language Processing: Artificial Intelligence Pyp 5
No ratings yet
Natural Language Processing: Artificial Intelligence Pyp 5
9 pages
DC Circuit Analysis
No ratings yet
DC Circuit Analysis
17 pages
Deep Learning With Keras Tutorial
No ratings yet
Deep Learning With Keras Tutorial
34 pages
Full download Neural Networks A Visual Introduction for Beginners Michael Taylor pdf docx
100% (1)
Full download Neural Networks A Visual Introduction for Beginners Michael Taylor pdf docx
65 pages
Coding Questions
No ratings yet
Coding Questions
17 pages
Python Scikit-Learn Cheat Sheet For Machine Learning
No ratings yet
Python Scikit-Learn Cheat Sheet For Machine Learning
3 pages
Recurrent Neural Networks - Hinton
No ratings yet
Recurrent Neural Networks - Hinton
57 pages
HEVC
No ratings yet
HEVC
50 pages
Application of NLP
No ratings yet
Application of NLP
10 pages
Matlab Solutions For Maxflow-Mincut and Ford Fulkerston Algorithm
No ratings yet
Matlab Solutions For Maxflow-Mincut and Ford Fulkerston Algorithm
4 pages
Flume Case Study
No ratings yet
Flume Case Study
2 pages
Back Propagation Technique
No ratings yet
Back Propagation Technique
24 pages
Feature Selection in Machine Learning
No ratings yet
Feature Selection in Machine Learning
4 pages
Cs 224 N
No ratings yet
Cs 224 N
128 pages
ML QP
No ratings yet
ML QP
6 pages
Machine Learning
No ratings yet
Machine Learning
2 pages
Unit 7-NLP
No ratings yet
Unit 7-NLP
33 pages
Iot Merged
No ratings yet
Iot Merged
132 pages
Learning AI
No ratings yet
Learning AI
27 pages
Feature Selection Techniques in Machine Learning - Javatpoint
No ratings yet
Feature Selection Techniques in Machine Learning - Javatpoint
9 pages
ف1
No ratings yet
ف1
4 pages
Simple Libraries in Python
No ratings yet
Simple Libraries in Python
12 pages
Artificial Intelligence: Computer Science Engineering
No ratings yet
Artificial Intelligence: Computer Science Engineering
1 page
Bad Ideas
No ratings yet
Bad Ideas
69 pages
Top 100 Deep Learning Interview Questions
No ratings yet
Top 100 Deep Learning Interview Questions
157 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
3D U-Net Based Brain Tumor Segmentation
No ratings yet
3D U-Net Based Brain Tumor Segmentation
11 pages
An A-Z of Useful Python Tricks - freeCodeCamp - Org - Medium PDF
No ratings yet
An A-Z of Useful Python Tricks - freeCodeCamp - Org - Medium PDF
14 pages
A Domain-Specific Automatic Text Summarization Using Fuzzy Logic
No ratings yet
A Domain-Specific Automatic Text Summarization Using Fuzzy Logic
13 pages
Introduction To Parallel Computing
100% (1)
Introduction To Parallel Computing
34 pages
User'S Guide: Simelectronics 1
No ratings yet
User'S Guide: Simelectronics 1
83 pages
TOP 21 DATA SCIENCE PROJECTS - Part 1
No ratings yet
TOP 21 DATA SCIENCE PROJECTS - Part 1
6 pages
Artificial Neural Networks - MiniProject
100% (1)
Artificial Neural Networks - MiniProject
16 pages
Future Academy Machine Learning Brochure
No ratings yet
Future Academy Machine Learning Brochure
14 pages
Lecture 1 Kaldi
No ratings yet
Lecture 1 Kaldi
56 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
75 pages
Logic synthesis Standard Requirements
From Everand
Logic synthesis Standard Requirements
Gerardus Blokdyk
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
LEC-7230M Manual v0.1
No ratings yet
LEC-7230M Manual v0.1
23 pages
9aa0 03 Pef 20230817
No ratings yet
9aa0 03 Pef 20230817
6 pages
Blunt Abdominal Trauma Medication
No ratings yet
Blunt Abdominal Trauma Medication
8 pages
6.2 Cem Plan Final Edit2
No ratings yet
6.2 Cem Plan Final Edit2
36 pages
IMOmath - Functional Equations - Problems With Solutions
No ratings yet
IMOmath - Functional Equations - Problems With Solutions
15 pages
Case 1 - The Patterson Operation
No ratings yet
Case 1 - The Patterson Operation
5 pages
Nukote Aegis Submittal
No ratings yet
Nukote Aegis Submittal
112 pages
Strain Gauge Report
No ratings yet
Strain Gauge Report
34 pages
Violino 1 Signed, Sealed, Delivered - Stevie Wonder
No ratings yet
Violino 1 Signed, Sealed, Delivered - Stevie Wonder
1 page
Rainey E. Updegrove: Objective
No ratings yet
Rainey E. Updegrove: Objective
1 page
Industrial Profile: Hutti Gold Mines Company Limited
No ratings yet
Industrial Profile: Hutti Gold Mines Company Limited
78 pages
Opal Drive Guide v1 Final 20190515
No ratings yet
Opal Drive Guide v1 Final 20190515
102 pages
SUPP 1951 Brownie Scout Handbook
No ratings yet
SUPP 1951 Brownie Scout Handbook
4 pages
Full (28)
No ratings yet
Full (28)
351 pages
Weekly Truck Registration Updated
No ratings yet
Weekly Truck Registration Updated
154 pages
A New Radar Waveform Design Algorithm With Improved Feasibility For Spectral Coexistence
No ratings yet
A New Radar Waveform Design Algorithm With Improved Feasibility For Spectral Coexistence
10 pages
Bus Coach Electrico BYD C9
No ratings yet
Bus Coach Electrico BYD C9
2 pages
Components of The Smart Eye Glasses Price Estimation
No ratings yet
Components of The Smart Eye Glasses Price Estimation
2 pages
Eee Formula Sheet
100% (2)
Eee Formula Sheet
147 pages
Personal Finance Summary
No ratings yet
Personal Finance Summary
332 pages
Alto Elvis 15.2xla SM Ver1.0
100% (1)
Alto Elvis 15.2xla SM Ver1.0
22 pages
Roads and Buildings Transfer GOs 2008
No ratings yet
Roads and Buildings Transfer GOs 2008
26 pages
Tétel
No ratings yet
Tétel
9 pages
New Twists in The Unfolded Protein Response: Cell Biology
No ratings yet
New Twists in The Unfolded Protein Response: Cell Biology
4 pages
Infection Prevention and Control (IPC) For COVID-19 Virus
No ratings yet
Infection Prevention and Control (IPC) For COVID-19 Virus
21 pages
3ms All Sequences Lesson Plans Worksheets Tutorial Sessions
No ratings yet
3ms All Sequences Lesson Plans Worksheets Tutorial Sessions
110 pages
DW Project
No ratings yet
DW Project
93 pages
GST - Pros and Cons1
No ratings yet
GST - Pros and Cons1
14 pages
Sales Methods: Selling
No ratings yet
Sales Methods: Selling
6 pages