Nltk natural language toolkit overview and application @ PyCon.tw 2012

NLTK: Natural Language Toolkit
Overview and Application
Jimmy Lai
jimmy.lai@oi-sys.com
Software Engineer @ 引京聚點
2012/06/09

1

Outline
1. An application based on NLP: 聚寶評
2. Introduction to Natural Language Processing
3. Brief History of NLTK
4. NLTK

2

聚寶評 www.ezpao.com

美食搜尋引擎

搜尋各大部落格食記

3

聚寶評 www.ezpao.com

語意分析搜尋引擎

4

評論主題分析

網友分享菜分析

正評/負評分析

5

手機版網友分享菜+評論分析

m.ezpao.com

網友分享菜+正負評分析

6

Natural Language Processing (NLP)
• 語音識別（Speech recognition）
• 詞性標註（Part-of-speech tagging）
• 句法分析（Parsing）
• 自然語言生成(Natural language generation)
• 文本分類（Text classification）
• 信息抽取（Information extraction）
• 機器翻譯（Machine translation）
• 文字蘊涵（Textual entailment）
via Wikipedia

7

NLTK: Natural Language Toolkit
• http://www.nltk.org/
• Author: Steven Bird, Edward Loper, Ewan Klein
• Originally developed for class student has
background either in computer science or
linguistics.
• Currently:
– Education: over 100 courses in 23 countries.
– Research: over 250 papers cites NLTK.

8

Outline
1. An application based on NLP: 聚寶評
2. Introduction to Natural Language Processing
3. Brief History of NLTK
4. NLTK:
a. Installation
b. Annotated Text Corpora
c. Text Tokenization, Normalization, Analysis,
Distribution Analysis
d. Part-of-speech Tagging
e. Text Classification
f. Named Entity Recognition
9

Install NLTK
Python 2.6+

pip install numpy
pip install matplotlib
pip install nltk

10

Annotated Text Corpora (1/3)
nltk.corpus
• Corpus: 語料庫，含有某種結構化標記的資
料集合，可能包含多種語言。
• 例如：
– stopwords: 常見字字典
– sinica_treebank: 中文語句結構標記語料庫
– brown: 包含15種分類及詞性標記的英語語料庫
– wordnet: 包含詞性、同義反義的英語字典

11

nltk.corpus
#Chinese treebank
import nltk nltk.corpus.sinica_treebank
#download corpus on demand
nltk.download() #Examples
(嘉珍, Nba)
#stopwords (和, Caa)
nltk.corpus.stopwords.words() (我, Nhaa)
nltk.corpus.stopwords.words('english') (住在, VC1)
nltk.corpus.stopwords.words('french') (同一條, DM)
(巷子, Nab)
some_english_stopwords = ['most', 'me',
'below', 'when', 'which', 'what', 'of', 'it', (我們, Nhaa)
'very', 'our'] (是, V_11)
(鄰居, Nab)

(也, Dbb)
(是, V_11)
(同班, Nv3)
(同學, Nab) 12

nltk.corpus
ACE Named Entity Chunker (Maximum entropy) Portuguese Treebank
Australian Broadcasting Commission 2006 Gazeteer Lists
Genesis Corpus
Alpino Dutch Treebank
Project Gutenberg Selections
BioCreAtIvE (Critical Assessment of Information NIST IE-ER DATA SAMPLE
Extraction Systems in Biology) C-Span Inaugural Address Corpus
Brown Corpus Indian Language POS-Tagged Corpus
Brown Corpus (TEI XML Version) JEITA Public Morphologically Tagged Corpus (in
CESS-CAT Treebank ChaSen format)
PC-KIMMO Data Files
CESS-ESP Treebank
KNB Corpus (Annotated blog corpus)
Chat-80 Data Files Language Id Corpus
City Database Lin's Dependency Thesaurus
The Carnegie Mellon Pronouncing Dictionary (0.6) MAC-MORPHO: Brazilian Portuguese news text with
ComTrans Corpus Sample part-of-speech tags
Machado de Assis -- Obra Completa
CONLL 2000 Chunking Corpus
Sentiment Polarity Dataset Version 2.0
CONLL 2002 Named Entity Recognition Corpus Names Corpus, Version 1.3 (1994-03-29)
Dependency Treebanks from CoNLL 2007 (Catalan NomBank Corpus 1.0
and Basque Subset) NPS Chat
Dependency Parsed Treebank Paradigm Corpus
Sample European Parliament Proceedings Parallel
Corpus

13

Text Tokenization
nltk.tokenize
Web Text Processing Flow
HTML ASCII Text Vocabulary

from urllib import urlopen
html = urlopen(url).read()
raw = nltk.clean_html(html)
sents = nltk.sent_tokenize(raw)
tokens = [] # wordpunct_tokenize: ['3', '.', '33']
for sent in sents: # word_tokenize: ['3.33']
tokens.extend(nltk.word_tokenize(sent))
text = nltk.Text(tokens)
words = [word.lower() for word in text]
vocab = sorted(set(words))

14

Text Normalization (1/2)
nltk.stem
• Stem: 將單字(現在式、過去式、單複數)還
原成原型。可以將不同形式的單字歸類為
同一個單字。
• 著名演算法：Porter Stemmer

15

Text Normalization (2/2)
nltk.stem

16

Text Analysis
nltk.text

17

Text Distribution Analysis
nltk.probability

18

Part-of-speech Tagging (1/3)
nltk.tag
• Part of Speech Tagging: 詞性標記，
標記每個單字的詞性。
• 同一單字的不同詞性其語義不同，
如Book名詞是書，動詞是預定。
• 透過POS Tagging，可以賦予文字更
多語義資訊。

19

nltk.tag

20

nltk.tag

21

Text Classification (1/3)
nltk.classify
• Text Classification: 文字分類，分析文字後將
文字分到預先定義的類別裡。
• 基於統計的機器學習演算法，著名的演算
法為：
– Naïve Bayes Classifier
– Decision Tree
– Support Vector Machine

22

nltk.classify
Machine Learning Approach Work Flow
Training
Text with Feature Learning Classifier
Features
Label Generation Algorithms Model

Prediction
Text Classifier
Feature
without Features Label
Generation Model
Label

23

nltk.classify

24

Named Entity Recognition (NER) (1/2)
nltk.tag, nltk.chunk
• Named Entity Recognition: 從文字中擷取出
命名實體，命名實體是具有完整語義的複
合單字。例如：人名、地名、事件。

NER General Work Flow

Sentence Named Entity Relation
Tokenization POS tagging
Segmentation Recognition Recognition

25

Named Entity Recognition (NER) (2/2)
nltk.tag, nltk.chunk

26

Reference
1. Steven Bird, Ewan Klein, and Edward Loper,
“Natural Language Processing with Python”,
2009. #includes: Python + NLP + NLTK
2. Jacob Perkins, “Python Text Processing with
NLTK 2.0 Cookbook”, 2010.
3. Matthew A. Russell, “Mining the Social Web”,
2011.
4. Loper, E., & Bird, S. (2002). NLTK: The Natural
Language Toolkit. Proceedings of the ACL02
Workshop on Effective tools and methodologies
for teaching natural language processing and
computational linguistics.
27

Thank you for your attention.
Q&A
We are hiring!
• 核心引擎演算法研發工程師
• 系統研發工程師
• 網路應用研發工程師

Oxygen Intelligence Taiwan Limited
引京聚點知識結構搜索股份有限公司
• 公司簡介： http://www.ezpao.com/about/
• 職缺簡介： http://www.ezpao.com/join/
• 請將履歷寄到 jimmy.lai@oi-sys.com

28

Nltk natural language toolkit overview and application @ PyCon.tw 2012

Related slideshows

More Related Content

Nltk natural language toolkit overview and application @ PyCon.tw 2012