This slides introduce a python toolkit for Natural Language Processing (NLP). The author introduces several useful topics in NLTK and demonstrates with code examples.
7. Natural Language Processing (NLP)
• 語音識別(Speech recognition)
• 詞性標註(Part-of-speech tagging)
• 句法分析(Parsing)
• 自然語言生成(Natural language generation)
• 文本分類(Text classification)
• 信息抽取(Information extraction)
• 機器翻譯(Machine translation)
• 文字蘊涵(Textual entailment)
via Wikipedia
7
8. NLTK: Natural Language Toolkit
• http://www.nltk.org/
• Author: Steven Bird, Edward Loper, Ewan Klein
• Originally developed for class student has
background either in computer science or
linguistics.
• Currently:
– Education: over 100 courses in 23 countries.
– Research: over 250 papers cites NLTK.
8
9. Outline
1. An application based on NLP: 聚寶評
2. Introduction to Natural Language Processing
3. Brief History of NLTK
4. NLTK:
a. Installation
b. Annotated Text Corpora
c. Text Tokenization, Normalization, Analysis,
Distribution Analysis
d. Part-of-speech Tagging
e. Text Classification
f. Named Entity Recognition
9
13. Annotated Text Corpora (3/3)
nltk.corpus
ACE Named Entity Chunker (Maximum entropy) Portuguese Treebank
Australian Broadcasting Commission 2006 Gazeteer Lists
Genesis Corpus
Alpino Dutch Treebank
Project Gutenberg Selections
BioCreAtIvE (Critical Assessment of Information NIST IE-ER DATA SAMPLE
Extraction Systems in Biology) C-Span Inaugural Address Corpus
Brown Corpus Indian Language POS-Tagged Corpus
Brown Corpus (TEI XML Version) JEITA Public Morphologically Tagged Corpus (in
CESS-CAT Treebank ChaSen format)
PC-KIMMO Data Files
CESS-ESP Treebank
KNB Corpus (Annotated blog corpus)
Chat-80 Data Files Language Id Corpus
City Database Lin's Dependency Thesaurus
The Carnegie Mellon Pronouncing Dictionary (0.6) MAC-MORPHO: Brazilian Portuguese news text with
ComTrans Corpus Sample part-of-speech tags
Machado de Assis -- Obra Completa
CONLL 2000 Chunking Corpus
Sentiment Polarity Dataset Version 2.0
CONLL 2002 Named Entity Recognition Corpus Names Corpus, Version 1.3 (1994-03-29)
Dependency Treebanks from CoNLL 2007 (Catalan NomBank Corpus 1.0
and Basque Subset) NPS Chat
Dependency Parsed Treebank Paradigm Corpus
Sample European Parliament Proceedings Parallel
Corpus
13
14. Text Tokenization
nltk.tokenize
Web Text Processing Flow
HTML ASCII Text Vocabulary
from urllib import urlopen
html = urlopen(url).read()
raw = nltk.clean_html(html)
sents = nltk.sent_tokenize(raw)
tokens = [] # wordpunct_tokenize: ['3', '.', '33']
for sent in sents: # word_tokenize: ['3.33']
tokens.extend(nltk.word_tokenize(sent))
text = nltk.Text(tokens)
words = [word.lower() for word in text]
vocab = sorted(set(words))
14
22. Text Classification (1/3)
nltk.classify
• Text Classification: 文字分類,分析文字後將
文字分到預先定義的類別裡。
• 基於統計的機器學習演算法,著名的演算
法為:
– Naïve Bayes Classifier
– Decision Tree
– Support Vector Machine
22
23. Text Classification (2/3)
nltk.classify
Machine Learning Approach Work Flow
Training
Text with Feature Learning Classifier
Features
Label Generation Algorithms Model
Prediction
Text Classifier
Feature
without Features Label
Generation Model
Label
23
25. Named Entity Recognition (NER) (1/2)
nltk.tag, nltk.chunk
• Named Entity Recognition: 從文字中擷取出
命名實體,命名實體是具有完整語義的複
合單字。例如:人名、地名、事件。
NER General Work Flow
Sentence Named Entity Relation
Tokenization POS tagging
Segmentation Recognition Recognition
25
27. Reference
1. Steven Bird, Ewan Klein, and Edward Loper,
“Natural Language Processing with Python”,
2009. #includes: Python + NLP + NLTK
2. Jacob Perkins, “Python Text Processing with
NLTK 2.0 Cookbook”, 2010.
3. Matthew A. Russell, “Mining the Social Web”,
2011.
4. Loper, E., & Bird, S. (2002). NLTK: The Natural
Language Toolkit. Proceedings of the ACL02
Workshop on Effective tools and methodologies
for teaching natural language processing and
computational linguistics.
27
28. Thank you for your attention.
Q&A
We are hiring!
• 核心引擎演算法研發工程師
• 系統研發工程師
• 網路應用研發工程師
Oxygen Intelligence Taiwan Limited
引京聚點 知識結構搜索股份有限公司
• 公司簡介: http://www.ezpao.com/about/
• 職缺簡介: http://www.ezpao.com/join/
• 請將履歷寄到 jimmy.lai@oi-sys.com
28