Text classification in scikit-learn

Text classification in Scikit-learn
Jimmy Lai
r97922028 [at] ntu.edu.tw
http://tw.linkedin.com/pub/jimmy-lai/27/4a/536
2013/06/17

Outline
1. Introduction to Data Analysis
2. Setup packages
3. Scikit-learn tutorial
4. Text Classification in Scikit-learn
2

Critical Technologies for Big Data
Analysis
• Please refer
http://www.slideshare.net/jimmy
_lai/when-big-data-meet-python
for more detail.
Collecting
User Generated
Content
Machine
Generated Data
Storage
Computing
Analysis
Visualization
Infrastructure
C/JAVA
Python/R
Javascript
3

Setup all packages on Ubuntu
• Packages required:
– pip
– Numpy
– Scipy
– Matplotlib
– Scikit-learn
– Psutil
– IPython
• Commands
sudo apt-get install python-pip
sudo apt-get build-dep python-
numpy
sudo apt-get build-dep python-scipy
sudo apt-get build-dep python-
matplotlib
# install packages in a virtualenv
pip install numpy
pip install scipy
pip matplotlib
pip install scikit-learn
pip install psutil
pip install ipython
4

Setup IPython Notebook
• Install:
$ pip install ipython
• Create config:
$ ipython profile create
• Edit config:
– c.NotebookApp.certfile =
u’cert_file’
– c.NotebookApp.password =
u’hashed_password’
– c.IPKernelApp.pylab = 'inline'
• Run server:
$ ipython notebook --ip=* --
port=9999
• Generate cert_file:
$ openssl req -x509 -nodes -days
365 -newkey rsa:1024 -keyout
mycert.pem -out mycert.pem
• Generate
hashed_password:
In [1]: from IPython.lib import
passwd
In [2]: passwd()
Via http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html
5

Fast prototyping - IPython Notebook
• Write python code in browser:
– Exploit the remote server resources
– View the graphical results in web page
– Sketch code pieces as blocks
– Refer http://www.slideshare.net/jimmy_lai/fast-data-mining-flow-
prototyping-using-ipython-notebook for more introduction.
6

Scikit-learn Cheat-sheet
Via http://peekaboo-vision.blogspot.tw/2013/01/machine-learning-cheat-sheet-for-scikit.html
7

Scikit-learn Tutorial
• https://github.com/ogrisel/parallel_ml_tutorial
8

Demo Code
• Demo Code:
ipython_demo/text_classification_demo.ipynb
in https://bitbucket.org/noahsark/slideshare
• Ipython Notebook:
– Install
$ pip install ipython
– Execution (under ipython_demo dir)
$ ipython notebook --pylab=inline
– Open notebook with browser, e.g.
http://127.0.0.1:8888
9

Machine learning classification
• 𝑋𝑖 = [𝑥1, 𝑥2, … , 𝑥 𝑛], 𝑥 𝑛 ∈ 𝑅
• 𝑦𝑖 ∈ 𝑁
• 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 = 𝑋, 𝑌
• 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 𝑓: 𝑦𝑖 = 𝑓(𝑋𝑖)
10

Text classification
Feature
Generation
Feature
Selection
Classification
Model Training
Model
Parameter
Tuning
11

From: zyeh@caspian.usc.edu (zhenghao yeh)
Subject: Re: Newsgroup Split
Organization: University of Southern California, Los Angeles, CA
Lines: 18
Distribution: world
NNTP-Posting-Host: caspian.usc.edu
In article <1quvdoINN3e7@srvr1.engin.umich.edu>, tdawson@engin.umich.edu
(Chris Herringshaw) writes:
|> Concerning the proposed newsgroup split, I personally am not in favor of
|> doing this. I learn an awful lot about all aspects of graphics by reading
|> this group, from code to hardware to algorithms. I just think making 5
|> different groups out of this is a wate, and will only result in a few posts
|> a week per group. I kind of like the convenience of having one big forum
|> for discussing all aspects of graphics. Anyone else feel this way?
|> Just curious.
|>
|>
|> Daemon
|>
I agree with you. Of cause I'll try to be a daemon :-)
Yeh
USC
Dataset:
20 newsgroups
dataset
Text
Structured Data
12

Dataset in sklearn
• sklearn.datasets
– Toy datasets
– Download data from http://mldata.org repository
• Data format of classification problem
– Dataset
• data: [raw_data or numerical]
• target: [int]
• target_names: [str]
13

Feature extraction from structured
data (1/2)
• Count the frequency of
keyword and select the
keywords as features:
['From', 'Subject',
'Organization',
'Distribution', 'Lines']
• E.g.
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Organization: University of Maryland, College
Park
Distribution: None
Lines: 15
Keyword Count
Distribution 2549
Summary 397
Disclaimer 125
File 257
Expires 116
Subject 11612
From 11398
Keywords 943
Originator 291
Organization 10872
Lines 11317
Internet 140
To 106
14

Feature extraction from structured
data (2/2)
• Separate structured
data and text data
– Text data start from
“Line:”
• Transform token matrix
as numerical matrix by
sklearn.feature_extract
ionDictVectorizer
• E.g.
[{‘a’: 1, ‘b’: 1}, {‘c’: 1}] =>
[[1, 1, 0], [0, 0, 1]]
15

Text Feature extraction in sklearn
• sklearn.feature_extraction.text
• CountVectorizer
– Transform articles into token-count matrix
• TfidfVectorizer
– Transform articles into token-TFIDF matrix
• Usage:
– fit(): construct token dictionary given dataset
– transform(): generate numerical matrix
16

Text Feature extraction
• Analyzer
– Preprocessor: str -> str
• Default: lowercase
• Extra: strip_accents – handle unicode chars
– Tokenizer: str -> [str]
• Default: re.findall(ur"(?u)bww+b“, string)
– Analyzer: str -> [str]
1. Call preprocessor and tokenizer
2. Filter stopwords
3. Generate n-gram tokens
17

Feature Selection
• Decrease the number of features:
– Reduce the resource usage for faster learning
– Remove the most common tokens and the most
rare tokens (words with less information):
• Parameter for Vectorizer:
– max_df
– min_df
– max_features
19

Classification Model Training
• Common classifiers in sklearn:
– sklearn.linear_model
– sklearn.svm
• Usage:
– fit(X, Y): train the model
– predict(X): get predicted Y
20

Cross Validation
• When tuning the parameters of model, let
each article as training and testing data
alternately to ensure the parameters are not
dedicated to some specific articles.
– from sklearn.cross_validation import KFold
– for train_index, test_index in KFold(10, 2):
• train_index = [5 6 7 8 9]
• test_index = [0 1 2 3 4]
21

Performance Evaluation
• 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑡𝑝
𝑡𝑝+𝑓𝑝
• 𝑟𝑒𝑐𝑎𝑙𝑙 =
𝑡𝑝
𝑡𝑝+𝑓𝑛
• 𝑓1𝑠𝑐𝑜𝑟𝑒 = 2
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
• sklearn.metrics
– precision_score
– recall_score
– f1_score
Source: http://en.wikipedia.org/wiki/Precision_and_recall
22

Visualization
1. Matplotlib
2. plot() function of Series, DataFrame
23

Experiment Result
• Future works:
– Feature selection by statistics or dimension reduction
– Parameter tuning
– Ensemble models
24

Text classification in scikit-learn

Related slideshows

More Related Content

Text classification in scikit-learn