Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Text classification in Scikit-learn
Jimmy Lai
r97922028 [at] ntu.edu.tw
http://tw.linkedin.com/pub/jimmy-lai/27/4a/536
2013/06/17
Outline
1. Introduction to Data Analysis
2. Setup packages
3. Scikit-learn tutorial
4. Text Classification in Scikit-learn
2
Critical Technologies for Big Data
Analysis
• Please refer
http://www.slideshare.net/jimmy
_lai/when-big-data-meet-python
for more detail.
Collecting
User Generated
Content
Machine
Generated Data
Storage
Computing
Analysis
Visualization
Infrastructure
C/JAVA
Python/R
Javascript
3
Setup all packages on Ubuntu
• Packages required:
– pip
– Numpy
– Scipy
– Matplotlib
– Scikit-learn
– Psutil
– IPython
• Commands
sudo apt-get install python-pip
sudo apt-get build-dep python-
numpy
sudo apt-get build-dep python-scipy
sudo apt-get build-dep python-
matplotlib
# install packages in a virtualenv
pip install numpy
pip install scipy
pip matplotlib
pip install scikit-learn
pip install psutil
pip install ipython
4
Setup IPython Notebook
• Install:
$ pip install ipython
• Create config:
$ ipython profile create
• Edit config:
– c.NotebookApp.certfile =
u’cert_file’
– c.NotebookApp.password =
u’hashed_password’
– c.IPKernelApp.pylab = 'inline'
• Run server:
$ ipython notebook --ip=* --
port=9999
• Generate cert_file:
$ openssl req -x509 -nodes -days
365 -newkey rsa:1024 -keyout
mycert.pem -out mycert.pem
• Generate
hashed_password:
In [1]: from IPython.lib import
passwd
In [2]: passwd()
Via http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html
5
Fast prototyping - IPython Notebook
• Write python code in browser:
– Exploit the remote server resources
– View the graphical results in web page
– Sketch code pieces as blocks
– Refer http://www.slideshare.net/jimmy_lai/fast-data-mining-flow-
prototyping-using-ipython-notebook for more introduction.
6
Scikit-learn Cheat-sheet
Via http://peekaboo-vision.blogspot.tw/2013/01/machine-learning-cheat-sheet-for-scikit.html
7
Scikit-learn Tutorial
• https://github.com/ogrisel/parallel_ml_tutorial
8
Demo Code
• Demo Code:
ipython_demo/text_classification_demo.ipynb
in https://bitbucket.org/noahsark/slideshare
• Ipython Notebook:
– Install
$ pip install ipython
– Execution (under ipython_demo dir)
$ ipython notebook --pylab=inline
– Open notebook with browser, e.g.
http://127.0.0.1:8888
9
Machine learning classification
• 𝑋𝑖 = [𝑥1, 𝑥2, … , 𝑥 𝑛], 𝑥 𝑛 ∈ 𝑅
• 𝑦𝑖 ∈ 𝑁
• 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 = 𝑋, 𝑌
• 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 𝑓: 𝑦𝑖 = 𝑓(𝑋𝑖)
10
Text classification
Feature
Generation
Feature
Selection
Classification
Model Training
Model
Parameter
Tuning
11
From: zyeh@caspian.usc.edu (zhenghao yeh)
Subject: Re: Newsgroup Split
Organization: University of Southern California, Los Angeles, CA
Lines: 18
Distribution: world
NNTP-Posting-Host: caspian.usc.edu
In article <1quvdoINN3e7@srvr1.engin.umich.edu>, tdawson@engin.umich.edu
(Chris Herringshaw) writes:
|> Concerning the proposed newsgroup split, I personally am not in favor of
|> doing this. I learn an awful lot about all aspects of graphics by reading
|> this group, from code to hardware to algorithms. I just think making 5
|> different groups out of this is a wate, and will only result in a few posts
|> a week per group. I kind of like the convenience of having one big forum
|> for discussing all aspects of graphics. Anyone else feel this way?
|> Just curious.
|>
|>
|> Daemon
|>
I agree with you. Of cause I'll try to be a daemon :-)
Yeh
USC
Dataset:
20 newsgroups
dataset
Text
Structured Data
12
Dataset in sklearn
• sklearn.datasets
– Toy datasets
– Download data from http://mldata.org repository
• Data format of classification problem
– Dataset
• data: [raw_data or numerical]
• target: [int]
• target_names: [str]
13
Feature extraction from structured
data (1/2)
• Count the frequency of
keyword and select the
keywords as features:
['From', 'Subject',
'Organization',
'Distribution', 'Lines']
• E.g.
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Organization: University of Maryland, College
Park
Distribution: None
Lines: 15
Keyword Count
Distribution 2549
Summary 397
Disclaimer 125
File 257
Expires 116
Subject 11612
From 11398
Keywords 943
Originator 291
Organization 10872
Lines 11317
Internet 140
To 106
14
Feature extraction from structured
data (2/2)
• Separate structured
data and text data
– Text data start from
“Line:”
• Transform token matrix
as numerical matrix by
sklearn.feature_extract
ionDictVectorizer
• E.g.
[{‘a’: 1, ‘b’: 1}, {‘c’: 1}] =>
[[1, 1, 0], [0, 0, 1]]
15
Text Feature extraction in sklearn
• sklearn.feature_extraction.text
• CountVectorizer
– Transform articles into token-count matrix
• TfidfVectorizer
– Transform articles into token-TFIDF matrix
• Usage:
– fit(): construct token dictionary given dataset
– transform(): generate numerical matrix
16
Text Feature extraction
• Analyzer
– Preprocessor: str -> str
• Default: lowercase
• Extra: strip_accents – handle unicode chars
– Tokenizer: str -> [str]
• Default: re.findall(ur"(?u)bww+b“, string)
– Analyzer: str -> [str]
1. Call preprocessor and tokenizer
2. Filter stopwords
3. Generate n-gram tokens
17
18
Feature Selection
• Decrease the number of features:
– Reduce the resource usage for faster learning
– Remove the most common tokens and the most
rare tokens (words with less information):
• Parameter for Vectorizer:
– max_df
– min_df
– max_features
19
Classification Model Training
• Common classifiers in sklearn:
– sklearn.linear_model
– sklearn.svm
• Usage:
– fit(X, Y): train the model
– predict(X): get predicted Y
20
Cross Validation
• When tuning the parameters of model, let
each article as training and testing data
alternately to ensure the parameters are not
dedicated to some specific articles.
– from sklearn.cross_validation import KFold
– for train_index, test_index in KFold(10, 2):
• train_index = [5 6 7 8 9]
• test_index = [0 1 2 3 4]
21
Performance Evaluation
• 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑡𝑝
𝑡𝑝+𝑓𝑝
• 𝑟𝑒𝑐𝑎𝑙𝑙 =
𝑡𝑝
𝑡𝑝+𝑓𝑛
• 𝑓1𝑠𝑐𝑜𝑟𝑒 = 2
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
• sklearn.metrics
– precision_score
– recall_score
– f1_score
Source: http://en.wikipedia.org/wiki/Precision_and_recall
22
Visualization
1. Matplotlib
2. plot() function of Series, DataFrame
23
Experiment Result
• Future works:
– Feature selection by statistics or dimension reduction
– Parameter tuning
– Ensemble models
24

More Related Content

Text classification in scikit-learn

  • 1. Text classification in Scikit-learn Jimmy Lai r97922028 [at] ntu.edu.tw http://tw.linkedin.com/pub/jimmy-lai/27/4a/536 2013/06/17
  • 2. Outline 1. Introduction to Data Analysis 2. Setup packages 3. Scikit-learn tutorial 4. Text Classification in Scikit-learn 2
  • 3. Critical Technologies for Big Data Analysis • Please refer http://www.slideshare.net/jimmy _lai/when-big-data-meet-python for more detail. Collecting User Generated Content Machine Generated Data Storage Computing Analysis Visualization Infrastructure C/JAVA Python/R Javascript 3
  • 4. Setup all packages on Ubuntu • Packages required: – pip – Numpy – Scipy – Matplotlib – Scikit-learn – Psutil – IPython • Commands sudo apt-get install python-pip sudo apt-get build-dep python- numpy sudo apt-get build-dep python-scipy sudo apt-get build-dep python- matplotlib # install packages in a virtualenv pip install numpy pip install scipy pip matplotlib pip install scikit-learn pip install psutil pip install ipython 4
  • 5. Setup IPython Notebook • Install: $ pip install ipython • Create config: $ ipython profile create • Edit config: – c.NotebookApp.certfile = u’cert_file’ – c.NotebookApp.password = u’hashed_password’ – c.IPKernelApp.pylab = 'inline' • Run server: $ ipython notebook --ip=* -- port=9999 • Generate cert_file: $ openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem • Generate hashed_password: In [1]: from IPython.lib import passwd In [2]: passwd() Via http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html 5
  • 6. Fast prototyping - IPython Notebook • Write python code in browser: – Exploit the remote server resources – View the graphical results in web page – Sketch code pieces as blocks – Refer http://www.slideshare.net/jimmy_lai/fast-data-mining-flow- prototyping-using-ipython-notebook for more introduction. 6
  • 9. Demo Code • Demo Code: ipython_demo/text_classification_demo.ipynb in https://bitbucket.org/noahsark/slideshare • Ipython Notebook: – Install $ pip install ipython – Execution (under ipython_demo dir) $ ipython notebook --pylab=inline – Open notebook with browser, e.g. http://127.0.0.1:8888 9
  • 10. Machine learning classification • 𝑋𝑖 = [𝑥1, 𝑥2, … , 𝑥 𝑛], 𝑥 𝑛 ∈ 𝑅 • 𝑦𝑖 ∈ 𝑁 • 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 = 𝑋, 𝑌 • 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 𝑓: 𝑦𝑖 = 𝑓(𝑋𝑖) 10
  • 12. From: zyeh@caspian.usc.edu (zhenghao yeh) Subject: Re: Newsgroup Split Organization: University of Southern California, Los Angeles, CA Lines: 18 Distribution: world NNTP-Posting-Host: caspian.usc.edu In article <1quvdoINN3e7@srvr1.engin.umich.edu>, tdawson@engin.umich.edu (Chris Herringshaw) writes: |> Concerning the proposed newsgroup split, I personally am not in favor of |> doing this. I learn an awful lot about all aspects of graphics by reading |> this group, from code to hardware to algorithms. I just think making 5 |> different groups out of this is a wate, and will only result in a few posts |> a week per group. I kind of like the convenience of having one big forum |> for discussing all aspects of graphics. Anyone else feel this way? |> Just curious. |> |> |> Daemon |> I agree with you. Of cause I'll try to be a daemon :-) Yeh USC Dataset: 20 newsgroups dataset Text Structured Data 12
  • 13. Dataset in sklearn • sklearn.datasets – Toy datasets – Download data from http://mldata.org repository • Data format of classification problem – Dataset • data: [raw_data or numerical] • target: [int] • target_names: [str] 13
  • 14. Feature extraction from structured data (1/2) • Count the frequency of keyword and select the keywords as features: ['From', 'Subject', 'Organization', 'Distribution', 'Lines'] • E.g. From: lerxst@wam.umd.edu (where's my thing) Subject: WHAT car is this!? Organization: University of Maryland, College Park Distribution: None Lines: 15 Keyword Count Distribution 2549 Summary 397 Disclaimer 125 File 257 Expires 116 Subject 11612 From 11398 Keywords 943 Originator 291 Organization 10872 Lines 11317 Internet 140 To 106 14
  • 15. Feature extraction from structured data (2/2) • Separate structured data and text data – Text data start from “Line:” • Transform token matrix as numerical matrix by sklearn.feature_extract ionDictVectorizer • E.g. [{‘a’: 1, ‘b’: 1}, {‘c’: 1}] => [[1, 1, 0], [0, 0, 1]] 15
  • 16. Text Feature extraction in sklearn • sklearn.feature_extraction.text • CountVectorizer – Transform articles into token-count matrix • TfidfVectorizer – Transform articles into token-TFIDF matrix • Usage: – fit(): construct token dictionary given dataset – transform(): generate numerical matrix 16
  • 17. Text Feature extraction • Analyzer – Preprocessor: str -> str • Default: lowercase • Extra: strip_accents – handle unicode chars – Tokenizer: str -> [str] • Default: re.findall(ur"(?u)bww+b“, string) – Analyzer: str -> [str] 1. Call preprocessor and tokenizer 2. Filter stopwords 3. Generate n-gram tokens 17
  • 18. 18
  • 19. Feature Selection • Decrease the number of features: – Reduce the resource usage for faster learning – Remove the most common tokens and the most rare tokens (words with less information): • Parameter for Vectorizer: – max_df – min_df – max_features 19
  • 20. Classification Model Training • Common classifiers in sklearn: – sklearn.linear_model – sklearn.svm • Usage: – fit(X, Y): train the model – predict(X): get predicted Y 20
  • 21. Cross Validation • When tuning the parameters of model, let each article as training and testing data alternately to ensure the parameters are not dedicated to some specific articles. – from sklearn.cross_validation import KFold – for train_index, test_index in KFold(10, 2): • train_index = [5 6 7 8 9] • test_index = [0 1 2 3 4] 21
  • 22. Performance Evaluation • 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑝 𝑡𝑝+𝑓𝑝 • 𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑡𝑝 𝑡𝑝+𝑓𝑛 • 𝑓1𝑠𝑐𝑜𝑟𝑒 = 2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 • sklearn.metrics – precision_score – recall_score – f1_score Source: http://en.wikipedia.org/wiki/Precision_and_recall 22
  • 23. Visualization 1. Matplotlib 2. plot() function of Series, DataFrame 23
  • 24. Experiment Result • Future works: – Feature selection by statistics or dimension reduction – Parameter tuning – Ensemble models 24