Document Classification In PHP - Slight Return

Document Classiﬁcation
In PHP

@ianbarber - ian@ibuildings.com.......
http://phpir.com.......


Deﬁning The Task
Document Pre-processing
Term Selection
Algorithms

What is
Document Classiﬁcation?

Uses

Ian Barber / @ianbarber / ian@ibuildings.com......
Filter Organise Metadata

Filtering -
Binary Classiﬁcation

Organising -....
Single Label Classiﬁcation....

Metadata -
Multiple Label Classiﬁcation

Manual Rules Written
Domain Experts

Machine Learning -.....
Automatically Extract Rules.....

Classes

Training Test
Documents Documents

Evaluation

spam ham

true false
spam
positive positive
false true
ham
negative negative

Measures....

$accuracy =
($tp + $tn) / ($tp + $tn + $fp + $fn);

$precision = $tp / ($tp + $fp);

$recall = $tp / ($tp + $fn);

Vector Space Model -
Bag Of Words

$doc = strtolower(strip_tags($doc));

$regex = '/w+/';
preg_match_all($regex, $doc, $matches);

$words = $matches[0];

Extract Tokens

A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew

i really like eggs cabbage and donʼt stew

A 1 1 1 1 0 0 0 0

B 1 0 1 0 1 1 1 1

2.00

1.00
i

0

-1.00
0 0.50 1.00 1.50 2.00
really

$tf
= $termCount;

$idf
= log($totalDocs
/ $docsWithTerm, 2);

$tfidf = $tf * $idf;

Term Weighting....

C: I really, really like stew

A 0 0.58 0 1.58 0 0 0 0
B 0 0 0 0 1.58 1.58 3.16 0.58
C 0 1.17 0 0 0 0 0 0.58

C: I really, really like stew

A 0 0.35 0 0.94 0 0 0 0
B 0 0 0 0 0.31 0.31 0.63 0.11
C 0 0.89 0 0 0 0 0 0.44

happening - happen.......
happens - happen. .....
happened - happen.......
http://tartarus.org/~martin/PorterStemmer ....
hhttp://snowball.tartarus.org/algorithms/dutchtml..

Stemming

spam ham
term $a $b
not term $c $d

Chi-Square....

$a = $termSpam; $b = $termHam;
$c = $restSpam; $d = $restHam;

$total = $a + $b + $c + $d;
$diff = ($a * $d) - ($c * $b);

$chisquare = (
$total * pow($diff, 2 ) /
(($a+$c) * ($b+$d) *
($a+$b) * ($c+$d));

Chi-Square 1DF....

p chi2.
0.1 2.71.
0.05 3.84.
0.01 6.63.
0.005 7.88.
0.001 10.83.

p - Value....

Decision Tree - ID3

?

✔ ?

✖ ✔

Entropy....

$entropy =
-( ($spam/$total)
* log($spam/$total, 2))
-( ($ham/$total)
* log($ham/$total, 2));

1.00

0.75
entropy

0.50

0.25

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
spam/total

Information Gain....

$gain =
$baseEntropy
-(($withCount/$total)* $withEntropy )
( -(($woutCount/$total)* $woutEntropy )

Split Entropy Proportion E*P

Base 50/50 1 1 1

With 20/5 0.722 0.25 0.1805

Without 30/45 0.97 0.75 0.7275

1 - With - Without = 0.092.

term

✔ term

✖ term

✔ ✖

Classiﬁcation....
function classify($doc, $tree) {
if($tree->isLeaf()) {
return $tree->class;
}
$term = $tree->getSplitTerm();
if(in_array($term, $doc)) {
return classify($doc, $tree->getWith());
} else {
return classify($doc, $tree->getWout());
}
}

Overﬁtting:....
Pruning or Stop Conditions....

Spam
Term X

Ham

Term Y

Cosine Similarity....

foreach($doca as $term => $tfidf) {
$similarity +=
floatval($tfidf) *
floatval($docb[$term]);
}

Zend_Search_Lucene
$index = Zend_Search_Lucene::create($db);
$doc = new Zend_Search_Lucene_Document();

$doc->addField(
Zend_Search_Lucene_Field::Text(
'class', $class));
$doc->addField(
Zend_Search_Lucene_Field::UnStored(
'contents', $content));
$index->addDocument($doc);

Zend_Search_Lucene::setResultSetLimit(25);

$analyser =
Zend_Search_Lucene_Analysis_Analyzer::getDefault();
$tokens = $analyser->tokenize($content);

foreach($tokens as $key => $token) {
$tok = $token->getTermText();
if(strlen($tok) > 4)
$filtered[$tok]++;
}
arsort($filtered);

Classifying with ZSL....

$q = new Zend_Search_Lucene_Search_Query_MultiTerm();

$tc = 0;
foreach($filtered as $t => $tf) {
$q->addTerm(
new Zend_Search_Lucene_Index_Term($t));
if(++$tc > 49) { break;}
}

$results = $index->find($q);
foreach($results as $result) {
$classes[$result->class] += 1;
}

arsort($classes);
$class = key($classes);

Flax/Xapian Search Service
http://www.ﬂax.co.uk.......

$flax = new FlaxSearchService('ip:8080');

$db = $flax->createDatabase('test');
$db->addField('class', array(
'store' => true,
'exacttext’ => true));
$db->addField('contents', array(
'store' => false,
'freetext' => array('language'=>'en')));
$db->commit();

$db->addDocument(array(
'class' => $class,
'contents' => $document));
$db->commit();

$db->addDocument(
array('contents' => $doc), 'foo');
$db->commit();

$results = $db->searchSimilar('foo',0,25);
$db->deleteDocument('foo');
$db->commit();

foreach($results['results'] as $r) {
if($r['docid'] != 'foo') {
$classes[$r['data']['class'][0]] += 1;
}
}

arsort($classes);
$class = key($classes);

Prototypes For Rocchio

$mul = 1 / count($classDocs);

foreach($classDocs as $doc) {
foreach($doc as $tid => $tfidf) {
$prototype[$tid] += $mul * $tfidf;
}
}

Naive Bayes -
Probability Based Classiﬁer

Bayes Theorem
Pr(Class Doc) = Pr(Doc Class) * Pr(Class)
Pr(Doc)

Pr(Class Doc) = Pr(Doc Class) * Pr(Class)

Likelihood Of Term Occurring
Given Class

word spam freq pr(word|spam) ham freq pr(word|ham)

register 1757 0.11 246 0.02

sent 487 0.03 4600 0.36

Estimating Likelihood
$this->db->query("
INSERT INTO class_terms
(class, term, likelihood)
SELECT d.class, d.term,
count(*) / " . $classCount . "
FROM documents AS d
JOIN document_terms AS dt USING (did)
WHERE d.class = '" . $class . "'"
);

Classifying A Document
foreach($classes as $class) {
$prob[$class] = 0.5; // assume prior

foreach($document as $term) {
$prob[$class] *=
$likely[$term][$class];
}
}

arsort($prob);
$class = key($prob);


Deﬁning The Problem
Document Processing
Term Selection
Algorithm

Image Credits
Title http://www.flickr.com/photos/themacinator/3499579760/
What is... http://www.flickr.com/photos/austinevan/1225274637/
Filter http://www.flickr.com/photos/benimoto/2913950616/
Organise http://www.flickr.com/photos/ellasdad/425813314/
Metadata http://www.flickr.com/photos/banky177/2282734063/
Manual http://www.flickr.com/photos/foundphotoslj/1134150364/
Automatic http://www.flickr.com/photos/29278394@N00/59538978/
Vector Space http://www.flickr.com/photos/ethanhein/2260878305/sizes/o/
Reduction http://www.flickr.com/photos/wili/157220657/sizes/l/
Stemming http://www.flickr.com/photos/clearlyambiguous/20847530/sizes/l/
Stop words http://www.flickr.com/photos/afroswede/22237769/
Chi-Squared http://www.flickr.com/photos/kdkd/2837565850/sizes/o/
ID3 http://www.flickr.com/photos/tonythemisfit/2414239471
Overfitting http://www.flickr.com/photos/akirkley/3222128726/sizes/l/
Bayes http://www.flickr.com/photos/darwinbell/440080655/sizes/l/
Conclusion http://www.flickr.com/photos/mukluk/241256203
Credits http://www.flickr.com/photos/librarianavengers/413762956/

Questions?

@ianbarber - ian@ibuildings.com.......
http://phpir.com .

Document Classification In PHP - Slight Return

More Related Content

Document Classification In PHP - Slight Return

Editor's Notes