18. A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew
C: I really, really like stew
i really like eggs cabbage and donʼt stew
A 0 0.58 0 1.58 0 0 0 0
B 0 0 0 0 1.58 1.58 3.16 0.58
C 0 1.17 0 0 0 0 0 0.58
19. A: I really like eggs
B: I donʼt like cabbage, and donʼt like stew
C: I really, really like stew
i really like eggs cabbage and donʼt stew
A 0 0.35 0 0.94 0 0 0 0
B 0 0 0 0 0.31 0.31 0.63 0.11
C 0 0.89 0 0 0 0 0 0.44
Hello!
PSC @ Ibuildings
Twitter
Email
Blog - related posts
This what you need to do to implement a classifier
And also our table of contents
a note on PHP - Qs at end, but ask syntax qs straight away
First, talk about what and why?
What is it - Assign documents to classes from a predefined set
Classes can be any label - e.g. topic words, categories
Documents in this case is text, web pages, emails, books
But it can be really anything as long as you can extract features from it
Algos not hard, applicable in all langs. Python/Java have good library versions
So - Why do in PHP? Integrate into web apps - WP, Drupal, MediaWiki
Classification is really organising of information - every day
Lots of uses - can group into common tasks of filter, organise, add metadata
Might do all three with uploading photos to flickr or facebook
Filter, get rid of bad ones.
Organise, upload to album or set
Tag photos with people in them etc.
Filtering is binary - Class OR Not Class
- often hide or remove one lot
Can break others types down into series of this binary choices often
BUT: simple, not easy. In flickr example, what is good?
- photographer, composition, light etc.
- regular person, contains their friends etc. - SUBJECTIVE
Organising is putting document in one place - one label chosen from a set of many possible
Single label only (often EXACTLY 1, 0 not allowed)
Folders, albums, libraries, handwriting recognition
Tagging, can have multiple
Often 0 to many labels
Often for tagging topics in content
E.g. a news story on us-china embargo talk might be filed under: US, China, Trade
In 80s people would come up with rules - computers would apply
IF this term AND this term THEN this category
Took a lot of time - Needed domain expert -Needed knowledge engineer to get knowledge out of expert
Hard to scale, need more experts for new cats - Subjective - experts disagree
Usually result was 60%-90% accurate
Machine Learning - ‘look at examples’ - Supervised Learning
Work out rules based on manually classified data
People don’t need to explain their thinking - just organise - easier
Scales better, is cheaper, and about as accurate!
In the picture, it’s easy to see by looking at the groupings what the ‘rule’ for classifying m&ms is
So what do we need?
1. the classes to classify to
2. A set of manually classified documents to train the classifier on
3. A set of manually classified docs to test on
In some cases may have a third set of manually classified docs for validation
How do we use these? We train a learner on training data to create a model
Then use the model to classify each test document
Compare manual to automatic judgements
Here we’ve got a binary classification, for a spam checker
Top is the manual judgement, side is classifier judgement
Boxes will just be counts of judgements
With that we can calculate some stats
Accuracy is just correct percentage - BUT big biased sets make ‘no to all’ to all accurate. Or we desire bias e.g. FN over FP with spam
Precision measures the number of positives that are true positives
Recall measures what percentage of the available positives we captures
Can have one w/out other: high threshold for precision, all positive for recall
Researchers quote breakeven or fbeta
To compare classifiers, researchers often quote breakeven point
This is just where recall and precision are equal
F-Beta allows weighing precision more than recall, or vice versa.
Beta = 1 is balanced
If beta = 0.5, recall is half as important as precision, such as spam checker
Before classifying, we need to extract features. How do we represent text
All this work is classic Information Retrieval
Bag of Words is so called because we discard the structure, and just note down appearances of words
Throw away the ordering, any structure at all from web pages etc.
See why called vector space in a couple of slides
First we have to extract words
Simplest version: Take continuous sequences of word character
Ignore all punctuation including apostrophes etc.
Each new token we find in each document will be added to a dictionary
Each document has a vector - there is a dimension for each dictionary word
Value is 1 if the document contained that token, 0 if it did not
Here is the collection of these two phrases as a vector.
1 if the word is in the document, 0 if not
Note both vectors have the same dimensions
In a real document collection there are lots of dimensions!
We can plot the documents on a graph - using 2 terms ‘i’ and ‘really’
Here the green circle is document A the red triangle B
The documents on the last page are in 8 dimensional space - 8 terms
But we want more information - how important a term is to a document
Need to capture a position in that dimension other than 0 & 1
A weight
TFIDF is a classic and very common weighting - are a lot of variations though
TF is just count of instances of that term in the doc
IDF is number of docs divided by number with term
Gives less common terms a higher weight
So best is uncommon term that appears a lot
Lets look at a similar example to before, with some term weights added
The idf means that the ‘i’ and ‘like’ actually disappear here
In all docs - no distinguishing power - no value to doc
Don’t gets weighted higher in B
Then normalise to unit length
Normalise is just each value divided by total length
(sqrt of the sqrd values)
I and Like still 0 though
Waste of time processing
Maybe there are others that are a waste of time?
DR or term space reduction is removing terms that don’t contribute much
This can often be by a factor of 10 or 100
Speeds up execution
May have heard of stop words - Common in search engines
Words like ‘of’ ‘the’ ‘an’ - or ‘het’, ‘de’ in dutch
Little to no semantic value to us
Can use a stoplist of words, or infer it from low idf scores
Collection stop words
Pokemon in english, not a stop word. Pokemon on pokemon forum: stop word.
Try to come up with ‘root’ word
Maps lots of different variations onto one term, reducing dimensions
Result is usually not a real world, it’s just repeatable
Kai-Square, greek not chinese - Helps choose indicative terms for each class
Statistical technique - Calculates how related a term is to a class
Take 4 Counts from data. How many spam docs contain term etc.
We look for difference between expected and actual counts
For a given cell Expected is the row sum * col sum / total
Square the difference, divided by expected value, and add all them up
Plug the numbers into this formula: a one step way of doing the same thing
Comes out with a number - not interesting absolutely
But is interesting relatively
Chi-square is a distribution, so we can calculate a probability of the events being unrelated using the area from this distribution
1DF because there is one variable and one dependent (term) (class)
P is the chance that variables are independent
For > 10.83 we are 99.9% certain the variables change together
Can work out the probability number from a chi-square distribution
But for DR, can just use a threshold and remove terms below
OK, so we’ve got a good set of data, now we need a learner
Tree of has term questions - ends in class decision
Easy to classify, and recursive building algorithm pretty easy
Algo is: If all collection is class, then leaf of class
Else, choose the best term - Split into 2 collections, WITH and WITHOUT term
Recurse on each half
But how does it determine best?
First, calculate entropy
Take counts for how many total docs, how many spam, how many ham
minus section could be repeated for multiple classes
Represents num bits needed to encode the class of a random choice of document from this set
How much new information we get - Easier to see on graph
Percentage of spam on horizontal entropy on vertical
If all spam or no spam no entropy - we know what will come out
If 50/50 entropy is 1 - we can’t guess ahead of time
We want to reduce entropy - so that the sets are more consistent
We’re using the entropy to calculated the maximum information gain
This is the overall reduction in entropy
The original entropy minus the new entropy
New is weighted by the proportion of docs in each group
withCount is the number of docs that have the feature
woutCount is the number without, total is the total
The split is how many of each class are in the group
The entropy is calculated with the formula before
The proportion is just the percentage of the total documents
Final col is just entropy times proportion
Note that the with class is very biased with a low entropy
BUT - only a small proportion, so the final information gain is low
Easy to implement recursive builder
If ‘spam’ or ‘ham’ are empty - we say the tree is a leaf node.
If not, we find the term with the highest info gain
And built a subtree based on the set of terms with and without the term
Just need to traverse to classify
An completely made up example of an output tree.
Millions of ways to do this, of course!
Simple function to return leaf node
Assumes document is an array of words
Problem: Tree gets too specific to training data - Need to generalise
Stop condition - min info gain or other
Pruning - test by trimming off bottom parts of tree
Use validation set to test effectiveness of measures
DTs generate human interpretable rules - very handy
BUT expensive to train, need small N dimension, and often require rebuilding
KNN is much cheaper at training time - as there is no training
Recall we can regard documents as vectors in a N-dimensional space
Where N is the size of the dictionary
Lets consider only 2 terms
Docs with weights for terms X and Y
Documents of class triangle and class circle
They seem to have a spatial cluster
This is also true in higher dimension for real documents
Class of new doc = class of it’s K nearest neighbours
The K is how many we look for
In this case K is three, and the nearest three are all green circles.
Choosing K is kind of hard, you might try a few different values but it’s usually in the 11-30 doc range - uneven to num classes
Only real challenge is comparing documents
Here we are looking at just the X and Y distance, this is the euclidean distance
Very easy. Simply looking at the difference between one and the other
Can actually do the whole thing in the database !
But, has some problems, so more common...
Similarity measure, goes to 1 for identical, 0 for orthogonal
Easy to do with normalised vectors - just take dot product
Multiply each dimension in Doc A with it in Doc B, and sum
Provides better matching than Euclidean
We could just loop over documents, find K most similar
But search engines do a very similar job - why not use one?
Two options when classifying: count most common or add similarities
Second helps, e.g. if 5 good matches in class A, 10 poor matches class B
For multiple class tagging: use thresholds
BUT: Have to compare all documents
Search engines do a very similar job, use similar scoring. Why not use one?
We can use Zend Framework native PHP implementation of Lucene
We add an unindexed ‘class’ field, and our contents
We would loop over our training data this way, adding documents
Then, we construct a query.
Use the same analyser to tokenise documents the same as training data
And take a count of how often each word appears
We don’t have IDF, so we’re just filtering short words
Construct a query with the top 50 words by term frequency
Results: take the most common class
Works OK, not great.
Java Lucene, can get a term vector - includes the true weights
We aren’t limited to using pure PHP search engines though
Flax is based on the open source Xapian engine, kind of like their Solr
Has a similarity search that makes KNN ridiculously easy and very effective
It works around the same lines as before, but extracting a set of relevant terms from the document or documents in question
Weighing scheme is BM25 - more advanced
This code creates a database, adds two fields to it, and indexes a document
Uses a restful web service - available from any language
Very similar to lucene loop
Except we add then remove a document to use searchSimilar feature
Gets good accuracy and is really fast.
However, if we want to use this kind of technique and don’t have a flax handy, there is another related technique
Instead of taking each value and comparing it
We take the average of all the documents in each class
And compare against that
Very easy
This works surprisingly well!
Here we compute the centroid or average of all the class
By summing the weight * 1/count.
You might do this in the database, pretty straightforward op.
Called a rohk-key-oh classifier because it’s based on a relevance feedback technique by Rocchio
Classify by doing similarity against each - taking closest
Quick and easy probability based classifier
Very commonly used in spam checking, very trendy a couple of years back
Naive assumption is that words are independent
One word does not influence chances of seeing another - not true!
BUT: Means that we don’t need an example for each combination of attributes
Bayes is good at very high dimensionality because of this
This is the Bayes theorem. Read the pipe as ‘given’, pr as probability of
Pr(Doc) is constant, can be dropped for ranking
Pr(Class) is either count or assumed - e.g. 60% spam = 0.6, or just use 0.5/0.5 for binary
Have to work out Pr(Doc|Class)
We calculate that by looking at the probability of the features in the docs
We can look at the data itself to calculate the term likelihoods
Conditional probability: Docs with term in class / Docs in class
We had 1757 docs with the word register in the spam class, and about 16,000 docs in the spam class, so the probability is about 0.11.
Register is more spam than ham, sent is more ham than spam
Can calculate in SQL directly
ClassCount is the number of docs in that class - from earlier query
Divide: Number of docs in class containing term / Number of docs in class
The stored value is the likelihood of seeing that term in a doc of that class
Would call once for each class
Independence assumption lets us treat probability of doc as product of probabilities of word for the given class
Loop over the terms and multiply likelihood for each class
Assumed prior of 0.5
Multi-bernouli - multinomial is term count in class over overall term count
To sum up, these are the steps for a wide range of problems
Step 1: Recognising that something is a classification problem
- context spelling, author ident, intrusion detection, find genes in DNA
Then extract features from the docs
Apply a learner to generate a model for classifying
Something for your mental toolbox!
Thanks to the people who put their photos on flickr under Creative Commons