Week4
Week4
Week4
Introduction
Naive Bayes Classifier
The bag of word representation
Sentiment example
A Practical (other problems)
Evaluation measures
Introduction
The goal of text classification is to take a single observation, extract
some useful features, and then classify the observation into one of a
set of discrete classes, e.g.
text categorization
assigning a label or category to an entire text or document.
sentiment analysis
the extraction of sentiment, the positive or negative orientation
that a writer expresses toward some object
spam detection
assigning an email to one of the two classes spam or not spam
author attribution
topic labeling
CS3TM20©XH 2
Text Classification: definition
Input:
• a document x
• a fixed set of classes C = {c1, c2,…, cJ}
Output: a predicted class C
Supervised Binary Classification problem
Given a series of input/output pairs:
• (x(i), y(i))
For each observation x(i)
• We represent x(i) by a feature vector [x1, x2,…, xn]
• We compute an output: a predicted class (i) {0,1}
CS3TM20©XH 3
Naive Bayes classifier
or = P(B) P(A|B)
= P(A) P(B|A) A A∩B B
This is called Bayes Theorem
CS3TM20©XH 5
Relies on very simple representation of document
Bag of words
Denote a document data sample d conditional on a Class
as
d|C
From Bayes theorem, we have
P(C) P(d|C) = P(d) P(C|d)
P(C|d) =
If there are two classes C1 and C2, the Bayes classier
chooses the class Ci, with higher value of P(Ci|d)
CS3TM20©XH 6
The Bag of Words Representation
Bag of Words assumption: Assume position doesn’t matter
CS3TM20©XH 7
Positive or negative movie review?
seen 2
γ(
sweet 1
whimsical
recommend
1
1
)=c
happy 1
... ...
CS3TM20©XH 9
"Likelihood" "Prior"
Document d
represented as
features x1..xn
CS3TM20©XH 10
Multinomial Naive Bayes: Learning
From training corpus, extract
Vocabulary Calculate P(wk | cj) terms
Calculate P(cj) terms • Textj single doc
• For each cj in C do containing all docsj
docsj all docs with • For each word wk in
class =cj Vocabulary
• nk # of
occurrences of wk
in Textj
CS3TM20©XH 11
Let's do a worked sentiment example!
CS3TM20©XH 12
A worked sentiment example with add-1 smoothing
1. Prior from training:
𝑁𝑐 P(-) = 3/5
^ (𝑐 )=
𝑃 𝑗
𝑗
𝑁 𝑡𝑜𝑡𝑎𝑙 P(+) = 2/5
2. Drop "with"
CS3TM20©XH 13
3. Likelihoods from training:
𝑐𝑜𝑢𝑛𝑡 ( 𝑤𝑖 , 𝑐 ) +1
𝑝 ( 𝑤𝑖|𝑐 ) =
( ∑
𝑤 ∈𝑉
)
𝑐𝑜𝑢𝑛𝑡 (𝑤 ,𝑐 ) +¿ 𝑉 ∨¿
CS3TM20©XH 14
A worked sentiment example with add-1 smoothing
1. Prior from training:
^ (𝑐 )=
𝑁𝑐 P(-) = 3/5
𝑃 𝑗
𝑗
𝑁 𝑡𝑜𝑡𝑎𝑙 P(+) = 2/5
2. Drop "with"
CS3TM20©XH 15
A Practical (other problems using Naïve Bayesian)
CS3TM20©XH 16
Evaluation measures:
The 2-by-2 confusion matrix
Evaluation: Precision and recall
Precision : % of items the system detected (i.e., items
the system labeled as positive) that are in fact positive
(according to the human gold labels)
P (English |Pos) = (P(English, Pos) +1) / (7+ |V|) =(0+1) / (7+9) = 1/16
P (Summer |Pos) = (P(Summer, Pos) +1) / (7+ |V|)= (2+1)/ (7+9) = 3/16
P (Weather |Pos) = (P(Weather, Pos) +1) / (7+ |V|)= (1+1)/ (7+9) = 1/8
P (English |Neg) = (P(English, Neg) +1) / (5+ |V|) =(0+1) / (5+9) = 1/14
P (Summer |Neg) = (P(Summer, Neg) +1) / (5+ |V|) = (0+1)/ (5+9) = 1/14
P (Weather |Neg) = (P(Weather, Neg) +1) / (5+ |V|) = (1+1)/ (5+9) = 1/7
28
29
Classifying sentiment for input x
=P(y=1|x)
= σ ( w⋅x+ b) =
= σ ([2.5, -5.0, -1.2, 0.5, 2.0, 0.7]
⋅[ 3, 2, 1, 3, 0, 4.19] +0.1)
=σ ( 0.833) =0.70
=1-0.70=0.30
Supervised classification with gradient descent
We have a feature representation of the input.
For each input observation x(i), a vector of features
[x1, x2, ... , xn]. Feature j for input x(i) is xj, more
completely xj(i)
We are given m input/output pairs (x(i),y(i)).
(We know the correct label y (either 0 or 1) for each x(i))
Logistic regression classifier computes , the estimated
class, via p(y|x), the sigmoid function
wx+b
31
CS3TM20©XH
We want to set w and b to minimize the distance between our
We have
^ (𝑡 ) ^ ( 𝑡 −1 ) 𝜕𝐿
𝜃 =𝜃 −𝜂
𝜕𝜃
is small positive number of learning rate
Let’s visualize for a single scalar w
x = [0.5,0.6,0.1] Weights
bias
Input layer
= = 0.7047
37
Non-Linear Activation Functions besides sigmoid
Most Common:
ReLU
tanh Rectified Linear Unit
Feedforward Neural Networks
Multilayer notation
Feedforward nets for classification
σ
Logistic U
σ 2-layer
Regression
W feedforward
network
W
x1 xn
f1 f2 fn x1 xn
f1 f2 fn
40 a hidden layer to logistic regression
Just adding
• allows the network to use non-linear interactions between features
• which may (or may not) improve performance.
Sentiment classification example:
41
42
Classifying sentiment for input x
x= [ 3,2,1,3,0,4.19] as
= [ -1 0 1]
Then
[4.19. -0.81 14.19 ];
[4.19. 0 14.19 ];
Assuming output node with Then
sigmoid as [4.19. -0.81 14.19 ];
= [0.2,0.3,0.9] [4.19. 0 14.19 ];
= 0.5
Then
= =1
Backpropagation
For training, we need the derivative of the loss with
respect to weights in early layers of the network
The collection of weights can also be denoted as θ
We want the weights that minimize the loss, averaged
over all examples.
We also have
^ (𝑡 ) ^ ( 𝑡 −1 ) 𝜕𝐿
𝜃 =𝜃 −𝜂
𝜕𝜃
Since loss is calculated at the end of network.
The weights are updated by error Backpropagation.