Week4

Naive Bayesian for text classification
 Introduction
 Naive Bayes Classifier
 The bag of word representation
 Sentiment example
 A Practical (other problems)
 Evaluation measures
Introduction
 The goal of text classification is to take a single observation, extract
some useful features, and then classify the observation into one of a
set of discrete classes, e.g.
 text categorization
assigning a label or category to an entire text or document.
 sentiment analysis
the extraction of sentiment, the positive or negative orientation
that a writer expresses toward some object
 spam detection
assigning an email to one of the two classes spam or not spam
 author attribution
 topic labeling
CS3TM20©XH 2
 Text Classification: definition
Input:
• a document x
• a fixed set of classes C = {c1, c2,…, cJ}
Output: a predicted class  C
 Supervised Binary Classification problem
Given a series of input/output pairs:
• (x(i), y(i))
For each observation x(i)
• We represent x(i) by a feature vector [x1, x2,…, xn]
• We compute an output: a predicted class (i)  {0,1}
CS3TM20©XH 3
Naive Bayes classifier
 Machine learning algorithms build a mathematical

model based on sample data, known as "training data", in
order to make predictions or decisions.
 They carry out certain tasks with wide areas of
applications.
 Part of computer science, but are closely related to
probability and statistics, in terms of some methods.
 A classifier is a hypothesis or discrete-valued function that
is used to assign (categorical) class labels to data points.
 Naive Bayes is a simple technique for constructing
classifiers. CS3TM20©XH 4
 Conditional probability is used to calculate one event
happening given the other even has already happened
 Denote event A conditional on event B as
A|B
P(A|B)=
or = P(B) P(A|B)
= P(A) P(B|A) A A∩B B
 This is called Bayes Theorem
CS3TM20©XH 5
 Relies on very simple representation of document
Bag of words
 Denote a document data sample d conditional on a Class
as
d|C
 From Bayes theorem, we have
P(C) P(d|C) = P(d) P(C|d)
P(C|d) =
 If there are two classes C1 and C2, the Bayes classier
chooses the class Ci, with higher value of P(Ci|d)
CS3TM20©XH 6
The Bag of Words Representation
Bag of Words assumption: Assume position doesn’t matter
CS3TM20©XH 7
Positive or negative movie review?
+ ...zany characters and richly applied satire, and

some great plot twists
− It was pathetic. The worst part about it was the
boxing scenes...
+ ...awesome caramel sauce and sweet toasty
almonds. I love this place!
− ...awful pizza and ridiculously overpriced...
CS3TM20©XH 8
The bag of words representation
seen 2
γ(
sweet 1
whimsical
recommend
1
1
)=c
happy 1
... ...
CS3TM20©XH 9
"Likelihood" "Prior"
Document d
represented as
features x1..xn
CS3TM20©XH 10
Multinomial Naive Bayes: Learning
 From training corpus, extract
Vocabulary  Calculate P(wk | cj) terms
Calculate P(cj) terms • Textj  single doc
• For each cj in C do containing all docsj
docsj  all docs with • For each word wk in
class =cj Vocabulary
• nk  # of
occurrences of wk
in Textj
CS3TM20©XH 11
Let's do a worked sentiment example!
CS3TM20©XH 12
A worked sentiment example with add-1 smoothing
1. Prior from training:
𝑁𝑐 P(-) = 3/5
^ (𝑐 )=
𝑃 𝑗
𝑗
𝑁 𝑡𝑜𝑡𝑎𝑙 P(+) = 2/5
2. Drop "with"
3. Likelihoods from training:

𝑐𝑜𝑢𝑛𝑡 ( 𝑤𝑖 , 𝑐 ) +1
𝑝 ( 𝑤𝑖|𝑐 ) =
(∑
𝑤 ∈𝑉
)
𝑐𝑜𝑢𝑛𝑡 (𝑤 ,𝑐 ) + ¿ 𝑉 ∨¿ ¿
CS3TM20©XH 13
𝑝 ( 𝑤𝑖|𝑐 ) =
( ∑
𝑤 ∈𝑉
)
𝑐𝑜𝑢𝑛𝑡 (𝑤 ,𝑐 ) +¿ 𝑉 ∨¿
CS3TM20©XH 14
A worked sentiment example with add-1 smoothing
1. Prior from training:
^ (𝑐 )=
𝑁𝑐 P(-) = 3/5
𝑃 𝑗
𝑗
𝑁 𝑡𝑜𝑡𝑎𝑙 P(+) = 2/5
2. Drop "with"

𝑝 ( 𝑤𝑖|𝑐 ) =
(∑ )
𝑐𝑜𝑢𝑛𝑡 ( 𝑤 ,𝑐 ) + ¿ 𝑉 ∨¿ ¿ 4. Scoring the test set:
𝑤 ∈𝑉
CS3TM20©XH 15
A Practical (other problems using Naïve Bayesian)
Gender classification with NLTK
CS3TM20©XH 16
Evaluation measures:
The 2-by-2 confusion matrix
Evaluation: Precision and recall
Precision : % of items the system detected (i.e., items
the system labeled as positive) that are in fact positive
(according to the human gold labels)
Recall: % of items present in the input that were correctly

identified by the system.
Why Precision and recall
 Our dumb pie-classifier
Just label nothing as "about pie"
 Accuracy=99.99%
but
 Recall = 0
(it doesn't get any of the 100 Pie tweets)
 Precision and recall, unlike accuracy, emphasize
true positives:
finding the things that we are supposed to be looking for.
A combined measure: F-measure
 F measure: a single number that combines P and
R:
 We almost always use balanced F1 (i.e.,  = 1)

Development Test Sets ("Devsets") and Cross-validation
Training set Development Test Set Test Set
Train on training set, tune on devset, report on testset

This avoids overfitting (‘tuning to the test set’)
More conservative estimate of performance
But paradox: want as much data as possible for
training, and ad much for dev; how to split?
Cross-validation: multiple splits
Pool results over splits, Compute pooled dev performance
Exam style question:
You are given three documents with their class labels as in Table 1.
You are asked to use a naïve Bayes classifier to determine new
document “English summer weather” belong to Class positive (P)
or negative (N). Estimate the probabilities using Maximum
Likelihood Estimation (with Laplace smoothing). Give your answer
as fractions.
Document ID Text Class
1 Summer weather is nice P
2 I love summer P
3 Weather in England is bad N
Test English summer weather ?
Document ID Text Class
1 Summer weather is nice P
Estimated probabilities: 2 I love summer P
3 Weather in England is bad N
Class prior: P(Pos) = 2/3, P(Neg) = 1/3, |V|= 9 Test English summer weather ?
P (English |Pos) = (P(English, Pos) +1) / (7+ |V|) =(0+1) / (7+9) = 1/16
P (Summer |Pos) = (P(Summer, Pos) +1) / (7+ |V|)= (2+1)/ (7+9) = 3/16
P (Weather |Pos) = (P(Weather, Pos) +1) / (7+ |V|)= (1+1)/ (7+9) = 1/8
P (English |Neg) = (P(English, Neg) +1) / (5+ |V|) =(0+1) / (5+9) = 1/14
P (Summer |Neg) = (P(Summer, Neg) +1) / (5+ |V|) = (0+1)/ (5+9) = 1/14
P (Weather |Neg) = (P(Weather, Neg) +1) / (5+ |V|) = (1+1)/ (5+9) = 1/7
P(Pos|England summer weather) = P( English summer weather| pos)*P(pos)

= 1/16*3/16* 1/8 *2/3 =0.0009766
> P(Neg|England summer weather) = P( English summer weather| neg)*P(neg)
= 1/8*3/14* 1/14 * 1/3 =0.0006378
Sentence Predicted as Positive
Logistic regression and neural network
 Introduction
 Logistic regression classifier
 Supervised learning and
gradient descent
 Neural network classifier
 Detecting
Introduction
central partpatterns
of NLP is a
Positive/negative
sentiment
Spam/not spam
Authorship attribution
 Classification
of choosing the is the task
correct
 given class
input. label for a
Each input
isolation is
from considered
all other in
inputs,
 is definedand inthe set
advance. of labels
We mainly
classes look
(binary at two
 classification)
Both
and logistic
neural regression
network
classifiers
important are an
analytic tool in
 natural
Baseline
machine
and social
supervised
learning
sciences
tools for
classification. CS3TM20©XH
26
Logistic regression classifier
 Sigmoid function
 We’ll compute w∙x+b

 And then pass it through the
sigmoid function:
σ(w∙x+b)
 Which is treated as a probability
p(y=1|x)
 P(y=0|x)= 1- p(y=1|x)
27
CS3TM20©XH
Sentiment classification example:
It's hokey . There are virtually no surprises , and the writing is

second-rate . So why was it so enjoyable ? For one thing , the
cast is great . Another nice touch is the music . I was overcome
with the urge to get off the couch and start dancing . It sucked
me in , and it'll do the same to you .
Does y=1 (positive) or y=0 (negative)?
28
29
Classifying sentiment for input x
 Suppose w =[2.5, -5.0, -1.2, 0.5, 2.0, 0.7], b=0.1

 x= [ 3,2,1,3,0,4.19]
=P(y=1|x)
= σ ( w⋅x+ b) =
= σ ([2.5, -5.0, -1.2, 0.5, 2.0, 0.7]
⋅[ 3, 2, 1, 3, 0, 4.19] +0.1)
=σ ( 0.833) =0.70
=1-0.70=0.30
Supervised classification with gradient descent
 We have a feature representation of the input.
 For each input observation x(i), a vector of features
[x1, x2, ... , xn]. Feature j for input x(i) is xj, more
completely xj(i)
 We are given m input/output pairs (x(i),y(i)).
(We know the correct label y (either 0 or 1) for each x(i))
 Logistic regression classifier computes , the estimated
class, via p(y|x), the sigmoid function
wx+b
31
CS3TM20©XH
 We want to set w and b to minimize the distance between our
 Representing 𝛳=(w,b). We need an optimization algorithm to

estimate (i) and the true y(i).
update 𝛳 to minimize a loss function.

 Which is given as cross entropy ( negative log likelihood
function)
 So that the likelihood of the true y in the training data, given
the observations x, is maximized.
L= - log p(y|x)= -y log( )- (1-y) log(1-)
 For the full data set

Intuition of gradient descent
How do I get to the bottom of this river canyon?
Look around me 360∘

Find the direction of
steepest slope down
Go that way
x
 The gradient of a function of many variables is a vector
pointing in the direction of the greatest increase in a
function.
 Gradient Descent: Find the gradient of the loss function at
the current point and move in the opposite direction.
 We want the weights that minimize the loss, averaged over
all examples.
 We have
^ (𝑡 ) ^ ( 𝑡 −1 ) 𝜕𝐿
𝜃 =𝜃 −𝜂
𝜕𝜃
 is small positive number of learning rate
 Let’s visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?

A: Move w in the reverse direction from the slope of the
function
So we'll move positive

Neural network classifier
Neural
Network
 Consider sigmoid function
 We’ll compute a neural network f(x,w)

 And then pass it through the sigmoid function:
σ(f (x,w))
 Which is treated as a probability p(y=1|x)
 P(y=0|x)= 1- p(y=1|x)
36
CS3TM20©XH
Neural Network Unit
Output value
Example: Suppose a unit has:
Non-linear transform
w = [0.2,0.3,0.9]
b = 0.5
With an input x and sigmoid function Weighted sum
x = [0.5,0.6,0.1] Weights
bias
Input layer
= = 0.7047
37
Non-Linear Activation Functions besides sigmoid
Most Common:
ReLU
tanh Rectified Linear Unit
 Feedforward Neural Networks
 Multilayer notation
Feedforward nets for classification
σ
Logistic U
σ 2-layer
Regression
W feedforward
network
W
x1 xn
f1 f2 fn x1 xn
f1 f2 fn
40 a hidden layer to logistic regression
Just adding
• allows the network to use non-linear interactions between features
• which may (or may not) improve performance.
Sentiment classification example:
It's hokey . There are virtually no surprises , and the writing is

second-rate . So why was it so enjoyable ? For one thing , the
cast is great . Another nice touch is the music . I was overcome
with the urge to get off the couch and start dancing . It sucked
me in , and it'll do the same to you .
Does y=1 (positive) or y=0 (negative)?
41
42
Classifying sentiment for input x
 x= [ 3,2,1,3,0,4.19] as
= [ -1 0 1]
 Assuming three hidden nodes with Relu as
 Then
[4.19. -0.81 14.19 ];
[4.19. 0 14.19 ];
 Assuming output node with  Then
sigmoid as [4.19. -0.81 14.19 ];
= [0.2,0.3,0.9] [4.19. 0 14.19 ];
= 0.5
Then
= =1
Backpropagation
 For training, we need the derivative of the loss with
respect to weights in early layers of the network
 The collection of weights can also be denoted as θ
 We want the weights that minimize the loss, averaged
over all examples.
 We also have
^ (𝑡 ) ^ ( 𝑡 −1 ) 𝜕𝐿
𝜃 =𝜃 −𝜂
𝜕𝜃
 Since loss is calculated at the end of network.
 The weights are updated by error Backpropagation.

Week4

Uploaded by

Copyright:

Available Formats

Week4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week4

Uploaded by

Copyright:

Available Formats

Naive Bayesian for text classification

 Machine learning algorithms build a mathematical

+ ...zany characters and richly applied satire, and

3. Likelihoods from training:

3. Likelihoods from training:

Gender classification with NLTK

Recall: % of items present in the input that were correctly

 We almost always use balanced F1 (i.e.,  = 1)

Training set Development Test Set Test Set

Train on training set, tune on devset, report on testset

P(Pos|England summer weather) = P( English summer weather| pos)*P(pos)

 We’ll compute w∙x+b

It's hokey . There are virtually no surprises , and the writing is

Does y=1 (positive) or y=0 (negative)?

 Suppose w =[2.5, -5.0, -1.2, 0.5, 2.0, 0.7], b=0.1

 Representing 𝛳=(w,b). We need an optimization algorithm to

update 𝛳 to minimize a loss function.

L= - log p(y|x)= -y log( )- (1-y) log(1-)

 For the full data set

Look around me 360∘

Q: Given current w, should we make it bigger or smaller?

So we'll move positive

 Consider sigmoid function

 We’ll compute a neural network f(x,w)

It's hokey . There are virtually no surprises , and the writing is

Does y=1 (positive) or y=0 (negative)?

 Assuming three hidden nodes with Relu as

You might also like