Text Classification Using Logistics Regression
Text Classification Using Logistics Regression
Logistic Regression
• Positive/negative
sentiment
• Spam/not spam
• Authorship attribution
(Hamilton or Madison?)
Alexander Hamilton
Text Classification: definition
Input:
◦ a document x
◦ a fixed set of classes C = {c1, c2,…, cJ}
z = w·x+ b
If this sum is high, we say y=1; if low, then y=0
But we want a probabilistic classifier
1
y = s (z) = (
1+ e− z
12
Idea of logistic regression
P(y=1)
wx + b
Turning a probability into a classifier
if w∙x+b > 0
if w∙x+b ≤ 0
Sentiment example: does y=1 or y=0?
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .
19
x2=2
x3=1
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .
x4=3
x1=3 x5=0 x6=4.19
Figur e 5.2 A sample mini test document showing the extracted features in the vector x.
Given these 6 features and the input review x, P(+ |x) and P(− |x) can be com-
puted using Eq. 5.5:
Suppose w =
b = 0.1 21
Figur e 5.2 1 mini test5document showing
A sample 6 the extracted features in the vector x.
Classifying
Figur e 5.2
sentiment for input x
A sample mini test document showing the extracted features in the vector x.
Given these 6 features and the input review x, P(+ |x) and P(− |x) can be com-
puted usingthese
Given Eq. 5.5:
6 features and the input review x, P(+ |x) and P(− |x) can be com-
puted using Eq. 5.5:
p(+ |x) = P(Y = 1|x) = s (w·x+ b)
([2.5, − 5.0,
p(+ |x) = P(Y = 1|x) = s (w·x+ b) − 1.2, 0.5, 2.0, 0.7] ·[3, 2, 1, 3, 0, 4.19] + 0.1)
= (.833)− 5.0, − 1.2, 0.5, 2.0, 0.7] ·[3, 2, 1, 3, 0, 4.19] + 0.1)
s ([2.5,
= s0.70
(.833) (5.6)
p(− |x) = P(Y = 0|x) = 1− 0.70s (w·x+ b) (5.6)
p(− |x) = P(Y = 0|x) = 1− 0.30s (w·x+ b)
= 0.30
Logistic regression is commonly applied to all sorts of NLP tasks, and any property
of the input
Logistic can be aisfeature.
regression commonlyConsider thetotask
applied all of perof
sorts iod disambiguation:
NLP tasks, and any deciding
property
if
of athe
period
input is
canthe
beend of a sentence
a feature. Considerorthe
part ofofa per
task word,
iod by classifying each
disambiguation: period
deciding
22
erty of the input can be a feature. Consider the task of per iod disambiguation:
We can build features for logistic regression for
deciding if a period is the end of a sentence or part of a word, by classifying each
any classification task: period disambiguation
period into one of two classes EOS (end-of-sentence) and not-EOS. We might use
features like x1 below expressing that the current word is lower case and the class
is EOS (perhaps with a positive weight), or that the current word is in our abbrevia-
End ofwith
tions dictionary (“ Prof.” ) and the class is EOS (perhaps sentence
a negative weight). A
feature canThis ends ainquite
also express a period.
complex combination of properties. For example a
The house
period following at 465
a upper cased word Main St.tois
is a likely new.
be an EOS, but if the word itself is
St. and the previous word is capitalized, then the period is likely part of a shortening
of the word street. Not end
⇢
1 if “Case(wi ) = Lower”
x1 =
0 otherwise
⇢
1 if “ wi 2 AcronymDict”
x2 =
0 otherwise
⇢
1 if “ wi = St. & Case(wi− 1) = Cap”
x3 =
0 otherwise 23
Classification in (binary) logistic regression: summary
Given:
◦ a set of classes: (+ sentiment,- sentiment)
◦ a vector x of features [x1, x2, …, xn]
◦ x1= count( "awesome")
◦ x2 = log(number of words in review)
◦ A vector w of weights [w1, w2, …, wn]
◦ wi for each feature fi
Wait, where did the W’s come from?
Supervised classification:
• We know the correct label y (either 0 or 1) for each x.
• But what the system produces is an estimate, 𝑦ො
We want to set w and b to minimize the distance between our
estimate 𝑦ො (i) and the true y(i).
• We need a distance estimator: a loss function or a cost
function
• We need an optimization algorithm to update w and b to
minimize the loss.
25
Learning components
A loss function:
◦ cross-entropy loss
An optimization algorithm:
◦ stochastic gradient descent
The distance between 𝑦ො and y
noting:
if y=1, this simplifies to 𝑦ො
if y=0, this simplifies to 1- 𝑦ො
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Maximize:
Now take the log of both sides (mathematically handy)
Maximize:
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is great . Another nice
touch is the music . I was overcome with the urge to get off the couch and
start dancing . It sucked me in , and it'll do the same to you .
x4=3
x1=3 x5=0 x6=4.19
Let'sFigursee
e 5.2 if thisminiworks
A sample test documentfor our
showing sentiment
the extracted example
features in the vector x.
Given these 6 features and the input review x, P(+ |x) and P(− |x) can be com-
True value is y=1. How well is our model doing?
puted using Eq. 5.5:
Logistic regression is commonly applied to all sorts of NLP tasks, and any property
of the input can be a feature. Consider the task of per iod disambiguation: deciding
if a period is the end of a sentence or part of a word, by classifying each period
into one of two classes EOS (end-of-sentence) and not-EOS. We might use features
like x1 below expressing that the current word is lower case and the class is EOS
p(+ |x) = P(Y = 1|x) = s (w·x+ b)
Let's see if this works= for our− 5.0,
s ([2.5, sentiment example
− 1.2, 0.5, 2.0, 0.7] ·[3, 2, 1
= s (.833)
Suppose true value instead was y=0.
= 0.70
p(− |x) = P(Y = 0|x) = 1− s (w·x+ b)
= 0.30
What's the loss?
Logistic regression is commonly applied to all sorts of NLP tasks,
of the input can be a feature. Consider the task of per iod disambig
if a period is the end of a sentence or part of a word, by classif
into one of two classes EOS (end-of-sentence) and not-EOS. We m
like x1 below expressing that the current word is lower case and
(perhaps with a positive weight), or that the current word is in
Let's see if this works for our sentiment example
The loss when model was right (if true y=1)
Is lower than the loss when model was wrong (if true y=0):
w1 wmin w
0 (goal)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
Loss
slope of loss at w1
is negative
w1 wmin w
0 (goal)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
Loss
one step
of gradient
slope of loss at w1 descent
is negative
w1 wmin w
0 (goal)
Gradients
The gradient of a function of many variables is a
vector pointing in the direction of the greatest
increase in a function.
where
where
where
where
where
63
The softmax function
◦ Turns a vector z = [z1,z2,...,zk] of k arbitrary values into probabilities
64
Softmax in multinomial logistic regression