Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
10 views

Text Classification Using Logistics Regression

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Text Classification Using Logistics Regression

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Text Classification using Logistic Regression

Logistic Regression

• Important analytic tool in natural and


social sciences
• Baseline supervised machine learning
tool for classification
• Is also the foundation of neural
networks
The two phases of logistic regression

Training: we learn weights w and b using stochastic


gradient descent and cross-entropy loss.

Test: Given a test example x we compute p(y|x)


using learned weights w and b, and return
whichever label (y = 1 or y = 0) is higher probability
Classification Reminder

• Positive/negative
sentiment
• Spam/not spam
• Authorship attribution
(Hamilton or Madison?)
Alexander Hamilton
Text Classification: definition

Input:
◦ a document x
◦ a fixed set of classes C = {c1, c2,…, cJ}

Output: a predicted class 𝑦ො  C


Binary Classification in Logistic Regression

Given a series of input/output pairs:


◦ (x(i), y(i))
For each observation x(i)
◦ We represent x(i) by a feature vector [x1, x2,…, xn]
◦ We compute an output: a predicted class 𝑦ො (i)  {0,1}
Logistic Regression for one observation x

Input observation: vector x = [x1, x2,…, xn]


Weights: one per feature: W = [w1, w2,…, wn]
◦ Sometimes we call the weights θ = [θ1, θ2,…, θn]
Output: a predicted class 𝑦ො  {0,1}

(multinomial logistic regression: 𝑦ො  {0, 1, 2, 3, 4})


Xn
z do
How to = classification
wi xi + b
i= 1
For each feature xi, weight wi tells us importance of xi
we’ ll◦ represent
(Plus we'll have a bias b)
such sumsusing thedot product notatio
We'll sum
ot product of up
two allvectors
the weighted
a andfeatures and the
b, written bias is the
as a·b
orresponding elements of each vector. Thus the followin
to Eq. 5.2:

z = w·x+ b
If this sum is high, we say y=1; if low, then y=0
But we want a probabilistic classifier

We need to formalize “sum is high”.


We’d like a principled classifier that gives us a
probability, just like Naive Bayes did
We want a model that can tell us:
p(y=1|x; θ)
p(y=0|x; θ)
Features in logistic regression
• For feature xi, weight wi tells is how important is xi
• xi ="review contains ‘awesome’": wi = +10
• xj ="review contains ‘abysmal’": wj = -10
• xk =“review contains ‘mediocre’": wk = -2
ear around 0 but outlier values get squashed toward 0 or 1.
to Eq.
The5.2:
problem: z isn't a probability, it's just a
number! we’ ll pass z through the sigmoid func
e a probability,
ction (named becausez = it w·x+
looks like b an s) is also called t
es logistic regression its name. Thesigmoid has the fol
Solution:
hically use5.1:
in Fig. a function of z that goes from 0 to 1
g in Eq. 5.3 forces z to be a legal probabil
fact, since weights 1 real-valued,
are 1 the outp
y = s (z) = − z
=
om − • to • . 1+ e 1+ exp(− z)
rest of the book, we’ ll use the notation exp(x) to mean
r of advantages; it takes areal-valued number and maps
rangeswe’
bility, − • to
ll pass
from z •through
. the sigmoid function, s (z).
The very useful sigmoid or logistic function
med because it looks like an s) is also called the logistic fu
regression its name. The sigmoid has the following equat
Fig. 5.1:

1
y = s (z) = (
1+ e− z

12
Idea of logistic regression

We’ll compute w∙x+b


And then we’ll pass it through the
sigmoid function:
σ(w∙x+b)
And we'll just treat it as a probability
Making probabilities with sigmoids
Turning a probability into a classifier

0.5 here is called the decision boundary


ranges from − • to • .
The probabilistic classifier

P(y=1)

wx + b
Turning a probability into a classifier

if w∙x+b > 0
if w∙x+b ≤ 0
Sentiment example: does y=1 or y=0?

It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .

19
x2=2
x3=1
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .
x4=3
x1=3 x5=0 x6=4.19

Figur e 5.2 A sample mini test document showing the extracted features in the vector x.

Given these 6 features and the input review x, P(+ |x) and P(− |x) can be com-
puted using Eq. 5.5:

p(+ |x) = P(Y = 1|x) = s (w·x+ b)


= s ([2.5, − 5.0, − 1.2, 0.5, 2.0, 0.7] ·[3, 2, 1, 3, 0, 4.19] + 0.1)
= s (.833)
= 0.70 (5.6) 20
Classifying sentiment for input x

Suppose w =
b = 0.1 21
Figur e 5.2 1 mini test5document showing
A sample 6 the extracted features in the vector x.

Classifying
Figur e 5.2
sentiment for input x
A sample mini test document showing the extracted features in the vector x.
Given these 6 features and the input review x, P(+ |x) and P(− |x) can be com-
puted usingthese
Given Eq. 5.5:
6 features and the input review x, P(+ |x) and P(− |x) can be com-
puted using Eq. 5.5:
p(+ |x) = P(Y = 1|x) = s (w·x+ b)
([2.5, − 5.0,
p(+ |x) = P(Y = 1|x) = s (w·x+ b) − 1.2, 0.5, 2.0, 0.7] ·[3, 2, 1, 3, 0, 4.19] + 0.1)
= (.833)− 5.0, − 1.2, 0.5, 2.0, 0.7] ·[3, 2, 1, 3, 0, 4.19] + 0.1)
s ([2.5,
= s0.70
(.833) (5.6)
p(− |x) = P(Y = 0|x) = 1− 0.70s (w·x+ b) (5.6)
p(− |x) = P(Y = 0|x) = 1− 0.30s (w·x+ b)
= 0.30
Logistic regression is commonly applied to all sorts of NLP tasks, and any property
of the input
Logistic can be aisfeature.
regression commonlyConsider thetotask
applied all of perof
sorts iod disambiguation:
NLP tasks, and any deciding
property
if
of athe
period
input is
canthe
beend of a sentence
a feature. Considerorthe
part ofofa per
task word,
iod by classifying each
disambiguation: period
deciding
22
erty of the input can be a feature. Consider the task of per iod disambiguation:
We can build features for logistic regression for
deciding if a period is the end of a sentence or part of a word, by classifying each
any classification task: period disambiguation
period into one of two classes EOS (end-of-sentence) and not-EOS. We might use
features like x1 below expressing that the current word is lower case and the class
is EOS (perhaps with a positive weight), or that the current word is in our abbrevia-
End ofwith
tions dictionary (“ Prof.” ) and the class is EOS (perhaps sentence
a negative weight). A
feature canThis ends ainquite
also express a period.
complex combination of properties. For example a
The house
period following at 465
a upper cased word Main St.tois
is a likely new.
be an EOS, but if the word itself is
St. and the previous word is capitalized, then the period is likely part of a shortening
of the word street. Not end

1 if “Case(wi ) = Lower”
x1 =
0 otherwise

1 if “ wi 2 AcronymDict”
x2 =
0 otherwise

1 if “ wi = St. & Case(wi− 1) = Cap”
x3 =
0 otherwise 23
Classification in (binary) logistic regression: summary
Given:
◦ a set of classes: (+ sentiment,- sentiment)
◦ a vector x of features [x1, x2, …, xn]
◦ x1= count( "awesome")
◦ x2 = log(number of words in review)
◦ A vector w of weights [w1, w2, …, wn]
◦ wi for each feature fi
Wait, where did the W’s come from?

Supervised classification:
• We know the correct label y (either 0 or 1) for each x.
• But what the system produces is an estimate, 𝑦ො
We want to set w and b to minimize the distance between our
estimate 𝑦ො (i) and the true y(i).
• We need a distance estimator: a loss function or a cost
function
• We need an optimization algorithm to update w and b to
minimize the loss.
25
Learning components

A loss function:
◦ cross-entropy loss

An optimization algorithm:
◦ stochastic gradient descent
The distance between 𝑦ො and y

We want to know how far is the classifier output:


𝑦ො = σ(w∙x+b)

from the true output:


y [= either 0 or 1]

We'll call this difference:


L(𝑦ො ,y) = how much 𝑦ො differs from the true y
Intuition of negative log likelihood loss
= cross-entropy loss

A case of conditional maximum likelihood


estimation
We choose the parameters w,b that maximize
• the log probability
• of the true y labels in the training data
• given the observations x
Deriving cross-entropy loss for a single observation x

Goal: maximize probability of the correct label p(y|x)


Since there are only 2 discrete outcomes (0 or 1) we can
express the probability p(y|x) from our classifier (the thing
we want to maximize) as

noting:
if y=1, this simplifies to 𝑦ො
if y=0, this simplifies to 1- 𝑦ො
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Maximize:
Now take the log of both sides (mathematically handy)
Maximize:

Whatever values maximize log p(y|x) will also maximize


p(y|x)
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Maximize:

Now flip sign to turn this into a loss: something to minimize


Cross-entropy loss (because is formula for cross-entropy(y, 𝑦ො ))
Minimize:
Or, plugging in definition of 𝑦:

Let's see if this works for our sentiment example
We want loss to be:
• smaller if the model estimate is close to correct
• bigger if model is confused
Let's first suppose the true label of this is y=1 (positive)

It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is great . Another nice
touch is the music . I was overcome with the urge to get off the couch and
start dancing . It sucked me in , and it'll do the same to you .
x4=3
x1=3 x5=0 x6=4.19

Let'sFigursee
e 5.2 if thisminiworks
A sample test documentfor our
showing sentiment
the extracted example
features in the vector x.

Given these 6 features and the input review x, P(+ |x) and P(− |x) can be com-
True value is y=1. How well is our model doing?
puted using Eq. 5.5:

p(+ |x) = P(Y = 1|x) = s (w·x+ b)


= s ([2.5, − 5.0, − 1.2, 0.5, 2.0, 0.7] ·[3, 2, 1, 3, 0, 4.19] + 0.1)
= s (.833)
= 0.70 (5.6)
p(− |x) = P(Y = 0|x) = 1− s (w·x+ b)
Pretty well! What's the loss?
= 0.30

Logistic regression is commonly applied to all sorts of NLP tasks, and any property
of the input can be a feature. Consider the task of per iod disambiguation: deciding
if a period is the end of a sentence or part of a word, by classifying each period
into one of two classes EOS (end-of-sentence) and not-EOS. We might use features
like x1 below expressing that the current word is lower case and the class is EOS
p(+ |x) = P(Y = 1|x) = s (w·x+ b)
Let's see if this works= for our− 5.0,
s ([2.5, sentiment example
− 1.2, 0.5, 2.0, 0.7] ·[3, 2, 1
= s (.833)
Suppose true value instead was y=0.
= 0.70
p(− |x) = P(Y = 0|x) = 1− s (w·x+ b)
= 0.30
What's the loss?
Logistic regression is commonly applied to all sorts of NLP tasks,
of the input can be a feature. Consider the task of per iod disambig
if a period is the end of a sentence or part of a word, by classif
into one of two classes EOS (end-of-sentence) and not-EOS. We m
like x1 below expressing that the current word is lower case and
(perhaps with a positive weight), or that the current word is in
Let's see if this works for our sentiment example
The loss when model was right (if true y=1)

Is lower than the loss when model was wrong (if true y=0):

Sure enough, loss was bigger when model was wrong!


Our goal: minimize the loss
Let's make explicit that the loss function is parameterized
by weights 𝛳=(w,b)
• And we’ll represent 𝑦ො as f (x; θ ) to make the
dependence on θ more obvious
We want the weights that minimize the loss, averaged
over all examples:
Intuition of gradient descent
How do I get to the bottom of this river canyon?

Look around me 360∘


Find the direction of
steepest slope down
x Go that way
Our goal: minimize the loss
For logistic regression, loss function is convex
• A convex function has just one minimum
• Gradient descent starting from any point is
guaranteed to find the minimum
• (Loss for neural networks is non-convex)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

Loss Should we move


right or left from here?

w1 wmin w
0 (goal)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

Loss

slope of loss at w1
is negative

So we'll move positive

w1 wmin w
0 (goal)
Let's first visualize for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function

Loss

one step
of gradient
slope of loss at w1 descent
is negative

So we'll move positive

w1 wmin w
0 (goal)
Gradients
The gradient of a function of many variables is a
vector pointing in the direction of the greatest
increase in a function.

Gradient Descent: Find the gradient of the loss


function at the current point and move in the
opposite direction.
How much do we move in that direction ?

• The value of the gradient (slope in our example)


𝑑
𝐿(𝑓 𝑥; 𝑤 , 𝑦) weighted by a learning rate η
𝑑𝑤
• Higher learning rate means move w faster
Now let's consider N dimensions
We want to know where in the N-dimensional space
(of the N parameters that make up θ ) we should
move.
The gradient is just such a vector; it expresses the
directional components of the sharpest slope along
each of the N dimensions.
Real gradients
Are much longer; lots and lots of weights
For each dimension wi the gradient component i
tells us the slope with respect to that variable.
◦ “How much would a small change in wi influence the
total loss function L?”
◦ We express the slope as a partial derivative ∂ of the loss
∂wi
The gradient is then defined as a vector of these
partials.
The gradient
We’ll represent 𝑦ො as f (x; θ ) to make the dependence on θ more
obvious:

The final equation for updating θ based on the gradient is thus


Hyperparameters
The learning rate η is a hyperparameter
◦ too high: the learner will take big steps and overshoot
◦ too low: the learner will take too long
Hyperparameters:
• Briefly, a special kind of parameter for an ML model
• Instead of being learned by algorithm from
supervision (like regular parameters), they are
chosen by algorithm designer.
Working through an example
One step of gradient descent
A mini-sentiment example, where the true y=1 (positive)
Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
Assume 3 parameters (2 weights and 1 bias) in Θ0 are zero:
w1 = w2 = b = 0
η = 0.1
Example of gradient descent w1 = w2 = b = 0;
Update step for update θ is: x1 = 3; x2 = 2

where

Gradient vector has 3 dimensions:


Example of gradient descent w1 = w2 = b = 0;
Update step for update θ is: x1 = 3; x2 = 2

where

Gradient vector has 3 dimensions:


Example of gradient descent w1 = w2 = b = 0;
Update step for update θ is: x1 = 3; x2 = 2

where

Gradient vector has 3 dimensions:


Example of gradient descent w1 = w2 = b = 0;
Update step for update θ is: x1 = 3; x2 = 2

where

Gradient vector has 3 dimensions:


Example of gradient descent w1 = w2 = b = 0;
Update step for update θ is: x1 = 3; x2 = 2

where

Gradient vector has 3 dimensions:


Example of gradient descent

Now that we have a gradient, we compute the new parameter vector


θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;
Example of gradient descent

Now that we have a gradient, we compute the new parameter vector


θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;
Example of gradient descent

Now that we have a gradient, we compute the new parameter vector


θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;
Example of gradient descent

Now that we have a gradient, we compute the new parameter vector


θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;

Note that enough negative examples would eventually make w2 negative


Mini-batch training
Stochastic gradient descent chooses a single
random example at a time.
That can result in choppy movements
More common to compute gradient over batches of
training instances.
Batch training: entire dataset
Mini-batch training: m examples (512, or 1024)
Logistic Multinomial Logistic
Regression
Regression
Multinomial Logistic Regression
Often we need more than 2 classes
◦ Positive/negative/neutral
◦ Parts of speech (noun, verb, adjective, adverb, preposition, etc.)
◦ Classify emergency SMSs into different actionable classes
If >2 classes we use multinomial logistic regression
= Softmax regression
= Multinomial logit
= (defunct names : Maximum entropy modeling or MaxEnt
So "logistic regression" will just mean binary (2 output classes)
61
Multinomial Logistic Regression
The probability of everything must still sum to 1
P(positive|doc) + P(negative|doc) + P(neutral|doc) = 1

Need a generalization of the sigmoid called the softmax


◦ Takes a vector z = [z1, z2, ..., zk] of k arbitrary values
◦ Outputs a probability distribution
◦ each value in the range [0,1]
◦ all the values summing to 1
62
The softmax function
Turns a vector z = [z1, z2, ... , zk] of k arbitrary values into probabilities

63
The softmax function
◦ Turns a vector z = [z1,z2,...,zk] of k arbitrary values into probabilities

64
Softmax in multinomial logistic regression

Input is still the dot product between weight vector w


and input vector x
But now we’ll need separate weight vectors for each
of the K classes.
65

You might also like