Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

Logistic Regression

ml model research paper

Uploaded by

Abhishek pandey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Logistic Regression

ml model research paper

Uploaded by

Abhishek pandey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Linear Classification with Logistic Regression

Ryan P. Adams
COS 324 – Elements of Machine Learning
Princeton University

When discussing linear regression, we examined two different points of view that often led
to similar algorithms: one based on constructing and minimizing a loss function, and the other
based on maximizing the likelihood. Classification has a similar set of parallel viewpoints and
algorithms, but we’ll start with the probabilistic view.
In probabilistic linear regression, we studied the idea that there was some generating procedure
that took in an input, produced an idealized function and then added noise to it. We examined, in
particular, the case where that noise was zero-mean Gaussian noise. In probabilistic classification
we will take a similar view, except that a Gaussian distribution will not make sense because the
data will now be binary rather than real-valued. Our go-to distribution for binary data is the
Bernoulli, which is just the biased coin flip. The outcome can take the value 0 or 1 and there is a
parameter θ ∈ [0, 1] that is the mean of the distribution. The probability mass function (PMF) is

Pr(y | θ ) = θ y (1 − θ )1−y (Bernoulli PMF) . (1)

This PMF might look strangely complicated for coin flips if you haven’t seen it before, but all that’s
going on here is that it’s using the fact that z0 = 1 and z1 = z as a kind of trick to slice out the right
values. What we’re going to do to turn this into a model for supervised binary classification is to
say that θ is a function of the input x . We can’t directly use the function w T x because that will
produce values less than 0 and greater than 1. To address this, we use a function that transform w T x
into [0, 1]. There are various choices we could make for such a function, but the most common
thing is to choose the logistic function:

exp{ z } 1
σ(z) = = . (2)
1 + exp{ z } 1 + exp{− z }
This function is shown in Figure 1 where you can see that this is an example of a sigmoid (“s-
shaped”) function. We often use σ (·) to denote this function.
Putting these pieces together, we can construct a model that takes in a location x (and weights w )
and produces a Bernoulli distribution:

Pr(y | x, w) = σ (w T x) y (1 − σ (w T x))1−y . (3)

1
1

0.8

0.6

0.4

0.2

0
−5 0 5
Figure 1: The logistic function f ( z ) = 1/(1 + exp{− z }).

The actual problem we want to solve, however, is to find the maximum likelihood estimate of w after
seeing N data {xn, yn }n=
N where x ∈ R D and y ∈ { 0, 1 } . We are taking these to be independent
1 n n
Bernoulli distributions, conditioned on w and the xn , so the likelihood is a product:
󳕘
N
N
Pr({yn }n= N
1 | {x n }n=1, w) = σ (w T xn ) yn (1 − σ (w T xn ))1−yn . (4)
n=1

This is the function that we will want to maximize with respect to w , and as in the linear regression
case we’ll want to take the log first to avoid numeric difficulties due to products of small numbers:
󰀫 N 󰀬
󳕗
w MLE = arg max yn log σ (w T xn ) + (1 − yn ) log(1 − σ (w T xn )) . (5)
w n=1

Even though this objective function is concave in w , it is not possible to minimize it directly be
setting the gradient to zero and solving for w as we did with linear regression. Instead, we’ll have
to minimize this the hard way, using gradient ascent.

Gradient Ascent/Descent
The idea of gradient descent is to solve the problem
min f (z) , (6)
z

2
in a situation where we have access to the gradient ∇ z f (z) but limited additional information.
(Gradient ascent is the same idea, but where we’re maximizing and we flip the signs of everything.
Here I’ll frame everything in terms of minimization.) Gradient descent observes that the negative
gradient points in the direction of steepest descent. If we take a small enough step in that direction,
then we’re very likely to go “downhill” and find a z that reduces the value of f (z):

z (t+1) ← z (t ) − α ∇ z f (z (t ) ) , (7)

where we start at some arbitrary (perhaps random) initialization z (0) . The constant α > 0 must
be simultaneously small enough that we’re tending to move downhill, while large enough that we
make progress. Iteratively taking such steps will send us toward a critical point (a place where
the gradient is zero) and in convex problems this critical point will be the global minimum. In
non-convex problems we often simply cross our fingers and hope that the critical point we converge
to is a minimum that is not too bad.

Newton’s Method Setting α can be difficult in practice, and it often needs to vary over the
course of the optimization in order to achieve a good solution. Moreover, as can be seen from the
zig-zagging pathology, going directly downhill may not be the best thing to do if the local shape
of the function is stretched out in some directions and compressed in others. Newton’s method is
one important example of a second order optimization method. Roughly speaking, the order of
an optimization approach refers to the number of derivatives used, so gradient descent is a first
order method, while a second order method would use the Hessian matrix in some form. The
idea of Newton’s method is to assume that the function we are trying to minimize is approximately
quadratic in the immediate vicinity of our current iterate f (z). We can estimate that quadratic using
a Taylor expansion around the current point z (t ) :
1
f (z) ≈ f (z (t ) ) + (z − z (t ) )T ∇ z f (z (t ) ) + (z − z (t ) )T H z [ f (z (t ) )](z − z (t ) ) , (8)
2
where H z [ f (·)] is the Hessian of f (·) with respect to z . If this were the true function, then we could
actually compute the minimum exactly by taking the gradient, setting it to zero:
󰀝 󰀞
(t ) (t ) T (t ) 1 (t ) T (t ) (t )
∇ z f (z ) + (z − z ) ∇ z f (z ) + (z − z ) H z [ f (z )](z − z ) (9)
2
= ∇ z f (z (t ) ) + H z [ f (z (t ) )](z − z (t ) ) = 0 (10)

and then solving for z :

∇ z f (z (t ) ) + H z [ f (z (t ) ]z − H z [ f (z (t ) )]z (t ) = 0 (11)
H z [ f (z (t ) ]z = H z [ f (z (t ) )]z (t ) − ∇ z f (z (t ) ) (12)
z = H z [ f (z (t ) ]−1 (H z [ f (z (t ) )]z (t ) − ∇ z f (z (t ) )) (13)
(t ) (t ) −1 (t )
=z − H z [ f (z ] ∇ z f (z ) . (14)

3
If we imagine then at each step of the optimization saying “assume f (·) is locally quadratic and
jump to where the minimum should be” then you get the update:

z (t+1) ← z (t ) − H z [ f (z (t ) ]−1 ∇ z f (z (t ) ) . (15)

This is exactly like the gradient descent update but rather than scale the gradient with a constant α,
we use it to solve a linear system with the Hessian. There are a huge number of ways this can go
wrong and so there is a large literature on variations and tweaks to improve things. For example,
one may not have direct access to the Hessian but can only compute Hessian-vector products; the
Hessian may be too big and so you don’t want to represent it at all, much less solve a linear system
with it; you may want to add a learning rate anyway rather than try to jump all the way to the
solution; your Hessian may not be positive definite and so this method may tell you to jump to
infinity. In the current moment of machine learning, where people seem to care the most about
optimizing large neural networks, second order methods seem to offer no practical improvement
at all over first order methods, or at least not enough to justify their complexity and computational
cost.

Stochastic Gradient Descent The workhorse of machine learning at the moment is stochastic
gradient descent (SGD). In SGD, we don’t have access to the true gradient but only to a noisy
version of it. It turns out that if the noise isn’t too bad, and you decay the learning rate over time,
then you will still converge to a solution. The way in which this is most helpful is in tackling large
data sets with gradient descent: the true gradient of the training loss will be an average over all of
the data, but we can often estimate it well using a small subset (“mini-batch”) of the data. This
will be an unbiased estimate and so things are still likely to work. It additionally seems to be the
case that the noise arising from stochastic gradient descent for deep neural networks actually helps
them generalize by somehow avoiding poor local minima in the training loss. That is, some early
theoretical evidence and much empirical evidence indicates that the noisy gradient introduces an
implicit regularization into the model that helps prevent overfitting.

SGD for Logistic Regression


We now return to the problem specified by Eqn. 5 and examine the gradient arising from a single
one of the data:
󰀋 󰀌
∇w yn log σ (w T xn ) + (1 − yn ) log(1 − σ (w T xn )) . (16)

We’re going to perform gradient descent by performing updates that subtract the negative of the
gradient, i.e., by adding the gradient. We’ll then make this a stochastic method by choosing data
uniformly at random rather than summing over the entire data set.
First, there are two good identities to know about the logistic function:
exp{ z } 1 + exp{ z } exp{ z } 1
1 − σ(z) = 1 − = − = = σ (− z) (17)
1 + exp{ z } 1 + exp{ z } 1 + exp{ z } 1 + exp{ z }

4
and
d d exp{− z } exp{− z } 1
σ(z) = (1 + exp{− z })−1 = = (18)
dz dz (1 + exp{− z }) 2 1 + exp{− z } 1 + exp{− z }
= σ (− z)σ ( z) = (1 − σ ( z))σ ( z) . (19)

We can use these to get an intuitive form for the gradient:


󰀋 󰀌
∇w yn log σ (w T xn ) + (1 − yn ) log(1 − σ (w T xn )) (20)
1 1
= yn T
∇w { σ (w T xn )} + (1 − yn ) ∇w { σ (−w T xn )} (21)
σ (w xn ) σ (−w T xn )
= yn xn (1 − σ (w T xn )) − (1 − yn )xn σ (w T xn ) (22)
= yn xn − yn xn σ (w T xn ) − xn σ (w T xn ) + yn xn σ (w T xn ) (23)
= yn xn − xn σ (w T xn ) (24)
T
= xn (yn − σ (w xn )) . (25)

The single-example gradient can then be used to form an unbiased estimate of the true (full-data)
gradient by sampling n uniformly at random from 1, . . . , N and then using the nth datum to perform
the update:

w (t+1) ← w (t ) + α xn (yn − σ ((w (t ) )T xn )) . (26)

Remarkably, this is actually the same rule we identified for gradient descent for least squares
regression in that it takes a step proportional to the error, weighted by the input features. It is almost
exactly what we saw from the perceptron learning rule, except using the sigmoid function rather
than the sign function.

Linear Separability and Regularization In general, linear separability is a good thing for a
binary classification problem. It means the problem is easy in some sense, and simple algorithms
like the perceptron learning rule will work. However, it creates a pathology for unregularized
logistic regression. Consider the fact that the decision boundary in a linear classifier is independent
of the scale of the parameters. You can see this by recalling that the decision boundary is the
set {x : w T x = 0} and that this set isn’t changed if we multiply w by some constant c. For a given
decision boundary, however, the scale does effect the likelihood in logistic regression by causing
the logistic function to become more steep. This is probably an obvious statement, but just in case:
you can see this steepness by thinking about the derivative of σ ( z) evaluated at z = 0 versus the
derivative of σ (10 z) at z = 0. The derivatives are σ ( z)(1 − σ ( z)) and 10σ ( z)(1 − σ ( z)), respectively,
and so that linear regime in the middle is ten times steeper when the input is scaled by a factor of
ten.
If we currently have a decision boundary such that the data are all correctly classified, then
increasing the scale of the weights will push the predictions further towards their correct answers.
Imagine that we have a set of weights ŵ with unit norm, i.e., || ŵ|| = 1 for whatever norm you

5
want. We construct a logistic regression classifier with weights w = c ŵ and seek only to fit the
constant c > 0 to the data. Recall that changing c does not move the decision boundary for the
classifier. We take the nth example and examine the derivative of its log likelihood with respect
to c:
∂ 󰀋 󰀌
yn log σ (c ŵ T xn ) + (1 − yn ) log(1 − σ (c ŵ T xn )) = ŵ T xn (yn − σ (c ŵ T xn )) . (27)
∂c
If yn = 0 then (yn − σ (c ŵ T xn )) < 0 and if yn = 1 then (yn − σ (c ŵ T xn )) > 0. Note also that
due to the fixed decision boundary, if yn = 0 is classified correctly then ŵ T xn < 0 and is positive
otherwise. Similarly if yn = 1 is classified correctly, then ŵ T xn > 0 and is negative otherwise.
Thus the derivative of the log likelihood with respect to c is always positive for an example that ŵ
classifies correctly. If the data are linearly separable, then there exists a ŵ such that all of the data
have log likelihoods with positive derivatives with respect to c. In that situation, gradient ascent
on c would cause it to grow without bound. This essentially drives the sigmoid function to be
sharper and sharper until it becomes a Heaviside step function. This is a kind of overfitting: the
model is becoming perfectly confident about the data and using very large weights to achieve it.
We have already learned a solution to this problem: regularize the weights. A common thing
to do is to use the same squared L 2 norm that we used in ridge regression: essentially saying as
before that we are going to find the MAP with a Gaussian prior on the weights.
󰀝 󰀞
MAP N N λ 2
w = arg max log Pr({yn }n=1 | {xn }n=1, w) − ||w||2 . (28)
w 2
The gradient of the resulting objective is then
󰀫 N 󰀬
󳕗 λ
∇w yn log σ (w T xn ) + (1 − yn ) log(1 − σ (w T xn )) − w T w (29)
n=1
2
󳕗
N
= xn (yn − σ (w T xn )) − λ w . (30)
n=1

The constant λ now has a scale relative to N , so we can either make our single example stochastic
updates scale up by a factor of N or scale λ down by a factor of N . Since α and λ are arbitrary
constants, this doesn’t have a practical effect on the algorithm. However, using the latter adjustment
results in small addition to the previous stochastic gradient descent update rule:
󰀕 󰀖
(t+1) (t ) (t ) T λ (t )
w ← w + α xn (yn − σ ((w ) xn )) − w , (31)
N
which we can rewrite as:
αλ (t )
w (t+1) ← (1 −)w + α xn (yn − σ ((w (t ) )T xn )) . (32)
N
This shows why machine learning researchers (and neural network researchers in particular) often
refer to L 2 regularization as “weight decay”. In the gradient ascent update rules, this regularization
term introduces a “decay toward zero then add the gradient” dynamic.

6
Beyond Binary Classification
Unlike some binary classification approaches, logistic regression generalizes naturally to K > 2
classes. This is essentially because the Bernoulli (binomial) distribution generalizes directly to
the categorical (multinomial) distribution. The parameter is an element of the K − 1 simplex,
󳕐
i.e., θ ∈ RK , where θ k > 0 and Kk=1 θ k = 1. Even though θ has K dimensions, it only has K − 1
degrees of freedom, since it must sum to one.
For the data, rather than imagining that our labels are yn ∈ {0, 1} we now imagine that they
󳕐
are yn ∈ {0, 1} K subject to the constraint that Kk=1 yn,k = 1. This is what we refer to as a “one-hot
coding”: a binary vector with as many dimensions as classes and all zeros except for a one in the
dimension of the label for the example. We can then write an equivalent to Eqn. 1 as
󳕘
K
Pr(y | θ) = (33)
y
θ kk .
k=1

As in binary logistic regression, we have to find a way to map our inputs x ∈ RD into the
vector θ . For K > 2, we’ll have K weight vectors w k ∈ RD and we will compute the inner product
of x with each of them. After that, we will exponentiate them and then divide by the total across
the classes:
exp{x T w k }
θ k = 󳕐K . (34)
k =1
′ exp {x Tw ′}
k
This exponentiate-and-normalize is often called a softmax and it ensures that each of the values is
non-negative and sums to one, as we require for θ . Combining together Eqns. 33 and 34 we can
write a “softmax regression” likelihood:
󰀣 󰀤 yk
󳕘K
exp {x T
w }
󳕐K
K k
Pr(y | x, {w k } k= 1) = (35)
k ′ =1 exp {x w k }
T ′
k=1
With this likelihood in hand, we can write the optimization problem for maximizing the log
likelihood after seeing N data:
󰀫 N 󰀣 K 󰀤 󰀬
󳕗 󳕗 󳕗
K
{w kMLE } k=
K
1 arg max yn,k xnT w k − log exp{xnT w k } . (36)
K
{wk }k=1 n=1 k=1 k=1

As in binary logistic regression, we maximize this by taking the gradient and performing (stochastic)
gradient ascent:
󰀫 N 󰀣 K 󰀤 󰀬
󳕗 󳕗 󳕗K 󳕗N
exp{xnT w k }
∇ wk T
yn,k xn w k − log T
exp{xn w k } = yn,k xn − 󳕐K T
xn (37)
k ′ =1 exp {x n w k }
󰀣 󰀤

n=1 k=1 k=1 n=1
󳕗N
exp{xnT w k }
= xn yn,k − 󳕐K T
. (38)
k ′ =1 exp {x n w k }

n=1
This is satisfying as a fairly direct analog to Eqn. 25: the inputs weighted by the difference between
the true label and the prediction.

7
An Aside: Computing Log-Sum-Exp The log-of-sum-of-exponentials term in Eqn. 36 comes
up a lot in machine learning and it is annoying because it is numerically prone to underflow and
overflow. Let’s look at a simplified version for a vector z ∈ J :

󳕗
J
log exp{ z j } (Log-Sum-Exp) (39)
j=1

Imagine that one entry in z is much larger than the others. In this case, the value of the log-sum-exp
will essentially just be that large entry in z . However, exponentiating a large floating point number
may overflow and give you inf. Taking the log of inf will still be inf (or NaN), which is not
what you want. We can tweak things to be better behaved, however, by introducing an arbitrary
constant c. Note that we can roll a constant into the log-sum-exp without changing its value:

󳕗
J 󳕗
J 󰀻
󰀿
󰁁 󳕗
J 󰀼
󰁀
󰁁
log exp{ z j } = c + log{exp{−c }} + log exp{ z j } = c + log exp{−c } exp{ z j } (40)
󰁁 󰁁
j=1 j=1 󰀽 j=1 󰀾
󳕗
J
= c + log exp{ z j − c } (41)
j=1

If we make c = max j z j then now the largest thing we’re taking an exponential of is zero. All of the
other values are less than or equal to zero, so we will not get underflow. The values might be large
and negative, but this is tolerable because, in floating point, underflow of the exponential function
just gives zero. In the worst case, after exponentiation everything but the big value becomes
zero, and the big value becomes one. Then the log term goes away and the entire quantity is
just c = max j z j , which is essentially the correct answer.

Generalized Linear Models


Logistic regression is a special case of a popular and important class of statistical models called
generalized linear models (GLMs). The GLM framework allows one to model different kinds of
label spaces using this same recipe of linear function, nonlinear transformation, and likelihood.
Note that in linear regression, binary logistic regression, and softmax regression, we were using a
linear function of x to parameterize the mean of the distribution on the output. The GLM frames this
in a slightly different way than we have here, by calling the inverse transformation a link function,
but the concept is essentially the same. A couple of common examples of GLM likelihoods are the
Poisson, where the labels are non-negative integers:

λ n exp{ λn }
y
Pr(yn | λn ) = n λn = exp{w T xn } (42)
yn !
and similarly one could construct an exponential distribution regression model on the positive reals:

Pr(yn | λn ) = λ exp{−λ yn } λn = exp{w T xn } . (43)

8
Changelog
• 8 October 2018 – Initial version.

You might also like