Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Machine Learning - Logistic Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Machine Learning Course - CS-433

Logistic Regression

Oct 17, 2019

changes by Rüdiger Urbanke 2019,2018,2017,2016; Mohammad


c Emtiyaz Khan 2015
Last updated on: October 18, 2019
Logistic regression
Recall that in the previous lecture we discussed
what happens if we treat binary classification as
regression with lets say y = 0 and y = 1 as the
two possible (target) values and then decide on the
label by looking if the predicted value is smaller or
larger than 0.5.
We have also discussed that it is tempting to inter-
pret the predicted value as probability.
But there are problems: (i) the predicted values are
in general not in [0, 1]; further, (ii) very large (y 
1) or very small (y  0) values of the prediction
will contribute to the error if we use the squared
loss, even though they indicate that we are very
confident in the resulting classification.
It is therefore natural that we transform the pre-
dictions that take values in (−∞, ∞) into a true
probability by applying an appropriate function.
There are several possible such functions. The lo-
gistic function
ez
σ(z) :=
1 + ez
is a natural and popular choice, see the next figure.1

1
If you implement this function note that you are applying the
exponential function to potentially large (in magnitude) values.
1.0

0.8

0.6

0.4

0.2

-10 -5 5 10

Consider the binary classification case and assume


that our two class labels are {0, 1}. We proceed as
follows. Given a training set Strain we learn a weight
vector w (we will discuss how to do this shortly)
and a “shift” (scalar) w0. Given a “new” feature
vector x, we predict the (posterior) probability of
the two class labels given x by means of

p(1 | x, w) = σ(x>w + w0),


p(0 | x, w) = 1 − σ(x>w + w0).

Note that we predict a real value (a probability)


and not a label. This is the reason it is called lo-
gistic regression. But typically we use logistic re-
gression as the first step of a classifier. In the sec-
ond step we quantize the value to a binary value,
typically according to whether the predicted prob-
This can lead to overflows. One work around is to implement this
function by first checking the value of x and by treating large (in
magnitude) values separately.
ability is smaller or larger than 0.5.
So very large and very small (large negative) values
of x>w+w0 correspond to probabilities p(1 | x, w)
very close to 1 and 0, respectively.
The following figure visualizes the probabilities ob-
tained for a 2-D problem (taken from KPM Chap-
ter 7). More precisely, this is a case with two fea-
tures and hence two weights that we learn. We
see the effect of changing the weight vector on the
resulting probability function.
It is easy to see what the roles of w and w0 are. The
vector w is orthogonal to the “surface of transition”
and the w0 allows us to shift the transition point
along the vector w. E.g., if w = (1, 0) and w0 = 0
then the transition between the two levels happens
at the x1 = 0 plane. By scaling w we can make the
transition faster or slower and by changing w0 we
can shift the decision region along the w vector.
At this point it is hopefully clear how we use logistic
regression to do classification. To repeat, given the
weight vector w we predict the probability of the
class label 1 to be p(1 | x, w) = σ(x>w + w0)
and then quantize. What we need to discuss next
is how we learn the model, i.e., how we find a good
weight vector w given some training set Strain.

A word about notation


In the beginning of this course we started with an
arbitrary feature vector x. We then discussed that
often it is useful to add the constant 1 to this fea-
ture vector and we called the resulting vector x e.
We also discussed that often it is useful to add fur-
ther features and we called then the resulting vec-
tor φ(x). Note that in particular for the logistic
regression it is crucial that we have the constant
term contained in x since this allows us to ”shift”
the decision region.
We will assume from now on that the vector x
always contains the constant term as well as any
further features we care to add. This will save us
from a flood of notation.
Hence, from now on we no longer need the extra
term w0 but the term x>w suffices since it contains
already the constant.

Training
As always we assume that we have our training
set Strain, consisting of iid samples {(xn, yn)}N n=1 ,
sampled according to a fixed but unknown distri-
bution D.
Exploiting that the samples (xn, yn) are indepen-
dent, the probability of y (vector of all labels) given X
(matrix of all inputs) and w (weight vector) has a
simple product form:
N
Y
p(y | X, w) = p(yn|xn)
n=1
Y Y
= p(yn = 1|xn) p(yn = 0|xn)
n:yn =1 n:yn =0
YN
= σ(x>
n w) yn
[1 − σ(x>
n w)] 1−yn
.
n=1

It is convenient to take the logarithm of this proba-


bility to bring it into an even simpler form. In addi-
tion we add a minus sign to the expression. In this
way our objective will be to minimize the resulting
cost function (rather than maximizing it). This is
consistent with our previous examples, where we
always minimized the cost function. We call the
resulting cost function L(w),
N
X
L(w) = − yn ln σ(x>
n w) + (1 − y n ) ln[1 − σ(x>
n w)]
n=1
N
X
= ln[1 + exp(x>
n w)] − y n x>
n w.
n=1

In the last step we have used the specific form of


the logistic function σ(x) to bring the cost function
into a nice form.
Before we continue note the following. In prin-
ciple we should have written down the likelihood
of the data (y, X) given the parameter w, i.e.,
p(y, X | w). But

p(y, X | w) = p(X | w)p(y | X, w)


= p(X)p(y | X, w),

where in the second step we have made the natural


assumption that the X data does not depend on
the parameter we choose in our model. Note that
this is an assumption and part of our model. But
now note that the factor p(X) is a constant wrt to
the choice of w, and hence plays no role when we
apply the maximum likelihood criterion.

Maximum likelihood criterion


Recall what we did so far. Under the assumption
that the samples are independent we have written
down the likelihood of the data given a particular
choice of weights w. We then choose the weights
w that maximize this likelihood.
Equivalently, we choose the weights that maximize
the log-likelihood. This is called the maximum-
likelihood criterion. In a final reformulation, we
added a negative sign to bring the cost function to
our standard form and called it L(w). In this form,
we are looking for the weights w that minimize
L(w). In formulae, we choose the weight w?, so
that

w? = argminw L(w).

As we discussed in that context of the probabilis-


tic interpretation of the least squares problem, one
justification of the maximum-likelihood criterion is
that, under some mild technical conditions, it is
consistent. I.e., if we assume that the data was
generated according a model in this class and we
have iid samples and we use this procedure to esti-
mate the underlying parameter, then our estimate
will converge to the true parameter if we get more
and more data. Of course, in practice the data
is unlikely being generated in this way and there
might not be any probabilistic model underlying
it. But nevertheless, this gives our method a theo-
retical justification.

Conditions of optimality
As we want to minimize L(w), let us look at the
stationary points of this function by computing the
gradient, setting it to zero, and solving for w. Note
that
∂ ln[1 + exp(x)]
= σ(x).
∂x
Therefore
N
X
∇L(w) = xn(σ(x>
n w) − yn )
n=1
>
= X [σ(Xw) − y].
Recall that by our convention the matrix X has
N rows, one per input sample. Further, y is the
column vector of length N which represents the N
labels corresponding to each sample.
Therefore, Xw is a column vector of length N .
The expression σ(Xw) means that we apply the
function σ to each of the N components of Xw.
In this manner we can express the gradient in a
compact manner.
There is no closed-form solution for this equation.
Let us therefore discuss how to solve this equation
in an iterative fashion by using gradient descent or
the Newton method.

Convexity
Since we are planning to iteratively minimize our
cost function, it is good to know that this cost func-
tion is convex.
Lemma. The cost function
N
X
L(w) = ln[1 + exp(x>
n w)] − y n x>
nw
n=1
is convex in the weight vector w.
Proof. Recall that the sum (with non-negative weights)
of any number of (strictly) convex functions is (strictly)
convex. Note that L(w) is the sum of 2N func-
tions. N of them have the form −ynx> n w, i.e., they
are linear in w and a linear function is convex.
Therefore it suffices to show that the other N func-
tions are convex as well. Let us consider one of
those. It has the form log[1 + exp(x> n w)]. Note
that ln(1 + exp(x)) is convex – it has first deriva-
tive σ(x) and second derivative
∂ 2 ln(1+exp(x)) ∂σ(x)
2
= = σ(x)(1−σ(x)), (1)
∂x ∂x
which is non-negative.
The proof is complete by noting that ln[1+exp(x>n w)]
is the composition of a linear function with a con-
vex function, and is therefore convex.
Note: Alternatively, to prove that a function is con-
vex (strictly convex) we can check that the Hessian
(matrix consisting of second derivatives) is positive
semi-definite (positive definite). We will do this
shortly.
Gradient descent
As we have done for other cost functions, we can
apply a (stochastic) gradient descent algorithm to
minimize our cost function. E.g. for the batch
version we can implement the update equation

w(t+1) := w(t) − γ (t)∇L(w(t)),

where γ (t) > 0 is the step size and w(t) is the se-
quence of weight vectors.

Newton’s method
The gradient method is a first-order method, i.e.,
it only uses the gradient (the first derivative). We
get a more powerful optimization algorithm if we
use also the second order terms. Of course there is
a trade-off. On the one hand we need fewer steps to
converge if we use second order terms, on the other
hand every iteration is more costly. Let us describe
now a scheme that also makes use of second order
terms. It is called Newton’s method.

Hessian of the Log-Likelihood


Let us compute the Hessian of the cost function
L(w), call it H(w). What is the Hessian? If w
has D components then this is the D × D sym-
metric matrix with entries
∂ 2L(w)
Hi,j = .
∂wi∂wj
Recall that the cost function L(w) is a sum of N
terms, all of the same form. So let us first compute
the Hessian corresponding to one such term. We
already computed the gradient of one such term
and got

xn(σ(x>
n w) − yn ).

Recall, that this gradient is a vector of length D


(the dimension of the feature vector x and hence
also the dimension of the weight vector) where the
i-th component is the derivative of L(w) with re-
spect to wi. If you look at the above expression
you see that this gradient is equal to x (a vector)
times the scalar (σ(x>
n w) − yn ). Note that x does
not depend on w and neither does yn. The only de-
pendence on w is in the term σ(x> n w). Therefore,
the Hessian associated to one term will be

xn(∇σ(x> >
n w)) .

We have already seen that σ 0(x) = σ(x)(1 − σ(x)).


Therefore, by the chain rule one such term gives
rise to the Hessian

xnx> > >


n σ(xn w)(1 − σ(xn w)).

It remains to do the sum over all N samples. Rather


than just summing, let us put this again in a com-
pact form by using the data matrix X. We get

H(w) = X>SX,

where S is a N × N diagonal matrix with diagonal


entries

Snn := σ(x> >


n w)[1 − σ(xn w)].

Note that the diagonal entries of S are non-negative.


Hence H(w) is non-negative definite. This gives us
an alternative proof that our original cost function
is convex.

Newton’s Method
Gradient descent uses only first-order information
and takes steps in the direction opposite to the gra-
dient. This makes sense since the gradient points
in the direction of increasing function values and
we want to minimize the function.
Newton’s method uses second-order information
and takes steps in the direction that minimizes a
quadratic approximation. More precisely, it ap-
proximates the function locally by a quadratic form
and then moves in the direction where this quadratic
form has its minimum. The update equation is of
the form

w(t+1) = w(t) − γ (t)(H(t))−1∇L(w(t)).

Where does this update equation come from?


Recall that the Taylor series approximation of a
function (up to second order terms) around a point w?
has the form

L(w) ≈ L(w?) + ∇L(w?)>(w − w?)


1
+ (w − w?)>H(w?)(w − w?).
2
The right-hand side is a local approximation of
L(w). Assume that we take the right-hand side
to be an exact representation of our cost function.
We want to minimize this function. So let us look
where the right-hand side takes its minimum value.
If we think that this approximation is reasonably
good, then it makes sense to move the new weight
vector to the position of this minimum.
Let us take the gradient of the right hand side and
set it to zero. We get

∇L(w?) + H(w?)(w − w?) = 0.


Solving for w gives us w = w? −H(w?)−1∇L(w?).
This corresponds exactly to the stated update equa-
tion, except that in this update we have an extra
step size γ. Why do we need this factor?
Recall that the right-hand side is only an approx-
imation. Caution therefore dictates that we only
move part of the way to the indicated minimum.

Regularized Logistic Regression


Although the cost-function for logistic regression
is lower bounded by 0 we get issues if the data is
linearly separable. In this case there is no finite-
weight vector w which gives us this minimum cost
function and if we continue to run the optimization
the weights will tend to infinity.
To avoid this problem, as for standard regression
problems, we can add a penalty term. E.g., we
consider the cost function
N
X λ
argminw − ln p(yn | x>
n w) + kwk2.
n=1
2

You might also like