Machine Learning - Logistic Regression
Machine Learning - Logistic Regression
Machine Learning - Logistic Regression
Logistic Regression
1
If you implement this function note that you are applying the
exponential function to potentially large (in magnitude) values.
1.0
0.8
0.6
0.4
0.2
-10 -5 5 10
Training
As always we assume that we have our training
set Strain, consisting of iid samples {(xn, yn)}N n=1 ,
sampled according to a fixed but unknown distri-
bution D.
Exploiting that the samples (xn, yn) are indepen-
dent, the probability of y (vector of all labels) given X
(matrix of all inputs) and w (weight vector) has a
simple product form:
N
Y
p(y | X, w) = p(yn|xn)
n=1
Y Y
= p(yn = 1|xn) p(yn = 0|xn)
n:yn =1 n:yn =0
YN
= σ(x>
n w) yn
[1 − σ(x>
n w)] 1−yn
.
n=1
w? = argminw L(w).
Conditions of optimality
As we want to minimize L(w), let us look at the
stationary points of this function by computing the
gradient, setting it to zero, and solving for w. Note
that
∂ ln[1 + exp(x)]
= σ(x).
∂x
Therefore
N
X
∇L(w) = xn(σ(x>
n w) − yn )
n=1
>
= X [σ(Xw) − y].
Recall that by our convention the matrix X has
N rows, one per input sample. Further, y is the
column vector of length N which represents the N
labels corresponding to each sample.
Therefore, Xw is a column vector of length N .
The expression σ(Xw) means that we apply the
function σ to each of the N components of Xw.
In this manner we can express the gradient in a
compact manner.
There is no closed-form solution for this equation.
Let us therefore discuss how to solve this equation
in an iterative fashion by using gradient descent or
the Newton method.
Convexity
Since we are planning to iteratively minimize our
cost function, it is good to know that this cost func-
tion is convex.
Lemma. The cost function
N
X
L(w) = ln[1 + exp(x>
n w)] − y n x>
nw
n=1
is convex in the weight vector w.
Proof. Recall that the sum (with non-negative weights)
of any number of (strictly) convex functions is (strictly)
convex. Note that L(w) is the sum of 2N func-
tions. N of them have the form −ynx> n w, i.e., they
are linear in w and a linear function is convex.
Therefore it suffices to show that the other N func-
tions are convex as well. Let us consider one of
those. It has the form log[1 + exp(x> n w)]. Note
that ln(1 + exp(x)) is convex – it has first deriva-
tive σ(x) and second derivative
∂ 2 ln(1+exp(x)) ∂σ(x)
2
= = σ(x)(1−σ(x)), (1)
∂x ∂x
which is non-negative.
The proof is complete by noting that ln[1+exp(x>n w)]
is the composition of a linear function with a con-
vex function, and is therefore convex.
Note: Alternatively, to prove that a function is con-
vex (strictly convex) we can check that the Hessian
(matrix consisting of second derivatives) is positive
semi-definite (positive definite). We will do this
shortly.
Gradient descent
As we have done for other cost functions, we can
apply a (stochastic) gradient descent algorithm to
minimize our cost function. E.g. for the batch
version we can implement the update equation
where γ (t) > 0 is the step size and w(t) is the se-
quence of weight vectors.
Newton’s method
The gradient method is a first-order method, i.e.,
it only uses the gradient (the first derivative). We
get a more powerful optimization algorithm if we
use also the second order terms. Of course there is
a trade-off. On the one hand we need fewer steps to
converge if we use second order terms, on the other
hand every iteration is more costly. Let us describe
now a scheme that also makes use of second order
terms. It is called Newton’s method.
xn(σ(x>
n w) − yn ).
xn(∇σ(x> >
n w)) .
H(w) = X>SX,
Newton’s Method
Gradient descent uses only first-order information
and takes steps in the direction opposite to the gra-
dient. This makes sense since the gradient points
in the direction of increasing function values and
we want to minimize the function.
Newton’s method uses second-order information
and takes steps in the direction that minimizes a
quadratic approximation. More precisely, it ap-
proximates the function locally by a quadratic form
and then moves in the direction where this quadratic
form has its minimum. The update equation is of
the form