0% found this document useful (0 votes)

37 views

Machine Learning - Logistic Regression

This document provides an overview of logistic regression for machine learning classification tasks. It discusses: 1) Using the logistic function to transform predictions into probabilities between 0 and 1. 2) How logistic regression predicts the probability that an input belongs to each class. 3) Training the model by minimizing the cost/loss function using gradient descent or Newton's method to learn the weights. 4) That the cost function is convex, making iterative optimization methods suitable for training logistic regression models.

Uploaded by

nagybaly

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views

Machine Learning - Logistic Regression

Uploaded by

nagybaly

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Machine Learning Course - CS-433

Logistic Regression

Oct 17, 2019

changes by Rüdiger Urbanke 2019,2018,2017,2016; Mohammad

c Emtiyaz Khan 2015
Last updated on: October 18, 2019
Logistic regression
Recall that in the previous lecture we discussed
what happens if we treat binary classification as
regression with lets say y = 0 and y = 1 as the
two possible (target) values and then decide on the
label by looking if the predicted value is smaller or
larger than 0.5.
We have also discussed that it is tempting to inter-
pret the predicted value as probability.
But there are problems: (i) the predicted values are
in general not in [0, 1]; further, (ii) very large (y
1) or very small (y 0) values of the prediction
will contribute to the error if we use the squared
loss, even though they indicate that we are very
confident in the resulting classification.
It is therefore natural that we transform the pre-
dictions that take values in (−∞, ∞) into a true
probability by applying an appropriate function.
There are several possible such functions. The lo-
gistic function
ez
σ(z) :=
1 + ez
is a natural and popular choice, see the next figure.1

1
If you implement this function note that you are applying the
exponential function to potentially large (in magnitude) values.
1.0

0.8

0.6

0.4

0.2

-10 -5 5 10

Consider the binary classification case and assume

that our two class labels are {0, 1}. We proceed as
follows. Given a training set Strain we learn a weight
vector w (we will discuss how to do this shortly)
and a “shift” (scalar) w0. Given a “new” feature
vector x, we predict the (posterior) probability of
the two class labels given x by means of

p(1 | x, w) = σ(x>w + w0),

p(0 | x, w) = 1 − σ(x>w + w0).

Note that we predict a real value (a probability)

and not a label. This is the reason it is called lo-
gistic regression. But typically we use logistic re-
gression as the first step of a classifier. In the sec-
ond step we quantize the value to a binary value,
typically according to whether the predicted prob-
This can lead to overflows. One work around is to implement this
function by first checking the value of x and by treating large (in
magnitude) values separately.
ability is smaller or larger than 0.5.
So very large and very small (large negative) values
of x>w+w0 correspond to probabilities p(1 | x, w)
very close to 1 and 0, respectively.
The following figure visualizes the probabilities ob-
tained for a 2-D problem (taken from KPM Chap-
ter 7). More precisely, this is a case with two fea-
tures and hence two weights that we learn. We
see the effect of changing the weight vector on the
resulting probability function.
It is easy to see what the roles of w and w0 are. The
vector w is orthogonal to the “surface of transition”
and the w0 allows us to shift the transition point
along the vector w. E.g., if w = (1, 0) and w0 = 0
then the transition between the two levels happens
at the x1 = 0 plane. By scaling w we can make the
transition faster or slower and by changing w0 we
can shift the decision region along the w vector.
At this point it is hopefully clear how we use logistic
regression to do classification. To repeat, given the
weight vector w we predict the probability of the
class label 1 to be p(1 | x, w) = σ(x>w + w0)
and then quantize. What we need to discuss next
is how we learn the model, i.e., how we find a good
weight vector w given some training set Strain.

A word about notation

In the beginning of this course we started with an
arbitrary feature vector x. We then discussed that
often it is useful to add the constant 1 to this fea-
ture vector and we called the resulting vector x e.
We also discussed that often it is useful to add fur-
ther features and we called then the resulting vec-
tor φ(x). Note that in particular for the logistic
regression it is crucial that we have the constant
term contained in x since this allows us to ”shift”
the decision region.
We will assume from now on that the vector x
always contains the constant term as well as any
further features we care to add. This will save us
from a flood of notation.
Hence, from now on we no longer need the extra
term w0 but the term x>w suffices since it contains
already the constant.

Training
As always we assume that we have our training
set Strain, consisting of iid samples {(xn, yn)}N n=1 ,
sampled according to a fixed but unknown distri-
bution D.
Exploiting that the samples (xn, yn) are indepen-
dent, the probability of y (vector of all labels) given X
(matrix of all inputs) and w (weight vector) has a
simple product form:
N
Y
p(y | X, w) = p(yn|xn)
n=1
Y Y
= p(yn = 1|xn) p(yn = 0|xn)
n:yn =1 n:yn =0
YN
= σ(x>
n w) yn
[1 − σ(x>
n w)] 1−yn
.
n=1

It is convenient to take the logarithm of this proba-

bility to bring it into an even simpler form. In addi-
tion we add a minus sign to the expression. In this
way our objective will be to minimize the resulting
cost function (rather than maximizing it). This is
consistent with our previous examples, where we
always minimized the cost function. We call the
resulting cost function L(w),
N
X
L(w) = − yn ln σ(x>
n w) + (1 − y n ) ln[1 − σ(x>
n w)]
n=1
N
X
= ln[1 + exp(x>
n w)] − y n x>
n w.
n=1

In the last step we have used the specific form of

the logistic function σ(x) to bring the cost function
into a nice form.
Before we continue note the following. In prin-
ciple we should have written down the likelihood
of the data (y, X) given the parameter w, i.e.,
p(y, X | w). But

p(y, X | w) = p(X | w)p(y | X, w)

= p(X)p(y | X, w),

where in the second step we have made the natural

assumption that the X data does not depend on
the parameter we choose in our model. Note that
this is an assumption and part of our model. But
now note that the factor p(X) is a constant wrt to
the choice of w, and hence plays no role when we
apply the maximum likelihood criterion.

Maximum likelihood criterion

Recall what we did so far. Under the assumption
that the samples are independent we have written
down the likelihood of the data given a particular
choice of weights w. We then choose the weights
w that maximize this likelihood.
Equivalently, we choose the weights that maximize
the log-likelihood. This is called the maximum-
likelihood criterion. In a final reformulation, we
added a negative sign to bring the cost function to
our standard form and called it L(w). In this form,
we are looking for the weights w that minimize
L(w). In formulae, we choose the weight w?, so
that

w? = argminw L(w).

As we discussed in that context of the probabilis-

tic interpretation of the least squares problem, one
justification of the maximum-likelihood criterion is
that, under some mild technical conditions, it is
consistent. I.e., if we assume that the data was
generated according a model in this class and we
have iid samples and we use this procedure to esti-
mate the underlying parameter, then our estimate
will converge to the true parameter if we get more
and more data. Of course, in practice the data
is unlikely being generated in this way and there
might not be any probabilistic model underlying
it. But nevertheless, this gives our method a theo-
retical justification.

Conditions of optimality
As we want to minimize L(w), let us look at the
stationary points of this function by computing the
gradient, setting it to zero, and solving for w. Note
that
∂ ln[1 + exp(x)]
= σ(x).
∂x
Therefore
N
X
∇L(w) = xn(σ(x>
n w) − yn )
n=1
>
= X [σ(Xw) − y].
Recall that by our convention the matrix X has
N rows, one per input sample. Further, y is the
column vector of length N which represents the N
labels corresponding to each sample.
Therefore, Xw is a column vector of length N .
The expression σ(Xw) means that we apply the
function σ to each of the N components of Xw.
In this manner we can express the gradient in a
compact manner.
There is no closed-form solution for this equation.
Let us therefore discuss how to solve this equation
in an iterative fashion by using gradient descent or
the Newton method.

Convexity
Since we are planning to iteratively minimize our
cost function, it is good to know that this cost func-
tion is convex.
Lemma. The cost function
N
X
L(w) = ln[1 + exp(x>
n w)] − y n x>
nw
n=1
is convex in the weight vector w.
Proof. Recall that the sum (with non-negative weights)
of any number of (strictly) convex functions is (strictly)
convex. Note that L(w) is the sum of 2N func-
tions. N of them have the form −ynx> n w, i.e., they
are linear in w and a linear function is convex.
Therefore it suffices to show that the other N func-
tions are convex as well. Let us consider one of
those. It has the form log[1 + exp(x> n w)]. Note
that ln(1 + exp(x)) is convex – it has first deriva-
tive σ(x) and second derivative
∂ 2 ln(1+exp(x)) ∂σ(x)
2
= = σ(x)(1−σ(x)), (1)
∂x ∂x
which is non-negative.
The proof is complete by noting that ln[1+exp(x>n w)]
is the composition of a linear function with a con-
vex function, and is therefore convex.
Note: Alternatively, to prove that a function is con-
vex (strictly convex) we can check that the Hessian
(matrix consisting of second derivatives) is positive
semi-definite (positive definite). We will do this
shortly.
Gradient descent
As we have done for other cost functions, we can
apply a (stochastic) gradient descent algorithm to
minimize our cost function. E.g. for the batch
version we can implement the update equation

w(t+1) := w(t) − γ (t)∇L(w(t)),

where γ (t) > 0 is the step size and w(t) is the se-
quence of weight vectors.

Newton’s method
The gradient method is a first-order method, i.e.,
it only uses the gradient (the first derivative). We
get a more powerful optimization algorithm if we
use also the second order terms. Of course there is
a trade-off. On the one hand we need fewer steps to
converge if we use second order terms, on the other
hand every iteration is more costly. Let us describe
now a scheme that also makes use of second order
terms. It is called Newton’s method.

Hessian of the Log-Likelihood

Let us compute the Hessian of the cost function
L(w), call it H(w). What is the Hessian? If w
has D components then this is the D × D sym-
metric matrix with entries
∂ 2L(w)
Hi,j = .
∂wi∂wj
Recall that the cost function L(w) is a sum of N
terms, all of the same form. So let us first compute
the Hessian corresponding to one such term. We
already computed the gradient of one such term
and got

xn(σ(x>
n w) − yn ).

Recall, that this gradient is a vector of length D

(the dimension of the feature vector x and hence
also the dimension of the weight vector) where the
i-th component is the derivative of L(w) with re-
spect to wi. If you look at the above expression
you see that this gradient is equal to x (a vector)
times the scalar (σ(x>
n w) − yn ). Note that x does
not depend on w and neither does yn. The only de-
pendence on w is in the term σ(x> n w). Therefore,
the Hessian associated to one term will be

xn(∇σ(x> >
n w)) .

We have already seen that σ 0(x) = σ(x)(1 − σ(x)).

Therefore, by the chain rule one such term gives
rise to the Hessian

xnx> > >

n σ(xn w)(1 − σ(xn w)).

It remains to do the sum over all N samples. Rather

than just summing, let us put this again in a com-
pact form by using the data matrix X. We get

H(w) = X>SX,

where S is a N × N diagonal matrix with diagonal

entries

Snn := σ(x> >

n w)[1 − σ(xn w)].

Note that the diagonal entries of S are non-negative.

Hence H(w) is non-negative definite. This gives us
an alternative proof that our original cost function
is convex.

Newton’s Method
Gradient descent uses only first-order information
and takes steps in the direction opposite to the gra-
dient. This makes sense since the gradient points
in the direction of increasing function values and
we want to minimize the function.
Newton’s method uses second-order information
and takes steps in the direction that minimizes a
quadratic approximation. More precisely, it ap-
proximates the function locally by a quadratic form
and then moves in the direction where this quadratic
form has its minimum. The update equation is of
the form

w(t+1) = w(t) − γ (t)(H(t))−1∇L(w(t)).

Where does this update equation come from?

Recall that the Taylor series approximation of a
function (up to second order terms) around a point w?
has the form

L(w) ≈ L(w?) + ∇L(w?)>(w − w?)

1
+ (w − w?)>H(w?)(w − w?).
2
The right-hand side is a local approximation of
L(w). Assume that we take the right-hand side
to be an exact representation of our cost function.
We want to minimize this function. So let us look
where the right-hand side takes its minimum value.
If we think that this approximation is reasonably
good, then it makes sense to move the new weight
vector to the position of this minimum.
Let us take the gradient of the right hand side and
set it to zero. We get

∇L(w?) + H(w?)(w − w?) = 0.

Solving for w gives us w = w? −H(w?)−1∇L(w?).
This corresponds exactly to the stated update equa-
tion, except that in this update we have an extra
step size γ. Why do we need this factor?
Recall that the right-hand side is only an approx-
imation. Caution therefore dictates that we only
move part of the way to the indicated minimum.

Regularized Logistic Regression

Although the cost-function for logistic regression
is lower bounded by 0 we get issues if the data is
linearly separable. In this case there is no finite-
weight vector w which gives us this minimum cost
function and if we continue to run the optimization
the weights will tend to infinity.
To avoid this problem, as for standard regression
problems, we can add a penalty term. E.g., we
consider the cost function
N
X λ
argminw − ln p(yn | x>
n w) + kwk2.
n=1
2

3 DeltaRule PDF
No ratings yet
3 DeltaRule PDF
10 pages
Stanford University CS 229, Autumn 2014 Midterm Examination
No ratings yet
Stanford University CS 229, Autumn 2014 Midterm Examination
23 pages
Tutorial On Helmholtz Machine
No ratings yet
Tutorial On Helmholtz Machine
26 pages
Pumps Lecture
100% (1)
Pumps Lecture
38 pages
TBP-5360 Manual
No ratings yet
TBP-5360 Manual
23 pages
Tug Asmik Ro Biolog Ike Lomp Ok
100% (1)
Tug Asmik Ro Biolog Ike Lomp Ok
9 pages
CAA AD Electronics June 2019
No ratings yet
CAA AD Electronics June 2019
2 pages
Log-Linear Models, Memms, and CRFS: 1 Notation
No ratings yet
Log-Linear Models, Memms, and CRFS: 1 Notation
11 pages
Homework2 v1.0
No ratings yet
Homework2 v1.0
5 pages
HW 4
No ratings yet
HW 4
7 pages
CS771: Practice Set 2: Problem 1
No ratings yet
CS771: Practice Set 2: Problem 1
2 pages
cs188 Fa23 Note21
No ratings yet
cs188 Fa23 Note21
8 pages
Section05 Solutions
No ratings yet
Section05 Solutions
9 pages
Lecture 05
No ratings yet
Lecture 05
5 pages
斯坦福大学机器学习数学基础 33-40
No ratings yet
斯坦福大学机器学习数学基础 33-40
8 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
Tut3 Questions
No ratings yet
Tut3 Questions
2 pages
traducere P3
No ratings yet
traducere P3
4 pages
5_LR_Apr_7_2021 (3)
No ratings yet
5_LR_Apr_7_2021 (3)
93 pages
Practice_Problems_for_ML_Midterms
No ratings yet
Practice_Problems_for_ML_Midterms
5 pages
Main
No ratings yet
Main
12 pages
Dealing With Massive Data - Duplicate Detection
No ratings yet
Dealing With Massive Data - Duplicate Detection
3 pages
Regression Analysis: Ordinary Least Squares
No ratings yet
Regression Analysis: Ordinary Least Squares
12 pages
Dirichlet Principle
No ratings yet
Dirichlet Principle
12 pages
How Are Rao-Blackwell Estimators "Better"?
No ratings yet
How Are Rao-Blackwell Estimators "Better"?
2 pages
Lnotes 05
No ratings yet
Lnotes 05
5 pages
hw3 Sol
No ratings yet
hw3 Sol
14 pages
Day 1
No ratings yet
Day 1
41 pages
Notes 3
No ratings yet
Notes 3
8 pages
Homework 3: SVM and Sentiment Analysis: Minted Listings
No ratings yet
Homework 3: SVM and Sentiment Analysis: Minted Listings
7 pages
Math YHPLinear Regression
No ratings yet
Math YHPLinear Regression
13 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
5 LR Apr 7 2021
No ratings yet
5 LR Apr 7 2021
94 pages
MultivariableRegression 1
No ratings yet
MultivariableRegression 1
30 pages
Vmls Additional Exercises
No ratings yet
Vmls Additional Exercises
77 pages
Homework Solution 01 KNN DT
No ratings yet
Homework Solution 01 KNN DT
3 pages
Preffered Attachment Model
No ratings yet
Preffered Attachment Model
9 pages
Seminar Econometrie
No ratings yet
Seminar Econometrie
15 pages
Notes Linearregression
No ratings yet
Notes Linearregression
4 pages
deep-learning
No ratings yet
deep-learning
10 pages
Thuat Toan Winows
No ratings yet
Thuat Toan Winows
4 pages
Linear Models: The Least-Squares Method, The Perceptron: A Heuristic Learning Algorithm For
No ratings yet
Linear Models: The Least-Squares Method, The Perceptron: A Heuristic Learning Algorithm For
25 pages
Micro 1
No ratings yet
Micro 1
99 pages
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
No ratings yet
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
7 pages
COGS 118 Homework 3 Supervised Machine Learning Algorithms
No ratings yet
COGS 118 Homework 3 Supervised Machine Learning Algorithms
7 pages
DL (Unit I)
No ratings yet
DL (Unit I)
25 pages
10.1.1.92.623
No ratings yet
10.1.1.92.623
11 pages
175 Main
No ratings yet
175 Main
60 pages
CS211 Fall 2021 Programming Assignment II: David Menendez
No ratings yet
CS211 Fall 2021 Programming Assignment II: David Menendez
10 pages
HW 5
No ratings yet
HW 5
5 pages
Least Squares Optimization With L1-Norm Regularization
No ratings yet
Least Squares Optimization With L1-Norm Regularization
12 pages
Lecture 2
No ratings yet
Lecture 2
14 pages
2019-05-30
No ratings yet
2019-05-30
7 pages
Sparse Regression
No ratings yet
Sparse Regression
37 pages
Vmls Additional Exercises
No ratings yet
Vmls Additional Exercises
66 pages
Dynamic Programming
No ratings yet
Dynamic Programming
9 pages
Convolution and Frequency Response For LTI Systems: Hapter
No ratings yet
Convolution and Frequency Response For LTI Systems: Hapter
9 pages
Text Classification Using Logistics Regression
No ratings yet
Text Classification Using Logistics Regression
64 pages
Handout 5 Elements of Conic Linear Programming
No ratings yet
Handout 5 Elements of Conic Linear Programming
12 pages
Final 2015 PDF
No ratings yet
Final 2015 PDF
13 pages
Functional Analysis Lecture Notes
No ratings yet
Functional Analysis Lecture Notes
38 pages
WEEK 5 - Vector Space, Subspace
No ratings yet
WEEK 5 - Vector Space, Subspace
28 pages
Linear
No ratings yet
Linear
31 pages
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Machine Learning - Classification
No ratings yet
Machine Learning - Classification
13 pages
Machine Learning Course - Matrix Factorization
No ratings yet
Machine Learning Course - Matrix Factorization
7 pages
Machine Learning - SVM
No ratings yet
Machine Learning - SVM
11 pages
Machine Learning Course - Kernel Regression
No ratings yet
Machine Learning Course - Kernel Regression
9 pages
Mortar Pig Cheat Sheet
50% (2)
Mortar Pig Cheat Sheet
13 pages
Intro To Pig
0% (1)
Intro To Pig
33 pages
11 KV 23
No ratings yet
11 KV 23
181 pages
Symfony Cookbook 2.3 FR
100% (1)
Symfony Cookbook 2.3 FR
336 pages
Production Function: Module - 7
No ratings yet
Production Function: Module - 7
13 pages
Dosing Pump Guidance Notes
No ratings yet
Dosing Pump Guidance Notes
25 pages
Lab Report JC
No ratings yet
Lab Report JC
5 pages
Analysis of Riboflavin in A Vitamin Pill
No ratings yet
Analysis of Riboflavin in A Vitamin Pill
5 pages
NUMB3RS Activity: Is This Seat Taken? Episode: "Soft Target"
No ratings yet
NUMB3RS Activity: Is This Seat Taken? Episode: "Soft Target"
4 pages
DSP All Mcq's
50% (4)
DSP All Mcq's
46 pages
GL200 SMS Protocol V102 Decrypted.100130920 PDF
No ratings yet
GL200 SMS Protocol V102 Decrypted.100130920 PDF
28 pages
Cantactor Starter NEMA Cuttler Hammer
No ratings yet
Cantactor Starter NEMA Cuttler Hammer
48 pages
Automatic Irrigation System Using Arduino Microcontroller
No ratings yet
Automatic Irrigation System Using Arduino Microcontroller
41 pages
Basics of Spar Analysis (Tension Field Beams)
No ratings yet
Basics of Spar Analysis (Tension Field Beams)
39 pages
Twist, Writhe, and Geometry A Loop Containing Equally Spaced Coplanar Bends
No ratings yet
Twist, Writhe, and Geometry A Loop Containing Equally Spaced Coplanar Bends
16 pages
DC Tut 1 2023
No ratings yet
DC Tut 1 2023
2 pages
(Lab Manual) Chemistry Laboratory
No ratings yet
(Lab Manual) Chemistry Laboratory
79 pages
A Novel Thermal Model For HEVEV Battery Modeling Based On CFD
No ratings yet
A Novel Thermal Model For HEVEV Battery Modeling Based On CFD
8 pages
P.T. Stanvac Indonesia P.T. Schlumberger Geophysics: Nusantara
No ratings yet
P.T. Stanvac Indonesia P.T. Schlumberger Geophysics: Nusantara
16 pages
Pranav - Final Assignment
No ratings yet
Pranav - Final Assignment
56 pages
Valve-Gestra NRV VPENC000050
No ratings yet
Valve-Gestra NRV VPENC000050
2 pages
Part and Assembly Modeling: With Solidworks 2014
100% (1)
Part and Assembly Modeling: With Solidworks 2014
123 pages
Maths Worksheet 1
No ratings yet
Maths Worksheet 1
5 pages
STEPS-MAILMERGE FULL
No ratings yet
STEPS-MAILMERGE FULL
10 pages
EAS 3312 Aeroelasticity: Assignment 2
No ratings yet
EAS 3312 Aeroelasticity: Assignment 2
18 pages
NUR HUSNA BINTI MOHD MAZLAN Moe - GC Chemical Bond
No ratings yet
NUR HUSNA BINTI MOHD MAZLAN Moe - GC Chemical Bond
2 pages
Java Question Bank 1
No ratings yet
Java Question Bank 1
6 pages
Sikacrete Fire Protection Mortar 201102
No ratings yet
Sikacrete Fire Protection Mortar 201102
4 pages