Lecture 8 Logistic Regression
Lecture 8 Logistic Regression
Regression
Outline
5. Worked examples
2 /33
Outline
5. Worked examples
3 /33
The classification probelm
The linear regression model discussed in the previous lesson assumes that the
response variable 𝑦 is quantitative (metrical)
• in many situations, the response variable is instead qualitative (categorical)
Qualitative variables take values in an unordered set 𝒞 = "cat! ", … , "cat " " , such as:
• eye color ∈ "brown", "blue", "green"
• email ∈ "spam", "not spam"
4 /33
The classification probelm
The process of estimating categorical outcomes using a set of regressors 𝝋 is called
classification
Often we are more interested in estimating the probabilities that 𝝋 belongs to each
category in 𝒞
The most probable category is then chosen as the class for the observation 𝝋
5 /33
Examples of classification problems
• A person arrives at the emergency room with a set of symptoms that could possibly
be attributed to one of three medical conditions
Which of the three conditions does the individual have?
• A biologist collects DNA sequence data for a number of patients with and without
a given disease
which DNA mutations are deleterious (disease-causing) and which are not?
6 /33
Example: cat vs dog classification
Suppose that we measure the weight and Classifier function 𝑓 ⋅
height of some dogs and cats
Height [cm]
We want to learn the function 𝑓 ⋅ that can
$
tell us if a given input vector 𝝋 = 𝜑! , 𝜑# is a
dog or a cat Cats
• 𝜑! : weight
𝜑" Dogs
• 𝜑# : height
QUIZ: Consider a company that produces sliding gates. The gates can have four
weights 300Kg, 400Kg, 500Kg, 600Kg . We want to detect the weight of the
gate. This is a:
q A regression problem
q A classification problem
8 /33
Outline
5. Worked examples
9 /33
Why not linear regression?
Suppose that we are trying to estimate the medical condition of a patient in the
emergency room based on her symptoms
There are three possibilities: stroke, drug overdose and epileptic seizure
We could consider encoding these values as a quantitative response variable, 𝑦, as
1 if stroke
𝑦 = <2 if drug overdose
3 if epileptic seizure
However, we are implicitly saying that the «difference» between drug overdose and
stroke is the same as the «difference» between epileptic seizure and drug
overdose, which does not make much sense
10 /33
Why not linear regression?
We can also change the encoding to
1 if epileptic seizure
𝑦 = <2 if stroke
3 if drug overdose
This would imply a totally different relationship among the three conditions
• each of these codings would produce fundamentally different linear models…
• …that would ultimately lead to different sets of estimates on test observations
In general, there is no natural way to convert a qualitative response variable with more
than two levels into a quantitative response that is ready for linear regression
11 /33
Why not linear regression?
With two levels, the situation is better. For instance, perhaps there are only two
possibilities for the patient’s medical condition: stroke and drug overdose
0 if stroke
𝑦=<
1 if drug overdose
We can fit a linear regression to this binary response, and classify as drug overdose if
𝑦C > 0.5 and stroke otherwise, interpreting 𝑦C as a probability of drug overdose
However, if we use linear regression, some of our estimates might be outside the [0, 1]
interval, which does not make sense as a probability. There is nothing that “saturates” the
output between 0 and 1. Logistic function (Sigmoid)
12 /33
Outline
5. Worked examples
13 /33
Logistic regression
Purpose: Estimate the probability that a set of input regressors 𝝋 ∈ ℝ%×! belong to one
of two classes 𝑦 ∈ 0, 1
𝑎 = I 𝜑' ⋅ 𝜃' = 𝝋$ ⋅ 𝜽
'()
0.5
The formula 𝑠 𝑎 is the logistic function
1 𝑒$ • 𝑎≫0⇒𝑠 𝑎 =1
𝑠 𝑎 = #$ =
1+𝑒 1 + 𝑒$ • 𝑎≪0⇒𝑠 𝑎 =0 0
14 /33
Logistic regression
Purpose: Estimate the probability that a set of input regressors 𝝋 ∈ ℝ%×! belong to one
of two classes 𝑦 ∈ 0, 1
1
𝑃 𝑦=1𝝋 =𝑠 𝑎 =𝑠 𝝋%𝜽 = !𝜽
1 + 𝑒 #𝝋
15 /33
Outline
5. Worked examples
16 /33
Logistic regression cost function
Suppose to have at disposal a dataset 𝒟 = 𝝋 1 ,𝑦 1 ,…, 𝝋 𝑁 ,𝑦 𝑁 where 𝝋 𝑖 ∈ ℝ%×!
and 𝑦 𝑖 ∈ 0, 1 , 𝑖 = 1, … , 𝑁, 𝑖. 𝑖. 𝑑.
1
Estimate a logistic regression model 𝑃 𝑦 𝑖 = 1 𝝋 𝑖 = ! ≡𝜋 𝑖
1+ 𝑒 *𝝋 ' 𝜽
*
𝐽 𝜽 = − 7 𝑦 𝑖 ⋅ ln 𝜋 𝑖 + 1 − 𝑦 𝑖 ⋅ ln 1 − 𝜋 𝑖
()!
17 /33
Logistic regression cost function
QUIZ: In the logistic regression cost function, where are the parameters 𝜽 that we
want to estimate?
*
𝐽 𝜽 = − 7 𝑦 𝑖 ⋅ ln 𝜋 𝑖 + 1 − 𝑦 𝑖 ⋅ ln 1 − 𝜋 𝑖
()!
q In the 𝑦 𝑖 terms
q In the ln terms
q In the 𝜋 𝑖 terms
18 /33
Logistic regression cost function
Cost function interpretation
Suppose there is only one datum 𝒟 = 𝝋, 𝑦
− ln 𝜋 if 𝑦 = 1
⇒𝐽 𝜽 =<
− ln 1 − 𝜋 if 𝑦 = 0
Case 𝑦 = 1
• 𝐽 𝜽 ≈ 0 if 𝑦 = 1 and 𝜋 ≈ 1
𝐽 𝜽 = −ln 𝜋
• 𝐽 𝜽 ≈ +∞ if 𝑦 = 1 and 𝜋 ≈ 0
19 /33
Logistic regression cost function
Cost function interpretation
Suppose there is only one datum 𝒟 = 𝝋, 𝑦
− ln 𝜋 if 𝑦 = 1
⇒𝐽 𝜽 =<
− ln 1 − 𝜋 if 𝑦 = 0
Case 𝑦 = 0
• 𝐽 𝜽 ≈ 0 if 𝑦 = 0 and 𝜋 ≈ 0
𝐽 𝜽 = −ln 1 − 𝜋
• 𝐽 𝜽 ≈ +∞ if 𝑦 = 0 and 𝜋 ≈ 1
20 /33
IN-DEPTH ANALYSIS
Computation of the minimum of 𝐽(𝜽)
We have to compute the gradient of 𝐽 𝜽 with respect to 𝜽 ∈ ℝ%×! . First, compute the
!
derivative of 𝑠 𝑎 =
!-. "#
21 /33
IN-DEPTH ANALYSIS
Computation of the minimum of 𝐽(𝜽)
We can now compute the gradient of 𝐽 𝜽
(
1
𝐽 𝜽 = − 3 𝑦 𝑖 ln 𝜋 𝑖 + 1 − 𝑦 𝑖 ln 1 − 𝜋 𝑖 𝜋 𝑖 = !𝜽
&'$
1 + 𝑒 "𝝋 &
( (
𝜋) 𝑖 −𝜋 ) 𝑖 𝝋 𝑖 𝜋 𝑖 1−𝜋 𝑖 −𝝋 𝑖 𝜋 𝑖 1 − 𝜋 𝑖
𝛻𝜽 𝐽 𝜽 = − / 𝑦 𝑖 + 1−𝑦 𝑖 = −/ 𝑦 𝑖 + 1−𝑦 𝑖
𝜋 𝑖 1−𝜋 𝑖 𝜋 𝑖 1−𝜋 𝑖
𝑑×1 %&' %&'
( (
= / −𝑦 𝑖 𝝋 𝑖 1 − 𝜋 𝑖 − 1−𝑦 𝑖 −𝝋 𝑖 𝜋 𝑖 = / 𝝋 𝑖 ⋅ −𝑦 𝑖 + 𝑦 𝑖 𝜋 𝑖 +𝝋 𝑖 ⋅ 𝜋 𝑖 −𝑦 𝑖 𝜋 𝑖
%&' %&'
( (
= / 𝝋 𝑖 ⋅ −𝑦 𝑖 + 𝑦 𝑖 𝜋 𝑖 − 𝑦 𝑖 𝜋 𝑖 + 𝜋 𝑖 = /𝝋 𝑖 ⋅ 𝜋 𝑖 − 𝑦 𝑖
%&' %&' 𝑑×1 1×1
22 /33
Gradient descent
It can be shown that:
• The cost function 𝐽 𝜽 is convex and admits a unique minimum
• The equations found by posing 𝛻𝜽 𝐽 𝜽 = 𝟎 are nonlinear in 𝜽 and it is not possible to
find a solution in closed-form
ü For this reason, we need to resort to iterative optimization algorithms
𝜽 ! 𝑘 − 𝛼 ⋅ 𝛻𝐽 𝜽 ,
! 𝑘+1 =𝜽 𝛼 ∈ ℝ/) : learning rate
𝑑×1 𝑑×1 1×1 𝑑×1 : ;
𝜽9𝜽
23 /33
Gradient descent
*
𝐽 𝜽 = − 7 𝑦 𝑖 ⋅ ln 𝜋 𝑖 + 1 − 𝑦 𝑖 ⋅ ln 1 − 𝜋 𝑖
()!
Repeat { 0
𝜃) = 𝜃) − 𝛼 ⋅ I 𝜋 𝑖 − 𝑦 𝑖
'(!
0
𝜃! = 𝜃# − 𝛼 ⋅ I 𝜋 𝑖 − 𝑦 𝑖 ⋅ −𝜑! 𝑖
'(!
⋮
0
24 /33
Logistic regression recap
The logistic regression model, despite its name, is not used for regression, but for
classification
Once the model estimates the probability of a class, we can classify a point to a particular
class if the probability for that class is above a threshold (usually 0.5)
$
1
𝑠 𝝋 𝜽 = !𝜽
1 + 𝑒 *𝝋
25 /33
Logistic regression recap
The classification boundary found by Linear classifier
logistic regression is linear
Height [cm]
Infact, classifying with the rule:
𝑦 = 1 if 𝑠 𝝋$ 𝜽 ≥ 0.5
Cats
Dogs
Weight [kg]
26 /33
Outline
5. Worked examples
27 /33
Students admissions classification
% Read data from file
We want to estimate if a student will get admitted data = load(‘studentsdata.csv’);
to a university given the results on two exams (𝜑$ , 𝜑% ) Phi = data(:, [1, 2]); y = data(:, 3);
• The training set consists of 𝑁 = 100 students with % Setup the data matrix appropriately, and
add ones for the intercept term
𝜑$ 𝑖 , 𝜑% 𝑖 and 𝑦 𝑖 ∈ 0,1 , for 𝑖 = 1, … , 𝑁 [N, m] = size(Phi); d = m + 1;
28 /33
The framingham heart study
In late 1940s, U.S. Government set out to better understand cardiovascular disease
The city of Framingham (MA) was selected as site for the study in the 1948
• Appropriate size
• Stable population
• Cooperative doctors and residents
A total of 5209 patients aged 30-59 were enrolled. They had to give a survey and take
and exam every 2 years:
• Physical characteristics and behavioral characteristics
• Test results
29 /33
The framingham heart study
We will build models using the Framingham data to estimate and prevent heart disease
30 /33
The framingham heart study
Demographic risk factors
• male: sex of patient • age: age in years at first examination
• education: Some high school (1), high school (2), some college (3), college (4)
31 /33
The framingham heart study
Risk factors from first examination
• totChol: Total cholesterol (mg/dL)
• sysBP: Systolic blood pressure
• diaBP: Diastolic blood pressure
• BMI: Body Mass Index (kg/m# )
• heartRate: Heart rate (beats/minute)
• glucose: Blood glucose level (mg/dL)
Use logistic regression to estimate whether or not a patient experienced CHD within
10 years of first examination
32 /33
The framingham heart study
33 /33