Logistic Regression Lecture Notes
Logistic Regression Lecture Notes
Logistic Regression
In the last module, you learnt Linear Regression, which is a supervised regression model. In other words,
linear regression allows you to make predictions from labelled data, if the target (output) variable is
numeric.
Hence, in this module, you moved to the next step, i.e., Logistic Regression. Logistic Regression is a
supervised classification model. It allows you to make predictions from labelled data, if the target (output)
variable is categorical.
Binary Classification
You first learnt what a binary classification is. Basically, it is a classification problem in which the target
variable has only 2 possible values, or in other words, two classes. Some examples of binary classification
are –
1. A bank wants to predict, based on some variables, whether a particular customer will default on a
loan or not
2. A factory manager wants to predict, based on some variables, whether a particular machine will
break down in the next month or not
3. Google’s backend wants to predict, based on some variables, whether an incoming email is spam or
not
You then saw an example which was discussed in detail, which is the diabetes example. Basically, in this
example, you try to predict whether a person has diabetes or not, based on that person’s blood sugar level.
You saw why a simple boundary decision approach does not work very well for this example. It would be
too risky to decide the class blatantly on the basis of cutoff, as especially in the middle, the patients could
basically belong to any class, diabetic or non-diabetic.
𝑃(𝐷𝑖𝑎𝑏𝑒𝑡𝑒𝑠) =
Likelihood
The next step, just like linear regression, would be to find the best fit curve. Hence, you learnt that in order
to find the best fit sigmoid curve, you need to vary β0 and β1 until you get the combination of beta values
that maximises the likelihood. For the diabetes example, likelihood is given by the expression –
Likelihood = (1-P1)(1-P2)(1-P3)(1-P4)(P5)(1-P6)(P7)(P8)(P9)(P10)
This process, where you vary the betas, until you find the best fit curve for probability of diabetes, is called
logistic regression.
Then, you saw a simpler way of interpreting the equation for logistic regression. You saw that the following
linearized equation is much easier to interpret –
The left-hand side of this equation is what is called log odds. Basically, the odds of having diabetes (P/1-P),
indicate how much more likely a person is to have diabetes than to not have it. For example, a person for
whom the odds of having diabetes are equal to 3, is 3 times more likely to have diabetes than to not have
it. In other words, P(Diabetes) = 3*P(No diabetes).
Also, you saw how odds vary with variation in x. Basically, with every linear increase in x, the increase in
odds is multiplicative. For example, in the diabetes case, after every increase of 11.5 in the value of x, the
odds get approximately doubled, i.e., increase by a multiplicative factor of around 2.
In this session, you learnt how to build a multivariate logistic regression model in R. The equation for
multivariate logistic regression is basically just an extension of the univariate equation –
The example used for building the multivariate model in R, was the Telecom Churn Example. Basically, you
learnt how R can be used to decide the probability of a customer churning, based on the value of 21
predictor variables, like monthly charges, paperless billing, etc.
First, the data was imported, which was present in 3 separate csv files. After creating a merged master data
set, one that contains all 21 variables, data preparation was done, which involved the following steps:
After all of this was done, a logistic regression model was built in Python using the function GLM() under
statsmodel library. This model contained all the variables, some of which had insignificant coefficients.
Hence, some of these variables were removed first based on an automated approach, i.e. RFE and then a
manual approach based on the VIFs and p-values.
The following code in statsmodels was used to build the logistic regression model.
You first learnt what a confusion matrix is. It was basically a matrix showing the number of all the actual
and predicted labels. It looked something like:
From the confusion matrix, you can see that the correctly predicted labels are present in the first row, first
column and the last row, last column. Hence, we defined accuracy as –
For your model, you got an accuracy of about 80% which seemed good but you relooked at the confusion
matrix, and saw that there were a lot of misclassifications going on. Hence, we brought in two new metrics,
i.e. Sensitivity and Specificity. They were defined as follows:
You found out that your specificity was good (~89%) but your sensitivity was only 53%. Hence, this needed
to be taken care of.
ROC Curve
You had gotten sensitivity of 53% and this was mainly because of the cut-off point of 0.5 that you had
arbitrarily chosen. Now, this cut-off point had to be optimised in order to get a decent value of sensitivity
and in came the ROC curve. You first saw what the True Positive Rate (TPR) and the False Positive Rate
(FPR) were. They were defined as follows –
When you plotted the true positive rate against the false positive rate, you got a graph which showed the trade-off
between them and this curve is known as the ROC curve. The following curve is what you plotted for your case study.
Then you also plotted the accuracy, sensitivity, and specificity and got the following curve.
From this, you concluded that the optimal cut-off for the model was around 0.3 and you chose this value to
be your threshold and got decent values of all the three metrics – Accuracy (~77%), Sensitivity (~78%), and
Specificity (~77%).
And similar to what you did for sensitivity and specificity, you also plotted a trade-off curve between
precision and recall.
Recall Telecom business problem, The data used to build the model was from 2014. You split the original data into
two parts, training and test data. However, these two parts were both with data from 2014.
This is called in-sample validation. Testing your model on this test data may not be enough though, as test data is
too similar to training data.
So, it makes sense to actually test the model on data that is from some other time, like 2016. This is called, out of
time validation.
Basically, there are 3 iterations in which evaluation is done. In the first iteration, 1/3rd of the data is selected as
training data and the remaining 2/3rd of it is selected as testing data. In the next iteration, a different 1/3rd of the
data is selected as the training data set and then the model is built and evaluated. Similarly, the third iteration is
completed.
Such an approach is necessary if the data you have for model building is very small, i.e., has very few data points.
If these three methods of validation are still unclear to you, you need not worry as of now. They will be covered at
length in Course 4 (Predictive Analytics II).
Model Stability
Again, if stability is still a little cloudy, you need not worry. It will also be covered at length in Course 4 (Predictive
Analytics II)
• You can download this document from the website for self-use only.
• Any copies of this document, in part or full, saved to disc or to any other storage medium may only be used
for subsequent, self-viewing purposes or to print an individual extract or copy for non-commercial personal
use only.
• Any further dissemination, distribution, reproduction, copying of the content of the document herein or the
uploading thereof on other websites or use of content for any other commercial/unauthorized purposes in
any way which could infringe the intellectual property rights of UpGrad or its contributors, is strictly
prohibited.
• No graphics, images or photographs from any accompanying text in this document will be used separately
for unauthorised purposes.
• No material in this document will be modified, adapted or altered in any way.
• No part of this document or UpGrad content may be reproduced or stored in any other web site or included
in any public or private electronic retrieval system or service without UpGrad’s prior written permission.
• Any rights not expressly granted in these terms are reserved.
Disclaimer: All content and material on the upGrad website is copyrighted material,
either belonging to upGrad or its bonafide contributors and is purely for the
dissemination of education. You are permitted to access print and download extracts
from this site purely for your own education only and on the following basis:
• You can download this document from the website for self-use only.
• Any copies of this document, in part or full, saved to disc or to any other storage
medium may only be used for subsequent, self-viewing purposes or to print
an individual extract or copy for non-commercial personal use only.
• Any further dissemination, distribution, reproduction, copying of the content of
the document herein or the uploading thereof on other websites or use of
the content for any other commercial/unauthorised purposes in any way
which could infringe the intellectual property rights of upGrad or its
contributors, is strictly prohibited.
• No graphics, images or photographs from any accompanying text in this
document will be used separately for unauthorised purposes.
• No material in this document will be modified, adapted or altered in any way.
• No part of this document or upGrad content may be reproduced or stored in
any other web site or included in any public or private electronic retrieval
system or service without upGrad’s prior written permission.
• Any rights not expressly granted in these terms are reserved.