Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
109 views

Logistic Regression Lecture Notes

Uploaded by

Pankaj Pandey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views

Logistic Regression Lecture Notes

Uploaded by

Pankaj Pandey
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Lecture Notes

Logistic Regression
In the last module, you learnt Linear Regression, which is a supervised regression model. In other words,
linear regression allows you to make predictions from labelled data, if the target (output) variable is
numeric.

Hence, in this module, you moved to the next step, i.e., Logistic Regression. Logistic Regression is a
supervised classification model. It allows you to make predictions from labelled data, if the target (output)
variable is categorical.

Binary Classification
You first learnt what a binary classification is. Basically, it is a classification problem in which the target
variable has only 2 possible values, or in other words, two classes. Some examples of binary classification
are –

1. A bank wants to predict, based on some variables, whether a particular customer will default on a
loan or not
2. A factory manager wants to predict, based on some variables, whether a particular machine will
break down in the next month or not
3. Google’s backend wants to predict, based on some variables, whether an incoming email is spam or
not

You then saw an example which was discussed in detail, which is the diabetes example. Basically, in this
example, you try to predict whether a person has diabetes or not, based on that person’s blood sugar level.

You saw why a simple boundary decision approach does not work very well for this example. It would be
too risky to decide the class blatantly on the basis of cutoff, as especially in the middle, the patients could
basically belong to any class, diabetic or non-diabetic.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


Hence, you learnt it is better, actually to talk in terms of probability. One such curve which can model the
probability of diabetes very well, is the sigmoid curve.

Its equation is given by the following expression –

𝑃(𝐷𝑖𝑎𝑏𝑒𝑡𝑒𝑠) =
Likelihood
The next step, just like linear regression, would be to find the best fit curve. Hence, you learnt that in order
to find the best fit sigmoid curve, you need to vary β0 and β1 until you get the combination of beta values
that maximises the likelihood. For the diabetes example, likelihood is given by the expression –

Likelihood = (1-P1)(1-P2)(1-P3)(1-P4)(P5)(1-P6)(P7)(P8)(P9)(P10)

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


Generally, it is the product of -
[(1-Pi)(1-Pi) ------ for all non-diabetics --------] X [(Pi)(Pi) -------- for all diabetics -------]

This process, where you vary the betas, until you find the best fit curve for probability of diabetes, is called
logistic regression.

Odds and Log Odds

Then, you saw a simpler way of interpreting the equation for logistic regression. You saw that the following
linearized equation is much easier to interpret –

The left-hand side of this equation is what is called log odds. Basically, the odds of having diabetes (P/1-P),
indicate how much more likely a person is to have diabetes than to not have it. For example, a person for
whom the odds of having diabetes are equal to 3, is 3 times more likely to have diabetes than to not have
it. In other words, P(Diabetes) = 3*P(No diabetes).

Also, you saw how odds vary with variation in x. Basically, with every linear increase in x, the increase in
odds is multiplicative. For example, in the diabetes case, after every increase of 11.5 in the value of x, the
odds get approximately doubled, i.e., increase by a multiplicative factor of around 2.

Multivariate Logistic Regression (Telecom Churn Example)

In this session, you learnt how to build a multivariate logistic regression model in R. The equation for
multivariate logistic regression is basically just an extension of the univariate equation –

The example used for building the multivariate model in R, was the Telecom Churn Example. Basically, you
learnt how R can be used to decide the probability of a customer churning, based on the value of 21
predictor variables, like monthly charges, paperless billing, etc.

Multivariate Logistic Regression (Model Building)


The example used for building the multivariate model in Python was the telecom churn example. Basically,
you learnt how Python can be used to decide the probability of a customer churning based on the value of 21
predictor variables such as monthly charges, paperless billing, etc.

First, the data was imported, which was present in 3 separate csv files. After creating a merged master data
set, one that contains all 21 variables, data preparation was done, which involved the following steps:

1. Missing value imputation


2. Outlier treatment
3. Dummy variable creation for categorical variables
© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved
4. Test-train split of the data
5. Standardisation of the scales of continuous variables

After all of this was done, a logistic regression model was built in Python using the function GLM() under
statsmodel library. This model contained all the variables, some of which had insignificant coefficients.
Hence, some of these variables were removed first based on an automated approach, i.e. RFE and then a
manual approach based on the VIFs and p-values.

The following code in statsmodels was used to build the logistic regression model.

Model Evaluation: Accuracy, Sensitivity, and Specificity

You first learnt what a confusion matrix is. It was basically a matrix showing the number of all the actual
and predicted labels. It looked something like:

From the confusion matrix, you can see that the correctly predicted labels are present in the first row, first
column and the last row, last column. Hence, we defined accuracy as –

For your model, you got an accuracy of about 80% which seemed good but you relooked at the confusion
matrix, and saw that there were a lot of misclassifications going on. Hence, we brought in two new metrics,
i.e. Sensitivity and Specificity. They were defined as follows:

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


The you saw that the different elements in the confusion matrix can be labelled as follows –

Hence, you rewrote the sensitivity and specificity formulas as –

You found out that your specificity was good (~89%) but your sensitivity was only 53%. Hence, this needed
to be taken care of.

ROC Curve
You had gotten sensitivity of 53% and this was mainly because of the cut-off point of 0.5 that you had
arbitrarily chosen. Now, this cut-off point had to be optimised in order to get a decent value of sensitivity
and in came the ROC curve. You first saw what the True Positive Rate (TPR) and the False Positive Rate
(FPR) were. They were defined as follows –

When you plotted the true positive rate against the false positive rate, you got a graph which showed the trade-off
between them and this curve is known as the ROC curve. The following curve is what you plotted for your case study.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


The more this curve is towards the upper-left corner, the more is the area under the curve (AUC) and the
better is your model. And when the curve is more towards the 45-degree diagonal, the worse is your
model.

Then you also plotted the accuracy, sensitivity, and specificity and got the following curve.

From this, you concluded that the optimal cut-off for the model was around 0.3 and you chose this value to
be your threshold and got decent values of all the three metrics – Accuracy (~77%), Sensitivity (~78%), and
Specificity (~77%).

Model Evaluation: Precision and Recall


You also learnt about precision and recall which was another pair of industry-relevant metric used to
evaluate the performance of a logistic regression module. They were defined as –

And similar to what you did for sensitivity and specificity, you also plotted a trade-off curve between
precision and recall.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


After playing around with the metrics, and choosing a cut-off point of 0.3, you went ahead and made
predictions on the test set and got decent values there as well. So, you decided this to be your final model.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


Model Validation
Model can be validated on:
• In-sample validation
• Out-time validation
• K-fold cross validation

Recall Telecom business problem, The data used to build the model was from 2014. You split the original data into
two parts, training and test data. However, these two parts were both with data from 2014.

This is called in-sample validation. Testing your model on this test data may not be enough though, as test data is
too similar to training data.

So, it makes sense to actually test the model on data that is from some other time, like 2016. This is called, out of
time validation.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


Another way to do the same thing is to use K-fold cross validation. Basically, the evaluation of the sample is done for
k-iterations. E.g. here's a representation of how 3-fold cross validation works:

Basically, there are 3 iterations in which evaluation is done. In the first iteration, 1/3rd of the data is selected as
training data and the remaining 2/3rd of it is selected as testing data. In the next iteration, a different 1/3rd of the
data is selected as the training data set and then the model is built and evaluated. Similarly, the third iteration is
completed.

Such an approach is necessary if the data you have for model building is very small, i.e., has very few data points.

If these three methods of validation are still unclear to you, you need not worry as of now. They will be covered at
length in Course 4 (Predictive Analytics II).

Model Stability

Obviously, a good model will be stable. A model is considered stable if it has:


• Performance Stability - Results of in-sample validation approximately match those of out-of-time validation
• Variable Stability - Sample used for model building hasn't changed too much and has the same general
characteristics

Again, if stability is still a little cloudy, you need not worry. It will also be covered at length in Course 4 (Predictive
Analytics II)

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


Disclaimer: All content and material on the UpGrad website is copyrighted material, either belonging to UpGrad or
its bonafide contributors and is purely for the dissemination of education. You are permitted to access print and
download extracts from this site purely for your own education only and on the following basis:

• You can download this document from the website for self-use only.
• Any copies of this document, in part or full, saved to disc or to any other storage medium may only be used
for subsequent, self-viewing purposes or to print an individual extract or copy for non-commercial personal
use only.
• Any further dissemination, distribution, reproduction, copying of the content of the document herein or the
uploading thereof on other websites or use of content for any other commercial/unauthorized purposes in
any way which could infringe the intellectual property rights of UpGrad or its contributors, is strictly
prohibited.
• No graphics, images or photographs from any accompanying text in this document will be used separately
for unauthorised purposes.
• No material in this document will be modified, adapted or altered in any way.
• No part of this document or UpGrad content may be reproduced or stored in any other web site or included
in any public or private electronic retrieval system or service without UpGrad’s prior written permission.
• Any rights not expressly granted in these terms are reserved.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved


Powered by upGrad Education Private Limited
© Copyright . UpGrad Education Pvt. Ltd. All rights reserved

Disclaimer: All content and material on the upGrad website is copyrighted material,
either belonging to upGrad or its bonafide contributors and is purely for the
dissemination of education. You are permitted to access print and download extracts
from this site purely for your own education only and on the following basis:

• You can download this document from the website for self-use only.
• Any copies of this document, in part or full, saved to disc or to any other storage
medium may only be used for subsequent, self-viewing purposes or to print
an individual extract or copy for non-commercial personal use only.
• Any further dissemination, distribution, reproduction, copying of the content of
the document herein or the uploading thereof on other websites or use of
the content for any other commercial/unauthorised purposes in any way
which could infringe the intellectual property rights of upGrad or its
contributors, is strictly prohibited.
• No graphics, images or photographs from any accompanying text in this
document will be used separately for unauthorised purposes.
• No material in this document will be modified, adapted or altered in any way.
• No part of this document or upGrad content may be reproduced or stored in
any other web site or included in any public or private electronic retrieval
system or service without upGrad’s prior written permission.
• Any rights not expressly granted in these terms are reserved.

You might also like