0% found this document useful (0 votes)

135 views

Logistic Regression Lecture Notes

Uploaded by

Pankaj Pandey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

135 views

Logistic Regression Lecture Notes

Uploaded by

Pankaj Pandey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Lecture Notes

Logistic Regression
In the last module, you learnt Linear Regression, which is a supervised regression model. In other words,
linear regression allows you to make predictions from labelled data, if the target (output) variable is
numeric.

Hence, in this module, you moved to the next step, i.e., Logistic Regression. Logistic Regression is a
supervised classification model. It allows you to make predictions from labelled data, if the target (output)
variable is categorical.

Binary Classification
You first learnt what a binary classification is. Basically, it is a classification problem in which the target
variable has only 2 possible values, or in other words, two classes. Some examples of binary classification
are –

1. A bank wants to predict, based on some variables, whether a particular customer will default on a
loan or not
2. A factory manager wants to predict, based on some variables, whether a particular machine will
break down in the next month or not
3. Google’s backend wants to predict, based on some variables, whether an incoming email is spam or
not

You then saw an example which was discussed in detail, which is the diabetes example. Basically, in this
example, you try to predict whether a person has diabetes or not, based on that person’s blood sugar level.

You saw why a simple boundary decision approach does not work very well for this example. It would be
too risky to decide the class blatantly on the basis of cutoff, as especially in the middle, the patients could
basically belong to any class, diabetic or non-diabetic.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Hence, you learnt it is better, actually to talk in terms of probability. One such curve which can model the
probability of diabetes very well, is the sigmoid curve.

Its equation is given by the following expression –

𝑃(𝐷𝑖𝑎𝑏𝑒𝑡𝑒𝑠) =
Likelihood
The next step, just like linear regression, would be to find the best fit curve. Hence, you learnt that in order
to find the best fit sigmoid curve, you need to vary β0 and β1 until you get the combination of beta values
that maximises the likelihood. For the diabetes example, likelihood is given by the expression –

Likelihood = (1-P1)(1-P2)(1-P3)(1-P4)(P5)(1-P6)(P7)(P8)(P9)(P10)

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Generally, it is the product of -
[(1-Pi)(1-Pi) ------ for all non-diabetics --------] X [(Pi)(Pi) -------- for all diabetics -------]

This process, where you vary the betas, until you find the best fit curve for probability of diabetes, is called
logistic regression.

Odds and Log Odds

Then, you saw a simpler way of interpreting the equation for logistic regression. You saw that the following
linearized equation is much easier to interpret –

The left-hand side of this equation is what is called log odds. Basically, the odds of having diabetes (P/1-P),
indicate how much more likely a person is to have diabetes than to not have it. For example, a person for
whom the odds of having diabetes are equal to 3, is 3 times more likely to have diabetes than to not have
it. In other words, P(Diabetes) = 3*P(No diabetes).

Also, you saw how odds vary with variation in x. Basically, with every linear increase in x, the increase in
odds is multiplicative. For example, in the diabetes case, after every increase of 11.5 in the value of x, the
odds get approximately doubled, i.e., increase by a multiplicative factor of around 2.

Multivariate Logistic Regression (Telecom Churn Example)

In this session, you learnt how to build a multivariate logistic regression model in R. The equation for
multivariate logistic regression is basically just an extension of the univariate equation –

The example used for building the multivariate model in R, was the Telecom Churn Example. Basically, you
learnt how R can be used to decide the probability of a customer churning, based on the value of 21
predictor variables, like monthly charges, paperless billing, etc.

Multivariate Logistic Regression (Model Building)

The example used for building the multivariate model in Python was the telecom churn example. Basically,
you learnt how Python can be used to decide the probability of a customer churning based on the value of 21
predictor variables such as monthly charges, paperless billing, etc.

First, the data was imported, which was present in 3 separate csv files. After creating a merged master data
set, one that contains all 21 variables, data preparation was done, which involved the following steps:

1. Missing value imputation

2. Outlier treatment
3. Dummy variable creation for categorical variables
© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved
4. Test-train split of the data
5. Standardisation of the scales of continuous variables

After all of this was done, a logistic regression model was built in Python using the function GLM() under
statsmodel library. This model contained all the variables, some of which had insignificant coefficients.
Hence, some of these variables were removed first based on an automated approach, i.e. RFE and then a
manual approach based on the VIFs and p-values.

The following code in statsmodels was used to build the logistic regression model.

Model Evaluation: Accuracy, Sensitivity, and Specificity

You first learnt what a confusion matrix is. It was basically a matrix showing the number of all the actual
and predicted labels. It looked something like:

From the confusion matrix, you can see that the correctly predicted labels are present in the first row, first
column and the last row, last column. Hence, we defined accuracy as –

For your model, you got an accuracy of about 80% which seemed good but you relooked at the confusion
matrix, and saw that there were a lot of misclassifications going on. Hence, we brought in two new metrics,
i.e. Sensitivity and Specificity. They were defined as follows:

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

The you saw that the different elements in the confusion matrix can be labelled as follows –

Hence, you rewrote the sensitivity and specificity formulas as –

You found out that your specificity was good (~89%) but your sensitivity was only 53%. Hence, this needed
to be taken care of.

ROC Curve
You had gotten sensitivity of 53% and this was mainly because of the cut-off point of 0.5 that you had
arbitrarily chosen. Now, this cut-off point had to be optimised in order to get a decent value of sensitivity
and in came the ROC curve. You first saw what the True Positive Rate (TPR) and the False Positive Rate
(FPR) were. They were defined as follows –

When you plotted the true positive rate against the false positive rate, you got a graph which showed the trade-off
between them and this curve is known as the ROC curve. The following curve is what you plotted for your case study.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

The more this curve is towards the upper-left corner, the more is the area under the curve (AUC) and the
better is your model. And when the curve is more towards the 45-degree diagonal, the worse is your
model.

Then you also plotted the accuracy, sensitivity, and specificity and got the following curve.

From this, you concluded that the optimal cut-off for the model was around 0.3 and you chose this value to
be your threshold and got decent values of all the three metrics – Accuracy (~77%), Sensitivity (~78%), and
Specificity (~77%).

Model Evaluation: Precision and Recall

You also learnt about precision and recall which was another pair of industry-relevant metric used to
evaluate the performance of a logistic regression module. They were defined as –

And similar to what you did for sensitivity and specificity, you also plotted a trade-off curve between
precision and recall.

After playing around with the metrics, and choosing a cut-off point of 0.3, you went ahead and made
predictions on the test set and got decent values there as well. So, you decided this to be your final model.

Model Validation
Model can be validated on:
• In-sample validation
• Out-time validation
• K-fold cross validation

Recall Telecom business problem, The data used to build the model was from 2014. You split the original data into
two parts, training and test data. However, these two parts were both with data from 2014.

This is called in-sample validation. Testing your model on this test data may not be enough though, as test data is
too similar to training data.

So, it makes sense to actually test the model on data that is from some other time, like 2016. This is called, out of
time validation.

Another way to do the same thing is to use K-fold cross validation. Basically, the evaluation of the sample is done for
k-iterations. E.g. here's a representation of how 3-fold cross validation works:

Basically, there are 3 iterations in which evaluation is done. In the first iteration, 1/3rd of the data is selected as
training data and the remaining 2/3rd of it is selected as testing data. In the next iteration, a different 1/3rd of the
data is selected as the training data set and then the model is built and evaluated. Similarly, the third iteration is
completed.

Such an approach is necessary if the data you have for model building is very small, i.e., has very few data points.

If these three methods of validation are still unclear to you, you need not worry as of now. They will be covered at
length in Course 4 (Predictive Analytics II).

Model Stability

Obviously, a good model will be stable. A model is considered stable if it has:

• Performance Stability - Results of in-sample validation approximately match those of out-of-time validation
• Variable Stability - Sample used for model building hasn't changed too much and has the same general
characteristics

Again, if stability is still a little cloudy, you need not worry. It will also be covered at length in Course 4 (Predictive
Analytics II)

Disclaimer: All content and material on the UpGrad website is copyrighted material, either belonging to UpGrad or
its bonafide contributors and is purely for the dissemination of education. You are permitted to access print and
download extracts from this site purely for your own education only and on the following basis:

• You can download this document from the website for self-use only.
• Any copies of this document, in part or full, saved to disc or to any other storage medium may only be used
for subsequent, self-viewing purposes or to print an individual extract or copy for non-commercial personal
use only.
• Any further dissemination, distribution, reproduction, copying of the content of the document herein or the
uploading thereof on other websites or use of content for any other commercial/unauthorized purposes in
any way which could infringe the intellectual property rights of UpGrad or its contributors, is strictly
prohibited.
• No graphics, images or photographs from any accompanying text in this document will be used separately
for unauthorised purposes.
• No material in this document will be modified, adapted or altered in any way.
• No part of this document or UpGrad content may be reproduced or stored in any other web site or included
in any public or private electronic retrieval system or service without UpGrad’s prior written permission.
• Any rights not expressly granted in these terms are reserved.

Disclaimer: All content and material on the upGrad website is copyrighted material,
either belonging to upGrad or its bonafide contributors and is purely for the
dissemination of education. You are permitted to access print and download extracts
from this site purely for your own education only and on the following basis:

• You can download this document from the website for self-use only.
• Any copies of this document, in part or full, saved to disc or to any other storage
medium may only be used for subsequent, self-viewing purposes or to print
an individual extract or copy for non-commercial personal use only.
• Any further dissemination, distribution, reproduction, copying of the content of
the document herein or the uploading thereof on other websites or use of
the content for any other commercial/unauthorised purposes in any way
which could infringe the intellectual property rights of upGrad or its
contributors, is strictly prohibited.
• No graphics, images or photographs from any accompanying text in this
document will be used separately for unauthorised purposes.
• No material in this document will be modified, adapted or altered in any way.
• No part of this document or upGrad content may be reproduced or stored in
any other web site or included in any public or private electronic retrieval
system or service without upGrad’s prior written permission.
• Any rights not expressly granted in these terms are reserved.

Download Complete Quantitative Investment Analysis, 4th Edition Cfa Institute PDF for All Chapters
No ratings yet
Download Complete Quantitative Investment Analysis, 4th Edition Cfa Institute PDF for All Chapters
57 pages
COMM 204 HW3 Solution
No ratings yet
COMM 204 HW3 Solution
2 pages
Cost I Chapter 5
No ratings yet
Cost I Chapter 5
10 pages
Active Databases
No ratings yet
Active Databases
47 pages
Chapter 5
100% (1)
Chapter 5
37 pages
Chapter 04 Sensitivity Analysis - An Applied Approach
No ratings yet
Chapter 04 Sensitivity Analysis - An Applied Approach
34 pages
Chapter 2 Handouts - Introduction To Management 2
No ratings yet
Chapter 2 Handouts - Introduction To Management 2
21 pages
A+ Emerging Final Exam AAU (4) - 240110 - 231739
100% (1)
A+ Emerging Final Exam AAU (4) - 240110 - 231739
15 pages
Ch-2 System Planning and Selection
No ratings yet
Ch-2 System Planning and Selection
79 pages
Research Proposal
No ratings yet
Research Proposal
5 pages
Bahirdar University Institute of Technology: Final Project Proposal On Automation of Taye Mola Stationary Shopping Center
No ratings yet
Bahirdar University Institute of Technology: Final Project Proposal On Automation of Taye Mola Stationary Shopping Center
5 pages
Revised Research Guidlines - 2023
No ratings yet
Revised Research Guidlines - 2023
81 pages
Jimma Inclusive
No ratings yet
Jimma Inclusive
244 pages
Business Management Exit Exam Questions and Answers PDF
No ratings yet
Business Management Exit Exam Questions and Answers PDF
78 pages
Briefly Describe The Three Common Types of Models and Give An Example of Each
No ratings yet
Briefly Describe The Three Common Types of Models and Give An Example of Each
3 pages
Chapter 4 - Data Communication
No ratings yet
Chapter 4 - Data Communication
67 pages
OR Chapter 2 PDF
No ratings yet
OR Chapter 2 PDF
175 pages
Course Title Buss. Intellegency
No ratings yet
Course Title Buss. Intellegency
5 pages
Student Information System System Design
No ratings yet
Student Information System System Design
15 pages
HCI-Lecture-14 - 15
No ratings yet
HCI-Lecture-14 - 15
94 pages
Developing Bacterial Wilt Detection Model On Enset Crop Using A Deep Learning Approach
No ratings yet
Developing Bacterial Wilt Detection Model On Enset Crop Using A Deep Learning Approach
91 pages
Routine Health Information System: by Atsede Mazengia (BSC, MPH) Uog-2022
No ratings yet
Routine Health Information System: by Atsede Mazengia (BSC, MPH) Uog-2022
374 pages
502 Object Oriented Analysis and Design
No ratings yet
502 Object Oriented Analysis and Design
110 pages
OOSAD - Course Outline
100% (1)
OOSAD - Course Outline
2 pages
Admas University School of Post Graduate Masters of Business Administration
No ratings yet
Admas University School of Post Graduate Masters of Business Administration
16 pages
ER Model Example
No ratings yet
ER Model Example
9 pages
Discrete Maths and Combinatory Handout
No ratings yet
Discrete Maths and Combinatory Handout
128 pages
Identified Competencies, BA in Management
No ratings yet
Identified Competencies, BA in Management
8 pages
Customer Attitude Towards Textile
No ratings yet
Customer Attitude Towards Textile
45 pages
Bus-525, Managerial Economics Course Convener: Dr. Tamgid Ahmed Chowdhury
No ratings yet
Bus-525, Managerial Economics Course Convener: Dr. Tamgid Ahmed Chowdhury
19 pages
CBTP PHASE 3 Final
No ratings yet
CBTP PHASE 3 Final
13 pages
Chapter 1 and 2 Materials Management
No ratings yet
Chapter 1 and 2 Materials Management
87 pages
Assessment Results: SN Assessment Name Assessment Type Maximum Mark Result Grade
100% (1)
Assessment Results: SN Assessment Name Assessment Type Maximum Mark Result Grade
2 pages
Mizan Tepi University Group 4
No ratings yet
Mizan Tepi University Group 4
13 pages
Health Information Systems in Ethiopia
100% (1)
Health Information Systems in Ethiopia
10 pages
HMIS
0% (1)
HMIS
6 pages
Ethics Mid-Exam
No ratings yet
Ethics Mid-Exam
2 pages
Mcsa Test Q
No ratings yet
Mcsa Test Q
6 pages
Clinic Management System (Wku) by Nejash
67% (6)
Clinic Management System (Wku) by Nejash
21 pages
Lecture 2 Values, Mission, Vision and Objectives
No ratings yet
Lecture 2 Values, Mission, Vision and Objectives
61 pages
Data Communication Modes of Data Communication. Transmission Media Computer Networking
No ratings yet
Data Communication Modes of Data Communication. Transmission Media Computer Networking
41 pages
Business Statistics: Measures of Central Tendency
No ratings yet
Business Statistics: Measures of Central Tendency
44 pages
Cs Project
No ratings yet
Cs Project
41 pages
Chapter 7 Anova
No ratings yet
Chapter 7 Anova
20 pages
System Analysis and Design (Set 1)
No ratings yet
System Analysis and Design (Set 1)
22 pages
MCA4020-Model Question Paper
No ratings yet
MCA4020-Model Question Paper
18 pages
Computer Application in Business Final Exam
No ratings yet
Computer Application in Business Final Exam
2 pages
AD 03. Requirements Determination
No ratings yet
AD 03. Requirements Determination
41 pages
Information - Systems - Development Exam
No ratings yet
Information - Systems - Development Exam
6 pages
Project Proposal ICT
No ratings yet
Project Proposal ICT
5 pages
Quiz On Chapter 3 On Structure Decision Making
No ratings yet
Quiz On Chapter 3 On Structure Decision Making
11 pages
Ch5 MGT Mathematics Lecture Notes
No ratings yet
Ch5 MGT Mathematics Lecture Notes
6 pages
Chapter Three: Lecture 1: Solving Problems by Searching and Constraint Satisfaction Problem
No ratings yet
Chapter Three: Lecture 1: Solving Problems by Searching and Constraint Satisfaction Problem
53 pages
CHAPTER - 3rm
No ratings yet
CHAPTER - 3rm
19 pages
303 60 MCQs POM BBA 303 BBA V Dec 2021
No ratings yet
303 60 MCQs POM BBA 303 BBA V Dec 2021
11 pages
Manual - Health Data Quality
100% (1)
Manual - Health Data Quality
69 pages
Addis Ababa Institute of Technology: Statistics and Probability (Stat 2171)
No ratings yet
Addis Ababa Institute of Technology: Statistics and Probability (Stat 2171)
8 pages
Database For Exit Exam
No ratings yet
Database For Exit Exam
69 pages
Revised OOSAD Module2020
No ratings yet
Revised OOSAD Module2020
72 pages
Automata and Complexity Theory Module
No ratings yet
Automata and Complexity Theory Module
104 pages
Workplace Conflicts PDF
No ratings yet
Workplace Conflicts PDF
14 pages
Logistic Regression Lecture Notes
No ratings yet
Logistic Regression Lecture Notes
11 pages
Image Compression Using DCT Implementing Matlab
50% (2)
Image Compression Using DCT Implementing Matlab
23 pages
ORGANIZATION OF DATA USING GRAPHS
No ratings yet
ORGANIZATION OF DATA USING GRAPHS
1 page
NP Complete
No ratings yet
NP Complete
9 pages
Heap Data Structure: Zahoor Jan
No ratings yet
Heap Data Structure: Zahoor Jan
38 pages
PERT HO 19 - RlNav30RMX
No ratings yet
PERT HO 19 - RlNav30RMX
2 pages
Algorithm Types and Classification
No ratings yet
Algorithm Types and Classification
5 pages
Data Networks Solutions PDF
No ratings yet
Data Networks Solutions PDF
105 pages
DMM Quiz
No ratings yet
DMM Quiz
11 pages
LPP
100% (1)
LPP
63 pages
Cambridge Methods 1/2 - Chapter 17 Differentiation and Antidifferentiation
No ratings yet
Cambridge Methods 1/2 - Chapter 17 Differentiation and Antidifferentiation
32 pages
Data Validation
No ratings yet
Data Validation
23 pages
Applications of Network Flow
No ratings yet
Applications of Network Flow
75 pages
Bio Signal Processing by Arnon Cohen
No ratings yet
Bio Signal Processing by Arnon Cohen
360 pages
Mcse 004
No ratings yet
Mcse 004
75 pages
Dyslexia Prediction Using Machine Learning
No ratings yet
Dyslexia Prediction Using Machine Learning
9 pages
exchange sort
No ratings yet
exchange sort
5 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
1 page
Reading List Introduction To Generative AI ESLA
No ratings yet
Reading List Introduction To Generative AI ESLA
3 pages
Quantum Computing: Lecture Notes: Ronald de Wolf
No ratings yet
Quantum Computing: Lecture Notes: Ronald de Wolf
163 pages
Estimating Sample Size
No ratings yet
Estimating Sample Size
24 pages
IR ch4 - Inverted-Index
No ratings yet
IR ch4 - Inverted-Index
44 pages
Digital Signal Processing: Unit-Vi
No ratings yet
Digital Signal Processing: Unit-Vi
93 pages
Com124 Data Structure Note-1-1
No ratings yet
Com124 Data Structure Note-1-1
30 pages
U of A ANSYS Tutorials
No ratings yet
U of A ANSYS Tutorials
2 pages
Neural Engineering Computation Representation and
No ratings yet
Neural Engineering Computation Representation and
86 pages
Systems of Equations - Elimination Worksheet
No ratings yet
Systems of Equations - Elimination Worksheet
2 pages
Problem Set Quanti
No ratings yet
Problem Set Quanti
1 page
A Jacobian-Free Newton-Krylov Method For Time-Implicit Multidimensional Hydrodynamics
No ratings yet
A Jacobian-Free Newton-Krylov Method For Time-Implicit Multidimensional Hydrodynamics
17 pages

Logistic Regression Lecture Notes

Uploaded by

Logistic Regression Lecture Notes

Uploaded by

Lecture Notes

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Its equation is given by the following expression –

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Odds and Log Odds

Multivariate Logistic Regression (Telecom Churn Example)

Multivariate Logistic Regression (Model Building)

1. Missing value imputation

Model Evaluation: Accuracy, Sensitivity, and Specificity

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Hence, you rewrote the sensitivity and specificity formulas as –

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Model Evaluation: Precision and Recall

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Obviously, a good model will be stable. A model is considered stable if it has:

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

You might also like