0% found this document useful (0 votes)

23 views

Supervised Logistic Tutorial Final PDF

Uploaded by

michaelkotze03

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

Supervised Logistic Tutorial Final PDF

Uploaded by

michaelkotze03

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Supervised Learning - Logistic Regression Tutorial

WST 212 2020

Logistic Regression

Packages

The packages required for this tutorial are:

library(readr)
library(ggplot2)
library(caret)

## Warning: package 'caret' was built under R version 3.6.3

## Loading required package: lattice

Note: Install any of the above packages should you not have them installed.
Linear models are models were the error term is assumed to follow a normal distribution. An extension on
linear models is generalized linear models (GLMs), in these model we no longer assume the error terms of
the model to be normally distributed.
In some application we will not have a continuous outcome variable as in linear regression but rather a
categorical outcome. More specifically dicotomous, meaning the outcome variable only has two levels. An
example of such an outcome is whether a student passed a module or not.
To model data with a dicotomous outcome (y) and one or more continuous independent variables (x) logistic
regression can be used. The logistic regression model is defined as

p(y = 1)
ln( ) = β0 + β1 x 1
1 − p(y = 1)

Inside the log function is the odds of y being 1. β0 is the intercept coefficient and β1 the slope coefficient
associated with x1 . Therefore we model the log odds of y being equal to 1 to a linear function of an intercept
and the independent variables.

Example

We will now do an example to fit a logistic regression model to some data. We will first import the data
and print the head of the data so we can see what it looks like. This dataset contains 462 observations of
2 variables, the obesity index of the person (obesity) and whether the person has a chronic heart disease
(chd). The variable group is 1 if the does have a heart disease. We also plot the outcome variable against
the obesity index, and we can see that there is a relationship between the two variables.

1
Heart = read_csv('Heart_data_supervised.csv')

## Parsed with column specification:

## cols(
## obesity = col_double(),
## chd = col_double()
## )

# Define the dummy variable to be a factor variable

Heart$chd_dum <- factor(Heart$chd, levels = c(0,1))
head(Heart)

## # A tibble: 6 x 3
## obesity chd chd_dum
## <dbl> <dbl> <fct>
## 1 40.3 1 1
## 2 43.9 1 1
## 3 47.0 1 1
## 4 41.0 1 1
## 5 38.1 1 1
## 6 45.1 1 1

plot = ggplot(Heart,aes(x=obesity,y=chd))+geom_point()
plot

1.00

0.75
chd

0.50

0.25

0.00

20 30 40 50 60
obesity

2
We will now split the data into a training and testing set. We are going to set a random seed value to ensure
the random sample remains consistent. We will then draw a sample of size 300 as the training dataset, this
will be used to train the model parameters and a test sample of size 162 which will be used to evaluate the
model. Because the test dataset was not seen by the model when training, this means the results we obtain
from the evaluation will be unbiased.

set.seed(1234)
train_ind <- sample(seq_len(nrow(Heart)), size = 300)

TrainSetHeart = Heart[train_ind,]
TestSetHeart = Heart[-train_ind,]

We will now fit the logistic regression model using the glm function on the train dataset.
The first argument (formula= chd~obesity) is the structure of the model, chd is the outcome variable
explained by the independent variables which is obesity in this case.
The second argument (data=TrainSet) gives the model the data is should used to fit the model.
The third argument (family=binomial) indicates that the model we want to fit is logistic regression.

# Fitting logistic regression model

Logistic = glm(formula = chd_dum~obesity,data = TrainSetHeart,family = binomial)

The following outputs a summary of the model. From the we can see that the estimate for the intercept
is -25.75 and for the slope associated with obesity is 0.74. Both estimates are significant at a 1% level of
significance since their p-values are smaller than 0.01.

summary(Logistic)

##
## Call:
## glm(formula = chd_dum ~ obesity, family = binomial, data = TrainSetHeart)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.9493 -0.0868 -0.0196 0.0771 2.6753
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -23.5176 3.5790 -6.571 5.0e-11 ***
## obesity 0.6723 0.1014 6.628 3.4e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.468 on 299 degrees of freedom
## Residual deviance: 60.211 on 298 degrees of freedom
## AIC: 64.211
##
## Number of Fisher Scoring iterations: 8

We will now extract the coefficients of the logistic regression model and use this to overlay a line on our
scatter plot from above to see the estimated model. Note that β0 + β1 x1 calculates the log odds of y being
1 so we need to use this to calculate the probability of choronic heart disease.

3
B0 = Logistic$coefficients[1]
B1 = Logistic$coefficients[2]
Obs = seq(10,70)
LogOdds = B0+B1*Obs
Prob = exp(LogOdds)/(1+exp(LogOdds))
Predicted = as.data.frame(cbind(Obs,Prob))

plot = ggplot(Heart,aes(x = obesity,y = chd)) + geom_point() +

geom_line(data = Predicted,mapping = aes(x = Obs,y = Prob),col='Red',lwd=1)
plot

1.00

0.75
chd

0.50

0.25

0.00

20 40 60
obesity

We will now use the test dataset to evaluate the fit of the model. Let’s start by defining the confusion matrix.

1 (Actual) 0 (Actual)
1 (Predicted) True Positive(TP) False Positive(FP)
0 (Predicted) False Negative(FN) True Negative(TN)

True Positive (TP) - Is the number of observations which are 1 and is correctly predicted by the model as 1.
False Negative (FN) - Is the number of observations which are 1 and is wrongly predicted by the model as 0.
False Positive (FP) - Is the number of observations which are 0 and is wrongly predicted by the model as 1.
True Negative (TN) - Is the number of observations which are 0 and is correctly predicted by the model as
0.

4
Using this we can calulate some metrics, for example.
Accuracy = (TP+TN)/(TP+FN+FP+TN) - The proportions of correctly predicted observations.
TP
Recall = T P +F N - The proportion of ones which are correctly predicted as one.
TP
Precision = T P +F P - The proportion of observations predicted as one which is actually one.
Now we will calculate these metrics for our example. The model gives us the probability of the observation
being 1. We therefore have to choose a cut-off probability which we will use to decide if a observation should
be predicted as 1. 0.5 is a logical cut-off point as above this point the probability of the observation being 1
is greater than the probability of it being 0.

LogOdds = B0 + B1*(TestSetHeart$obesity)
Prob = exp(LogOdds)/(1 + exp(LogOdds))

Yhat = Prob > 0.5

Y = TestSetHeart$chd
TP = sum((Y==1)*(Yhat==1))
TN = sum((Y==0)*(Yhat==0))
FP = sum((Y==0)*(Yhat==1))
FN = sum((Y==1)*(Yhat==0))

Accuracy = (TP+TN)/(TP+TN+FP+FN)
Accuracy

## [1] 0.962963

Precision = TP/(TP+FP)
Precision

## [1] 0.9454545

Recall = TP/(TP+FN)
Recall

## [1] 0.9454545

The confusion matrix can also be obtained using the caret package and confusionMatrix function. When
using the confusionMatrix function, make use of the positive argument. This specifies the factor level that
corresponds to a “positive” result.

# Predict the response variables for the test set.

pred <- predict(Logistic, newdata = TestSetHeart, type = "response")

# Recode factors.
y_pred_num <- ifelse(pred > 0.5, 1, 0)
y_pred <- factor(y_pred_num, levels = c(0,1))
y_act <- TestSetHeart$chd_dum

# Performance metrics.
confusionMatrix(data = y_pred, reference = TestSetHeart$chd_dum, positive = "1")

5
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 104 3
## 1 3 52
##
## Accuracy : 0.963
## 95% CI : (0.9211, 0.9863)
## No Information Rate : 0.6605
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9174
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9455
## Specificity : 0.9720
## Pos Pred Value : 0.9455
## Neg Pred Value : 0.9720
## Prevalence : 0.3395
## Detection Rate : 0.3210
## Detection Prevalence : 0.3395
## Balanced Accuracy : 0.9587
##
## 'Positive' Class : 1
##

Note: When using the confusionMatrix function, the output of the confusion matrix is always displayed
numerically or alphabetically. Pay close attention to the positive outcome specified, such that evaluation
metrics can be calculated and interpreted correctly.
Can you identify the TP, TN, FP, and FN values? Pay close attention to the labels displayed in the confusion
matrix.

6
Practical Question(s):
Note: Packages can be used to build the logistic regression model(s), as well as to obtain prediction results
and appropriate evaluation metrics.

Question 1
Use the dataset Heart and replicate the results obtained in the example above. The Heart data is loaded
with the code below into the object heart.

# Load the data

heart = read_csv('Heart_data_supervised.csv')

## Parsed with column specification:

## cols(
## obesity = col_double(),
## chd = col_double()
## )

head(heart) # View the firt 5 rows

## # A tibble: 6 x 2
## obesity chd
## <dbl> <dbl>
## 1 40.3 1
## 2 43.9 1
## 3 47.0 1
## 4 41.0 1
## 5 38.1 1
## 6 45.1 1

Question 2
Consider the dataset Titanic. This dataset contains info about 714 passengers on the titanic. The data
set has the following independent variables, age, gender (which is 1 if the passenger is female and 0 if the
passenger is male) and passenger class. The dataset also contains the variable survived which is 1 if a
passenger survived the titanic and 0 if not. Where a survival outcome of 1 (passenger survived) is considered
a positive outcome. The Titanic data is loaded with the code below into the object titanic.

# Load the data

titanic = read_csv('Titanic.csv')

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:

## cols(
## X1 = col_double(),
## Survived = col_double(),
## PassengerClass = col_double(),
## Age = col_double(),
## Gender = col_double()
## )

7
head(titanic) # View the firt 5 rows

## # A tibble: 6 x 5
## X1 Survived PassengerClass Age Gender
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 3 22 0
## 2 2 1 1 38 1
## 3 3 1 3 26 1
## 4 4 1 1 35 1
## 5 5 0 3 35 0
## 6 7 0 1 54 0

Question 2a

Use the dataset and split it into a train and test set using a random seed of 1357, with a train set of size
600 and a testset size of 114.

Question 2b

Use the train dataset obtained in question 2a along with the glm function to fit a logistic regression model
which uses all 3 independent variables to model whether a passanger survived or not. Give a summary of
the fitted model.

Question 2c

Use the summary of the model and get the estimates aswell as the relavent p-values of the parameters.
Comment on these values as well as their significance.

Question 2d

Use the test dataset to evaluate the model. Use this set to calculate the true positive, true negative, false
positive and false negative values. Then use these values to calculate the accuracy, precision and recall rates.
Comment on each of these metric with regards to the preformance of the model.

Question 3

Repeat Question 2. Split the data into a train and test set using a random seed of 42 and a 80/20 split. The
Titanic data is loaded with the code below into the object titanic.

# Load the data

titanic = read_csv('Titanic.csv')

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:

## cols(
## X1 = col_double(),
## Survived = col_double(),
## PassengerClass = col_double(),

8
## Age = col_double(),
## Gender = col_double()
## )

titanic <- titanic[,2:5]

head(titanic) # View the firt 5 rows

## # A tibble: 6 x 4
## Survived PassengerClass Age Gender
## <dbl> <dbl> <dbl> <dbl>
## 1 0 3 22 0
## 2 1 1 38 1
## 3 1 3 26 1
## 4 1 1 35 1
## 5 0 3 35 0
## 6 0 1 54 0

Logistic Regression
No ratings yet
Logistic Regression
49 pages
Sestrada Logistic Regression in R 02172023
No ratings yet
Sestrada Logistic Regression in R 02172023
25 pages
Final Cc01 Group05-1
No ratings yet
Final Cc01 Group05-1
26 pages
Lecture 5. Part 1 - Regression Analysis
No ratings yet
Lecture 5. Part 1 - Regression Analysis
28 pages
Logistic Regression Essentials in R - Articles - STHDA
No ratings yet
Logistic Regression Essentials in R - Articles - STHDA
10 pages
Binary Logistic Regression - 6.2
No ratings yet
Binary Logistic Regression - 6.2
34 pages
Lecture 3-Logistic Reg Model-II
No ratings yet
Lecture 3-Logistic Reg Model-II
37 pages
Exp2 Milf
No ratings yet
Exp2 Milf
7 pages
Introduction To Logistic Regression: Rachid Salmi, Jean-Claude Desenclos, Alain Moren, Thomas Grein
No ratings yet
Introduction To Logistic Regression: Rachid Salmi, Jean-Claude Desenclos, Alain Moren, Thomas Grein
36 pages
Logistic Regression and Discriminant Analysis: Jerry D.T. Purnomo, PH.D
No ratings yet
Logistic Regression and Discriminant Analysis: Jerry D.T. Purnomo, PH.D
54 pages
Regression Logistic Regression
100% (1)
Regression Logistic Regression
37 pages
Week - 6 - SWI - MLP - LogisticRegression - Ipynb - Colaboratory
No ratings yet
Week - 6 - SWI - MLP - LogisticRegression - Ipynb - Colaboratory
15 pages
Logistic Regression
No ratings yet
Logistic Regression
18 pages
Heart Disease App With Code
No ratings yet
Heart Disease App With Code
22 pages
Logistic Regression (2022)
No ratings yet
Logistic Regression (2022)
44 pages
Ogistic Egression: Concha Bielza, Pedro Larra Naga
No ratings yet
Ogistic Egression: Concha Bielza, Pedro Larra Naga
33 pages
Regression Analysis Assignment1111
No ratings yet
Regression Analysis Assignment1111
13 pages
Logistic Regression Analysis
No ratings yet
Logistic Regression Analysis
48 pages
Regression3 Slides
No ratings yet
Regression3 Slides
47 pages
Case Study
No ratings yet
Case Study
21 pages
Detailed_Logistic_Regression
No ratings yet
Detailed_Logistic_Regression
30 pages
Data Analytics Using R
No ratings yet
Data Analytics Using R
23 pages
Logistic Regression & Practice
100% (1)
Logistic Regression & Practice
51 pages
Logistic Regression
100% (1)
Logistic Regression
37 pages
Supervised Learning Example
No ratings yet
Supervised Learning Example
6 pages
Unit - 5
No ratings yet
Unit - 5
111 pages
Appendix: Answers To Selected Exercises: /user
No ratings yet
Appendix: Answers To Selected Exercises: /user
8 pages
4-10 Aiml
No ratings yet
4-10 Aiml
25 pages
13. Review of Logistic and Poisson Regression Models
No ratings yet
13. Review of Logistic and Poisson Regression Models
15 pages
ISYE6414 FA23 Practice Midterm Exam 2 Solutions
No ratings yet
ISYE6414 FA23 Practice Midterm Exam 2 Solutions
6 pages
Day 13 Logistic Regression
No ratings yet
Day 13 Logistic Regression
28 pages
Logistic Regression Playbook
No ratings yet
Logistic Regression Playbook
19 pages
Ordered Probit and Logit Models R Program and Output
No ratings yet
Ordered Probit and Logit Models R Program and Output
3 pages
Heart Disease Prediction Project Documentation
No ratings yet
Heart Disease Prediction Project Documentation
22 pages
Diabetic Prediction Using LogicalRegression
No ratings yet
Diabetic Prediction Using LogicalRegression
9 pages
Zhang 2021 J. Phys. Conf. Ser. 1769 012024
No ratings yet
Zhang 2021 J. Phys. Conf. Ser. 1769 012024
6 pages
Logistic Regression Lecture Notes
No ratings yet
Logistic Regression Lecture Notes
11 pages
Quiz 3, Modified: Modern Data Mining October 29, 2018
No ratings yet
Quiz 3, Modified: Modern Data Mining October 29, 2018
5 pages
S Pss Classification
No ratings yet
S Pss Classification
16 pages
Logistic Regression
100% (1)
Logistic Regression
34 pages
Week 04 Logistic Regression
No ratings yet
Week 04 Logistic Regression
5 pages
class
No ratings yet
class
102 pages
Logistic Regression
100% (2)
Logistic Regression
32 pages
Module 4 - Logistic Regression - Afterclass1b
No ratings yet
Module 4 - Logistic Regression - Afterclass1b
54 pages
Lec-4 Logistic Regression
No ratings yet
Lec-4 Logistic Regression
54 pages
Lab 4: Logistic Regression: PSTAT 131/231, Winter 2019
No ratings yet
Lab 4: Logistic Regression: PSTAT 131/231, Winter 2019
10 pages
Home Lesson 15: Logistic, Poisson & Nonlinear Regression
No ratings yet
Home Lesson 15: Logistic, Poisson & Nonlinear Regression
32 pages
Linear Regression
No ratings yet
Linear Regression
13 pages
Logistic Regression
No ratings yet
Logistic Regression
49 pages
Logistic Regression
No ratings yet
Logistic Regression
49 pages
Logit, Probit and Multinomial Logit Models in R: Oscar Torres-Reyna
No ratings yet
Logit, Probit and Multinomial Logit Models in R: Oscar Torres-Reyna
24 pages
Lab 1
No ratings yet
Lab 1
8 pages
Logistic Regression
No ratings yet
Logistic Regression
20 pages
Data Science and Bigdata Analytics: Dr. Ali Imran Jehangiri
No ratings yet
Data Science and Bigdata Analytics: Dr. Ali Imran Jehangiri
20 pages
G26_report
No ratings yet
G26_report
4 pages
Wa0004.
No ratings yet
Wa0004.
9 pages
utf-8''C2M1 Assignment
No ratings yet
utf-8''C2M1 Assignment
24 pages
02 LogisticRegression
No ratings yet
02 LogisticRegression
29 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Unsupervised Learning Example
No ratings yet
Unsupervised Learning Example
5 pages
Asset Classes
No ratings yet
Asset Classes
16 pages
Data Wrangling Tutorial PDF
No ratings yet
Data Wrangling Tutorial PDF
8 pages
WTW211 Study Guide 2022
No ratings yet
WTW211 Study Guide 2022
10 pages
Project Appraisals
No ratings yet
Project Appraisals
8 pages
PV Tables
No ratings yet
PV Tables
10 pages
Regression Analysis
No ratings yet
Regression Analysis
2 pages
Analisis Risiko Pada Proyek Pembangunan Parkir Basement Jalan Sulawesi Denpasar
No ratings yet
Analisis Risiko Pada Proyek Pembangunan Parkir Basement Jalan Sulawesi Denpasar
11 pages
(FREE PDF Sample) Handbook of Regression Analysis With Applications in R, Second Edition Samprit Chatterjee Ebooks
100% (1)
(FREE PDF Sample) Handbook of Regression Analysis With Applications in R, Second Edition Samprit Chatterjee Ebooks
54 pages
Opm Assignment#1
No ratings yet
Opm Assignment#1
22 pages
Chapter 4 PowerPoint
No ratings yet
Chapter 4 PowerPoint
76 pages
Module 6A Estimating Relationships
No ratings yet
Module 6A Estimating Relationships
104 pages
Understanding The Basics of Actuarial Methods
No ratings yet
Understanding The Basics of Actuarial Methods
24 pages
Lasso Regression
No ratings yet
Lasso Regression
16 pages
AIC and BIC
No ratings yet
AIC and BIC
5 pages
Actuary- SOA.pdf
No ratings yet
Actuary- SOA.pdf
17 pages
Wapic 2017 Annual Report PDF
100% (1)
Wapic 2017 Annual Report PDF
244 pages
Modelling Pensions
100% (1)
Modelling Pensions
416 pages
Actuarial Guide
No ratings yet
Actuarial Guide
44 pages
Simple Regression Model CH02
No ratings yet
Simple Regression Model CH02
60 pages
Presentation 37
No ratings yet
Presentation 37
12 pages
Chapter 4
No ratings yet
Chapter 4
24 pages
Multiple Regression Analysis With Qualitative Information
No ratings yet
Multiple Regression Analysis With Qualitative Information
4 pages
Sociology Lecture No 11 Population and Urbanization
No ratings yet
Sociology Lecture No 11 Population and Urbanization
3 pages
KSA 2020 Medical Insurance
No ratings yet
KSA 2020 Medical Insurance
32 pages
IAI APS21-AppointedActuaryandGeneralInsuranceBusiness
No ratings yet
IAI APS21-AppointedActuaryandGeneralInsuranceBusiness
11 pages
Tile Basis/or These Credibility/ormulas Has Been A Pro/ottnd Mystery To Most People Who Have Come Itrto Cotrtact With Them
No ratings yet
Tile Basis/or These Credibility/ormulas Has Been A Pro/ottnd Mystery To Most People Who Have Come Itrto Cotrtact With Them
28 pages
FAQ of Insurance Institute of India
No ratings yet
FAQ of Insurance Institute of India
8 pages
Linear Regression Quiz
No ratings yet
Linear Regression Quiz
6 pages
Understanding Actuarial Valuation
67% (3)
Understanding Actuarial Valuation
56 pages
EPID 5400 - Epidemiology Fall 2017 EXERCISE 3: Measures of Disease Occurrence: Mortality
No ratings yet
EPID 5400 - Epidemiology Fall 2017 EXERCISE 3: Measures of Disease Occurrence: Mortality
4 pages
2.ZAMARA RWANDA COMPANY PROFILE - v2
No ratings yet
2.ZAMARA RWANDA COMPANY PROFILE - v2
12 pages
Regrn Summary Output
No ratings yet
Regrn Summary Output
24 pages
Iga Herdiana - Pemodelan Tarikan Perjalanan Menuju Mal Di Kota Surakarta
No ratings yet
Iga Herdiana - Pemodelan Tarikan Perjalanan Menuju Mal Di Kota Surakarta
7 pages
Econometrics_Review Questions
No ratings yet
Econometrics_Review Questions
4 pages