Supervised Logistic Tutorial Final PDF
Supervised Logistic Tutorial Final PDF
Logistic Regression
Packages
library(readr)
library(ggplot2)
library(caret)
Note: Install any of the above packages should you not have them installed.
Linear models are models were the error term is assumed to follow a normal distribution. An extension on
linear models is generalized linear models (GLMs), in these model we no longer assume the error terms of
the model to be normally distributed.
In some application we will not have a continuous outcome variable as in linear regression but rather a
categorical outcome. More specifically dicotomous, meaning the outcome variable only has two levels. An
example of such an outcome is whether a student passed a module or not.
To model data with a dicotomous outcome (y) and one or more continuous independent variables (x) logistic
regression can be used. The logistic regression model is defined as
p(y = 1)
ln( ) = β0 + β1 x 1
1 − p(y = 1)
Inside the log function is the odds of y being 1. β0 is the intercept coefficient and β1 the slope coefficient
associated with x1 . Therefore we model the log odds of y being equal to 1 to a linear function of an intercept
and the independent variables.
Example
We will now do an example to fit a logistic regression model to some data. We will first import the data
and print the head of the data so we can see what it looks like. This dataset contains 462 observations of
2 variables, the obesity index of the person (obesity) and whether the person has a chronic heart disease
(chd). The variable group is 1 if the does have a heart disease. We also plot the outcome variable against
the obesity index, and we can see that there is a relationship between the two variables.
1
Heart = read_csv('Heart_data_supervised.csv')
## # A tibble: 6 x 3
## obesity chd chd_dum
## <dbl> <dbl> <fct>
## 1 40.3 1 1
## 2 43.9 1 1
## 3 47.0 1 1
## 4 41.0 1 1
## 5 38.1 1 1
## 6 45.1 1 1
plot = ggplot(Heart,aes(x=obesity,y=chd))+geom_point()
plot
1.00
0.75
chd
0.50
0.25
0.00
20 30 40 50 60
obesity
2
We will now split the data into a training and testing set. We are going to set a random seed value to ensure
the random sample remains consistent. We will then draw a sample of size 300 as the training dataset, this
will be used to train the model parameters and a test sample of size 162 which will be used to evaluate the
model. Because the test dataset was not seen by the model when training, this means the results we obtain
from the evaluation will be unbiased.
set.seed(1234)
train_ind <- sample(seq_len(nrow(Heart)), size = 300)
TrainSetHeart = Heart[train_ind,]
TestSetHeart = Heart[-train_ind,]
We will now fit the logistic regression model using the glm function on the train dataset.
The first argument (formula= chd~obesity) is the structure of the model, chd is the outcome variable
explained by the independent variables which is obesity in this case.
The second argument (data=TrainSet) gives the model the data is should used to fit the model.
The third argument (family=binomial) indicates that the model we want to fit is logistic regression.
The following outputs a summary of the model. From the we can see that the estimate for the intercept
is -25.75 and for the slope associated with obesity is 0.74. Both estimates are significant at a 1% level of
significance since their p-values are smaller than 0.01.
summary(Logistic)
##
## Call:
## glm(formula = chd_dum ~ obesity, family = binomial, data = TrainSetHeart)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.9493 -0.0868 -0.0196 0.0771 2.6753
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -23.5176 3.5790 -6.571 5.0e-11 ***
## obesity 0.6723 0.1014 6.628 3.4e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.468 on 299 degrees of freedom
## Residual deviance: 60.211 on 298 degrees of freedom
## AIC: 64.211
##
## Number of Fisher Scoring iterations: 8
We will now extract the coefficients of the logistic regression model and use this to overlay a line on our
scatter plot from above to see the estimated model. Note that β0 + β1 x1 calculates the log odds of y being
1 so we need to use this to calculate the probability of choronic heart disease.
3
B0 = Logistic$coefficients[1]
B1 = Logistic$coefficients[2]
Obs = seq(10,70)
LogOdds = B0+B1*Obs
Prob = exp(LogOdds)/(1+exp(LogOdds))
Predicted = as.data.frame(cbind(Obs,Prob))
1.00
0.75
chd
0.50
0.25
0.00
20 40 60
obesity
We will now use the test dataset to evaluate the fit of the model. Let’s start by defining the confusion matrix.
1 (Actual) 0 (Actual)
1 (Predicted) True Positive(TP) False Positive(FP)
0 (Predicted) False Negative(FN) True Negative(TN)
True Positive (TP) - Is the number of observations which are 1 and is correctly predicted by the model as 1.
False Negative (FN) - Is the number of observations which are 1 and is wrongly predicted by the model as 0.
False Positive (FP) - Is the number of observations which are 0 and is wrongly predicted by the model as 1.
True Negative (TN) - Is the number of observations which are 0 and is correctly predicted by the model as
0.
4
Using this we can calulate some metrics, for example.
Accuracy = (TP+TN)/(TP+FN+FP+TN) - The proportions of correctly predicted observations.
TP
Recall = T P +F N - The proportion of ones which are correctly predicted as one.
TP
Precision = T P +F P - The proportion of observations predicted as one which is actually one.
Now we will calculate these metrics for our example. The model gives us the probability of the observation
being 1. We therefore have to choose a cut-off probability which we will use to decide if a observation should
be predicted as 1. 0.5 is a logical cut-off point as above this point the probability of the observation being 1
is greater than the probability of it being 0.
LogOdds = B0 + B1*(TestSetHeart$obesity)
Prob = exp(LogOdds)/(1 + exp(LogOdds))
Accuracy = (TP+TN)/(TP+TN+FP+FN)
Accuracy
## [1] 0.962963
Precision = TP/(TP+FP)
Precision
## [1] 0.9454545
Recall = TP/(TP+FN)
Recall
## [1] 0.9454545
The confusion matrix can also be obtained using the caret package and confusionMatrix function. When
using the confusionMatrix function, make use of the positive argument. This specifies the factor level that
corresponds to a “positive” result.
# Recode factors.
y_pred_num <- ifelse(pred > 0.5, 1, 0)
y_pred <- factor(y_pred_num, levels = c(0,1))
y_act <- TestSetHeart$chd_dum
# Performance metrics.
confusionMatrix(data = y_pred, reference = TestSetHeart$chd_dum, positive = "1")
5
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 104 3
## 1 3 52
##
## Accuracy : 0.963
## 95% CI : (0.9211, 0.9863)
## No Information Rate : 0.6605
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9174
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9455
## Specificity : 0.9720
## Pos Pred Value : 0.9455
## Neg Pred Value : 0.9720
## Prevalence : 0.3395
## Detection Rate : 0.3210
## Detection Prevalence : 0.3395
## Balanced Accuracy : 0.9587
##
## 'Positive' Class : 1
##
Note: When using the confusionMatrix function, the output of the confusion matrix is always displayed
numerically or alphabetically. Pay close attention to the positive outcome specified, such that evaluation
metrics can be calculated and interpreted correctly.
Can you identify the TP, TN, FP, and FN values? Pay close attention to the labels displayed in the confusion
matrix.
6
Practical Question(s):
Note: Packages can be used to build the logistic regression model(s), as well as to obtain prediction results
and appropriate evaluation metrics.
Question 1
Use the dataset Heart and replicate the results obtained in the example above. The Heart data is loaded
with the code below into the object heart.
## # A tibble: 6 x 2
## obesity chd
## <dbl> <dbl>
## 1 40.3 1
## 2 43.9 1
## 3 47.0 1
## 4 41.0 1
## 5 38.1 1
## 6 45.1 1
Question 2
Consider the dataset Titanic. This dataset contains info about 714 passengers on the titanic. The data
set has the following independent variables, age, gender (which is 1 if the passenger is female and 0 if the
passenger is male) and passenger class. The dataset also contains the variable survived which is 1 if a
passenger survived the titanic and 0 if not. Where a survival outcome of 1 (passenger survived) is considered
a positive outcome. The Titanic data is loaded with the code below into the object titanic.
7
head(titanic) # View the firt 5 rows
## # A tibble: 6 x 5
## X1 Survived PassengerClass Age Gender
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 3 22 0
## 2 2 1 1 38 1
## 3 3 1 3 26 1
## 4 4 1 1 35 1
## 5 5 0 3 35 0
## 6 7 0 1 54 0
Question 2a
Use the dataset and split it into a train and test set using a random seed of 1357, with a train set of size
600 and a testset size of 114.
Question 2b
Use the train dataset obtained in question 2a along with the glm function to fit a logistic regression model
which uses all 3 independent variables to model whether a passanger survived or not. Give a summary of
the fitted model.
Question 2c
Use the summary of the model and get the estimates aswell as the relavent p-values of the parameters.
Comment on these values as well as their significance.
Question 2d
Use the test dataset to evaluate the model. Use this set to calculate the true positive, true negative, false
positive and false negative values. Then use these values to calculate the accuracy, precision and recall rates.
Comment on each of these metric with regards to the preformance of the model.
Question 3
Repeat Question 2. Split the data into a train and test set using a random seed of 42 and a 80/20 split. The
Titanic data is loaded with the code below into the object titanic.
8
## Age = col_double(),
## Gender = col_double()
## )
## # A tibble: 6 x 4
## Survived PassengerClass Age Gender
## <dbl> <dbl> <dbl> <dbl>
## 1 0 3 22 0
## 2 1 1 38 1
## 3 1 3 26 1
## 4 1 1 35 1
## 5 0 3 35 0
## 6 0 1 54 0