Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
23 views

Supervised Logistic Tutorial Final PDF

Uploaded by

michaelkotze03
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Supervised Logistic Tutorial Final PDF

Uploaded by

michaelkotze03
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Supervised Learning - Logistic Regression Tutorial

WST 212 2020

Logistic Regression

Packages

The packages required for this tutorial are:

library(readr)
library(ggplot2)
library(caret)

## Warning: package 'caret' was built under R version 3.6.3

## Loading required package: lattice

Note: Install any of the above packages should you not have them installed.
Linear models are models were the error term is assumed to follow a normal distribution. An extension on
linear models is generalized linear models (GLMs), in these model we no longer assume the error terms of
the model to be normally distributed.
In some application we will not have a continuous outcome variable as in linear regression but rather a
categorical outcome. More specifically dicotomous, meaning the outcome variable only has two levels. An
example of such an outcome is whether a student passed a module or not.
To model data with a dicotomous outcome (y) and one or more continuous independent variables (x) logistic
regression can be used. The logistic regression model is defined as

p(y = 1)
ln( ) = β0 + β1 x 1
1 − p(y = 1)

Inside the log function is the odds of y being 1. β0 is the intercept coefficient and β1 the slope coefficient
associated with x1 . Therefore we model the log odds of y being equal to 1 to a linear function of an intercept
and the independent variables.

Example

We will now do an example to fit a logistic regression model to some data. We will first import the data
and print the head of the data so we can see what it looks like. This dataset contains 462 observations of
2 variables, the obesity index of the person (obesity) and whether the person has a chronic heart disease
(chd). The variable group is 1 if the does have a heart disease. We also plot the outcome variable against
the obesity index, and we can see that there is a relationship between the two variables.

1
Heart = read_csv('Heart_data_supervised.csv')

## Parsed with column specification:


## cols(
## obesity = col_double(),
## chd = col_double()
## )

# Define the dummy variable to be a factor variable


Heart$chd_dum <- factor(Heart$chd, levels = c(0,1))
head(Heart)

## # A tibble: 6 x 3
## obesity chd chd_dum
## <dbl> <dbl> <fct>
## 1 40.3 1 1
## 2 43.9 1 1
## 3 47.0 1 1
## 4 41.0 1 1
## 5 38.1 1 1
## 6 45.1 1 1

plot = ggplot(Heart,aes(x=obesity,y=chd))+geom_point()
plot

1.00

0.75
chd

0.50

0.25

0.00

20 30 40 50 60
obesity

2
We will now split the data into a training and testing set. We are going to set a random seed value to ensure
the random sample remains consistent. We will then draw a sample of size 300 as the training dataset, this
will be used to train the model parameters and a test sample of size 162 which will be used to evaluate the
model. Because the test dataset was not seen by the model when training, this means the results we obtain
from the evaluation will be unbiased.

set.seed(1234)
train_ind <- sample(seq_len(nrow(Heart)), size = 300)

TrainSetHeart = Heart[train_ind,]
TestSetHeart = Heart[-train_ind,]

We will now fit the logistic regression model using the glm function on the train dataset.
The first argument (formula= chd~obesity) is the structure of the model, chd is the outcome variable
explained by the independent variables which is obesity in this case.
The second argument (data=TrainSet) gives the model the data is should used to fit the model.
The third argument (family=binomial) indicates that the model we want to fit is logistic regression.

# Fitting logistic regression model


Logistic = glm(formula = chd_dum~obesity,data = TrainSetHeart,family = binomial)

The following outputs a summary of the model. From the we can see that the estimate for the intercept
is -25.75 and for the slope associated with obesity is 0.74. Both estimates are significant at a 1% level of
significance since their p-values are smaller than 0.01.

summary(Logistic)

##
## Call:
## glm(formula = chd_dum ~ obesity, family = binomial, data = TrainSetHeart)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.9493 -0.0868 -0.0196 0.0771 2.6753
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -23.5176 3.5790 -6.571 5.0e-11 ***
## obesity 0.6723 0.1014 6.628 3.4e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 388.468 on 299 degrees of freedom
## Residual deviance: 60.211 on 298 degrees of freedom
## AIC: 64.211
##
## Number of Fisher Scoring iterations: 8

We will now extract the coefficients of the logistic regression model and use this to overlay a line on our
scatter plot from above to see the estimated model. Note that β0 + β1 x1 calculates the log odds of y being
1 so we need to use this to calculate the probability of choronic heart disease.

3
B0 = Logistic$coefficients[1]
B1 = Logistic$coefficients[2]
Obs = seq(10,70)
LogOdds = B0+B1*Obs
Prob = exp(LogOdds)/(1+exp(LogOdds))
Predicted = as.data.frame(cbind(Obs,Prob))

plot = ggplot(Heart,aes(x = obesity,y = chd)) + geom_point() +


geom_line(data = Predicted,mapping = aes(x = Obs,y = Prob),col='Red',lwd=1)
plot

1.00

0.75
chd

0.50

0.25

0.00

20 40 60
obesity

We will now use the test dataset to evaluate the fit of the model. Let’s start by defining the confusion matrix.

1 (Actual) 0 (Actual)
1 (Predicted) True Positive(TP) False Positive(FP)
0 (Predicted) False Negative(FN) True Negative(TN)

True Positive (TP) - Is the number of observations which are 1 and is correctly predicted by the model as 1.
False Negative (FN) - Is the number of observations which are 1 and is wrongly predicted by the model as 0.
False Positive (FP) - Is the number of observations which are 0 and is wrongly predicted by the model as 1.
True Negative (TN) - Is the number of observations which are 0 and is correctly predicted by the model as
0.

4
Using this we can calulate some metrics, for example.
Accuracy = (TP+TN)/(TP+FN+FP+TN) - The proportions of correctly predicted observations.
TP
Recall = T P +F N - The proportion of ones which are correctly predicted as one.
TP
Precision = T P +F P - The proportion of observations predicted as one which is actually one.
Now we will calculate these metrics for our example. The model gives us the probability of the observation
being 1. We therefore have to choose a cut-off probability which we will use to decide if a observation should
be predicted as 1. 0.5 is a logical cut-off point as above this point the probability of the observation being 1
is greater than the probability of it being 0.

LogOdds = B0 + B1*(TestSetHeart$obesity)
Prob = exp(LogOdds)/(1 + exp(LogOdds))

Yhat = Prob > 0.5


Y = TestSetHeart$chd
TP = sum((Y==1)*(Yhat==1))
TN = sum((Y==0)*(Yhat==0))
FP = sum((Y==0)*(Yhat==1))
FN = sum((Y==1)*(Yhat==0))

Accuracy = (TP+TN)/(TP+TN+FP+FN)
Accuracy

## [1] 0.962963

Precision = TP/(TP+FP)
Precision

## [1] 0.9454545

Recall = TP/(TP+FN)
Recall

## [1] 0.9454545

The confusion matrix can also be obtained using the caret package and confusionMatrix function. When
using the confusionMatrix function, make use of the positive argument. This specifies the factor level that
corresponds to a “positive” result.

# Predict the response variables for the test set.


pred <- predict(Logistic, newdata = TestSetHeart, type = "response")

# Recode factors.
y_pred_num <- ifelse(pred > 0.5, 1, 0)
y_pred <- factor(y_pred_num, levels = c(0,1))
y_act <- TestSetHeart$chd_dum

# Performance metrics.
confusionMatrix(data = y_pred, reference = TestSetHeart$chd_dum, positive = "1")

5
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 104 3
## 1 3 52
##
## Accuracy : 0.963
## 95% CI : (0.9211, 0.9863)
## No Information Rate : 0.6605
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9174
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9455
## Specificity : 0.9720
## Pos Pred Value : 0.9455
## Neg Pred Value : 0.9720
## Prevalence : 0.3395
## Detection Rate : 0.3210
## Detection Prevalence : 0.3395
## Balanced Accuracy : 0.9587
##
## 'Positive' Class : 1
##

Note: When using the confusionMatrix function, the output of the confusion matrix is always displayed
numerically or alphabetically. Pay close attention to the positive outcome specified, such that evaluation
metrics can be calculated and interpreted correctly.
Can you identify the TP, TN, FP, and FN values? Pay close attention to the labels displayed in the confusion
matrix.

6
Practical Question(s):
Note: Packages can be used to build the logistic regression model(s), as well as to obtain prediction results
and appropriate evaluation metrics.

Question 1
Use the dataset Heart and replicate the results obtained in the example above. The Heart data is loaded
with the code below into the object heart.

# Load the data


heart = read_csv('Heart_data_supervised.csv')

## Parsed with column specification:


## cols(
## obesity = col_double(),
## chd = col_double()
## )

head(heart) # View the firt 5 rows

## # A tibble: 6 x 2
## obesity chd
## <dbl> <dbl>
## 1 40.3 1
## 2 43.9 1
## 3 47.0 1
## 4 41.0 1
## 5 38.1 1
## 6 45.1 1

Question 2
Consider the dataset Titanic. This dataset contains info about 714 passengers on the titanic. The data
set has the following independent variables, age, gender (which is 1 if the passenger is female and 0 if the
passenger is male) and passenger class. The dataset also contains the variable survived which is 1 if a
passenger survived the titanic and 0 if not. Where a survival outcome of 1 (passenger survived) is considered
a positive outcome. The Titanic data is loaded with the code below into the object titanic.

# Load the data


titanic = read_csv('Titanic.csv')

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:


## cols(
## X1 = col_double(),
## Survived = col_double(),
## PassengerClass = col_double(),
## Age = col_double(),
## Gender = col_double()
## )

7
head(titanic) # View the firt 5 rows

## # A tibble: 6 x 5
## X1 Survived PassengerClass Age Gender
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 3 22 0
## 2 2 1 1 38 1
## 3 3 1 3 26 1
## 4 4 1 1 35 1
## 5 5 0 3 35 0
## 6 7 0 1 54 0

Question 2a

Use the dataset and split it into a train and test set using a random seed of 1357, with a train set of size
600 and a testset size of 114.

Question 2b

Use the train dataset obtained in question 2a along with the glm function to fit a logistic regression model
which uses all 3 independent variables to model whether a passanger survived or not. Give a summary of
the fitted model.

Question 2c

Use the summary of the model and get the estimates aswell as the relavent p-values of the parameters.
Comment on these values as well as their significance.

Question 2d

Use the test dataset to evaluate the model. Use this set to calculate the true positive, true negative, false
positive and false negative values. Then use these values to calculate the accuracy, precision and recall rates.
Comment on each of these metric with regards to the preformance of the model.

Question 3

Repeat Question 2. Split the data into a train and test set using a random seed of 42 and a 80/20 split. The
Titanic data is loaded with the code below into the object titanic.

# Load the data


titanic = read_csv('Titanic.csv')

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:


## cols(
## X1 = col_double(),
## Survived = col_double(),
## PassengerClass = col_double(),

8
## Age = col_double(),
## Gender = col_double()
## )

titanic <- titanic[,2:5]


head(titanic) # View the firt 5 rows

## # A tibble: 6 x 4
## Survived PassengerClass Age Gender
## <dbl> <dbl> <dbl> <dbl>
## 1 0 3 22 0
## 2 1 1 38 1
## 3 1 3 26 1
## 4 1 1 35 1
## 5 0 3 35 0
## 6 0 1 54 0

You might also like