Logistic Regression
Logistic Regression
As in univariate logistic regression, let π(x) represent the probability of an event that
depends on p covariates or independent variables. Then, using an inv.logit formulation
for modeling the probability, we have:
So, the form is identical to univariate logistic regression, but now with more than one
covariate. [Note: by “univariate” logistic regression, I mean logistic regression with
one independent variable; really there are two variables involved, the independent
variable and the dichotomous outcome, so it could also be termed bivariate.]
To obtain the corresponding logit function from this, we calculate (letting X represent
the whole set of covariates X1 , X2 , . . . , Xp ):
" #
π(X)
logit[π(X)] = ln
1 − π(X)
eβ0 +β1 X1 +β2 X2 +...+βp Xp
1+eβ0 +β1 X1 +β2 X2 +...+βp Xp
= ln eβ0 +β1 X1 +β2 X2 +...+βp Xp
1 − 1+e β0 +β1 X1 +β2 X2 +...+βp Xp
= β0 + β1 X1 + β2 X2 + . . . + βp Xp
So, again, we see that the logit of the probability of an event given X is a simple
linear function.
2
which gives the probabilities of outcome events given the covariate values X1 , X2 , . . . , Xp ,
and
logit[π(X)] = β0 + β1 X1 + β2 X2 + . . . + βp Xp
which shows that logistic regression is really just a standard linear regression model,
once we transform the dichotomous outcome by the logit transform. This transform
changes the range of π(X) from 0 to 1 to −∞ to +∞, as usual for linear regression.
Again analogously to univariate logistic regression, the above equations are for mean
probabilities, and each data point will have an error term. Once again, we assume
that this error has mean zero, and that it follows a binomial distribution with mean
π(X), and variance π(X)(1 − π(X)). Of course, now X is a vector, whereas before it
was a scalar value.
eβ0
π(x) =
1 + eβ0
exactly the same as in the univariate case. So, the interpretation of β0 remains
the same as in the simpler case: β0 sets the “baseline” event rate, through the
above function, when all covariate values are set equal to zero.
For example, if β0 = 0 then
eβ0 e0 1
π(x) = β
= 0
= = 0.5
1+e 0 1+e 1+1
and if β0 = 1 then
eβ0 e1
π(x) = = = 0.73
1 + eβ0 1 + e1
3
and if β0 = −1 then
eβ0 e−1
π(x) = = = 0.27
1 + eβ0 1 + e−1
and so on.
As before, positive values of β0 give values greater than 0.5, while negative
values of β0 give probabilities less than 0.5, when all covariates are set to zero.
• When there is more than one independent variable, if all variables are
completely uncorrelated with each other, then the interpretations of all
coefficients are simple, and follow the above pattern:
We have OR = ezβi for any variable Xi , i = 1, 2, . . . , p, where the OR
represents the odds ratio for a change of size z for that variable.
• When the variables are not uncorrelated, the interpretation is more diffi-
cult. It is common to say that OR = ezβi represents the odds ratio for
a change of size z for that variable adjusted for the effects of the other
variables. While this is essentially correct, we must keep in mind that
confounding and collinearity can change and obscure these estimated rela-
tionships. The way confounding operates is identical to what we saw for
linear regression.
As in the univariate case, the distribution associated with logistic regression is the
binomial. For a single subject with covariate values xi = {x1i , x2i , . . . , xpi }, the like-
lihood function is:
i i
π(xi )y (1 − π(xi ))1−y
n
i i
π(xi )y (1 − π(xi ))1−y
Y
i=1
4
We again omit the details here (as in the univariate case, no easy closed form formulae
exists), and will rely on statistical software to find the maximum likelihood estimates
for us.
Inferences typically rely on SE formulae for confidence intervals, and likelihood ratio
testing for hypothesis tests. Again, we will omit the details, and rely on statistical
software.
In all cases, we will follow a similar procedure to that followed for multiple linear
regression:
1. Look at various descriptive statistics to get a feel for the data. For logistic
regression, this usually includes looking at descriptive statistics, for example
within “outcome = yes = 1” versus “outcome = no = 0” subgroups.
2. The above “by outcome group” descriptive statistics are often sufficient for
discrete covariates, but you may want to prepare some graphics for continuous
variables. Recall that we did this for the age variable when looking at the CHD
example.
7. Think about any “interaction terms” that you may want to try in the model.
8. Perform some sort of model selection technique, or, often much better, think
about avoiding any strict model selection by finding a set of models that seem
to have something to contribute to overall conclusions.
9. Based on all work done, draw some inferences and conclusions. Carefully inter-
pret each estimated parameter, perform “model criticism”, possibly repeating
some of the above steps (for example, run further models), as needed.
10. Other inferences, such as predictions for future observations, and so on.
As with linear regression, the above should not be considered as “rules”, but rather
as a rough guide as to how to proceed through a logistic regression analysis.
Chapter 1 (section 1.6.1) of the Hosmer and Lemeshow book described a data set
called ICU. Deleting the ID variable, there are 20 variables in this data set, which we
describe in the table below:
6
The main outcome is vital status, alive or dead, coded as 0/1 respectively, under the
variable name sta. For this illustrative example, we will investigate the effect of the
dichotomous variables sex, ser, and loc. Later, we will look at more of the variables.
> summary(icu.dat)
sex=icu.dat$sex, ser=icu.dat$ser)
> summary(icu1.dat)
sta loc sex ser
Min. :0.0 Min. :0.000 Min. :0.00 Min. :0.000
1st Qu.:0.0 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0.000
Median :0.0 Median :0.000 Median :0.00 Median :1.000
Mean :0.2 Mean :0.125 Mean :0.38 Mean :0.535
3rd Qu.:0.0 3rd Qu.:0.000 3rd Qu.:1.00 3rd Qu.:1.000
Max. :1.0 Max. :2.000 Max. :1.00 Max. :1.000
# Notice that loc, sex, and ser need to be made into factor variables
# Look at reduced data set again, this time with factor variables
> summary(icu1.dat)
sta loc sex ser
Min. :0.0 0:185 0:124 0: 93
1st Qu.:0.0 1: 5 1: 76 1:107
Median :0.0 2: 10
Mean :0.2
3rd Qu.:0.0
Max. :1.0
# Preliminary comments:
- Not too many events, only 20% rate
- loc may not be too useful, poor variability
- sex and ser reasonably well balanced
0 1
0 100 60
1 24 16
0 1
0 67 93
1 26 14
0 1 2
0 158 0 2
1 27 5 8
0 1
0 54 70
1 39 37
0 1 2
0 116 3 5
1 69 2 5
0 1 2
0 84 2 7
1 101 3 3
Call:
glm(formula = sta ~ sex, family = binomial, data = icu1.dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.6876 -0.6876 -0.6559 -0.6559 1.8123
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.4271 0.2273 -6.278 3.42e-10 ***
sex1 0.1054 0.3617 0.291 0.771
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
$intercept.ci
[1] -1.8726220 -0.9816107
$slopes.ci
[1] -0.6035757 0.8142967
$OR
sex1
1.111111
$OR.ci
[1] 0.5468528 2.2575874
Call:
glm(formula = sta ~ ser, family = binomial, data = icu1.dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8098 -0.8098 -0.5296 -0.5296 2.0168
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9466 0.2311 -4.097 4.19e-05 ***
ser1 -0.9469 0.3682 -2.572 0.0101 *
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
$intercept.ci
[1] -1.3994574 -0.4937348
$slopes.ci
[1] -1.6685958 -0.2252964
$OR
ser1
0.3879239
$OR.ci
[1] 0.1885116 0.7982796
Call:
glm(formula = sta ~ loc, family = binomial, data = icu1.dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7941 -0.5617 -0.5617 -0.5617 1.9619
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7668 0.2082 -8.484 < 2e-16 ***
loc1 18.3328 1073.1090 0.017 0.986370
loc2 3.1531 0.8175 3.857 0.000115 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
$intercept.ci
[1] -2.174912 -1.358605
$slopes.ci
[,1] [,2]
[1,] -2084.922247 2121.587900
[2,] 1.550710 4.755395
$OR
loc1 loc2
9.158944e+07 2.340741e+01
$OR.ci
[,1] [,2]
[1,] 0.000000 Inf
[2,] 4.714817 116.2095
13
Call:
glm(formula = sta ~ sex + ser, family = binomial, data = icu1.dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8168 -0.8048 -0.5266 -0.5266 2.0221
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.96129 0.27885 -3.447 0.000566 ***
sex1 0.03488 0.36896 0.095 0.924688
ser1 -0.94442 0.36915 -2.558 0.010516 *
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
$intercept.ci
[1] -1.5078281 -0.4147469
$slopes.ci
[,1] [,2]
[1,] -0.6882692 0.758025
[2,] -1.6679299 -0.220904
$OR
sex1 ser1
14
1.0354933 0.3889063
$OR.ci
[,1] [,2]
[1,] 0.5024449 2.1340574
[2,] 0.1886372 0.8017936
> newdata
sex ser
1 0 0
2 0 1
3 1 0
4 1 1
We will continue with the same example, but now consider many more variables,
including both categorical and continuous variables.
Very shortly, we will see an excellent way to simultaneously select a model and to
investigate confounding in data sets with a large number of variables. For now, we
will take a quick look at logistic regression using four variables from the ICU data
set: Age, sex, ser, and typ.
> summary(icu2.dat)
sta sex ser age typ
Min. :0.0 0:124 0: 93 Min. :16.00 0: 53
1st Qu.:0.0 1: 76 1:107 1st Qu.:46.75 1:147
Median :0.0 Median :63.00
Mean :0.2 Mean :57.55
3rd Qu.:0.0 3rd Qu.:72.00
Max. :1.0 Max. :92.00
0 1
0 51 109
1 2 38
0 1
0 1 92
1 52 55
It looks like those with higher ages also have higher death rates.
> output <- glm(sta ~ sex + ser + age + typ, data=icu2.dat, family=binomial)
> logistic.regression.or.ci(output)
$regression.table
Call:
glm(formula = sta ~ sex + ser + age + typ, family = binomial,
data = icu2.dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2753 -0.7844 -0.3920 -0.2281 2.5072
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.26359 1.11678 -4.713 2.44e-06 ***
sex1 -0.20092 0.39228 -0.512 0.60851
ser1 -0.23891 0.41697 -0.573 0.56667
age 0.03473 0.01098 3.162 0.00156 **
typ1 2.33065 0.80238 2.905 0.00368 **
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
$intercept.ci
[1] -7.452432 -3.074752
$slopes.ci
[,1] [,2]
[1,] -0.96976797 0.56792495
[2,] -1.05615818 0.57834395
[3,] 0.01320442 0.05624833
[4,] 0.75801170 3.90328595
$OR
sex1 ser1 age typ1
0.8179766 0.7874880 1.0353364 10.2846123
18
$OR.ci
[,1] [,2]
[1,] 0.3791710 1.764602
[2,] 0.3477894 1.783083
[3,] 1.0132920 1.057860
[4,] 2.1340289 49.565050
As expected, age has a strong effect, with an odds ratio of 1.035 per year, or 1.03510 =
1.41 per decade (95% CI per year of (1.013, 1.058), so (1.138, 1.757) per decade). Typ
also has a very strong effect, with a CI of at least 2.
There does indeed seem to be some confounding between ser and typ, as the coefficient
estimate for ser has changed drastically from when typ was not in the model. In fact,
ser no longer looks “important”, it has been “replaced” by typ. Because of the high
correlation between ser and typ, it is difficult to separate out the effects of these two
variables.
We will return to this issue when we discuss model selection for logistic regression.
Going back to the example where we had just sex and ser in the model, what if we
wanted to investigate an interaction term between these two variables?
So far, we have seen that ser is associated with a strong effect, but the effect of sex
was inconclusive. But, what if the effect of ser is different among males and females,
i.e., what if we have an interaction (or sometimes called effect modification) between
sex and ser?
> ser.sex
[1] 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1
[36] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0 1 0
[71] 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0
[106] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
[141] 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0
[176] 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0
> summary(icu3.dat)
sta ser sex ser.sex
Min. :0.0 Min. :0.000 Min. :0.00 Min. :0.000
1st Qu.:0.0 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:0.000
Median :0.0 Median :1.000 Median :0.00 Median :0.000
Mean :0.2 Mean :0.535 Mean :0.38 Mean :0.185
3rd Qu.:0.0 3rd Qu.:1.000 3rd Qu.:1.00 3rd Qu.:0.000
Max. :1.0 Max. :1.000 Max. :1.00 Max. :1.000
> logistic.regression.or.ci(output)
$regression.table
Call:
glm(formula = sta ~ sex + ser + ser.sex, family = binomial, data = icu3.dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8696 -0.7244 -0.4590 -0.4590 2.1460
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.7777 0.2930 -2.654 0.00795 **
sex -0.4263 0.4799 -0.888 0.37440
ser -1.4195 0.4945 -2.870 0.00410 **
ser.sex 1.1682 0.7518 1.554 0.12021
---
20
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
$intercept.ci
[1] -1.3519790 -0.2034301
$slopes.ci
[,1] [,2]
[1,] -1.366836 0.5142995
[2,] -2.388770 -0.4502697
[3,] -0.305277 2.6416881
$OR
sex ser ser.sex
0.6529412 0.2418301 3.2162162
$OR.ci
[,1] [,2]
[1,] 0.25491223 1.6724666
[2,] 0.09174244 0.6374562
[3,] 0.73691921 14.0368799
Note that one needs to be careful in interpreting the Odds Ratios from the above
outputs, because of the interaction term.
The OR given above for the sex variable, 0.653, applies only within the medical service
(coded as 0 in the ser variable), and the OR given above for ser, 0.242, applies only
within males (coded 0 in the sex variable).
In order to obtain the OR for sex within the ser = 1 (surgical category), or to obtain
the OR for ser within females (sex = 1), one needs to multiply by the OR from the
interaction term. Hence, we have: