SPSS Binary Logistic Regression Demo 1 Terminate
SPSS Binary Logistic Regression Demo 1 Terminate
A link for the data, as well as this Powerpoint will be made available for download underneath the video
description. Additionally, a “running” document containing links to other videos on logistic regression and
using other programs will be made available as well.
If you find the video and materials I have made available useful, please take the time to “like” the video
and share it with others. Also, please consider subscribing to receive information on new statistics videos
I upload!
Binary logistic regression (BLR) is utilized when a researcher desires to model the relationship between one or
more predictor variables and a binary dependent variable. Fundamentally, the researcher is addressing the
question, “What is the probability that a given case falls into one of two categories on the dependent variable,
given the predictors in the model?”
One might be inclined to ask why we don’t use standard ordinary least squares regression (OLS) instead of BLR.
OLS regression assumes (a) a linear relationship between the independent variables and the dependent
variable, (b) the residuals are normally distributed, and (c) the residuals exhibit constant variance (Pituch &
Stevens, 2016). All three assumptions are violated if the outcome variable in an OLS model is binary. And
pivoting off (a), the relationship between one or more predictors and the probability of a target outcome is
inherently non-linear (and takes on an S-shaped curve), as probabilities are bounded at 0 and 1. When
modeling a binary outcome using OLS regression, the estimation of model parameters ignores this
boundedness, which has the notable effect of producing predicted probabilities that fall outside the 0-1 range.
BLR estimates regression parameters by taking into account the fact that probabilities are bounded and 0 and 1.
It also does not assume that residuals are normally distributed and exhibit constant variance.
Model estimation:
Unlike OLS regression, BLR uses maximum likelihood (ML) to estimate model parameters. Maximum likelihood
estimation is an iterative process aimed at arriving at population (parameter) values that most likely produced
the observed (sample) data. In general, this estimation approach assumes large samples and, aside from issues
of power, smaller sample sizes can create problems with model convergence and estimation of model
parameters. [Side note: With smaller samples, Exact logistic regression or Firth procedure using Penalized
Maximum likelihood can be used. Unfortunately, these options are not commonly available in statistics
programs.]
As with standard OLS regression, evaluation of fit occurs at two levels. The first involves evaluating the fit of the
full-model (i.e., containing the full set of predictors), which is typically done using a likelihood-ratio chi-square
test (which compares the full model with a null, or intercept-only, model) and the Hosmer-Lemeshow test
results. Moreover, overall model fit is often assessed using “pseudo-r-squared” indices and evaluation the
degree to which the model is able to classify individuals into groups on the dependent variable (Smith &
McKenna, 2013). Following, one evaluates the individual predictors for their contribution to overall model fit.
This is done using either the Wald test or likelihood ratio tests (the latter involves comparing the full model with
all the predictors against a reduced model with a given predictor removed).
SPSS example
Scenario:
In this example, we are attempting to model the likelihood of early termination from counseling in a sample
of n=45 clients at a community mental health center. The dependent variable in the model is 'terminate'
(coded 1=terminated early, 0=did not terminate early), where the “did not terminate” group is the reference
(baseline) category and the “terminated early” group is the target category. Two predictors in the model are
categorical: gender identification ('genderid', coded 0=identified as male, 1=identified as female) and 'income'
(ordinal variable, coded 1=low, 2=medium, 3=high). The reference category for ‘genderid’ is male
identification, whereas the reference category for ‘income’ is the low income group. Finally, two predictors
are assumed continuous in the model: avoidance of disclosure ('avdisc') and symptom severity ('sympsev’).
https://drive.google.com/open?id=1Etmudy8b6SZRykSPxCyFzG8ANQZwv966
When you have categorical predictor variables, you’ll need to let SPSS know to treat them as factors. By doing so,
then SPSS will do the dummy coding for you. Here, I’ve clicked the Categorical tab and, when the dialog box
opened up, I moved ‘income’ over to the right, set the ‘contrast’ to Indicator, and clicked on ‘first’ to make the
low income group the reference category. For the changes to take effect, I had to click the ‘change’ button.
Although gender identification could technically
be included as a predictor without this recoding
(since it is a dichotomous variable), I am going
ahead and treating the variable as a categorical
covariate and setting the reference category to
‘First’. I also clicked Change for the changes to
take effect.
Clicking on the Option tab allows you to request Classification Plots, the Hosmer-Lemeshow goodness
of fit test (although it is not highly recommended), and confidence intervals for the odds ratios.
The output is presented in blocks.
The -2*Log likelihood (also referred to as “model deviance” (Pituch & Stevens, 2016) is most useful for
comparing competing models, particularly because it is distributed as chi-square (Field, 2018).
Given the limitations of “pseudo-R-square” as indices of model fit, some authors have proposed an
alternative approach to computing R-square based on logistic regression that is more akin to the R-
square in OLS regression. Tabachnick and Fidell (2013) and Lomax and Hahs-Vaughn (2012) suggest
that one can compute an R-square value by (a) correlating group membership on the dependent
variable with the predicted probability associated with (target) group membership (effectively a
point-biserial correlation) and (b) squaring that correlation. As a demonstration with the current
data:
r=.729
R-square=.531
(as can be seen, these two variables share
considerable variation)
The Hosmer & Lemeshow test is another test that can be used to
evaluate global fit. A non-significant test result (as seen here; p=.97)
is an indicator of good model fit.
The 95% confidence interval for the Odds ratio can also be used to test the observed OR to determine if it is
significantly different from the null OR of 1.0. If 1.0 falls between the lower and upper bound for a given
interval, then the computed odds ratio is not significantly different from 1.0 (indicating no change as a
function of the predictor).
Avoidance of disclosure is a positive and significant (b=.3866, s.e.=.138, p=.005) predictor of the
probability of early termination, with the OR indicating that for every one unit increase on this predictor
the odds of early termination change by a factor of 1.472 (meaning the odds are increasing).
Symptom severity is a negative and significant (b=-.3496, s.e.=.137, p=.010) predictor of the probability of
early termination. The OR indicates that for every one unit increment on the predictor, the odds of
terminating increase by a factor of .705 (meaning that the odds are decreasing).
Genderid is a non-significant predictor of early termination (b=-1.5079, s.e.=.0956, p=.115). [Had the
predictor been significant, then the negative coefficient would be taken as an indicator that females
(coded 1) are less likely to terminate early than males.]
Income is represented by two dummy variables:
The first dummy variable is a comparison of the medium (coded 1 on the variable) and low (reference
category coded 0 on the variable) income groups. The negative coefficient suggests that persons in the
medium income category were less likely to terminate early than those in the low income category.
Nevertheless, the difference is not significant (b=-1.0265, s.e.=.1323, p=.438). Similarly, the second
dummy variable compares the high income group (coded 1 on the variable) and the low income group
(again, the reference category; coded 0). The difference between the groups is not significant, however
(b=-.8954, s.e.=1.022, p=.401).
References
Field, A. (2018). Discovering statistics using IBM SPSS statistics (5th ed). Los Angeles: Sage.
Lomax, R.G., & Hahs-Vaughn (2012). An introduction to statistical concepts (3rd ed). New York: Routledge.
Osborne, J.W. (2015). Best practices in logistic regression. Los Angeles: Sage.
Pituch, K.A., & Stevens, J.A. (2016). Applied multivariate statistics for the social sciences (6th ed). New York: Routledge.
Smith, T.J., & McKenna, C.M. (2013). A comparison of logistic regression pseudo R 2. Multiple Linear Regression Viewpoints,
39, 17-26. Retrieved from http://www.glmj.org/archives/articles/Smith_v39n2.pdf on June 20, 2019.
Tabachnick, B.G., & Fidell, L.S. (2013). Using multivariate statistics (6th ed.). New York: Pearson.