Binary Logistic Regression Using Stata 17 Drop-Down Menus
Binary Logistic Regression Using Stata 17 Drop-Down Menus
Binary Logistic Regression Using Stata 17 Drop-Down Menus
May 2021
https://drive.google.com/file/d/1YisZj7IqObhPExjFxQ1XJLqZcPh1ylSA/view?usp=sharing
anxiety anxiety
mastery pass
mastery
pass
interest
interest
Example 1
medinc
incomeLev
highinc
Example 2
Example 1: Binary logistic regression with three continuous predictors
anxiety
mastery pass
interest
Step 2
Step 3: Choose whether you wish to display regression coefficients or
odds ratios (along with making any other selections). The default is odds
ratios. To select estimated regression coefficients, click the other button.
Note: for our analysis we will begin with the unstandardized regression
coefficients, and then move to odds ratios later. This merely reflects my
preferred way of looking at the results.
The model chi-square tests whether the model represents a significant increment in fit relative to a null/baseline/intercept-
only model. You can also think of it as test of whether at least one regression slope is significantly different from zero.
The chi-square test result indicates that our model is a significant improvement in fit relative to an intercept-only model,
χ²(3) = 36.74, p<.001.
Here, we have the log likelihood for the model (-271.22786), which is an indicator of the degree of lack of fit for the model. The
more negative the LL is, the worse the model fits the data; whereas the less negative the LL, the better the fit of the model to the
data. [Other programs, such as SPSS provide a -2LL to create an index of the deviance from perfect fit, whereby the LL is rescaled
so that 0 indicates perfect fit and increasingly positive values indicate worse fit.] By itself, this index usually is not very informative
about what constitutes a poor fitting model. However, it is used in the construction of several measures of overall model fit.
For example, we can compute the LR chi-square value (shown above) in the following way: , where is the log likelihood of the full
model (shown above as
-271.22786) and is the log likelihood of an intercept-only (null) model (although not shown, for this model it is -289.5974). Thus: -
542.45572= 36.73908. In effect, the chi-square value is the difference in model deviances for the null and full models (see Pituch &
Stevens, 2016).
The pseudo R-square value that is provided is an analogy to the R-square value we are most familiar with in the context of
linear regression. It can be computed from the log likelihoods for the full and intercept-only models as follows:
In the previous slide, we’d computed the -2LL (model deviance) for the full and null models as: and = 542.45572
Values between .2 and .4 may indicate ‘strong improvement in fit’ (Pituch & Stevens, 2016; citing McFadden, 1979).
The Hosmer & Lemeshow test provides a second global fit test, testing the ‘estimated model to one that has perfect fit’ (Pituch
& Stevens, 2016, p. 455). If this test is not significant, then you have evidence of an adequately-specified model. If it is
significant, then you have evidence that the model is misspecified, such as through omission of ‘nonlinear and/or interaction
terms’ (p. 456). Here, we see the Hosmer and Lemeshow test is not statistically significant [χ²(8) =2.11, p=.9776], indicating
adequate fit of the model.
This table provides additional information that you may find useful in
describing how well your model is fitting. Specifically, it presents
information on the degree to which the observed outcomes are
predicted by your model.
The first column of results are the unstandardized regression coefficients. The slopes for each predictor are provided, along
with the intercept (constant). The statistical significance of these parameter estimates are tested using the Wald Z test
results.
In logistic regression, we are not modeling the pr(Y=1) directly as a linear function of the predictors. Rather, we use a
mathematical transformation of probabilities into a new variable called a logit. This allows us to model pr(Y=1) – albeit
as a transformed version of itself – as a linear function of the predictors. This linearization of the relationship between
the predictors and the pr(Y=1) occurs via the use of the logit link function (see Heck et al., 2012).
Given , we see that the fitted values for the regression model are predicted logits(Y=1) for the cases in our data. This
means that each unstandardized regression slope is interpreted most literally as the predicted change in logits (or
ln((Y=1))) per unit increment on a given predictor, controlling for the other predictors in the model.
Nevertheless, if you want to think about the meaning of the unstandardized regression slopes more generally you can do
so in the following way:
a) With a positive slope, you can say that as scores on the predictor increase, so does the probability of falling into the
target group (Y=1).
b) With a negative slope, you can say that as scores on the predictor increase, the probability of falling into the target
group (Y=1) decrease.
c) A slope of zero indicates no systematic increase or decrease in the probability of a case falling into the target group
(Y=1) with increasing values on the predictor.
**Just keep in mind that the regression slopes for your predictors are NOT directly interpreted as the amount change in
pr(Y=1) given a unit increment on the predictor.
The results given here indicate:
Anxiety is a negative and significant predictor of the probability of a student passing the test (b=-.081, s.e.=.032, Wald Z
= -2.51, p=.012).
Mastery goals is a positive and significant predictor of the probability of a student passing the test (b=.096, s.e.=.026,
Wald Z = 3.64, p<.001).
Although the slope for interest is positive, it was not statistically significantly different from 0 (b=.042, s.e.=.031, Wald Z
= 1.35, p=.177).
Here, I am re-invoking the default presentation of odds
ratios. [Again, the sequencing of steps I am using is
merely my own preference.]
In addition to reporting on the
unstandardized regression slope, you likely
will want to report on the odds ratio. That
can be done using the ‘or’ option when
you run your analysis.
As noted previously, in logistic regression logits are a transformation of probabilities via the log link function.
In the equation above, you should notice that part of the transformation is that of pr(Y=1) into odds(Y=1). Odds are computed
as a ratio of the probabilities of two mutually exclusive outcomes: e.g., , or . If we take the natural log of the odds(A), this will
return the logit(A).
Before continuing…review of a few more concepts
Now, we can convert the logit (Y=1) for each case in our data back into the odds(Y=1) through exponentiation:
From there, we can return the pr(Y=1) from odds(Y=1) using the following mathematical transformation:
At this point, I hope you see that using ‘probability’ and ‘odds’ interchangeably is an incorrect way of referring to these
concepts. And that when interpreting odds ratios (see upcoming slides), they are not referring to ratios of probabilities (as you
might think of when talking about relative risk).
Now back to our output…
Something of note:
If an odds ratio is > 1, then it is indicating that the odds associated with target group membership are increasing with
increases on the predictor. [Logically, we can reason that the probability of a case falling into the target group is greater at
higher levels of the predictor variable.]
If an odds ratio is < 1, then it is indicating that the odds of target group membership are decreasing with increases on the
predictor. [Logically, we can reason that the probability of a case falling into the target group is lower at higher levels on the
predictor variable.]
You can theoretically use the confidence intervals to test the odds ratios for statistical significance. The null hypothesis in
these cases is that the odds ratio for a given predictor equals 1.
If 1 falls between the lower and upper bounds, then you infer that the population odds ratio equals 1.
If 1 falls outside the lower and upper bounds, then you can infer that the population odds ratio is not equal to 1.
If we wish to talk about the odds ratios in terms of their formal meaning, then we can interpret them as follows:
The odds ratio for anxiety is .922, meaning that the for the odds of a student passing the test (Y=1) change by a factor of .922
with every unit increase on anxiety. [Since we are multiplying odds by .922 per unit increase on the predictor, this must mean
our odds are decreasing with each increase on the predictor.]
The odds ratio for mastery is 1.101, meaning that the for the odds of a student passing the test (Y=1) increased by a factor of
1.101 with every unit increase on mastery. [Since we are multiplying odds by 1.101 per unit increase on the predictor, this
must mean our odds are increasing with each increase on the predictor.]
The odds ratio for interest is 1.043, meaning that the for the odds of a student passing the test (Y=1) increased by a factor of
1.043 with every unit increase on interest. [Since we are multiplying odds by 1.043 per unit increase on the predictor, this
must mean our odds are increasing with each increase on the predictor.]
A little finesse:
Odds ratios (OR) less than 1 can be tricky to interpret. As such, you may wish to report only odds ratios that are greater than
1. For those odds ratios that are already greater than 1 (such as for mastery goals and interest in the table above) this is no
great issue. Just report the odds ratios.
For those that are less than 1, you can simply compute the inverse of the odds ratio and this will provide you with an odds
ratio where Y=0 is treated as the target group (when interpreting the predictor):
OR(Y=0) =
We see that anxiety has an OR=.922. However, if we take its multiplicative inverse, we obtain OR=1.0846 [1/.922 = 1.0846].
We interpret this to mean that for each one unit increase on anxiety, the predicted odds of not passing the test changes by a
factor of 1.0846.
Finally, some folks prefer to use odds ratios (OR) in a manner that communicates the percentage change in odds(Y=1) per
unit increase on a predictor.
This is easily accomplished using the following formula: 100%*(OR-1) ; see Pampel (2000, p. 23)
For example, for mastery goals OR=1.101. As such, we can say that for every unit increase on mastery goals, the predicted
odds(Y=1) change by 100% * (1.101 – 1) = 10.1%. Since the OR is > 1, this means the predicted odds are increasing by
10.1% per unit increase on mastery goals.
For anxiety, OR=.922. From this we can say that for every unit increase on mastery goals, the predicted odds(Y=1) change
by 100% * (.922 – 1) = -7.8%. Since the OR is < 1, this means the odds are decreasing by 7.8% per unit increase on anxiety.
Predicted probabilities and group membership
Syntax written from command line
Using the command ‘predict p’ (or drop-down menu), we can create a new variable (p) that is saved to the dataset that is the
probability of case i falling into the target groups (i.e., Y=1). [For this dataset, this is the predicted probability of each student passing
the test.] Moreover, the next set of commands will generate predicted group membership based on the predicted probabilities of Y=1.
The code creates a new variable ‘prgp’ representing predicted group, with values of 0 and 1 generated as values based on the
probabilities associated with the ‘p’ variable that is generated from the syntax above.
We see that the first student had a predicted probability of passing based on the model of 29.38%, whereas student 4 had a predicted
probability of 53.307%. Student 1 was predicted to not pass (i.e., Y=0), whereas student 4 was predicted to pass (i.e., Y=1).
Identifying collinearity among predictors
Below is the syntax in the command line for generating collinearity diagnostics for our predictors.
The Tolerance values in the first column are computed as: 1-R-square, where each predictor (k) is regressed onto the
remaining predictors (k).
For example, if I regress anxiety onto mastery and interest, then the R-square value is .098. As you can see here anxiety is
not a linear function of the other two predictors. The tolerance value is 1-.098 = .902.
Obviously, tolerance values closer to 0 signal greater problems (and .1 or .2) is often regarded as signifying more serious
problems.
The Variance inflation factor (VIF) for each predictor is the inverse of the tolerance:
VIF = 1/tolerance
Various conventions for what constitutes a serious problem with collinearity have been recommended in the literature, with
10 often being treated as indicating more substantial problems. However, other authors suggest cutoffs around 5 or 6 or 7.
In general, we see no problems with multicollinearity among our predictors based on the Tolerance and VIF values.
For a more in-depth discussion on identifying collinearity in the context of multiple logistic regression, go to:
https://youtu.be/dvjogQ43xs4
Example 2: Analysis with categorical predictor included alongside the previous predictors
anxiety
mastery
pass
interest
medinc
highinc
For this analysis, we will introduce ‘incomeLev’ as an additional predictor in our model. This ordered categorical variable is
coded 1=low income, 2=medium income, 3=high income.
Note: Categorical predictor variables are permissible in logistic regression so long as they have been recoded into binary
predictors (such as through the use of a dummy coding system). We can have Stata do this for us by including the i. prefix
before the variable name.
The LR chi-square indicates that our model represents a significant improvement in fit relative to an intercept-only model,
χ²(5) = 46.66, p < .001. McFadden’s pseudo R-square of .0806 suggests the model fit is improved by approximately 8.06%
over the intercept-only model.
The Hosmer and Lemeshow test is not
significant [χ²(8) = 6.83, p=.5551], suggesting
adequate fit.
The overall classification accuracy is still rather poor (at 65.2%).
The sensitivity of the model is reasonably high at 83.7%, whereas
the specificity is poor at 40%.
Now let’s look at the contributions of the individual predictors to the model…
The ‘incomeLev’ variable has been dummy coded for inclusion in the model using the following system:
Low income students were coded 0 on both ‘medium income’ and ‘high income’. Medium income students were coded 1
on ‘medium income’ and 0 on ‘high income’. High income students were coded 0 on ‘medium income’ and 1 on ‘high
income’.
The regression coefficient for ‘medium income’ is equal to the difference in predicted logits for the medium income and
low income students. The regression coefficient for ‘high income’ equals the difference in predicted logits for the high
income and low income students .
We see that both dummy variables are statistically significant in the model, with both p’s below .01.
The positive regression coefficient (b=.999) and OR > 1 (2.715) associated with the ‘medium income’ dummy variable
indicates that students in the medium income category are more likely to pass the test than students in the low
income category. The positive regression coefficient (b=1.283) and OR > 1 (3.607) associated with the ‘high income’
dummy variable also indicates that students in the high income category are more likely to pass the test than
students in the low income category.
Finally, formally speaking the odds ratio for ‘medium income’ is interpreted as follows: ‘The odds of a student in the
medium income group passing (Y=1) are 2.715 times that of the odds of a student in the low income group passing.
The odds ratio for ‘high income’ is interpreted as follows: ‘The odds of a student in the high income group passing
(Y=1) are 3.607 times that of the odds of a student in the low income group passing.
You will notice that this ‘variable’ has no regression slope or odds ratio associated with it. The reason is that all group
information is contained in the dummy variables. Attempting to add a dummy variable for the low group into the
model would yield a variable that is redundant with the others.
For more details on dummy coding concepts, see https://youtu.be/XGlbGaOsV9U (note: this presentation was
developed using SPSS)
References and suggested readings
Field, A. (2018). Discovering statistics using IBM SPSS statistics (5th ed). Los Angeles: Sage.
Fox,J. (1997). Applied regression analysis, linear models, and related methods. Thousand Oaks, CA: Sage.
Heck, R.H., Thomas, S.L., & Tabata, L.N. (2012). Multilevel modeling of categorical outcomes using IBM SPSS. New York:
Routledge.
Lomax, R.G., & Hahs-Vaughn, D.L. (2012). An introduction to statistical concepts (3rd ed.). New York: Routledge.
Pampel, F.C. (2000). Logistic regression: A primer. Thousand Oaks, CA: Sage.
Pituch, K.A., & Stevens, J.P. (2016). Applied multivariate statistics for the social sciences. New York: Routledge.
Osborne, J.W. (2015). Best practices in logistic regression. Los Angeles, CA: Sage.
Osborne, J.W. (2012). Logits and tigers and bears, oh my! A brief look at the simple math of logistic regression and how it
can improve dissemination of results. Practical Assessment, Research, and Evaluation, 17(11). Downloaded on March
10, 2021 from
https://pdfs.semanticscholar.org/caa3/c477fadbd9e51824d71d41e70898ff14b4af.pdf?_ga=2.121394440.1816740118.
1616501380-125705265.1610722785