Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Binary Logistic Regression Using Stata 17 Drop-Down Menus

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 53

Binary logistic regression using Stata 17 drop-down menus

May 2021

Mike Crowson, Ph.D.


Learning Sciences Program
University of Oklahoma
Link to dataset used in this presentation:

https://drive.google.com/file/d/1YisZj7IqObhPExjFxQ1XJLqZcPh1ylSA/view?usp=sharing

Link to Youtube video demo: https://youtu.be/PvEjbhnIFic


For our regression analysis, we will be using the logit command in Stata. As noted in the Stata manual (see
https://www.stata.com/manuals/rlogit.pdf), the dependent variable should be coded as 0 and 1. The reason is that a value of
0 is recognized as a ‘failure’ (which reflects the non-target outcome), whereas a code of 1 will be recognized as a ‘success’
(which reflects the target outcome). [In fact, any positive integer greater than 0 will be recognized as ‘success’]. My
recommendation is to make sure you check your dependent variable to make sure the coding is correct prior to performing
your analysis. In the screenshot of a subset of the data above, the ‘Did not pass’ group is coded as 0 and the ‘Passed’ group is
coded as 1 (even though what you see in the column are the group labels associated with the value codes of 0 and 1).
We are testing the probability of a student passing a test (coded 0=did not pass, 1=passed) as a function of three
continuous predictor variables (anxiety, mastery goals, and interest) and one factor variable (incomeLev).

anxiety anxiety

mastery pass
mastery

pass
interest
interest

Example 1
medinc
incomeLev
highinc

Example 2
Example 1: Binary logistic regression with three continuous predictors

anxiety

mastery pass

interest

Although not displayed, the predictors are correlated in the model.


Step 1

Step 2
Step 3: Choose whether you wish to display regression coefficients or
odds ratios (along with making any other selections). The default is odds
ratios. To select estimated regression coefficients, click the other button.

Note: for our analysis we will begin with the unstandardized regression
coefficients, and then move to odds ratios later. This merely reflects my
preferred way of looking at the results.
The model chi-square tests whether the model represents a significant increment in fit relative to a null/baseline/intercept-
only model. You can also think of it as test of whether at least one regression slope is significantly different from zero.

The chi-square test result indicates that our model is a significant improvement in fit relative to an intercept-only model,
χ²(3) = 36.74, p<.001.
Here, we have the log likelihood for the model (-271.22786), which is an indicator of the degree of lack of fit for the model. The
more negative the LL is, the worse the model fits the data; whereas the less negative the LL, the better the fit of the model to the
data. [Other programs, such as SPSS provide a -2LL to create an index of the deviance from perfect fit, whereby the LL is rescaled
so that 0 indicates perfect fit and increasingly positive values indicate worse fit.] By itself, this index usually is not very informative
about what constitutes a poor fitting model. However, it is used in the construction of several measures of overall model fit.

For example, we can compute the LR chi-square value (shown above) in the following way: , where is the log likelihood of the full
model (shown above as
-271.22786) and is the log likelihood of an intercept-only (null) model (although not shown, for this model it is -289.5974). Thus: -
542.45572= 36.73908. In effect, the chi-square value is the difference in model deviances for the null and full models (see Pituch &
Stevens, 2016).
The pseudo R-square value that is provided is an analogy to the R-square value we are most familiar with in the context of
linear regression. It can be computed from the log likelihoods for the full and intercept-only models as follows:

In the previous slide, we’d computed the -2LL (model deviance) for the full and null models as: and = 542.45572

Thus, McFadden’s is computed as: 0.93657 = .0634


The formula for McFadden’s [] shows you that the index represents the proportionate reduction in the lack of fit as a
result of including your predictors (recall the -2LL’s are rescaled to range from 0 to positive infinity, so that more
positive deviance values represent worse fit). Equivalently, you can interpret McFadden’s pseudo R-square as reflecting
the proportionate improvement in fit.

Values between .2 and .4 may indicate ‘strong improvement in fit’ (Pituch & Stevens, 2016; citing McFadden, 1979).
The Hosmer & Lemeshow test provides a second global fit test, testing the ‘estimated model to one that has perfect fit’ (Pituch
& Stevens, 2016, p. 455). If this test is not significant, then you have evidence of an adequately-specified model. If it is
significant, then you have evidence that the model is misspecified, such as through omission of ‘nonlinear and/or interaction
terms’ (p. 456). Here, we see the Hosmer and Lemeshow test is not statistically significant [χ²(8) =2.11, p=.9776], indicating
adequate fit of the model.
This table provides additional information that you may find useful in
describing how well your model is fitting. Specifically, it presents
information on the degree to which the observed outcomes are
predicted by your model.

The overall percentage correct indicates the percentage of cases with


an observed outcome that were correctly predicted (in terms of the
outcome) by the model. In this output, the overall percentage is
63.3%, computed as: 100%[(71+198) / (71+109+47+198)] = 100%[269 /
425] = 63.3%.
Two indices that you may wish to report on are the sensitivity
and specificity for the model.

Sensitivity refers to percentage of cases observed to fall in the


target group (Y=1; e.g., observed as having passed the test)
who were correctly predicted by the model to fall into that
group (e.g., predicted pass). In effect, it is an index of
sensitivity of the model to correctly identify cases that fall into
the target group.

We can calculate the sensitivity of the model by examining the


frequencies in the first column in the table. The sensitivity for
the model is calculated as: 100%[198/(47+198)] =
100%[198/245] = 80.8%.
Specificity refers to percentage of cases observed to fall into
the non-target (or reference) category (e.g., observed did not
pass test) who were correctly predicted by the model to fall
into that group (e.g., predicted did not pass test). In other
words, it reflects the degree to which the model correctly
identifies cases that do not fall into the target group.

The frequencies contained in the second column above can


be used to compute the specificity for the model. The
specificity for this model is calculated as: 100%[71/(71+109)]
= 100%[71/180] = 39.4%.
Overall, the predictive accuracy rate was very modest at 63.3%. The model exhibits good sensitivity since among those
persons who passed the test, 80.8% were correctly predicted to pass based on the model. The model exhibits poor
specificity since among those persons not passing the test 39.4% were correctly predicted not to pass.
An alternate way of looking at classification accuracy is to
compute the positive and negative predictive values. These
values represent the percentage of cases that were correctly
predicted by the model to be observed as falling into either
the target group or the non-target group.

The positive predictive value (PPV) can be used to address


the question, ‘Among those cases that were predicted by the
model to fall into the target group, what percentage were
actually observed to fall into that group’?

We can calculate the PPV as follows: 100%[198/(198+109)] =


100%[198/307] = 64.5%.

In other words, among those students that were predicted


by the model to pass the test, only 64.5% actually passed
the test.
The negative predictive value (NPV) can be used to address
the question, ‘Among those cases that were predicted by the
model to NOT fall into the target group, what percentage
were actually observed to NOT fall into that group’?

We can calculate the PPV as follows: 100%[71/(71+47)] =


100%[71/118] = 60.17%.

In other words, among those students that were predicted


by the model to not pass the test, only 60.17% actually did
not pass the test.
This portion of our output contains the unstandardized regression slopes and associated significance tests, and confidence
intervals for the regression coefficients.

The first column of results are the unstandardized regression coefficients. The slopes for each predictor are provided, along
with the intercept (constant). The statistical significance of these parameter estimates are tested using the Wald Z test
results.

The Wald Z is computed as:

As an example, the for the anxiety coefficient is computed as:-2.51.


In logistic regression, we are predicting the probability of a case falling into a target group [e.g., pr(Y=1, passed test)] as a
function of the predictors in the model. Since probabilities are necessarily bounded at 0 and 1, this would create serious
conceptual and statistical problems (see Pampel ( 2000) for a nice discussion) if we attempted to model them as a
function of the predictors using OLS regression.

In logistic regression, we are not modeling the pr(Y=1) directly as a linear function of the predictors. Rather, we use a
mathematical transformation of probabilities into a new variable called a logit. This allows us to model pr(Y=1) – albeit
as a transformed version of itself – as a linear function of the predictors. This linearization of the relationship between
the predictors and the pr(Y=1) occurs via the use of the logit link function (see Heck et al., 2012).
Given , we see that the fitted values for the regression model are predicted logits(Y=1) for the cases in our data. This
means that each unstandardized regression slope is interpreted most literally as the predicted change in logits (or
ln((Y=1))) per unit increment on a given predictor, controlling for the other predictors in the model.
Nevertheless, if you want to think about the meaning of the unstandardized regression slopes more generally you can do
so in the following way:

a) With a positive slope, you can say that as scores on the predictor increase, so does the probability of falling into the
target group (Y=1).
b) With a negative slope, you can say that as scores on the predictor increase, the probability of falling into the target
group (Y=1) decrease.
c) A slope of zero indicates no systematic increase or decrease in the probability of a case falling into the target group
(Y=1) with increasing values on the predictor.

**Just keep in mind that the regression slopes for your predictors are NOT directly interpreted as the amount change in
pr(Y=1) given a unit increment on the predictor.
The results given here indicate:

Anxiety is a negative and significant predictor of the probability of a student passing the test (b=-.081, s.e.=.032, Wald Z
= -2.51, p=.012).

Mastery goals is a positive and significant predictor of the probability of a student passing the test (b=.096, s.e.=.026,
Wald Z = 3.64, p<.001).

Although the slope for interest is positive, it was not statistically significantly different from 0 (b=.042, s.e.=.031, Wald Z
= 1.35, p=.177).
Here, I am re-invoking the default presentation of odds
ratios. [Again, the sequencing of steps I am using is
merely my own preference.]
In addition to reporting on the
unstandardized regression slope, you likely
will want to report on the odds ratio. That
can be done using the ‘or’ option when
you run your analysis.

Odds ratios 95% confidence


intervals for
odds ratios
Before continuing…review of a few more concepts

As noted previously, in logistic regression logits are a transformation of probabilities via the log link function.

In the equation above, you should notice that part of the transformation is that of pr(Y=1) into odds(Y=1). Odds are computed
as a ratio of the probabilities of two mutually exclusive outcomes: e.g., , or . If we take the natural log of the odds(A), this will
return the logit(A).
Before continuing…review of a few more concepts

Now, we can convert the logit (Y=1) for each case in our data back into the odds(Y=1) through exponentiation:

, where e (base of natural logarithms) is raised to the power of the logit

From there, we can return the pr(Y=1) from odds(Y=1) using the following mathematical transformation:

At this point, I hope you see that using ‘probability’ and ‘odds’ interchangeably is an incorrect way of referring to these
concepts. And that when interpreting odds ratios (see upcoming slides), they are not referring to ratios of probabilities (as you
might think of when talking about relative risk).
Now back to our output…

Something of note:

An unstandardized regression slope of 0 will be associated with an odds ratio of 1.


An unstandardized regression slope > 0 will be associated with an odds ratio > 1.
An unstandardized regression slope < 0 will be associated with an odds ratio < 1.
Each odds ratio in this table indicates the multiplicative change in the odds (of a case falling into the target group, or Y=1)
per unit increase on a given predictor, controlling for the others in the model.
If an odds ratio is 1, then it is indicating that there is no change in odds per unit increase on the predictor. [Logically, we can
reason that the probability of a case falling into the target group does not change depending on level of the predictor
variable.]

If an odds ratio is > 1, then it is indicating that the odds associated with target group membership are increasing with
increases on the predictor. [Logically, we can reason that the probability of a case falling into the target group is greater at
higher levels of the predictor variable.]

If an odds ratio is < 1, then it is indicating that the odds of target group membership are decreasing with increases on the
predictor. [Logically, we can reason that the probability of a case falling into the target group is lower at higher levels on the
predictor variable.]
You can theoretically use the confidence intervals to test the odds ratios for statistical significance. The null hypothesis in
these cases is that the odds ratio for a given predictor equals 1.

If 1 falls between the lower and upper bounds, then you infer that the population odds ratio equals 1.
If 1 falls outside the lower and upper bounds, then you can infer that the population odds ratio is not equal to 1.
If we wish to talk about the odds ratios in terms of their formal meaning, then we can interpret them as follows:

The odds ratio for anxiety is .922, meaning that the for the odds of a student passing the test (Y=1) change by a factor of .922
with every unit increase on anxiety. [Since we are multiplying odds by .922 per unit increase on the predictor, this must mean
our odds are decreasing with each increase on the predictor.]

The odds ratio for mastery is 1.101, meaning that the for the odds of a student passing the test (Y=1) increased by a factor of
1.101 with every unit increase on mastery. [Since we are multiplying odds by 1.101 per unit increase on the predictor, this
must mean our odds are increasing with each increase on the predictor.]

The odds ratio for interest is 1.043, meaning that the for the odds of a student passing the test (Y=1) increased by a factor of
1.043 with every unit increase on interest. [Since we are multiplying odds by 1.043 per unit increase on the predictor, this
must mean our odds are increasing with each increase on the predictor.]
A little finesse:

Odds ratios (OR) less than 1 can be tricky to interpret. As such, you may wish to report only odds ratios that are greater than
1. For those odds ratios that are already greater than 1 (such as for mastery goals and interest in the table above) this is no
great issue. Just report the odds ratios.

For those that are less than 1, you can simply compute the inverse of the odds ratio and this will provide you with an odds
ratio where Y=0 is treated as the target group (when interpreting the predictor):

OR(Y=0) =

We see that anxiety has an OR=.922. However, if we take its multiplicative inverse, we obtain OR=1.0846 [1/.922 = 1.0846].
We interpret this to mean that for each one unit increase on anxiety, the predicted odds of not passing the test changes by a
factor of 1.0846.
Finally, some folks prefer to use odds ratios (OR) in a manner that communicates the percentage change in odds(Y=1) per
unit increase on a predictor.

This is easily accomplished using the following formula: 100%*(OR-1) ; see Pampel (2000, p. 23)

For example, for mastery goals OR=1.101. As such, we can say that for every unit increase on mastery goals, the predicted
odds(Y=1) change by 100% * (1.101 – 1) = 10.1%. Since the OR is > 1, this means the predicted odds are increasing by
10.1% per unit increase on mastery goals.

For anxiety, OR=.922. From this we can say that for every unit increase on mastery goals, the predicted odds(Y=1) change
by 100% * (.922 – 1) = -7.8%. Since the OR is < 1, this means the odds are decreasing by 7.8% per unit increase on anxiety.
Predicted probabilities and group membership
Syntax written from command line

Using the command ‘predict p’ (or drop-down menu), we can create a new variable (p) that is saved to the dataset that is the
probability of case i falling into the target groups (i.e., Y=1). [For this dataset, this is the predicted probability of each student passing
the test.] Moreover, the next set of commands will generate predicted group membership based on the predicted probabilities of Y=1.
The code creates a new variable ‘prgp’ representing predicted group, with values of 0 and 1 generated as values based on the
probabilities associated with the ‘p’ variable that is generated from the syntax above.

We see that the first student had a predicted probability of passing based on the model of 29.38%, whereas student 4 had a predicted
probability of 53.307%. Student 1 was predicted to not pass (i.e., Y=0), whereas student 4 was predicted to pass (i.e., Y=1).
Identifying collinearity among predictors
Below is the syntax in the command line for generating collinearity diagnostics for our predictors.
The Tolerance values in the first column are computed as: 1-R-square, where each predictor (k) is regressed onto the
remaining predictors (k).

For example, if I regress anxiety onto mastery and interest, then the R-square value is .098. As you can see here anxiety is
not a linear function of the other two predictors. The tolerance value is 1-.098 = .902.

Obviously, tolerance values closer to 0 signal greater problems (and .1 or .2) is often regarded as signifying more serious
problems.
The Variance inflation factor (VIF) for each predictor is the inverse of the tolerance:

VIF = 1/tolerance

The VIF for anxiety is computed as: 1/.902 = 1.109.

Various conventions for what constitutes a serious problem with collinearity have been recommended in the literature, with
10 often being treated as indicating more substantial problems. However, other authors suggest cutoffs around 5 or 6 or 7.
In general, we see no problems with multicollinearity among our predictors based on the Tolerance and VIF values.

For a more in-depth discussion on identifying collinearity in the context of multiple logistic regression, go to:
https://youtu.be/dvjogQ43xs4
Example 2: Analysis with categorical predictor included alongside the previous predictors

anxiety

mastery

pass
interest

medinc

highinc
For this analysis, we will introduce ‘incomeLev’ as an additional predictor in our model. This ordered categorical variable is
coded 1=low income, 2=medium income, 3=high income.

Note: Categorical predictor variables are permissible in logistic regression so long as they have been recoded into binary
predictors (such as through the use of a dummy coding system). We can have Stata do this for us by including the i. prefix
before the variable name.
The LR chi-square indicates that our model represents a significant improvement in fit relative to an intercept-only model,
χ²(5) = 46.66, p < .001. McFadden’s pseudo R-square of .0806 suggests the model fit is improved by approximately 8.06%
over the intercept-only model.
The Hosmer and Lemeshow test is not
significant [χ²(8) = 6.83, p=.5551], suggesting
adequate fit.
The overall classification accuracy is still rather poor (at 65.2%).
The sensitivity of the model is reasonably high at 83.7%, whereas
the specificity is poor at 40%.
Now let’s look at the contributions of the individual predictors to the model…

Table containing regression


coefficients

Table containing odds ratios


We see again that anxiety and mastery goals are significant predictors of the probability of a student passing the test,
whereas interest is unrelated to that probability (p=.259).

The ‘incomeLev’ variable has been dummy coded for inclusion in the model using the following system:
Low income students were coded 0 on both ‘medium income’ and ‘high income’. Medium income students were coded 1
on ‘medium income’ and 0 on ‘high income’. High income students were coded 0 on ‘medium income’ and 1 on ‘high
income’.

The regression coefficient for ‘medium income’ is equal to the difference in predicted logits for the medium income and
low income students. The regression coefficient for ‘high income’ equals the difference in predicted logits for the high
income and low income students .
We see that both dummy variables are statistically significant in the model, with both p’s below .01.

The positive regression coefficient (b=.999) and OR > 1 (2.715) associated with the ‘medium income’ dummy variable
indicates that students in the medium income category are more likely to pass the test than students in the low
income category. The positive regression coefficient (b=1.283) and OR > 1 (3.607) associated with the ‘high income’
dummy variable also indicates that students in the high income category are more likely to pass the test than
students in the low income category.
Finally, formally speaking the odds ratio for ‘medium income’ is interpreted as follows: ‘The odds of a student in the
medium income group passing (Y=1) are 2.715 times that of the odds of a student in the low income group passing.
The odds ratio for ‘high income’ is interpreted as follows: ‘The odds of a student in the high income group passing
(Y=1) are 3.607 times that of the odds of a student in the low income group passing.
You will notice that this ‘variable’ has no regression slope or odds ratio associated with it. The reason is that all group
information is contained in the dummy variables. Attempting to add a dummy variable for the low group into the
model would yield a variable that is redundant with the others.

For more details on dummy coding concepts, see https://youtu.be/XGlbGaOsV9U (note: this presentation was
developed using SPSS)
References and suggested readings

Field, A. (2018). Discovering statistics using IBM SPSS statistics (5th ed). Los Angeles: Sage.

Fox,J. (1997). Applied regression analysis, linear models, and related methods. Thousand Oaks, CA: Sage.

Heck, R.H., Thomas, S.L., & Tabata, L.N. (2012). Multilevel modeling of categorical outcomes using IBM SPSS. New York:
Routledge.

Lomax, R.G., & Hahs-Vaughn, D.L. (2012). An introduction to statistical concepts (3rd ed.). New York: Routledge.

Pampel, F.C. (2000). Logistic regression: A primer. Thousand Oaks, CA: Sage.

Pituch, K.A., & Stevens, J.P. (2016). Applied multivariate statistics for the social sciences. New York: Routledge.
Osborne, J.W. (2015). Best practices in logistic regression. Los Angeles, CA: Sage.

Osborne, J.W. (2012). Logits and tigers and bears, oh my! A brief look at the simple math of logistic regression and how it
can improve dissemination of results. Practical Assessment, Research, and Evaluation, 17(11). Downloaded on March
10, 2021 from
https://pdfs.semanticscholar.org/caa3/c477fadbd9e51824d71d41e70898ff14b4af.pdf?_ga=2.121394440.1816740118.
1616501380-125705265.1610722785

You might also like