SPSS Manual
SPSS Manual
SPSS Manual
1
and a different y-intercept. For example when Asian = 1 the y-intercept is (
0
+
3
). In this
model setting all the indicator variables equal to zero represents White.
In the regression context a categorical variable is called a factor and its values are called
levels. Rule for categorical variables in regression: Whenever a factor has k levels it can be
included into the multiple regression model by using (k-1) indicator variables.
BINARY LOGISTIC REGRESSION
As part of enrollment projections, retention analyses are an important variable to
consider. Well use our fake data set is to illustrate the use of logistic regression models
and discriminant analysis to identify the variables that contribute to retention.
Institutional characteristics and mission drive the variables that contribute to retention.
The fake data set is purposely elementary to demonstrate the statistical technique.
Notice that in the data set the variable retention is already coded as a dummy variable
with 1 representing retention, and 0 representing did not retain.
One of the consequences of the linear regression model is that the dependent variable must
be a continuous variable because the condition of normal error terms implies that the
dependent variable has a normal distribution. Thus, we cannot use linear regression to
predict retention status. Logistic regression models a function of the probability of retaining
the student.
Let p = Pr(retention = 1), the probability that the value of retention is 1, that is, the
probability that the student comes back. The logistic regression approach models the
function
(
=
=
) 1 Pr( 1
) 1 Pr(
retention
retention
ln as a linear function of say, the first year GPA, financial aid,
social, gender and ethnicity as follows:
LATINO BLACK FEMALE SOCIAL FINAID FYGPA
p
p
9 5 4 3 2 1 0
1
ln + + + + + + + =
|
|
.
|
\
|
L
where
0
, ...,
9
are unknown parameters to be estimated from the data.
Data. The dependent variable should be dichotomous. Independent variables can be interval
level or categorical; if categorical, they should be dummy or indicator coded (there is an
option in the procedure to recode categorical variables automatically).
Assumptions. Logistic regression does not rely on distributional assumptions in the same
sense that discriminant analysis does. However, your solution may be more stable if your
predictors have a multivariate normal distribution. Additionally, as with other forms of
regression, multicollinearity among the predictors can lead to biased estimates and inflated
standard errors. The procedure is most effective when group membership is a truly
categorical variable; if group membership is based on values of a continuous variable (for
example, "high GPA" versus "low GPA"), you should consider using linear regression to take
advantage of the richer information offered by the continuous variable itself.
SPSS for Institutional Researchers 45
Graphical Analysis
A plot of the binary dependent variable versus an independent variable is not worthwhile,
since there are only two distinct values for the dependent variable. Although no graphical
approach can be prescribed for all problems, it is occasionally useful to examine a scatterplot
of one of the independent variables versus another, with codes to indicate whether the
dependent variable is 0 or 1.
Graphs
Scatter
Matrix
Define
Move the variables to be plotted to the Matrix Variables box
Retention Set Markers by
Add a title if desired
OK
Grade Point Average
Amount of Financial
NSSE Q10a
Retention Status
Retain
Did not return/Not r
etained
The plot above shows no clear differences in the association between financial aid and NSSE
Q10a (social), and the association between first year GPA and NSSE Q10a for returning and
non- returning students. Returning students seem to have slightly lower first year GPA and
lower amounts of financial aid than non-returning students.
Analyze
Regression
Binary Logistic
Retention Dependent
Finaid, social
1
, First year Grade Point Average, ethnicity, gender
Covariate(s)
Method Enter (forward conditional, forward LR, forward Wald,
backward conditional, backward LR, or backward Wald, as desired)
Categorical
Ethnicity Categorical Covariates
Gender Categorical Covariates
Ethnicity
1
Social refers to Question 10a on the National Survey of Student Engagement. Mark the box that best
represents the quality of your relationships with people at your institution: Other Students - Friendly,
supportive, sense of belonging. It is scored on a 7-point Likert scale so higher scores are better.
SPSS for Institutional Researchers 46
First
Change
Gender
First
Change
Continue
Save
Predicted values
Probabilities
Group Membership
Residuals
Standardized
Deviance
Continue
Options
Statistics and Plots
Classification plots
Hosmer-Lemeshow goodness of fit
CI for Exp(B)
Continue
OK
Dependent Variable Encoding
0
1
Original Value
Did not return/Not
retained
Retain
Internal Value
SPSS for Institutional Researchers 47
Categorical Variables Codings
23 .000 .000 .000 .000 .000
17 1.000 .000 .000 .000 .000
3 .000 1.000 .000 .000 .000
3 .000 .000 1.000 .000 .000
3 .000 .000 .000 1.000 .000
1 .000 .000 .000 .000 1.000
25 .000
25 1.000
White
Black or African American
Asian
American Indian or
Alaska Native
Native Hawaiian or Other
Pacific Islander
Hispanic or Latino
Ethnicity
Male
Female
Gender
Frequency (1) (2) (3) (4) (5)
Parameter coding
Classification Table
a,b
0 19 .0
0 31 100.0
62.0
Observed
Did not return/Not
retained
Retain
Retention Status
Overall Percentage
Step 0
Did not
return/Not
retained Retain
Retention Status
Percentage
Correct
Predicted
Constant is included in the model.
a.
The cut value is .500
b.
Variables in the Equation
.490 .291 2.823 1 .093 1.632 Constant Step 0
B S.E. Wald df Sig. Exp(B)
Variables not in the Equation
a
5.148 1 .023
1.311 1 .252
.300 1 .584
2.122 1 .145
6.209 5 .286
.080 1 .777
1.113 1 .291
1.113 1 .291
1.113 1 .291
1.665 1 .197
FYGPA
FINAID
SOCIAL
GENDER(1)
ETHNICIT
ETHNICIT(1)
ETHNICIT(2)
ETHNICIT(3)
ETHNICIT(4)
ETHNICIT(5)
Variables Step
0
Score df Sig.
Residual Chi-Squares are not computed because of redundancies.
a.
SPSS for Institutional Researchers 48
Omnibus Tests of Model Coefficients
12.405 9 .191
12.405 9 .191
12.405 9 .191
Step
Block
Model
Step 1
Chi-square df Sig.
Model Summary
54.001 .220 .299
Step
1
-2 Log
likelihood
Cox & Snell
R Square
Nagelkerke
R Square
Hosmer and Lemeshow Test
3.304 8 .914
Step
1
Chi-square df Sig.
Classification Table
a
10 9 52.6
5 26 83.9
72.0
Observed
Did not return/Not
retained
Retain
Retention Status
Overall Percentage
Step 1
Did not
return/Not
retained Retain
Retention Status
Percentage
Correct
Predicted
The cut value is .500
a.
Variables in the Equation
-1.558 .879 3.146 1 .076 .210 .038 1.178
.000 .000 .222 1 .638 1.000 1.000 1.000
-.157 .182 .742 1 .389 .855 .598 1.221
-.461 .821 .315 1 .575 .631 .126 3.154
3.428 5 .634
-.702 .795 .779 1 .377 .496 .104 2.355
-1.596 1.363 1.371 1 .242 .203 .014 2.932
-1.791 1.505 1.416 1 .234 .167 .009 3.185
-1.921 1.550 1.537 1 .215 .146 .007 3.053
-7.028 36.707 .037 1 .848 .001 .000 1.6E+28
7.516 3.148 5.699 1 .017 1837.557
FYGPA
FINAID
SOCIAL
GENDER(1)
ETHNICIT
ETHNICIT(1)
ETHNICIT(2)
ETHNICIT(3)
ETHNICIT(4)
ETHNICIT(5)
Constant
Step
1
a
B S.E. Wald df Sig. Exp(B) Lower Upper
95.0% C.I.for EXP(B)
Variable(s) entered on step 1: FYGPA, FINAID, SOCIAL, GENDER, ETHNICIT.
a.
SPSS for Institutional Researchers 49
Step number: 1
Observed Groups and Predicted Probabilities
4 R
R
R
F R
R 3 RRR R R R
E RRR R R R
Q RRR R R R
U RRR R R R
E 2 D R D R DRRR R R RRRR
N D R D R DRRR R R RRRR
C D R D R DRRR R R RRRR
Y D R D R DRRR R R RRRR
1 D D D RD DRD DDR DRDRDD D D RRRRR DDRR R
D D D RD DRD DDR DRDRDD D D RRRRR DDRR R
D D D RD DRD DDR DRDRDD D D RRRRR DDRR R
D D D RD DRD DDR DRDRDD D D RRRRR DDRR R
Predicted
Prob: 0 .25 .5 .75 1
Group: DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
Predicted Probability is of Membership for Retain
The Cut Value is .50
Symbols: D - Did not return/Not retained
R - Retain
Each Symbol Represents .25 Cases.
Interpreting the output
Dependent variable Encoding. This table informs you of how the procedure handled the
dichotomous dependent variable, which helps you to interpret the values of the
parameter coefficients. Since Retain was coded as 1 the probabilities computed using the
model will correspond to the probability that the student will return.
Categorical variables codings. This table supplies information about how categorical
predictors were treated. In this case White was used as the reference category for
ethnicity and Male was used as the reference category for gender. The variable
Ethnicity(1) will be associated with black students, Ethnicity(2) with Asian students, and
so on.
Classification Table. The classification table helps you assess the performance of your
model by crosstabulating the observed response categories with the predicted response
categories. There are two classification tables in the output. The first classification table
(Step 0) corresponds to a model that does not include any independent (predictor)
variables. This model correctly classifies all the 31 returning students, but incorrectly
classifies the 19 non-returning students as being retained. The overall misclassification
rate is 38%. The second classification table (Step 1) when all the independent variables
are in the model shows an overall misclassification rate of 28%. Nine of the non-returning
students and five of the returning students were misclassified.
SPSS for Institutional Researchers 50
Variables in the Equation. This table summarizes the roles of the parameters in the
model. In Step 0, there is only a constant in the model (an estimate of
0
). This estimate
is .49. Thus,
. 632 . 1
1
, 49 . 0
1
ln =
=
|
|
.
|
\
|
p
p
and
p
p
The latter quotient is called the odds. The probability of a
student returning is 1.632 times the probability of a student not returning.
For the model that include all the variables we have
LATINO BLACK FEMALE SOCIAL FINAID FYGPA
p
p
028 . 7 702 . 461 . 157 . 000 . 558 . 1 516 . 7
1
ln =
|
|
.
|
\
|
L
The
expression 210 .
) 5 . 2 | 1 Pr( 1
) 5 . 2 | 1 Pr(
) 5 . 3 | 1 Pr( 1
) 5 . 3 | 1 Pr(
) 5 . 2 5 . 3 )( 558 . 1 (
= =
= =
= =
= =
= =
FYGPA retention
FYGPA retention
FYGPA retention
FYGPA retention
e compares the odds of retaining a
student with first year GPA of 3.5 to the odds of retaining a student with first year GPA of
2.5 when the values of all the other independent variables are the same for both
students. This quotient is called the odds ratio, and the model predicts that all else being
equal, the odds of returning for a student with a 3.5 first year GPA is .21 times the odds
of returning for a student with a 2.5 first year GPA, that is, the odds of retaining the
student with the lower first year GPA are higher. If the difference in first year GPA were
only .3 the odds ratio would be .63 = e
(-1.558)(.3)
.
For each of the covariates (independent variables) the table provides the value of the
sample odds ratio, the significance of the estimated coefficient and a confidence interval
for the population odds ratio. A significance value < .05 indicates a potential good
predictor of retention status.
Variables not in the Equation. In Block 0, the variable with the highest score (if
significant) is included first by Forward stepwise regression methods. This information is
ignored if we use Enter as the method.
Omnibus Test for Model Coefficients. This is the analogous of the ANOVA test in linear
regression. This test compares the likelihood of the data as measured by the current
model to the likelihood of the data under the model containing only a constant term.
Large Chi-square values with a small significance value (< .05) indicate that the data is
better explained by the current model than by the constant term only model. In this
example, the chi-square value is 12.405 with a p-value = .191. The model that includes
first year GPA, financial aid, social, ethnicity and gender does not explain the data
significantly better than the constant only model.
Model Summary. The 2 log-likelihood and pseudo r-square statistics are computed. When
the procedure is a backward, forward or stepwise selection these values are computed at
each step. The 2 log-likelihood is a measure of the likelihood of the data under the
current model. The Cox & Snell R-squared, and the Nagelkerke adjusted R-squared are
descriptive measures of the fit of the model, these are measurements similar to the R-
squared in regression. The model explains 29.9% of the variation seen in retention status.
Hosmer and Lemeshow Test. This is a goodness-of-fit test of the null hypothesis that
the model adequately fits the data. If the chi-square value is small and the p-value for
this test is greater than .05 we should conclude that the model fits the data well. In this
example the p-value is .914, thus a good fit.
SPSS for Institutional Researchers 51
Observed Groups and Predicted Probabilities. Visual display of predicted group
membership. A case is classified into group 1 if the predicted Pr(retention =1) > .5. The 5
misclassified cases in the did not retain group are represented by the five sets of 4 Rs.
Each set of four Rs represents one observation.
The Save command we issued created four additional variables that are stored in the data
worksheet. They are the retention probabilities (pre_3), the predicted group membership
(pgr_3), standardized residuals (zre_3) and deviance residuals (dev_3). Plots of the residuals
can be used to ascertain whether the logistic model fits the data well. When the fit is
adequate and the sample size is large, the standardized residuals will follow a standard
normal distribution. The deviance residuals will also be approximately normally distributed
when the model fits the data well. Normal quantile plots of these residuals can be plotted
for examination.
REFINING THE MODEL
In logistic regression we can perform several types of variable selection in order to streamline
the model. The choices are forward Wald, forward conditional, forward LR, backward Wald,
backward conditional and backward LR. The difference resides on the criterion used in order
to enter (forward) or remove (backward) variables from the model. Below are some highlights
of the Backward LR procedure applied to these data.
SPSS for Institutional Researchers 52
Omnibus Tests of Model Coefficients
12.405 9 .191
12.405 9 .191
12.405 9 .191
-.223 1 .637
12.182 8 .143
12.182 4 .016
-.157 1 .692
12.025 7 .100
12.025 3 .007
-.833 1 .362
11.193 6 .083
11.193 2 .004
-5.691 5 .337
5.502 1 .019
5.502 1 .019
Step
Block
Model
Step
Block
Model
Step
Block
Model
Step
Block
Model
Step
Block
Model
Step 1
Step 2
a
Step 3
a
Step 4
a
Step 5
a
Chi-square df Sig.
A negative Chi-squares value indicates that the
Chi-squares value has decreased from the
previous step.
a.
Model Summary
54.001 .220 .299
54.224 .216 .294
54.381 .214 .291
55.214 .201 .273
60.905 .104 .142
Step
1
2
3
4
5
-2 Log
likelihood
Cox & Snell
R Square
Nagelkerke
R Square
Hosmer and Lemeshow Test
3.304 8 .914
6.531 8 .588
10.290 8 .245
7.265 8 .508
4.514 7 .719
Step
1
2
3
4
5
Chi-square df Sig.
SPSS for Institutional Researchers 53
Variables not in the Equation
e
.224 1 .636
.224 1 .636
.064 1 .800
.158 1 .691
.216 1 .642
.828 1 .363
.047 1 .829
.343 1 .558
.302 1 .583
1.187 1 .276
5.444 5 .364
.019 1 .891
.597 1 .440
1.074 1 .300
1.753 1 .185
.810 1 .368
FINAID Variables
Overall Statistics
Step 2
a
FINAID
GENDER(1)
Variables Step 3
b
FINAID
SOCIAL
GENDER(1)
Variables Step 4
c
FINAID
SOCIAL
GENDER(1)
ETHNICIT
ETHNICIT(1)
ETHNICIT(2)
ETHNICIT(3)
ETHNICIT(4)
ETHNICIT(5)
Variables Step 5
d
Score df Sig.
Variable(s) removed on step 2: FINAID.
a.
Variable(s) removed on step 3: GENDER.
b.
Variable(s) removed on step 4: SOCIAL.
c.
Variable(s) removed on step 5: ETHNICIT.
d.
Residual Chi-Squares are not computed because of redundancies.
e.
Step Summary
a,b
-.223 1 .637 12.182 8 .143 72.0%
OUT:
FINAID
-.157 1 .692 12.025 7 .100 72.0%
OUT:
GENDER
-.833 1 .362 11.193 6 .083 76.0%
OUT:
SOCIAL
-5.691 5 .337 5.502 1 .019 70.0%
OUT:
ETHNICIT
Step
2
3
4
5
Chi-square df Sig.
Improvement
Chi-square df Sig.
Model
Correct
Class % Variable
No more variables can be deleted from or added to the current model.
a.
End block: 1
b.
The procedure took 5 steps. The first variable removed was financial aid followed by gender,
social and ethnicity in that order. Notice that removal of the variable ethnicity that is coded
as five separate indicator variables results in the simultaneous removal of all 5 indicators.
The final model is FYGPA
p
p
73 . 1 38 . 6
1
ln =
(