Module 4 - Logistic Regression PDF
Module 4 - Logistic Regression PDF
Objectives
Understand the principles and theory underlying logistic regression
Understand proportions, probabilities, odds, odds ratios, logits and exponents
Be able to implement multiple logistic regression analyses using SPSS and
accurately interpret the output
Understand the assumptions underlying logistic regression analyses and how
to test them
Appreciate the applications of logistic regression in educational research, and
think about how it may be useful in your own research
You can jump to specific pages using the contents list below. If you are new to this
module start at the overview and work through section by section using the 'Next' and
'Previous' buttons at the top and bottom of each page. Be sure to tackle the exercise
and the quiz to get a good understanding.
Contents
4.1
Overview
4.2
Quiz A
4.3
4.4
4.5
4.6
4.7
4.8
4.9
Assumptions
4.10
4.11
4.12
4.13
4.14
Model diagnostics
4.15
Quiz B
Exercise
4.1 Overview
What is Multiple Logistic Regression?
In the last two modules we have been concerned with analysis where the outcome
variable (sometimes called the dependent variable) is measured on a continuous
scale. However many of the variables we meet in education and social science more
generally have just a few, maybe only two categories. Frequently we have only a
dichotomous or binary outcome. For example this might be whether a student plans to
continue in full-time education after age 16 or not, whether they have identified
Special Educational Needs (SEN), whether they achieve a certain threshold of
educational achievement, whether they do or do not go to university, etc. Note that
here the two outcomes are mutually exclusive and one must occur. We usually code
such outcomes as 0 if the event does not occur (e.g. the student does not plan to
continue in FTE after the age of 16) and 1 if the event does occur (e.g. the student
does plan to continue in FTE after age 16).
This module first covers some basic descriptive methods for the analysis of binary
outcomes. This introduces some key concepts about percentages, proportions,
probabilities, odds and odds-ratios. We then show how variation in a binary response
can be modeled using regression methods to link the outcome to explanatory
variables. In analyzing such binary outcomes we are concerned with modeling the
probability of the event occurring given the level/s of one or more explanatory
(independent/predictor) variables. This module is quite difficult because there are
many new concepts in it. However if you persevere through the rationale below, you
will find (hopefully!) that the examples make sense of it all. Also, like all things, the
concepts and applications will grow familiar with use, so work through the examples
and take the quizzes.
the previous module. We recommend that you retrieve them from the ESDS website
playing around with them will really help you to understand this stuff!
LSYPE 15,000
Figure 4.2.1: Aspiration to continue in full time education (FTE) after the age of
16 by gender: Cell counts and percentages
We have coded not aspiring to continue in FTE after age 16 as 0 and aspiring to do so
as 1. Although it is possible to code the variable with any values, employing the values
0 and 1 has advantages. The mean of the variable will equal the proportion of cases
with the value 1 and can therefore be interpreted as a probability. Thus we can see
that the percentage of all students who aspire to continue in FTE after age 16 is
81.6%. This is equivalent to saying that the probability of aspiring to continue in FTE in
our sample is 0.816.
are 4.43 times more likely to aspire to continue in FTE than not to aspire to continue in
FTE.
We dont actually have to calculate the odds directly from the numbers of students if
we know the proportion for whom the event occurs, since the odds of the event
occurring can be gained directly from this proportion by the formula (Where p is the
probability of the event occurring.):
The above are the unconditional odds, i.e. the odds in the sample as a whole.
However odds become really useful when we are interested in how some other
variable might affect our outcome. We consider here what the odds of aspiring to
remain in FTE are separately for boys and girls, i.e. conditional on gender. We have
seen the odds of the event can be gained directly from the proportion by the formula
odds=p/(1-p).
These are the conditional odds, i.e. the odds depending on the condition of gender,
either boy or girl.
We can see the odds of girls aspiring to continue in FTE are higher than for boys. We
can in fact directly compare the odds for boys and the odds for girls by dividing one by
the other to give the Odds Ratio (OR). If the odds were the same for boys and for girls
then we would have an odds ratio of 1. If however the odds differ then the OR will
depart from 1. In our example the odds for girls are 6.53 and the odds for boys are
3.27 so the OR= 6.56 / 3.27 = 2.002, or roughly 2:1. This says that girls are twice as
likely as boys to aspire to continue in FTE.
Note that the way odd-ratios are expressed depends on the baseline or comparison
category. For gender we have coded boys=0 and girls =1, so the boys are our natural
base group. However if we had taken girls as the base category, then the odds ratio
would be 3.27 / 6.56= 0.50:1. This implies that boys are half as likely to aspire to
continue in FTE as girls. You will note that saying Girls are twice as likely to aspire as
boys is actually identical to saying boys are half as likely to aspire as girls. Both
figures say the same thing but just differ in terms of the base.
Odds Ratios from 0 to just below 1 indicate the event is less likely to happen in the
comparison than in the base group, odds ratios of 1 indicate the event is exactly as
likely to occur in the two groups, while odds ratios from just above 1 to infinity indicate
the event is more likely to happen in the comparator than in the base group.
Extension D provides a table that shows the equivalence between ORs in the range 0
to 1 with those in the range 1 to infinity.
3.27 * 1
= 3.27
So another way of looking at this is that the odds for each gender can be expressed
as a constant multiplied by a gender specific multiplicative factor (namely the OR).
However there are problems in using ORs directly in any modeling because they are
asymmetric. As we saw in our example above, an OR of 2.0 indicates the same
relative ratio as an OR of 0.50, an OR of 3.0 indicates the same relative ratio as an
OR of 0.33, an OR of 4.0 indicates the same relative ratio as an OR of 0.25 and so on.
This asymmetry is unappealing because ideally the odds for males would be the
opposite of the odds for females.
7
So if we take the log of each side of the equation we can then express the log odds
as:
Log [p/(1-p)] = constant + log (OR)
If the constant is labelled a, the log of the OR is labelled b, and the variable gender (x)
takes the value 0 for boys and 1 for girls, then:
Log [p/(1-p)] = a + bx
Note that taking the log of the odds has converted this from a multiplicative to an
additive relationship with the same form as the linear regression equations we have
8
discussed in the previous two modules (it is not essential, but if you want to
understand how logarithms do this it is explained in Extension E). So the log of the
odds can be expressed as an additive function of a + bx. This equation can be
generalised to include any number of explanatory variables:
Log [p/(1-p)] = a + b1x1+ b2x2 + b3x3 + ... + bnxn.
Output from a logistic regression of gender on educational aspiration
If we use SPSS to complete a logistic regression (more on this later) using the student
level data from which the summary Figure 4.2.1 was constructed, we get the logistic
regression output shown below (Figure 4.2.3).
Figure 4.2.3: Output from a logistic regression of gender on aspiration to
continue in FTE post 16
Lets explain what this output means. The B weights give the linear combination of the
explanatory variables that best predict the log odds. So we can determine that the log
odds for:
Male:
Female:
The inverse of the log function is the exponential function, sometimes conveniently
also called the anti-logarithm (nice and logical!). So if we want to convert the log odds
back to odds we take the exponent of the log odds. So the odds for our example are:
Male:
Female:
The odds ratio is given in the SPSS output for the gender variable [indicated as
Exp(B)] showing that girls are twice as likely as boys to aspire to continue in FTE.
By simple algebra we can rearrange the formula odds= [p/(1-p] to solve for
probabilities, i.e. p= [odds/(1+odds)]:
Males:
These probabilities, odds and odds ratios - derived from the logistic regression model
- are identical to those calculated directly from Figure 4.2.1. This is because we have
just one explanatory variable (gender) and it has only two levels (girls and boys). This
is called a saturated model for which the expected counts and the observed counts
are identical. The logistic regression model will come into its own when we have an
explanatory variable with more than two values, or where we have multiple
explanatory variables. However what we hope this section has done is show you how
probabilities, odds, and odds ratios are all related, how we can model the proportions
in a binary outcome through a linear prediction of the log odds (or logits), and how
these can be converted back into odds ratios for easier interpretation.
Take the quiz to check you are comfortable with what you have learnt so far. If you are
not perturbed by maths and formulae why not check out Extension E for more about
logs and exponents.
10
We can think of the data in Figure 4.2.1 (Page 4.2) in two ways. One is to think of
them as two proportions, the proportion of students who aspire to continue in FTE in
two independent samples, a sample of boys and a sample of girls. The other way is to
think of the data as 13,825 observations, with the response always either 0 (does not
wish to continue in FTE) or 1 (wishes to continue in FTE). Thus our response or
outcome distribution is actually what is known as a binomial distribution since it is
made up of only two values, students either do not aspire (0) or do aspire (1) to
continue in FTE.
As another way to consider the logic of logistic regression, consistent with what we
have already described but coming at it from a different perspective, lets consider first
why we cannot model a binary outcome using the linear regression methods we
covered in modules 2 and 3. We will see that significant problems arise in trying to use
linear regression with binary outcomes, which is why logistic regression is needed.
Figure 4.3.1: A linear regression of age 11 test score against achieving five or
more GCSE grades A*-C including English and maths (fiveem)
The linear regression of age 11 score on fiveem give the following regression
equation:
= .454 + .032 * X (where X=age 11 score which can range from -24 to 39).
The predicted values take the form of proportions or probabilities. Thus at the average
age 11 score (which was set to 0, see Extension A) the predicted probability is simply
the intercept or .454 (i.e. 45.4%). At an age 11 score 1 SD below the mean (X=-10)
the predicted probability = .454 + .032*-10 = .134, or 13.4%. At an age 11 score 1 SD
above the mean (X=10) the predicted probability = .454 + .032*10 = .774, or 77.4%.
12
However there are two problems with linear regression that make it inappropriate to
use with binary outcomes. One problem is conceptual and the other statistical.
Lets deal with the statistical problem first. The problem is that a binary outcome
violates the assumption of normality and homoscedasticity inherent in linear
regression. Remember that linear regression assumes that most of the observed
values of the outcome variable will fall close to those predicted by the linear
regression equation and will have an approximately normal distribution (See Page
2.6). Yet with a binary outcome only two Y values exist so there can only be two
residuals for any value of X, either 1, predicted value (when Y=1) or 0, predicted
value (when Y=0). The assumption of normality clearly becomes nonsensical with a
binary response, the distribution of residuals cannot be normal when the distribution
only has two values. It also violates the assumption of homoscedasticity, namely that
the variance of errors is constant at all levels of X. If you look at Figure 4.3.1 it is
apparent that near the lower and upper extremes of X, where the line comes close to
the floor of 0 and the ceiling of 1, the residuals will be relatively small, but near the
middle values of X the residuals will be relatively large. Thus there are good statistical
reasons for rejecting a linear regression model for binary outcomes.
The second problem is conceptual. Probabilities and proportions are different from
continuous outcomes because they are bounded by a minimum of 0 and a maximum
of 1, and by definition probabilities and proportions cannot exceed these limits. Yet the
linear regression line can extend upwards beyond 1 for large values of X and
downwards below 0 for small values of X. Look again at Figure 4.3.1. We see that
students with an age 11 score below -15 are actually predicted to have a less than
zero (<0) probability of achieving fiveem. Equally problematic are those with an age
11 score above 18 who are predicted to have a probability of achieving fiveem greater
than one (>1). These values simply make no sense.
We could attempt a solution to the boundary problem by assuming that any predicted
values of Y>1 should be truncated to the maximum of 1. The regression line would be
straight to the value of 1, but any increase in X above this point would have no
influence on the predicted outcome. Similarly any predicted values of Y<0 could be
13
truncated to 0 so any decrease in X below this point would have no influence on the
predicted outcome. However there is another functional form of the relationship
between X and a binary Y that makes more theoretical sense than this truncated
linearity. Stay tuned
14
Figure 4.4.1: Probability of 5+ GCSE A*-C including English & maths by age 11
test score
We can see that the relationship between age 11 score and fiveem actually takes the
form of an S shaped curve (a sigmoid). In fact whenever we have a binary outcome
(and are thus interested in modeling proportions) the sigmoid or S shaped curve is a
better function than a linear relationship. Remember that in linear regression a one
unit increase in X is assumed to have the same impact on Y wherever it occurs in the
distribution of X. However the S shape curve represents a nonlinear relationship
between X and Y. While the relationship is approximately linear between probabilities
of 0.2 and 0.8, the curve levels off as it approaches the ceiling of 1 and the floor of 0.
The effect of a unit change in age 11 score on the predicted probability is relatively
15
small near the floor and near the ceiling compared to the middle. Thus a change of 2
or 3 points in age 11 score has quite a substantial impact on the probability of
achieving fiveem around the middle of the distribution, but much larger changes in age
11 score are needed to effect the same change in predicted probabilities at the
extremes. Conceptually the S-shaped curve makes better sense than the straight line
and is far better at dealing with probabilities.
The logistic function transforms the log odds to express them as predicted
probabilities. First, it applies the reverse of the log (called the exponential or antilogarithm) to both sides of the equation, eliminating the log on the left hand side, so
the odds can be expressed as:
p/(1-p) = Exp(a+bx).
Second, the formula can be rearranged by simple algebra1 to solve for the value p.
p = Exp(a+bx) / [ 1 + Exp(a+bx)]
So the logistic function transforms the log odds into predicted probabilities. Figure
4.4.2 shows the relationship between the log odds (or logit) of an event occurring and
the probabilities of the event as created by the logistic function. This function gives the
distinct S shaped curve.
You will have to take my word on this, but a formula p / (1-p) =x can be rearranged to p = x / (1+x).
16
Look back to Figure 4.4.1 where the blue bars shows the actual proportion of
students achieving fiveem for each age 11 test score. We have superimposed over
the actual figures a red line that shows the predicted probabilities of achieving fiveem
as modeled from a logistic regression using age 11 test score as the explanatory
variable. Comparing these predicted probabilities (red line) to the actual probability of
achieving fiveem (blue bars) we can see that the modeled probabilities fit the actual
data extremely well.
when we are at the ceiling than at the middle of the curve. Lets look at specific figures
using Figure 4.4.3 which shows the relationship between probabilities, odds and log
odds.
18
The log odds still has the non-linear relationship with probability at the ceiling, since a
change in p from .5 to .6 is associated with an increase in log odds from 0 to 0.4, while
a change in probability from .8 to .9 is associated with an increase in log odds from
1.39 to 2.20 (or 0.8). However they also reflect the non-linear relationship at the floor,
since a decrease in probability from .5 to .4 is associated with an decrease in log odds
from 0 to -0.4, while a decrease in probability from .2 to .1 is associated with an
decrease in log odds from -1.39 to -2.20 (or -0.8) (see Figure 4.4.3). Also, as we
discussed on Page 4.2 log odds have the advantage that they are symmetrical around
0. A probability of 0.8 that an event will occur has log odds of 1.39, and the probability
of 0.2 that the event will not occur has log odds of -1.39. So the log odds are
symmetrical around 0.
19
So now rather than log odds or logits, which people are not happy talking about, we
can talk about odds and probabilities, which people are happier talking about (at least
relatively!).
20
The B coefficients describe the logistic regression equation using age 11 score to
predict the log odds of achieving fiveem, thus the logistic equation is:
log [p/(1-p)] = -.337 + .235 * age 11 score.
Figure 4.5.2 lets us visualize the equation. The left hand chart shows the linear
relationship between age 11 score and the log odds of achieving fiveem. This line has
an intercept of -.337 and a slope of .235 and is clearly linear. However we can use the
logistic function to transform the log odds to predicted probabilities, which are shown
in the right hand chart. Looking back to Figure 4.4.1 on Page 4.4 we see how well
these predicted probabilities match the actual data on the proportion of pupils
achieving fiveem at each age 11 score.
Figure 4.5.2: Relationship between age 11 score and (a) the log odds of
achieving 5+ GCSE A*-C including English & maths (b) the probability of
achieving 5+ GCSE A*-C including English and maths.
The logistic regression equation indicates that a one unit increase in age 11 test score
is associated with a .235 increase in the log odds of achieving fiveem. Taking the
exponent of the log odds, indicated in the output as Exp(B), gives the Odds Ratio,
which shows that a one unit increase in age 11 test score increases the odds of
achieving fiveem by a multiplicative factor of 1.27. Various procedures also exist to
calculate the effects of a unit change in the b on the probability of Y occurring.
21
However the effect on probabilities depends on the point of the logistic curve at which
the effect is calculated (e.g. a one unit change in age 11 score from -10 to -9 would
give a different change in the predicted probability than a one unit change in age 11
score from 0 to 1). This is why we typically stick to ORs as the main way of
interpreting the logistic regression results. (For more detail on interpreting the age 11
coefficient see Pages 4.10 and 4.12).
Summary
So in summary we have seen that when attempting to predict probabilities (which we
are doing when we model binary outcomes) linear regression in inappropriate, both for
statistical and conceptual reasons. With binary outcomes the form of the relationship
between an explanatory variable X and the probability of Y is better modeled by an Sshaped curve. While the relationship between X and the probability of Y is non-linear
(it is in fact curvilinear), and therefore cannot be modeled directly as a linear function
of our explanatory variables, there can be a linear and additive combination of our
explanatory variables that predict the log odds, which are not restricted by the floor of
0 and ceiling of 1 inherent in probabilities. These predicted log odds can be converted
back to odds (by taking the exponential) and to predicted probabilities using the
logistic function.
(usually about half a dozen times) until convergence is reached (that is until the
improvement in the LL does not differ significantly from zero).
23
The deviance has little intuitive meaning because it depends on the sample size and
the number of parameters in the model as well as on the goodness of fit. We therefore
need a standard to help us evaluate its relative size. One way to interpret the size of
the deviance is to compare the value for our model against a baseline model. In
linear regression we have seen how SPSS performs an ANOVA to test whether or not
the model is better at predicting the outcome than simply using the mean of the
outcome. The change in the -2LL statistic can be used to do something similar: to test
whether the model is significantly more accurate than simply always guessing that the
outcome will be the more common of the two categories. We use this as the baseline
because in the absence of any explanatory variables the best guess will be the
category with the largest number of cases.
Lets clarify this with our fiveem example. In our sample 46.3% of student achieve
fiveem while 53.7% do not. The probability of picking at random a student who does
not achieve the fiveem threshold is therefore slightly higher than the probability of
24
picking a student who does. If you had to pick one student at random and guess
whether they would achieve fiveem or not, what would you guess? Assuming you
have no other information about them, it would be most logical to guess that they
would not achieve fiveem simply because a slight majority do not. This is the
baseline model which we can test our later models against. This is also the logistic
model when only the constant is included. If we then add explanatory variables to the
model we can compute the improvement as follows:
X2= [-2LL (baseline)] - [-2LL (new)]
with degrees of freedom= kbaseline- knew, where k is the number of parameters in each
model.
If our new model explains the data better than the baseline model there should be a
significant reduction in the deviance (-2LL) which can be tested against the chi-square
distribution to give a p value. Dont worry - SPSS will do this for you! However if you
would like to learn more about the process you can go to Extension F.
The deviance statistic is useful for more than just comparing the model to the baseline
- you can also compare different variations of your model to see if adding or removing
certain explanatory variables will improve its predictive power (Figure 4.6.1)! If the
deviance (-2LL) is decreasing to a statistically significant degree with each set of
explanatory variables added to the model then it is improving at accurately predicting
the outcome for each case.
Prior attainment,
Attitude to school,
SES, aspirations
Parental expectations
25
26
Of course this will be true only if our additional explanatory variables actually add
significantly to the prediction of the outcome! As in linear regression, we want to know
not only how well the model overall fits the data, but also the individual contributions of
the explanatory variables. SPSS will calculate standard errors and significance values
for all variables added to our model, so we can judge how much they have added to
the prediction of the outcome.
27
28
29
As you can see it is still possible to group the explanatory variables in blocks and to
enter these blocks in to the model in order of importance. Thus the above screen shot
show we are at Block 1 of 1, but we can use the Next button to set up a second
block if we want to. The Enter option should also be familiar - when selected, all
explanatory variables (here labeled covariates by SPSS just to add an extra little
challenge!) in the specific block are forced into the model simultaneously.
The main difference for logistic regression is that the automated stepwise entry
methods are different. Once again the forward and backward methods are present.
They differ in how they construct the regression model, with the forward method
adding explanatory variables to a basic model (which includes only the constant, B0)
and the backwards method removing explanatory variables from the full model (one
including all the specified explanatory variables). SPSS makes these decisions based
on whether the explanatory variables meet certain criteria. You can choose three
different types of criteria for both forward and backward stepwise entry methods:
30
Conditional, LR and Wald. LR stands for Likelihood Ratio which is considered the
criterion least prone to error.
We havent gone into too much detail here partly because stepwise methods confuse
us but mainly because they are not generally recommended. They take important
decisions away from the researcher and base them on mathematical criteria rather
than sound theoretical logic. Stepwise methods are only really recommended if you
are developing a theory from scratch and have no empirical evidence or sensible
theories about which explanatory variables are most important. Most of the time we
have some idea about which predictors are important and the relative importance of
each one, which allows us to specify the entry method for the regression analysis
ourselves.
31
4.9 Assumptions
You will find that the assumptions for logistic regression are very similar to the
assumptions for linear regression. If you need a recap, rather than boring you by
repeating ourselves like statistically obsessed parrots (the worst kind of parrot) we
direct you to our multiple regression assumptions on Page 3.3. However, there are
still three key assumptions which you should be aware of:
Linearity (sort of...): For linear regression the assumption is that the outcome variable
has a linear relationship with the explanatory variables, but for logistic regression this
is not possible because the outcome is binary. The assumption of linearity in logistic
regression is that any explanatory variables have a linear relationship with the logit of
the outcome variable. What are they on about now? we imagine youre sighing. If the
relationship between the log odds of the outcome occurring and each of the
explanatory variables is not linear than our model will not be accurate. Well discuss
how to evaluate this in the context of SPSS over the coming pages, but the best way
to check that the model you are creating is sensible is by looking at the model fit
statistics and pseudo R2. If you are struggling with the concept of logits and log odds
you can revise Pages 4.2 and 4.4 of this module.
32
control for clustering through the use of multilevel regression models (also called
hierarchical linear models, mixed models, random effects or variance component
models) which explicitly recognize the hierarchical structure that may be present in
your data. Sounds complicated, right? It definitely can be and these issues are more
complicated than we need here where we are focusing on understanding the
essentials of logistic regression. However if you feel you want to develop these skills
we have an excellent sister website provided by another NCRM supported node called
LEMMA which explicitly provides training on using multilevel modeling including for
logistic regression. We also know of some good introductory texts on multilevel
modelling and you can find all of this among our Resources.
household but the relationship does not appear strong enough (Pearsons r = .42) to
be considered a problem. Usually values of r = .8 or more are cause for concern. As
before the Variance Inflation Factor (VIF) and tolerance statistics can be used to help
you verify that multicollinearity is not a problem (see Page 3.3).
34
Descriptive statistics
As with all research we should not run straight to the statistical analyses, even though
this might be the sexier bit because it is so clever! The starting point should always be
simple descriptive statistics so we better understand our data before we engage with
the more complex stuff. So what is the pattern of association between our key
variables of ethnicity, SEC and gender and fiveem?
Remember, as we said on Page 4.2, that the advantage of coding our binary
response as 0 and 1 is that the mean will then indicate the proportion achieving our
outcome (the value coded 1). We can therefore just ask for simple means of fiveem by
our independent variables (Figure 4.10.1).
35
Note: You will remember from module 2 that LSYPE was unable to code the SEC of
the head of household for quite a large proportion (18%) of LSYPE students. These
cases are coded 0 which has been defined as the missing value for SECshort.
However, we do not want to lose this many cases from the analysis. To include these
cases we have redefined the missing values property for SECshort to no missing
values. This means that those cases where SEC is not known (0) are included in the
above table. We will explain how we to do this on Page 4.13.
We see that the proportion of White British students achieving fiveem is 48%. The
proportion is substantially higher among Indian students (59%) and substantially lower
among Black Caribbean students (33%), with the other ethnic groups falling in
between these two extremes. There is also a substantial association between SEC
and fiveem. While 29% of students from low SEC home achieve fiveem, this rises to
45% for students from middle SEC homes and 66% for students from high SEC
homes. Finally there is also a difference related to gender, with 43% of boys achieving
fiveem compared to 51% of girls.
36
In the table you will see that we have also calculated and included the odds and the
odds ratios (OR) as described on Page 4.2. You will remember the odds are
calculated as [p/(1-p)] and represent the odds of achieving fiveem relative to the
proportion not achieving fiveem. We did this in Microsoft EXCEL, but you could
equally easily use a calculator. The OR compares the odds of success for a particular
group to a base category for that variable. For the variable ethnicity we have selected
White British as the base category. From the OR we can therefore say that Indian
students are 1.58 times more likely than White British students to achieve fiveem.
From Page 4.7 you will also remember that I can express this in percentage terms (1OR *100), so Indian students are 58% more likely to achieve fiveem than White British
students. Conversely for Black Caribbean students the OR is 0.53, so Black
Caribbean students are about half as likely as White British students to achieve
fiveem. In percentage terms (1-OR * 100) they are 47% less likely to achieve fiveem.
The ORs for SEC and for gender were calculated in the same way.
37
To do this we will need to run a logistic regression which will attempt to predict the
outcome fiveem based on a students ethnic group, SEC and gender.
Take the following route through SPSS: Analyse> Regression > Binary Logistic
38
The logistic regression pop-up box will appear and allow you to input the variables as
you see fit and also to activate certain optional features. First of all we should tell
SPSS which variables we want to examine. Our outcome measure is whether or not
the student achieves five or more A*-Cs (including Maths and English) and is coded
as 0 for no and 1 for yes. This variable is labelled fiveem and should be moved in to
the Dependent box.
Any explanatory variables need to be placed in what is named the covariates box. If
the explanatory variable is continuous it can be dropped in to this box as normal and
SPSS can be trusted to add it to the model, However, the process is slightly more
demanding for categorical variables such as the three we wish to add because we
need to tell SPSS to set up dummy variables based on a specific baseline category
(we do not need to create the dummies ourselves this time hooray!).
Lets run through this process. To start with, move ethnic, SEC and Gender into the
covariates box. Now they are there we now need to define them as categorical
variables. To do this we need to click the button marked Categorical (a rare moment
of simplicity from our dear friend SPSS) to open a submenu. You need to move all of
39
the explanatory variables that are categorical from the left hand list (Covariates) to the
right hand window in this case we need to move all of them!
The next step is to tell SPSS which category is the reference (or baseline) category for
each variable. To do this we must click on each in turn and use the controls on the
bottom right of the menu which are marked Change Contrast. The first thing to note
is the little drop down menu which is set to Indicator as a default. This allows you to
alter how categories within variables are compared in a number of ways (that you may
or may not be pleased to hear are beyond the scope of this module). For our purposes
we can stick with the default of indicator, which essentially creates dummy variables
for each category to compare against a specified reference category a process
which you are probably getting familiar with now (if not, head to Page 3.6).
All we need to do then is tell SPSS whether the first or last category should be used
as the reference and then click Change to finalize the setting. For our Ethnic variable
40
the first category is 0 White-British (the category with the highest number of
participants) so, as before, we will use this as the reference category. Change the
selection to First and click Change. For the Gender variable we only have two
categories and could use either male (0) or female (1) as the reference. Previously
we have used male as the reference so we will stick with this (once again, change the
selection to First and click Change). Finally, for Socio Economic Class (sec) we will
use the least affluent class as the reference (Never worked/long term unemployed 8). This time we will use the Last option given that the SEC categories are coded
such that the least affluent one is assigned the highest value code. Remember to click
Change! You will see that your selections have appeared in brackets next to each
variable and you can click Continue to close the submenu.
Notice that on the main Logistic Regression menu you can change the option for
which method you use with a drop down menu below the covariates box. As we are
entering all three explanatory variables together as one block you can leave this as
Enter. You will also notice that our explanatory variables (Covariates) now have Cat
printed next to them in brackets. This simply means that they have been defined as
categorical variables, not that they have suddenly become feline (that would just be
silly).
41
If you recall, the save options actually create new variables for your data set. We can
ask SPSS to calculate four additional variables for us:
Predicted probabilities This creates a new variable that tells us for each case the
predicted probability that the outcome will occur (that fiveem will be achieved) based
on the model.
Predicted Group Membership This new variable estimates the outcome for each
participant based on their predicted probability. If the predicted probability is >0.5 then
they are predicted to achieve the outcome, if it is <.5 they are predicted not to achieve
the outcome. This .5 cut-point can be changed, but it is sensible to leave it at the
default. The predicted classification is useful for comparison with the actual outcome!
Residuals (standardized) This provides the residual for each participant (in terms
of standard deviations for ease of interpretation). This shows us the difference
between the actual outcome (0 or 1) and the probability of the predicted outcome and
is therefore a useful measure of error.
Cooks Weve come across this in our travels before. This generates a statistic
called Cooks distance for each participant which is useful for spotting cases which
unduly influence the model (a value greater than 1 usually warrants further
investigation).
The other options can be useful for the statistically-minded but for the purposes of our
analysis the options above should suffice (we think we are fairly thorough!). Click on
Continue to shut the sub-menu. The next sub-menu to consider is called options:
42
Click on continue to close the sub-menu. Once you are happy with all the settings take
a deep breath... and click OK to run the analysis.
43
Figure 4.12.1: Case Processing Summary and Variable Encoding for Model
The Case Processing Summary simply tells us about how many cases are included in
our analysis The second row tells us that 3423 participants are missing data on some
of the variables included in our analysis (they are missing either ethnicity, gender or
fiveem, remember we have included all cases with missing SEC), but this still leaves
us with 12347 cases to analyse. The Dependent Variable Encoding reminds us how
our outcome variable is encoded 0 for no (Not getting 5 or more A*-C grades
including Maths and English) and 1 for yes (making the grade!).
Pakistani students ethnic(3) and so on. You will also see that Never worked/long
term unemployed is the base category for SEC, and that each of the other SEC
categories has a parameter coding of 1-7 reflecting each of the seven dummy SEC
variables that SPSS has created. This is only important in terms of how the output is
labeled, nothing else, but you will need to refer to it later to make sense of the output.
The next set of output is under the heading of Block 0: Beginning Block (Figure
4.12.3):
Figure 4.12.3: Classification Table and Variables in the Equation
45
This set of tables describes the baseline model that is a model that does not include
our explanatory variables! As we mentioned previously, the predictions of this baseline
model are made purely on whichever category occurred most often in our dataset. In
this example the model always guesses no because more participants did not
achieve 5 or more A*-C grades than did (6422 compared to 5925 according to our first
column). The overall percentage row tells us that this approach to prediction is correct
52.0% of the time so it is only a little better than tossing a coin!
The Variables in the Equation table shows us the coefficient for the constant (B0). This
table is not particularly important but weve highlighted the significance level to
illustrate a cautionary tale! According to this table the model with just the constant is a
statistically significant predictor of the outcome (p <.001). However it is only accurate
52% of the time! The reason we can be so confident that our baseline model has
some predictive power (better than just guessing) is that we have a very large sample
size even though it only marginally improves the prediction (the effect size) we have
enough cases to provide strong evidence that this improvement is unlikely to be due
to sampling. You will see that our large sample size will lead to high levels of statistical
significance for relatively small effects in a number of cases.
We have not printed the next table Variables not Included in the Model because all it
really does is tell us that none of our explanatory variables were actually included in
this baseline model (Block 0) which we know anyway! It is however worth noting the
number in brackets next to each variable this is the parameter coding we
mentioned earlier. As you can see, you will need to refer to the Categorical Variables
Encoding Table to make sense of these!
Now we move to the regression model that includes our explanatory variables. The
next set of tables begins with the heading of Block 1: Method = Enter (Figure 4.12.4):
46
The Omnibus Tests of Model Coefficients is used to check that the new model (with
explanatory variables included) is an improvement over the baseline model. It uses
chi-square tests to see if there is a significant difference between the Log-likelihoods
(specifically the -2LLs) of the baseline model and the new model. If the new model
has a significantly reduced -2LL compared to the baseline then it suggests that the
new model is explaining more of the variance in the outcome and is an improvement!
Here the chi-square is highly significant (chi-square=1566.7, df=15, p<.000) so our
new model is significantly better.
To confuse matters there are three different versions; Step, Block and Model. The
Model row always compares the new model to the baseline. The Step and Block rows
are only important if you are adding the predictors to the model in a stepwise or
hierarchical manner. If we were building the model up in stages then these rows would
compare the -2LLs of the newest model with the previous version to ascertain whether
or not each new set of explanatory variables were causing improvements. In this case
we have added all three explanatory variables in one block and therefore have only
one step. This means that the chi-square values are the same for step, block and
model. The Sig. values are p < .001, which indicates the accuracy of the model
improves when we add our explanatory variables.
The Model Summary (also in Figure 4.12.4) provides the -2LL and pseudo-R2 values
for the full model. The -2LL value for this model (15529.8) is what was compared to
47
the -2LL for the previous null model in the omnibus test of model coefficients which
told us there was a significant decrease in the -2LL, i.e. that our new model (with
explanatory variables) is significantly better fit than the null model. The R2 values tell
us approximately how much variation in the outcome is explained by the model (like in
linear regression analysis). We prefer to use the Nagelkerkes R2 (circled) which
suggests that the model explains roughly 16% of the variation in the outcome. Notice
how the two versions (Cox & Snell and Nagelkerke) do vary! This just goes to show
that these R2 values are approximations and should not be overly emphasized.
Moving on, the Hosmer & Lemeshow test (Figure 4.12.5) of the goodness of fit
suggests the model is a good fit to the data as p=0.792 (>.05). However the chisquared statistic on which it is based is very dependent on sample size so the value
cannot be interpreted in isolation from the size of the sample. As it happens, this p
value may change when we allow for interactions in our data, but that will be
explained in a subsequent model on Page 4.13. You will notice that the output also
includes a contingency table, but we do not study this in any detail so we have not
included it here.
More useful is the Classification Table (Figure 4.12.6). This table is the equivalent to
that in Block 0 (Figure 4.12.3) but is now based on the model that includes our
explanatory variables. As you can see our model is now correctly classifying the
outcome for 64.5% of the cases compared to 52.0% in the null model. A marked
improvement!
48
However the most important of all output is the Variables in the Equation table (Figure
4.12.7). We need to study this table extremely closely because it is at the heart of
answering our questions about the joint association of ethnicity, SEC and gender with
exam achievement.
This table provides the regression coefficient (B), the Wald statistic (to test the
statistical significance) and the all important Odds Ratio (Exp (B)) for each variable
category.
49
Looking first at the results for SEC, there is a highly significant overall effect
(Wald=1283, df=7, p<.000). The b coefficients for all SECs (1-7) are significant and
positive, indicating that increasing affluence is associated with increased odds of
achieving fiveem. The Exp(B) column (the Odds Ratio) tells us that students from the
highest SEC homes are eleven (11.37) times more likely than those from lowest SEC
homes (our reference category) to achieve fiveem. Comparatively those from the SEC
group just above the poorest homes are about 1.37 times (or 37%) more likely to
achieve fiveem than those from the lowest SEC group. The effect of gender is also
significant and positive, indicating that girls are more likely to achieve fiveem than
boys. The OR tells us they are 1.48 times (or 48%) more likely to achieve fiveem,
even after controlling for ethnicity and SEC (refer back to Page 4.7 effect size of
explanatory variables to remind yourself how these percentages are calculated).
Most importantly, controlling for SEC and gender has changed the associations
between ethnicity and fiveem. The overall association between fiveem and ethnicity
remains highly significant, as indicated by the overall Wald statistic, but the size of the
b coefficients2 and the associated ORs for most of the ethnic groups has changed
substantially. This is because the SEC profile for most ethnic minority groups is lower
than for White British, so controlling for SEC has significantly changed the odds ratios
for these ethnic groups (as it did in our multiple linear regression example). We saw in
Figure 4.10.1 that Indian students (Ethnic(2)) were significantly more likely than White
British students to achieve fiveem (OR=1.58), and now we see that this increases
even further after controlling for SEC and gender (OR=1.97). Bangladeshi students
(Ethnic(4)) were previously significantly less likely than White British students to
achieve fiveem (OR=.80) but are now significantly more likely (OR=1.47). Pakistani
(Ethnic(3)) students were also previously significantly less likely than White British
students to achieve fiveem (OR=.64) but now do not differ significantly after controlling
for SEC (OR=.92). The same is true for Black African (Ethnic(6)) students (OR change
from .83 to .95). However the OR for Black Caribbean (Ethnic(5)) students has not
changed much at all (OR change .53 to .57) and they are still significantly less likely to
2
. Before running this model we ran a model that just included ethnic group to estimate the b
coefficients and to test the statistical significance of the ethnic gaps for fiveem. We havent
reported it here because the Odds Ratios from the model are identical to those shown in
Figure 4.10.1. However the b coefficients and their statistical significance are shown as
Model 1 in Figure 4.15.1 where we show how to present the results of a logistic regression.
50
achieve fiveem than White British students, even after accounting for the influence of
social class and gender.
The final piece of output is the classification plot (Figure 4.12.8). Conceptually it
answers a similar question as the classification table (see Figure 4.12.6) which is
how accurate is our model in classifying individual cases? However the classification
plot gives some finer detail. This plot shows you the frequency of categorizations for
different predicted probabilities and whether they were yes or no categorizations.
This provides a useful visual guide to how accurate our model is by displaying how
many times the model would predict a yes outcome based on the calculated
predicted probability when in fact the outcome for the participant was no.
If the model is good at predicting the outcome for individual cases we should see a
bunching of the observations towards the left and right ends of the graph. Such a plot
would show that where the event did occur (fiveem was achieved, as indicated by a y
in the graph) the predicted probability was also high, and that where the event did not
occur (fiveem was not achieved, indicated by a n in the graph) the predicted
probability was also low. The above graph shows that quite a lot of cases are actually
in the middle area of the plot, i.e. the model is predicting a probability of around .5 (or
51
a 50:50 chance) that fiveem will be achieved. So while our model identifies that SEC,
ethnicity and gender are significantly associated with the fiveem outcome, and indeed
can explain 15.9% of the variance in outcome (quoting the Nagelkerke pseudo-R2),
they do not predict the outcome for individual students very well. This is important
because it indicates that social class, ethnicity and gender do not determine students
outcomes (although they are significantly associated with it). There is substantial
individual variability that cannot be explained by social class, ethnicity or gender, and
we might expect this reflects individual factors like prior attainment, student effort,
teaching quality, etc.
Lets move on to discuss interaction terms for now we will save explaining how to
test the assumptions of the model for a little later. Something to look forward to!
52
The first step is to add all the interaction terms, starting with the highest. With three
explanatory variables there is the possibility of a 3-way interaction (ethnic * gender *
SEC). If we include a higher order (3 way) interaction we must also include all the
possible 2-way interactions that underlie it (and of course the main effects). There are
three 2-way interactions: ethnic*gender, ethnic*SEC and Gender*SEC. Our strategy
here is to start with the most complex 3-way interaction to see if it is significant. If it is
not then we can eliminate it and just test the 2-way interactions. If any of these are not
significant then we can eliminate them. In this way we can see if any interaction terms
make a statistically significant contribution to the interpretation of the model.
some useful extra variables which we created for the last module. The process for
creating a model with interaction terms is very similar to doing it without them so we
wont repeat the whole process in detail (see the previous page, Page 4.12, if you
require a recap). However, there is a key extra step which we describe below...
The two variables will appear next to each other separated by a *. In this way you
can add all the interaction terms to your model.
Even though we have chosen to use the three category SEC measure, the output is
very extensive when we include all possible interaction terms. We have a total of 55
interaction terms (three for gender*SECshort, seven for ethnic*gender, 21 for
54
Remember to tell SPSS which variables are categorical and set the options as we
showed you on Page 4.11!
55
Before running this model you will need to do one more thing. Wherever it was not
possible to estimate the SEC of the household in which the student lived SECshort
was coded 0. To exclude these cases from any analysis the missing value indicator
for SECshort is currently set to the value 0. As discussed on Page 3.9, it is actually
very useful to include a dummy variable for missing data where possible. If we want to
include these cases we will need to tell SPSS. Go to the Variable view and find the
row of options for SECshort. Click on the box for Missing and change the option to No
missing values (see below) and click OK to confirm the change.
This will ensure that SPSS makes us a dummy variable for SEC missing. You can
now click OK on the main menu screen to run the model!
56
57
Working out the ORs with interaction effects is somewhat tricky (remember we
encountered a similar issue for multiple linear regression modules on Page 3.11). As
we have discussed, each B coefficient represents the change in the logit of our
outcome predicted by a one unit change in our explanatory variable, but this is more
complicated when we are also have interactions between our explanatory variables.
Each of the ethnic coefficients represents the difference between that ethnic
group and White British students, but crucially only for students in the baseline
category for SECshort (i.e. low SEC students).
For SECshort the coefficients represent the difference between each of the
medium and high SEC categories and the baseline category of low SEC, but
only for White British students.
The coefficients for each ethnic * SECshort interaction term represent how
much the SEC contrasts vary for each ethnic group, relative to the size of the
SEC effect among White British students.
58
Caribbean and high SEC [SECshort(2) by ethnic(5)]. This gives = 1.735 + -.784 =
0.953. What does this mean? Not much yet we need to take the exponent to turn it
into a trusty odds ratio: Exp(0.953)=2.59 (we used the =EXP() function in EXCEL or
you could use a calculator). This means that among Black Caribbean students High
SEC students are only 2.6 times more likely to achieve fiveem than low SEC students
- the SEC gap is much smaller than among White British students. The important point
to remember is that you cannot simply add up the Exp(B)s to arrive here it only
works if you add the B coefficients in their original form and then take the exponent of
this sum!
Interpreting the ethnic gaps at different levels of SEC
The model has told us what the ethnic gaps are among low SEC students (the
reference category). Suppose I wanted to know what the estimated size of the ethic
gaps was among high SEC students, how would I do this? To find out you would rerun
the model but set high SEC as the base or reference category. The coefficients for
each ethnic group would then represent the differences between the average for that
ethnic group and White British students among those from high SEC homes.
Currently SECshort is coded as follows with the last category used as the reference
group.
SECshort value
0
1
2
3
Label
Missing SEC
High SEC
Middle SEC
Low SEC (Reference category LAST)
We can simply recode the value for missing cases from 0 to 9 and set the reference
category to the first value, so High SEC becomes the reference category, as shown
below:
SECshort value
1
2
3
4
Label
High SEC (Reference category FIRST)
Middle SEC
Low SEC
Missing SEC
You can do this through the Transformation-Recode into new variable windows menu
(see Foundation Module) or simply through the following syntax:
59
We then rerun the model simply adding SECshortnew as our SEC measure. It is
important to note that computationally this is an exactly equivalent model to when low
SEC was the reference category. The coefficients for other variables (for example,
gender) are identical, the contrast between low and high SEC homes is the same (you
can check the B value in the output below), and the R2 and log-likelihood are exactly
the same. All that has varied is that the coefficients printed for ethnicity are now the
contrasts among high SEC rather than low SEC homes.
The output is shown below (Figure 4.13.2). For convenience we have added labels to
the values so you can identify the groups. As you know, this is not done by SPSS so it
is vital that you refer to the Categorical variables encoding table when interpreting
your output. It is apparent that the ethnic gaps are substantially different among high
SEC than among low SEC students. Among low SEC students the only significant
contrasts were that Indian, Bangladeshi and Any other ethnic group had higher
performance than White British (see Figure 4.13.1). However among students from
high SEC homes while Indian students again achieve significantly better outcomes
than White British students, both Black Caribbean (OR=.36, p<.005) and Black African
(OR=.685, p<.025) are significantly less likely to achieve fiveem than White British
students by a considerable margin. Black Caribbean students are only about one third
as likely to achieve fiveem as White British high SEC students. In percentage terms
we can say they are 64% (0.358-1 * 100) less likely to achieve fiveem than White
British students of the same SEC group.
60
Figure 4.13.2: Variables in the Equation Table with high SEC as the reference
category
61
If we wanted to evaluate the ethnic gaps among students from middle SEC homes we
would simply follow the same logic as above, recoding our values so that middle SEC
was the first (or last) category and setting the reference category appropriately.
Figure 4.13.3: Mean Number of Students with Five or More A*-C grades (inc.
English and Maths) by SEC and Ethnicity
62
The line graph shows a clear interaction between SEC and ethnicity. If the two
explanatory variables did not interact we would expect all of the lines to have
approximately the same slope (for example, the lines on the graph would be parallel
when there is no interaction effect) but it seems that the effect of SEC on fiveem is
different for different ethnic groups. For example the relationship appears to be very
linear for White British students (blue line) as the socio-economic group becomes
more affluent the probability of fiveem increases. This not the case for all of the ethnic
groups. For example, with regard to Black Caribbean students there is a big increase
in fiveem as we move from low SEC to intermediate SEC, but a much smaller
increase as we move to high SEC. As you (hopefully) can see, the line graph is a
good way of visualizing an interaction between two explanatory variables.
Now that we have seen how to create and interpret out logistic regression models
both with and without interaction terms we must again turn our attention to the
important business of checking that the assumptions underlying our model are met
and that the results are not misleading due to any extreme cases.
63
dataset).
Linearity of Logit
This assumption is confusing but it is not usually an issue. Problems with the linearity
of the logit can usually be identified by looking at the model fit and pseudo R2 statistics
(Nagelkerke R2, see Page 4.12, Figure 4.12.4). The Hosmer and Lameshow test,
which as you may recall was discussed on Page 4.12 and shown as SPSS output in
Figure 4.12.5 (reprinted below) is a good test of how well your model fits the data. If
the test is not statistically significant (as is the case with our model here!) you can be
fairly confident that you have fitted a good model.
Figure 4.14.1: Hosmer and Lameshow Test
With regard to the Nagelkerke R2 you are really just checking that your model is
explaining a reasonable amount of the variance in the data. Though in this case the
value of .159 (about 16%) is not high in absolute terms it is highly statistically
significant.
Of course this approach is not perfect. Field (2009, p.296, see Resources) suggests
an altogether more technical approach to testing the linearity of the logit if you wish to
have more confidence that your model is not violating its assumptions.
Independent Errors
As we mentioned on Page 4.9 checking for this assumption is only really necessary
when data is clustered hierarchically and this is beyond the scope of this website. We
thoroughly recommend our sister site LEMMA (see Resources) if you want to learn
more about this.
64
Multicollinearity
It is important to know how to perform the diagnostics if you believe there might be a
problem. The first thing to do is simply create a correlation matrix and look for high
coefficients (those above .8 may be worthy of closer scrutiny). You can do this very
simply on SPSS: Analyse > Correlate > Bivariate will open up a menu with a single
window and all you have to do is add all of the relevant explanatory variables into it
and click OK to produce a correlation matrix.
If you are after more detailed colinearity diagnostics it is unfortunate that SPSS does
not make it easy to perform them when creating a logistic regression model (such a
shame, it was doing so well after including the interaction button). However, if you
recall, it is possible to collect such diagnostics using the menus for multiple linear
regression (see Page 3.14) because the tests of multicollinearity are actually
independent of the type of regression model you are making (they examine only the
explanatory variables) you can get them from running a multiple linear regression
using the exact same variables as you used for your logistic regression. Most of the
output will be meaningless because the outcome variable is not continuous (which
violates a key assumption of linear regression methods) but the multicollinearity
diagnostics will be fine! Of course we have discussed this whole issue in the previous
module (Page 3.14).
Influential Cases
On page 4.11 we showed you how to request the models residuals and the Cooks
distances as new variables for analysis. As you may recall from the previous module
(Page 3.14), if a case has a Cooks distance greater than one it may be unduly
influencing your model. Requesting the Cooks distance will have created a new
variable in your dataset called COO_1 (note that this might be different if you have
created other variables in previous exercises this is why clearly labeling variables is
so useful!). To check that we have no cases where Cooks distance is greater than
one we can simply look at the frequencies: Analyse > Descriptive Statistics >
Frequencies, add Coo_1 into the window, and click OK. Your output, terrifyingly, will
look something like Figure 4.14.2 (only much, much longer!):
65
This is not so bad though remember we are looking for values greater than one and
these values are in order. If you scroll all the way to the bottom of the table you will
see that the highest value Cooks distance is less than .014 nowhere near the level
of 1 at which we need to be concerned.
Finally we have reached the end of our journey through the world of Logistic
Regression. Let us now take stock and discuss how you might go about pulling all of
this together and reporting it.
66
Model 1 shows the simple association between ethnic group and the fiveem outcome.
Model 2 shows what happens when we add SECshort and gender to the model.
Model 3 shows the significant interaction that exists between ethnic group and
SECshort which needs to be taken into account. Summarising the results of the three
models alongside each other in this way lets you tell the story of your analysis and
show how your modeling developed.
The elements of this table (Figure 4.15.1) that you choose to discuss in more detail in
your text will depend on the precise nature of your research question, but as you can
see it provides a fairly concise presentation of nearly all of the key relevant statistics.
67
Variable
Constant
-.100
.020
.91
-.142
.458 ***
-.447 ***
-.222 **
-.626 ***
-.190 *
.188 *
.076
.067
.071
.079
.093
.086
.082
.87
1.58
.64
.80
.53
.83
1.21
Ethnic group
Mixed heritage
Indian
Pakistani
Bangladeshi
Black Caribbean
Black African
Any other group
(base = White British)
Gender
Female
(base= male)
Socio-Economic Class (SEC)
Missing
High
Medium
(base= low)
Interaction ethnic * SEC
Mixed heritage * missing SEC
Mixed heritage * high SEC
Mixed heritage * medium SEC
Indian * missing SEC
Indian * high SEC
Indian * medium SEC
Pakistani * missing SEC
Pakistani * high SEC
Pakistani * medium SEC
Bangladeshi * missing SEC
Bangladeshi * high SEC
Bangladeshi * medium SEC
Black Caribbean * missing SEC
Black Caribbean * high SEC
Black Caribbean * medium SEC
Black African * missing SEC
Black African * high SEC
Black African * medium SEC
Any other * missing SEC
Any other * high SEC
Any other * medium SEC
-2LL
Model 3
SE OR
-1.088
.043
-1.184
.051
.31
.171
.128
.125
.123
.210
.156
.163
.92
2.19
1.16
1.72
.78
1.07
1.83
-.133
.678 ***
-.138
.223 **
-.632 ***
-.010
.335 ***
.476
1.585
.705
.34
.079 .88
.070 1.97
.074 .87
.082 1.25
.097 .53
.091 .99
.086 1.40
.053 1.61
.049 4.88
.048 2.02
-.079
.784 ***
.152
.544 ***
-.246
.068
.605 ***
-.005
-.129
-.019
-.065
-.159
-.145
-.328
-.519 *
-.418 *
-.320
-1.009 **
-.711 **
-.492
-.782 **
-.065
.288
-.447
.001
-.177
-.418
-.453
Nagelkerke R
Hosmer & Lemeshow test
Classification accuracy
Model 2
SE OR
20652
2
OR
x =165, df=7,p<.001
1.5%
p=1.00
54.7%
19335
2
x =1317, df=4,p<.001
12.5%
p=0.026
63.8%
.251 1.00
.217 .88
.235 .98
.198 .94
.215 .85
.179 .86
.198 .72
.237 .59
.187 .66
.190 .73
.366 .36
.226 .49
.318 .61
.263 .46
.276 .94
.246 1.33
.231 .64
.273 1.00
.241 .84
.239 .66
.234 .64
19291
x 2 =44, df=21,p<.001
12.9%
p=0.535
64.0%
68
If you want to see an example of a published paper presenting the results of a logistic
regression see:
Strand, S. & Winston, J. (2008). Educational aspirations in inner city schools.
Educational Studies, 34, (4), 249-267.
Conclusion
We hope that now you have braved this module you are confident in your knowledge
about what logistic regression is and how it works. We hope that you are confident
about creating and interpreting your own logistic regression models using SPSS. Most
of all we hope that all of the formula has not frightened you away Logistic regression
can be an extremely useful tool for educational research, as we hope our LSYPE
example has demonstrated, and so getting to grips with it can be a very useful
experience! Whew why not have a little lie down (and perhaps a stiff drink) and then
return to test your knowledge with our quiz and exercise?
69
Exercise
We have seen that prior attainment, specifically age 11 test score, is a strong
predictor of later achievement. Maybe some of the ethnic, social class and gender
differences in achievement at age 16 reflect differences that were already apparent in
attainment at age 11? This would have substantive implications for education policy,
because it would indicate that attention would need to focus as much on what has
happened during the primary school years up to age 11 as on the effects of education
during the secondary school years up to age 16.
Answer them in full sentences with supporting tables or graphs where appropriate as
this will help when you to better understand how you may apply these techniques to
your own research. The answers are on the next page.
Note: The variable names as they appear in the SPSS dataset are listed in brackets.
We have also included some hints in italics.
Question 1
Exam score at age 11 is included in the LSYPE dataset as ks2stand. Before we
include it in our model we need to know how it is related to our outcome variable,
fiveem. Graphically show the relationship between ks2stand and fiveem.
Question 2
In this module we have established that ethnicity, Socio-economic class and gender
can all be used to predict whether or not students pass 5 exams with grades A-C at
age 16 (our trusty fiveem). Does including age 11 exam score (ks2stand) to this main
effects model as an explanatory variable make a statistically significant contribution to
predicting fiveem?
Question 3
Does adding age 11 score as an explanatory variable substantially improve the fit of
the model? That is to say, does it improve how accurately the model predicts whether
or not students achieve fiveem?
Run the analysis again but this time with two blocks, including ethnic, SECshort
and gender in the first block and ks2stand in the second. Examine the -2LL and
pseudo-R2 statistics.
Question 4
Following on from question 3, what difference (if any) does adding age 11 score to the
model make to the ethnic, gender and SEC coefficients? What is your interpretation of
this result?
No need to carry out further analysis just examine the Variables in the
equation tables for each block.
Question 5
Is there an interaction effect between SEC and age 11 score?
Run the model again but this time including a SECshort*ks2stand interaction.
You may also wish to graph the relationship between age 11 exam score and
SEC in the context of fiveem.
Question 6
Are there any overly influential cases in this model?
You will need to get the Cooks distances for each case. This may require you
to re-run the model with different option selections.
71
Answers
Question 1
There is arguably more than one way to achieve this but we have gone for a bar graph
which uses ks2stand as the category (X axis) and the mean fiveem score as the main
variable (Y axis). If you have forgotten how to do this on SPSS we run through it as
part of the Foundation Module.
The mean fiveem score (which varies between 0 and 1) provides an indication of how
many students passed five or more GCSEs (including maths and English) at each
level of age 11 standard score. As you can see, the mean score is far lower for those
with lower Age 11 scores but right up at the maximum of 1 for those with the highest
scores. Note that the shape of the graph matches the sigmoid shape for binary
outcomes which we have seen throughout this module. Based on this graph there
appears to be strong evidence to suggest that age 11 is a good predictor of fiveem.
72
Question 2
The table below shows that the explanatory variable for age 11 score is indeed a
statistically significant predictor of fiveem (this can be ascertained from the sig.
column). The Exp(B) column shows that the odds ratio is 1.273; meaning that a one
unit change in age 11 score (an increase of 1 point) changes the odds of achieving
fiveem increase by a multiplicative factor of 1.273. This is very substantial when you
consider how large the range of possible age 11 scores is!
See Page 4.5 if you want to review how to interpret logistic regression coefficients for
continuous explanatory variables.
73
Question 3
In order to explore just how much impact adding age 11 exam score as an
explanatory variable had on the model we re-ran the model but entered ks2stand as a
second block (block 2). The classification Tables, Nagelkerke R2 and omnibus tests
can then be compared across these blocks to assess the impact of accounting for
prior achievement.
As we have discussed on Page 4.12, the omnibus test tells us whether or not our
model is better at predicting the outcome than the baseline model (which always
predicts whichever of the two outcomes was more frequent in the data). The Block
row tells us whether the new block significantly improves the number of correct
predictions compared to the previous block. As you can see, the omnibus test table for
Block 2 indicates that the addition of ks2stand does indeed improve the accuracy of
predictions to a statistically significant degree (highlighted - Sig is <.05). The Model
row test is also significant for both blocks, suggesting both are better at predicting the
outcome than the baseline model.
74
The classification table (third table down) shows us just how much more accurately
block 2 describes the data compared to block 1. The model defined as block 1
correctly classifies 64.1% of cases an improvement over the baseline model but still
not great. The inclusion of ks2stand in block 2 increases the number of correct
classifications substantially, to 80.9%. Finally, the Model Summary table helps to
confirm this. The deviance (-2LL) is substantially lower for block 2 than for block 1 and
the Nagelkerke pseudo R2 is .589 (59% of variance explained) for block 2 compared
to .134 (13% of variance explained) for block 1. Overall, age 11 score is clearly very
important for explaining age 16 exam success!
75
Question 4
Below you will see the Variables in the Equation table for block 2 with the sig and
Exp(B) columns from the same table for block 1. These are taken from the same
SPSS output that we generated for question 3. Notice how in most cases the odds
ratios [Exp(B)] are less in block 2 than they were in block 1. In addition, note that three
explanatory variables are not statistically significant in both versions of the model.
76
Question 5
Before exploring the interaction statistically it is worth first examining the relationship
by looking at a line graph (though remembering that this graph does not account for
the influence of other explanatory variables such as gender and ethnicity):
As you may recall from Page 4.13, if the lines in this graph were approximately
parallel we would expect there to be no interaction. Though this does not appear to be
the case here the interaction is certainly not clear cut you could argue that in the
middle range of age 11 scores there us a clear ordering by affluence whereby the
wealthiest group has the highest pass rate and the least wealthy the lowest. There are
considerable fluctuations, particularly at the upper and lower ends of the age 11 score
range, but these may not be as important as they appear given the large range of
possible age 11 score values.
Let us check the Variables in the equation for the logistic regression model when we
include a ks2stand*SECshort interaction term. As you can see it doesnt actually help
77
that much! Though the significance level is greater than the commonly used 5%
(p<.05) it is still less than 10%. The comparison between the lower SEC group and the
intermediate SEC group (for White-British males only) is statistically significant at the
.05 level. Deciding whether or not to include this interaction in your final model would
be a judgment call. Given that the interaction effect does not appear to be particularly
pronounced (the odds ratios are relatively small for the interactions and the odds
ratios for the other variables have changed very little) we would probably not include it
for the sake of parsimony.
78
Question 6
As we saw on Page 4.14, cases that are having a particularly large influence on a
model can be identified by requesting a statistic called Cooks distance. If this statistic
is greater than 1 for a given case than that case may be a an outlier that is powerful
enough to unduly influence a model (this is usually a more significant issue when you
have a smaller sample. The table below shows the cooks distances produced by the
interaction model we created in Question 5. We have completely removed the middle
section because it is a horrendously long table. The Cooks distances are in the leftmost column and listed in numerical order. As you can see largest vale (at the bottom
of the table) is .05158. This is much lower than a value of 1, which would usually be
cause for concern. It seems that our model has no cases that are overly influential.
79