Handling Missing Data Analysis of A Challenging Data Set Using Multiple Imputation
Handling Missing Data Analysis of A Challenging Data Set Using Multiple Imputation
To cite this article: Maria Pampaka, Graeme Hutcheson & Julian Williams (2016) Handling
missing data: analysis of a challenging data set using multiple imputation, International Journal of
Research & Method in Education, 39:1, 19-37, DOI: 10.1080/1743727X.2014.979146
1. Introduction
Missing data is certainly not a new issue for educational research, particularly given the
constraints of designing and performing research in schools and other educational
establishments. Consider the situation when a researcher gets permission to administer
a questionnaire about bullying to the students during class time. On the agreed day of
administration various scenarios could take place: (A) some pupils may have been
absent at random without predictable reasons, (B) some pupils may have been absent
because they are representing their school in competitions (these pupils may be the
keenest and most engaged), and (C) some pupils did not respond to sensitive questions
(maybe they are more likely to be bullied or have special needs). All the above
scenarios will lead to missing data, but with different degrees of bias (i.e. errors due
to systematically favouring certain groups or outcomes) depending on the object of
the analysis. For example, if data are missing due to scenario B, the analysis will
under-represent the more highly attaining or engaged pupils. If data are missing due
to scenario C, those pupils with special educational needs will be under-represented,
causing any results to be significantly biased. In the real world, it is likely that data
∗
Corresponding author. Email: maria.pampaka@manchester.ac.uk
are missing due to multiple reasons with the above scenarios happening simultaneously
in any single project.
Missing data is a particular issue for longitudinal studies, especially when the
design involves transitions between phases of education when pupils tend to move
between institutions. This is a major issue in the wider social science literature,
which acknowledges that nearly all longitudinal studies suffer from significant attrition,
raising concerns about the characteristics of the dropouts compared to the remaining
subjects. This raises questions about the validity of inferences when applied to the
target population (Little 1988; Little and Rubin 1989; Schafer and Graham 2002;
Kim and Fuller 2004; Plewis 2007).
Even though the issues around missing data are well-documented, it is common
practice to ignore missing data and employ analytical techniques that simply delete
all cases that have some missing data on any of the variables considered in the analysis.
See, for example, Horton and Kleinman (2007) for a review of medical research reports,
and King et al. (2001, 49) who state that ‘approximately 94% (of analyses) use listwise
deletion1 to eliminate entire observations. [ . . . ] The result is a loss of valuable infor-
mation at best and severe selection bias at worst’.
In regression modelling, the use of step-wise selection methods2 is particularly
dangerous in the presence of missing data as the loss of information can be severe
and may not even be obvious to the analyst. A demonstration of this is provided
using the data presented in Table 1, where the variable ‘SCORE’ (a continuous vari-
able) is modelled using 3 candidate explanatory variables for 10 cases. It should be
noted here that this example is simply presented for illustration since in reality we
would not usually carry out such analyses on such small samples. In order to
compare successive models, a typical step-wise procedure first deletes any missing
data list-wise, leaving only the complete cases. Any case that has a missing data
point in any of the candidate variables is removed from the analysis, even when the
data may only be missing on a variable that is not included in the final model. This
can result in the loss of substantial amounts of information and introduce bias into
the selected models.
Table 2. Linear regression models for SCORE by GCSE (SCOREGCSE) based on different
modelling procedures.
Table 2 shows the ordinary least squares regression model ‘SCORE GCSE’ that
was selected using a step-wise procedure and the ‘same model’ constructed using all
available data. The step-wise procedure resulted in a model estimated from a smaller
sample as cases 2, 6, and 7 are excluded. The exclusion of these cases introduced
bias in the analysis as these three cases all have relatively low General Certificate of
Secondary Education (GCSE) marks. The parameter estimates from the step-wise pro-
cedure are based on a sub-sample that does not reflect the sample particularly accu-
rately. The models provide very different impressions about the relationship between
SCORE and GCSE marks. It is important to note that very little warning may be
given about the amount of data discarded by the model selection procedure. The loss
of data is only evident in the following output by comparing the degrees of freedom
reported for the two models.3 This simple example highlights two common problems
in the analysis and reporting of results: the highly problematic step-wise regression,
and the sample size on which the models are calculated (which is rarely the same as
that reported with the description of the sample of the study). Both of these problems
are issues within popular analysis packages making it important for analysts to check
and report how a statistical package deals with model selection4 and also to make
sure that the selection process does not needlessly exclude data.
Even though missing data is an important issue, it is rarely dealt with or even
acknowledged in educational research (for an exception to this, see Wayman 2003).
Whilst data imputation (particularly multiple imputation (MI)) is now generally
accepted by statisticians, non-specialist researchers have been slow to adopt it. Data
imputation makes an easy target for criticism, mainly because it involves adding simu-
lated data to a raw data set, which causes some suspicion that the data are being manipu-
lated in some way resulting in a sample that is not representative. In fact, imputation
does the opposite, by using what information is available to simulate the missing
data so as to minimize the bias in results due to ‘missingness’.
Our aim in this paper is to review some of the issues surrounding missing data and
imputation methods and demonstrate how missing data can be imputed using readily-
available software. Using a real data set which (i) had serious quantities of missing data
22 M. Pampaka et al.
It can be argued that the above names are not intuitive and could lead to confusion
(e.g. between MAR and MCAR which could be thought of as synonymous when in
reality they are not). However, the classification has stuck in the statistical terminology
and it is important in determining the possible resolutions of missing data problems, as
we will illustrate in the next section.
Step 1 – Imputation: Impute missing values using an appropriate model that incorpor-
ates appropriate random variation. During this first step, sets of plausible values for
24 M. Pampaka et al.
missing observations are created that reflect uncertainty about the non-response model.
These sets of plausible values can then be used M times5 to ‘complete’ the missing
values and create M ‘completed’ data sets.
Step 2 – Analysis: Perform the desired analysis on each of these M data sets using stan-
dard complete-data methods.
Step 3 – Combination: During this final step, the results are combined, which allows
the uncertainty regarding the imputation to be taken into account. This procedure
involves the following estimations:
Fundamental to MI is the model, and hence the technique/algorithm used, for the impu-
tation of values. The non-statistically minded reader can skip the next section and jump to
the software which implement these algorithms and produce the desired imputations.
for each observation with missing data, multiple entries are created in an augmented
dataset for each possible value of the missing covariates, and a probability of observing
that value is estimated . . . the augmented complete-dataset is then used to fit the
regression model.
Bayesian8 MI methods are increasingly popular: these are performed using a Bayesian
predictive distribution to generate the imputations (Nielsen 2003) and specifying prior
values for all the parameters (Ibrahim et al. 2005). According to Schafer and Graham
(2002), Bayesian methods bring together MI methods and ML methods:
International Journal of Research & Method in Education 25
[ . . . ] the attractive properties of likelihood carry over to the Bayesian method of MI,
because in the Bayesian paradigm we combine a likelihood function with a prior distri-
bution for the parameters. As the sample size grows, the likelihood dominates the
prior, and Bayesian and likelihood answers become similar. (154)
computing the observed data likelihood [ . . . ] and taking random draws from it, is com-
putationally infeasible with classical methods. Even maximizing the function takes inor-
dinately long with standard optimization routines. In response to such difficulties, the
Imputation-Posterior (IP) and Expectation-Maximization (EM) algorithms were devised
and subsequently applied to this problem. From the perspective of statisticians, IP is
now the gold standard of algorithms for multivariate normal multiple imputations, in
large part because it can be adapted to numerous specialized models. Unfortunately,
from the perspective of users, it is slow and hard to use. Because IP is based on
Markov Chain Monte Carlo (MCMC) methods, considerable expertise is needed to
judge convergence, and there is no agreement among experts about this except for
special cases. (54)
In response to the above difficulties, the same group developed a new algorithm, the
EMB algorithm, which combines the classic EM algorithm with a bootstrap approach
to take draws from the posterior distribution. This algorithm expands substantially the
range of computationally feasible data types and sizes for which MI can be used
(Honaker and King 2010; Honaker, King, and Blackwell 2011). It should be noted
that this was the algorithm used within Amelia for our data set. Other tools are pre-
sented next.
Of particular interest for this paper is the educational outcome ‘AS grade’ high-
lighted in the final data point of Figure 1. This variable was only requested from stu-
dents at the final data point (because this is when it would have been available) and
was the cause of much of the missing data, given the large attrition rates at this stage
of the study. Fortunately, many of these missing data points for ‘AS grade’ were
accessed at a later date by collecting data directly from the schools and the students
via additional telephone surveys. With this approach we have been able to fill in the
grades for 879 additional pupils.
This supplementary collection of much of the missing data from the study enabled
an evaluation to be made of the success of data imputation. We compare the results of
analyses conducted on the data which was initially available with the enhanced, sup-
plementary data and evaluate the success of the imputation technique for replacing
the missing data.
where Drop-out and Course are binary categorical variables, GCSE-grade is an ordered
categorical variable,11 and Disposition and Maths Self Efficacy are continuous.
The analysis reported here is restricted to the data and the model used in Hutcheson,
Pampaka and Williams (2011), which included the binary classifications of whether
students had ‘dropped out’ of the course that were retrieved after the initial study. In
this paper, we model dropout using the initial data (n ¼ 495) and compare the resulting
model to a model where the missing dropout scores are imputed (n ¼ 1374). An evalu-
ation of the accuracy of the imputation is made by comparing the models with the one
derived from the actual data that were recovered (n ¼ 1374). The only difference
between the imputed model and the one reported for the actual data is that the
former includes 879 imputed values for dropout, whilst the latter includes 879 values
for dropout that were retrieved after the initial study.
Table 3. A logistic regression model of ‘dropout’ using the 495 cases available at the end of the
initial study.
information at the end of study (1) or not (0), modelled with respect to the following
explanatory variables: Course, Disposition, GCSE-grade, and Maths Self Efficacy)
are shown in Table 4 and the effects of the explanatory variables are illustrated via
the effect plots in Figure 2 (Fox 1987; Fox and Hong 2009).
The logistic regression in Table 4 clearly shows the difference between GCSE
grades for those students for whom information about dropout is available at the end
of the initial study and those for whom it is not. The difference is particularly clear
in the case of the students with A∗ grades, as these students are more than three
times as likely (exp(1.17) ¼ 3.22 times) to provide information about dropout than
those with an intermediate-C grade (IntC).
Given these missingness patterns, the model in Table 3 is, therefore, likely to over-
estimate the effect of the high-achieving pupils.
Figure 2. Effect plots of a logistic regression model of missingness on dropout variable. The
graphs show the size of the effect of the explanatory variables on the response variable Y (i.e. the
probability of providing information, thus not missing). The Y range is set to be the same for all
graphs in order for the relative effects to be comparable.
programme (R Core Team 2013). Amelia II assumes that the complete data are multi-
variate normal which is ‘often a crude approximation to the true distribution of the
data’; however, there is ‘evidence that this model works as well as other, more compli-
cated models even in the face of categorical or mixed data’ (Honaker, King, and Black-
well 2013, 4). Amelia II also makes the usual assumption in MI that the data are missing
at random (MAR), which means that the pattern of missingness only depends on the
observed data and not the unobserved. The model we presented in Table 4 and
Figure 2 shows that missingness depends on GCSE grades, which is an observed vari-
able. Finally, the missing data points in the current analysis are binary, making Amelia
an appropriate choice.
The missing values for dropout were imputed using a number of variables available
in the full data set. In addition to using information about Course, Disposition, GCSE-
grade, and Maths Self Efficacy to model the missing data, information about EMA (i.e.
whether the student was holding Educational Maintenance Allowance), ethnicity,
gender, Language (whether the student’s first language was English), LPN (i.e.
whether the student was from Low Participation Neighbourhood), uniFAM (whether
the student was not first generation at HE), and HEFCE (an ordered categorical variable
30 M. Pampaka et al.
denoting socio-economic status) were also included.12 Although relatively few imputa-
tions are often required (3–10, see Rubin 1987), it is recommended that more imputa-
tions are used when there are substantial amounts of missing data. For this analysis, we
erred on the side of caution and used 100 imputed data sets.
Amelia imputed 100 separate data sets, each of which could have been used to model
dropout. In order to get parameter estimates for the overall imputed model, models com-
puted on the individual data sets were combined. The combined estimates and standard
errors for the imputed model were then obtained using the Zelig library (Owen et al.
2013) from the R package. The overall statistics for the imputed models computed
using Zelig are shown in Table 5, with the software instructions provided in Appendix.
The conclusions for the model based on the imputed data are broadly the same as for
the model with missing data (n ¼ 495), with Course and GCSE-grade both showing
significance. Disposition and Self Efficacy are non-significant in both models. The
main difference between the models is found in the standard error estimates for the
GCSE grades (the regression coefficients for the models are broadly similar across
the two models) with the imputed data model allowing for a greater differentiation
of ‘GCSE-grade’, with significant differences demonstrated between more categories
compared to the initial model (the Higher B and Intermediate B groups are now signifi-
cantly different to the reference category).
Table 6. A logistic regression model of ‘dropout’ using the full/retrieved data (n ¼ 1374).
Model with missing Model with imputed Model using full data (n
Explanatory data (n ¼ 495) data ¼ 1374)
variables est, (s.e.), p est, (s.e.), p est, (s.e.), p
Course UoM 21.15, (0.26), ,.001 20.87, (0.22), ,.001 21.29, (0.16), ,.001
(ref:Trad)
Disposition 20.09, (0.05), .06 20.08, (0.04), .06 20.13, (0.03), ,.001
GCSE-grade
(ref: IntC)
Higher C 20.44, (0.57), .44 20.36, (0.34), .29 20.26, (0.29), .38
Intermediate 20.46, (0.32), .16 20.64, (0.23), .007 20.88, (0.20), ,.001
B
Higher B 20.67, (0.34), .05 20.95, (0.26), ,.001 21.02, (0.21), ,.001
A 21.85, (0.37), ,.001 21.55, (0.26), ,.001 22.25, (0.24), ,.001
A∗ 24.90, (1.06), ,.001 22.74, (0.46), ,.001 23.83, (0.50), ,.001
Maths Self 20.07, (0.1), .49 20.06, (0.08), .48 20.18, (0.06), ,.01
Efficacy
In our own data set, the models did not in fact change much as a result of imputing
missing data, but we were able to show that imputation improved the models and
caused differences to the significance of the effects of some of the important variables.
The results from our study demonstrated that the initial data sample which included a
substantial amount of missing data was likely to be biased, as a regression model using
this data set was quite different from the one based on a more complete data set that
included information subsequently collected.
Imputing the missing data proved to be a useful exercise, as it ‘improved’ the
model, particularly with respect to the parameters for GCSE-grade, but it is important
to note that it did not entirely recreate the structure of the full data set, as Disposition
and Maths Self Efficacy remained non-significant. This, however, is not that surprising
given these variables were insignificant in the original sample of 495, and it could also
be due to the self-report nature of these variables in comparison to more robust GCSE
grades. The failure to reconstruct the results from the full data set is not so much a
failure of the MI technique, but a consequence of the initial model (where Disposition
and Maths Self Efficacy were not significant). The information available to impute the
missing dropout data points was not sufficient to accurately recreate the actual relation-
ships between these variables. Incorporating additional information in the imputation
process might have rectified this to some extent.13
The results of the MI are encouraging, particularly as the amount of missing data
was relatively large (over 60% for dropout) and also missing on a ‘simple’ binary vari-
able. It is also worth noting that the model evaluated is one which shows a high degree
of imprecision (Nagelkerke’s pseudo-R2 ¼ 0.237 for the full data set). There is,
International Journal of Research & Method in Education 33
therefore, likely to also be substantial imprecision in the imputed data. This empirical
study demonstrates that even with this very difficult data set, MI still proved to be
useful.
The imputation process was a useful exercise in understanding the data and the pat-
terns of missingness. In this study, the model based on imputed data was broadly
similar to the model based on the original data (n ¼ 495). This finding does not dimin-
ish the usefulness of the analysis, as it reinforces the evidence that the missing data may
not have heavily biased this model. In cases where there is more substantial bias, larger
discrepancies between the imputed and non-imputed models may be expected. For the
current data, even though the missing data were not particularly influential, the imputa-
tion was still advantageous.
The most important conclusion from this paper is that missing data can have adverse
effects on analyses and imputation methods should be considered when this is an issue.
This study shows the value of MI even when imputing large amounts of missing data
points for a binary outcome variable. It is encouraging that tools now exist to enable
MI to be applied relatively simply using easy-to-access software (see Hutcheson and
Pampaka, 2012, for a review). Thus, MI techniques are now within the grasp of most edu-
cational researchers and should be used routinely in the analysis of educational data.
Acknowledgements
As authors of this paper we acknowledge the support of The University of Manchester. We are
grateful for the valuable feedback of the anonymous reviewer(s).
Funding
This work was supported by the Economic and Social Research Council (ESRC) through
various grant awards: The Transmaths (www.transmaths.org) projects [RES-062-23-1213 and
RES-139-25-0241] investigated the transition of students into post-compulsory mathematics
education, and the most recent Teleprism (www.teleprism.com) study [RES-061-025-0538]
explores the progression in Secondary education, whilst dealing with methodological chal-
lenges, including missing data.
Supplemental data
Supplemental data for this article can be accessed at http://research-training.net/missingdata/
Notes
1. List-wise deletion is one traditional statistical method for handling missing data, which
entails an entire record being excluded from analysis if any single item/question value is
missing. An alternative approach is pairwise deletion, when the case is excluded only
from analyses involving variables that have missing values.
2. Selection methods here refer to the procedures followed for the selection of explanatory
variables in regression modelling. In step-wise selection methods, the choice of predic-
tive/explanatory variables is carried out by an automatic procedure. The most widely
used step-wise methods are backward elimination (i.e. start with all possible variables
and continue by excluding iteratively the less significant) and forward selection (i.e. start-
ing with no explanatory variables and adding iteratively the most significant).
3. The same applies with popular packages such as SPSS where the only way to infer about
the sample size used for each model is to check the degrees of freedom in the ANOVA
table.
34 M. Pampaka et al.
4. Some statistical packages automatically delete all cases list-wise (SPSS, for example),
while others (e.g. the ‘step()’ procedure implemented in R (R Development Core team
2013)) do not allow step-wise regression to be easily applied in the presence of missing
data – when the sample size changes as a result of variables being added or removed,
the procedure halts.
5. It has been shown by Rubin in 1987 that the relative efficiency of an estimate based on m
imputations to one based on an infinite number of them approximates the inverse of (1+ l/
m), with l the rate of missing information. Based on this, it is also reported that there is no
practical benefit in using more than 5– 10 imputations (Schafer 1999; Schafer and Graham
2002).
6. ML estimation is a statistical method for estimating population parameters (i.e. mean and
variance) from sample data that selects as estimates those parameter values maximizing the
probability of obtaining the observed data (http://www.merriam-webster.com/dictionary/
maximum%20likelihood).
7. From a statistical point of view P(Ycomplete ; u) has two possible interpretations, which guide
the choice of estimation methods for dealing with missing data:
† when regarded as the repeated-sampling distribution for Ycomplete, it describes the prob-
ability of obtaining any specific data set among all the possible data sets that could arise
over hypothetical repetitions of the sampling procedure and data collection;
† when considered as a likelihood function for theta (unknown parameter), the realized
value of Ycomplete is substituted into P and the resulting function for theta summarizes
the data’s evidence about parameters.
it is often useful to add more information to the imputation model than will be present
when the analysis is run. Since imputation is predictive, any variables that would
increase predictive power should be included in the model, even if including them
would produce bias in estimating a causal effect or collinearity would preclude deter-
mining which variables had a relationship with the dependent variable. (Honaker, King
and Blackwell 2013)
13. It is important to collect data that may aid the imputation process even if these data are not
part of the final model.
References
ACME. 2009. The Mathematics Education Landscape in 2009. A report of the Advisory
Committee on Mathematics Education (ACME) for the DCSF/DIUS STEM high level strat-
egy group meeting, June 12. Accessed March 1, 2010. http://www.acme-uk.org/
downloaddoc.asp?id=139
International Journal of Research & Method in Education 35
Allison, P. D. 2000. “Multiple Imputation for Missing Data: A Cautionary Tale.” Sociological
Methods and Research 28 (3): 301– 309.
Durrant, G. B. 2009. “Imputation Methods for Handling Item Non-response in Practice:
Methodological Issues and Recent Debates.” International Journal of Social Research
Methodology 12 (4): 293 – 304.
Fox, J. 1987. “Effect Displays for Generalized Linear Models.” Sociological Methodology 17:
347– 361.
Fox, J., and J. Hong. 2009. “The Effects Package. Effect Displays for Linear, Generalized
Linear, Multinomial-Logit, and Proportional-Odds Logit Models.” Journal of Statistical
Software 32 (1): 1– 24.
Gottardo, R. 2004. “EMV: Estimation of Missing Values for a Data Matrix.” Accessed January
15, 2014. http://ftp.auckland.ac.nz/software/CRAN/doc/packages/EMV.pdf
Harrell, F. E. 2008. “Hmisc: Harrell Miscellaneous.” Accessed January 15, 2014. http://cran.r-
project.org/web/packages/Hmisc/index.html
Hastie, T., R. Tibshirani, B. Narasimhan, and G. Chu. 2014. “Impute: Imputation for Microarray
Data.” R Package Version 1.32.0. Accessed January 15, 2014. http://www.bioconductor.
org/packages/release/bioc/manuals/impute/man/impute.pdf
Honaker, J., and G. King. 2010. “What to Do about Missing Values in Time-Series Cross-
section Data.” American Journal of Political Science 54 (2): 561– 581.
Honaker, J., G. King, and M. Blackwell. 2011. “Amelia II: A Programme for Missing Data.”
Journal of Statistical Software 45 (7): 1– 47.
Honaker, J., G. King, and M. Blackwell. 2013. “Amelia II: A Program for Missing Data.”
Accessed January 15, 2014. http://cran.r-project.org/web/packages/Amelia/vignettes/
amelia.pdf
Horton, N. J., and K. P. Kleinman. 2007. “Much Ado about Nothing: A Comparison of Missing
Data Methods and Software to Fit Incomplete Data Regression Models.” The American
Statistician 61 (1): 79– 90.
Hutcheson, G. D., and M. Pampaka. 2012. “Missing Data: Data Replacement and Imputation
(Tutorial).” Journal of Modelling in Management 7 (2): 221– 233.
Hutcheson, G. D., M. Pampaka, and J. Williams. 2011. “Enrolment, Achievement and Retention
on ‘Traditional’ and ‘Use of Mathematics’ Pre-university Courses.” Research in
Mathematics Education 13 (2): 147 –168.
Ibrahim, J. G., M.-H. Chen, S. R. Lipsitz, and A.H. Herring. 2005. “Missing-Data Methods for
Generalized Linear Models.” Journal of the American Statistical Association 100 (469):
332– 346.
Kim, J. K., and W. Fuller. 2004. “Fractional Hot Deck Imputation.” Biometrika 91 (3): 559 –578.
King, G., J. Honaker, A. Joseph, and K. Scheve. 2001. “Analyzing Incomplete Political Science
Data: An Alternative Algorithm for Multiple Imputation.” American Political Science
Review 95 (1): 49 –69.
Lee, E.-K., D. Yoon, and T. Park. 2009. “ArrayImpute: Missing Imputation for Microarray
Data.” Accessed January 15, 2014. http://cran.uvigo.es/web/packages/arrayImpute/
arrayImpute.pdf
Little, R. J. A. 1988. “Missing-Data Adjustments in Large Surveys.” Journal of Business &
Economic Statistics 6 (3): 287 – 296.
Little, R. J. A., and D. B. Rubin. 1987. Statistical Analysis with Missing Data. New York: John
Wiley & Sons.
Little, R. J. A., and D. B. Rubin. 1989. “The Analysis of Social Science Data with Missing
Values.” Sociological Methods Research 18 (2 – 3): 292– 326.
Nielsen, S. R. F. 2003. “Proper and Improper Multiple Imputation.” International Statistical
Review/Revue Internationale de Statistique 71 (3): 593 – 607.
Owen, M., O. Lau, K. Imai, and G. King. 2013. “Zelig v4.0-10 Core Model Reference Manual.”
Accessed December 5, 2013. http://cran.r-project.org/web/packages/Zelig/
Pampaka, M., I. Kleanthous, G. D. Hutcheson, and G. Wake. 2011. “Measuring Mathematics
Self-efficacy as a Learning Outcome.” Research in Mathematics Education 13 (2):
169– 190.
Pampaka, M., J. S. Williams, G. Hutcheson, L. Black, P. Davis, P. Hernandez-Martines, and G.
Wake. 2013. “Measuring Alternative Learning Outcomes: Dispositions to Study in Higher
Education.” Journal of Applied Measurement 14 (2): 197 –218.
36 M. Pampaka et al.
The analysis that identified imbalances in the distribution of missing data was a logit regression
model of missing values for the dropout variable. The R package was used for this analysis. The
code below shows a logit model of missingness based on the explanatory variables:
International Journal of Research & Method in Education 37
ords¼"GCSEgrade")
The command above instructs amelia to impute 100 data sets (m ¼ 100) using nominal (noms)
and ordinal (ords) variables and save these to the object ‘imp.datasets’. The ‘imp.datasets’ object
holds 100 data sets containing the imputed values, each of which can be viewed or analysed sep-
arately using the command:
imputed.datasets$imputations[[i]]
Step 5: Run the models of interest using the ‘m’ imputed data sets.
Step 6: Combine the model’s parameters.
The logit regression model of dropout (our response variable of interest) can be applied to each
of the imputed data sets, which results in 100 different models for the imputed data. These 100
models need to be combined to provide single parameter estimates across all the imputed data.
An easy method for this is to use the R library ‘Zelig’ (Owen et al. 2013). Zelig first computes
models for each individual imputed data set and saves analyses to the object ‘Zelig.model.imp’:
GCSEgrade + SelfEfficacy,
model ¼ "logit",
data ¼ imputed.datasets$imputations)