Tutorial How to Generate Missing Data New Version (7)
Tutorial How to Generate Missing Data New Version (7)
Xijuan Zhang
University of British Columbia
Author Note
Abstract
Missing data are common in psychological and educational research. With the
improvement in computing technology in recent decades, more researchers begin developing
missing data techniques. In their research, they often conduct Monte Carlo simulation
studies to compare the performances of different missing data techniques. During such
simulation studies, researchers must generate missing data in the simulated dataset by
deciding which data values to delete. However, in the current literature, there are few
guidelines on how to generate missing data for simulation studies. Our paper is one of the
first research that examines ways of generating missing data for simulation studies. We
emphasize the importance of specifying missing data rules which are statistical models for
generating missing data. We begin the paper by reviewing the types of missing data
mechanisms and missing data patterns. We then explain how to specify missing data rules
to generate missing data with different mechanisms and patterns. We end the paper by
presenting recommendations for generating missing data for simulation studies.
Keywords: Missing Data; Incomplete Data; Simulation Studies; Creating
Missing Data; Generating Missing Data
GENERATING MISSING DATA 3
Missing data are prevalent in many psychological and educational research studies,
particularly those where questionnaires are used to collect data and where participants are
studied over a period of time. Historically, statistical analysis methods are developed
assuming no missing data, and statistical techniques for handling missing data are hard to
implement due to intensive computation. However, with the advance of computing
technology, beginning in the late 1980s, the problem of missing data began to receive a lot
of attention. In recent decades, more and more research articles have studied statistical
techniques for dealing with missing data. Two of the most commonly studied modern
missing data techniques are the full information maximum likelihood (FIML; Arbuckle,
1999; Schafer & Graham, 2002a) and multiple imputation (MI; Little & Rubin, 2019); a
relatively less popular method is the two-stage (TS) (TS; Savalei & Bentler, 2005; Yuan &
Bentler, 2000).
In addition, also due to the increasing of computing power, Monte Carlo simulation
studies have become routinely used by quantitative researchers to evaluate different
statistical techniques. In typical simulation studies, quantitative researchers first specify
the population parameters and distribution, then generate sample data from the
population distribution they specified, and finally, analyze the data using different
statistical techniques. With simulation studies, researchers can compare the effectiveness of
different statistical techniques since they know the true population parameters from which
we generate the sample data; therefore, simulation studies have become a valuable tool for
comparing different existing statistical techniques or for studying new statistical
techniques. Due to the importance of simulation studies, how to generate data for
simulation studies and how to design a good simulation have become a research of its own.
For example, many researchers (e.g., Fleishman, 1978; Foldnes & Olsson, 2016; Mattson,
1997; Olvera Astivia & Zumbo, 2015; Reinartz, Echambadi, & Chin, 2002) have examined
different ways of modelling and generating non-normal data, and have provided
GENERATING MISSING DATA 4
In the context of conducting simulation studies for studying missing data techniques
such as FIML and MI, researchers not only have to generate sample data but also need to
generate missing data in the sample data. In other words, researchers must decide which
values in the sample data should be deleted in order to create missingness in the dataset;
and after generating missing data, researchers can analyze the incomplete data to compare
1
different missing data techniques such as FIML and ML. However, unlike the research on
generating non-normal data, there has been almost no research that examines ways of
generating missing data and offers recommendations for simulation studies. The current
paper is one of the first papers that addresses this gap of research.
The main purpose of our paper is to explain the statistical modelling for generating
missing data with different properties that are important for simulation studies. To design
a good simulation study involving missing data, researchers must understand the modelling
behind missing data generation and systematically manipulate different properties of the
missing data (e.g., varying one missing data property while holding the other properties
constant). However, in the current missing data literature, most simulation studies’ designs
were done haphazardly, usually based on what past simulation studies had done.
Particularly, most simulation studies were not designed with the missing data generation
modelling in mind, and they do not systematically vary the properties of the missing data,
thus creating confounds in the results of the simulation studies. In addition, the computer
algorithms for generating missing data are often not very consistent with the desired
modelling of the missing data generation. Our paper will also address these problems in
the current literature.
1
Generating missing data for simulation studies is different from imputing missing data in MI. For the
former, we create missing data in simulated complete datasets; however, for the latter, we fill in the missing
data based on the best estimates of the parameters.
GENERATING MISSING DATA 5
missing data by introducing concepts such as missing data rules, missing data mechanisms,
and missing data patterns. Specifically, we define the term “missing data rules” to mean
the statistical model for generating missing data. We then explain how to use the missing
data rules to generate missing data with different properties. Here, we focus on describing
how to generate missing data with different missing data mechanisms because missing data
mechanism is one of the most important properties that affect the performance of modern
missing data techniques, and is almost always manipulated in simulation studies involving
missing data. Finally, we conclude with recommendations for generating missing data for
simulation studies.
Preliminaries
In this paper, we use the term “missing data rule(s)" to mean a statistical model for
generating missing data. This model allows us to calculate the probability of being missing
for each subject and each variable. An example of missing data rule is each subject has 20%
probability of being missing from the variable Y . In statistical terms, this missing data rule
is P (M = 1) = 0.2 where M is a random indicator variable with M = 1 indicating a
missing value in Y . This missing data rule assumes that the chance of one subject being
missing is independent from the chance of another subject being missing, a common
assumption made by researchers when generating missing data (Graham, 2010). A given
dataset can have a set of missing data rules, one for each variable, or a single missing data
rule for multiple variables.
Like other kinds of statistical models, a missing data rule has one or more parameters
associated with it. Specifically, these parameters are associated with the distribution of the
missing data indicator M . When we generate missing data, we have to specify these
parameters associated with the missing data rule. In the above example of missing data
rule, the parameter associated with the missing data rule is the 20% probability of being
GENERATING MISSING DATA 6
missing. This parameter pertains to the population. With sample data, the parameters
associated with the missing data rule can only be estimated. Although the average of the
estimated parameter values over repeated samples is equal to the true parameter value, the
estimated parameter in a specific sample data is usually different from the true parameter
value. In conclusion, when generating missing data for simulation studies, it is important
to explicitly state the missing data rule and the parameters associated with it. As we will
show later, knowing the missing data rule makes it easy for us to figure out many
properties of the missing data being created.
In the missing data literature, missing data mechanism is usually defined as the
statistical relationship between subjects (or variables) and the probability of missing data
(Nakagawa, 2015). In this paper, we note that missing data mechanisms are equivalent to
missing data rules. More precisely, a missing data rule is a specific missing data mechanism
that describes how missing values are generated in the data. An introductory course on
missing data usually explains the three types of missing data (i.e., three types of missing
data rules) described in Rubin (1976): 1) missing completely at random (MCAR), 2)
missing at random (MAR), and 3) missing not at random (MNAR). In this section, we
review these three types of missing data mechanisms in both informal and formal terms.
paper-form questionnaire data are missing because a house cat spilled coffee on the table.
In this case, there are no observed nor missing data that can predict the probability of
being missing. If Y1 is MAR, then the probability of a subject being missing depends on its
observed values of other variables but does not depend on its value of Y1 . In other words,
MAR means conditionally missing at random: conditional on the observed values of other
variables, the probability of being missing does not depend on the value of Y1 . An example
of MAR data is when shy participants are less willing to answer questions regarding their
sexuality, thus creating missing values on a sexuality survey item. In this case, if
researchers can measure participants’ shyness, they can predict the probability of missing
data on the sexuality question. If Y1 is neither MCAR nor MAR, then Y1 is MNAR, where
the probability of a subject having a missing value on Y1 depends on its value of Y1 . A
classical example of MNAR data is when participants with high income avoid answering
questions about income. In this case, the probability of missing the income data is related
to participants’ own income.
To define MAR and MNAR, we have to break down y into the observed (yobs ) and the
unobserved or missing (ymis ) parts of y; that is y = (ymis , yobs )T . In this case, since Y1 is
the only variable with missing data, ymis = y1 and yobs = (y2 , . . . , yp )T , MAR occurs when
the distribution of M depends on yobs but not ymis :
Lastly, MNAR occurs when the distribution of M depends on yobs ; that is when
P (M = 1|(ymis , yobs )T ) and P (M = 0|(ymis , yobs )T ) can not be simplified further.
From the above definitions of MCAR, MAR and MNAR data, we can see that
MCAR can be viewed as a special case of MAR data or MNAR data. Specifically, MAR
data becomes MCAR data when M ’s dependency on yobs is zero; similarly, MNAR becomes
MCAR data when M ’s dependency on ymis is zero. In fact, the difference between MCAR
and MAR can be viewed on a continuum of M ’s increasing dependency on yobs ; similarly,
the difference between MCAR and MNAR can be seen on a continuum of M ’s increasing
dependency on ymis . In other words, some data can be more or less MAR depending on
how strong M is related on yobs ; and some data can be more or less MNAR depending how
strong M is related on ymiss . In the later sections, we will focus on explaining how to
generate missing data with varying degrees of being MAR.
Another important concept related to the types of missing data mechanisms is
ignorability. Ignorable data are the types of missing data that can be effectively handled by
modern missing data techniques such as FIML, MI and TS. Missing data needs to satisfy
two conditions to become ignorable missing data: 1) the missing data are either MCAR or
MAR; 2) parameters associated with the specific missing data rule are distinct from the
parameters associated with the distribution of the variables in the dataset (Rubin, 1976).
The second condition means that the parameters associated with the distribution of M are
distinct from the parameters associated with the distribution of Y . To explain why these
conditions are needed, let θ and φ are the parameters associated with Y and M ,
respectively, and let f (y, m; θ, φ) denote the joint density of Y and M . Because θ and φ are
distinct, when the data are incomplete, the observed data likelihood can be obtained via
the marginal of yobs as follows:
Z
f (yobs , m, θ, φ) = f (yobs , ymis ; θ)f (m|yobs , ymiss ; φ)dymis (1)
When the data are MCAR, f (m|yobs , ymiss ; φ) = f (m; φ); when the data are MAR,
GENERATING MISSING DATA 9
f (m|yobs , ymis ; φ) = f (m|yobs ; φ). Since neither f (m; φ) nor f (m|yobs ; φ) involves ymis , we
can take f (m; φ) or f (m|yobs ; φ) out of the integral. In other words, for MCAR or MAR
R
data, it is sufficient to maximize f (yobs , ymis ; θ)dymis with respect to θ if we only want
estimate θ. There are MAR data that violate the second assumption for ignorable missing
data (i.e.,θ and φ are not distinct); in such cases, statistical methods assuming ignorability
are not optimal but may still be good. Therefore, in practice, ignorable missing data stand
for MCAR or MAR data and non-ignorable missing data implies MNAR data. The
advantage of ignorable data and their relationship with types of missing mechanisms
motivate researchers to generate missing data with different missing mechanisms when
studying methods for handling missing data.
Missing data pattern refers to the arrangement of observed and missing values in a
dataset (Graham, 2010). It is often confused with missing data mechanism (e.g. Grigsby &
McLawhorn, 2019). The distinction is that a specific missing data mechanism is a missing
data rule that describes the relationship between subjects and the probability of missing
whereas a specific missing data pattern is a data configuration that describes the location
of the missing values in the data.
There are generally three kinds of missing data patterns. The univariate pattern
occurs when missing values are on one variable or a group of variables that is either entirely
observed or entirely missing for each case, but all other variables are completely observed
(Schafer & Graham, 2002b) (see Figure 1a). The univariate pattern has the lowest number
of missing data patterns; in other words, it has two missing data patterns, one pattern
where subjects have complete data and the other pattern where subjects have missing data.
Another type of missing pattern is the monotone pattern (e.g. Newman, 2003; Schafer &
Graham, 2002b; Strike, Emam, & Madhavji, 2001). In the monotone missing pattern, a
group of variables Y1 , . . . , Yp can be ordered in such a way that if Yj is missing for a
GENERATING MISSING DATA 10
subject, then Yj+1 , . . . , Yp is also missing (see Figure 1b). Notice that the univariate
pattern can be viewed as a special case of monotone pattern. Monotone patterns can be
seen in longitudinal studies with attrition, where Yj representing a variable or a group of
variables collected at time j. Last, the general missing data pattern occurs when a group of
variables may be missing for any subject, creating a dataset with missing values dispersed
throughout the data matrix in a haphazard fashion (Graham, 2010) (see Figure 1c).
Although missing data pattern and missing data mechanism are distinct concepts,
they do affect each other. Given a specific missing data rule with certain types of missing
data mechanism, the number and the type of missing data pattern will be determined. For
example, suppose a dataset has Y1 , . . . , Yp variables, if the missing data rule is each subject
has 20% probability of being missing from the variable Y1 , then the missing data pattern is
univariate, implying two missing patterns.
When designing simulation studies examining missing data techniques such as FIML
and MI, researchers often consider missing data patterns less important than missing data
mechanisms, probably because missing data patterns are not directly related to the
ignorability property of missing data. However, some researchers have found that the
number of missing data patterns can affect affect the performance of missing data
techniques (Savalei & Bentler, 2005; Zhang & Savalei, 2020). Therefore, when designing
simulation studies, researchers should manipulate both missing data mechanisms and
patterns; this requires researchers to know how to generate missing data with different
missing data mechanisms and patterns, which we will explain in detail in the following
section.
In this section, we will explain how to specify missing data rules to generate data
with different missing data mechanisms and patterns. We also describe the parameters and
various properties associated with each missing data rule. We mainly focus on MCAR and
GENERATING MISSING DATA 11
MAR data with the univariate missing data pattern because they are the most commonly
studied missing data in the missing data literature, but we will also briefly discuss
generating MNAR data as well as generating missing data with a large number of missing
data patterns.
Missing Data Rules for MCAR Data. Missing data rules for MCAR data
always involve each subject’s probability of being missing from one or more variables. The
probability of being missing is the parameter value associated with the missing data rule,
denoted as θ earlier. This parameter value affects the expected percentage of missing and
the expected number of missing data patterns in a sample dataset.
For MCAR data with univariate pattern, the missing data rule is that each subject
has π probability of being missing from the variable(s) with missing data. Putting it in
statistical terms, this missing data rule is P (M = 1) = π where M is the missing data
indicator, and π is the parameter associated with the missing data rule. Given this missing
data rule, researchers can determine various properties associated with the MCAR data,
including the expected percentage of missing values and the expected number of missing
data patterns in the MCAR data.
To explain how the missing data rule affects the missing data properties, let n be the
number of subjects in the data, and K be the random variable indicating the number of
subjects with missing values in the data. Given this missing data rule and assuming the
chance of one subject being missing is independent of the chance of another being missing,
K follows a binomial distribution: K ∼ Bin(n, π), where 0 ≤ π ≤ 1. Since E(K) = nπ and
Var(K) = nπ(1 − π), the expected percentage of missing values is
K 1
E(Π) = E( ) = E(K) = π, (2)
n n
where Π = K/n is the random variable denoting the estimated percentage of missing values
GENERATING MISSING DATA 12
K 1 π(1 − π)
Var(Π) = Var( ) = 2 Var(K) = . (3)
n n n
This variance shows that given our missing data rule, researchers may not always obtain
the exact π percentage of missing values in a sample dataset.
Researchers can also determine the expected number of distinct missing data patterns
in a sample, given a MCAR missing data rule. Consider two possible missing data
patterns: pattern 1 includes subjects with complete data; pattern 2 includes subjects with
missing values. For j ∈ {1, 2}, let Ij be the indicator variable of the event that pattern j is
present in at least one subject in the sample. The probability that pattern 1 is present in
at least one subject is P (I1 = 1) = E(I1 ) = 1 − π n . The probability that pattern 2 is
present in at least one subject is P (I2 = 1) = E(I2 ) = 1 − (1 − π)n . Let D be the number
P2
of distinct missing data patterns: D = j=1 Ij . The expected number of distinct missing
data patterns is
2 2
E(Ij ) = (1 − π n ) + (1 − (1 − π)n ) = 2 − π n − (1 − π)n ,
X X
E(D) = E( Ij ) = (4)
j=1 j=1
which shows that as the sample size increases, the expected number of patterns converges
very quickly to 2, which is the maximum number of patterns for this missing data rule.
Creating More Missing Data Patterns for MCAR Data. To generate MCAR
data with more missing data patterns, researchers can allow each subject’s chance of being
missing from one variable to be independent of its chance of being missing from another
variable. For example, if the variables Y1 , . . . , Yl have missing values, then the missing data
rule that can create the maximum number of missing data pattern is each subject has πi
probability of being missing from variable Yi where i ∈ {1, . . . , l}. In this case, there is a
total of l parameters: π1 , . . . , πl . For each variable Yi with missing data, the expected
percentage of missing values and the variance associated with the estimated missing
percentage are the same as those shown in (2) and (3), respectively i.e., E(Π1 ) = π1 and
πi (1−πi )
Var(Πi ) = n
.
GENERATING MISSING DATA 13
As mentioned before, this missing data rule can create the maximum number of
missing data patterns with l number of variables (i.e., m = 2l number of patterns).
However, in a given sample, some of the missing data patterns may not be realized. The
expected number of distinct missing data patterns in a sample is
m
(1 − ηj )n ,
X
E(D) = m − (5)
j=1
Implementing Missing Data Rules for MCAR Data. In the missing data
literature, there are generally two methods for implementing MCAR missing data rules.
The first method is randomly deleting the desired percentage of missing values (e.g.,
Enders, 2001b; Savalei & Bentler, 2005; Savalei & Yuan, 2009; Strike et al., 2001; Yuan &
Bentler, 2000). The deletion can be accomplished by deleting every ith subject (e.g.,
deleting every second subject to create 50% missing data) (e.g., Yuan & Bentler, 2000) or
deleting randomly until the desired percentage is reached (e.g., Savalei & Bentler, 2005;
Savalei & Yuan, 2009; Strike et al., 2001). One problem with this method is that the
estimated probability of being missing is equal to the expected probability of being missing
across different datasets; however, as shown in (3), there is sampling variability associated
2
with the estimated percentage of missing data. Whether this problem matters for
2
We note that in planned missing design, the percentage of missing data is held constant across samples.
GENERATING MISSING DATA 14
simulation studies depends on the purpose of the simulation study. For example, if the
purpose of the simulation is to examine the average performance of a missing data
technique across samples with a large sample size, then this issue does not matter because
the sampling variability is very small for a large sample and does not really affect the
computation of statistics that are aggregated across samples. However, if the purpose is to
examine the performance of missing data techniques under small sample sizes, it may be
better to incorporate the sampling variability when implementing the missing data rule;
this will make the simulation more realistic.
The second method involves comparing the values of a variable that has missing data
with the corresponding values of a uniform random variable ranging between zero and one
(e.g., De Raadt, Warrens, Bosker, & Kiers, 2019; Enders, 2001a, 2004, 2010; Jamshidian &
Siavash, 2010; Kim & Bentler, 2002). Taking a concrete example, suppose that there are
200 subjects in the data and the missing data rule is that each subject has 20% probability
of being missing from the variable Y . Given that the data of 200 subjects for variable Y
are already generated, to create missing values, we first draw 200 subjects from a uniform
random variable U ranging from zero to one. Then we pair 200 subjects for Y with the 200
corresponding subjects for U . If the ith subject in U is less than 0.2, then the ith subject
in Y should be removed. This method is equivalent to implementing the missing data rule
directly by allowing each subject in Y having 20% chance of being missing. In fact, we can
create a missing value indicator M from U by letting M = 1 whenever U ≤ 0.2. In other
words, instead of drawing cases from a uniform variable, we can draw 200 subjects from an
indicator random variable M that has 20% chance of being one, and then delete subjects
for Y when M equals one. We recommend this way of implementing the missing data rule
because it is the most direct and straightforward way of implementing MCAR missing data
rules. For sample R code for generating MCAR data, please refer to our Open Science
Framework (OSF) website:
https://osf.io/pmn9z/?view_only=37c891661d00406a8691ed365a8b8ff6
GENERATING MISSING DATA 15
For MAR data, the probability of a subject having a missing value depends on the
observed values of other variables. In other words, researchers can predict the probability
of missing values from the observed values of other variables. For the rest of the paper, we
call the variable that can predict the probability of missing values the missing data
predictor. A missing data predictor can be one single variable in the dataset or it can be a
new variable that is a linear combination of several variables in the dataset.
Generating MAR data is more complicated than generating MCAR data in two ways.
First, the missing data rules for MAR data are more complicated than those for MCAR
data. The missing data rules for MAR data can be organized into several categories: 1)
single cutoff method; 2) multiple cutoff method; 3) percentile method; 4) logistic regression
method. The most commonly used MAR missing data rule in psychological research is the
single cutoff method (e.g., Allison, 2000; Enders, 2004; Musil, Warner, Yobas, & Jones,
2002; Yuan & Bentler, 2000; Yuan & Savalei, 2014). Second, researchers can vary the
strength and shape of the dependency between the missing data indicator and the missing
data predictor. The strength of the dependency can be weak or strong; the shape of the
dependency can be linear or curvilinear. The dependency commonly used in simulation
studies is strong and linear (e.g., Yuan & Savalei, 2014; Yuan, Tong, & Zhang, 2015). In
addition, like MCAR data, MAR data can vary in the number and type of missing data
patterns, with the univariate pattern being the most commonly studied pattern (e.g.,
Enders, 2001b, 2010; Jia & Wu, 2019; Yuan & Bentler, 2000).
In the following sections, we will explain the different types of missing data rules for
generating MAR missing data for simulation studies. Within each type of missing data
rule, we explain the different kinds of patterns and strengths of dependency. We focus
more on the MAR data generated using the single cutoff method with univariate pattern
and linear dependency because this kind of MAR data is more commonly used in
simulation studies involving missing data.
GENERATING MISSING DATA 16
values, we first calculate the unconditional probability of a subject being missing from Y1 :
πmiss = P (M = 1)
= π1 π0 + π2 (1 − π0 ).
Then let K be a random variable indicating the number of subjects with missing data in a
sample. We know K ∼ Bin(n, πmiss ). Thus, the expected percentage of missing values
across samples is
K
E = πmiss = π1 π0 + π2 (1 − π0 ), (7)
n
K πmiss (1 − πmiss )
Var = . (8)
n n
Notice Equations (7) and (8) are the same as Equations (2) and (3), expect that π in (2)
and (3) is replaced by πmiss in (7) and (8). Similarly, by replacing π in (4) to πmiss .
Researchers can find the expected number of patterns for this MAR missing data rule:
n
E(D) = 2 − πmiss − (1 − πmiss )n .
Since M and U are two binary variables, researchers can measure the strength of
dependency between M and U using the absolute risk difference (ARD) or odd ratio (OR),
which are standard association measures for binary variables. The respective equations for
GENERATING MISSING DATA 18
and
P (M = 1|U = 1)/(1 − P (M = 1|U = 1)) π1 /(1 − π1 )
OR = = . (10)
P (M = 1|U = 0)/(1 − P (M = 1|U = 0)) π2 /(1 − π2 )
Large ARD values indicate strong dependency; OR values farther away from one indicate
strong dependency. Notice that OR is not defined when 1 − π1 = 0 or 1 − π2 = 0; therefore,
if any of these cases occurs, ARD should be used to measure the strength of dependency.
Equations (9)-(10) measure the strength of dependency between M and U at the
population level. At the sample level, the estimated strength of dependency may vary from
sample to sample. The variances associated with the estimated ARD and estimated
log(OR) are as follows (see Agresti & Kateri, 2011, for derivation):
π1 (1 − π1 ) π2 (1 − π2 )
Var(Π1 − Π2 ) = Var(Π1 ) + Var(Π2 ) = + . (11)
n1 n − n1
1 1 1 1
Var(log OR) = + + + , (12)
A B C D
where A, B, C and D are defined in Table 1.
Since OR is closely related to the logistic regression model, we can also define the
relationship between M and U using the logistic regression framework. In other words, the
log-odds of M can be predicted by U :
P (M = 1)
log = β0 + β1 U, (13)
1 − P (M = 1)
P (M = 1|U = 0) π
2
β0 = log = log ,
1 − P (M = 1|U = 0) 1 − π2
π1 /(1 − π1 )
β1 = log(OR) = log .
π2 /(1 − π2 )
In the logistic regression, the higher the β1 value, the stronger the dependency is. We note
that Equation (13) shows that the missing data rule specified using the single cutoff
GENERATING MISSING DATA 19
method is actually equivalent to a logistic regression model, which can be directly used to
generate MAR data using the logistic regression method. We will explain more about the
connections between these two methods in the section about the logistic regression method.
An example of a missing data rule with the strongest dependency is if a subject has
Y2 ≥ 0, then its Y1 value is always missing (see Table 2a for the contingency table for this
missing data rule). The cutoff point used in this missing data rule is Y2 = 0. The three
parameters associated with this rule are π0 = P (U = 1) = P (Y2 ≥ 0) = 0.5, π1 = 1, and
π2 = 0, assuming Y2 follows the standard normal distribution. To demonstrate the property
of this missing data rule, we have simulated a large sample dataset (n = 1, 000, 000) and
then generated missing values in the dataset according to the above missing data rule.
Since Y2 is a continuous variable while M is a categorical random variable, we used
boxplots to show the assication between Y2 and M . Figure 2a shows the association
between Y2 and M based on this missing data rule. Figure 2 shows a complete separation of
the boxplot for M = 0 from the one for M = 1 along the Y2 = 0 cutoff point; this indicates
that researchers can accurately predict the value of M based on the value of Y2 ; in other
words, M and Y2 are highly associated with each other. In addition, another interesting
GENERATING MISSING DATA 20
property for missing data rules with the strongest dependency is that the percentage of
missing values only depends on the parameter π0 . For the missing data rule in our example,
the expected percentage of missing values calculated using Equation (7) is πmiss = π0 .
An example missing data rule with an even weaker dependency is if a subject has
Y2 ≥ 0, then its Y1 value has 60% probability of being missing; otherwise, its Y1 has 40%
probability of being missing (see Table 2c for contingency table). The parameters are
π0 = 0.5, π1 = 0.6 and π2 = 0.4. The ARD and OR are 0.2 and 2.25. The logistic equation
P (M =1)
is log 1−P (M =1)
= log(0.67) + log(2.25)U . Figure 2c shows that the boxplot of Y2 for
subjects with M = 1 almost completely overlaps with the one for M = 0, making the
prediction of M based on Y2 only slighter better than chance.
Creating More Missing Data Patterns Under the Single Cutoff Method.
To generate MAR data with many missing data patterns, we can let the different missing
data indicators depend on different missing data predictors. For example, if two variables,
Y1 and Y2 , have MAR missing data, we can let the probabilities of missing values for Y1
and Y2 depend on the observed values of Y3 and Y4 , respectively. This way of creating
missing data patterns can be used in combination with the single-cutoff, multiple-cutoff,
percentile, or logistic regression method for generating MAR data.
In the case of the single-cutoff method, an example missing data rule that can create
the maximum number of patterns (i.e., four patterns) for two variables Y1 and Y2 with
missing data is if the subject has Y3 ≥ a1 , then its Y1 has π1 probability of being missing,
otherwise, Y1 has π2 probability of being missing; if the subject has Y4 ≥ a2 , then its Y2 has
π3 probability of being missing; otherwise Y2 has π4 probability of being missing. With this
missing data rule, we can still use Equation (5) (with m = 4) to calculate the expected
number of patterns in a sample data. However, for MAR data, the probability of each
missing data pattern (i.e., η1 , . . . , η4 in Equation (5)) also depends on the correlation
between the missing data predictors Y3 and Y4 . In the most extreme case, if Y3 and Y4 have
a correlation of one and a1 equals a2 , then this missing data rule creates data with only
two patterns (i.e., univariate pattern); in other words, the probabilities of the other two
patterns are both zero. Therefore, to maximize the number of patterns in a sample, we
suggest generating missing data predictors that are moderately correlated. With
moderately correlated missing data predictors, the probability of each pattern is greater
than zero, making (5) quickly converges to m, the maximum number of patterns, as
n → ∞.
GENERATING MISSING DATA 22
Implementing Missing Data Rules for the Single Cutoff Method. In the
missing data literature, there are three different methods to implement missing data rules
associated with the single-cutoff method, each with some drawbacks. The first method
involves setting a missing data rule, and then applying this rule subject by subject until
the desired percentage of missing data is reached (e.g., Enders, 2004). This method is
highly problematic because it violates the assumption that each subject is coming from the
same population. If we assume that each subject comes from the same population and thus
follows the same missing data rule, then it does not make sense that we apply the missing
data rule to some cases but not the other. A consequence of this method is that the
percentage of missing values in a sample may be very different from the expected
percentage of missing data given by the missing data rule (see Equation (7)). Another
problem with this method is that it is impossible to determine the strength of dependency
between the missing data indicator and the missing data predictor since the missing data
rule is not applied to every subject.
The second method involves deleting a subject whenever its percentile ranking in a
sample is higher than the desired missing data percentage (e.g., Enders, 2001b, 2010;
Savalei & Yuan, 2009). For example, suppose that each subject’s probability of being
missing from Y1 depends on its Y2 value, and we want k percent of subjects in a sample to
have missing values in Y1 . Using this method, subjects whose Y2 values are in the top k
percent will have their Y1 values deleted. This is equivalent to the missing data rule that
sets a cutoff point corresponding to the quantile point for the top k percent values of Y2
and that says if a subject’s Y2 value is greater than the cutoff point, then its Y1 ’s probability
of being missing is one, otherwise, Y1 ’s probability of being missing is zero. Notice that
this missing data rule is the one with the strongest dependency between the missing data
indicator and predictor. Therefore, one disadvantage with this method is that it does not
allow researchers to vary the strength of dependency. Another problem with this method is
that the cutoff point may vary across datasets. In other words, the estimated percentage of
GENERATING MISSING DATA 23
missing values is forced to be the same across datasets by shifting the cutoff point. Shifting
the cutoff point violates the assumption that the same missing data rule should be applied
to datasets that come from the same population; thus, we suggest setting a cutoff point
and holding it constant across datasets when generating MAR data.
The third method involves deleting the desired percentage of subjects that are above
or below a specific cutoff point (e.g., Savalei & Bentler, 2005; Yuan & Bentler, 2000). This
method allows researchers to generate MAR data with different strengths of dependency,
and is almost equivalent to implementing the missing data rule directly to all subjects.
However, one problem with this method is that the estimated values for the parameters
(i.e., π1 and π2 in Equation (6)) associated with the missing data rule are the same across
datasets. However, as shown in (6), there should be variances associated with the
estimated values across samples. This problem is trivial if researchers are only interested in
large-sample simulations, but if researchers want to study small samples with missing data,
it may be important to incorporate the variances of parameter estimates.
Since each of the three methods mentioned above has drawbacks, we do not
recommend any of these methods. We recommend researchers to explicitly specify a
missing data rule before generating missing data, and then apply this missing data rule to
every subject in the dataset. If researchers have the desired percentage of missing data,
they should manipulate the parameters associated with the missing data rule so that the
expected percentage of missing equals the desired percentage of missing values.
Specifically, researchers can manipulate the parameters π0 , π1 and π2 in Equation (7) so
that πmiss in (7) equals to the desired percentage of missing values. Similarly, if researchers
have the desired strength of dependency between missing data indicator and predictor,
they can manipulate the parameter values so that the AR and OR in Equations (9)-(10)
show the desired strength of dependency (see our OSF website for sample R code).
Multiple Cutoff Method. When using the multiple cutoff method to generate
MAR data, we need to specify multiple cutoff points in a missing data predictor. One
GENERATING MISSING DATA 24
advantage of the multiple cutoff method is that it can be used to create a nonlinear
relationship between the missing data indicator and the missing data predictor (e.g.,
Collins, Schafer, & Kam, 2001; Graham, 2010). A nonlinear relationship occurs when
subjects with extreme values on the missing data predictor have a higher or lower
probability of being missing than subjects with mid-range values on the predictor. In
contrast, a linear relationship occurs when the probability of being missing gradually
increases or decreases as the value of the missing data predictor increases. In the following
subsections, we explain how to specify the missing data rules associated with the multiple
cutoff method to create a linear and nonlinear relationship between the missing data
indicator and the missing data predictor.
Missing Data Rules for the Multiple Cutoff Method. When using the
multiple cutoff method to create a nonlinear relationship between the missing data
indicator and the missing data predictor, we need to specify an upper cutoff and a lower
cutoff. Suppose the probability of missing values on Y1 depends on two cutoff points, a and
−a, in the variable Y2 . Let M be the missing data indicator, and U be the indicator
denoting whether Y2 value is between the two cutoff points: U = 1 when Y2 ≥ a or
Y2 ≤ −a, and U = 0 when −a < X < a. In statistical terms, the missing data rule is
P (M = 1|U = 1) = π1 and P (M = 1|U = 0) = π2 . Notice that this missing data rule is the
same as the one for the single cutoff method. In other words, in the case of a nonlinear
relationship, a missing data rule associated with the multiple cutoff method can be framed
to be the same as the missing data rule associated with the single cutoff method. As a
result, in this case, all equations for the single cutoff method can be used for the multiple
cutoff method.
On the other hand, to create a linear relationship between the missing data indicator
and the missing data predictor, researchers need to specify at least two cutoff points in the
missing data predictor. Most of the times, researchers specify three or four cutoff points,
which are usually the quartile or quantile points of the missing data predictors (e.g.,
GENERATING MISSING DATA 25
Graham, 2010; Strike et al., 2001). In other words, researchers can use the quartile or
quantile points to divide the values of the missing data predictor into four or five groups,
and each subject’s value on the missing data predictor has an equal chance to be in any of
the groups. Going from the group with the lowest values to the group with the highest
values, the probability of being missing usually increases or decreases at a constant rate
(e.g., Graham, 2010; Strike et al., 2001).
where Q1 , Q2 , and Q3 are the quartile points in Y2 . The missing data rule is that
P (M = 1|V = 1) = π1 , P (M = 1|V = 2) = π2
(15)
P (M = 1|V = 3) = π3 , and P (M = 1|V = 4) = π4 ,
where π1 = 0.3, π2 = 0.4, π3 = 0.5 and π4 = 0.6 in this example. There are five parameters
associated with this missing data rule. Four of them, of course, are π1 , π2 , π3 and π4 . The
fifth parameter is the one related to the probability of V : P (V = i) = π0 = 0.25 where
i ∈ {1, 2, 3, 4}; the value for π0 is set when researchers decide to use quartile cutoff points.
For each parameter, researchers can calculate the variances associated with the estimates of
GENERATING MISSING DATA 26
the parameters. Let n be the total number of subjects, and n0 = 0.25n be the number of
subjects in each quartile group. The variance for the estimated π0 is
π0 (1 − π0 )
Var(Π0 ) = . (16)
n
πj (1 − πj )
Var(Πj ) = . (17)
n0
Table 3 shows contingency table for M and V . Using this contingency table, we can
calculate each subject’s probability of being missing by calculating the marginal probability
of M = 1:
πmiss = P (M = 1)
= π1 π 0 + π2 π0 + π 3 π0 + π4 π 0
(18)
= (0.3)(0.25) + (0.4)(0.25) + (0.5)(0.25) + (0.6)(0.25)
= 0.45.
Let n be sample size and K be the number of subjects with missing data. We know that
K ∼ Binomial(n, πmiss ). Therefore, the expected percentage of missing values is
K
E = πmiss = π1 π0 + π2 π0 + π3 π0 + π4 π0 = 0.45. (19)
n
K πmiss (1 − πmiss )
Var = . (20)
n n
The expected number of distinct missing patterns can be calculated by Equation (4) by
setting π = πmiss . Overall, the multiple cutoff method is very similar to the single cutoff
method. The main difference is with the single cutoff method, the missing data predictor
only has one cutoff point, whereas with the multiple cutoff method, the missing data
predictor usually has three or four cutoff points.
GENERATING MISSING DATA 27
Similar to the single cutoff method, as the AARD increases, the strength of
dependency increases. With quartile points, the maximum AARD is 1/3 = 0.33. In this
case of maximum AARD, the parameters need to be set as π1 = 0, π2 = 0.33, π3 = 0.67
and π4 = 1. Figure 3a shows the relationship between the missing data predictor Y2 and
the missing data indicator M for our previous example with AARD = 0.1, and Figure 3b
shows the relationship between Y2 and M for the example with the maximum AARD (i.e.,
AARD = 0.33). As expected, as the strength of dependency increases (i.e., comparing
Figure 3a and 3b), the boxplot for M = 1 overlaps less with the one for M = 0. However,
with the multiple cutoff method, we can no longer achieve the case where the boxplot for
M = 1 is completely separate from the boxplot for M = 0 (as shown in Figure 2a); this
means that with the multiple cutoff method, researchers can never achieve the strongest
dependency which they can do with the single cutoff method. In fact, as the number of
cutoff points increases, the maximum strength of dependency we can create decreases. The
reason is that the possible range of the probability of missing values is from 0 to 1, and as
the number of cutoff points increases, researchers need to divide this range into smaller and
smaller pieces, thus the maximum AARD decreases.
GENERATING MISSING DATA 28
P (M = 1)
log = −1.25 + 0.42V, (22)
1 − P (M = 1)
where the regression coefficients are obtained by fitting a straight line describing the
relationship between V and log-odds of M . As the the regression coefficient for V
increases, the strength of dependency increases. However, similar to the single cutoff
method, in the case of the maximum strength of dependency (i.e., when AARD = 0.33),
the logistic regression model cannot be estimated because the log-odds of M for V = 4 (or
for V = 1) is not defined.3
In conclusion, we can use AARD or the coefficient from the logistic regression model
to measure the strength of dependency when we use the multiple cutoff method to specify
missing data rules. Higher AARD or regression coefficient value indicates a higher strength
of dependency; however, in the case of the maximum dependency, we can only calculate
AARD as the regression coefficient is undefined.
Implementing Missing Data Rules for the Multiple Cutoff Method. In the
missing data literature, to implement missing data rules associated with the multiple cutoff
method, researchers usually just delete the desired percentage of subjects that are below
3
If the parameters in (15) are set as π1 = 0, π2 = 0.33, π3 = 0.67 and π4 = 1, when V = 4, the log-odds is
P (M =1) 1
log 1−P (M =1) = log( 0 ), which is undefined. If the parameters are set as π1 = 1, π2 = 0.67, π3 = 0.33 and
the lowest cutoff or above the highest cutoff or between two cutoffs (e.g., Graham, 2010;
Strike et al., 2001). For example, to implement the missing data rule in (15), researchers
will delete 30% of subjects with Y2 < Q1 , 40% of subjects with Q1 ≤ Y2 < Q2 , and so on.
With this method, the estimated values for the parameters π1 , π2 , π3 and π4 are held
constant across the datasets; in other words, there is no sampling variability for the
parameter estimates. This issue may be a problem if researchers want to study small
samples. Once again, we recommend researchers specify a missing data rule and then apply
this missing data rule to every subject in the dataset.
Percentile Method.
Missing Data Rules for the Percentile Method. The percentile method is an
extension of the multiple cutoff method. In the percentile method, each subject’s
probability of being missing depends on its percentile rank in the missing data predictor,
therefore, it can be viewed as the multiple cutoff method where each subject has its own
cutoff point based on its percentile rank.
To define the missing data rule formally, suppose that a subject’s probability of being
missing from Y1 is related to its percentile rank on the missing data predictor Y2 . Again,
let M be the missing data indicator. If there is a direct relationship between the missing
data indicator and the missing data predictor, then the missing data rule is if a subject is
at kth percentile on Y2 , then it has k% probability of being missing from Y1 or
P (M = 1|Y2 = qk ) = k/100 where qk is the Y2 value corresponding to its kth percentile. If
there is an indirect relationship, then the missing data rule is if a subject is at kth
percentile on Y2 , then it has 100 − k% probability of being missing from Y2 or
P (M = 1|Y2 = qk ) = 1 − k/100. These two missing data rules are the only possible missing
data rules associated with the percentile method. Since the percentile method only involves
these two missing data rules, there are no parameter values researchers need to consider
when generating MAR data using the percentile method.
Since CDF is a function that maps a value of a random variable to its percentile rank, this
means the percentile ranks of all possible Y2 values are distributed as the standard uniform
distribution. As a result, the expected percentile rank of a subject is the 50th percentile;
thus, each subject’s probability of being missing is 50% or P (M = 1) = 0.5. Let n be
sample size and K be the number of subjects with missing data. We know K ∼ Bin(n, 0.5).
Therefore, the expected percentage of missing values is
K
E = 0.5. (24)
n
The expected number of distinct missing patterns can be calculated by Equation (4) by
setting π = 0.5.
With the percentile method, researchers cannot vary the strength of dependency. The
reason is that the two missing data rules associated with the percentile method only vary
in the direction of dependency between the missing data indicator and the missing data
predictor, and do not vary in the strength of dependency. Since the percentile method can
be viewed as the multiple cutoff method with a large number of cutoffs, the strength of
dependency created by the percentile method is less than the maximum strength created
by the single cutoff method (see Figure 2a) or by the multiple cutoff method with quartile
cutoffs (see Figure 3b). Figure 4 shows the relationship between Y and M for the two
missing data rules. As expected, relative to the boxplots in Figure 2a and 3b, the boxplots
for M = 0 and M = 1 in Figure 4a or 4b have more overlap with each other.
GENERATING MISSING DATA 31
Perhaps, one way to quantify the strength of dependency created by the percentile
method is to find a logistic regression model that approximates the missing data rule.
Based on a large sample simulation (n = 1, 000, 000) where Y2 follows the standard normal
distribution, if the probability of being missing from Y1 is directly related to the percentile
rank of Y2 , an approximate logistic regression model is
P (M = 1)
log = 1.70Y2 . (26)
1 − P (M = 1)
If the probability of being missing from Y1 is inversely related to the percentile rank of Y2 ,
then the logistic regression is the same as the above except that the coefficient 1.70 is
replaced with -1.70. Equation (26) shows that the missing data rule specified using the
percentile method can also be specified using the logistic regression method (which will be
explained in the next section). Therefore, the percentile method can also be viewed as a
part of the logistic method.
The advantage of using the percentile method is that the probability of missing
values gradually increases or decreases as the value of the missing data predictor Y2
increases. This gradual change in probability as Y2 is more realistic than the sudden
change in probability as Y2 passes a certain cutoff, which is used in the single or multiple
cutoff method. However, we do not recommend the percentile method to generate MAR
data because this method does not allow researchers to vary the strength of dependency
and the expected percentage of missing data. Alternatively, according to Equation (26),
researchers can use the logistic regression method to generate MAR data equivalent to
those created by the percentile method. With the logistic regression method, researchers
can vary the strength of dependency and the percentage of missing values (see the next
section for details).
Implementing Missing Data Rules for the Percentile Method. In the
missing data literature, to implement the missing data rule associated with the percentile
method, researchers usually apply the missing data rule in an ascending order according to
the value of the missing data predictor (i.e., from the lowest Y2 value to the highest Y2
GENERATING MISSING DATA 32
value) until the desired percentage of missing data is reached (e.g., Enders, 2001a). This
implementation is very problematic. If researchers believe that each subject comes from the
same population and thus follows the same missing data rule, it does not make sense that
they apply the missing data rule to only a fraction of the subjects. As we have mentioned
above, the expected percentage of missing data is 50% when the percentile method is used.
However, with this implementation, the percentage of missing data in a dataset is
commonly set to 5% or 15%, which is highly unlikely given this missing data rule.
If researchers want to use the percentile method, they should apply the missing data
rule to each subject. In addition, they should calculate each subject’s percentile rank on Y2
based on the population distribution of Y2 , not based on the sample distribution of Y2
values (see our OSF website for sample R code).
Logistic Regression Method. As shown before, the missing data rules associated
with the single cutoff, multiple cutoff, and percentile methods can be reframed as logistic
regression models (see Equations (13), (22), and (26)). In other words, the single cutoff,
multiple cutoff, and percentile methods are all related to the logistic regression method for
generating MAR data. In this section, we explain how to directly use logistic regression
models to generate MAR data.
Missing Data Rules for the Logistic Regression Method. When using the
logistic regression method to generate MAR data, we can view the logistic regression model
as the missing data rule, and the population regression coefficients associated with the
model as the parameters associated with the missing data rule. For example, if each
subject’s probability of being missing from Y1 is related to the missing data predictor Y2 ,
then the logistic regression model for subject i is
P (Mi = 1|y2,i )
log = β0 + β1 y2,i , (27)
1 − P (Mi = 1|y2,i )
GENERATING MISSING DATA 33
where M is the missing data indicator and y2,i is subject i’s value on Y2 . The parameters
associated with the missing data rule are β0 and β1 .4 Conditional on the value of Y2 , each
subject’s (or subject i’s) probability of being missing is given by
1
P (Mi = 1|y2,i ) = . (28)
1+ e−β0 −β1 y2,i
Because the above function is continuous, it means the probability of being missing for Y1
gradually increases or decreases as the value of Y2 increases, an advantage shared with the
percentile method.
With the logistic regression, there is no simple formula for calculating the expected
percentage of missing data.5 We can use computer simulation to estimate the expected
percentage of missing by calculating the mean of the probabilities in a sample with a large
sample size (e.g., n = 100, 000):
n
1X 1
πmiss = . (29)
n i=1 1 + e 0 −β1 y2,i
−β
4
With sample data, the regression coefficients and the variances associated with the coefficients can be
estimated using the maximum likelihood method. More details can be found in any textbook on logistic
regression (e.g., Hilbe, 2009).
5
The reason is that it is hard to solve P (M = 1) = E( 1+e−β01−β1 Y2,i ) analytically since it involves finding
the expected value of a nonlinear transformation of a random variable.
GENERATING MISSING DATA 34
situation relative to the single cutoff and multiple cutoff methods. However, the
disadvantage of the logistic regression method is that it does not allow researchers to set a
very strong dependency between the missing data indicator and predictor.
Generating MNAR Missing Data. MNAR data are less studied in the missing
data literature relative to the MCAR and MAR data because most statistical methods for
handling missing data are unable to handle MNAR data. Generating MNAR data is very
similar to generating MAR data. Recall that the only difference between MAR and MNAR
data is that in MAR data, the probability of missing values for one variable depends on the
observed values of another variable, but in MNAR data, the probability of missing values
depends on the variable’s own value. Therefore, when generating MNAR missing data for
simulation studies, researchers can change the missing data predictor to the variable with
missing values, and then use one of the methods for generating MAR data to generate
MNAR data. For example, suppose a missing data rule that generates MAR data says
when Y2 value is above a cutoff point a, Y1 has π1 probability of being missing, otherwise,
GENERATING MISSING DATA 35
Y1 has π2 probability of being missing. To change this MAR missing data rule to one that
generates MNAR data, we simply have to change the variable Y2 to Y1 ; therefore, the
corresponding missing data rule for generating MNAR data is when Y1 value is above a
cutoff point a, Y1 has π1 probability of being missing, otherwise , Y1 has π2 probability of
being missing. In summary, by changing the missing data predictor to the variable with
missing data, researchers can change all the MAR missing data rules to MNAR missing
data rules, and then generate MNAR missing data according to the MNAR missing data
rules.
When generating missing data for stimulation studies, properties of the data that are
not related to the missing data rule may also affect the results of the simulation studies. In
this section, we explain two important factors of the data that are not related to the
missing data rule but can often affect the performance of common missing data techniques
such as FIML and MI.
First, the correlations among the variables in the dataset may affect how well missing
data techniques (e.g., FIML and MI) handle MAR data. On the one hand, as mentioned
previously, the correlations among variables may affect the number of missing data
patterns in a MAR dataset. Specifically, if researcher want to create more missing data
patterns by letting the different missing data indicators to depend on different missing data
predictors, then the more correlated the missing data predictors are, the fewer the number
of missing data patterns is. The number of missing data patterns, in turn, may affect the
performance of missing data techniques (Savalei & Bentler, 2005; Zhang & Savalei, 2020).
On the other hand, the correlation between the missing data predictor and the
variable with missing data may affect how well missing data techniques such as MI predict
the values of the missing data in a MAR dataset. In the special case of uncorrelated MAR
data, the correlation between the missing data predictor and the variable with missing data
GENERATING MISSING DATA 36
is zero, but the probability of missing values is related to the values of the missing data
predictor. For example, suppose Y2 is the missing data predictor such that for subjects
with Y2 ≥ 0, their Y1 values are missing (i.e., single cutoff method with the strongest
dependency), but Y2 and Y1 has a correlation of zero. In this case, subjects with missing
values on Y1 have high values on Y2 but had we observed their values on Y1 , the
distribution of their Y1 values is same as the one for the subjects without missing values. In
other words, given the Y2 values, we can predict which subjects have missing values on Y1
but not their missing values of Y1 . Although uncorrelated MAR data provide us with
slightly more information about the variable with missing data relative to MCAR data
(with MCAR data, we cannot even predict which subjects have missing values), they
definitely provide less information about the missing data relative to MAR data where the
missing data predictor and the variable with missing data are moderately or strongly
correlated. Therefore, if researchers wish to generate MAR data that are more different
from MCAR data, we recommend researchers generate correlated MAR data.
The second factor that may affect the performance of missing data techniques is the
location of the variables with certain properties (e.g., variables with model misfit, variables
with nonnormality) relative to the location of the variables with missing data. When we
have missing data, we loss information about the features of the data that have missing
values.6 Therefore, the location of the variables with certain properties may interact with
the location of the variables with missing data to affect the performance of missing data
techniques. For example, Zhang and Savalei (2020) showed that when the variables that
are misspecified are the same as those with missing data (i.e., the location of the model
misfit overlaps with the location of missing data), the approximate model fit improves
relative to the fit for data without missing values because some of the information
6
We are not using the term “information" in a technical sense (e.g., it does not mean Fisher information).
By "information", mean things about the dataset (e.g., covariance structure of the data) that will allow us
to know about certain properties of the data (e.g., model misfit).
GENERATING MISSING DATA 37
regarding the model misfit is lost due to missing data. In contrast, when the variables that
are misspecified are different from the ones with missing data, the model fit does not
change much because the information regarding model misfit is not affected by the missing
data. Of course, depending on the purpose of the simulation study, researchers may only
be interested in a small number of properties of the data; nonetheless, when designing the
study, they should think carefully about how the location of these properties may interact
with the location of missing data.
In conclusion, there are two characteristics of the data that are not related to the
missing data rule but may still affect the results of the simulation studies. Of course, the
factors that affect the results of simulation studies are not limited to these two factors, but
our main message is that it is also important for researchers to consider factors that are
not related to missing data rules nor missing data mechanisms when they design simulation
studies involving missing data.
Simulation studies play a crucial part in the development and evaluation of many
statistical methods, including statistical techniques for handling missing data (e.g., FIML
or MI). To conduct simulation studies involving missing data, researchers must sample
data from a known population distribution and then generate missing data in the sample
data (i.e., deciding which values to delete in the data). The main purpose of the current
paper is to provide guidelines on generating missing data for simulation studies, which have
never been done in the past research. Specifically, we have provided detailed explanations
regarding the statistical models, also known as “missing data rules”, for generating missing
data with different missing data mechanisms and patterns. For each type of missing data
rules, we have also explained the computer algorithm that can implement the rules and
provided R code for algorithms. We conclude our paper by providing the following
summary of recommendations for generating missing data for simulation studies.
GENERATING MISSING DATA 38
• Researchers should always specify the missing data rule and identify the parameters
associated with the rule before generating missing data on the computer. Knowing
the specific missing data rule makes it easier for researchers to figure out and
understand the missing data properties, such as the expected percentage of missing
values, the type of missing mechanism, and the number of missing data patterns.
• Researchers should apply the missing data rule subject by subject when generating
missing data on the computer. It is the easiest and most straightforward way to
apply the missing data rule to generate missing data.
• Researchers should maximize the difference between MCAR and MAR data to
achieve a strong manipulation of the type of missing data mechanism. To maximize
the difference between MCAR data and MAR data, we suggest that researchers
include a MAR dataset with the strongest dependency (between the missing data
indicator and the missing data predictor) using the single cutoff method, and make
sure that for all MAR data, there is a moderate correlation between the missing data
predictor and the variable with missing data (i.e., avoid uncorrelated MAR data).
• If researchers wish to include more realistic MAR data that do not involve sudden
changes in the probability of missing values as the value of the missing data predictor
increases, we suggest that they generate MAR data using the logistic regression
method rather than the percentile method because the percentile method does not
allow researchers to manipulate the strength of dependency between the missing data
indicator and the missing data predictor.
• If researchers want to manipulate the type of missing data mechanism, they should
control for the number of missing data patterns between conditions with different
missing data mechanisms. In other words, they should compare MCAR and MAR
data with approximately the same number of missing data patterns.
GENERATING MISSING DATA 39
• Researchers should consider how data properties not related to the missing data rule
can affect the performance of the missing data technique. In this paper, we have
explained how correlations among variables and the location of variables with certain
properties may affect the performance of missing data techniques. Depending on the
purpose of the simulation study, researchers should consider the data properties that
are important for their own study.
GENERATING MISSING DATA 40
References
Graham, J. W. (2010). Missing data: Analysis and design. New York: Springer.
Grigsby, T. J., & McLawhorn, J. (2019). Missing data techniques and the statistical
conclusion validity of survey-based alcohol and drug use research studies: A review
and comment on reproducibility. Journal of Drug Issues, 49 (1), 44–56. doi:
10.1177/0022042618795878
Hilbe, J. M. (2009). Logistic regression models. Chapman and Hall/CRC.
Jamshidian, M., & Siavash, J. (2010). Tests of homoscedasticity, normality, and missing
completely at random for incomplete multivariate data. Psychometrika, 75 , 649–674.
doi: 10.1007/s11336-010-9175-3
Jia, F., & Wu, W. (2019). Evaluating methods for handling missing ordinal data in
structural equation modeling. Behavior research methods, Advanced Online
Publication. doi: 10.3758/s13428-018-1187-4
Kim, K. H., & Bentler, P. M. (2002). Tests of homogeneity of means and covariance
matrices for multivariate incomplete data. Psychometrika, 67 , 609-624. doi:
10.1007/BF02295134
Little, R. J., & Rubin, D. B. (2019). Statistical analysis with missing data (Vol. 793). John
Wiley & Sons.
Mattson, S. (1997). How to generate non-normal data for simulation of structural equation
models. Multivariate Behavioral Research, 32 (4), 355–373. doi:
10.1207/s15327906mbr3204_3
Miao, W., Ding, P., & Geng, Z. (2016). Identifiability of normal and normal mixture
models with nonignorable missing data. Journal of the American Statistical
Association, 111 (516), 1673–1683. doi: 10.1080/01621459.2015.1105808
Musil, C. M., Warner, C. B., Yobas, P. K., & Jones, S. L. (2002). Performance of weighted
estimating equations for longitudinal binary data with drop-outs missing at random.
Western Journal of Nursing Research, 24 , 815-829. doi:
10.1177/019394502762477004
GENERATING MISSING DATA 42
Table 1
Contingency table for generating MAR data using the single cutoff method.
M
1 0
U
and M = 0 when Y2 is not missing. U is the indicator variable indicating whether Y2 is equal to
Table 2
Contingency tables for MAR data with different strengths of dependency
(a) (b)
M M
1 0 1 0
U U
(c) (d)
M M
1 0 1 0
U U
Y1 is not missing. U is the missing data predictor with U = 1 when Y2 ≥ 0, and U = 0 when
Y2 < 0. Suppose Y2 follows the standard normal distribution. The missing data rule for (a) is if a
subject has Y2 ≥ 0, then its Y1 value is always missing; the rule for (b) if a subject Y2 ≥ 0, then its
Y1 has 70% probability of being missing; otherwise, Y1 has 30% probability of being missing; the
rule for (c) is if a subject has Y2 ≥ 0, then its Y1 has 60% probability of being missing; otherwise,
its Y1 has 40% probability of being missing; the rule for (d) is each subject has 50% probability of
being missing from Y1 . As the table goes from (a) to (d), the strength of dependency goes from
Table 3
Contingency table for generating MAR data using the multiple cutoff method
M
1 0
V
Y1 is not missing. V be a discrete uniform random variable indicating which quartile the Y2 value
Figure 1 . Types of missing data patterns. Rows represent subjects; columns represent
variables. The shared cells represent the location of missing values.
GENERATING MISSING DATA 48
Figure 2 . MAR data created by the single cutoff method, varying in the strength of
dependency. M is the missing data indicator with M = 1 indicating Y1 is missing, and
M = 0 indicating Y1 is not missing. Y2 is the missing data predictor, which follows the
standard normal distribution. Since M is a binary variable while Y2 is a continuous
variable, boxplots can be used to show the strength of dependency between M and Y2 . The
strength of dependency decreases as the boxplox for M = 0 overlaps more with the one for
M = 1. In other words, the strength of dependency decreases as the graph goes from (a) to
(d). Each graph is based on a large simulated dataset.
GENERATING MISSING DATA 49
Figure 3 . MAR data created by the multiple cutoff method, varying in the strength of
dependency. M is the missing data indicator with M = 1 indicating Y1 is missing, and
M = 0 indicating Y1 is not missing. Y2 is the missing data predictor, which follows the
standard normal distribution. Since M is a binary variable while Y2 is a continuous
variable, boxplots can be used to show the strength of dependency between M and Y2 . The
strength of dependency decreases as the boxplox for M = 0 overlaps more with the one for
M = 1. In other words, the strength of dependency in graph (b) (AARD=0.33) is stronger
than that in graph (a) (AARD=0.10). Each graph is based on large a simulated dataset.
GENERATING MISSING DATA 50
Figure 4 . MAR data that are generated using the percentile method. M is the missing
data indicator with M = 1 indicating Y1 is missing, and M = 0 indicating Y1 is not
missing. Y2 is the missing data predictor, which follows the standard normal distribution.
Since M is a binary variable while Y2 is a continuous variable, boxplots can be used to
show the dependency between M and Y2 . In graph (a), the relationship between the
probability of being missing from Y1 and the percentile rank on Y2 is a direct relationship;
in graph (b), this relationship is an inverse relationship. Each graph is based on a large
simulated dataset.