This Content Downloaded From 115.27.214.77 On Thu, 01 Dec 2022 06:58:36 UTC
This Content Downloaded From 115.27.214.77 On Thu, 01 Dec 2022 06:58:36 UTC
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
Royal Statistical Society and Wiley are collaborating with JSTOR to digitize, preserve and
extend access to Journal of the Royal Statistical Society. Series B (Methodological)
SUMMARY
This paper presents computational results for some alternative methods of
analysing multivariate data with missing values. We recommend an
algorithm due to Orchard and Woodbury (1972), which gives an estimator
that is maximum likelihood when the data come from a multivariate normal
population. We include a derivation of the estimator that does not assume
a multivariate normal population, as an iterated form of Buck's (1960)
method.
We derive an approximate method of assigning standard errors to
regression coefficients estimated from incomplete observations, and quote
supporting evidence from simulation studies.
A brief account is given of the application of these methods to some
school examinations data.
1. INTRODUCTION
MANY multivariate analysis techniques, and in particular multiple regression, assume
that one starts with an array of numbers xij representing the value of the jth variable
in the ith observation. This will be for j= 1,...,n and i= 1,...,N if we have N
observations and n variables. From these raw data one then forms a square matrix
alk Of sums of squares and products defined by the equation
aik XiXk(1.1)
One then can proceed to a multiple regression analysis or any of the more specialized
analyses such as principal component analysis, or factor analysis, or interdependence
analysis.
But what should we do if there are gaps in the original data, that is to say if
individual variables are missing in some observations? Sometimes the fact that the
variable is missing indicates that its true value is probably unusual, and in these
circumstances any mechanical method of analysis may be very misleading. But
information about some variables may simply not be readily available, particularly
if the relevance of this information is doubtful, as in exploratory regression work.
We can now find the value OM of 0 that maximizes the left-hand side of (2.3). This
may depend on 0A, so we may write
Om = 4(A). (2.4)
Equation (2.4) represents a transfor
We now define the Missing Information Principle.
0 = 4+(O)- (2.5)
We call (2.5) the "fixed point equation". We justify this approach by two theorems,
which show that the maximum likelihood estimator of 0 is a root of the fixed point
equation, and conversely every root of the fixed point equation is a maximum or
stationary value of the likelihood.
Hence, if the likelihood function is differentiable, any solution of the fixed point
equations automatically satisfies the "likelihood equations" found by setting the
partial derivatives of the likelihood equal to zero. Some solutions of the likelihood
equations may not be fixed points, since the fixed point method involves finding
OM giving a global maximum, and not merely a stationary value, of the left-hand side
of (2.3). (In this respect our approach differs slightly from that of Orchard and
Woodbury, who implicitly define 4 be setting the derivatives of (2.2) with respect to
o equal to zero.)
Theorem 1. The maximum likelihood estimator 0 satisfies (2.5).
Theorem 2. If L(z I y; 0) is a differentiable function of 0, then any other va
of 0 satisfying (2.5) must represent either a maximum or a stationary value of L(y; 0
To prove these theorems, consider the last term on the right-hand side of (2.3).
We have
N n n
where aok denotes the jkth element of E-1. Taking expectations with
known variables fixed,
N n n
for 1 <j, k < n. Now set ,UA = pm = I, EA = EM = E. The fixed point equations
are
ij = N E xsj, (2.7)
iN
gjk = N ii j) (Xik -Ik) + Urjk.Pi} (2.8)
A= x, if xij is observed,
= a linear combination of the variables in Pi,
if xij is missing.
At each iteration the data are completed by equation (2.6), and the means and a sum
of squares and products matrix found for the variables. This matrix is adjusted by
adding cjk.P, for every observation i to the jkth element. Note that this adjustment
is zero unless both x?, and Xik are missing.
It seems reasonable to hope that in this case we have a unique fixed point of the
transformation (2.4), since the effect of changing one component of OA is to chan
the corresponding component of the OM by a smaller amount in the same sense. Thus
cycling through equations (2.6)-(2.9) represents an iterative procedure for finding
the maximum likelihood estimates ^j, 'jk' which is much simpler to carry out than
working directly from the likelihood equations.
x x /N (3.2)
i
The problem remains to determine suitable formulae for the correction terms Cijk.
This problem is a subtle one, and is discussed rather briefly by Buck. The solution
has been indicated at the end of Section 2, but it is of interest to explore it in more
detail, discussing terms of order N-1 in a special case. Suppose that we have only
one incomplete observation, where only the first p(<n) variables are known. For
notational convenience we assume that the incomplete observation has i = 1.
Let - denote
1 N
N~- N1E
1 = xv,'
i.e. the mean value of the lth variable over all complete observations. Define bik
forj= 1, ...,n, k = 1,..., p as
bjk = partial regression coefficient of xj on Xk, estimated from the
complete data, if j> p,
=0, ifj?pandjok,
=1, ifj]pandj=k.
Then
N 1
N N- I
-E Xi Xik - (N-
_= 11 12
We are now concerned with the expected value of Sjk. We must therefore define some
properties of the population from which the observations are drawn.
Without loss of generality we may assume that the true means of all variables
are zero.
Let uj1 denote the covariance of xj and Xk. Let Vik denote the "partial covariance"
of x7 and xk, by which we mean the covariance of
= Ujk-Vjk,
while
where g,!s is the (li, 12)th element of the inverse of the (p xp) matrix G where
N
Hence
But now
E(g,,2) N
where uj1, denotes the (1l12)th element of the inverse of the p xp matrix (ull.).
N-p-2
N-2 k
E(V k) = N-p-2 Vk
It is somewhat remarkable that all the various terms of order 1/N in the above
analysis cancel, and that the resulting formula for Cijk is the "naive" estimate for th
partial covariance of xj and Xk given the known variables. This cancellation does n
happen exactly with more complicated deletion patterns, but there is no simple
correction formula for the bias of order 1/N.
The analysis implicitly assumes that the probability of a particular variable being
missing is independent of the numerical values of any of the variables for this
observation. This important assumption was noted in the Introduction. But the
analysis does not assume that the underlying population is multivariate normal.
This is of some practical significance, since multiple regression is widely applied to
non-normal data. On the other hand, it is worth noting that if the population is not
multivariate normal, then any unknown variable is not necessarily best estimated by
a linear function of the known variables for the observation. So it may be possible
to develop slightly more powerful estimators for particular non-normal populations.
This analysis has concentrated on the situation where we have (N- 1) complete
observations and a single incomplete observation. We now consider what to do
when more observations are incomplete. Buck's method uses only the complete
observations to define the means - and the estimated covariance matrix z2k. But our
simulation studies suggest that an iterated version of Buck's method is generally
superior. This method takes trial values for the xj and the Ujk, uses them to compute
the X and Cijk and hence aJk and xX from (3.1) and (3.2). We then set
Xi = xx, (3.4)
Ujk = ajk/(N- 1), (3.5)
and repeat the process until there are no further chan
We noted in Section 2 that Orchard and Woodbury have derived the same
algorithm, with N substituted for N-1 in (3.5), as giving maximum likelihood
estimates when the population is multivariate normal.
After fitting missing values in this way, Method 4 proceeds with an ordinary
least squares analysis. But this is inefficient since it amounts to giving the same weight
to incomplete observations as that given to complete observations. Method 5
computes a weight wi for each observation i, and carries out a weighted lea
analysis.
To find the weights, let
For each method we must decide when an observation is so incomplete that it should
be ignored. For all methods we ignore observations with all variables missing. For
Method 4 we also ignore observations with all independent variables missing, since
the inclusion of such observations with full weight was found to make the results
significantly worse.
In all cases the data were generated from a multivariate normal population with
1 variable identified as the dependent variable, and between 2 and 4 independent
variables. Some of the populations were the same as those studied by Haitovsky,
but other had smaller values of R2. We took 50, 100 or 200 observations and deleted
either 5, 10, 20 or 40 per cent of the observed values of each variable. The values
to be deleted were chosen randomly, -and independently for each variable. Our
criterion for judging the effectiveness of each estimator was the residual sum of
squares of deviations of the observed and fitted values of the dependent variable when
the deleted values were restored. In symbols we may write this as
N (8
S= y -bo- bj xij
TABLE 1
Average percentage increase in residual sum of squares over best fit when all variables
are known (averaged over 10 runs)
C 1 1.6 0.8 7.7 3.3 2-4 36.2 12.1 7.3 37.4 1241
5 var. 2 0-8 03 3.4 1-8 0 9 23-1 441 2-5 25-3 6-9
R2= 0-4402 3 0-8 0 3 2.6 1-7 0-8 9.5 2.9 1P5 6-8 3 0
4 0-9 0 3 2-9 1.8 0-7 11.0 3'0 1-3 7-1 3.3
5 0-8 0-3 2-9 1.8 0-8 10'7 2.9 1-4 6-8 3-2
6 0'8 0'3 2-9 1.8 0-8 10.4 3 0 1'4 6-8 3.1
D 1 1.6 0-8 7.7 3.3 2.4 36.2 12-1 7.3 37-4 1241
5 var. 2 0 9 0 3 4-2 2.0 1P0 24.6 4.8 2-8 25.2 7.3
R2 = 0-6339 3 0 9 0*3 3.2 1.8 0 9 11.2 3.4 1P9 6-5 3.3
4 141 0 4 3.9 2.2 0.9 1541 3.4 1P6 8.6 4-1
5 1P0 0 3 3-6 2-0 0-9 13-9 3-2 1.6 8.1 3.8
6 1.0 0 3 3-6 2-0 1P0 12-9 3.4 1P8 8-0 3-8
E 1 1.6 0-8 7.7 3.3 2-4 36.2 12.1 7.3 37.4 1241
5 var. 2 0 7 0.3 5-7 1-5 1.2 25.6 6.1 3.4 27-3 8.0
R2 = 0-7173 3 0 7 0 3 5.2 1.3 1.1 16.3 5-8 2-5 9.7 4-8
4 0-8 0-3 7.4 1P4 1.2 14.2 4.7 2.6 18.8 5.7
5 0.8 0 3 6.1 1.4 1.2 12.1 4.8 2'3 17-7 5.2
6 0-8 0-3 5-8 1-3 1.2 12-8 5-2 2-3 14.6 4.9
F 1 1-6 0-8 7.7 3.3 2-4 36.2 12.1 7.3 37.4 1241
5 var. 2 1-4 0 7 6-4 2.9 2-0 32-6 9.9 6-4 32.7 10-6
R2= 0-9866 3 1P5 0 7 5.3 3 0 1-9 27.0 8-7 5.5 23-5 8-5
4 15.9 4*2 77.9 33-2 13-0 245-4 65-5 26.4 118.2 66.6
5 1P6 0.6 13.5 4-0 2-2 78-4 15-4 5.7 77.6 22-1
6 1.4 0.6 5.6 3-1 2.0 25-3 8-5 5.5 25.8 8.6
G 1 1.6 0-8 7-7 3.3 2.4 36.2 12.1 7.3 37.4 12.1
5 var. 2 1.4 0 7 6'3 2.8 2-0 33.6 1041 6-5 33.4 10-8
R2 = 0*9904 3 1.5 0 7 5-3 3-0 2-0 30*9 8'4 5.8 24-4 9 1
4 21-5 5.5 104-2 47.8 20.1 372.9 96.6 37.2 178.3 8.2
5 1.6 0.6 10.5 3.9 2.2 112-1 18.3 6-8 119-5 30.6
6 1*4 0.6 5 6 3.1 2-1 28-2 8-3 5 8 26.7 9.1
TABLE 1 (continued)
A xy X2 Y
X1 1P0000
X2 0-9817 1.0000
y 0 9722 0*9697 1.0000 R2 = 0-9516
B XI x2 X3 y
X1 1-0000
X2 0-9128 1*0000
X3 0-8730 0-9529 1P0000
y 0-2570 0.2851 0-2977 1.0000 R2 = 0-0888
C X1 x2 X3 X4 y
X1 1-0000
X2 0-8385 1.0000
X3 0-4596 0-6077 1-0000
X4 0-3618 0.4706 0-7962 1.0000
y 0.7522 0*5958 0-6979 0-8232 2.2500 R2 = 0-4402
E XI x2 X3 X4 y
X1 1-0000
X2 0-8743 1.0000
X3 0-4570 0-8255 1P0000
X4 0-3765 0-5181 0.6080 1.0000
y 0-3705 0*4575 0 5039 0 8261 1.0000 R2 = 0'7173
F X1 x2 x8 X4 y
X1 1-0000
x2 0X8738 1.0000
X3 0-5166 0.6314 1.0000
X4 0-4267 0.4650 0-7119 1-0000
y 0-7852 0-6137 0.6389 0'8283 1.0000 R2 = 09866
G XI x2 X3 X4 y
X1 1*0000
X2 0-8385 1.0000
X3 0-4596 0.6077 1.0000
X4 0*3618 0.4706 0-7962 1.0000
y 0-7522 0.5958 0.6979 0-8232 1.0000 R2 = 0.9904
It remains to compare the best of the least squares approaches, Method 6, with
Iterated Buck, Method 3. There is not much to choose between the methods, but
Method 3 is marginally better in a large majority of the cases considered. From a
computing point of view the methods are very similar, and the weighting procedures
in Methods 5 and 6 are used to derive approximate standard errors in the regression
coefficients for Method 3. We return to this point in Section 5 below.
It is perhaps worth noting that we also tested the straight Maximum Likelihood
Method of Orchard and Woodbury. The results are almost identical to those of
Method 3. Mostly they are worse, but by less than 0-1 per cent. We therefore see
no reason to use straight maximum likelihood, in preference to the conventional
correction represented by Method 3.
that the approximate theory is adequate to give general guidance about the precision
of the estimates. But we should point out that we have not tested the theory with
more systematic deletion patterns. Such systematic patterns of missing data often
arise in practice, and may not be quite as well covered by our approximate theory.
TABLE 2
The overall covariance matrix for all variables is then estimated by corrected
maximum likelihood, and the appropriate submatrices are extracted for any desired
regression analyses. The standard errors of the resulting regression coefficients can
then be estimated by the method described in Section 5. This approach may under-
estimate precision when missing variables are highly correlated with known variables
that are excluded from the regression analysis, but it seems a reasonably safe
procedure.
Another feature is concerned with the problem discussed by Woodbury (1971).
If two variables are never observed together, then the data give no evidence about
their conditional correlation, given all other variables. Woodbury then recommends
setting this conditional correlation equal to zero. The iterative procedure converges
to this solution, but incredibly slowly.
Test
1 2 3 4 5 6
Strategy (10) (25) (15) (10) (15) (25)
A x x x
B x x x x
C x x x
There were six parts, the number of items in each part appears in brackets.
Three choice strategies A, B and C were available, and each strategy is picked
out with crosses. Although 100 questions were presented, each candidate was
only required to tackle 50. The problem, then, is to estimate what scores the
candidates who went for strategy A would have obtained on parts 4, 5 and 6,
and then to cumulate actual and estimated scores to obtain an overall mark."
There were 321 observations, 73 following Strategy A, 188 Strategy B and 60
Strategy C. Tests 2-6 are in increasing order of difficulty.
We analysed the data on two different bases: once applying the corrected
maximum likelihood method to the data as presented, the other time after making
angular transformations on all the scores. The final results were very similar. The
calculations using angular transformations converged after 92 iterations and required
50 seconds of CPU time on the Univac 1108 computer. Without angular trans-
formations the calculations terminated after 100 iterations and had nearly but not
quite converged. This run required 70 seconds of CPU time. The structure of the
problem is revealed by the way the assumed means of the 6 variables changed as
the iteration proceeded. The initial estimates of the means, based on the observations
for which each variable was obtained, were
This indicates that the fitted scores on the easier Tests 2 and 3 for the candidates who
did not take them were somewhat higher than the scores obtained by the candidates
who did take them. But the opposite effect is seen with the more difficult Tests 5
and 6. This is as it should be.
ACKNOWLEDGEMENTS
This work has been based on earlier work on regression with missing values,
carried out at Imperial College and at Scientific Control Systems Limited. In
particular we should like to acknowledge the benefit we have gained from studying
the Imperial College M.Sc. Project Reports by M. J. Wallace in 1969 and D. Chant
in 1970.
One of us (R. J. A. Little) acknowledges the receipt of a Research Studentship
from the Science Research Council.
Finally, we are most grateful for many constructive criticisms of earlier drafts
of this paper by Professor D. E. Barton.
REFERENCES
BEALE, E. M. L. (1970). Computational methods in least squares. In Integer and Nonlinear
Programming (J. Abadie, ed.), pp. 213-227. Amsterdam: North Holland.
BUCK, S. F. (1960). A method of estimation of missing values in multivariate data suitable for
use with an electronic computer. J. R. Statist. Soc. B, 22, 302-306.
HAITOVSKY, Y. (1968). Missing data in regression analysis. J. R. Statist. Soc. B, 30, 67-82.
JOWETT, G. H. (1963). Application of Jordan's procedure for matrix inversion in multiple
regression and multivariate distance analysis. J. R. Statist. Soc. B, 25, 352-357.
KENDALL, M. G. and STUART, A. (1967). The Advanced Theory of Statistics, Vol. II, 2nd ed.
London: Griffin.
ORCHARD, T. and WOODBURY, M. A. (1972). A missing information principle: theory and
applications. In Proc. 6th Berkeley Symp. Math. Statist. Prob., Vol. I, pp. 697-715.
WOODBURY, M. A. (1971). Contribution to the discussion of "The analysis of incomplete data"
by H. 0. Hartley and R. R. Hocking. Biometrics, 27, 808-813.