Robust Detection of Multiple Outliers in A Multivariate Data Set
Robust Detection of Multiple Outliers in A Multivariate Data Set
single multivariate sample, but very few for the case where there
detecting outliers, which are points that are distant from every
the method is illustrated by its application to two real data sets and
the data. When the number of groups is not known in advance, the
solution.
CHAPTER ONE
Introduction
cally present in data of this type and tend to cause problems in the
this grouping are not known beforehand. Outliers in this context are
Rocke & Woodruffe, 1996; Penny & Jolliffe, 2001). Caroni (1998)
with a common covariance matrix, but she assumed that the number
of pop- ulations was known. Thus, her results are not applicable to
al. (1999) removed this restriction. Recently, Hardin & Rocke (2004)
here. The only other methods in the literature that appear to have the
this in a single group, using Wilks’ statis- tic, was given by Caroni
& Prescott (1992) but it is not obvious how it can be extended to the
published real data sets and a simulation study of its performance for
that has been reduced by omitting points that are even further from
the mean than the one currently being tested. Points sufficiently
masking: if there are actually more outliers than the number being
tested for, then the covariance matrix will be inflated by these extra
outliers, thus reducing the distances di and making it less likely that
procedure.
These considerations make it desirable to construct a robust
where T (X) and C(X) are robust estimators of the center and the
volume ellipsoid (MVE) (Rousseeuw & van Zomeren, 1990) and the
computation required
to be sure of reaching the solution was, in many cases, infeasible.
Step 1. Using either of the following two versions of the method, determine an initial basic subset of m > p
observations where m is an integer chosen by the data analyst: m = cp with c = 3, 4 or 5 is suggested.
Version 1: Initial Subset Selected Based on Mahalanobis Distances
For all i = 1,...,n points, compute the Mahalanobis distances
Step 3. Determine a new basic subset consisting of the observations with smallest discrep-
ancies, specifically, those satisfying di (x̄b , Sb ) < Cnpr · χp,α/n , where χ 2 is the 100(1
p,α
− α)
percentile of the chi-squared distribution with p degrees of freedom, and Cnpr = Cnp + Chr
is a variance inflation factor, with Chr = max{0,(h − r)/(h + r)} and
p +1 1
C =1+ +
np
−
n p n − h −p
where r is the size of the current basic subset and h = [|(n + p + 1)/2|].
basic subset no longer grows. Outliers are those points, if any, still
terminates.
Data
covariance matrix for each group, and hence a basic subset will be
The first step of the algorithm for both cases, that is, with equal or
points (objects) from the data set. The corresponding clusters are
that they are centrally located in the clusters they represent. In other
come from.
Step 1. Start the analysis by using the k-medoids method to find the
centers of k subgroups, where k has been chosen in advance.
center, d (1) ,..., d(m) irrespective of which group they belong to. In the
where x¯jb and Sjb are the mean and covariance, respectively, of the
Dij.
Step 5. Enlarge the basic subset(s) by including all points with Di <
the basic subsets do not change. Points remaining outside the basic
Examples of Applications
For numerical illustration using actual data, we take two sets of real
that there may be outliers. We will not carry out detailed analysis of
the data, but simply demonstrate how easily our method reveals
=
=
outliers. The first set consists of five socioeconomic measures for
are available within the S-PLUS package. The cluster analyses and
suggest that there are three main groups in the data, with one clear
outlier (point number 45). Figure 1 shows the data plotted in the
space of the first two prin- cipal components, with this point
marked. The extended BACON algorithm with k 3 (and p 5)
rounds of iteration.
to points 16 and 51, which are possible outliers. In fact, our method
works without difficulty even in a case where the groups are not
clearly separated.
Figure 1. Data set consisting of five socioeconomic measures on 47
76 young bulls of three breeds, plotted in the space of the first two
Simulation Study
In all cases, the covariance matrix used for data generation was the
identity, the mean of the first group was zero and the mean of the
second group was at (5, 0,..., 0) . For k 3, the mean of the third
group was placed at (2.5, 4.33, 0,..., 0) so that the means of the
results were obtained using the 5% level of significance for the chi-
for the starting basic subset, we only examined n 100 points. The
covariance matrix was the identity matrix for the first group, and also
the third if k 3, while for the second group the variance in the first
zero and of the second was (0, 8.66, 0,..., 0) so that the
The third
Table 1. Percentage of 1000 sets of n 50 or n 100 simulated data points in which any outliers were
= =
declared at the nominal 5% level of significance: equal covariances assumed
n = 50 n = 100
No. of outliers detected No. of outliers
Dimensions Groups Basic detected
subset
p k m 0 1 2+ 0 1 2+
10 97.0% 2.8% 0.0% 96.3% 3.5% 0.2%
2 20 96.8 3.1 0.1 95.7 4.1 0.2
2 30 96.8 3.1 0.1 95.8 4.0 0.2
10 97.6 2.3 0.1 97.0 3.0 0.0
3 20 98.2 1.7 0.1 96.5 3.5 0.0
30 96.9 2.9 0.2 96.6 3.4 0.0
15 97.2 2.7 0.1 96.1 3.9 0.0
2 20 98.1 1.9 0.0 97.5 2.4 0.1
5 30 97.6 2.3 0.1 97.5 2.5 0.0
15 95.2 4.6 0.2 96.8 3.1 0.1
3 20 97.8 2.2 0.0 97.0 2.9 0.1
30 96.8 3.1 0.0 97.3 2.6 0.1
2 20 98.4 1.6 0.0 98.3 1.6 0.1
8 30 99.0 1.0 0.0 98.4 1.6 0.0
3 20 96.9 3.0 0.1 97.5 2.3 0.2
30 97.6 2.4 0.0 98.3 1.6 0.1
Table 2. Percentage of 1000 sets of n 100 simulated data points in which any outliers were declared at
=
the nominal 5% level of significance: unequal covariances
group for k 3. In the case of two groups, the slippages were ( x, 0) , (0, x) and (0, x) (and zero in the remaining
dimensions if p > = 2) in the first group, and (x, 0) , (0, x) and (0, x) in −
the second group, with−x 4 and x 6 for
equal and unequal covariances, respectively. In the case of three groups, the slippages were ( x, 0) and ( y, y) in
− (x, 0) and (y, y) in the second,
the first group, = and (y, y)=and ( y, y) in the third, where x 6, y 4.24 for p 2 or 5
and the same values multiplied by 1.5 for p 8 (otherwise detection − probabilities − − very low). These
became
combinations of values were chosen − to give a symmetric arrangement−of outliers relative to groups, although
=
the symmetry = not hold with
does = respect to generalized distance for unequal covariances.= The results in Table 3
serve to show that both versions of the algorithm work well, despite the apparently lower power for k 3, p 8
which is due to the greater conservatism of the test in this case.
It is possible that the performance of our algorithm could be improved by adjusting the correction factors used
in Step 5. We used the same factors as in the single-sample case. These should be applicable when the basic
= but =
subset is small in size, not necessarily when it is large. A point will be declared to be an outlier if the
minimum of its distances to the k groups exceeds the criterion, so the factors could possibly be adjusted to
allow for this, although in fact this consideration only seems relevant for outliers that fall ‘in between’ the groups
in the multidimensional space. The correction factors used in the single-group case were derived empirically
from very extensive simulation studies. Given the huge range of possible configurations that need to be
examined, it would be rather difficult to carry out an equivalent study in the case of several groups.
CHAPTER FOUR
Conclusions
several groups as the original BACON algorithm did for one group,
almost 100 points that we are sure are not outliers in order to start the
listed in Table 1 of Baxter (1995) have more than 100 points and 12
of them do not even have 50 points, while p ranges from seven to 20.
In several of these examples, applying even the equal-covariances
matrices.
outliers.
References
Barnett, V. & Lewis, T. (1994) Outliers in Statistical Data, 3rd edn (Chichester:
Wiley).
Billor, N., Hadi, A.S. & Velleman, P.F. (2000) BACON: blocked
237.
761–771.
625–638.
Prentice Hall).
Mosteller, F. & Tukey, J.W. (1977) Data Analysis and Regression (Reading,
MA: Addison-Wesley).
Rousseeuw, P.J. & van Driessen, K. (1999) A fast algorithm for the minimum
Rousseeuw, P.J. & van Zomeren, B. (1990) Unmasking multivariate outliers and
Sain, S.R., Gray, H.L., Woodward, W.A. & Fisk, M.D. (1999) Outlier
pp. 294–304.
Venables, W.N. & Ripley, B.D. (1999) Modern Applied Statistics
Wang, S., Woodward, W.A., Gray, H.L., Wiechecki, S. & Sain, S.R.