Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
54 views

Robust Detection of Multiple Outliers in A Multivariate Data Set

ABSTRACT Many methods have been developed for detecting multiple outliers in a single multivariate sample, but very few for the case where there may be groups in the data set. We propose a method of simultaneously determining groups (as in cluster analysis) and detecting outliers, which are points that are distant from every group. Our method is an adaptation of the BACON algorithm proposed by Billor, Hadi and Velleman for the robust detection of multiple outliers in a single group of multivar

Uploaded by

chukwu solomon
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Robust Detection of Multiple Outliers in A Multivariate Data Set

ABSTRACT Many methods have been developed for detecting multiple outliers in a single multivariate sample, but very few for the case where there may be groups in the data set. We propose a method of simultaneously determining groups (as in cluster analysis) and detecting outliers, which are points that are distant from every group. Our method is an adaptation of the BACON algorithm proposed by Billor, Hadi and Velleman for the robust detection of multiple outliers in a single group of multivar

Uploaded by

chukwu solomon
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

TITLE PAGE

Robust Detection Of Multiple Outliers In A Multivariate Data Set


ABSTRACT

Many methods have been developed for detecting multiple outliers in a

single multivariate sample, but very few for the case where there

may be groups in the data set. We propose a method of

simultaneously determining groups (as in cluster analysis) and

detecting outliers, which are points that are distant from every

group. Our method is an adaptation of the BACON algorithm

proposed by Billor, Hadi and Velleman for the robust detection of

multiple outliers in a single group of multivariate data. There are two

versions of our method, depending on whether or not the groups can

be assumed to have equal covariance matrices. The effectiveness of

the method is illustrated by its application to two real data sets and

further shown by a simulation study for different sample sizes and

dimensions for 2 and 3 groups, with and without planted outliers in

the data. When the number of groups is not known in advance, the

algorithm could be used as a robust method of cluster analysis, by

running it for various numbers of groups and choosing the best

solution.
CHAPTER ONE

Introduction

The enormous extent of the literature on cluster analysis shows how

important it is in many different fields of science to be able to find

groups of objects on the basis of mul- tidimensional measurements.

To take just one example, it is common in archaeology to come

across papers dealing with the clustering of a set of objects (for

example, pieces of pottery or glass on the basis of their chemical

composition) in order to identify those with a common place of

origin or manufacture. According to Baxter (1999), outliers are typi-

cally present in data of this type and tend to cause problems in the

application of standard statistical procedures. Hence, it is desirable

to identify outliers. However, ‘much statistical methodology dealing

with the detection of such outliers is not well suited to archaeometric

data that, in the event, consists of two or more groups’ (Baxter,

1999, p. 321). Our pur- pose in the present study is to develop a

method of detecting multivariate outliers that can be applied to data

that are expected to have a group structure, although the details of

this grouping are not known beforehand. Outliers in this context are

points that are remote from every group.

Many methods of detecting outliers in multivariate data have


been proposed in the literature, almost all of them applicable to a

single sample (Barnett & Lewis, 1994

Rocke & Woodruffe, 1996; Penny & Jolliffe, 2001). Caroni (1998)

extended the application of Wilks’ well-known test statistic for a

single outlier to the case of samples from several subpopulations

with a common covariance matrix, but she assumed that the number

of pop- ulations was known. Thus, her results are not applicable to

the situation in which it is also necessary to carry out a cluster

analysis, with an unknown number of groups. A method given by

Wang et al. (1997) for detecting multiple outliers from a mixture

distribution was also restricted, since it required the existence of a

set of points whose group membership was known, although Sain et

al. (1999) removed this restriction. Recently, Hardin & Rocke (2004)

have proposed another method for the problem we are considering

here. The only other methods in the literature that appear to have the

same purpose as the one we introduce here are those of Glascock

(1992) and Beier & Mommsen (1994).

It is important that the method should be able to detect any number

of outliers, not just a prespecified number. A procedure for doing

this in a single group, using Wilks’ statis- tic, was given by Caroni
& Prescott (1992) but it is not obvious how it can be extended to the

case examined here. Instead, we propose a method based on a

computationally efficient robust method of outlier detection in a

single group, given the name BACON by Billor et al. (2000). We

extend the BACON algorithm to grouped multivariate data below.

The application of the proposed method is illustrated using two

published real data sets and a simulation study of its performance for

various configurations of data is carried out.


CHAPTER TWO

Robust Detection of Multivariate Outliers

Wilks’ statistic, widely used for detecting multivariate outliers, is

equivalent to using the Mahalanobis distances of the n sample points

xi from the sample mean x¯

di (x̄ , S) = (xi − x̄ ) S −1(xi − x̄ )

where S is the covariance matrix of the entire sample or – in the case

of the multiple outlier version (Caroni & Prescott, 1992) – a sample

that has been reduced by omitting points that are even further from

the mean than the one currently being tested. Points sufficiently

distant from the mean (in relation to percentage points of an F

distribution) are declared to be outliers. An obvious danger is that of

masking: if there are actually more outliers than the number being

tested for, then the covariance matrix will be inflated by these extra

outliers, thus reducing the distances di and making it less likely that

outliers will be declared. Furthermore, it is possible that distortion of

the covariance matrix by outliers may lead to the candidate outliers

being tested in the wrong order in the sequential version of the

procedure.
These considerations make it desirable to construct a robust

version of the multivariate outlier detecting procedure, as can be

done in other multivariate analyses such as principal components

analysis (Campbell, 1980; Caroni, 2000; Hubert et al., 2005). The

Mahalanobis distance is thus replaced by a robust distance

RDi 2 = (xi − T (X)) (C(X))−1(xi − T (X))

where T (X) and C(X) are robust estimators of the center and the

covariance matrix, respec- tively. Various estimators have been

proposed in the literature, such as those based on the minimum

volume ellipsoid (MVE) (Rousseeuw & van Zomeren, 1990) and the

minimum covariance determinant (Rousseeuw & van Driessen,

1999). These estimators have the desirable properties of high

breakdown points and affine invariance. However, their imple-

mentation posed considerable practical problems since the amount of

computation required
to be sure of reaching the solution was, in many cases, infeasible.

The BACON multiple outlier detection method proposed by Billor et

al. (2000) is based on the methods of Hadi (1992, 1994), which

aimed for efficient computation of the MVE. One version of BACON

is nearly affine equivariant, has a high breakdown point (upwards of

40%), and nevertheless is computationally efficient even for very

large datasets. Another version is affine equivariant at the expense of

a somewhat lower breakdown point (about 20%), but with the

advantage of even lower computational cost.

The BACON Algorithm

The following steps summarize the general BACON algorithm for

detecting outliers in a single sample of n independent p-dimensional

observations (Billor et al., 2000).

Step 1. Using either of the following two versions of the method, determine an initial basic subset of m > p
observations where m is an integer chosen by the data analyst: m = cp with c = 3, 4 or 5 is suggested.
Version 1: Initial Subset Selected Based on Mahalanobis Distances
For all i = 1,...,n points, compute the Mahalanobis distances

di (x̄, S) = (xi − x̄) S −1(xi − x̄)


where x¯and S are the mean and covariance matrix, respectively, of all n observations. The m observations with
the smallest values of di (x̄, S) form the basic subset.
Version 2: Initial Subset Selected Based on Distances from the Medians
The basic subset comprises the m observations with the smallest values of xi M , where M is a vector
containing the coordinate-wise median, xi is row i of X, and is the vector norm. " −
·
Step 2. Compute all n discrepancies di (x̄b , Sb ) = (xi − x̄b ) S −1(xi − x̄b ) where
b x̄b and
Sb are the mean and the covariance matrix, respectively, of the basic subset.

Step 3. Determine a new basic subset consisting of the observations with smallest discrep-
ancies, specifically, those satisfying di (x̄b , Sb ) < Cnpr · χp,α/n , where χ 2 is the 100(1
p,α
− α)
percentile of the chi-squared distribution with p degrees of freedom, and Cnpr = Cnp + Chr
is a variance inflation factor, with Chr = max{0,(h − r)/(h + r)} and
p +1 1
C =1+ +
np

n p n − h −p
where r is the size of the current basic subset and h = [|(n + p + 1)/2|].

Step 4. Return to Step 2, stopping the iterative procedure when the

basic subset no longer grows. Outliers are those points, if any, still

remaining outside the basic subset when the iterative procedure

terminates.

Billor et al. (2000) presented the results of simulation studies that

demonstrate the success of this method. It has high power for

detecting outliers and the solution is reached in very few iterations.

The speed of computation is a great advantage over earlier methods.


The Extended BACON Algorithm for Identification of Outliers in Grouped

Data

We now examine the problem of detecting outliers from a

multivariate mixture distribution of several populations. We propose

an extended version of the BACON algorithm for use when the

sample comprises two or more subgroups. Since we will not assume

that we know which group each observation belongs to, we need to

identify subgroups and outliers at the same time. The number k of

subgroups will be chosen in advance, but the analysis can be

repeated for different values of k in order to find the best solution, as

is done in many conventional methods of cluster analysis.

We develop two versions of the method. One allows a different

covariance matrix for each group, and hence a basic subset will be

formed in each group separately. The other version assumes a

common covariance matrix and one basic subset will be formed,

comprising observations from different groups.

The first step of the algorithm for both cases, that is, with equal or

unequal covariances, requires first finding the centers of k subgroups

where k has been chosen in advance. Of the various techniques in the

literature for finding the centers of k clusters, the most commonly


used seem to be the k-means and k-medoids methods. Since the

former is known to be sensitive to outliers, we prefer the more robust

k-medoids method (Kaufman & Rousseeuw, 1990; Venables &

Ripley, 1999). The idea of this method is to select k representative

points (objects) from the data set. The corresponding clusters are

then found by assigning each remaining point to the nearest

representative object. The representative objects must be chosen so

that they are centrally located in the clusters they represent. In other

words, the average distance of the representative object to all other

objects of the same cluster is being minimized. Such an optimal

representative object is called the medoid of its cluster.

After finding these starting points and starting clusters, we then

form the basic subset. If we allow a different covariance matrix in

each group, then a basic subset is formed in every cluster separately,

just as in the single-group BACON algorithm. If we assume equal

covariance matrices, then we require a pooled within-group

estimate, which we obtain by choosing for the basic subset those

points nearest to their respective centers, whichever group they

come from.

The extended BACON algorithm can be summarized in the following steps.

Step 1. Start the analysis by using the k-medoids method to find the
centers of k subgroups, where k has been chosen in advance.

Step 2. Compute all the Euclidean distances dij between points i =

1,...,n and centers j = 1,..., k. Form groups by allocating each

point to its nearest center, by identifying di = minj dij.

Step 3. Form the basic subset. In the equal covariance version, we

take the m points with smallest distances from their corresponding

center, d (1) ,..., d(m) irrespective of which group they belong to. In the

unequal covariance version, we take in each group the m points with

the smallest distance from their center.

Step 4. Using the covariance matrix or matrices of the basic subset

or subsets, compute all the Mahalanobis distances between the n

points and the means of the k groups’ basic subsets

Dij = (xi − x̄ jb) Sj−b 1(xi − x̄ jb ) i = 1, . . . , n; j = 1, . . . , k

where x¯jb and Sjb are the mean and covariance, respectively, of the

basic subset in group j ; if the covariances are assumed to be equal,

then we use Sjb = Sb, a pooled within-group estimator. Now

reallocate each point to its nearest center by identifying Di = minj

Dij.
Step 5. Enlarge the basic subset(s) by including all points with Di <

cnprχp,α/n, where the variance inflation factor is defined in Step 3 of

the single-sample version of the algorithm.

Step 6. Return to Step 4, stopping the procedure when the sizes of

the basic subsets do not change. Points remaining outside the basic

subset(s), if any, are outliers.

Examples of Applications

For numerical illustration using actual data, we take two sets of real

data, for both of which the previously published analyses suggest

that there may be outliers. We will not carry out detailed analysis of

the data, but simply demonstrate how easily our method reveals
=
=
outliers. The first set consists of five socioeconomic measures for

each of 47 Swiss provinces (Mosteller & Tukey, 1977). These data

are available within the S-PLUS package. The cluster analyses and

principal components analysis given by Venables & Ripley (1999)

suggest that there are three main groups in the data, with one clear

outlier (point number 45). Figure 1 shows the data plotted in the

space of the first two prin- cipal components, with this point
marked. The extended BACON algorithm with k 3 (and p 5)

identifies it as an outlier (at the 5% significance level) in just two

rounds of iteration.

As a second example, we consider a set of data consisting of the

measurements of seven characteristics on 76 young bulls of three


= =
breeds (Johnson & Wichern, 2002). The data are plotted in the=space
=
of the first two principal components in Figure 2. Attention is drawn

to points 16 and 51, which are possible outliers. In fact, our method

identifies point 51 as an outlier at the 5% level for k 3 or 4 groups

and at the 7% level for k 2, whereas none of the solutions

identifies point 16 as an outlier even at the 10% level. Even for k 4

and p 7, results were obtained effectively immediately on an up-to-

date personal computer. This example shows that the algorithm

works without difficulty even in a case where the groups are not

clearly separated.
Figure 1. Data set consisting of five socioeconomic measures on 47

Swiss provinces (taken from Venables & Ripley, 1999), plotted in

the space of the first two principal components. Point number 45 is

identified as an outlier by our method


Figure 2. Data set consisting of seven characteristics measured on

76 young bulls of three breeds, plotted in the space of the first two

principal components. Points 16 and 51 appear to be possible

outliers. The remaining points are labeled as 1, 2 or 3 corresponding

to the breed (labeled 1, 5 and 8 respectively in the original: Johnson

& Wichern, 2002)


CHAPTER THREE

Simulation Study

We carried out a simulation study of the performance of the extended


= =
BACON algorithm for various configurations of= data, with and
=
without outliers. We used k 2 or 3 groups, p 2, 5 or 8 dimensions,

and various choices of the size of the initial subset m, depending on

p. For the equal-covariances version of the algorithm, we generated

n 50 or 100 points and for the unequal-covariances version n 100

points, divided equally between the k groups. The algorithms were

implemented in a program written in double precision in Microsoft

FORTRAN, obtaining pseudorandom multivariate normal deviates

from the IMSL routine DRNMVN.

Initially we examined data generated from multivariate normal


= =
distributions without any outliers. In Table 1, we show results
=
obtained using the equal-covariances version of the algorithm, with

a total of n 50 or n 100 points, equally divided between the groups.

In all cases, the covariance matrix used for data generation was the

identity, the mean of the first group was zero and the mean of the

second group was at (5, 0,..., 0) . For k 3, the mean of the third

group was placed at (2.5, 4.33, 0,..., 0) so that the means of the

three groups formed an equilateral triangle. The results in Table 1


indicate that the algorithm was rather conservative in detecting

outliers. The degree of conservatism increased somewhat as p

increased, but remained quite moderate so that the performance of

the test in the absence of outliers appears to be satisfactory. These

results were obtained using the 5% level of significance for the chi-

squared comparisons in the algorithm; results were also obtained for

the 1% level and show a similar picture.

Table 2 shows equivalent results for the unequal-covariances


=
version=of the algorithm. In this case, since more points are needed

for the starting basic subset, we only examined n 100 points. The

covariance matrix was the identity matrix for the first group, and also

the third if k 3, while for the second group the variance in the first

dimension was increased to 2. The mean of the first group was at

zero and of the second was (0, 8.66, 0,..., 0) so that the

generalized distance was similar to that used with equal covariances.

The third
Table 1. Percentage of 1000 sets of n 50 or n 100 simulated data points in which any outliers were
= =
declared at the nominal 5% level of significance: equal covariances assumed

n = 50 n = 100
No. of outliers detected No. of outliers
Dimensions Groups Basic detected
subset
p k m 0 1 2+ 0 1 2+
10 97.0% 2.8% 0.0% 96.3% 3.5% 0.2%
2 20 96.8 3.1 0.1 95.7 4.1 0.2
2 30 96.8 3.1 0.1 95.8 4.0 0.2
10 97.6 2.3 0.1 97.0 3.0 0.0
3 20 98.2 1.7 0.1 96.5 3.5 0.0
30 96.9 2.9 0.2 96.6 3.4 0.0
15 97.2 2.7 0.1 96.1 3.9 0.0
2 20 98.1 1.9 0.0 97.5 2.4 0.1
5 30 97.6 2.3 0.1 97.5 2.5 0.0
15 95.2 4.6 0.2 96.8 3.1 0.1
3 20 97.8 2.2 0.0 97.0 2.9 0.1
30 96.8 3.1 0.0 97.3 2.6 0.1
2 20 98.4 1.6 0.0 98.3 1.6 0.1
8 30 99.0 1.0 0.0 98.4 1.6 0.0
3 20 96.9 3.0 0.1 97.5 2.3 0.2
30 97.6 2.4 0.0 98.3 1.6 0.1

Table 2. Percentage of 1000 sets of n 100 simulated data points in which any outliers were declared at
=
the nominal 5% level of significance: unequal covariances

Basic subset No. of outliers detected


Dimensions Groups
p k m 0 1 2+
10 96.6% 3.1% 0.3%
2 20 96.4 3.5 0.1
30 96.2 3.6 0.2
2 10 94.1 5.2 0.7
3 20 95.8 4.1 0.1
30 94.8 5.0 0.2
15 96.4 3.5 0.1
2 20 96.8 3.1 0.1
30 95.7 3.8 0.5
5 15 97.0 2.9 0.1
3 20 96.7 3.1 0.2
30 96.2 3.6 0.2
2 20 97.7 2.2 0.1
30 97.9 2.1 0
8 3 20 99.6 0.4 0
30 99.5 0.5 0
group’s mean was again placed to form an equilateral triangle. The results in Table 2 are basically the same as
in Table 1, although perhaps more conservative for k 3 and p 8. We next examined = the performance
= of the
algorithm in data in which outliers had been planted. We kept the same configurations of group means and
covariances as above. Because there are so many possible configurations of the number and location of outliers,
it is only feasible to give the illustrative results of Table 3. Here we generated six outliers by adding
slippages to the first three points of each group for k = 2, or the first two points of each
Table 3. Performance of extended BACON algorithm on data con- taining six planted outliers: 1000
simulations, n 100, m 30.
= =
For location of outliers, see text
Percentage of planted outliers detected
Equal Unequal covariance
Dimensions Groups covariance matrices
p k matrices
2 2 40.60 35.43
3 38.47 27.70
5 2 16.43 14.35
3 14.82 10.03
8 2 55 03 43.82
3 54.97 19.70

group for k 3. In the case of two groups, the slippages were ( x, 0) , (0, x) and (0, x) (and zero in the remaining
dimensions if p > = 2) in the first group, and (x, 0) , (0, x) and (0, x) in −
the second group, with−x 4 and x 6 for
equal and unequal covariances, respectively. In the case of three groups, the slippages were ( x, 0) and ( y, y) in
− (x, 0) and (y, y) in the second,
the first group, = and (y, y)=and ( y, y) in the third, where x 6, y 4.24 for p 2 or 5
and the same values multiplied by 1.5 for p 8 (otherwise detection − probabilities − − very low). These
became
combinations of values were chosen − to give a symmetric arrangement−of outliers relative to groups, although
=
the symmetry = not hold with
does = respect to generalized distance for unequal covariances.= The results in Table 3
serve to show that both versions of the algorithm work well, despite the apparently lower power for k 3, p 8
which is due to the greater conservatism of the test in this case.
It is possible that the performance of our algorithm could be improved by adjusting the correction factors used
in Step 5. We used the same factors as in the single-sample case. These should be applicable when the basic
= but =
subset is small in size, not necessarily when it is large. A point will be declared to be an outlier if the
minimum of its distances to the k groups exceeds the criterion, so the factors could possibly be adjusted to
allow for this, although in fact this consideration only seems relevant for outliers that fall ‘in between’ the groups
in the multidimensional space. The correction factors used in the single-group case were derived empirically
from very extensive simulation studies. Given the huge range of possible configurations that need to be
examined, it would be rather difficult to carry out an equivalent study in the case of several groups.
CHAPTER FOUR

Conclusions

The simulation studies show that our extended BACON algorithm

works as well for iden- tifying outliers against the background of

several groups as the original BACON algorithm did for one group,

at least over the range of data


=
configurations we have examined in
= =
our simulation study. The extended algorithm has two versions,

according to whether equal within-group covariance matrices are

assumed or not. The advantage of the equal-covariance matrix version

is that the estimate of Sb, the covariance matrix of a basic subset, is a

pooled estimate and thus we require only a single subset of size m to

start the algorithm, instead of k subsets each of size m for unequal

covariances. In the latter case, starting with km points in the initial

basic subset and taking m 3p as a minimum requirement, implies

that at least 3kp points are needed. Thus, if p 8 and k 4, we need

almost 100 points that we are sure are not outliers in order to start the

procedure. Many real data sets are much smaller than

this. For example, only three of the 20 archaeological examples

listed in Table 1 of Baxter (1995) have more than 100 points and 12

of them do not even have 50 points, while p ranges from seven to 20.
In several of these examples, applying even the equal-covariances

version of our algorithm would be problematic. In general, however,

we do not think that assuming equal covariances is a sound strategy

for exploratory data analysis. Furthermore, the results of Caroni

(1998) suggest that it is a dangerous assumption in our particular

application of detecting outliers from groups in multivariate data.

We mentioned in the Introduction four existing methods of

detecting multiple outliers in grouped multivariate data. The one

developed by Wang et al. (1997) initially assumed that a set of data

was available with known group membership, as in discriminant

analysis, in order to have good starting values for their iterative

method of estimation. Sain et al. (1999) removed this restriction by

using conventional clustering algorithms to obtain an initial

empirical grouping to provide starting values. The technique is based

on normal likelihoods and therefore might be susceptible to the kind

of problems the robust BACON algorithm avoids. They also limited

their detailed investigation to the mixture with two components,

which corresponds to the practical problem behind their investigation,

namely, distinguish- ing nuclear blasts (the outliers) from

earthquakes and mining explosions (the two groups) using seismic

data. The method proposed recently by Hardin & Rocke (2004) is

similar to ours in that it seeks to define outliers by their Mahalanobis


distances with respect to robust estimates of location and dispersion

of each group. It operates separately on each group so does not have

the option of equal covariances that we have in our method. The

other two methods of detecting outliers can be found in the

archaeometric literature and are not widely known elsewhere. Beier

& Mommsen’s (1994) method is similar to ours in its later stages,

where it uses Mahalanobis distances and X2 measures of group

membership. However, it forms groups sequentially, not

simultaneously as we require. The method of Glascock (1992) does

form groups simultaneously, but it employs an iterative re-allocation

proce- dure rather like some optimization methods of cluster

analysis. It does not ‘grow’ groups in the fashion of BACON.

Neither of these methods has a version for equal covariance

matrices.

If the number of groups k is not known beforehand, but will be

determined by the analysis, then our algorithm works as a method of

cluster analysis. The analysis must be repeated for different choices

of k, and the ‘best’ solution chosen. As in other methods of cluster

analysis, the selection could be based on various measures of within-

groups homogeneity or between-groups heterogeneity. Since our

method excludes outliers, it should work better than other methods,

which generally do not detect outliers effectively (Baxter, 1999) and


whose results must therefore tend to be distorted by the presence of

outliers.
References

Barnett, V. & Lewis, T. (1994) Outliers in Statistical Data, 3rd edn (Chichester:

Wiley).

Baxter, M.J. (1995) Standardization and transformation in principal

component analysis, with applications to archaeometry, Applied

Statistics, 44, pp. 513–527.

Baxter, M.J. (1999) Detecting multivariate outliers in artefact

compositional data, Archaeometry, 41, pp. 321–338. Beier, T. &

Mommsen, H. (1994) Modified Mahalanobis filters for grouping

pottery by chemical composition,

Archaeometry, 36, pp. 287–306.

Billor, N., Hadi, A.S. & Velleman, P.F. (2000) BACON: blocked

adaptive computationally efficient outlier nominators,

Computational Statistics and Data Analysis, 34, pp. 279–298.

Campbell, N.A. (1980) Robust procedures in multivariate analysis.

I: robust covariance estimation, Applied Statistics, 29, pp. 231–

237.

Caroni, C. (1998) Wilks’ outlier test in more than one multivariate

sample, Communications in Statistics – Simulation and

Computation, 27, pp. 79–94.

Caroni, C. (2000) Outlier detection by robust principal

components analysis, Communications in Statistics –


Simulation and Computation, 29, pp. 139–151.

Caroni, C. & Prescott, P. (1992) Sequential application of Wilks’

multivariate outlier test, Applied Statistics, 41, pp. 355–364.

Glascock, M.D. (1992) Characterization of archaeological ceramics

at MURR by neutron activation analysis and multivariate

statistics, in: H. Neff (ed.) Chemical Characterization of Ceramic

Pastes in Archaeology. Monographs in World Archaeology, 7,

pp. 11–26. (Madison, WI: Prehistory Press).

Hadi, A.S. (1992) Identifying multiple outliers in multivariate

data, Journal of the Royal Statistical Society, Series B, 54, pp.

761–771.

Hadi, A.S. (1994) A modification of a method for the detection of

outliers in multivariate samples, Journal of the Royal Statistical

Society, Series B, 56, pp. 393–396.

Hardin, J. & Rocke, D.M. (2004) Outlier detection in the multiple

cluster setting using the minimum covariance determinant

estimator, Computational Statistics and Data Analysis, 44, pp.

625–638.

Hubert, M., Rousseeuw, P.J. & Vanden Branden, K. (2005) ROBPCA:

a new approach to robust principal component analysis,

Technometrics, 47, pp. 64–79.

Johnson, R.A. & Wichern, D.W. (2002) Applied Multivariate


Statistical Data Analysis, 5th edn (Upper Saddle River, NJ:

Prentice Hall).

Kaufman, L. & Rousseeuw, P.J. (1990) Finding Groups in Data: An

Introduction to Cluster Analysis (New York: John Wiley).

Mosteller, F. & Tukey, J.W. (1977) Data Analysis and Regression (Reading,

MA: Addison-Wesley).

Penny, K.I. & Jolliffe, I.T. (2001) A comparison of multivariate

outlier detection methods for clinical laboratory safety data,

Statistician, 50, pp. 295–307.

Rocke, D.M. & Woodruff, D.L. (1996) Identification of outliers in

multivariate data, Journal of the American Statistical

Association, 91, pp. 1047–1061.

Rousseeuw, P.J. & van Driessen, K. (1999) A fast algorithm for the minimum

covariance determinant estimator,

Technometrics, 41, pp. 212–223.

Rousseeuw, P.J. & van Zomeren, B. (1990) Unmasking multivariate outliers and

leverage points (with Discussion),

Journal of the American Statistical Association, 85, pp. 633–639.

Sain, S.R., Gray, H.L., Woodward, W.A. & Fisk, M.D. (1999) Outlier

detection from a mixture distribution when training data are

unlabeled, Bulletin of the Seismological Society of America, 89,

pp. 294–304.
Venables, W.N. & Ripley, B.D. (1999) Modern Applied Statistics

with S-PLUS, 3rd edn (New York: Springer- Verlag).

Wang, S., Woodward, W.A., Gray, H.L., Wiechecki, S. & Sain, S.R.

(1997) A new test for outlier detection from a multivariate

mixture distribution, Journal of Computational and Graphical

Statistics, 6, pp. 285–299.

You might also like