Algorithms For Spatial Outlier Detection: Chang-Tien Lu Dechang Chen Yufeng Kou
Algorithms For Spatial Outlier Detection: Chang-Tien Lu Dechang Chen Yufeng Kou
Attribute Value
rithms) should be selected in applications. If the attribute 100
40 40
both algorithms be used. Y coordinate 20 0
20
X coordinate
In both Algorithms 1 and 2, once an S-outlier is detected,
some corrections are made immediately. These include re-
placing the attribute value of the outlier by the average at- Figure 1. A spatial data set. Objects are lo-
tribute value of its neighbors and some subsequent updating cated in the X − Y plane. The height of each
computation. The effect of these corrections is to avoid nor- vertical line segment represents the attribute
mal points close to the true outliers to be claimed as possible value of each object.
outliers. There is a direct method to reduce the risk of over-
Methods
stating the number of outliers without replacing the attribute
Rank Scatter- Moran z Iterative Iterative Median
value of the detected outlier, as describe in Algorithm 3. plot Scatterplot Alg. z Alg. r Alg. Alg.
The method in Algorithm 3 defines the neighborhood func- 1 E1 S1 S1 S1 S1 S1
tion differently. Instead of the average attribute value, g(xi ) 2 E2 E1 E1 S2 S2 S2
is chosen to be the median of the attribute values of the 3 S2 E2 E2 S3 S3 S3
points in N Nk (xi ). The motivation of using median is the
fact that median is a robust estimator of the “center” of a
Table 1. The top three spatial outliers detected
sample.
by Scatterplot, Moran scatterplot, z, iterative
Algorithm 3 (Median Algorithm)
z, iterative r, and median algorithms.
1. For each spatial point xi , compute the k nearest neigh-
bor set N Nk (xi ), the neighborhood function g(xi ) =
median of the data set {f (x) : x ∈ N Nk (xi )}, and the
comparison function hi = h(xi ) = f (xi ) − g(xi ). 4 Experiments
2. Let µ and σ denote the sample mean and sample stan- We empirically compared the detection performance of
dard deviation of the data set {h1 , h2 , . . . , hn }. Stan- our proposed methods with the z algorithm through min-
dardize the data set and compute the absolute values ing a real-life census data set. The experiment results in-
yi = | hiσ−µ | for i = 1, 2, . . . , n. dicate that our algorithms can successfully identify spatial
3. For a given positive integer m, let i1 , i2 , . . . , im be the outliers ignored by the z algorithm and can avoid detect-
m indices such that their y values in {y1 , y2 , . . . , yn } ing false spatial outliers. In this experiment, we tested var-
represent the m largest. Then the m S-outliers are ious attributes from census data compiled by U.S. Census
xi1 , xi2 , . . . , xim . Bureau [17]. The attributes tested include population, pop-
ulation density, percent of white persons, percent of black
A quick illustration of Algorithms 1, 2, and 3 is to apply or African American persons, percent of American Indian
them to the data in Figure 1. Table 1 shows the results using persons, percent of Asian persons, and percent of female
the three algorithms with parameters k = m = 3, com- persons. We first ran the four algorithms (z, iterative r, it-
pared with the existing approaches. As can be seen, all the erative z, and median algorithms) to detect which counties
three proposed algorithms accurately detect S1, S2, and S3 have abnormal population. There are 3192 counties in the
as spatial outliers, but z algorithm, Scatterplot, and Moran USA. We show the top 10 counties which are most likely to
Scatterplot, falsely identify E1 and E2 as spatial outliers. be the spatial outliers.
In this table, the rank of the outliers is defined in an obvi- Table 2 provides the experimental results for all four spa-
ous way. For example, in iterative r and z algorithms, the tial outlier detection algorithms. For the top 10 spatial out-
rank is the order of iterations, while in both z and Median lier detected by z, iterative z, and median algorithms, most
algorithms, the rank is determined by the y value.
of them are the same with slightly different order. In fact,
there are eight spatial outliers in common detected by the
Table 2. The top ten spatial outliers detected by z, iterative z, iterative r, and Median algorithms.
three algorithms. Further examination shows that the top [3] J. Haslett, R. Brandley, P. Craig, A. Unwin, and G. Wills.
10 counties selected by iterative z and median algorithms Dynamic Graphics for Exploring Spatial Data With Appli-
are true outliers. But Ventura Co. in California was falsely cation to Locating Global and Local Anomalies. The Amer-
detected by the non-iterative z algorithm. This falsely de- ican Statistician, 45:234–242, 1991.
[4] D. Hawkins. Identification of Outliers. Chapman and Hall,
tected outlier was avoided by both the iterative z and median
1980.
algorithms. The last column of the table shows the top ten [5] R. Johnson. Applied Multivariate Statistical Analysis. Pren-
candidate outliers from the iterative r algorithm. Although tice Hall, 1992.
these 10 candidates are true outliers from a practical exami- [6] E. Knorr and R. Ng. Algorithms for Mining Distance-Based
nation, they are very different from those obtained from the Outliers in Large Datasets. In Proc. 24th VLDB Conference,
other three methods. This difference is due to the fact that 1998.
the iterative r algorithm focuses on the ratio between the at- [7] C.-T. Lu, H. Wang, and Y. Kou.
tribute value and the averaged attribute value of neighbors. http://europa.nvc.cs.vt.edu/˜ctlu/Project/MapView/index.htm.
[8] A. Luc. Exploratory Spatial Data Analysis and Geographic
Experimental results from other attributes also show that
Information Systems. In M. Painho, editor, New Tools for
the iterative algorithms and median method are more ac- Spatial Analysis, pages 45–54, 1994.
curate than the non-iterative algorithm in terms of falsely [9] A. Luc. Local Indicators of Spatial Association: LISA. Ge-
detected spatial outliers. For running the algorithms and ographical Analysis, 27(2):93–115, 1995.
generating more results, we refer interested readers to [7], [10] Y. Panatier. Variowin. Software For Spatial Data Analysis in
where we developed one software package which imple- 2D. New York: Springer-Verlag, 1996.
ments all the existing and proposed algorithms. [11] I. Ruts and P. Rousseeuw. Computing Depth Contours of Bi-
variate Point Clouds. In Computational Statistics and Data
Analysis, 23:153–168, 1996.
5 Conclusion [12] S. Shekhar and S. Chawla. A Tour of Spatial Databases.
Prentice Hall, 2002.
[13] S. Shekhar, C.-T. Lu, and P. Zhang. Detecting Graph-Based
In this paper we propose three spatial outlier detection Spatial Outlier: Algorithms and Applications(A Summary
algorithms to analyze spatial data: two algorithms based on of Results). In Proc. of the Seventh ACM-SIGKDD Int’l
iteration and one algorithm based on median. The experi- Conference on Knowledge Discovery and Data Mining, Aug
mental results confirm the effectiveness of our approach in 2001.
reducing the risk of falsely claiming regular spatial points as [14] S. Shekhar, C.-T. Lu, and P. Zhang. Detecting Graph-Based
outliers, which exists in commonly used detection method- Spatial Outlier. Intelligent Data Analysis: An International
ologies. Furthermore, it carries the important bonus of or- Journal, 6(5):451–468, 2002.
[15] S. Shekhar, C.-T. Lu, and P. Zhang. A Unified Approach
dering the spatial outliers with respect to their degree of out-
to Spatial Outliers Detection. GeoInformatica, An Interna-
lierness. tional Journal on Advances of Computer Science for Geo-
graphic Information System, 7(2), June 2003.
References [16] T. Johnson and I. Kwok and R. Ng. Fast Computation
of 2-Dimensional Depth Contours. In Proceedings of the
Fourth International Conference on Knowledge Discovery
[1] V. Barnett and T. Lewis. Outliers in Statistical Data. John and Data Mining, pages 224–228. AAAI Press, 1998.
Wiley, New York, 3rd edition, 1994. [17] U.S. Census Burean, United Stated Department of Com-
[2] R. Haining. Spatial Data Analysis in the Social and Envi- mence. http://www.census.gov/.
ronmental Sciences. Cambridge University Press, 1993.