Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Distance-based outliers: algorithms and applications

Published: 01 February 2000 Publication History

Abstract

This paper deals with finding outliers (exceptions) in large, multidimensional datasets. The identification of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the analysis of performance statistics of professional athletes. Existing methods that we have seen for finding outliers can only deal efficiently with two dimensions/attributes of a dataset. In this paper, we study the notion of DB (distance-based) outliers. Specifically, we show that (i) outlier detection can be done efficiently for large datasets, and for k-dimensional datasets with large values of k (e.g., $k \ge 5$); and (ii), outlier detection is a meaningful and important knowledge discovery task.First, we present two simple algorithms, both having a complexity of $O(k \: N^2)$, k being the dimensionality and N being the number of objects in the dataset. These algorithms readily support datasets with many more than two attributes. Second, we present an optimized cell-based algorithm that has a complexity that is linear with respect to N, but exponential with respect to k. We provide experimental results indicating that this algorithm significantly outperforms the two simple algorithms for $k \leq 4$. Third, for datasets that are mainly disk-resident, we present another version of the cell-based algorithm that guarantees at most three passes over a dataset. Again, experimental results show that this algorithm is by far the best for $k \leq 4$. Finally, we discuss our work on three real-life applications, including one on spatio-temporal data (e.g., a video surveillance application), in order to confirm the relevance and broad applicability of DB outliers.

References

[1]
1. Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Buneman P, Jajedia S (eds) Proc. ACM SIGMOD, 1993, Washington, DC. ACM Press, New York, NY, pp 207-216.
[2]
2. Arning A, Agrawal R, Raghavan P (1996) A linear method for deviation detection in large databases. In: Simoudis E, Han J, Fayyad U (eds) Proc. KDD, 1996, Portland, Or. AAAI Press, Menlo Park, CA, pp 164-169.
[3]
3. Barnett V, Lewis T (1994) Outliers in statistical data. John Wiley & Sons, Chichester.
[4]
4. Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18(9): 509-517.
[5]
5. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont, Calif.
[6]
6. Berchtold S, Keim D, Kriegel H-P (1996) The X-tree: an index structure for high- dimensional data. In: Vijiayaraman TM, Buchmann A, Mohen C, Sarda NL (eds) Proc. VLDB, 1996, Mumbai, India. Morgan Kaufmann, San Francisco, CA, pp 28-39.
[7]
7. Burl MC, Fayyad U, Perona P, Smyth P, Burl MP (1994) Automating the hunt for volcanoes on Venus. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 1994, Seattle, WA. IEEE Computer Society Press, Los Alamitos, CA, pp 302-308.
[8]
8. Chakrabarti S, Sarawagi S, Dom B (1998) Mining surprising patterns using temporal description length. In: Gupta A, Shmueli O, Widom J (eds) Proc. VLDB, 1998, New York City, NA. Morgan Kauffmann, San Francisco, CA, pp 606-617.
[9]
9. Draper NR, Smith H (1966) Applied regression analysis. John Wiley & Sons, Chichester.
[10]
10. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad U (eds) Proc. KDD, 1996, Portland, OR. AAAI Press, Menlo Park, CA, pp 226-231.
[11]
11. Eveland C, Konolige K, Bolles RC (1998) Background modeling for segmentation of video-rate stereo sequences. In: Proc. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 1998, Santa Barbara, CA. IEEE Computer Society Press, Los Alamitos, CA, pp 266-271.
[12]
12. Fayyad U, Haussler D, Stolorz P (1996) KDD for science data analysis: issues and examples. In: Simoudis E, Han J, Fayyad U (eds) Proc KDD, 1996, Portland, OR. AAAI Press, Menlo Park, CA, pp 50-56.
[13]
13. Freedman D, Pisani R, Purves R (1978) Statistics. WW Norton, New York.
[14]
14. Gavrilla DM, Davis IS (1996) 3-D model-based tracking of humans in action: a multi-view approach. Proc. Conf. on IEEE Computer Vision and Pattern Recognition, 1996, San Francisco, CA. IEEE Computer Society Press, Los Alamitos, CA, pp 73-80.
[15]
15. Guttmann R (1984) A dynamic index structure for spatial searching. In: Yormark B (eds) Proc. ACM SIGMOD, 1984, Boston, MA. ACM Press, New York, NY, pp 47-57.
[16]
16. Haritaoglu I, Harwood D, Davis L (1997) Real time detection and tracking of people and their parts. Technical Report. University of Maryland, College Park, MD.
[17]
17. Hawkins D (1980) Identification of outliers. Chapman & Hall, London.
[18]
18. Hellerstein J, Koutsoupias E, Papadimitriou C (1997) On the analysis of indexing schemes. In: Yuan L-Y (ed) Proc. PODS, 1997, Tucson, AZ. ACM Press, New York, NY, pp 249-256.
[19]
19. Isard M, Blake A (1996) Contour tracking by stochastic propagation of conditional density. In: Buxton BF, Cipolla R (eds) Proc. European Conf. on Computer Vision, Vol 1, 1996, Cambridge, UK. Lecture Notes in Computer Science, Vol. 1064, Springer, Berlin, pp 343-356.
[20]
20. Johnson T, Kwok I, Ng R (1998) Fast computation of 2-dimensional depth contours. In: Agrawal R, Stolorz P (eds) Proc KDD, 1998, New York City, NY. AAAI Press, Menlo Park, CA, pp 224-228.
[21]
21. Johnson RA, Wichern DW (1992) Applied multivariate statistical analysis. Third edition. Prentice Hall, Englewood Cliffs, N.J.
[22]
22. Knorr EM, Ng RT (1997) A unified notion of outliers: properties and computation. In: Heckerman D, Mannila H, Pregibon D, Uthurusamy R (eds) Proc. KDD, 1997, Newport Beach, CA. AAAI Press, Menlo Park, CA, pp 219-222; An extended version of this paper appears as: A unified approach for mining outliers. In: Proc. CASCON, pp 236-248.
[23]
23. Knorr EM, Ng RT (1998) Algorithms for mining distance-based outliers in large datasets. In: Gupta A, Shmueli O, Widom J (eds) Proc. VLDB, 1998, New York City, NY. Morgan Kaufmann, San Francisco, CA, pp 392-403.
[24]
24. Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Bocca J, Jarke M, Zaniolo C (eds) Proc. VLDB, 1994, Santiago, Chile. Morgan Kaufmann, San Francisco, CA, pp 144- 155.
[25]
25. Preparata F, Shamos M (1998) Computational geometry: an introduction. Springer, Berlin Heidelberg New York.
[26]
26. Ruts I, Rousseeuw P (1996) Computing depth contours of bivariate point clouds. Comput Stat Data Anal 23: 153-168.
[27]
27. Samet H (1990) The design and analysis of spatial data structures. Addison-Wesley, Reading, MA.
[28]
28. Sarawagi S, Agrawal R, Megiddo N (1998) Discovery-driven exploration of OLAP data cubes. In: Schek H-J, Saltor F, Ramos I, Alonso G (eds) Proc. EDBT, 1998, Valencia, Spain. Lecture Notes in Computer Science, Vol. 1377, Springer, Berlin, pp 168-182.
[29]
29. Stolorz P, Mesrobian E, Muntz RR, Shek FC, Santos JR, Yi J, Ng K, Chien SY, Nakamura H, Mechoso CR, Farrara JD (1995) Fast spatio-temporal data mining of large geophysical datasets. In: Fayyad UM, Uthurusamy R (eds) Proc. KDD, 1995, Montreal. AAAI Press, Menlo Park, CA, pp 300-305.
[30]
30. Stolorz P, Dean C, Crippen R, Blom R (1995) Photographing earthquakes from space. In: Pauna T (ed) Concurrent Supercomputing Consortium Annual Report. CACR publications, Calif. Institute of Technology, Pasadena, CA, pp 20-22.
[31]
31. White RA (1992) The detection and testing of multivariate outliers. Masters thesis. Department of Statistics, University of British Columbia, Vancouver, Canada.
[32]
32. Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data-clustering method for very large databases. In: Jagdish HV, Mumick IS (eds) Proc. ACM SIGMOD, 1996, Montreal. ACM, New York, NY, pp 103-114.

Cited By

View all
  • (2024)Trajectory outlier detection method based on group divisionIntelligent Data Analysis10.3233/IDA-23738428:2(415-432)Online publication date: 1-Jan-2024
  • (2024)Regularized Contrastive Partial Multi-view Outlier DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681125(8711-8720)Online publication date: 28-Oct-2024
  • (2024)Chromosomal Structural Abnormality Diagnosis by Homologous SimilarityProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671642(5317-5328)Online publication date: 25-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases
The VLDB Journal — The International Journal on Very Large Data Bases  Volume 8, Issue 3-4
February 2000
184 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 February 2000

Author Tags

  1. Algorithms
  2. Data mining
  3. Data mining applications
  4. Outliers/exceptions

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)49
  • Downloads (Last 6 weeks)11
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Trajectory outlier detection method based on group divisionIntelligent Data Analysis10.3233/IDA-23738428:2(415-432)Online publication date: 1-Jan-2024
  • (2024)Regularized Contrastive Partial Multi-view Outlier DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681125(8711-8720)Online publication date: 28-Oct-2024
  • (2024)Chromosomal Structural Abnormality Diagnosis by Homologous SimilarityProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671642(5317-5328)Online publication date: 25-Aug-2024
  • (2024)Random clustering-based outlier detectorInformation Sciences: an International Journal10.1016/j.ins.2024.120498667:COnline publication date: 1-May-2024
  • (2024)Outlier detection in a multiset-valued information system based on rough set theory and granular computingInformation Sciences: an International Journal10.1016/j.ins.2023.119950657:COnline publication date: 1-Feb-2024
  • (2024)Exploiting fuzzy rough entropy to detect anomaliesInternational Journal of Approximate Reasoning10.1016/j.ijar.2023.109087165:COnline publication date: 1-Feb-2024
  • (2024)Deep learning-based anomaly detection for individual drone vehicles performing swarm missionsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.122869244:COnline publication date: 15-Jun-2024
  • (2024)Attribute granules-based object entropy for outlier detection in nominal dataEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108198133:PBOnline publication date: 1-Jul-2024
  • (2024)Enhancing anomaly detectors with LatentOutJournal of Intelligent Information Systems10.1007/s10844-023-00829-662:4(905-923)Online publication date: 1-Aug-2024
  • (2024)Outlier detection for incomplete real-valued data via information entropy and class-consistent technologyApplied Intelligence10.1007/s10489-024-05428-854:7(5317-5335)Online publication date: 1-Apr-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media