Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1150402.1150447acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Mining distance-based outliers from large databases in any metric space

Published: 20 August 2006 Publication History

Abstract

Let R be a set of objects. An object oR is an outlier, if there exist less than k objects in R whose distances to o are at most r. The values of k, r, and the distance metric are provided by a user at the run time. The objective is to return all outliers with the smallest I/O cost.This paper considers a generic version of the problem, where no information is available for outlier computation, except for objects' mutual distances. We prove an upper bound for the memory consumption which permits the discovery of all outliers by scanning the dataset 3 times. The upper bound turns out to be extremely low in practice, e.g., less than 1% of R. Since the actual memory capacity of a realistic DBMS is typically larger, we develop a novel algorithm, which integrates our theoretical findings with carefully-designed heuristics that leverage the additional memory to improve I/O efficiency. Our technique reports all outliers by scanning the dataset at most twice (in some cases, even once), and significantly outperforms the existing solutions by a factor up to an order of magnitude.

References

[1]
C. Aggarwal and S. Yu. An effective and efficient algorithm for high-dimensional outlier detection. The VLDB Journal, 14(2):211--221, 2005.
[2]
V. Barnett and T. Lewis. Outliers in Statistical Data, 3rd Edition. John Wiley, 1994.
[3]
S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In SIGKDD, pages 29--38, 2003.
[4]
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: identifying density-based local outliers. In SIGMOD, pages 93--104, 2000.
[5]
S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. pages 359--366, 2000.
[6]
W. Jin, A. K. H. Tung, and J. Han. Mining top-n local outliers in large databases. In SIGKDD, pages 293--298, 2001.
[7]
T. Johnson, I. Kwok, and R. T. Ng. Fast computation of 2-dimensional depth contours. In SIGKDD, pages 224--228, 1998.
[8]
E. M. Knorr and R. T. Ng. Algorithms for mining distance-based outliers in large datasets. In VLDB, pages 392--403, 1998.
[9]
E. M. Knorr and R. T. Ng. Finding intensional knowledge of distance-based outliers. In VLDB, pages 211--222, 1999.
[10]
E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based outliers: algorithms and applications. The VLDB Journal, 8(3-4):237--253, 2000.
[11]
A. Lazarevic and V. Kumar. Feature bagging for outlier detection. In SIGKDD, pages 157--166, 2005.
[12]
F. Olken and D. Rotem. Simple random sampling from relational databases. In VLDB, pages 160--169, 1986.
[13]
S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local correlation integral. In ICDE, pages 315--326, 2003.
[14]
S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In SIGMOD, pages 427--438, 2000.

Cited By

View all
  • (2023)Pre-Cutoff Value Calculation Method for Accelerating Metric Space Outlier DetectionInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.33412516:1(1-17)Online publication date: 28-Nov-2023
  • (2022)Fast, exact, and parallel-friendly outlier detection algorithms with proximity graph in metric spacesThe VLDB Journal10.1007/s00778-022-00729-131:4(797-821)Online publication date: 27-Jan-2022
  • (2020)Ordinal Outlier Algorithm for Anomaly Detection of High-Dimensional Data Sets2020 Chinese Control And Decision Conference (CCDC)10.1109/CCDC49329.2020.9164610(5356-5361)Online publication date: Aug-2020
  • Show More Cited By

Index Terms

  1. Mining distance-based outliers from large databases in any metric space

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2006
    986 pages
    ISBN:1595933395
    DOI:10.1145/1150402
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 August 2006

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. metric data
    2. mining
    3. outlier

    Qualifiers

    • Article

    Conference

    KDD06

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)13
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 01 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Pre-Cutoff Value Calculation Method for Accelerating Metric Space Outlier DetectionInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.33412516:1(1-17)Online publication date: 28-Nov-2023
    • (2022)Fast, exact, and parallel-friendly outlier detection algorithms with proximity graph in metric spacesThe VLDB Journal10.1007/s00778-022-00729-131:4(797-821)Online publication date: 27-Jan-2022
    • (2020)Ordinal Outlier Algorithm for Anomaly Detection of High-Dimensional Data Sets2020 Chinese Control And Decision Conference (CCDC)10.1109/CCDC49329.2020.9164610(5356-5361)Online publication date: Aug-2020
    • (2020)Uncertain distance-based outlier detection with arbitrarily shaped data objectsJournal of Intelligent Information Systems10.1007/s10844-020-00624-757:1(1-24)Online publication date: 15-Oct-2020
    • (2020)A Fast Distance-Based Outlier Detection Technique Using a Divisive Hierarchical Clustering AlgorithmNew Developments in Unsupervised Outlier Detection10.1007/978-981-15-9519-6_3(39-69)Online publication date: 25-Nov-2020
    • (2020)Developments in Unsupervised Outlier Detection ResearchNew Developments in Unsupervised Outlier Detection10.1007/978-981-15-9519-6_2(13-36)Online publication date: 25-Nov-2020
    • (2019)A Fast kNN-Based Approach for Time Sensitive Anomaly Detection over Data StreamsComputational Science – ICCS 201910.1007/978-3-030-22741-8_5(59-74)Online publication date: 8-Jun-2019
    • (2019)Research Issues in Outlier DetectionOutlier Detection: Techniques and Applications10.1007/978-3-030-05127-3_3(29-51)Online publication date: 11-Jan-2019
    • (2019)Temperature Anomaly Detection by Integrating Local Contrast and Global ContrastAdvances in Intelligent, Interactive Systems and Applications10.1007/978-3-030-02804-6_30(227-233)Online publication date: 17-Jan-2019
    • (2018)An attempt to analyze data distribution for abnormal behaviors2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC)10.1109/CCWC.2018.8301728(275-280)Online publication date: Jan-2018
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media