Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/956750.956758acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Mining distance-based outliers in near linear time with randomization and a simple pruning rule

Published: 24 August 2003 Publication History
  • Get Citation Alerts
  • Abstract

    Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set.

    References

    [1]
    C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2001.
    [2]
    F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In Proceedings of the Sixth European Conference on the Principles of Data Mining and Knowledge Discovery, pages 15--26, 2002.
    [3]
    V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley & Sons, 1994.
    [4]
    J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509--517, 1975.
    [5]
    S. Berchtold, D. Keim, and H.-P. Kreigel. The X-tree: an index structure for high-dimensional data. In Proceedings of the 22nd International Conference on Very Large Databases, pages 28--39, 1996.
    [6]
    G. Bisson. Learning in FOL with a similarity measure. In Proceedings of the Tenth National Conference on Artificial Intelligence, pages 82--87, 1992.
    [7]
    R. J. Bolton and D. J. Hand. Statistical fraud detection: A review (with discussion). Statistical Science, 17(3):235--255, 2002.
    [8]
    M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander. LOF: Identifying density-based local outliers. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000.
    [9]
    W. Emde and D. Wettschereck. Relational instance-based learning. In Proceedings of the thirteenth International Conference on Machine Learning, 1996.
    [10]
    E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In Data Mining for Security Applications, 2002.
    [11]
    E. Fix and J. L. Hodges. Discriminatory analysis: Nonparametric discrimination: Small sample performance. Technical Report Project 21-49-004, Report Number 11, USAF School of Aviation Medicine, Randolf Field, Texas, 1952.
    [12]
    R. Guttmann. A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, pages 47--57, 1984.
    [13]
    D. Hawkins. Identification of outliers. Chapman and Hall, 1980.
    [14]
    S. Hettich and S. D. Bay. The UCI KDD archive. {http://kdd.ics.uci.edu/}. Irvine, CA: University of California, Department of Information and Computer Science, 1999.
    [15]
    T. Horvath, S. Wrobel, and U. Bohnebeck. Relational instance-based learning with lists and terms. Machine Learning, 43:53--80, 2001.
    [16]
    E. M. Knorr and R. T. Ng. Finding intensional knowledge of distance-based outliers. In Proceedings of the 25th VLDB Conference, 1999.
    [17]
    E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based outliers: algorithms and applications. VLDB Journal: Very Large Databases, 8(3--4):237--253, 2000.
    [18]
    T. Lane and C. E. Brodley. Temporal sequence learning and data reduction for anomaly detection. ACM Transactions on Information and System Security, 2(3):295--331, 1999.
    [19]
    S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM SIGMOD Conference, pages 427--438, 2000.
    [20]
    S. Ruggles and M. Sobek. Integrated public use microdata series: Version 2.0. {http://www.ipums.umn.edu/}, 1997.
    [21]
    Rulequest Research. Gritbot. {http://www.rulequest.com/}.

    Cited By

    View all
    • (2024)Approach and Landing Energy Prediction Based on a Long Short-Term Memory ModelAerospace10.3390/aerospace1103022611:3(226)Online publication date: 14-Mar-2024
    • (2024)Real-Time Anomaly Detection and Categorization for Satellite Reaction Wheels2024 European Control Conference (ECC)10.23919/ECC64448.2024.10591184(253-260)Online publication date: 25-Jun-2024
    • (2024)METIS: An AI Assistant Enabling Autonomous Spacecraft Operations for Human Exploration Missions2024 IEEE Aerospace Conference10.1109/AERO58975.2024.10521154(1-22)Online publication date: 2-Mar-2024
    • Show More Cited By

    Index Terms

    1. Mining distance-based outliers in near linear time with randomization and a simple pruning rule

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2003
      736 pages
      ISBN:1581137370
      DOI:10.1145/956750
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 24 August 2003

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. anomaly detection
      2. diskbased algorithms
      3. distance-based operations
      4. outliers

      Qualifiers

      • Article

      Conference

      KDD03
      Sponsor:

      Acceptance Rates

      KDD '03 Paper Acceptance Rate 46 of 298 submissions, 15%;
      Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

      Upcoming Conference

      KDD '24

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)66
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 12 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Approach and Landing Energy Prediction Based on a Long Short-Term Memory ModelAerospace10.3390/aerospace1103022611:3(226)Online publication date: 14-Mar-2024
      • (2024)Real-Time Anomaly Detection and Categorization for Satellite Reaction Wheels2024 European Control Conference (ECC)10.23919/ECC64448.2024.10591184(253-260)Online publication date: 25-Jun-2024
      • (2024)METIS: An AI Assistant Enabling Autonomous Spacecraft Operations for Human Exploration Missions2024 IEEE Aerospace Conference10.1109/AERO58975.2024.10521154(1-22)Online publication date: 2-Mar-2024
      • (2024)EGNN: Energy-efficient anomaly detection for IoT multivariate time series data using graph neural networkFuture Generation Computer Systems10.1016/j.future.2023.09.028151(45-56)Online publication date: Feb-2024
      • (2024)A survey of anomaly detection techniquesJournal of Optics10.1007/s12596-023-01147-4Online publication date: 16-Feb-2024
      • (2024)A visual foreign object detection system for wireless charging of electric vehiclesMachine Vision and Applications10.1007/s00138-024-01553-z35:4Online publication date: 1-Jul-2024
      • (2023)Pre-Cutoff Value Calculation Method for Accelerating Metric Space Outlier DetectionInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.33412516:1(1-17)Online publication date: 28-Nov-2023
      • (2023)Energy Consumption Patterns and Characteristics of College Dormitory Buildings Based on Unsupervised Data Mining MethodBuildings10.3390/buildings1303066613:3(666)Online publication date: 2-Mar-2023
      • (2023)Application of Knowledge Graph Technology with Integrated Feature Data in Spacecraft Anomaly DetectionApplied Sciences10.3390/app13191090513:19(10905)Online publication date: 30-Sep-2023
      • (2023)High-Dimensional Approximate Nearest Neighbor Search: with Reliable and Efficient Distance Comparison OperationsProceedings of the ACM on Management of Data10.1145/35892821:2(1-27)Online publication date: 20-Jun-2023
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media