Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1081870.1081891acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Feature bagging for outlier detection

Published: 21 August 2005 Publication History
  • Get Citation Alerts
  • Abstract

    Outlier detection has recently become an important problem in many industrial and financial applications. In this paper, a novel feature bagging approach for detecting outliers in very large, high dimensional and noisy databases is proposed. It combines results from multiple outlier detection algorithms that are applied using different set of features. Every outlier detection algorithm uses a small subset of features that are randomly selected from the original feature set. As a result, each outlier detector identifies different outliers, and thus assigns to all data records outlier scores that correspond to their probability of being outliers. The outlier scores computed by the individual outlier detection algorithms are then combined in order to find the better quality outliers. Experiments performed on several synthetic and real life data sets show that the proposed methods for combining outputs from multiple outlier detection algorithms provide non-trivial improvements over the base algorithm.

    References

    [1]
    C. Aggarwal, Re-designing distance functions and distance-based applications for high dimensional data, ACM SIGMOD Record, vol. 30, 1, pp. 13 -- 18, March 2001.]]
    [2]
    C. Aggarwal and P. Yu, Finding Generalized Projected Clusters in High Dimensional Spaces, In Proceedings of the ACM SIGMOD international conference on Management of data, Dallas, TX, 70--81, 2000.]]
    [3]
    C.C. Aggarwal, P. Yu, Outlier Detection for High Dimensional Data, In Proceedings of the ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA, May 2001.]]
    [4]
    R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, In Proceedings of the ACM SIGMOD international conference on Management of data, Seattle, WA, 94--105, June 1998.]]
    [5]
    V. Barnett and T. Lewis, Outliers in Statistical Data. New York, NY, John Wiley and Sons, 1994.]]
    [6]
    K. Beyer, J. Goldstein, R. Ramakrishnan and U. Shaft, When is nearest neighbor meaningful?, In Proceedings of the 7th International Conference on Database Theory (ICDT'99), Jerusalem, Israel, 217--235, 1999.]]
    [7]
    N. Billor, A. Hadi and P. Velleman BACON: Blocked Adaptive Computationally-Efficient Outlier Nominators, Computational Statist & Data Analysis, vol. 34, pp. 279--298, 2000.]]
    [8]
    C. Blake,C. Merz, UCI Repository of machine learning databases,www.ics.uci.edu/~mlearn/MLRepository.html, 1998.]]
    [9]
    L. Breiman, Bagging Predictors, Machine Learning, vol. 24, 2, pp. 123--140, August 1996.]]
    [10]
    M.M. Breunig, H.P. Kriegel, R.T. Ng and J. Sander, LOF: Identifying DensityBased Local Outliers, ACM SIGMOD Conference, vol. Dallas, TX, May 2000.]]
    [11]
    N. Chawla, A. Lazarevic, L. Hall,K. Bowyer, SMOTEBoost: Improving the Prediction of Minority Class in Boosting, In Proceedings of the Principles of Knowledge Discovery in Databases, PKDD-2003, Cavtat, Croatia, September 2003.]]
    [12]
    E. Eskin, Anomaly Detection over Noisy Data using Learned Probability Distributions, In Proceedings of the International Conference on Machine Learning, Stanford University, CA, 2000.]]
    [13]
    E. Eskin, A. Arnold, M. Prerau, L. Portnoy, S. Stolfo, A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data, in Applications of Data Mining in Computer Security, Advances In Information Security, S. Jajodia D. Barbara, Ed. Boston: Kluwer, 2002.]]
    [14]
    Y. Freund, R. Schapire, Experiments with a New Boosting Algorithm, In Proceedings of the 13th International Conference on Machine Learning, Bari, Italy, 325--332, July 1996.]]
    [15]
    S. Hawkins, H. He, G. Williams, R. Baxter, Outlier Detection Using Replicator Neural Networks, In Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery, Lecture Notes in Computer Science 2454, Aix-en-Provence, France, 170--180, September 2002.]]
    [16]
    M. Joshi, R. Agarwal, V. Kumar, PNrule, Mining Needles in a Haystack: Classifying Rare Classes via Two-Phase Rule Induction, In Proceedings of the ACM SIGMOD Conference on Management of Data, Santa Barbara, CA, May 2001.]]
    [17]
    M. Joshi, R. Agarwal and V. Kumar, Predicting Rare Classes: Can Boosting Make Any Weak Learner Strong?, In Proceedings of the Eight ACM Conference ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada, July 2002.]]
    [18]
    M. Joshi and V. Kumar, CREDOS: Classification using Ripple Down Structure (A Case for Rare Classes), In Proceedings of the SIAM International Conference on Data Mining, Lake Buena Vista, FL, April 2004.]]
    [19]
    E. Knorr and R. Ng, Algorithms for Mining Distance based Outliers in Large Data Sets, In Proceedings of the Very Large Databases (VLDB) Conference, New York City, NY, August 1998.]]
    [20]
    E. Kong and T. Dietterich, Error-Correcting Output Coding Corrects Bias and Variance, In Proceedings of the 12th International Conference on Machine Learning, San Francisco, CA, 313--321, 1995.]]
    [21]
    A. Lazarevic, L. Ertoz, A. Ozgur, J. Srivastava and V. Kumar, A comparative study of anomaly detection schemes in network intrusion detection, In Proceedings of the Third SIAM International Conference on Data Mining, San Francisco, CA, May 2003.]]
    [22]
    M. Maloof, P. Langley, T. Binford, R. Nevatia and S. Sage, Improved Rooftop Detection in Aerial Images with Machine Learning, Machine Learning, vol. 53, 1--2, pp. 157--191, October-November 2003.]]
    [23]
    M. Markou and S. Singh, Novelty detection: a review--part 1: statistical approaches, Signal Processing, vol. 83, 12, pp. 2481--2497, December 2003.]]
    [24]
    P. McBurney and Y. Ohsawa, Chance Discovery, Advanced Information Processing Springer, 2003.]]
    [25]
    R. Michalski, I. Mozetic, J. Hong and N. Lavrac, The Multi-Purpose Incremental Learning System AQ15 and its Testing Applications to Three Medical Domains, In Proceedings of the Fifth National Conference on Artificial Intelligence, Philadelphia, PA, 1041--1045, 1986.]]
    [26]
    F. Provost, T. Fawcett, Robust Classification for Imprecise Environments, Machine Learning, vol. 42, pp. 203--231, 2001.]]
    [27]
    S. Ramaswamy, R. Rastogi, K. Shim, Efficient Algorithms for Mining Outliers from Large Data Sets, In Proceedings of the ACM SIGMOD Conference, Dallas, TX, May 2000.]]
    [28]
    A. Strehl, J. Ghosh, Cluster ensembles - a knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research, vol. 3, pp. 583--617, March 2003.]]
    [29]
    E. Suzuki, J. Zytkow, Unified Algorithm for Undirected Discovery of Exception Rules, In Proceedings of the Principles of Data Mining and Knowledge Discovery, 4th European Conference, PKDD2000, Lyon, France, 169--180, September 13-16, 2000.]]
    [30]
    P. van der Putten, M. van Someren, CoIL Challenge 2000: The Insurance Company Case, Sentient Machine Research, Amsterdam and Leiden Institute of Advanced Computer Science, Leiden LIACS Technical Report 2000-09, June, 2000.]]
    [31]
    D. Yu, G. Sheikholeslami and A. Zhang, FindOut: Finding Outliers in Very Large Datasets, The Knowledge and Information Systems (KAIS) journal, vol. 4, 4, October 2002.]]
    [32]
    A. E. Howe, D. Dreilinger, SavvySearch: A meta-search engine that learns which search engines to query, AI Magazine, Vol. 18., No. 2, 1997.]]
    [33]
    S. Lawrence, C. L. Giles, Inquirus, the NECI meta search engine, In Proceedings of Seventh International World Wide Web Conference, Brisbane, Australia, 95--105, 1998.]]
    [34]
    B. U. Oztekin, G. Karypis, V. Kumar, Expert Agreement and Content Based Reranking in a Meta Search Environment using Mearf, In Proceedings of Eleventh International World Wide Web Conference, Honolulu, Hawaii, May 2002.]]
    [35]
    S. D. Bay, M. Schwabacher: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington DC, 29--38, 2003.]]
    [36]
    S. Papadimitriou, H. Kitagawa, P. B. Gibbons, C. Faloutsos: LOCI: Fast Outlier Detection Using the Local Correlation Integral. In Proceedings of IEEE International Conference on Data engineering, Bangalore, India March 2003.]]
    [37]
    P. Sun, S. Chawla, On Local Spatial Outliers, In Proceedings of Fourth IEEE International Conference on Data Mining (ICDM'04), Brighton, United Kingdom, November 2004.]]
    [38]
    L. Ertoz, Similarity Measures, PhD dissertation, University of Minnesota, in progress, 2005.]]

    Cited By

    View all
    • (2024)Comparative Analysis of Anomaly Detection Approaches in Firewall Logs: Integrating Light-Weight Synthesis of Security Logs and Artificially Generated Attack DetectionSensors10.3390/s2408263624:8(2636)Online publication date: 20-Apr-2024
    • (2024)Graph Attention Network and Informer for Multivariate Time Series Anomaly DetectionSensors10.3390/s2405152224:5(1522)Online publication date: 26-Feb-2024
    • (2024)Enhancing rubber rupture detection in rubber bearing through generative adversarial network and feature-bagging zero-shot methodologyStructural Health Monitoring10.1177/14759217241264096Online publication date: 25-Jul-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
    August 2005
    844 pages
    ISBN:159593135X
    DOI:10.1145/1081870
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 August 2005

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. bagging
    2. detection rate
    3. false alarm
    4. feature subsets
    5. integration
    6. outlier detection

    Qualifiers

    • Article

    Conference

    KDD05

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)132
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Comparative Analysis of Anomaly Detection Approaches in Firewall Logs: Integrating Light-Weight Synthesis of Security Logs and Artificially Generated Attack DetectionSensors10.3390/s2408263624:8(2636)Online publication date: 20-Apr-2024
    • (2024)Graph Attention Network and Informer for Multivariate Time Series Anomaly DetectionSensors10.3390/s2405152224:5(1522)Online publication date: 26-Feb-2024
    • (2024)Enhancing rubber rupture detection in rubber bearing through generative adversarial network and feature-bagging zero-shot methodologyStructural Health Monitoring10.1177/14759217241264096Online publication date: 25-Jul-2024
    • (2024)CNV-FB: A Feature bagging strategy-based approach to detect copy number variants from NGS dataJournal of Bioinformatics and Computational Biology10.1142/S021972002350026921:06Online publication date: 10-Jan-2024
    • (2024)An Efficient Adaptive Multi-Kernel Learning With Safe Screening Rule for Outlier DetectionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.333070836:8(3656-3669)Online publication date: Aug-2024
    • (2024)Detection and Classification of Anomalies in Power Distribution System Using Outlier Filtered Weighted Least SquareIEEE Transactions on Industrial Informatics10.1109/TII.2024.336052320:5(7513-7523)Online publication date: May-2024
    • (2024) FOOR: Be Careful for Outlier-Score Outliers When Using Unsupervised Outlier Ensembles IEEE Transactions on Computational Social Systems10.1109/TCSS.2023.328059311:2(2843-2852)Online publication date: Apr-2024
    • (2024)CGAAD: Centrality- and Graph-Aware Deep-Learning Model for Detecting Cyberattacks Targeting Industrial Control Systems in Critical InfrastructureIEEE Internet of Things Journal10.1109/JIOT.2024.339069111:13(24162-24182)Online publication date: 1-Jul-2024
    • (2024)Unsupervised outlier detection using random subspace and subsampling ensembles of Dirichlet process mixturesPattern Recognition10.1016/j.patcog.2024.110846(110846)Online publication date: Jul-2024
    • (2024)Contrastive learning enhanced by graph neural networks for Universal Multivariate Time Series RepresentationInformation Systems10.1016/j.is.2024.102429125(102429)Online publication date: Nov-2024
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media