Article

Feature bagging for outlier detection

Authors:

Aleksandar Lazarevic,

Vipin KumarAuthors Info & Claims

KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

Pages 157 - 166

https://doi.org/10.1145/1081870.1081891

Published: 21 August 2005 Publication History

Abstract

Outlier detection has recently become an important problem in many industrial and financial applications. In this paper, a novel feature bagging approach for detecting outliers in very large, high dimensional and noisy databases is proposed. It combines results from multiple outlier detection algorithms that are applied using different set of features. Every outlier detection algorithm uses a small subset of features that are randomly selected from the original feature set. As a result, each outlier detector identifies different outliers, and thus assigns to all data records outlier scores that correspond to their probability of being outliers. The outlier scores computed by the individual outlier detection algorithms are then combined in order to find the better quality outliers. Experiments performed on several synthetic and real life data sets show that the proposed methods for combining outputs from multiple outlier detection algorithms provide non-trivial improvements over the base algorithm.

References

[1]

C. Aggarwal, Re-designing distance functions and distance-based applications for high dimensional data, ACM SIGMOD Record, vol. 30, 1, pp. 13 -- 18, March 2001.]]

Digital Library

[2]

C. Aggarwal and P. Yu, Finding Generalized Projected Clusters in High Dimensional Spaces, In Proceedings of the ACM SIGMOD international conference on Management of data, Dallas, TX, 70--81, 2000.]]

Digital Library

[3]

C.C. Aggarwal, P. Yu, Outlier Detection for High Dimensional Data, In Proceedings of the ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA, May 2001.]]

Digital Library

[4]

R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, In Proceedings of the ACM SIGMOD international conference on Management of data, Seattle, WA, 94--105, June 1998.]]

Digital Library

[5]

V. Barnett and T. Lewis, Outliers in Statistical Data. New York, NY, John Wiley and Sons, 1994.]]

[6]

K. Beyer, J. Goldstein, R. Ramakrishnan and U. Shaft, When is nearest neighbor meaningful?, In Proceedings of the 7th International Conference on Database Theory (ICDT'99), Jerusalem, Israel, 217--235, 1999.]]

Digital Library

[7]

N. Billor, A. Hadi and P. Velleman BACON: Blocked Adaptive Computationally-Efficient Outlier Nominators, Computational Statist & Data Analysis, vol. 34, pp. 279--298, 2000.]]

Digital Library

[8]

C. Blake,C. Merz, UCI Repository of machine learning databases,www.ics.uci.edu/~mlearn/MLRepository.html, 1998.]]

[9]

L. Breiman, Bagging Predictors, Machine Learning, vol. 24, 2, pp. 123--140, August 1996.]]

Digital Library

[10]

M.M. Breunig, H.P. Kriegel, R.T. Ng and J. Sander, LOF: Identifying DensityBased Local Outliers, ACM SIGMOD Conference, vol. Dallas, TX, May 2000.]]

Digital Library

[11]

N. Chawla, A. Lazarevic, L. Hall,K. Bowyer, SMOTEBoost: Improving the Prediction of Minority Class in Boosting, In Proceedings of the Principles of Knowledge Discovery in Databases, PKDD-2003, Cavtat, Croatia, September 2003.]]

[12]

E. Eskin, Anomaly Detection over Noisy Data using Learned Probability Distributions, In Proceedings of the International Conference on Machine Learning, Stanford University, CA, 2000.]]

Digital Library

[13]

E. Eskin, A. Arnold, M. Prerau, L. Portnoy, S. Stolfo, A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data, in Applications of Data Mining in Computer Security, Advances In Information Security, S. Jajodia D. Barbara, Ed. Boston: Kluwer, 2002.]]

[14]

Y. Freund, R. Schapire, Experiments with a New Boosting Algorithm, In Proceedings of the 13th International Conference on Machine Learning, Bari, Italy, 325--332, July 1996.]]

[15]

S. Hawkins, H. He, G. Williams, R. Baxter, Outlier Detection Using Replicator Neural Networks, In Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery, Lecture Notes in Computer Science 2454, Aix-en-Provence, France, 170--180, September 2002.]]

Digital Library

[16]

M. Joshi, R. Agarwal, V. Kumar, PNrule, Mining Needles in a Haystack: Classifying Rare Classes via Two-Phase Rule Induction, In Proceedings of the ACM SIGMOD Conference on Management of Data, Santa Barbara, CA, May 2001.]]

Digital Library

[17]

M. Joshi, R. Agarwal and V. Kumar, Predicting Rare Classes: Can Boosting Make Any Weak Learner Strong?, In Proceedings of the Eight ACM Conference ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada, July 2002.]]

Digital Library

[18]

M. Joshi and V. Kumar, CREDOS: Classification using Ripple Down Structure (A Case for Rare Classes), In Proceedings of the SIAM International Conference on Data Mining, Lake Buena Vista, FL, April 2004.]]

[19]

E. Knorr and R. Ng, Algorithms for Mining Distance based Outliers in Large Data Sets, In Proceedings of the Very Large Databases (VLDB) Conference, New York City, NY, August 1998.]]

Digital Library

[20]

E. Kong and T. Dietterich, Error-Correcting Output Coding Corrects Bias and Variance, In Proceedings of the 12th International Conference on Machine Learning, San Francisco, CA, 313--321, 1995.]]

[21]

A. Lazarevic, L. Ertoz, A. Ozgur, J. Srivastava and V. Kumar, A comparative study of anomaly detection schemes in network intrusion detection, In Proceedings of the Third SIAM International Conference on Data Mining, San Francisco, CA, May 2003.]]

[22]

M. Maloof, P. Langley, T. Binford, R. Nevatia and S. Sage, Improved Rooftop Detection in Aerial Images with Machine Learning, Machine Learning, vol. 53, 1--2, pp. 157--191, October-November 2003.]]

Digital Library

[23]

M. Markou and S. Singh, Novelty detection: a review--part 1: statistical approaches, Signal Processing, vol. 83, 12, pp. 2481--2497, December 2003.]]

Digital Library

[24]

P. McBurney and Y. Ohsawa, Chance Discovery, Advanced Information Processing Springer, 2003.]]

Digital Library

[25]

R. Michalski, I. Mozetic, J. Hong and N. Lavrac, The Multi-Purpose Incremental Learning System AQ15 and its Testing Applications to Three Medical Domains, In Proceedings of the Fifth National Conference on Artificial Intelligence, Philadelphia, PA, 1041--1045, 1986.]]

Digital Library

[26]

F. Provost, T. Fawcett, Robust Classification for Imprecise Environments, Machine Learning, vol. 42, pp. 203--231, 2001.]]

Digital Library

[27]

S. Ramaswamy, R. Rastogi, K. Shim, Efficient Algorithms for Mining Outliers from Large Data Sets, In Proceedings of the ACM SIGMOD Conference, Dallas, TX, May 2000.]]

Digital Library

[28]

A. Strehl, J. Ghosh, Cluster ensembles - a knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research, vol. 3, pp. 583--617, March 2003.]]

Digital Library

[29]

E. Suzuki, J. Zytkow, Unified Algorithm for Undirected Discovery of Exception Rules, In Proceedings of the Principles of Data Mining and Knowledge Discovery, 4th European Conference, PKDD2000, Lyon, France, 169--180, September 13-16, 2000.]]

Digital Library

[30]

P. van der Putten, M. van Someren, CoIL Challenge 2000: The Insurance Company Case, Sentient Machine Research, Amsterdam and Leiden Institute of Advanced Computer Science, Leiden LIACS Technical Report 2000-09, June, 2000.]]

[31]

D. Yu, G. Sheikholeslami and A. Zhang, FindOut: Finding Outliers in Very Large Datasets, The Knowledge and Information Systems (KAIS) journal, vol. 4, 4, October 2002.]]

Digital Library

[32]

A. E. Howe, D. Dreilinger, SavvySearch: A meta-search engine that learns which search engines to query, AI Magazine, Vol. 18., No. 2, 1997.]]

[33]

S. Lawrence, C. L. Giles, Inquirus, the NECI meta search engine, In Proceedings of Seventh International World Wide Web Conference, Brisbane, Australia, 95--105, 1998.]]

Digital Library

[34]

B. U. Oztekin, G. Karypis, V. Kumar, Expert Agreement and Content Based Reranking in a Meta Search Environment using Mearf, In Proceedings of Eleventh International World Wide Web Conference, Honolulu, Hawaii, May 2002.]]

Digital Library

[35]

S. D. Bay, M. Schwabacher: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington DC, 29--38, 2003.]]

Digital Library

[36]

S. Papadimitriou, H. Kitagawa, P. B. Gibbons, C. Faloutsos: LOCI: Fast Outlier Detection Using the Local Correlation Integral. In Proceedings of IEEE International Conference on Data engineering, Bangalore, India March 2003.]]

[37]

P. Sun, S. Chawla, On Local Spatial Outliers, In Proceedings of Fourth IEEE International Conference on Data Mining (ICDM'04), Brighton, United Kingdom, November 2004.]]

Digital Library

[38]

L. Ertoz, Similarity Measures, PhD dissertation, University of Minnesota, in progress, 2005.]]

Cited By

Komadina AKovačević IŠtengl BGroš S(2024)Comparative Analysis of Anomaly Detection Approaches in Firewall Logs: Integrating Light-Weight Synthesis of Security Logs and Artificially Generated Attack DetectionSensors10.3390/s2408263624:8(2636)Online publication date: 20-Apr-2024
https://doi.org/10.3390/s24082636
Zhao MPeng HLi LRen Y(2024)Graph Attention Network and Informer for Multivariate Time Series Anomaly DetectionSensors10.3390/s2405152224:5(1522)Online publication date: 26-Feb-2024
https://doi.org/10.3390/s24051522
Zeng YDeng CXiong F(2024)Enhancing rubber rupture detection in rubber bearing through generative adversarial network and feature-bagging zero-shot methodologyStructural Health Monitoring10.1177/14759217241264096Online publication date: 25-Jul-2024
https://doi.org/10.1177/14759217241264096
Show More Cited By

Index Terms

Feature bagging for outlier detection
1. Human-centered computing
  1. Visualization
    1. Visualization application domains
      1. Geographic visualization
      2. Scientific visualization
2. Information systems
  1. Information systems applications
    1. Data mining
    2. Spatial-temporal systems

Recommendations

Enhancing Outlier Detection by an Outlier Indicator
Machine Learning and Data Mining in Pattern Recognition
Abstract
Outlier detection is an important task in data mining and has high practical value in numerous applications such as astronomical observation, text detection, fraud detection and so on. At present, a large number of popular outlier detection ...
Outlier detection based on rough sets theory

An outlier in a dataset is a point or a class of points that is considerably dissimilar to or inconsistent with the remainder of the data. Detection of outliers is important for many applications and has always attracted attention among data mining ...
Neighborhood outlier detection

KNN (k nearest neighbor) is widely discussed and applied in pattern recognition and data mining, however, as a similar outlier detection method using local information for mining a new outlier, neighborhood outlier detection, few literatures are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

August 2005

844 pages

ISBN:159593135X

DOI:10.1145/1081870

General Chair:
Robert Grossman
University of Illinois at Chicago & Open Data Partners, USA
,
Program Chairs:
Roberto Bayardo
IBM Almaden Research, USA
,
Kristin Bennett
RPI, USA

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

KDD05

Sponsor:

KDD05: The Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 21 - 24, 2005

Illinois, Chicago, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '24

Sponsor:
sigkdd
sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

417
Total Citations
View Citations
3,991
Total Downloads

Downloads (Last 12 months)132
Downloads (Last 6 weeks)11

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Komadina AKovačević IŠtengl BGroš S(2024)Comparative Analysis of Anomaly Detection Approaches in Firewall Logs: Integrating Light-Weight Synthesis of Security Logs and Artificially Generated Attack DetectionSensors10.3390/s2408263624:8(2636)Online publication date: 20-Apr-2024
https://doi.org/10.3390/s24082636
Zhao MPeng HLi LRen Y(2024)Graph Attention Network and Informer for Multivariate Time Series Anomaly DetectionSensors10.3390/s2405152224:5(1522)Online publication date: 26-Feb-2024
https://doi.org/10.3390/s24051522
Zeng YDeng CXiong F(2024)Enhancing rubber rupture detection in rubber bearing through generative adversarial network and feature-bagging zero-shot methodologyStructural Health Monitoring10.1177/14759217241264096Online publication date: 25-Jul-2024
https://doi.org/10.1177/14759217241264096
Li CFan SZhao HLiu X(2024)CNV-FB: A Feature bagging strategy-based approach to detect copy number variants from NGS dataJournal of Bioinformatics and Computational Biology10.1142/S021972002350026921:06Online publication date: 10-Jan-2024
https://doi.org/10.1142/S0219720023500269
Wang XDuan LHe CChen YWu X(2024)An Efficient Adaptive Multi-Kernel Learning With Safe Screening Rule for Outlier DetectionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.333070836:8(3656-3669)Online publication date: Aug-2024
https://doi.org/10.1109/TKDE.2023.3330708
Gholami ATiwari AQin CPannala SSrivastava ASharma RPandey SRahmatian F(2024)Detection and Classification of Anomalies in Power Distribution System Using Outlier Filtered Weighted Least SquareIEEE Transactions on Industrial Informatics10.1109/TII.2024.336052320:5(7513-7523)Online publication date: May-2024
https://doi.org/10.1109/TII.2024.3360523
Yang JRahardja SRahardja S(2024) FOOR: Be Careful for Outlier-Score Outliers When Using Unsupervised Outlier Ensembles IEEE Transactions on Computational Social Systems10.1109/TCSS.2023.328059311:2(2843-2852)Online publication date: Apr-2024
https://doi.org/10.1109/TCSS.2023.3280593
Alrumaih TAlenazi M(2024)CGAAD: Centrality- and Graph-Aware Deep-Learning Model for Detecting Cyberattacks Targeting Industrial Control Systems in Critical InfrastructureIEEE Internet of Things Journal10.1109/JIOT.2024.339069111:13(24162-24182)Online publication date: 1-Jul-2024
https://doi.org/10.1109/JIOT.2024.3390691
Kim DPark JChung HJeong S(2024)Unsupervised outlier detection using random subspace and subsampling ensembles of Dirichlet process mixturesPattern Recognition10.1016/j.patcog.2024.110846(110846)Online publication date: Jul-2024
https://doi.org/10.1016/j.patcog.2024.110846
Wang XXing QXiao HYe M(2024)Contrastive learning enhanced by graph neural networks for Universal Multivariate Time Series RepresentationInformation Systems10.1016/j.is.2024.102429125(102429)Online publication date: Nov-2024
https://doi.org/10.1016/j.is.2024.102429
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents