article

Fast Distributed Outlier Detection in Mixed-Attribute Data Sets

Authors:

Matthew Eric Otey,

Srinivasan ParthasarathyAuthors Info & Claims

Data Mining and Knowledge Discovery, Volume 12, Issue 2-3

Pages 203 - 228

https://doi.org/10.1007/s10618-005-0014-6

Published: 01 May 2006 Publication History

Abstract

Efficiently detecting outliers or anomalies is an important problem in many areas of science, medicine and information technology. Applications range from data cleaning to clinical diagnosis, from detecting anomalous defects in materials to fraud and intrusion detection. Over the past decade, researchers in data mining and statistics have addressed the problem of outlier detection using both parametric and non-parametric approaches in a centralized setting. However, there are still several challenges that must be addressed. First, most approaches to date have focused on detecting outliers in a continuous attribute space. However, almost all real-world data sets contain a mixture of categorical and continuous attributes. Categorical attributes are typically ignored or incorrectly modeled by existing approaches, resulting in a significant loss of information. Second, there have not been any general-purpose distributed outlier detection algorithms. Most distributed detection algorithms are designed with a specific domain (e.g. sensor networks) in mind. Third, the data sets being analyzed may be streaming or otherwise dynamic in nature. Such data sets are prone to concept drift, and models of the data must be dynamic as well. To address these challenges, we present a tunable algorithm for distributed outlier detection in dynamic mixed-attribute data sets.

References

[1]

Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proc. of the International Conference on Very Large Data Bases VLDB, Morgan Kaufmann, pp. 487-499.

[2]

Barnett, V. and Lewis, T. 1994. Outliers in Statistical Data. John Wiley.

[3]

Bay, S. D. and Schwabacher, M. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. Proc. of ACM SIGKDD Inte'l Conf. on Knowledge Discovery and Data Mining.

[4]

Blake, C. and Merz, C. 1998. UCI machine learning repository.

[5]

Bolton, R. J. and Hand, D. J. 2002. Statistical fraud detection: A review. Statistical Science, 17:235-255.

[6]

Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. 2000. LOF: Identifying density-based local outliers. Proc. of the ACM SIGMOD International Conference on Management of Data.

[7]

Gamberger, D., Lavra¿, N., and Gro¿elj, C. 1999. Experiments with noise filtering in a medical domain. Proc. of the International Conference on Machine Learning.

[8]

Ghoting, A., Otey, M. E., and Parthasarathy, S. 2004. Loaded: Link-based outlier and anomaly detection in evolving data sets. Proc. of the IEEE International Conference on Data Mining.

[9]

Guha, S., Rastogi, R., and Shim, K. 2000. ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 25:345-366.

Digital Library

[10]

Hettich, S. and Bay, S. 1999. KDDCUP 1999 dataset, UCI KDD archive.

[11]

Huang, Y.-A. and Lee, W. 2003. A cooperative intrusion detection system for ad hoc networks. Proc. of the ACM workshop on Security of ad hoc and sensor networks (SASN), Fairfax, Virginia: ACM Press, pp. 135-147.

[12]

Jain, A. K. and Dubes, R. C. 1988. Algorithms for Clustering Data. Prentice Hall.

[13]

Johnson, T., Kwok, I., and Ng, R. 1998. Fast computation of 2-dimensional depth contours. Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

[14]

Knorr, E., Ng, R., and Tucakov, V. 2000. Distance-based outliers: Algorithms and applications. VLDB Journal.

[15]

Knorr, E. and Ng, R. T. 1998. Algorithms for mining distance-based outliers in large datasets. Proc. of the International Conference on Very Large Databases.

[16]

Lazarevic, A., Ertoz, L., Ozgur, A., Kumar, V., and Srivastava, J. 2003. A comparative study of outlier detection schemes for network intrusion detection. Proc. of the SIAM International Conference on Data Mining.

[17]

Locasto, M. E., Parekh, J. J., Stolfo, S. J., Keromytis, A. D., Malkin, T., and Misra, V. 2004. Collaborative distributed intrusion detection (Technical Report CUCS-012-04). Department of Computer Science, Columbia University, New York.

[18]

Mahoney, M. V. and Chan, P. K. 2002. Learning nonstationary models of normal network traffic for detecting novel attacks. Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

[19]

Otey, M., Parthasarathy, S., Ghoting, A., Li, G., Narravula, S., and Panda, D. 2003. Towards nic-based intrusion detection. Proc. of 9th annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

[20]

Palpanas, T., Papadopoulos, D., Kalogeraki, V., and Gunopulos, D. 2003. Distributed deviation detection in sensor networks. SIGMOD Record, 32:77-82.

Digital Library

[21]

Papadimitriou, S., Kitawaga, H., Gibbons, P. B., and Faloutsos, C. 2003. LOCI: Fast outlier detection using the local correlation integral. Proc. of the International Conference on Data Engineering.

[22]

Penny, K. I. and Jolliffe, IT. 2001. A comparison of multivariate outlier detection methods for clinical laboratory safety data. The Statistician, Journal of the Royal Statistical Society, 50:295-308.

[23]

Porras, P. A. and Neumann, P. G. 1997. EMERALD: Event monitoring enabling responses to anomalous live disturbances. Proc. of the 20th NIST-NCSC National Information Systems Security Conference, pp. 353-365.

[24]

Rice, J. 1995. Mathematical statistics and data analysis. Duxbury Press.

[25]

Sequeira, K. and Zaki, M. 2002. ADMIT: Anomaly-based data mining for intrusions. Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

[26]

Veloso, A. A., Meira W., Jr., de Carvalho, M. B., Possas, B., Parthasarathy, S., and Zaki, M. J. 2002. Mining frequent itemsets in evolving databases. Proc. of the SIAM International Conference on Data Mining.

[27]

Wu, X. and Zhang, S. 2003. Synthesizing high-frequency rules from different data sources. IEEE Transactions on Knowledge and Data Engineering, 15:353-367.

Digital Library

[28]

Zhang, S., Wu, X., and Zhang, C. 2003a. Multi-database mining. IEEE Computational Intelligence Bulletin, 2:5-13.

Digital Library

[29]

Zhang, Y. and Lee, W. 2000. Intrusion detection in wireless ad-hoc networks. Mobile Computing and Networking, pp. 275-283.

[30]

Zhang, Y., Lee, W., and Huang, Y.-A. 2003b. Intrusion detection techniques for mobile wireless networks. Wireless Networks, 9:545-556.

Digital Library

Cited By

Kou AHuang XSun W(2023)Outlier Detection Algorithms for Open EnvironmentsWireless Communications & Mobile Computing10.1155/2023/51622542023Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1155/2023/5162254
Cerdà-Alabern LIuhasz GGemmi G(2023)Anomaly detection for fault detection in wireless community networks using machine learningComputer Communications10.1016/j.comcom.2023.02.019202:C(191-203)Online publication date: 15-Mar-2023
https://dl.acm.org/doi/10.1016/j.comcom.2023.02.019
Uher VDráždilová P(2023)Towards Rough Set Theory for Outliers Detection in Questionnaire DataComputer Information Systems and Industrial Management10.1007/978-3-031-42823-4_23(310-324)Online publication date: 22-Sep-2023
https://dl.acm.org/doi/10.1007/978-3-031-42823-4_23
Show More Cited By

Index Terms

Fast Distributed Outlier Detection in Mixed-Attribute Data Sets
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs
  2. Information systems applications
    1. Data mining

Recommendations

A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes

Outlier detection has attracted substantial attention in many applications and research areas; some of the most prominent applications are network intrusion detection or credit card fraud detection. Many of the existing approaches are based on ...
A pattern-based outlier detection method identifying abnormal attributes in software project data

Despite the importance of the quality of software project data, problematic data inevitably occurs during data collection. These data are the outliers with abnormal values on certain attributes, which we call the abnormal attributes of outliers. ...
Projected outlier detection in high-dimensional mixed-attributes data set

Detecting outlier efficiently is an active research issue in data mining, which has important applications in the field of fraud detection, network intrusion detection, monitoring criminal activities in electronic commerce, etc. Because of the sparsity ...

Comments

Information & Contributors

Information

Published In

cover image Data Mining and Knowledge Discovery

Data Mining and Knowledge Discovery Volume 12, Issue 2-3

May 2006

185 pages

ISSN:1384-5810

Issue’s Table of Contents

Copyright © Copyright © 2006 Springer Science+Business Media, Inc.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 May 2006

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

46
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kou AHuang XSun W(2023)Outlier Detection Algorithms for Open EnvironmentsWireless Communications & Mobile Computing10.1155/2023/51622542023Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1155/2023/5162254
Cerdà-Alabern LIuhasz GGemmi G(2023)Anomaly detection for fault detection in wireless community networks using machine learningComputer Communications10.1016/j.comcom.2023.02.019202:C(191-203)Online publication date: 15-Mar-2023
https://dl.acm.org/doi/10.1016/j.comcom.2023.02.019
Uher VDráždilová P(2023)Towards Rough Set Theory for Outliers Detection in Questionnaire DataComputer Information Systems and Industrial Management10.1007/978-3-031-42823-4_23(310-324)Online publication date: 22-Sep-2023
https://dl.acm.org/doi/10.1007/978-3-031-42823-4_23
Zhao YChen GJia Z(2022)TODProceedings of the VLDB Endowment10.14778/3570690.357070316:3(546-560)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.14778/3570690.3570703
Angiulli FFassetti FPalopoli LSerrao C(2022)A density estimation approach for detecting and explaining exceptional values in categorical dataApplied Intelligence10.1007/s10489-022-03271-352:15(17534-17556)Online publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1007/s10489-022-03271-3
Pang GCao LChen L(2021)Homophily outlier detection in non-IID categorical dataData Mining and Knowledge Discovery10.1007/s10618-021-00750-y35:4(1163-1224)Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.1007/s10618-021-00750-y
Wang SLi YXu XYue G(2020)A Distributed Anomaly Filtering Algorithm for Heterogeneous Data Based on City ComputingProceedings of the 2020 6th International Conference on Computing and Artificial Intelligence10.1145/3404555.3404636(79-83)Online publication date: 23-Apr-2020
https://dl.acm.org/doi/10.1145/3404555.3404636
Molina-Coronado BMori UMendiburu AMiguel-Alonso J(2020)Survey of Network Intrusion Detection Methods From the Perspective of the Knowledge Discovery in Databases ProcessIEEE Transactions on Network and Service Management10.1109/TNSM.2020.301624617:4(2451-2479)Online publication date: 1-Dec-2020
https://dl.acm.org/doi/10.1109/TNSM.2020.3016246
Le DZincir-Heywood N(2020)A Frontier: Dependable, Reliable and Secure Machine Learning for Network/System ManagementJournal of Network and Systems Management10.1007/s10922-020-09512-528:4(827-849)Online publication date: 30-Jan-2020
https://dl.acm.org/doi/10.1007/s10922-020-09512-5
Xu XLiu HYao MGil D(2019)Recent Progress of Anomaly DetectionComplexity10.1155/2019/26863782019Online publication date: 13-Jan-2019
https://dl.acm.org/doi/10.1155/2019/2686378
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents