Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Fast Distributed Outlier Detection in Mixed-Attribute Data Sets

Published: 01 May 2006 Publication History

Abstract

Efficiently detecting outliers or anomalies is an important problem in many areas of science, medicine and information technology. Applications range from data cleaning to clinical diagnosis, from detecting anomalous defects in materials to fraud and intrusion detection. Over the past decade, researchers in data mining and statistics have addressed the problem of outlier detection using both parametric and non-parametric approaches in a centralized setting. However, there are still several challenges that must be addressed. First, most approaches to date have focused on detecting outliers in a continuous attribute space. However, almost all real-world data sets contain a mixture of categorical and continuous attributes. Categorical attributes are typically ignored or incorrectly modeled by existing approaches, resulting in a significant loss of information. Second, there have not been any general-purpose distributed outlier detection algorithms. Most distributed detection algorithms are designed with a specific domain (e.g. sensor networks) in mind. Third, the data sets being analyzed may be streaming or otherwise dynamic in nature. Such data sets are prone to concept drift, and models of the data must be dynamic as well. To address these challenges, we present a tunable algorithm for distributed outlier detection in dynamic mixed-attribute data sets.

References

[1]
Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proc. of the International Conference on Very Large Data Bases VLDB, Morgan Kaufmann, pp. 487-499.
[2]
Barnett, V. and Lewis, T. 1994. Outliers in Statistical Data. John Wiley.
[3]
Bay, S. D. and Schwabacher, M. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. Proc. of ACM SIGKDD Inte'l Conf. on Knowledge Discovery and Data Mining.
[4]
Blake, C. and Merz, C. 1998. UCI machine learning repository.
[5]
Bolton, R. J. and Hand, D. J. 2002. Statistical fraud detection: A review. Statistical Science, 17:235-255.
[6]
Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. 2000. LOF: Identifying density-based local outliers. Proc. of the ACM SIGMOD International Conference on Management of Data.
[7]
Gamberger, D., Lavra¿, N., and Gro¿elj, C. 1999. Experiments with noise filtering in a medical domain. Proc. of the International Conference on Machine Learning.
[8]
Ghoting, A., Otey, M. E., and Parthasarathy, S. 2004. Loaded: Link-based outlier and anomaly detection in evolving data sets. Proc. of the IEEE International Conference on Data Mining.
[9]
Guha, S., Rastogi, R., and Shim, K. 2000. ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 25:345-366.
[10]
Hettich, S. and Bay, S. 1999. KDDCUP 1999 dataset, UCI KDD archive.
[11]
Huang, Y.-A. and Lee, W. 2003. A cooperative intrusion detection system for ad hoc networks. Proc. of the ACM workshop on Security of ad hoc and sensor networks (SASN), Fairfax, Virginia: ACM Press, pp. 135-147.
[12]
Jain, A. K. and Dubes, R. C. 1988. Algorithms for Clustering Data. Prentice Hall.
[13]
Johnson, T., Kwok, I., and Ng, R. 1998. Fast computation of 2-dimensional depth contours. Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[14]
Knorr, E., Ng, R., and Tucakov, V. 2000. Distance-based outliers: Algorithms and applications. VLDB Journal.
[15]
Knorr, E. and Ng, R. T. 1998. Algorithms for mining distance-based outliers in large datasets. Proc. of the International Conference on Very Large Databases.
[16]
Lazarevic, A., Ertoz, L., Ozgur, A., Kumar, V., and Srivastava, J. 2003. A comparative study of outlier detection schemes for network intrusion detection. Proc. of the SIAM International Conference on Data Mining.
[17]
Locasto, M. E., Parekh, J. J., Stolfo, S. J., Keromytis, A. D., Malkin, T., and Misra, V. 2004. Collaborative distributed intrusion detection (Technical Report CUCS-012-04). Department of Computer Science, Columbia University, New York.
[18]
Mahoney, M. V. and Chan, P. K. 2002. Learning nonstationary models of normal network traffic for detecting novel attacks. Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[19]
Otey, M., Parthasarathy, S., Ghoting, A., Li, G., Narravula, S., and Panda, D. 2003. Towards nic-based intrusion detection. Proc. of 9th annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[20]
Palpanas, T., Papadopoulos, D., Kalogeraki, V., and Gunopulos, D. 2003. Distributed deviation detection in sensor networks. SIGMOD Record, 32:77-82.
[21]
Papadimitriou, S., Kitawaga, H., Gibbons, P. B., and Faloutsos, C. 2003. LOCI: Fast outlier detection using the local correlation integral. Proc. of the International Conference on Data Engineering.
[22]
Penny, K. I. and Jolliffe, IT. 2001. A comparison of multivariate outlier detection methods for clinical laboratory safety data. The Statistician, Journal of the Royal Statistical Society, 50:295-308.
[23]
Porras, P. A. and Neumann, P. G. 1997. EMERALD: Event monitoring enabling responses to anomalous live disturbances. Proc. of the 20th NIST-NCSC National Information Systems Security Conference, pp. 353-365.
[24]
Rice, J. 1995. Mathematical statistics and data analysis. Duxbury Press.
[25]
Sequeira, K. and Zaki, M. 2002. ADMIT: Anomaly-based data mining for intrusions. Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[26]
Veloso, A. A., Meira W., Jr., de Carvalho, M. B., Possas, B., Parthasarathy, S., and Zaki, M. J. 2002. Mining frequent itemsets in evolving databases. Proc. of the SIAM International Conference on Data Mining.
[27]
Wu, X. and Zhang, S. 2003. Synthesizing high-frequency rules from different data sources. IEEE Transactions on Knowledge and Data Engineering, 15:353-367.
[28]
Zhang, S., Wu, X., and Zhang, C. 2003a. Multi-database mining. IEEE Computational Intelligence Bulletin, 2:5-13.
[29]
Zhang, Y. and Lee, W. 2000. Intrusion detection in wireless ad-hoc networks. Mobile Computing and Networking, pp. 275-283.
[30]
Zhang, Y., Lee, W., and Huang, Y.-A. 2003b. Intrusion detection techniques for mobile wireless networks. Wireless Networks, 9:545-556.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Data Mining and Knowledge Discovery
Data Mining and Knowledge Discovery  Volume 12, Issue 2-3
May 2006
185 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 May 2006

Author Tags

  1. anomaly detection
  2. data streams
  3. distributed data mining
  4. mining dynamic data
  5. mixedattribute data sets
  6. outlier detection

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Outlier Detection Algorithms for Open EnvironmentsWireless Communications & Mobile Computing10.1155/2023/51622542023Online publication date: 1-Jan-2023
  • (2023)Anomaly detection for fault detection in wireless community networks using machine learningComputer Communications10.1016/j.comcom.2023.02.019202:C(191-203)Online publication date: 15-Mar-2023
  • (2023)Towards Rough Set Theory for Outliers Detection in Questionnaire DataComputer Information Systems and Industrial Management10.1007/978-3-031-42823-4_23(310-324)Online publication date: 22-Sep-2023
  • (2022)TODProceedings of the VLDB Endowment10.14778/3570690.357070316:3(546-560)Online publication date: 1-Nov-2022
  • (2022)A density estimation approach for detecting and explaining exceptional values in categorical dataApplied Intelligence10.1007/s10489-022-03271-352:15(17534-17556)Online publication date: 1-Dec-2022
  • (2021)Homophily outlier detection in non-IID categorical dataData Mining and Knowledge Discovery10.1007/s10618-021-00750-y35:4(1163-1224)Online publication date: 1-Jul-2021
  • (2020)A Distributed Anomaly Filtering Algorithm for Heterogeneous Data Based on City ComputingProceedings of the 2020 6th International Conference on Computing and Artificial Intelligence10.1145/3404555.3404636(79-83)Online publication date: 23-Apr-2020
  • (2020)Survey of Network Intrusion Detection Methods From the Perspective of the Knowledge Discovery in Databases ProcessIEEE Transactions on Network and Service Management10.1109/TNSM.2020.301624617:4(2451-2479)Online publication date: 1-Dec-2020
  • (2020)A Frontier: Dependable, Reliable and Secure Machine Learning for Network/System ManagementJournal of Network and Systems Management10.1007/s10922-020-09512-528:4(827-849)Online publication date: 30-Jan-2020
  • (2019)Recent Progress of Anomaly DetectionComplexity10.1155/2019/26863782019Online publication date: 13-Jan-2019
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media