article

A comprehensive survey of numeric and symbolic outlier mining techniques

Authors:

Malik Agyemang,

Rada AlhajjAuthors Info & Claims

Intelligent Data Analysis, Volume 10, Issue 6

Pages 521 - 538

Published: 01 December 2006 Publication History

Abstract

Data that appear to have different characteristics than the rest of the population are called outliers. Identifying outliers from huge data repositories is a very complex task called outlier mining. Outlier mining has been akin to finding needles in a haystack. However, outlier mining has a number of practical applications in areas such as fraud detection, network intrusion detection, and identification of competitor and emerging business trends in e-commerce. This survey discuses practical applications of outlier mining, and provides a taxonomy for categorizing related mining techniques. A comprehensive review of these techniques with their advantages and disadvantages along with some current research issues are provided.

References

[1]

A. Arning, R. Agrawal and P. Raghavan, A Linear Method for Deviation Detection in Large Databases, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, AAAI Press, 1996, 164-169.

[2]

M. Agyemang, K. Barker and R. Alhajj, Framework for Mining Web Content Outliers' Proceedings of the 19th ACM Symposium on Applied Computing, Nicosia, Cyprus, 2004, 590-594.

[3]

M. Agyemang, K. Barker and R. Alhajj, Mining Web Content Outliers Using Structure Oriented Weighting Techniques and N-grams, Proceedings of the 20th ACM International Symposium on Applied Computing, Santa Fe, New Mexico, USA, 2005, 482-487.

Digital Library

[4]

M. Agyemang, K. Barker and R. Alhajj, Hybrid Approach to Web Content Outlier Mining Without Query Vector, Proceedings of the 7th International Conference on Data Warehousing and Knowledge Discovery (DaWaK), LNCS- 3589, Denmark, 2005, 285-294.

Digital Library

[5]

M. Agyemang and C.I. Ezeife, LSC-Mine: Algorithm for Mining Local Outliers, (Vol. 1), Proceedings of the 15th Information Resource Management Association (IRMA) International Conference, New Orleans, USA, 2004, 5-8.

[6]

R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan, Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, Proceedings of ACM SIGMOD International Conference on Management of Data, Seattle, WA, 1998, 94-105.

Digital Library

[7]

R. Agrawal, T. Imielinski and A. Swami, Data Mining: A Performance Perspective, IEEE Transactions on Knowledge and Data Engineering 5(6) (1993), 914-925.

Digital Library

[8]

R. Agrawal, T. Imielinski and A. Swami, Mining association rules between sets of items in large databases, ACM SIGMOD Records 22(2) (1993), 207-216.

[9]

F. Anguilli and C. Pizzuti, in: Fast Outlier Detection in High Dimensional Spaces, T. Elomaa, ed., PKDD, LNAI 2431, 2002, pp. 15-27.

[10]

F. Anguilli and C. Pizzuti, Outlier Mining in Large High-Dimensional Data Sets, IEEE Transactions on Knowledge and Data Engineering 12(2) (2005), 203-215.

Digital Library

[11]

A. Adam, E. Rivlin and I. Shimshoni, ROR: Rejection of Outliers by Rotation, IEEE Transaction on Pattern Analysis and Machine Intelligence 23(1) (2001), 78-84.

Digital Library

[12]

D. Asimov, The grand your: a yool for viewing multidimensional data, SIAM J.Sci. Stat. Compu 6 (1985), 128-143.

Digital Library

[13]

C.C. Aggarwal and P.S. Yu, Outlier Detection for High Dimensional Data, Proceedings of ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA, USA, 2001, 37-46.

[14]

C.C. Aggarwal and P.S. Yu, An effective and efficient algorithm for high-dimensional outlier detection, The VLDB Journal 14(2) (2005), 211-221.

[15]

R.J. Bolton D.J. Hand, Unsupervised Profiling Methods for Fraud Detection, In Conference of Credit Scoring and Credit Control VII, UK September 5-7, 2001.

[16]

Z. Bi, C. Faloutsos and F. Korn, The "DGX" Distribution for Mining Massive Skewed Data, Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, USA, 2001, 17-26.

Digital Library

[17]

A. Ben-Hur, D. Horn, H.T. Siegelmann and V. Vapnik, Support vector clustering, Journal of Machine Learning Research 2 (2001), 125-137.

[18]

M.M. Breunig, H.-P. Kriegel, R.T. Ng and J. Sander, OPTICS-OF: Identifying Local Outliers, Proceedings of the 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Czech Republic, (LNAI 1704), 1999, 262-270.

[19]

M.M. Breunig, H.-P. Kriegel, R.T. Ng and J. Sander, LOF: identifying outliers in large dataset, Proceedings of the ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA 29(2) (2000), 93-104.

[20]

V. Barnett and T. Lewis, Outliers in Statistical Data, John Willey, 1994.

[21]

A. Bartkowiak and A. Szustalewicz, Detecting utliers by a grand tour, Machine Graphics and Vision 6 (1997), 487-505.

[22]

D.S Bay and M. Schwabacher, Mining Distance-Based Outliers in Near Linear Time with Randomization and Simple Pruning Rule, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery, Washington, DC, USA, 2003, 29-38.

Digital Library

[23]

P. Chan, W. Fan, A.L. Prodromidis and S.J. Stolfo, Distributed Data Mining in Credit Card Fraud Detection, IEEE Intelligent Systems, Nov.-Dec., 1999, 67-74.

[24]

U. Fayyad, G. Piatesky-Shapiro and P. Smyth, Knowledge Discovery and Data Mining: Towards a Unifying Framework, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, 1996, 82-88.

[25]

J. Furnkranz, Separate-and-Conquer Rule Learning, Artificial Intelligence Review 13 (1999), 3-54.

[26]

V.A. Hodge and J. Austin, A survey of outlier detection methodologies, Artificial Intelligence Review 28 (2004), 85-126.

Digital Library

[27]

D. Hawkins, Identification of Outliers, Chapman and Hall, London, 1980.

[28]

A.S. Hadi, A new measure of overall potential influence in linear regression, Computation Statistics Data Analysis 14 (1992), 1-27.

Digital Library

[29]

E. Hung and D.W. Cheung, Parallel algorithms for mining outliers in large databases, Distributed and Parallel Databases 12(1) (2002), 5-26.

[30]

Z. He, S. Deng and X. Xu, in: Outlier Detection Integrating Semantic Knowledge, X. Meng, J. Su and Y. Wang, eds, WAIM 2002, LNCS 2419, 2002, pp. 126-131.

[31]

S. Hawkins, H. He, G. Williams and R. Baxter, Outlier Detection Using Replicator Neural Networks, Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery, 2002, 170-180.

Digital Library

[32]

A. Hinneburg and D.A. Keim, An Efficient Approach to Clustering in Large Multimedia Databases with Noise, Proceedings of 4th International Conference on Knowledge Discovery and Data Mining, New York City, NY, 1998, 58-65.

[33]

F. Hussain, H. Liu, E. Suzuki and H. Lu, Exception Rule Mining with a Relative Interestingness Measure, Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2000, 86-97.

[34]

J. Han and M. Kamber, Data Mining: Concept and Techniques, Morgan Kaufmann Publishers, 2001.

[35]

T. Inoue and S. Abe, Fuzzy Support Vector Machines for Pattern Classification, Proceedings of IJCNN, 2001, 1449-1455.

[36]

J.J. Jung and G. Jo, Semantic Outlier Analysis for Sessionizing Web Logs, Proceedings of the 14th European Conference on Machine Learning/7th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), Cavtat - Dubrovnik, 2004, 13-25.

[37]

T. Johnson, I. Kwok and R. Ng, Fast Computation of 2-D depth Contours, Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1998, 224-228.

[38]

W. Jin, A. K.-H. Tung and J. Han, Mining Top-n Local Outliers in Large Databases, Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD, San Francisco, California, USA, 2001, 293-298.

Digital Library

[39]

T. Jussi, Outliers in Non-linear Time Series Econometrics, PhD Dissertation, University of Turku, Department of Economics, FIN-20014 Turku, Finland, June 2001.

[40]

S. Jaroszewicz T. Scheffer, Fast Discovery of Unexpected Patterns in Data, Relative to a Bayesian Network, Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, USA, 2005, 118-127.

[41]

Y. Kou, C.-T. Lu, S. Sirwongwattana and Y-P. Huang, Survey of Fraud Detection Techniques Networking, (Vol. 2), IEEE International Conference on Sensing and Control, 2004, 749-754.

[42]

E.M. Knorr and R.T. Ng, A Unified Notion of Outliers: Properties and Computation, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1997, 219-222.

[43]

E.M. Knorr and R.T. Ng, Algorithms for Mining Distance-Based Outliers in Large Dataset, Proceedings of the 24th VLDB International Conference, New York, USA, 1998, 392-403.

[44]

E.M. Knorr and R.T. Ng, Finding Intentional Knowledge of Distance-Based Outliers, Proceedings of the 25th International Conference on Very Large Databases (VLDB), 1999, 392-403.

[45]

E.M. Knorr, R.T. Ng and V. Tucacov, Distance-based outliers: Algorithms and applications, The VLDB Journal 8(3-4) (2000), 237-253.

Digital Library

[46]

P. S. Keila and D.B. Skillicorn, Detecting Unusual and Deceptive Communication in Email, Technical Report, School of Computing, Queens University, ISSN-0836-0227-2005-498, 2005.

[47]

S. Lin and E.D. Brown, An Outlier-based Data Association Method for Linking Criminal Incidents, Technical Report, Department of Systems Engineering, University of Virginia, SIE020010, 2000.

[48]

B. Liu, W. Hsu, L. Mun and H. Lee, Finding interesting patterns using user expectations, IEEE Transactions on Knowledge and Data Engineering 11(6) (1999), 817-832.

Digital Library

[49]

B. Liu, W. Hsu, L. Mun and H. Lee, Pruning and Summarizsing the Discovered Associations, Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, 125-134.

Digital Library

[50]

B. Liu, Y. Ma and P.S. Yu, Discovering Unexpected Information from Your Competitors' Web Sites, Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, 2001, 144-153.

Digital Library

[51]

C.R. Miller and A.B. Myres, Outlier Finding: Focusing User Attention on Possible Errors, Proceedings of the 14th Annual ACM Symposium on User Interface Software and Technology, 2001, 81-90.

Digital Library

[52]

S.L, Miller, W.M. Miller and P.J. Mcwhorter, Extrema1 dynamics: A unifying physical explanation of fractals, L/F noise, and activated processes, Journal of Applied. Physics 13(6) (1993), 2617-2628.

[53]

F. Provost and J. Aronis, Scaling up inductive learning with massive parallelism, Machine Learning 23(1) (1996), 33-46.

[54]

M.I. Petrovskiy, Outlier detection algorithms in data mining systems, Programming and Computer Software 29(4) (2003), 228-237.

[55]

C. Piatetsky-Shapiro and C.J. Mathcus, The Interestingness of Deviations, Proceedings of AAAI Workshop on Knowledge Discovery in Data Mining Databases, 1994, 25-36.

[56]

B. Padmanabhan and A. Tuzhilin, A Belief-Driven Method for Discovering Unexpected Patterns, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1998, 94-100.

[57]

B. Padmanabhan and A. Tuzhilin, Unexpectedness as a measure of interestingness in knowledge discovery, Decision Support Systems 27(3) (1999), 303-318.

[58]

B. Padmanabhan and A. Tuzhilin, Small is Beautiful: Discovering the Minimal Set of Unexpected Patterns, Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000, 54-63.

Digital Library

[59]

P. Rousseeuw and A. Leroy, Robust Regression and Outlier Detection, (3rd Edition), John Willey & Sons.

[60]

S.J. Roberts, Novelty detection using extreme value statistics, IEE Proceedings on Vision, Image and Signal Processing 146(3) (1999), 124-129.

[61]

I. Ruts and P. Rousseuw, Computing depth contours of bivariate points cloud, Computational Statistics and Data Analysis 23 (1996), 153-16.

[62]

S. Ramaswamy, R. Rastogi and K. Shim, Efficient Algorithms for Mining Outliers from Large Data Set, Proceedings of the ACM SIGMOD International Conference, USA, 2000, 427-438.

[63]

G. Sheikholeslami, S. Chatterjee and A. Zhang, WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Database, Proceedings of the International Conference on Very Large Databases. New York, USA, 1998, 428-439.

[64]

P. Smyth and R.M. Goodman, An information theoretic approach to rule induction from databases, IEEE Transactions on Knowledge and Data Engineering 4 (1992), 301-316.

Digital Library

[65]

D.B. Skillicorn, Beyond Keyword Filtering for Messages and Conversation Detection, IEEE International Conference on Intelligence and Security Informatics (ISI), Atlanta, GA, USA, 2005, 231-253.

Digital Library

[66]

S. Shekhar, C. Lu and P. Zhang, Detecting Graph-Based Spatial Outliers: Algorithms and Applications, Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001, 371-376.

Digital Library

[67]

E. Suzuki, Discovering Interesting Exception Rules with Rule Pair, Proceedings of the Workshop on Advances in Inductive Rule Learning (with PKDD) 2004, 163-178.

[68]

A. Sun, E. Lim and W. Ng, Web Classification Using Support Vector Machine, Proceedings of the 4th ACM-WIDM International Workshop on Web Information and Data Management, Virginia, USA, 2002, 96-99.

Digital Library

[69]

E. Suzuki and M. Shimura, Exceptional Knowledge Discovery in Databases Based on Information Theory, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, AAAI Press, California, 1996, 275-278.

[70]

A. Silberschatz and A. Tuzhilin, On Subjective Measures of Interestingness in Knowledge Discovery, Proceedings of the First International Conference on Knowledge Discovery and Data Mining, 1995, 275-281.

Digital Library

[71]

A. Silberschatz and A. Tuzhilin, What makes patterns interesting in knowledge discovery systems, IEEE Transactions on Knowledge and Data Engineering 8(6) (1996), 970-974.

Digital Library

[72]

J. Tang, Z. Chen, A. Fu and D. Cheung, Enhancing Effectiveness of Outlier Detections for Low Density Patterns, Proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Taipei, Taiwan, 2002, 535-548.

Digital Library

[73]

J.W. Tukey, Exploratory Data Analysis, Addison-Wesley, 1977.

[74]

P.H. Torr and D.W. Murray, Outlier detection and motion segmentation, in: Journal of International Society for Optical Engineering (SPIE), (Vo. 2059), Paul S. Schenker, ed., 1993, pp. 432-443.

[75]

http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data, December 2004.

[76]

W. Wang, J. Yang and R. Muntz, STING: A Statistical Information Grid Approach to Spatial Data Mining, Proceedings of the 23rd VLDB International Conference, Greece, 1997, 186-195.

[77]

K. Yamanish, J. Takeuchi, A Unifying Framework for Detecting Outliers and Change Points from Non-Stationary Time Series Data, Proceedings of the 8th ACM SIGKDD International Conference, Canada, 2002, 676-681.

[78]

D. Zhang and S.W. Lee, Question Classification Using Support Vector Machines, Proceedings of the 26th Annual International ACM SIGIR Conf. on Research and Development in Information Retrieval, Toronto, Canada, 2003, 26-32.

[79]

J. Zhao, C. Lu and Y. Kou, Detecting Region Outliers in Meteorological Data, Proceedings of the 11th ACM International Symposium on Advances in Geographic Information Systems, 2003, 49-55.

Digital Library

[80]

T. Zhang, R. Ramakrishnan and M. Linvy, BIRCH: An Efficient Data Clustering Method for Very Large Databases, Proceedings of ACM SIGMOD International Conference on Management of Data, ACM Press, New York, 1996, 103-114.

Digital Library

Cited By

Wang DLi SXiao GLiu YSui YHe PLyu MRoychoudhury APaiva AAbreu RStorey M(2024)An Exploratory Investigation of Log Anomalies in Unmanned Aerial VehiclesProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639186(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639186
Li ZZhu YVan Leeuwen M(2023)A Survey on Explainable Anomaly DetectionACM Transactions on Knowledge Discovery from Data10.1145/360933318:1(1-54)Online publication date: 6-Sep-2023
https://dl.acm.org/doi/10.1145/3609333
Sun DHu YShi ZXu GZhou W(2019)An Efficient Anomaly Detection Framework for Electromagnetic Streaming DataProceedings of the 4th International Conference on Big Data and Computing10.1145/3335484.3335521(151-155)Online publication date: 10-May-2019
https://dl.acm.org/doi/10.1145/3335484.3335521
Show More Cited By

Index Terms

A comprehensive survey of numeric and symbolic outlier mining techniques
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications
    1. Data mining

Index terms have been assigned to the content through auto-classification.

Recommendations

Web outlier mining: Discovering outliers from web datasets

Exception mining in large datasets is an important task in traditional data mining with numerous applications in credit card fraud detection, weather prediction, intrusion detection, and cellular phone cloning fraud detection; among other applications. ...
Conceptual distance for association rules post-processing
MEDI'11: Proceedings of the First international conference on Model and data engineering

Data-mining methods have the drawbacks to generate a very large number of rules, sometimes obvious, useless or not very interesting to the user. In this paper we propose a new approach to find unexpected rules from a set of discovered association rules. ...
A Comprehensive Survey on Cloud Data Mining (CDM) Frameworks and Algorithms

Data mining is used for finding meaningful information out of a vast expanse of data. With the advent of Big Data concept, data mining has come to much more prominence. Discovering knowledge out of a gigantic volume of data efficiently is a major ...

Comments

Information & Contributors

Information

Published In

cover image Intelligent Data Analysis

Intelligent Data Analysis Volume 10, Issue 6

December 2006

110 pages

ISSN:1088-467X

Issue’s Table of Contents

Publisher

IOS Press

Netherlands

Publication History

Published: 01 December 2006

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang DLi SXiao GLiu YSui YHe PLyu MRoychoudhury APaiva AAbreu RStorey M(2024)An Exploratory Investigation of Log Anomalies in Unmanned Aerial VehiclesProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639186(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639186
Li ZZhu YVan Leeuwen M(2023)A Survey on Explainable Anomaly DetectionACM Transactions on Knowledge Discovery from Data10.1145/360933318:1(1-54)Online publication date: 6-Sep-2023
https://dl.acm.org/doi/10.1145/3609333
Sun DHu YShi ZXu GZhou W(2019)An Efficient Anomaly Detection Framework for Electromagnetic Streaming DataProceedings of the 4th International Conference on Big Data and Computing10.1145/3335484.3335521(151-155)Online publication date: 10-May-2019
https://dl.acm.org/doi/10.1145/3335484.3335521
Taha AHadi A(2019)Anomaly Detection Methods for Categorical DataACM Computing Surveys10.1145/331273952:2(1-35)Online publication date: 30-May-2019
https://dl.acm.org/doi/10.1145/3312739
Polhul TYarovyi A(2019)Method of Fraudster Fingerprint Formation During Mobile Application Installations2019 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS)10.1109/IDAACS.2019.8924369(1099-1103)Online publication date: 18-Sep-2019
https://dl.acm.org/doi/10.1109/IDAACS.2019.8924369
Salehi MRashidi L(2018)A Survey on Anomaly detection in Evolving DataACM SIGKDD Explorations Newsletter10.1145/3229329.322933220:1(13-23)Online publication date: 29-May-2018
https://dl.acm.org/doi/10.1145/3229329.3229332
Schneible JLu A(2017)Anomaly detection on the edgeMILCOM 2017 - 2017 IEEE Military Communications Conference (MILCOM)10.1109/MILCOM.2017.8170817(678-682)Online publication date: 23-Oct-2017
https://dl.acm.org/doi/10.1109/MILCOM.2017.8170817
Zor CKittler J(2017)Maritime anomaly detection in ferry tracks2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2017.7952636(2647-2651)Online publication date: 5-Mar-2017
https://dl.acm.org/doi/10.1109/ICASSP.2017.7952636
Tong XFeng YZhao A(2016)A survey on Neyman-Pearson classification and suggestions for future researchWIREs Computational Statistics10.5555/3160181.31601848:2(64-81)Online publication date: 1-Mar-2016
https://dl.acm.org/doi/10.5555/3160181.3160184
Bindu PThilagam P(2016)Mining social networks for anomaliesJournal of Network and Computer Applications10.1016/j.jnca.2016.02.02168:C(213-229)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1016/j.jnca.2016.02.021
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents