Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

A comprehensive survey of numeric and symbolic outlier mining techniques

Published: 01 December 2006 Publication History
  • Get Citation Alerts
  • Abstract

    Data that appear to have different characteristics than the rest of the population are called outliers. Identifying outliers from huge data repositories is a very complex task called outlier mining. Outlier mining has been akin to finding needles in a haystack. However, outlier mining has a number of practical applications in areas such as fraud detection, network intrusion detection, and identification of competitor and emerging business trends in e-commerce. This survey discuses practical applications of outlier mining, and provides a taxonomy for categorizing related mining techniques. A comprehensive review of these techniques with their advantages and disadvantages along with some current research issues are provided.

    References

    [1]
    A. Arning, R. Agrawal and P. Raghavan, A Linear Method for Deviation Detection in Large Databases, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, AAAI Press, 1996, 164-169.
    [2]
    M. Agyemang, K. Barker and R. Alhajj, Framework for Mining Web Content Outliers' Proceedings of the 19th ACM Symposium on Applied Computing, Nicosia, Cyprus, 2004, 590-594.
    [3]
    M. Agyemang, K. Barker and R. Alhajj, Mining Web Content Outliers Using Structure Oriented Weighting Techniques and N-grams, Proceedings of the 20th ACM International Symposium on Applied Computing, Santa Fe, New Mexico, USA, 2005, 482-487.
    [4]
    M. Agyemang, K. Barker and R. Alhajj, Hybrid Approach to Web Content Outlier Mining Without Query Vector, Proceedings of the 7th International Conference on Data Warehousing and Knowledge Discovery (DaWaK), LNCS- 3589, Denmark, 2005, 285-294.
    [5]
    M. Agyemang and C.I. Ezeife, LSC-Mine: Algorithm for Mining Local Outliers, (Vol. 1), Proceedings of the 15th Information Resource Management Association (IRMA) International Conference, New Orleans, USA, 2004, 5-8.
    [6]
    R. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan, Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, Proceedings of ACM SIGMOD International Conference on Management of Data, Seattle, WA, 1998, 94-105.
    [7]
    R. Agrawal, T. Imielinski and A. Swami, Data Mining: A Performance Perspective, IEEE Transactions on Knowledge and Data Engineering 5(6) (1993), 914-925.
    [8]
    R. Agrawal, T. Imielinski and A. Swami, Mining association rules between sets of items in large databases, ACM SIGMOD Records 22(2) (1993), 207-216.
    [9]
    F. Anguilli and C. Pizzuti, in: Fast Outlier Detection in High Dimensional Spaces, T. Elomaa, ed., PKDD, LNAI 2431, 2002, pp. 15-27.
    [10]
    F. Anguilli and C. Pizzuti, Outlier Mining in Large High-Dimensional Data Sets, IEEE Transactions on Knowledge and Data Engineering 12(2) (2005), 203-215.
    [11]
    A. Adam, E. Rivlin and I. Shimshoni, ROR: Rejection of Outliers by Rotation, IEEE Transaction on Pattern Analysis and Machine Intelligence 23(1) (2001), 78-84.
    [12]
    D. Asimov, The grand your: a yool for viewing multidimensional data, SIAM J.Sci. Stat. Compu 6 (1985), 128-143.
    [13]
    C.C. Aggarwal and P.S. Yu, Outlier Detection for High Dimensional Data, Proceedings of ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA, USA, 2001, 37-46.
    [14]
    C.C. Aggarwal and P.S. Yu, An effective and efficient algorithm for high-dimensional outlier detection, The VLDB Journal 14(2) (2005), 211-221.
    [15]
    R.J. Bolton D.J. Hand, Unsupervised Profiling Methods for Fraud Detection, In Conference of Credit Scoring and Credit Control VII, UK September 5-7, 2001.
    [16]
    Z. Bi, C. Faloutsos and F. Korn, The "DGX" Distribution for Mining Massive Skewed Data, Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, USA, 2001, 17-26.
    [17]
    A. Ben-Hur, D. Horn, H.T. Siegelmann and V. Vapnik, Support vector clustering, Journal of Machine Learning Research 2 (2001), 125-137.
    [18]
    M.M. Breunig, H.-P. Kriegel, R.T. Ng and J. Sander, OPTICS-OF: Identifying Local Outliers, Proceedings of the 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Czech Republic, (LNAI 1704), 1999, 262-270.
    [19]
    M.M. Breunig, H.-P. Kriegel, R.T. Ng and J. Sander, LOF: identifying outliers in large dataset, Proceedings of the ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA 29(2) (2000), 93-104.
    [20]
    V. Barnett and T. Lewis, Outliers in Statistical Data, John Willey, 1994.
    [21]
    A. Bartkowiak and A. Szustalewicz, Detecting utliers by a grand tour, Machine Graphics and Vision 6 (1997), 487-505.
    [22]
    D.S Bay and M. Schwabacher, Mining Distance-Based Outliers in Near Linear Time with Randomization and Simple Pruning Rule, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery, Washington, DC, USA, 2003, 29-38.
    [23]
    P. Chan, W. Fan, A.L. Prodromidis and S.J. Stolfo, Distributed Data Mining in Credit Card Fraud Detection, IEEE Intelligent Systems, Nov.-Dec., 1999, 67-74.
    [24]
    U. Fayyad, G. Piatesky-Shapiro and P. Smyth, Knowledge Discovery and Data Mining: Towards a Unifying Framework, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, 1996, 82-88.
    [25]
    J. Furnkranz, Separate-and-Conquer Rule Learning, Artificial Intelligence Review 13 (1999), 3-54.
    [26]
    V.A. Hodge and J. Austin, A survey of outlier detection methodologies, Artificial Intelligence Review 28 (2004), 85-126.
    [27]
    D. Hawkins, Identification of Outliers, Chapman and Hall, London, 1980.
    [28]
    A.S. Hadi, A new measure of overall potential influence in linear regression, Computation Statistics Data Analysis 14 (1992), 1-27.
    [29]
    E. Hung and D.W. Cheung, Parallel algorithms for mining outliers in large databases, Distributed and Parallel Databases 12(1) (2002), 5-26.
    [30]
    Z. He, S. Deng and X. Xu, in: Outlier Detection Integrating Semantic Knowledge, X. Meng, J. Su and Y. Wang, eds, WAIM 2002, LNCS 2419, 2002, pp. 126-131.
    [31]
    S. Hawkins, H. He, G. Williams and R. Baxter, Outlier Detection Using Replicator Neural Networks, Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery, 2002, 170-180.
    [32]
    A. Hinneburg and D.A. Keim, An Efficient Approach to Clustering in Large Multimedia Databases with Noise, Proceedings of 4th International Conference on Knowledge Discovery and Data Mining, New York City, NY, 1998, 58-65.
    [33]
    F. Hussain, H. Liu, E. Suzuki and H. Lu, Exception Rule Mining with a Relative Interestingness Measure, Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2000, 86-97.
    [34]
    J. Han and M. Kamber, Data Mining: Concept and Techniques, Morgan Kaufmann Publishers, 2001.
    [35]
    T. Inoue and S. Abe, Fuzzy Support Vector Machines for Pattern Classification, Proceedings of IJCNN, 2001, 1449-1455.
    [36]
    J.J. Jung and G. Jo, Semantic Outlier Analysis for Sessionizing Web Logs, Proceedings of the 14th European Conference on Machine Learning/7th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), Cavtat - Dubrovnik, 2004, 13-25.
    [37]
    T. Johnson, I. Kwok and R. Ng, Fast Computation of 2-D depth Contours, Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1998, 224-228.
    [38]
    W. Jin, A. K.-H. Tung and J. Han, Mining Top-n Local Outliers in Large Databases, Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD, San Francisco, California, USA, 2001, 293-298.
    [39]
    T. Jussi, Outliers in Non-linear Time Series Econometrics, PhD Dissertation, University of Turku, Department of Economics, FIN-20014 Turku, Finland, June 2001.
    [40]
    S. Jaroszewicz T. Scheffer, Fast Discovery of Unexpected Patterns in Data, Relative to a Bayesian Network, Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, USA, 2005, 118-127.
    [41]
    Y. Kou, C.-T. Lu, S. Sirwongwattana and Y-P. Huang, Survey of Fraud Detection Techniques Networking, (Vol. 2), IEEE International Conference on Sensing and Control, 2004, 749-754.
    [42]
    E.M. Knorr and R.T. Ng, A Unified Notion of Outliers: Properties and Computation, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1997, 219-222.
    [43]
    E.M. Knorr and R.T. Ng, Algorithms for Mining Distance-Based Outliers in Large Dataset, Proceedings of the 24th VLDB International Conference, New York, USA, 1998, 392-403.
    [44]
    E.M. Knorr and R.T. Ng, Finding Intentional Knowledge of Distance-Based Outliers, Proceedings of the 25th International Conference on Very Large Databases (VLDB), 1999, 392-403.
    [45]
    E.M. Knorr, R.T. Ng and V. Tucacov, Distance-based outliers: Algorithms and applications, The VLDB Journal 8(3-4) (2000), 237-253.
    [46]
    P. S. Keila and D.B. Skillicorn, Detecting Unusual and Deceptive Communication in Email, Technical Report, School of Computing, Queens University, ISSN-0836-0227-2005-498, 2005.
    [47]
    S. Lin and E.D. Brown, An Outlier-based Data Association Method for Linking Criminal Incidents, Technical Report, Department of Systems Engineering, University of Virginia, SIE020010, 2000.
    [48]
    B. Liu, W. Hsu, L. Mun and H. Lee, Finding interesting patterns using user expectations, IEEE Transactions on Knowledge and Data Engineering 11(6) (1999), 817-832.
    [49]
    B. Liu, W. Hsu, L. Mun and H. Lee, Pruning and Summarizsing the Discovered Associations, Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, 125-134.
    [50]
    B. Liu, Y. Ma and P.S. Yu, Discovering Unexpected Information from Your Competitors' Web Sites, Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, 2001, 144-153.
    [51]
    C.R. Miller and A.B. Myres, Outlier Finding: Focusing User Attention on Possible Errors, Proceedings of the 14th Annual ACM Symposium on User Interface Software and Technology, 2001, 81-90.
    [52]
    S.L, Miller, W.M. Miller and P.J. Mcwhorter, Extrema1 dynamics: A unifying physical explanation of fractals, L/F noise, and activated processes, Journal of Applied. Physics 13(6) (1993), 2617-2628.
    [53]
    F. Provost and J. Aronis, Scaling up inductive learning with massive parallelism, Machine Learning 23(1) (1996), 33-46.
    [54]
    M.I. Petrovskiy, Outlier detection algorithms in data mining systems, Programming and Computer Software 29(4) (2003), 228-237.
    [55]
    C. Piatetsky-Shapiro and C.J. Mathcus, The Interestingness of Deviations, Proceedings of AAAI Workshop on Knowledge Discovery in Data Mining Databases, 1994, 25-36.
    [56]
    B. Padmanabhan and A. Tuzhilin, A Belief-Driven Method for Discovering Unexpected Patterns, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1998, 94-100.
    [57]
    B. Padmanabhan and A. Tuzhilin, Unexpectedness as a measure of interestingness in knowledge discovery, Decision Support Systems 27(3) (1999), 303-318.
    [58]
    B. Padmanabhan and A. Tuzhilin, Small is Beautiful: Discovering the Minimal Set of Unexpected Patterns, Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000, 54-63.
    [59]
    P. Rousseeuw and A. Leroy, Robust Regression and Outlier Detection, (3rd Edition), John Willey & Sons.
    [60]
    S.J. Roberts, Novelty detection using extreme value statistics, IEE Proceedings on Vision, Image and Signal Processing 146(3) (1999), 124-129.
    [61]
    I. Ruts and P. Rousseuw, Computing depth contours of bivariate points cloud, Computational Statistics and Data Analysis 23 (1996), 153-16.
    [62]
    S. Ramaswamy, R. Rastogi and K. Shim, Efficient Algorithms for Mining Outliers from Large Data Set, Proceedings of the ACM SIGMOD International Conference, USA, 2000, 427-438.
    [63]
    G. Sheikholeslami, S. Chatterjee and A. Zhang, WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Database, Proceedings of the International Conference on Very Large Databases. New York, USA, 1998, 428-439.
    [64]
    P. Smyth and R.M. Goodman, An information theoretic approach to rule induction from databases, IEEE Transactions on Knowledge and Data Engineering 4 (1992), 301-316.
    [65]
    D.B. Skillicorn, Beyond Keyword Filtering for Messages and Conversation Detection, IEEE International Conference on Intelligence and Security Informatics (ISI), Atlanta, GA, USA, 2005, 231-253.
    [66]
    S. Shekhar, C. Lu and P. Zhang, Detecting Graph-Based Spatial Outliers: Algorithms and Applications, Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001, 371-376.
    [67]
    E. Suzuki, Discovering Interesting Exception Rules with Rule Pair, Proceedings of the Workshop on Advances in Inductive Rule Learning (with PKDD) 2004, 163-178.
    [68]
    A. Sun, E. Lim and W. Ng, Web Classification Using Support Vector Machine, Proceedings of the 4th ACM-WIDM International Workshop on Web Information and Data Management, Virginia, USA, 2002, 96-99.
    [69]
    E. Suzuki and M. Shimura, Exceptional Knowledge Discovery in Databases Based on Information Theory, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, AAAI Press, California, 1996, 275-278.
    [70]
    A. Silberschatz and A. Tuzhilin, On Subjective Measures of Interestingness in Knowledge Discovery, Proceedings of the First International Conference on Knowledge Discovery and Data Mining, 1995, 275-281.
    [71]
    A. Silberschatz and A. Tuzhilin, What makes patterns interesting in knowledge discovery systems, IEEE Transactions on Knowledge and Data Engineering 8(6) (1996), 970-974.
    [72]
    J. Tang, Z. Chen, A. Fu and D. Cheung, Enhancing Effectiveness of Outlier Detections for Low Density Patterns, Proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Taipei, Taiwan, 2002, 535-548.
    [73]
    J.W. Tukey, Exploratory Data Analysis, Addison-Wesley, 1977.
    [74]
    P.H. Torr and D.W. Murray, Outlier detection and motion segmentation, in: Journal of International Society for Optical Engineering (SPIE), (Vo. 2059), Paul S. Schenker, ed., 1993, pp. 432-443.
    [75]
    http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data, December 2004.
    [76]
    W. Wang, J. Yang and R. Muntz, STING: A Statistical Information Grid Approach to Spatial Data Mining, Proceedings of the 23rd VLDB International Conference, Greece, 1997, 186-195.
    [77]
    K. Yamanish, J. Takeuchi, A Unifying Framework for Detecting Outliers and Change Points from Non-Stationary Time Series Data, Proceedings of the 8th ACM SIGKDD International Conference, Canada, 2002, 676-681.
    [78]
    D. Zhang and S.W. Lee, Question Classification Using Support Vector Machines, Proceedings of the 26th Annual International ACM SIGIR Conf. on Research and Development in Information Retrieval, Toronto, Canada, 2003, 26-32.
    [79]
    J. Zhao, C. Lu and Y. Kou, Detecting Region Outliers in Meteorological Data, Proceedings of the 11th ACM International Symposium on Advances in Geographic Information Systems, 2003, 49-55.
    [80]
    T. Zhang, R. Ramakrishnan and M. Linvy, BIRCH: An Efficient Data Clustering Method for Very Large Databases, Proceedings of ACM SIGMOD International Conference on Management of Data, ACM Press, New York, 1996, 103-114.

    Cited By

    View all
    • (2024)An Exploratory Investigation of Log Anomalies in Unmanned Aerial VehiclesProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639186(1-13)Online publication date: 20-May-2024
    • (2023)A Survey on Explainable Anomaly DetectionACM Transactions on Knowledge Discovery from Data10.1145/360933318:1(1-54)Online publication date: 6-Sep-2023
    • (2019)An Efficient Anomaly Detection Framework for Electromagnetic Streaming DataProceedings of the 4th International Conference on Big Data and Computing10.1145/3335484.3335521(151-155)Online publication date: 10-May-2019
    • Show More Cited By

    Index Terms

    1. A comprehensive survey of numeric and symbolic outlier mining techniques
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Intelligent Data Analysis
        Intelligent Data Analysis  Volume 10, Issue 6
        December 2006
        110 pages

        Publisher

        IOS Press

        Netherlands

        Publication History

        Published: 01 December 2006

        Author Tags

        1. Symbolic
        2. depth-based
        3. distance-based
        4. distribution-based
        5. exception patterns
        6. interestingness
        7. outliers
        8. rule-based
        9. taxonomy
        10. unexpectedness
        11. web-based

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 26 Jul 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)An Exploratory Investigation of Log Anomalies in Unmanned Aerial VehiclesProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639186(1-13)Online publication date: 20-May-2024
        • (2023)A Survey on Explainable Anomaly DetectionACM Transactions on Knowledge Discovery from Data10.1145/360933318:1(1-54)Online publication date: 6-Sep-2023
        • (2019)An Efficient Anomaly Detection Framework for Electromagnetic Streaming DataProceedings of the 4th International Conference on Big Data and Computing10.1145/3335484.3335521(151-155)Online publication date: 10-May-2019
        • (2019)Anomaly Detection Methods for Categorical DataACM Computing Surveys10.1145/331273952:2(1-35)Online publication date: 30-May-2019
        • (2019)Method of Fraudster Fingerprint Formation During Mobile Application Installations2019 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS)10.1109/IDAACS.2019.8924369(1099-1103)Online publication date: 18-Sep-2019
        • (2018)A Survey on Anomaly detection in Evolving DataACM SIGKDD Explorations Newsletter10.1145/3229329.322933220:1(13-23)Online publication date: 29-May-2018
        • (2017)Anomaly detection on the edgeMILCOM 2017 - 2017 IEEE Military Communications Conference (MILCOM)10.1109/MILCOM.2017.8170817(678-682)Online publication date: 23-Oct-2017
        • (2017)Maritime anomaly detection in ferry tracks2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2017.7952636(2647-2651)Online publication date: 5-Mar-2017
        • (2016)A survey on Neyman-Pearson classification and suggestions for future researchWIREs Computational Statistics10.5555/3160181.31601848:2(64-81)Online publication date: 1-Mar-2016
        • (2016)Mining social networks for anomaliesJournal of Network and Computer Applications10.1016/j.jnca.2016.02.02168:C(213-229)Online publication date: 1-Jun-2016
        • Show More Cited By

        View Options

        View options

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media