Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Random clustering-based outlier detector

Published: 09 July 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Outlier detection is one of the most important issues in contemporary data analysis. At present, many methods are employed for anomaly and outlier detection, but there is still no universal tool that delivers a high degree of efficiency. In this study, we present a novel approach for outlier detection based on the skillful use of the law of large numbers. The main idea of the proposed solution consists of the random clustering of the elements of the analyzed set. Then, those elements that are sufficiently distant from the random cluster centers are marked as outliers. The proposed approach, besides being highly effective, is also very intuitive. The results of the conducted numerical experiments confirm the high degree of effectiveness of the proposed method, with the measures of accuracy and precision reaching a value of 1. The indisputable advantages of this novel approach for outlier detection are the simplicity of interpretation and the possibility of its modification by people who may lack an extensive experience in data analysis. The effectiveness of the proposed method was compared with other recognized techniques in detecting outliers within both artificially generated and empirical data sets.

    References

    [1]
    F. Angiulli, C. Pizzuti, Fast outlier detection in high dimensional spaces, in: T. Elomaa, H. Mannila, H. Toivonen (Eds.), Principles of Data Mining and Knowledge Discovery, 2002, pp. 15–27,.
    [2]
    E.M. Knorr, R.T. Ng, V. Tucakov, Distance-based outliers: algorithms and applications, VLDB J. 8 (3–4) (2000) 237–253,.
    [3]
    S. Ramaswamy, R. Rastogi, K. Shim, Efficient algorithms for mining outliers from large data sets, in: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 2000, pp. 427–438,.
    [4]
    T.T. Dang, H.Y. Ngan, W. Liu, Distance-based k-nearest neighbors outlier detection method in large-scale traffic data, in: 2015 IEEE International Conference on Digital Signal Processing (DSP), IEEE, 2015, pp. 507–510,.
    [5]
    F.T. Liu, K.M. Ting, Z.-H. Zhou, Isolation forest, in: 2008 Eighth IEEE International Conference on Data Mining, 2008, pp. 413–422,.
    [6]
    F.T. Liu, K.M. Ting, Z.-H. Zhou, Isolation-based anomaly detection, ACM Trans. Knowl. Discov. Data 6 (1) (2012),.
    [7]
    J.-M. Liu, J. Tian, Z.-X. Cai, Y. Zhou, R.-H. Luo, R.-R. Wang, A hybrid semi-supervised approach for financial fraud detection, 2017 International Conference on Machine Learning and Cybernetics (ICMLC), vol. 1, IEEE, 2017, pp. 217–222,.
    [8]
    P. Karczmarek, A. Kiersztyn, W. Pedrycz, E. Al, K-means-based isolation forest, Knowl.-Based Syst. 195 (2020),.
    [9]
    P. Karczmarek, A. Kiersztyn, W. Pedrycz, D. Czerwiński, Fuzzy c-means-based isolation forest, Appl. Soft Comput. 106 (2021),.
    [10]
    B. Schölkopf, J.C. Platt, J. Shawe-Taylor, A.J. Smola, R.C. Williamson, Estimating the support of a high-dimensional distribution, Neural Comput. 13 (7) (2001) 1443–1471,.
    [11]
    S.-W. Lin, K.-C. Ying, C.-Y. Lee, Z.-J. Lee, An intelligent algorithm with feature selection and decision rules applied to anomaly intrusion detection, Appl. Soft Comput. 12 (10) (2012) 3285–3290,.
    [12]
    J. Li, W. Pedrycz, I. Jamal, Multivariate time series anomaly detection: a framework of hidden Markov models, Appl. Soft Comput. 60 (2017) 229–240,.
    [13]
    B. Wang, Z. Mao, Outlier detection based on Gaussian process with application to industrial processes, Appl. Soft Comput. 76 (2019) 505–516,.
    [14]
    R. Pamula, J.K. Deka, S. Nandi, An outlier detection method based on clustering, in: 2011 Second International Conference on Emerging Applications of Information Technology, IEEE, 2011, pp. 253–256,.
    [15]
    C. Zhou, R.C. Paffenroth, Anomaly detection with robust deep autoencoders, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 665–674,.
    [16]
    P. Malhotra, L. Vig, G. Shroff, P. Agarwal, Long short term memory networks for anomaly detection in time series, in: European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, vol. 89, 2015, pp. 89–94.
    [17]
    E. De la Hoz, E. De La Hoz, A. Ortiz, J. Ortega, A. Martínez-Álvarez, Feature selection by multi-objective optimisation: application to network anomaly detection by hierarchical self-organising maps, Knowl.-Based Syst. 71 (2014) 322–338,.
    [18]
    N. Chouhan, A. Khan, et al., Network anomaly detection using channel boosted and residual learning based deep convolutional neural network, Appl. Soft Comput. 83 (2019),.
    [19]
    R. Scitovski, K. Sabo, DBSCAN-like clustering method for various data densities, Pattern Anal. Appl. (2019) 1–14,.
    [20]
    J.-G. Li, X.-G. Hu, Efficient mixed clustering algorithm and its application in anomaly detection, J. Comput. Appl. 30 (7) (2010) 1916–1918.
    [21]
    W. Chimphlee, A.H. Abdullah, M.N.M. Sap, S. Srinoy, S. Chimphlee, Anomaly-based intrusion detection using fuzzy rough clustering, 2006 International Conference on Hybrid Information Technology, vol. 1, IEEE, 2006, pp. 329–334,.
    [22]
    J. Gómez, F. González, D. Dasgupta, An immuno-fuzzy approach to anomaly detection, The 12th IEEE International Conference on Fuzzy Systems, FUZZ'03, vol. 2, IEEE, 2003, pp. 1219–1224,.
    [23]
    X.D. Hoang, J. Hu, P. Bertok, A program-based anomaly intrusion detection scheme using multiple detection engines and fuzzy inference, J. Netw. Comput. Appl. 32 (6) (2009) 1219–1228,.
    [24]
    R. Östermark, A fuzzy vector valued KNN-algorithm for automatic outlier detection, Appl. Soft Comput. 9 (4) (2009) 1263–1272,.
    [25]
    C.-H. Tsang, S. Kwong, H. Wang, Genetic-fuzzy rule mining approach and evaluation of feature selection techniques for anomaly intrusion detection, Pattern Recognit. 40 (9) (2007) 2373–2391,.
    [26]
    S. Cateni, V. Colla, G. Nastasi, A multivariate fuzzy system applied for outliers detection, J. Intell. Fuzzy Syst. 24 (4) (2013) 889–903,.
    [27]
    H. Izakian, W. Pedrycz, Anomaly detection in time series data using a fuzzy c-means clustering, in: 2013 Joint IFSA World Congress and NAFIPS Annual Meeting (IFSA/NAFIPS), IEEE, 2013, pp. 1513–1518,.
    [28]
    H. Izakian, W. Pedrycz, I. Jamal, Clustering spatiotemporal data: an augmented fuzzy c-means, IEEE Trans. Fuzzy Syst. 21 (5) (2013) 855–868,.
    [29]
    H. Izakian, W. Pedrycz, Anomaly detection and characterization in spatial time series data: a cluster-centric approach, IEEE Trans. Fuzzy Syst. 22 (6) (2014) 1612–1624,.
    [30]
    A. Wilbik, J.M. Keller, J.C. Bezdek, Linguistic prototypes for data from eldercare residents, IEEE Trans. Fuzzy Syst. 22 (1) (2013) 110–123,.
    [31]
    M. Moshtaghi, J.C. Bezdek, C. Leckie, S. Karunasekera, M. Palaniswami, Evolving fuzzy rules for anomaly detection in data streams, IEEE Trans. Fuzzy Syst. 23 (3) (2014) 688–700,.
    [32]
    K. Kiersztyn, A. Kiersztyn, Fuzzy rule-based outlier detector, in: 2022 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), IEEE, 2022, pp. 1–7,.
    [33]
    A. Kiersztyn, P. Karczmarek, K. Kiersztyn, W. Pedrycz, The concept of detecting and classifying anomalies in large data sets on a basis of information granules, in: 2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), IEEE, 2020, pp. 1–7,.
    [34]
    Y. Chen, D. Miao, R. Wang, Outlier detection based on granular computing, in: International Conference on Rough Sets and Current Trends in Computing, Springer, 2008, pp. 283–292,.
    [35]
    F. Jiang, Y.-M. Chen, Outlier detection based on granular computing and rough set theory, Appl. Intell. 42 (2) (2015) 303–322,.
    [36]
    X. Zhu, W. Pedrycz, Z. Li, Granular models and granular outliers, IEEE Trans. Fuzzy Syst. 26 (6) (2018) 3835–3846,.
    [37]
    C.C. Aggarwal, P.S. Yu, Outlier detection for high dimensional data, in: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, 2001, pp. 37–46,.
    [38]
    J.M. Whitacre, T.Q. Pham, R.A. Sarker, Use of statistical outlier detection method in adaptive evolutionary algorithms, in: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, 2006, pp. 1345–1352,.
    [39]
    C.C. Aggarwal, P.S. Yu, An effective and efficient algorithm for high-dimensional outlier detection, VLDB J. 14 (2) (2005) 211–221,.
    [40]
    A.C.S. Rao, D. Somayajulu, H. Banka, R. Chaturvedi, Outlier detection in microarray data using hybrid evolutionary algorithm, Proc. Technol. 6 (2012) 291–298,.
    [41]
    S.S.S. Abd Mutalib, S.Z. Satari, W.N.S.W. Yusoff, A review on outliers-detection methods for multivariate data, J. Stat. Model. Anal. 3 (1) (2021),.
    [42]
    X. Du, J. Yu, Z. Chu, L. Jin, J. Chen, Graph autoencoder-based unsupervised outlier detection, Inf. Sci. (2022),.
    [43]
    Y. Wang, Y. Li, Outlier detection based on weighted neighbourhood information network for mixed-valued datasets, Inf. Sci. 564 (2021) 396–415,.
    [44]
    C. Liu, X. Gao, X. Wang, Data adaptive functional outlier detection: analysis of the paris bike sharing system data, Inf. Sci. 602 (2022) 13–42,.
    [45]
    A. Degirmenci, O. Karal, Efficient density and cluster based incremental outlier detection in data streams, Inf. Sci. (2022),.
    [46]
    Y. Ma, X. Zhao, C. Zhang, J. Zhang, X. Qin, Outlier detection from multiple data sources, Inf. Sci. 580 (2021) 819–837,.
    [47]
    L. Ge, J. Feng, Type-based outlier removal framework for point clouds, Inf. Sci. 580 (2021) 436–459,.
    [48]
    C. Eiras-Franco, D. Martinez-Rego, B. Guijarro-Berdinas, A. Alonso-Betanzos, A. Bahamonde, Large scale anomaly detection in mixed numerical and categorical input spaces, Inf. Sci. 487 (2019) 115–127,.
    [49]
    W. Song, W. Dong, L. Kang, Group anomaly detection based on bayesian framework with genetic algorithm, Inf. Sci. 533 (2020) 138–149,.
    [50]
    W. Wang, Y. Shang, Y. He, Y. Li, J. Liu, Botmark: automated botnet detection with hybrid analysis of flow-based and graph-based traffic behaviors, Inf. Sci. 511 (2020) 284–296,.
    [51]
    M. Tokovarov, P. Karczmarek, A probabilistic generalization of isolation forest, Inf. Sci. 584 (2022) 433–449,.
    [52]
    F. Jiang, G. Liu, J. Du, Y. Sui, Initialization of k-modes clustering using outlier detection techniques, Inf. Sci. 332 (2016) 167–183,.
    [53]
    J. Yang, Y. Chen, S. Rahardja, Neighborhood representative for improving outlier detectors, Inf. Sci. 625 (2023) 192–205,.
    [54]
    J. Ha, S. Seok, J.-S. Lee, A precise ranking method for outlier detection, Inf. Sci. 324 (2015) 88–107,.
    [55]
    L. Akoglu, H. Tong, D. Koutra, Graph based anomaly detection and description: a survey, Data Min. Knowl. Discov. 29 (3) (2015) 626–688,.
    [56]
    V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: a survey, ACM Comput. Surv. 41 (3) (2009) 1–58,.
    [57]
    H. Fanaee-T, J. Gama, Tensor-based anomaly detection: an interdisciplinary survey, Knowl.-Based Syst. 98 (2016) 130–147,.
    [58]
    R.A.A. Habeeb, F. Nasaruddin, A. Gani, I.A.T. Hashem, E. Ahmed, M. Imran, Real-time big data processing for anomaly detection: a survey, Int. J. Inf. Manag. 45 (2019) 289–307,.
    [59]
    H. Wang, M.J. Bah, M. Hammad, Progress in outlier detection techniques: a survey, IEEE Access 7 (2019) 107964–108000,.
    [60]
    S. Bhatia, B. Hooi, L. Akoglu, S. Chatterjee, X. Jiang, M. Gupta, Odd: outlier detection and description, in: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 4108–4109,.
    [61]
    A. Kiersztyn, P. Urbanovich, N. Shutko, The concept of random cluster-based outlier detection, in: M. Charytanowicz, P. Karczmarek, A. Kiersztyn (Eds.), Computational Intelligence, Information Systems and Data Mining, Lublin University of Technology, 2021, pp. 170–181.
    [62]
    A. Cerioli, Multivariate outlier detection with high-breakdown estimators, J. Am. Stat. Assoc. 105 (489) (2010) 147–156,.
    [63]
    P.J. Rousseeuw, A diagnostic plot for regression outliers and leverage points, Comput. Stat. Data Anal. 11 (1) (1991) 127–129,.
    [64]
    P.J. Rousseeuw, M. Hubert, Robust statistics for outlier detection, Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 1 (1) (2011) 73–79,.
    [65]
    P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier Detection, John Wiley & Sons, 2005.
    [66]
    Y. She, A.B. Owen, Outlier detection using nonconvex penalized regression, J. Am. Stat. Assoc. 106 (494) (2011) 626–639,.
    [67]
    S.-y. Jiang, Q.-b. An, Clustering-based outlier detection method, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery, vol. 2, IEEE, 2008, pp. 429–433,.
    [68]
    A. Loureiro, L. Torgo, C. Soares, Outlier detection using clustering methods: a data cleaning application, in: Proceedings of KDNet Symposium on Knowledge-Based Systems for the Public Sector, Bonn, Germany, 2004.
    [69]
    Z. Li, Y. Zhao, N. Botta, C. Ionescu, X. Hu, COPOD: copula-based outlier detection, in: 2020 IEEE International Conference on Data Mining (ICDM), IEEE, 2020, pp. 1118–1123,.
    [70]
    Z. Li, Y. Zhao, X. Hu, N. Botta, C. Ionescu, G. Chen, ECOD: unsupervised outlier detection using empirical cumulative distribution functions, IEEE Trans. Knowl. Data Eng. (2022) 1,.
    [71]
    M. Sugiyama, K. Borgwardt, Rapid distance-based outlier detection via sampling, in: C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K. Weinberger (Eds.), Advances in Neural Information Processing Systems, vol. 26, Curran Associates, Inc., 2013, pp. 1–9.
    [72]
    M.-L. Shyu, S.-C. Chen, K. Sarinnapakorn, L. Chang, A novel anomaly detection scheme based on principal component classifier, Tech. Rep. Miami Univ Coral Gables Fl Dept of Electrical and Computer Engineering, 2003.
    [73]
    Z. He, X. Xu, S. Deng, Discovering cluster-based local outliers, Pattern Recognit. Lett. 24 (9–10) (2003) 1641–1650,.
    [74]
    M. Goldstein, A. Dengel, Histogram-based outlier score (HBOS): a fast unsupervised anomaly detection algorithm, in: S. Wölfl (Ed.), Poster and Demo Track of the 35th German Conference on Artificial Intelligence (KI-2012), vol. 9, 2012, pp. 59–63.
    [75]
    M.M. Breunig, H.-P. Kriegel, R.T. Ng, J. Sander, LOF: identifying density-based local outliers, in: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 2000, pp. 93–104,.
    [76]
    T.R. Bandaragoda, K.M. Ting, D. Albrecht, F.T. Liu, Y. Zhu, J.R. Wells, Isolation-based anomaly detection using nearest-neighbor ensembles, Comput. Intell. 34 (4) (2018) 968–998,.
    [77]
    Y. Zhao, Z. Nasrullah, Z. Li Pyod, A python toolbox for scalable outlier detection, J. Mach. Learn. Res. 20 (96) (2019) 1–7. URL http://jmlr.org/papers/v20/19-011.html.
    [78]
    S. Rayana, ODDS Library, Stony Brook University, Department of Computer Sciences, 2016, http://odds.cs.stonybrook.edu.
    [79]
    K. Ting, S. Tan, F. Liu Mass, A new ranking measure for anomaly detection, Gippsland School of Information Technology, Tech. Rep. TR2009/1 Monash University, 2009.
    [80]
    F. Keller, E. Muller, K. Bohm, HiCS: high contrast subspaces for density-based outlier ranking, in: 2012 IEEE 28th International Conference on Data Engineering, IEEE, 2012, pp. 1037–1048,.
    [81]
    N. Abe, B. Zadrozny, J. Langford, Outlier detection by active learning, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006, pp. 504–509,.
    [82]
    C.C. Aggarwal, S. Sathe, Theoretical foundations and algorithms for outlier ensembles, ACM SIGKDD Explor. 17 (1) (2015) 24–47,.
    [83]
    S. Sathe, C. Aggarwal, LODES: local density meets spectral outlier detection, in: Proceedings of the 2016 SIAM International Conference on Data Mining, SIAM, 2016, pp. 171–179,.
    [84]
    K.M. Ting, G.-T. Zhou, F.T. Liu, J.S.C. Tan, Mass estimation and its applications, in: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010, pp. 989–998,.
    [85]
    S.C. Tan, K.M. Ting, T.F. Liu, Fast anomaly detection for streaming data, in: Twenty-Second International Joint Conference on Artificial Intelligence, 2011, pp. 1511–1516.
    [86]
    A. Lazarevic, V. Kumar, Feature bagging for outlier detection, in: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 2005, pp. 157–166,.
    [87]
    A. Zimek, M. Gaudet, R.J. Campello, J. Sander, Subsampling for efficient and effective unsupervised outlier detection ensembles, in: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013, pp. 428–436,.
    [88]
    B. Micenková, B. McWilliams, I. Assent, Learning outlier ensembles: the best of both worlds – supervised and unsupervised, in: Proceedings of the ACM SIGKDD 2014 Workshop ODD2: Outlier Detection & Description Under Data Diversity, 2014, pp. 51–54. https://doi.org/10.1.1.708.8377.
    [89]
    S. Rayana, L. Akoglu, Less is more: building selective anomaly ensembles with application to event detection in temporal graphs, in: Proceedings of the 2015 SIAM International Conference on Data Mining, SIAM, 2015, pp. 622–630,.
    [90]
    A. Kiersztyn, P. Karczmarek, K. Kiersztyn, W. Pedrycz, Detection and classification of anomalies in large data sets on the basis of information granules, IEEE Trans. Fuzzy Syst. 30 (8) (2022) 2850–2860,.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Information Sciences: an International Journal
    Information Sciences: an International Journal  Volume 667, Issue C
    May 2024
    974 pages

    Publisher

    Elsevier Science Inc.

    United States

    Publication History

    Published: 09 July 2024

    Author Tags

    1. Outlier detection
    2. Clustering
    3. Random clustering
    4. Number of clusters
    5. Metric
    6. Law of large numbers

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media