MEOD: Memory-Efficient Outlier Detection on Streaming Data
Abstract
:1. Introduction
- Distance-based outlier detection: For finding outliers, the distance between the data points is calculated, and outlier points are those whose distance is much bigger than the average distance [1,2]. As compared with the other two techniques, this technique is much simpler to use, and unlike the statistical-based approach, no prior assumptions are required.
- Density-based outlier detection: This approach compares the object density with respect to neighboring objects. Outlier points have a lower density than neighboring points. Local outlier factor (LOF) [3] and local correlation integral (LOCI) [4] are the popular density-based approaches provided by most of the machine learning libraries. Most of the methods, for example, DBSCAN [5], use a clustering technique for outlier detection. The outliers are treated as a by-product of the clustering technique. Initially, clustering is applied, and then the points far away from the centroid are identified as outliers.
- It is a local outlier detection technique over streaming data based on the local correlation integral (LOCI) technique.
- It works with limited memory resources using a data summarization mechanism.
- It uses a swarm intelligence technique to find the optimal value for radius r for LOCI calculations.
- To improve the efficiency of the algorithm, MEOD finds an outlier factor value for only candidate points rather than the whole dataset.
2. Preliminaries
2.1. Local Correlation Integral
2.2. Particle Swarm Optimization
2.3. Memory-Efficient Approach
- Summarization: Due to memory constraints, it is not feasible to preserve all the data points in the stream. If the previous data points are deleted, then the new events cannot be distinguished from the past ones. This affects the accuracy of the evaluation as there is no history of data to be considered while checking the local outlier. In the summarization phase, the summary of previous data points is preserved. For every window slot, b/2 points are processed, and a summary is generated of these points. For summary generation, the clustering technique is used. A large number f of cluster counts is set, and the cluster centers are preserved as a summary of information with granularity deviation factor (MDEF) and normalized deviation factor (σMDEF) at radius r values. The remaining points are deleted, and then the next slot is processed.
- Merging: The cluster centers generated in previous sliding window i-1 and clusters generated in the current sliding window i are merged in this phase, and a single value of cluster centers is preserved. For clustering, the cluster centers from the sliding windows i-1 and i are merged using a weighted c-means clustering algorithm where each point has a weight that shows the point importance value in the clustering process. In this process, each point is a cluster center. Hence the count of cluster members is assigned as a weight to the cluster center. After the clustering process, the weights of the cluster centers are updated as:
- Revised insertion: In the revised insertion phase, the processing is completed using the b/2 points in the current sliding window and the summarized points preserved in the memory. The outlier of a point is calculated using MDEF values. If the point is present in a radius of previously summarized points, then this point is not considered an outlier. Hence, there is no need to calculate the outlier factor of such points.
3. Proposed Outlier Detection Technique
- Revised insertion: For each sliding window, the outliers are detected using the LOCI technique, finding the outlier based on the density of neighboring points defined using the radius r. The radius value can be user-defined, but in order to remove such dependency, the r-value is automatically found using an optimization function applying the PSO-based approach to find the optimal value of the radius r.
- Summarization: The data representative points are extracted using a k-means clustering algorithm. The cluster centroids, treated as representative points for the rest of the cluster points, are preserved as summary information with a granularity deviation factor (MDEF) and a normalized deviation factor (σMDEF) at radius r values.
- Merging: A summary generated in the current sliding window is merged with the previous sliding window summary using the weighted c-means algorithm.
4. Implementation Details
5. Results
5.1. Accuracy Analysis
5.2. Time and Memory Analysis
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Radovanovic, M.; Nanopoulos, A.; Ivanovic, M. Reverse nearest neighbors in unsupervised distance-based outlier detection. IEEE Trans. Knowl. Data Eng. 2015, 27, 1369–1382. [Google Scholar] [CrossRef]
- Zhang, K.; Hutter, M.; Jin, H. A new local distance-based outlier detection approach for scattered real-world data. In Advances in Knowledge Discovery and Data Mining (PAKDD 2009). Lecture Notes in Computer Science; Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.B., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5476. [Google Scholar] [CrossRef]
- Breunig, M.; Kriegel, H.-P.; Ng, R.; Sander, J. LOF: Identifying density-based local outliers. Proc. ACM SIGMOD Int. Conf. Manag. Data (SIGMOD’00) 2000, 29, 93–104. [Google Scholar] [CrossRef]
- Papadimitriou, S.; Gibbons, P.B.; Faloutsos, C. LOCI: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering, Bangalore, India, 5–8 March 2003; pp. 315–326. [Google Scholar] [CrossRef]
- Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD’96); AAAI Press, Association for Computing Machinery: New York, NY, USA, 1996; Volume 96, pp. 226–231. [Google Scholar] [CrossRef]
- Chen, F.; Lu, C.-T.; Boedihardjo, A.P. GLS-SOD: A generalized local statistical approach for spatial outlier detection. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2010; pp. 1069–1078. [Google Scholar] [CrossRef]
- Liu, X.; Lu, C.-T.; Chen, F. Spatial outlier detection: Random walk based approaches. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems (GIS’10); Association for Computing Machinery: New York, NY, USA, 2010; pp. 370–379. [Google Scholar] [CrossRef]
- Kriegel, H.P.; Kr€oger, P.; Schubert, E.; Zimek, A. LoOP: Local outlier probabilities. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09); Association for Computing Machinery: New York, NY, USA, 2009; pp. 1649–1652. [Google Scholar] [CrossRef]
- Mukhopadhyay, A.; Maulik, U.; Bandyopadhyay, S.; Coello, C.A.C. A survey of multiobjective evolutionary algorithms for data mining: Part I. IEEE Trans. Evol. Comput. 2014, 18, 4–19. [Google Scholar] [CrossRef]
- Misinem; Bakar, A.A.; Hamdan, A.R.; Nazri, M.Z.A. A rough set outlier detection based on particle swarm optimization. In Proceedings of the 10th International Conference on Intelligent Systems Design and Applications, Cairo, Egypt, 29 November–1 December 2010; pp. 1021–1025. [Google Scholar] [CrossRef]
- Alam, S.; Dobbie, G.; Koh, Y.S.; Riddle, P. Web bots detection using particle swarm optimization based clustering. In Proceedings of the 2014 IEEE Congress on Evolutionary Computation (CEC), Beijing, China, 6–11 July 2014; pp. 2955–2962. [Google Scholar] [CrossRef]
- Mohemmed, A.W.; Zhang, M.; Will, B. Particle swarm optimisation for outlier detection. In Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation (GECCO’10); Association for Computing Machinery: New York, NY, USA, 2010; pp. 83–84. [Google Scholar] [CrossRef] [Green Version]
- Hashmi, A.; Doja, M.; Ahmad, T. An optimized density-based algorithm for anomaly detection in high dimensional datasets. Scalable Comput. Pract. Exp. 2018, 19, 69–77. [Google Scholar] [CrossRef]
- Knorr, E.M.; Ng, R.T.; Tucakov, V. Distance-based outliers: Algorithms and applications. VLDB J. 2000, 8, 237–253. [Google Scholar] [CrossRef]
- Karale, A. Outlier detection methods and the challenges for their implementation with streaming data. J. Mob. Multimed. 2020, 16, 351–388. [Google Scholar] [CrossRef]
- Pokrajac, D.; Lazarevic, A.; Latecki, L.J. Incremental local outlier detection for data streams. In Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining, Honolulu, HI, USA, 1 March–5 April 2007; pp. 504–515. [Google Scholar] [CrossRef] [Green Version]
- Salehi, M.; Leckie, C.; Bezdek, J.C.; Vaithianathan, T.; Zhang, X. Fast Memory efficient local outlier detection in data streams. IEEE Trans. Knowl. Data Eng. 2016, 28, 3246–3260. [Google Scholar] [CrossRef]
- Karale, A.; Lazarova, M.; Koleva, P.; Poulkov, V. A hybrid PSO-MiLOF approach for outlier detection in streaming data. In Proceedings of the 43rd International Conference on Telecommunications and Signal Processing (TSP), Milan, Italy, 7–9 July 2020; pp. 474–479. [Google Scholar] [CrossRef]
- UCI Machine Learning Repository. Available online: http://archive.ics.uci.edu/ml (accessed on 8 March 2021).
- Kaggle Wine Dataset. Available online: https://www.kaggle.com/rishidamarla/2d-3d-pca-t-sne-and-umap-on-wine-dataset (accessed on 8 March 2021).
Sr. No. | Dataset | Data Points (n) | Dimensions (D) |
---|---|---|---|
1. | UCI Vowel (Vl) | 1040 | 10 |
2. | UCI Glass | 214 | 10 |
3. | UCI Pendigit (Pt) | 3600 | 16 |
4. | IBRL | 3000 | 2 |
5. | Kaggle Wine | 177 | 2 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Karale, A.; Lazarova, M.; Koleva, P.; Poulkov, V. MEOD: Memory-Efficient Outlier Detection on Streaming Data. Symmetry 2021, 13, 458. https://doi.org/10.3390/sym13030458
Karale A, Lazarova M, Koleva P, Poulkov V. MEOD: Memory-Efficient Outlier Detection on Streaming Data. Symmetry. 2021; 13(3):458. https://doi.org/10.3390/sym13030458
Chicago/Turabian StyleKarale, Ankita, Milena Lazarova, Pavlina Koleva, and Vladimir Poulkov. 2021. "MEOD: Memory-Efficient Outlier Detection on Streaming Data" Symmetry 13, no. 3: 458. https://doi.org/10.3390/sym13030458