Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3388142.3388160acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiccdaConference Proceedingsconference-collections
research-article

A Genetic-Based Incremental Local Outlier Factor Algorithm for Efficient Data Stream Processing

Published: 17 April 2020 Publication History

Abstract

Interest in outlier detection methods is increasing because detecting outliers is an important operation for many applications such as detecting fraud transactions in credit card, network intrusion detection and data analysis in different domains. We are now in the big data era, and an important type of big data is data stream. With the increasing necessity for analyzing high-velocity data streams, it becomes difficult to apply older outlier detection methods efficiently. Local Outlier Factor (LOF) is a well-known outlier algorithm. A major challenge of LOF is that it requires the entire dataset and the distance values to be stored in memory. Another issue with LOF is that it needs to be recalculated from the beginning if any change occurs in the dataset. This research paper proposes a novel local outlier detection algorithm for data streams, called Genetic-based Incremental Local Outlier Factor (GILOF). The algorithm works without any previous knowledge of data distribution, and it executes in limited memory. The outcomes of our experiments with various real-world datasets demonstrate that GILOF has better performance in execution time and accuracy than other state-of-the-art LOF algorithms.

References

[1]
Sadik, S. and Gruenwald, L. 2014. Research issues in outlier detection for data streams. ACM SIGKDD Explorations Newsletter, vol. 15, no. 1, pp. 33--40.
[2]
Breunig, M.M., Kriegel, H.P., Ng, R.T. and Sander, J. 2000. LOF: Identifying Density-based Local Outliers. ACM SIGMOD Record, vol. 29, no. 2, pp. 93--104.
[3]
Pokrajac, D., Lazarevic, A. and Latecki, L.J., 2007. Incremental Local Outlier Detection for Data Streams. 2007 IEEE Symposium on Computational Intelligence and Data Mining (2007). 2007.
[4]
Salehi, M., Leckie, C., Bezdek, J.C., Vaithianathan, T. and Zhang, X. 2016. Fast Memory Efficient Local Outlier Detection in Data Streams. IEEE Transactions on Knowledge and Data Engineering 28, 12 (2016), 3246--3260.
[5]
Gyoung, S, N., Donghyun K. and Hwanjo Y. 2018. DILOF: Effective and Memory Efficient Local Outlier Detection in Data Streams. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining - KDD 18 (2018).
[6]
Gianpaolo, C. and Alessandro, M. 2012. Processing Flows of Information: From Data Stream to Complex Event Processing. ACM Computing Surveys (2012). vol. 44, no. 3, pp. 1--62.
[7]
Dmitry, N. 2015. On Big Data Stream Processing. International Journal of Open Information Technologies.
[8]
William, M. 2014. Information Management: Strategies for Gaining a Competitive Advantage with Data. Waltham, MA, Morgan Kaufmann.
[9]
Widom, J. et al. 2003. Query Processing, Approximation, and Resource Management in a Data Stream Management System. Proceedings of the 2003 CIDR Conference.
[10]
Pooja, T., Jay, V. and Vishal, P. 2016. Survey on Outlier Detection in Data Stream. International Journal of Computer Applications (2016). vol. 136, no. 2, pp. 13--16.
[11]
Venisha, M, T. and Divya, J, D. 2018. Detecting Anomalies in Data Stream Using Efficient Techniques: A Review. 2018 International Conference on Control, Power, Communication and Computing Technologies (ICCPCCT) (2018).
[12]
Imen, S., Zaki, B. and Hajer, T. 2017. A Survey on Outlier Detection in the Context of Stream Mining: Review of Existing Approaches and Recommadations. Advances in Intelligent Systems and Computing Intelligent Systems Design and Applications (2017), 372--383.
[13]
Ramaswamy, S., Rastogi, R. and Shim, K. 2000. Efficient algorithms for mining outliers from large data sets. ACM SIGMOD Record, 29(2000), pp.427--438.
[14]
Seyed, H, K., Manouchehr, K. and Sattar, H. 2012. I-IncLOF: Improved incremental local outlier detection for data streams. 16th CSI Int. Symposium on Artifcial Intellgence and Signal Processing.
[15]
Yogita, T. and Durga, T. 2012. Unsupervised outlier detection in streaming data using weighted clustering. The 12th International Conference on Intelligent Systems Design and Applications (ISDA), pp.947-952, 2012.
[16]
Feng, C., Martin, E., Weining, Q. and Aoying, Z. 2006. Density-Based Clustering over an Evolving Data Stream with Noise. SIAM Conf. on Data Mining (2006).
[17]
Solaimani, M., Iftekhar, M., Khan, L. and Thuraisingham, B. 2014. Statistical technique for online anomaly detection using Spark over heterogeneous data from multi-source VMware performance data. in Proc. IEEE Int.Conf. Big Data (Oct. 2014) pp. 1086--1094.
[18]
Zhang, Y., Meratnia, N. and Havinga, P.J. 2010. Outlier detection techniques for wireless sensor networks: A survey. IEEE Communications Surveys & Tutorials, vol. (2010) 12, no. 2, pp. 159--170.
[19]
Kale, A. and Ingle, M.D., 2015. SVM based feature extraction for novel class detection from streaming data. International Journal of Computer Applications, vol. (2015) 110, no. 9.
[20]
Masud, M.M., Chen, Q., Khan, L., Aggarwal, C.C., Gao, J., Han, J., Srivastava, A. and Oza, N.C. 2013. Classification and adaptive novel class detection of feature-evolving data streams. IEEE Trans. Know. Data Eng, vol. (Jul. 2013) 25, no. 7, pp. 1484--1497.
[21]
Lin, F., Le, W. and Bo, J., 2010. Research on maximal frequent pattern outlier factor for online high dimensional time-series outlier detection. Journal of convergence information technology, Vol.5, no. 10.
[22]
Knorr, E.M. and Ng, R.T. 1998. Algorithms for mining distance-based outliers in large datasets," in Proc. 24th Int. Conf. Very Large Data Bases, 1998, pp. 392--403.
[23]
Angiulli, F. and Fassetti, F. 2007. Detecting distance-based outliers in streams of data. Proceedings of the sixteenth ACM conference on Conference on information and knowledge management - CIKM 07.
[24]
Yang, D., Rundensteiner, E.A. and Ward, M.O. 2009. Neighbor-based pattern detection for windows over streaming data. Proceedings of the 12th International Conference on Extending Database Technology Advances in Database Technology pp. 529--540. ACM.
[25]
Niennattrakul, V., Keogh, E. and Ratanamahatana, C.A. 2010. Data Editing Techniques to Allow the Application of Distance-Based Outlier Detection to Streams. 2010 IEEE International Conference on Data Mining.
[26]
Kontaki, M., Gounaris, A., Papadopoulos, A.N., Tsichlas, K. and Manolopoulos, Y. 2011. Continuous monitoring of distance-based outliers over data streams. 2011 IEEE 27th International Conference on Data Engineering.
[27]
Tang, J., Chen, Z., Fu, A.W.C. and Cheung, D.W. 2002. Enhancing Effectiveness of Outlier Detections for Low Density Patterns. Advances in Knowledge Discovery and Data Mining Lecture Notes in Computer Science, pp. 535--548.
[28]
Jin, W., Tung, A.K., Han, J. and Wang, W. 2006. Ranking Outliers Using Symmetric Neighborhood Relationship. Advances in Knowledge Discovery and Data Mining Lecture Notes in Computer Science, pp. 577--593.
[29]
Papadimitriou, S., Kitagawa, H., Gibbons, P.B. and Faloutsos, C. 2003. LOCI: fast outlier detection using the local correlation integral. Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405), 2003.
[30]
Sun, P. and Chawla, S. 2004. On Local Spatial Outliers Fourth IEEE International Conference on Data Mining (ICDM04).
[31]
Yu, J.X., Qian, W., Lu, H. and Zhou, A. 2006. Finding centric local outliers in categorical/numerical spaces. Knowledge and Information Systems, (2006) vol. 9, no. 3, pp. 309--338.
[32]
Kriegel, H.P., Kröger, P., Schubert, E. and Zimek, A. 2009. LoOP: Local outlier probabilities. in Proceedings 18th ACM conference on Information and knowledge management, CIKM.
[33]
Dua, D. and Graff, C. 2019. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
[34]
Man, K.F., Tang, K.S. and Kwong, S. 1996. Genetic algorithms: concepts and applications [in engineering design]. IEEE Transactions on Industrial Electronics, (1996) vol. 43, no. 5, pp. 519--534.
[35]
Eiben, A.E. and Smith, J.E. 2003. Introduction to Evolutionary Computing. Natural Computing Series. Springer, Berlin, 2003.
[36]
https://github.com/olmallet81/GALGO-2.0
[37]
Aggarwal, C.C. and Sathe, S., 2015. 2015. Theoretical Foundations and Algorithms for Outlier Ensembles. ACM SIGKDD Explorations Newsletter, (2015) vol. 17, no. 1, pp. 24--47.
[38]
Shebuti, R. 2016. ODDS Library. Stony Brook, NY: Stony Brook University, Department of Computer Science, 2016. [online]. Available: http://odds.cs.stonybrook.edu
[39]
Yamanishi, K., Takeuchi, J.I., Williams, G. and Milne, P. 2004. On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms. Data Mining and Knowledge Discovery, (2004) vol. 8, no. 3, pp. 275--300.
[40]
Ponsich, A., Azzaro-Pantel, C., Domenech, S. and Pibouleau, L. 2008. Constraint handling strategies in Genetic Algorithms application to optimal batch plant design. Chemical Engineering and Processing: Process Intensification, (2008) vol. 47, no. 3, pp. 420--434.
[41]
Hanley, J.A. and McNeil, B.J. 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, (1982) vol. 143, no. 1, pp. 29--36.
[42]
Bradley, A.P. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms," Pattern Recognition, (1997) vol. 30, no. 7, pp. 1145--1159.
[43]
Carr, J. 2014. An Introduction to genetic algorithms. Senior Project, pp. 1--40, 2014.
[44]
Mitchell, M. 1998. Introduction to Genetic Algorithms. Cambridge, MA: MIT Press, 1998.
[45]
Srinivas, M. and Patnaik, L.M. 1994. Genetic algorithms: a survey. Computer, (1994) vol. 27, pp. 17--26.
[46]
Sivaraj, R. and Ravichandran, T. 2011. A Review of selection methods in genetic algorithm. International Journal of Engineering Science & Technology, (2011) vol. 3 issue 5, pp. 3792--3797.
[47]
Ochoa, G., Harvey, I. and Buxton, H. 2000. Optimal Mutation Rates and Selection Pressure in Genetic Algorithms. in: Proceedings of the Genetic and Evolutionary Computation Conference, (2000) Vol. 1, pp. 315--322, Morgan Kaufmann, San Francisco.
[48]
Baker, J.E. 1987. Reducing bias and inefficiency in the selection algorithm. in Proc. 2nd Int. Conf. Genetic Algorithms, pp. 14--21, Cambridge, MA.
[49]
Magalhaes-Mendes, J. 2013. A comparative study of crossover operators for genetic algorithms to solve the job shop scheduling problem. WSEAS Transaction on Computers, (2013) vol. 12, no. 4, pp. 164--173.
[50]
Abdoun, O., Abouchabaka, J. and Tajani, C. 2012. Analyzing the Performance of Mutation Operators to Solve the Travelling Salesman. Problem, CoRR abs/1203.3099.
[51]
Soni, N. and Kumar, T. 2014. Study of various mutation operators in genetic algorithms. IJCSIT International Journal of Computer Science and Information Technologies, (2014) vol. 5, pp. 4519--4521.
[52]
Afzal, M. and Ashraf, S.A. 2016. Genetic Algorithm for Outlier Detection. International Journal of Computer Science and Information Technologies, (2016) Vol. 7.
[53]
Raja, P.V. and Bhaskaran, V.M. 2012. An effective genetic algorithm for outlier detection. International Journal of Computer Applications, 38(6):30--33, January 2012.
[54]
Desale, K. and Ade, R. 2015. Preprocessing of Streaming Data using Genetic Algorithm. International Journal of Computer Applications, (2015) vol. 120, no. 17, pp. 16--19.
[55]
Vivekanandan, P. and Nedunchezhian, R. 2011. data streams with concept drifts using genetic algorithm. Artif. Intell. Rev., (2015) vol. 36, no. 3, pp. 163--178.
[56]
Iwashita, A.S. and Papa, J.P. 2018. An Overview on Concept Drift Learning. IEEE Access, 7(Section III):1--1, 2018.
[57]
Sharma, N. and Makhija, P. 2018. A Review on Optimizing Clustering Technique for Data Stream using Genetic Algorithm. International Journal of Computer Sciences and Engineering.
[58]
Cervantes, J. and Stephens, C.R. 2006. Optimal mutation rates for genetic search. in Proc. Genetic Evol. Comput. Conf. (GECCO), pp. 1313-1320, 2006.
[59]
Baker, J.E. 1985. Adaptive selection methods for genetic algorithms. in Proc. 1st Int. Conf. on Genetic Algorithms and Their Applications, Hillsdale, NJ: Lawrence Erlbaum.
[60]
https://youtu.be/s--OKNhq470
[61]
Alghushairy, O. and Ma, X. 2019. Data Storage. In: L. A. Schintler and C. L. McNeely (Eds.), Encyclopedia of Big Data. Springer, Cham.
[62]
https://github.com/xgmachina/GILOF
[63]
Deckert, M. 2013. Incremental Rule-Based Learners for Handling Concept Drift: An Overview. Foundations of Computing and Decision Sciences, (2013) vol. 38, no. 1.
[64]
Žliobaitė, I., Pechenizkiy, M. and Gama, J. 2016. Overview of Concept Drift Applications. Studies in Big Data Big Data Analysis: New Algorithms for a New Society, pp. 91--114.

Cited By

View all
  • (2023)Cleaning Big Data Streams: A Systematic Literature ReviewTechnologies10.3390/technologies1104010111:4(101)Online publication date: 26-Jul-2023
  • (2022)A Credit Conflict Detection Model Based on Decision Distance and Probability MatrixWireless Communications & Mobile Computing10.1155/2022/37951832022Online publication date: 1-Jan-2022
  • (2021)Improving the outlier detection method in concrete mix design by combining the isolation forest and local outlier factorConstruction and Building Materials10.1016/j.conbuildmat.2020.121396270(121396)Online publication date: Feb-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICCDA '20: Proceedings of the 2020 4th International Conference on Compute and Data Analysis
March 2020
224 pages
ISBN:9781450376440
DOI:10.1145/3388142
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 April 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data science
  2. evolutionary computation
  3. genetic algorithm
  4. outlier detection
  5. stream data mining

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICCDA 2020

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)1
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Cleaning Big Data Streams: A Systematic Literature ReviewTechnologies10.3390/technologies1104010111:4(101)Online publication date: 26-Jul-2023
  • (2022)A Credit Conflict Detection Model Based on Decision Distance and Probability MatrixWireless Communications & Mobile Computing10.1155/2022/37951832022Online publication date: 1-Jan-2022
  • (2021)Improving the outlier detection method in concrete mix design by combining the isolation forest and local outlier factorConstruction and Building Materials10.1016/j.conbuildmat.2020.121396270(121396)Online publication date: Feb-2021
  • (2021)Improving the Efficiency of Genetic-Based Incremental Local Outlier Factor Algorithm for Network Intrusion DetectionAdvances in Artificial Intelligence and Applied Cognitive Computing10.1007/978-3-030-70296-0_81(1011-1027)Online publication date: 15-Oct-2021
  • (2020)A Review of Local Outlier Factor Algorithms for Outlier Detection in Big Data StreamsBig Data and Cognitive Computing10.3390/bdcc50100015:1(1)Online publication date: 29-Dec-2020
  • (2020)An Efficient Local Outlier Factor for Data Stream Processing: A Case Study2020 International Conference on Computational Science and Computational Intelligence (CSCI)10.1109/CSCI51800.2020.00282(1525-1528)Online publication date: Dec-2020
  • (2020)A Grid Partition-Based Local Outlier Factor by Reachability Distance for Data Stream Processing2020 International Conference on Computational Science and Computational Intelligence (CSCI)10.1109/CSCI51800.2020.00069(369-375)Online publication date: Dec-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media