Using distribution analysis for parameter selection in repstream
-
Abstract
One of the most significant challenges in data clustering is the evolution of the data distributions over time. Many clustering algorithms have been introduced to deal specifically with streaming data, but common amongst them is that they require users to set input parameters. These inform the algorithm about the criteria under which data points may be clustered together. Setting the initial parameters for a clustering algorithm is itself a non-trivial task, but the evolution of the data distribution over time could mean even optimally set parameters could become non-optimal as the stream evolves. In this paper we extend the RepStream algorithm, a combination graph and density-based clustering algorithm, in a way which allows the primary input parameter, the $ K $ value, to be automatically adjusted over time. We introduce a feature called the edge distribution score which we compute for data in memory, as well as introducing an incremental method for adjusting the $ K $ parameter over time based on this score. We evaluate our methods against RepStream itself, and other contemporary stream clustering algorithms, and show how our method of automatically adjusting the $ K $ value over time leads to higher quality clustering output even when the initial parameters are set poorly.
-
Keywords:
- Cluster analysis,
- data clustering,
- parameter adjustment,
- graph-based clustering,
- KNN graphs.
Mathematics Subject Classification: Primary: 62H30.Citation: -
-
Figure 2. The representative vertex
$ R_1 $ and$ R_2 $ share a reciprocal connection at the representative level, and are also density related. The density radius of the vertices is shown as$ DR_1 $ and$ DR_2 $ respectively, which is the average distance to neighbours at the point level, multiplied by the alpha scaling factorFigure 8. Two dimensional representations of the 5 classes in the SynTest dataset. The main class is always present and steadily changes shape, the smaller classes appear at various points through the dataset, as shown in Figure 9
Table 1. Table showing the input parameters used for each algorithm
Algorithm name Density Parameters Grid Granularity Reclustering Method Decay Parameter Distance Parameter Other Parameters CluStream $ k $ -means$ \delta $ $ \alpha $ ,$ l $ Snapshot parametersSWClustering $ \beta $ $ k $ -means$ \epsilon $ ,$ N $ Window parametersDenStream $ \mu $ ,$ \beta $ $ \epsilon $ HPStream $ \lambda $ $ r $ $ l $ Projected DimensionsD-Stream $ C_m $ ,$ C_l $ ,$ \beta $ $ len $ $ \lambda $ ExCC Grid Granularity in all dimensions MR-Stream $C_H$ ,$C_L$ ,$\beta$ ,$\mu$ Hierarchical $\lambda$ $\epsilon$ $H$ Heirarchical limitFlockStream $\beta$ ,$\mu$ $\lambda$ $\epsilon$ DeBaRa $dPts$ $f$ DBStream $\alpha$ ,$w\_min$ $\lambda$ ,$t\_gap$ $r$ SNCStream $\psi$ ,$\beta$ $\lambda$ $\epsilon$ $K$ Graph connectivityPASCAL Evolutionary Evolutionary EDA hyper-parameters HASTREAM $minPts$ Hierarchical $\lambda$ BEStream $\Delta$ ,$\tau$ $\lambda$ $\xi$ $\theta$ Direction parameterADStream $\xi$ ,$\epsilon$ $\lambda$ PatchWork $ratio$ ,$minPoints$ ,$minCells$ $\epsilon$ RepStream $\lambda$ $\alpha$ $K$ Graph connectivityTable 2. Best and Worst
$ K $ values for RepStream, according to$ F $ scoreDataset Best K Best $ F $ scoreWorst K Worst $ F $ scoreDS1 7 0.7208 18 0.2767 DS2 7 0.6371 21 0.2594 SynTest 9 0.8614 5 0.5435 Closer 9 0.7989 5 0.4345 TreeCov 29 0.6108 5 0.2978 KDD99 30 0.7898 5 0.2636 Table 3. Comparison of our dynamic
$ K $ method versus RepStream with optimal and worst$ K $ valuesDataset Best F-Measure Worst F-Measure Dynamic $ K $ DS1 0.7208 0.2767 0.5153 DS2 0.6371 0.2594 0.4137 SynTest 0.7989 0.5435 0.7091 Closer 0.8614 0.4345 0.8410 TreeCov 0.6108 0.2978 0.5920 KDD99 0.7898 0.2636 0.7882 -
References
[1] M. R. Ackermann, M. Märtens, C. Raupach, K. Swierkot, C. Lammersen and C. Sohler, Streamkm++: A clustering algorithm for data streams, Journal of Experimental Algorithmics, 17 (2012), Article 2.4, 30 pp. doi: 10.1145/2133803.2184450. [2] C. Aggarwal, J. Han, J. Wang And P. Yu, A framework for projected clustering of high dimensional data streams, in Proceedings 2004 VLDB Conference, Elsevier, 2004, 852–863. doi: 10.1016/B978-012088469-8.50075-9. [3] C. C. Aggarwal, P. S. Yu, J. Han and J. Wang, A framework for clustering evolving data streams, in Proceedings 2003 VLDB Conference, Elsevier, 2003, 81–92. doi: 10.1016/B978-012722442-8/50016-1. [4] J. P. Barddal, H. M. Gomes and F. Enembreck, SNCStream: A social network-based data stream clustering algorithm, in Proceedings of the 30th Annual ACM Symposium on Applied Computing - SAC'15, SAC'15, ACM Press, New York, New York, USA, 2015, 935–940. doi: 10.1145/2695664.2695674. [5] V. Bhatnagar, S. Kaur and S. Chakravarthy, Clustering data streams using grid-based synopsis, Knowledge and Information Systems, 41 (2014), 127-152. doi: 10.1007/s10115-013-0659-1. [6] J. A. Blackard and D. J. Dean, Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables, Computers and Electronics in Agriculture, 24 (1999), 131–151. doi: 10.1016/S0168-1699(99)00046-0. [7] F. Cao, M. Estert, W. Qian and A. Zhou, Density-based clustering over an evolving data stream with noise, in Proceedings of the 2006 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, Philadelphia, PA, 2006, 328–339. doi: 10.1137/1.9781611972764.29. [8] Y. Chen and L. Tu, Density-based clustering for real-time stream data, in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD'07, ACM, ACM Press, New York, New York, USA, 2007, 133–142. doi: 10.1145/1281192.1281210. [9] A. Forestiero, C. Pizzuti and G. Spezzano, A single pass algorithm for clustering evolving data streams based on swarm intelligence, Data Mining and Knowledge Discovery, 26 (2013), 1-26. doi: 10.1007/s10618-011-0242-x. [10] J. Forrest, Stream: A framework for data stream modeling in r, Bachelor Thesis, Department of Computer Science and Engineering, SMU. [11] M. Hahsler and M. Bolaos, Clustering data streams based on shared density between micro-clusters, IEEE Transactions on Knowledge and Data Engineering, 28 (2016), 1449–1461, http://ieeexplore.ieee.org/document/7393836/. doi: 10.1109/TKDE.2016.2522412. [12] S. Hettich and S. D. Bay, The UCI KDD Archive, http://kdd.ics.uci.edu, 1999. [13] A. K. Jain, M. N. Murty and P. J. Flynn, Data clustering: A review, ACM computing surveys (CSUR), 31 (1999), 264-323. doi: 10.1145/331499.331504. [14] S. Kaur, V. Bhatnagar and S. Chakravarthy, Stream clustering algorithms: A primer, in Big Data in Complex Systems, Springer, 9 (2015), 105–145, http://link.springer.com/10.1007/978-3-319-11056-1. doi: 10.1007/978-3-319-11056-1_4. [15] H. Kremer, P. Kranen, T. Jansen, T. Seidl, A. Bifet, G. Holmes and B. Pfahringer, An effective evaluation measure for clustering on evolving data streams, in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2011, 868–876. doi: 10.1145/2020408.2020555. [16] S. Lühr and M. Lazarescu, Incremental clustering of dynamic data streams using connectivity based representative points, Data & Knowledge Engineering, 68 (2009), 1–27, http://linkinghub.elsevier.com/retrieve/pii/S0169023X08001092. [17] J. Schneider and M. Vlachos, Fast parameterless density-based clustering via random projections, in Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management - CIKM'13, ACM, ACM Press, New York, New York, USA, 2013, 861–866. doi: 10.1145/2505515.2505590. [18] N. Wattanakitrungroj, S. Maneeroj and C. Lursinsap, Bestream: Batch capturing with elliptic function for one-pass data stream clustering, Data & Knowledge Engineering, 117 (2018), 53–70, http://www.sciencedirect.com/science/article/pii/S0169023X17301027. doi: 10.1016/j.datak.2018.07.002. [19] G. Widmer and M. Kubat, Learning in the presence of concept drift and hidden contexts, Machine Learning, 23 (1996), 69-101. doi: 10.1007/BF00116900. [20] X. Zhang, C. Furtlehner and M. Sebag, Data streaming with affinity propagation, in Proc. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2008, 628–643. doi: 10.1007/978-3-540-87481-2_41. [21] A. Zhou, F. Cao, W. Qian and C. Jin, Tracking clusters in evolving data streams over sliding windows, Knowledge and Information Systems, 15 (2008), 181-214. doi: 10.1007/s10115-007-0070-x. -
Access History
- Figure 1. An illustration of the representative and point-level sparse graphs in the RepStream algorithm
-
Figure 2. The representative vertex
$ R_1 $ and$ R_2 $ share a reciprocal connection at the representative level, and are also density related. The density radius of the vertices is shown as$ DR_1 $ and$ DR_2 $ respectively, which is the average distance to neighbours at the point level, multiplied by the alpha scaling factor -
Figure 3. Intra and Inter-class edges. The Edge
$ E_1 $ is considered an inter-class edge as it connects two vertices$ R_1 $ and$ R_2 $ that belong to different ground-truth classes. Edge$ E_2 $ connects two vertices belonging to the same class and thus is considered an intra-class edge - Figure 4. An illustration of relative edge lengths of nearest neighbours in the middle of a cluster versus near the edge of a cluster
- Figure 5. Different cases for the distribution of edge lengths
- Figure 6. Visualisation of the DS1 dataset
- Figure 7. Visualisation of the DS2 dataset
- Figure 8. Two dimensional representations of the 5 classes in the SynTest dataset. The main class is always present and steadily changes shape, the smaller classes appear at various points through the dataset, as shown in Figure 9
- Figure 9. The class presence of the classes in the SynTest dataset. A marker indicates that the class is present in the dataset during the given time window
- Figure 10. The evolution of the Closer dataset, showing slices of its 3 sections
- Figure 11. Comparative purity for Tree Cover dataset
- Figure 12. Comparative purity for Tree Cover dataset
- Figure 13. Comparative purity for KDD 99' Cup dataset
- Figure 14. Comparative purity for KDD 99' Cup dataset
- Figure 15. Comparative purity for DS1 dataset
-
Figure 16. The
$ K $ value selected by our dynamic$ K $ method on the DS1 dataset - Figure 17. Comparative purity for DS2 dataset
- Figure 18. Comparative purity for SynTest dataset
- Figure 19. Comparative purity for Closer dataset
- Figure 20. Comparative purity for Tree Cover dataset
- Figure 21. Comparative purity for KDD 99' Cup dataset
-
Figure 22. The
$ K $ value selected by our dynamic$ K $ method on the KDD Dataset - Figure 23. F-Measure comparison vs RepStream using optimal parameters on DS1 dataset
- Figure 24. F-Measure comparison vs RepStream using optimal parameters on DS2 dataset
- Figure 25. F-Measure comparison vs RepStream using optimal parameters on SynTest dataset
- Figure 26. F-Measure comparison vs RepStream using optimal parameters on Closer dataset
- Figure 27. F-Measure comparison vs RepStream using optimal parameters on Tree Cover dataset
- Figure 28. F-Measure comparison vs RepStream using optimal parameters on KDD 99' Cup dataset