Using distribution analysis for parameter selection in repstream

Ross Callister; Duc-Son Pham; Mihai Lazarescu

doi:10.3934/mfc.2019015

Article Contents

2019, Volume 2, Issue 3: 215-250. Doi: 10.3934/mfc.2019015

This issue Previous Article Data modeling analysis on removal efficiency of hexavalent chromium Next Article Eigenstructure assignment for polynomial matrix systems ensuring normalization and impulse elimination

Using distribution analysis for parameter selection in repstream

Curtin University, Kent Street, Perth, Western Australia 6102, Australia

^* Corresponding author: Ross Callister
^* Corresponding author: Ross Callister

Received: April 2019

Published: September 2019

Abstract / Introduction Full Text(HTML) Figure(28) / Table(3) Related Papers Cited by

Abstract

One of the most significant challenges in data clustering is the evolution of the data distributions over time. Many clustering algorithms have been introduced to deal specifically with streaming data, but common amongst them is that they require users to set input parameters. These inform the algorithm about the criteria under which data points may be clustered together. Setting the initial parameters for a clustering algorithm is itself a non-trivial task, but the evolution of the data distribution over time could mean even optimally set parameters could become non-optimal as the stream evolves. In this paper we extend the RepStream algorithm, a combination graph and density-based clustering algorithm, in a way which allows the primary input parameter, the $ K $ value, to be automatically adjusted over time. We introduce a feature called the edge distribution score which we compute for data in memory, as well as introducing an incremental method for adjusting the $ K $ parameter over time based on this score. We evaluate our methods against RepStream itself, and other contemporary stream clustering algorithms, and show how our method of automatically adjusting the $ K $ value over time leads to higher quality clustering output even when the initial parameters are set poorly.

Keywords:

Mathematics Subject Classification: Primary: 62H30.

Citation:

Full Text(HTML)

Figure 1. An illustration of the representative and point-level sparse graphs in the RepStream algorithm

Download: Full-size image PowerPoint slide

Figure 2. The representative vertex $ R_1 $ and $ R_2 $ share a reciprocal connection at the representative level, and are also density related. The density radius of the vertices is shown as $ DR_1 $ and $ DR_2 $ respectively, which is the average distance to neighbours at the point level, multiplied by the alpha scaling factor

Download: Full-size image PowerPoint slide

Figure 3. Intra and Inter-class edges. The Edge $ E_1 $ is considered an inter-class edge as it connects two vertices $ R_1 $ and $ R_2 $ that belong to different ground-truth classes. Edge $ E_2 $ connects two vertices belonging to the same class and thus is considered an intra-class edge

Download: Full-size image PowerPoint slide

Figure 4. An illustration of relative edge lengths of nearest neighbours in the middle of a cluster versus near the edge of a cluster

Download: Full-size image PowerPoint slide

Figure 5. Different cases for the distribution of edge lengths

Download: Full-size image PowerPoint slide

Figure 6. Visualisation of the DS1 dataset

Download: Full-size image PowerPoint slide

Figure 7. Visualisation of the DS2 dataset

Download: Full-size image PowerPoint slide

Figure 8. Two dimensional representations of the 5 classes in the SynTest dataset. The main class is always present and steadily changes shape, the smaller classes appear at various points through the dataset, as shown in Figure 9

Download: Full-size image PowerPoint slide

Figure 9. The class presence of the classes in the SynTest dataset. A marker indicates that the class is present in the dataset during the given time window

Download: Full-size image PowerPoint slide

Figure 10. The evolution of the Closer dataset, showing slices of its 3 sections

Download: Full-size image PowerPoint slide

Figure 11. Comparative purity for Tree Cover dataset

Download: Full-size image PowerPoint slide

Figure 12. Comparative purity for Tree Cover dataset

Download: Full-size image PowerPoint slide

Figure 13. Comparative purity for KDD 99' Cup dataset

Download: Full-size image PowerPoint slide

Figure 14. Comparative purity for KDD 99' Cup dataset

Download: Full-size image PowerPoint slide

Figure 15. Comparative purity for DS1 dataset

Download: Full-size image PowerPoint slide

Figure 16. The $ K $ value selected by our dynamic $ K $ method on the DS1 dataset

Download: Full-size image PowerPoint slide

Figure 17. Comparative purity for DS2 dataset

Download: Full-size image PowerPoint slide

Figure 18. Comparative purity for SynTest dataset

Download: Full-size image PowerPoint slide

Figure 19. Comparative purity for Closer dataset

Download: Full-size image PowerPoint slide

Figure 20. Comparative purity for Tree Cover dataset

Download: Full-size image PowerPoint slide

Figure 21. Comparative purity for KDD 99' Cup dataset

Download: Full-size image PowerPoint slide

Figure 22. The $ K $ value selected by our dynamic $ K $ method on the KDD Dataset

Download: Full-size image PowerPoint slide

Figure 23. F-Measure comparison vs RepStream using optimal parameters on DS1 dataset

Download: Full-size image PowerPoint slide

Figure 24. F-Measure comparison vs RepStream using optimal parameters on DS2 dataset

Download: Full-size image PowerPoint slide

Figure 25. F-Measure comparison vs RepStream using optimal parameters on SynTest dataset

Download: Full-size image PowerPoint slide

Figure 26. F-Measure comparison vs RepStream using optimal parameters on Closer dataset

Download: Full-size image PowerPoint slide

Figure 27. F-Measure comparison vs RepStream using optimal parameters on Tree Cover dataset

Download: Full-size image PowerPoint slide

Figure 28. F-Measure comparison vs RepStream using optimal parameters on KDD 99' Cup dataset

Download: Full-size image PowerPoint slide

Table 1. Table showing the input parameters used for each algorithm

Algorithm name	Density Parameters	Grid Granularity	Reclustering Method	Decay Parameter	Distance Parameter	Other Parameters
CluStream			$ k $-means	$ \delta $		$ \alpha $, $ l $ Snapshot parameters
SWClustering	$ \beta $		$ k $-means			$ \epsilon $, $ N $ Window parameters
DenStream	$ \mu $, $ \beta $				$ \epsilon $
HPStream				$ \lambda $	$ r $	$ l $ Projected Dimensions
D-Stream	$ C_m $, $ C_l $, $ \beta $	$ len $		$ \lambda $
ExCC		Grid Granularity in all dimensions
MR-Stream	$C_H$, $C_L$, $\beta$, $\mu$		Hierarchical	$\lambda$	$\epsilon$	$H$ Heirarchical limit
FlockStream	$\beta$, $\mu$			$\lambda$	$\epsilon$
DeBaRa	$dPts$				$f$
DBStream	$\alpha$, $w\_min$			$\lambda$, $t\_gap$	$r$
SNCStream	$\psi$, $\beta$			$\lambda$	$\epsilon$	$K$ Graph connectivity
PASCAL			Evolutionary			Evolutionary EDA hyper-parameters
HASTREAM	$minPts$		Hierarchical	$\lambda$
BEStream	$\Delta$, $\tau$			$\lambda$	$\xi$	$\theta$ Direction parameter
ADStream	$\xi$, $\epsilon$			$\lambda$
PatchWork	$ratio$, $minPoints$, $minCells$	$\epsilon$
RepStream				$\lambda$	$\alpha$	$K$ Graph connectivity

| Show Table

DownLoad: CSV

Table 2. Best and Worst $ K $ values for RepStream, according to $ F $ score

Dataset	Best K	Best $ F $ score	Worst K	Worst $ F $ score
DS1	7	0.7208	18	0.2767
DS2	7	0.6371	21	0.2594
SynTest	9	0.8614	5	0.5435
Closer	9	0.7989	5	0.4345
TreeCov	29	0.6108	5	0.2978
KDD99	30	0.7898	5	0.2636

| Show Table

DownLoad: CSV

Table 3. Comparison of our dynamic $ K $ method versus RepStream with optimal and worst $ K $ values

Dataset	Best F-Measure	Worst F-Measure	Dynamic $ K $
DS1	0.7208	0.2767	0.5153
DS2	0.6371	0.2594	0.4137
SynTest	0.7989	0.5435	0.7091
Closer	0.8614	0.4345	0.8410
TreeCov	0.6108	0.2978	0.5920
KDD99	0.7898	0.2636	0.7882

| Show Table

DownLoad: CSV

Related Papers

Cited by

References

[1]	M. R. Ackermann, M. Märtens, C. Raupach, K. Swierkot, C. Lammersen and C. Sohler, Streamkm++: A clustering algorithm for data streams, Journal of Experimental Algorithmics, 17 (2012), Article 2.4, 30 pp. doi: 10.1145/2133803.2184450.
[2]	C. Aggarwal, J. Han, J. Wang And P. Yu, A framework for projected clustering of high dimensional data streams, in Proceedings 2004 VLDB Conference, Elsevier, 2004, 852–863. doi: 10.1016/B978-012088469-8.50075-9.
[3]	C. C. Aggarwal, P. S. Yu, J. Han and J. Wang, A framework for clustering evolving data streams, in Proceedings 2003 VLDB Conference, Elsevier, 2003, 81–92. doi: 10.1016/B978-012722442-8/50016-1.
[4]	J. P. Barddal, H. M. Gomes and F. Enembreck, SNCStream: A social network-based data stream clustering algorithm, in Proceedings of the 30th Annual ACM Symposium on Applied Computing - SAC'15, SAC'15, ACM Press, New York, New York, USA, 2015, 935–940. doi: 10.1145/2695664.2695674.
[5]	V. Bhatnagar, S. Kaur and S. Chakravarthy, Clustering data streams using grid-based synopsis, Knowledge and Information Systems, 41 (2014), 127-152. doi: 10.1007/s10115-013-0659-1.
[6]	J. A. Blackard and D. J. Dean, Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables, Computers and Electronics in Agriculture, 24 (1999), 131–151. doi: 10.1016/S0168-1699(99)00046-0.
[7]	F. Cao, M. Estert, W. Qian and A. Zhou, Density-based clustering over an evolving data stream with noise, in Proceedings of the 2006 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, Philadelphia, PA, 2006, 328–339. doi: 10.1137/1.9781611972764.29.
[8]	Y. Chen and L. Tu, Density-based clustering for real-time stream data, in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD'07, ACM, ACM Press, New York, New York, USA, 2007, 133–142. doi: 10.1145/1281192.1281210.
[9]	A. Forestiero, C. Pizzuti and G. Spezzano, A single pass algorithm for clustering evolving data streams based on swarm intelligence, Data Mining and Knowledge Discovery, 26 (2013), 1-26. doi: 10.1007/s10618-011-0242-x.
[10]	J. Forrest, Stream: A framework for data stream modeling in r, Bachelor Thesis, Department of Computer Science and Engineering, SMU.
[11]	M. Hahsler and M. Bolaos, Clustering data streams based on shared density between micro-clusters, IEEE Transactions on Knowledge and Data Engineering, 28 (2016), 1449–1461, http://ieeexplore.ieee.org/document/7393836/. doi: 10.1109/TKDE.2016.2522412.
[12]	S. Hettich and S. D. Bay, The UCI KDD Archive, http://kdd.ics.uci.edu, 1999.
[13]	A. K. Jain, M. N. Murty and P. J. Flynn, Data clustering: A review, ACM computing surveys (CSUR), 31 (1999), 264-323. doi: 10.1145/331499.331504.
[14]	S. Kaur, V. Bhatnagar and S. Chakravarthy, Stream clustering algorithms: A primer, in Big Data in Complex Systems, Springer, 9 (2015), 105–145, http://link.springer.com/10.1007/978-3-319-11056-1. doi: 10.1007/978-3-319-11056-1_4.
[15]	H. Kremer, P. Kranen, T. Jansen, T. Seidl, A. Bifet, G. Holmes and B. Pfahringer, An effective evaluation measure for clustering on evolving data streams, in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2011, 868–876. doi: 10.1145/2020408.2020555.
[16]	S. Lühr and M. Lazarescu, Incremental clustering of dynamic data streams using connectivity based representative points, Data & Knowledge Engineering, 68 (2009), 1–27, http://linkinghub.elsevier.com/retrieve/pii/S0169023X08001092.
[17]	J. Schneider and M. Vlachos, Fast parameterless density-based clustering via random projections, in Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management - CIKM'13, ACM, ACM Press, New York, New York, USA, 2013, 861–866. doi: 10.1145/2505515.2505590.
[18]	N. Wattanakitrungroj, S. Maneeroj and C. Lursinsap, Bestream: Batch capturing with elliptic function for one-pass data stream clustering, Data & Knowledge Engineering, 117 (2018), 53–70, http://www.sciencedirect.com/science/article/pii/S0169023X17301027. doi: 10.1016/j.datak.2018.07.002.
[19]	G. Widmer and M. Kubat, Learning in the presence of concept drift and hidden contexts, Machine Learning, 23 (1996), 69-101. doi: 10.1007/BF00116900.
[20]	X. Zhang, C. Furtlehner and M. Sebag, Data streaming with affinity propagation, in Proc. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2008, 628–643. doi: 10.1007/978-3-540-87481-2_41.
[21]	A. Zhou, F. Cao, W. Qian and C. Jin, Tracking clusters in evolving data streams over sliding windows, Knowledge and Information Systems, 15 (2008), 181-214. doi: 10.1007/s10115-007-0070-x.