Abstract
Ensemble clustering techniques have improved in recent years, offering better average performance between domains and data sets. Benefits range from finding novelty clustering which are unattainable by any single clustering algorithm to providing clustering stability, such that the quality is little affected by noise, outliers or sampling variations. The main clustering ensemble strategies are: to combine results of different clustering algorithms; to produce different results by resampling the data, such as in bagging and boosting techniques; and to execute a given algorithm multiple times with different parameters or initialization. Often ensemble techniques are developed for supervised settings and later adapted to the unsupervised setting. Recently, Blaser and Fryzlewicz proposed an ensemble technique to classification based on resampling and transforming input data. Specifically, they employed random rotations to improve significantly Random Forests performance. In this work, we have empirically studied the effects of random transformations based in rotation matrices, Mahalanobis distance and density proximity to improve ensemble clustering. Our experiments considered 12 data sets and 25 variations of random transformations, given a total of 5580 data sets applied to 8 algorithms and evaluated by 4 clustering measures. Statistical tests identified 17 random transformations that are viable to be applied to ensembles and standard clustering algorithms, which had positive effects on cluster quality. In our results, the best performing transforms were Mahalanobis-based transformations.










Similar content being viewed by others
Notes
Experiments indicate the algorithm is stable regarding the variation in the value of h.
This technique is available in the e1071 package for R.
This technique is available in the hkclustering package for R.
References
Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, New Orleans, Louisiana, pp 1027–1035
Barthélemy J, Leclerc B (1991) The median procedure for partitions. Mathematics Subject Classification 19:3–34
Ben-Hur A, Elisseeff A, Guyon I (2001) A stability based method for discovering structure in clustered data. In: Pacific symposium on biocomputing. Hawaii, vol 7, pp 6–17
Blaser R, Fryzlewicz P (2016) Random rotation ensembles. J Mach Learn Res 17:1–26
Breiman L (1996) Bagging predictors. Machine Learning 24 (2):123–140. https://doi.org/10.1023/A:1018054314350
Brijnesh JJ (2016) Condorcet’s jury theorem for consensus clustering and its implications for diversity. arXiv:1604.07711
Campello RJGB, Moulavi D, Zimek A, Sander J (2015) Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Transa Knowl Discovery from Data (TKDD) 10(1):5. https://doi.org/10.1145/2733381
Conover WJ, Iman RL (1979) On multiple-comparisons procedures. Los Alamos Sci. Lab. Tech. Rep. LA-7677-MS. pp, 1–14
Dempster AP, Laird NM, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (methodological), pp 1–38
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(Jan):1–30
Diaconis P, Shahshahani M (1994) On the eigenvalues of random matrices. Journal of Applied Probability, pp 49–62. https://doi.org/10.2307/3214948
Dudoit S, Fridlyand J (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9):1090–1099
Dunn JC (1974) Well-separated clusters and optimal fuzzy partitions. J Cybern 4(1):95–104. https://doi.org/10.1080/01969727408546059
Efron B (1979) Bootstrap methods: another look at the jackknife. The Annals of Statistics, pp 1–26
Ester M, Kriegel H, Jorg S, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Conference on knowledge discovery and data mining. Portland, Oregon, USA, vol 96, pp 226–231
Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the 20th international conference on machine learning (ICML-03). Washington, DC, pp 186–193
Fred ALN, Jain AK (2002) Data clustering using evidence accumulation. In: 16th international conference on pattern recognition, 2002. Proceedings. https://doi.org/10.1109/ICPR.2002.1047450, vol 4. IEEE, Quebec, pp 276–280
Fred ALN, Jain AK (2005) Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6):835–850
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association 32(200):675–701
Frossyniotis D, Likas A, Stafylopatis A (2004) A clustering method based on boosting. Pattern Recogn Lett 25(6):641–654
Householder AS (1958) Unitary triangularization of a nonsymmetric matrix. Journal of the ACM (JACM) 5(4):339–342. https://doi.org/10.1145/320941.320947
Hubert L, Arabie P (1985) Comparing partitions. Journal of Classification 2(1):193–218. https://doi.org/10.1007/BF01908075
Ja H, Ma W (1979) Algorithm as 136: a k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28 (1):100–108. http://www.jstor.org/stable/2346830
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Computing Surveys (CSUR) 31(3):264–323
Leisch F (1999) Bagged clustering SFB adaptive information systems and modelling in economics and management science
Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 19 Jul 2017
Lloyd S (1982) Least squares quantization in PCM. IEEE Transactions on Information Theory 28(2):129–137. https://doi.org/10.1109/TIT.1982.1056489
Mahalanobis PC (1936) On the generalised distance in statistics, pp 49–55
Mehta P, Bukov M, Wang C-H, Day AGR, Richardson C, Fisher CK, Schwab DJ (2019) A high-bias, low-variance introduction to machine learning for physicists. Physics Reports. https://doi.org/10.1016/j.physrep.2019.03.001
Minaei-Bidgoli B, Topchy A, Punch WF (2004) Ensembles of partitions via data resampling. In: International conference on information technology: coding and computing, 2004. Proceedings. ITCC 2004. https://doi.org/10.1109/ITCC.2004.1286629, vol 2. IEEE, Las Vegas, pp 188–192
Minaei-Bidgoli B, Parvin H, Alinejad-Rokny H, Alizadeh H, Punch WF (2014) Effects of resampling method and adaptation on clustering ensemble efficacy. Artificial Intelligence Review 41(1):27–48. https://doi.org/10.1007/s10462-011-9295-x
Moreau JV, Jain AK (1987) The bootstrap approach to clustering. In: Pattern recognition theory and applications. Springer, Berlin, pp 63–71
Ostrovsky R, Rabani Y, Schulman LJ, Swamy C (2006) The effectiveness of lloyd-type methods for the k-means problem. In: 47th annual IEEE symposium on soundations of computer science, 2006. FOCS ’06. Washington, DC, USA, pp 165–176
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Schapire RE (1990) The strength of weak learnability. Machine Learning 5(2):197–227. https://doi.org/10.1023/A:1022648800760
Siersdorfer S, Sizov S (2004) Restrictive clustering and metaclustering for self-organizing document collections. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. https://doi.org/10.1145/1008992.1009032. ACM, New York, pp 226–233
Silva GR, Albertini MK (2017) Using multiple clustering algorithms to generate constraint rules and create consensus clusters. In: 2017 Brazilian conference on intelligent systems (BRACIS). https://doi.org/10.1109/BRACIS.2017.78. IEEE, Uberlandia, pp 312–317
Stoyanov K (2015) Hierarchical k-means clustering and its application in customer segmentation. Ph.D. thesis, University of Essex, UK
Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3(Dec):583–617
Strehl A, Ghosh J (2003) Cluster ensembles-a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
Topchy A, Jain AK, Punch W (2004) A mixture model for clustering ensembles. In: Proceedings of the 2004 SIAM international conference on data mining. https://doi.org/10.1137/1.9781611972740.35. SIAM, Florida, pp 379–390
Topchy A, Jain AK, Punch WF (2003) Combining multiple weak clusterings. In: Third IEEE international conference on data mining, 2003. ICDM 2003. IEEE, Melbourne, pp 331–338
Vendramin L, Campello RJGB, Hruschka ER (2010) Relative clustering validity criteria: a comparative overview. Statistical Analysis and Data Mining 3 (4):209–235. https://doi.org/10.1002/sam.10080
Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, CA, USA
Wu J, Liu H, Xiong H, Cao J, Chen J (2015) K-means-based consensus clustering: a unified view. IEEE Trans Knowl Data Eng 27(1):155–169. https://doi.org/10.1109/TKDE.2014.2316512
Yu Z, Luo P, You J, Wong HS, Leung H, Wu S, Zhang J, Han G (2016) Incremental semi-supervised clustering ensemble for high dimensional data clustering. IEEE Trans Knowl Data Eng 28(3):701–714
Acknowledgments
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001, and also it is funded by CAPES-PrInt internationalization funding program.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Rodrigues, G.D., Albertini, M.K. & Yang, X. An empirical evaluation of random transformations applied to ensemble clustering. Multimed Tools Appl 79, 34253–34285 (2020). https://doi.org/10.1007/s11042-020-08947-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-08947-x