Abstract
Among the areas of data and text mining which are employed today in OR, science, economy and technology, clustering theory serves as a preprocessing step in the data analyzing. An important component of clustering theory is determination of the true number of clusters. This problem has not been satisfactorily solved. In our paper, this problem is addressed by the cluster stability approach. For several possible numbers of clusters, we estimate the stability of the partitions obtained from clustering of samples. Partitions are considered consistent if their clusters are stable. Clusters validity is measured by the total number of edges, in the clusters’ minimal spanning trees, connecting points from different samples. Actually, we use the Friedman and Rafsky two sample test statistic. The homogeneity hypothesis of well mingled samples, within the clusters, leads to an asymptotic normal distribution of the considered statistic. Resting upon this fact, the standard score of the mentioned edges quantity is set, and the partition quality is represented by the worst cluster, corresponding to the minimal standard score value. It is natural to expect that the true number of clusters can be characterized by the empirical distribution having the shortest left tail. The proposed methodology sequentially creates the described distribution and estimates its left-asymmetry. Several presented numerical experiments demonstrate the ability of the approach to detect the true number of clusters.
Similar content being viewed by others
References
Akteke-Öztürk B, Weber G-W, Kropat E (2008) Continuous optimization approach for minimum sum of squares. In: ISI proceedings of the 20th Mini-EURO Conference “continuous optimization and knowledge-based technologies”. Neringa, Lithuania, pp 253–258
Akume D, Weber G-W (2002) Cluster algorithms: theory and methods. J Comput Technol Vychisl Tekhnol 7(1): 15–27
Bagirov A (2009) Large scale non smooth optimization problems in data mining. In: Proceedings of the XIII international conference applied stochastic models and data analysis (ASMDA). Vilnius
Bagirov A, Ugon J, Webb D (2009) A new global k-means algorithm for clustering large data sets. In: Proceedings of the XIII international conference applied stochastic models and data analysis (ASMDA). Vilnius
Baringhaus L, Franz C (2004) On a new multivariate two-sample test. J Multivar Anal 88(1): 190–206
Barzily Z, Volkovich Z, Akteke-Öztürk B, Weber G-W (2008) Cluster stability using minimal spanning trees. In: Proceedings of the 20th mini conference “continuous optimization and knowledge-based technologies”. EurOPT’, Lithuania, pp 248–253
Barzily Z, Volkovich Z, Akteke-Öztürk B, Weber G-W (2009) On a minimal spanning tree approach in the cluster validation problem. Informatica 20(2): 187–202
Ben-Hur A, Guyon I (2003) Detecting stable clusters using principal component analysis, methods in molecular biology. In: Brownstein MJ, Kohodursky A (eds) Humana Press, MJ, pp 159–182
Ben-Hur A, Elisseeff A, Guyon I (2002) A stability based method for discovering structure in clustered data. In: Pacific symposium on biocomputing. pp 6–17
Büyükbebeci E (2009) Comparison of MARS, CMARS and CART in predicting default probabilities for emerging markets, M.Sc. term project Report/Thesis in financial mathematics. Institute of Applied Mathematics of METU, Ankara
Calinski R, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3: 1–27
Celeux G, Govaert G (1992) A classification EMalgorithm and two stochastic versions. Comput Stat Data Anal 14: 315–332
Cheng R, Milligan G (1996) Measuring the influence of individual data points in a cluster analysis. J Classif 13: 315–335
Conover WJ, Johnson ME, Johnson MM (1981) Comparative study of tests of homogeneity of variances, with applications to the outer continental shelf bidding data. Technometrics 23: 351–361
Dhillon I, Kogan J, Nicholas C (2003) Feature selection and document clustering, a comprehensive survey of text mining. In: Berry M (ed) Springer, Berlin, pp 73–100
Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3(7): 0036.1–0036.21
Duran BS (1976) A survey of nonparametric tests for scale. Commun Stat Theory Methods 5: 1287–1312
Friedman JH, Rafsky LC (1979) Multivariate generalizations of the Wolfowitz and Smirnov two-sample tests. Ann Stat 7: 697–717
Gordon AD (1999) Classification. Chapman and Hall, CRC, Boca Raton
Hartigan JA (1975) Clustering algorithms. Wiley, New York
Hartigan JA (1985) Statistical theory in clustering. J Classif 2: 63–76
Hastie T, Tibshirani R, Friedman JH (2001) The elements of statistical learning: data mining, inference and prediction. Springer, Berlin
Henze N (1988) A multivariate two-sample test based on the number of nearest neighbor type coincidences. Ann Stat 16: 772–783
Henze N, Penrose M (1999) On the multivariate runs test. Ann Stat 27: 290–298
Jain AK, Moreau JV (1987) Bootstrap technique in cluster analysis. Pattern Recognit 20(5): 547–568
Jain A, Xu X, Ho T, Xiao F (2002) Uniformity testing using minimal spanning tree. ICPR 4: 281–284
Karasözen B, Rubinov A, Weber G-W (2006) Optimization in Data Mining. Eur J Oper Res 173(3): 701–704
Kaufman L, Rousseeuw PJ (1990) Finding groups in data. Wiley, New York
Klebanov L (2005) N-distances and their applications. The Karolinum Press: Charsel University in Prague, Prague
Klebanov L (2003) One class of distribution free multivariate tests. Sanct-Petersburg Math Soc Preprint, 3
Kropat E, Weber G-W, Pedamallu CS (2009) Regulatory networks under ellipsoidal uncertainty-optimization theory and dynamical systems. Preprint at IAM, METU
Krzanowski W, Lai Y (1985) A criterion for determining the number of groups in a dataset using sum of squares clustering. Biometrics 44: 23–34
Kuhn H (1955) The hungarian method for the assignment problem. Naval Res Logistics Q 2: 83–97
Lange T, Roth V, Braun M, Buhmann JM (2004) Stability-based validation of clustering solutions. Neural Comput 15(6): 1299–1323
Levine E, Domany E (2001) Resampling method for unsupervised estimation of cluster validity. Neural Comput 13: 2573–2593
Milligan G, Cooper M (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159–179
Mufti GB, Bertrand P, El-Moubarki L (2005) Determining the number of groups from measures of cluster validity. In: Proceedigns of ASMDA 2005. pp 404–414
Nesetril J, Milkova E, Nesetrilova H (2001) Otakar Boruvka on minimum spanning tree problem, Translation of both the 1926 papers, comments, history. Discrete Math 3–36
Özögür-Akyüz S, Weber G-W (2009) Infinite kernel learning by infinite and semi-infinite programming. In: Proceedings of the second global conference on power control and optimization, AIP conference proceedings 1159. Bali, Indonesia, June 1–3, Hakim AH, Vasant P, Barsoum N (guest eds)
Roth V, Lange T, Braun M, Buhmann J (2002) A resampling approach to cluster validation, COMPSTAT, available at http://www.cs.uni-bonn.De/~braunm
Sezgin Alp Ö, Büyükbebeci E, Iscanoglu Cekic A, Yerlikaya-Özkurt F, Taylan P, Weber G-W, - CMARS and GAM & CQP—modern optimization methods applied to international credit default prediction, preprint at IAM, METU, submitted for publication
Smith S, Jain A (1984) Testing for uniformity in multidimensional data. IEEE Trans Pattern Anal Mach Intell 6: 73–80
Sugar C, James G (2003) Finding the number of clusters in a data set: an information theoretic approach. J Am Stat Assoc 98: 750–763
Taylan P, Weber G-W, Yerlikaya F (2008) Continuous optimization applied in MARS for modern applications in finance, science and technology. In: ISI proceedings of 20th Mini-EURO conference continuous optimization and knowledge-based technologies. EurOPT 2008 317-322, Neringa, Lithuania
Tibshirani R, Walther G (2005) Cluster validation by prediction strength. J Comput Graph Stat 14(3): 511–528
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters via the gap statistic. J Royal Stat Soc B 63(2): 411–423
Varma S, Simon R (2004) Iterative class discovery and feature selection using minimal spanning trees. BMC Bioinformatics 5:126
Volkovich Z, Barzily Z, Morozensky L (2006) A cluster stability criteria based on the two-sample test concept. In: Proceeding of the second workshop on algorithmic techniques for data mining (ATDM). Springer, pp 329–338
Volkovich Z, Barzily Z, Morozensky L (2008) A statistical model of cluster stability. Pattern Recognit 41(7): 2174–2188
Volkovich Z, Barzily Z, Avros R, Toledano-Kitai D (2009) On application of the K-nearest neighbors approach for cluster validation. In: Proceeding of the XIII international conference applied stochastic models and data analysis (ASMDA). Vilnius
Volkovich Z, Barzily Z, Weber G-W, Toledano-Kitai D (2009) Cluster stability estimation based on a minimal spanning trees approach. The second global conference on power and optimization (PCO). Bali, Indonesia
Weber G-W, Batmaz I, Köksal G, Taylan P, Yerlikaya-Özkurt F CMARS: A new contribution to nonparametric regression with multivariate adaptive regression splines supported by continuous optimisation, preprint at IAM, METU, submitted for publication
Weber G-W, Taylan P, Yildirak K, Görgülü ZK (2009) Financial regression and organization. To appear in the special issue on optimization in finance, of dynamics of continuous, discrete and impulsive systems (Series B)
Wishart D (1969) Mode analysis: a generalization of nearest neighbor which reduces chaining effects. Numer Taxonomy 76:282–311, AJ Cole, Academic Press, London
Xu Y, Olman V, Xu D (2002) Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics 18: 535–545
Zahn C (1971) Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans Comput C-20(1): 68–86
Zech G, Aslan B (2005) New test for the multivariate two-sample problem based on the concept of minimum energy. J Stat Comput Simul 75(2): 109–119
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Volkovich, Z., Barzily, Z., Weber, GW. et al. An application of the minimal spanning tree approach to the cluster stability problem. Cent Eur J Oper Res 20, 119–139 (2012). https://doi.org/10.1007/s10100-010-0157-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10100-010-0157-4