An application of the minimal spanning tree approach to the cluster stability problem

Volkovich, Z.; Barzily, Z.; Weber, G.-W.; Toledano-Kitai, D.; Avros, R.

doi:10.1007/s10100-010-0157-4

An application of the minimal spanning tree approach to the cluster stability problem

Original Paper
Published: 15 July 2010

Volume 20, pages 119–139, (2012)
Cite this article

Central European Journal of Operations Research Aims and scope Submit manuscript

Z. Volkovich¹,
Z. Barzily¹,
G.-W. Weber^2,3,4,5,
D. Toledano-Kitai¹ &
…
R. Avros¹

187 Accesses
Explore all metrics

Abstract

Among the areas of data and text mining which are employed today in OR, science, economy and technology, clustering theory serves as a preprocessing step in the data analyzing. An important component of clustering theory is determination of the true number of clusters. This problem has not been satisfactorily solved. In our paper, this problem is addressed by the cluster stability approach. For several possible numbers of clusters, we estimate the stability of the partitions obtained from clustering of samples. Partitions are considered consistent if their clusters are stable. Clusters validity is measured by the total number of edges, in the clusters’ minimal spanning trees, connecting points from different samples. Actually, we use the Friedman and Rafsky two sample test statistic. The homogeneity hypothesis of well mingled samples, within the clusters, leads to an asymptotic normal distribution of the considered statistic. Resting upon this fact, the standard score of the mentioned edges quantity is set, and the partition quality is represented by the worst cluster, corresponding to the minimal standard score value. It is natural to expect that the true number of clusters can be characterized by the empirical distribution having the shortest left tail. The proposed methodology sequentially creates the described distribution and estimates its left-asymmetry. Several presented numerical experiments demonstrate the ability of the approach to detect the true number of clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analysis of Cluster Structures by Different Similarity Measures

Article 26 May 2021

Holistic Assessment of Structure Discovery Capabilities of Clustering Algorithms

An empirical comparison and characterisation of nine popular clustering methods

Article 09 January 2022

References

Akteke-Öztürk B, Weber G-W, Kropat E (2008) Continuous optimization approach for minimum sum of squares. In: ISI proceedings of the 20th Mini-EURO Conference “continuous optimization and knowledge-based technologies”. Neringa, Lithuania, pp 253–258
Akume D, Weber G-W (2002) Cluster algorithms: theory and methods. J Comput Technol Vychisl Tekhnol 7(1): 15–27
Google Scholar
Bagirov A (2009) Large scale non smooth optimization problems in data mining. In: Proceedings of the XIII international conference applied stochastic models and data analysis (ASMDA). Vilnius
Bagirov A, Ugon J, Webb D (2009) A new global k-means algorithm for clustering large data sets. In: Proceedings of the XIII international conference applied stochastic models and data analysis (ASMDA). Vilnius
Baringhaus L, Franz C (2004) On a new multivariate two-sample test. J Multivar Anal 88(1): 190–206
Article Google Scholar
Barzily Z, Volkovich Z, Akteke-Öztürk B, Weber G-W (2008) Cluster stability using minimal spanning trees. In: Proceedings of the 20th mini conference “continuous optimization and knowledge-based technologies”. EurOPT’, Lithuania, pp 248–253
Barzily Z, Volkovich Z, Akteke-Öztürk B, Weber G-W (2009) On a minimal spanning tree approach in the cluster validation problem. Informatica 20(2): 187–202
Google Scholar
Ben-Hur A, Guyon I (2003) Detecting stable clusters using principal component analysis, methods in molecular biology. In: Brownstein MJ, Kohodursky A (eds) Humana Press, MJ, pp 159–182
Ben-Hur A, Elisseeff A, Guyon I (2002) A stability based method for discovering structure in clustered data. In: Pacific symposium on biocomputing. pp 6–17
Büyükbebeci E (2009) Comparison of MARS, CMARS and CART in predicting default probabilities for emerging markets, M.Sc. term project Report/Thesis in financial mathematics. Institute of Applied Mathematics of METU, Ankara
Calinski R, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3: 1–27
Article Google Scholar
Celeux G, Govaert G (1992) A classification EMalgorithm and two stochastic versions. Comput Stat Data Anal 14: 315–332
Article Google Scholar
Cheng R, Milligan G (1996) Measuring the influence of individual data points in a cluster analysis. J Classif 13: 315–335
Article Google Scholar
Conover WJ, Johnson ME, Johnson MM (1981) Comparative study of tests of homogeneity of variances, with applications to the outer continental shelf bidding data. Technometrics 23: 351–361
Article Google Scholar
Dhillon I, Kogan J, Nicholas C (2003) Feature selection and document clustering, a comprehensive survey of text mining. In: Berry M (ed) Springer, Berlin, pp 73–100
Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3(7): 0036.1–0036.21
Article Google Scholar
Duran BS (1976) A survey of nonparametric tests for scale. Commun Stat Theory Methods 5: 1287–1312
Article Google Scholar
Friedman JH, Rafsky LC (1979) Multivariate generalizations of the Wolfowitz and Smirnov two-sample tests. Ann Stat 7: 697–717
Article Google Scholar
Gordon AD (1999) Classification. Chapman and Hall, CRC, Boca Raton
Google Scholar
Hartigan JA (1975) Clustering algorithms. Wiley, New York
Google Scholar
Hartigan JA (1985) Statistical theory in clustering. J Classif 2: 63–76
Article Google Scholar
Hastie T, Tibshirani R, Friedman JH (2001) The elements of statistical learning: data mining, inference and prediction. Springer, Berlin
Google Scholar
Henze N (1988) A multivariate two-sample test based on the number of nearest neighbor type coincidences. Ann Stat 16: 772–783
Article Google Scholar
Henze N, Penrose M (1999) On the multivariate runs test. Ann Stat 27: 290–298
Article Google Scholar
Jain AK, Moreau JV (1987) Bootstrap technique in cluster analysis. Pattern Recognit 20(5): 547–568
Article Google Scholar
Jain A, Xu X, Ho T, Xiao F (2002) Uniformity testing using minimal spanning tree. ICPR 4: 281–284
Google Scholar
Karasözen B, Rubinov A, Weber G-W (2006) Optimization in Data Mining. Eur J Oper Res 173(3): 701–704
Article Google Scholar
Kaufman L, Rousseeuw PJ (1990) Finding groups in data. Wiley, New York
Book Google Scholar
Klebanov L (2005) N-distances and their applications. The Karolinum Press: Charsel University in Prague, Prague
Google Scholar
Klebanov L (2003) One class of distribution free multivariate tests. Sanct-Petersburg Math Soc Preprint, 3
Kropat E, Weber G-W, Pedamallu CS (2009) Regulatory networks under ellipsoidal uncertainty-optimization theory and dynamical systems. Preprint at IAM, METU
Krzanowski W, Lai Y (1985) A criterion for determining the number of groups in a dataset using sum of squares clustering. Biometrics 44: 23–34
Article Google Scholar
Kuhn H (1955) The hungarian method for the assignment problem. Naval Res Logistics Q 2: 83–97
Article Google Scholar
Lange T, Roth V, Braun M, Buhmann JM (2004) Stability-based validation of clustering solutions. Neural Comput 15(6): 1299–1323
Article Google Scholar
Levine E, Domany E (2001) Resampling method for unsupervised estimation of cluster validity. Neural Comput 13: 2573–2593
Article Google Scholar
Milligan G, Cooper M (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159–179
Article Google Scholar
Mufti GB, Bertrand P, El-Moubarki L (2005) Determining the number of groups from measures of cluster validity. In: Proceedigns of ASMDA 2005. pp 404–414
Nesetril J, Milkova E, Nesetrilova H (2001) Otakar Boruvka on minimum spanning tree problem, Translation of both the 1926 papers, comments, history. Discrete Math 3–36
Özögür-Akyüz S, Weber G-W (2009) Infinite kernel learning by infinite and semi-infinite programming. In: Proceedings of the second global conference on power control and optimization, AIP conference proceedings 1159. Bali, Indonesia, June 1–3, Hakim AH, Vasant P, Barsoum N (guest eds)
Roth V, Lange T, Braun M, Buhmann J (2002) A resampling approach to cluster validation, COMPSTAT, available at http://www.cs.uni-bonn.De/~braunm
Sezgin Alp Ö, Büyükbebeci E, Iscanoglu Cekic A, Yerlikaya-Özkurt F, Taylan P, Weber G-W, - CMARS and GAM & CQP—modern optimization methods applied to international credit default prediction, preprint at IAM, METU, submitted for publication
Smith S, Jain A (1984) Testing for uniformity in multidimensional data. IEEE Trans Pattern Anal Mach Intell 6: 73–80
Article Google Scholar
Sugar C, James G (2003) Finding the number of clusters in a data set: an information theoretic approach. J Am Stat Assoc 98: 750–763
Article Google Scholar
Taylan P, Weber G-W, Yerlikaya F (2008) Continuous optimization applied in MARS for modern applications in finance, science and technology. In: ISI proceedings of 20th Mini-EURO conference continuous optimization and knowledge-based technologies. EurOPT 2008 317-322, Neringa, Lithuania
Tibshirani R, Walther G (2005) Cluster validation by prediction strength. J Comput Graph Stat 14(3): 511–528
Article Google Scholar
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters via the gap statistic. J Royal Stat Soc B 63(2): 411–423
Article Google Scholar
Varma S, Simon R (2004) Iterative class discovery and feature selection using minimal spanning trees. BMC Bioinformatics 5:126
Google Scholar
Volkovich Z, Barzily Z, Morozensky L (2006) A cluster stability criteria based on the two-sample test concept. In: Proceeding of the second workshop on algorithmic techniques for data mining (ATDM). Springer, pp 329–338
Volkovich Z, Barzily Z, Morozensky L (2008) A statistical model of cluster stability. Pattern Recognit 41(7): 2174–2188
Article Google Scholar
Volkovich Z, Barzily Z, Avros R, Toledano-Kitai D (2009) On application of the K-nearest neighbors approach for cluster validation. In: Proceeding of the XIII international conference applied stochastic models and data analysis (ASMDA). Vilnius
Volkovich Z, Barzily Z, Weber G-W, Toledano-Kitai D (2009) Cluster stability estimation based on a minimal spanning trees approach. The second global conference on power and optimization (PCO). Bali, Indonesia
Weber G-W, Batmaz I, Köksal G, Taylan P, Yerlikaya-Özkurt F CMARS: A new contribution to nonparametric regression with multivariate adaptive regression splines supported by continuous optimisation, preprint at IAM, METU, submitted for publication
Weber G-W, Taylan P, Yildirak K, Görgülü ZK (2009) Financial regression and organization. To appear in the special issue on optimization in finance, of dynamics of continuous, discrete and impulsive systems (Series B)
Wishart D (1969) Mode analysis: a generalization of nearest neighbor which reduces chaining effects. Numer Taxonomy 76:282–311, AJ Cole, Academic Press, London
Google Scholar
Xu Y, Olman V, Xu D (2002) Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics 18: 535–545
Google Scholar
Zahn C (1971) Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans Comput C-20(1): 68–86
Article Google Scholar
Zech G, Aslan B (2005) New test for the multivariate two-sample problem based on the concept of minimum energy. J Stat Comput Simul 75(2): 109–119
Article Google Scholar

Download references

Author information

Authors and Affiliations

Ort Braude College of Engineering, 21982, Karmiel, Israel
Z. Volkovich, Z. Barzily, D. Toledano-Kitai & R. Avros
Institute of Applied Mathematics, Middle East Technical University, 06531, Ankara, Turkey
G.-W. Weber
University of Siegen, Siegen, Germany
G.-W. Weber
University of Aveiro, Aveiro, Portugal
G.-W. Weber
Universiti Teknologi Malaysia, Skudai, Malaysia
G.-W. Weber

Authors

Z. Volkovich
View author publications
You can also search for this author in PubMed Google Scholar
Z. Barzily
View author publications
You can also search for this author in PubMed Google Scholar
G.-W. Weber
View author publications
You can also search for this author in PubMed Google Scholar
D. Toledano-Kitai
View author publications
You can also search for this author in PubMed Google Scholar
R. Avros
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Z. Volkovich.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Volkovich, Z., Barzily, Z., Weber, GW. et al. An application of the minimal spanning tree approach to the cluster stability problem. Cent Eur J Oper Res 20, 119–139 (2012). https://doi.org/10.1007/s10100-010-0157-4

Download citation

Published: 15 July 2010
Issue Date: March 2012
DOI: https://doi.org/10.1007/s10100-010-0157-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An application of the minimal spanning tree approach to the cluster stability problem

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Analysis of Cluster Structures by Different Similarity Measures

Holistic Assessment of Structure Discovery Capabilities of Clustering Algorithms

An empirical comparison and characterisation of nine popular clustering methods

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

An application of the minimal spanning tree approach to the cluster stability problem

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Analysis of Cluster Structures by Different Similarity Measures

Holistic Assessment of Structure Discovery Capabilities of Clustering Algorithms

An empirical comparison and characterisation of nine popular clustering methods

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation