article

Towards understanding hierarchical clustering: A data distribution perspective

Authors:

Jian ChenAuthors Info & Claims

Neurocomputing, Volume 72, Issue 10-12

Pages 2319 - 2330

https://doi.org/10.1016/j.neucom.2008.12.011

Published: 01 June 2009 Publication History

Abstract

A very important category of clustering methods is hierarchical clustering. There are considerable research efforts which have been focused on algorithm-level improvements of the hierarchical clustering process. In this paper, our goal is to provide a systematic understanding of hierarchical clustering from a data distribution perspective. Specifically, we investigate the issues about how the ''true'' cluster distribution can make impact on the clustering performance, and what is the relationship between hierarchical clustering schemes and validation measures with respect to different data distributions. To this end, we provide an organized study to illustrate these issues. Indeed, one of our key findings reveals that hierarchical clustering tends to produce clusters with high variation on cluster sizes regardless of ''true'' cluster distributions. Also, our results show that F-measure, an external clustering validation measure, has bias towards hierarchical clustering algorithms which tend to increase the variation on cluster sizes. Viewed in light of this, we propose F"n"o"r"m, the normalized version of the F-measure, to solve the cluster validation problem for hierarchical clustering. Experimental results show that F"n"o"r"m is indeed more suitable than the unnormalized F-measure in evaluating the hierarchical clustering results across data sets with different data distributions.

References

[1]

Borg, I. and Groenen, P., Modern Multidimensional Scaling-Theory and Applications. 1997. Springer, Berlin.

[2]

Bouguettaya, A. and Le Viet, Q., Data clustering analysis in a multidimensional space. Information Sciences. v112 i1-4. 267-295.

Digital Library

[3]

DeGroot, M. and Schervish, M., Probability and Statistics. 2001. third ed. Addison-Wesley, Reading, MA.

[4]

J.W. Demmel, Applied Numerical Linear Algebra, Society for Industrial & Applied Mathematics, Philadelphia, PA, 1997.

Digital Library

[5]

J. Ghosh, Scalable Clustering Methods for Data Mining, Handbook of Data Mining, Lawrence Ealbaum Assoc, 2003.

[6]

Guha, S., Rastogi, R. and Shim, K., Cure: an efficient clustering algorithm for large databases. In: Proceedings of 1998 ACM SIGMOD International Conference on Management of Data, pp. 73-84.

Digital Library

[7]

Halkidi, M., Batistakis, Y. and Vazirgiannis, M., Cluster validity methods: Part i. SIGMOD Record. v31 i2. 40-45.

Digital Library

[8]

Halkidi, M., Batistakis, Y. and Vazirgiannis, M., Clustering validity checking methods: Part ii. SIGMOD Record. v31 i3. 19-27.

Digital Library

[9]

Han, E.-H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B. and Moore, J., WebACE: a web agent for document categorization and exploration. In: Proceedings of the 2nd International Conference on Autonomous Agents,

Digital Library

[10]

Hersh, W., Buckley, C., Leone, T.J. and Hickam, D., OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 192-201.

Digital Library

[11]

Hirano, S., Sun, X. and Tsumoto, S., Comparison of clustering methods for clinical databases. Information Sciences. v159 i3-4. 155-165.

Digital Library

[12]

Hruschka, E.R., Campello, R.J.G.B. and de Castro, L.N., Evolving clusters in gene-expression data. Information Sciences. v176 i13. 1898-1927.

[13]

Hubert, L. and Arabie, P., Comparing partitions. Journal of Classification. v2. 193-218.

[14]

Jain, A.K. and Dubes, R.C., Algorithms for Clustering Data. 1998. Prentice-Hall, Englewood Cliffs, NJ.

Digital Library

[15]

Jarvis, R.A. and Patrick, E.A., Clustering using a similarity measure based on shared nearest neighbors. IEEE Transactions on Computers. vC-22 i11. 1025-1034.

Digital Library

[16]

Jolliffe, I.T., Principal Component Analysis. 2002. second ed. Springer, Berlin.

[17]

Karypis, G., Han, E.-H. and Kumar, V., Chameleon: a hierarchical clustering algorithm using dynamic modeling. IEEE Computer. v32 i8. 68-75.

Digital Library

[18]

G. Karypis, Cluto-software for clustering high-dimensional datasets, version 2.1.1, 2008 {http://glaros.dtc.umn.edu/gkhome/views/cluto}.

[19]

Larsen, B. and Aone, C., Fast and effective text mining using linear-time document clustering. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16-22.

Digital Library

[20]

D. Lewis. Reuters-21578 text categorization text collection 1.0, 2008 {http://www.research.att.com/lewis}.

[21]

Li, J., Tao, D., Hu, W. and Li, X., Kernel principle component analysis in pixels clustering. In: Proceedings of The 2005 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 786-789.

Digital Library

[22]

J. Li, H. Liu. Kent ridge biomedical data set repository, 2008 {http://sdmc.i2r.a-star.edu.sg/rp/}.

[23]

Murtagh, F., Clustering Massive Data Sets, Handbook of Massive Data Sets. 2000. Kluwer Academic Publishers, Dordrecht.

[24]

D.J. Newman, S. Hettich, C.L. Blake, C.J. Merz, UCI repository of machine learning databases, 1998.

[25]

Pang, Y., Tao, D., Yuan, Y. and Li, X., Binary two-dimensional PCA. IEEE Transactions on Systems, Man, and Cybernetics, Part B. v38 i4. 1176-1180.

Digital Library

[26]

Porter, M.F., An algorithm for suffix stripping. Program. v14 i3. 130-137.

[27]

Van Rijsbergen, C.J., Information Retrieval. 1979. second ed. Butterworths, London.

Digital Library

[28]

Ryu, T.-W. and Eick, C.F., A database clustering methodology and tool. Information Sciences. v171 i1-3. 29-59.

Digital Library

[29]

Sneath, P.H. and Sokal, R.R., Numerical Taxonomy. 1973. Freeman, San Francisco, CA.

[30]

M. Steinbach, G. Karypis, V. Kumar, A comparison of document clustering techniques, in: Workshop on Text Mining, the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2000.

[31]

Tan, P.-N., Steinbach, M. and Kumar, V., Introduction to Data Mining. 2005. Addison-Wesley, Reading, MA.

Digital Library

[32]

B. Tang, M. Shepherd, M.I. Heywood, X. Luo, Comparing dimension reduction techniques for document clustering, in: Canadian Conference on Artificial Intelligence, 2005, pp. 292-296.

Digital Library

[33]

Tao, D., Li, X., Wu, X. and Maybank, S., Geometric mean for subspace selection in multiclass classification. IEEE Transactions on Pattern Analysis and Machine Intelligence. v30.

Digital Library

[34]

TREC, Text retrieval conference, 2008 {http://trec.nist.gov}.

[35]

Xiong, H., Pandey, G., Steinbach, M. and Kumar, V., Enhancing data analysis with noise removal. IEEE Transactions on Knowledge and Data Engineering. v18 i3. 304-319.

Digital Library

[36]

Xiong, H., Tan, P. and Kumar, V., Mining strong affinity association patterns in data sets with skewed support distribution. In: Proceedings of the 3rd IEEE International Conference on Data Mining, pp. 387-394.

Digital Library

[37]

Xiong, H., Tan, P.-N. and Kumar, V., Hyperclique pattern discovery. Data Mining and Knowledge Discovery Journal. v13 i2. 219-242.

Digital Library

[38]

Xiong, H., Wu, J. and Chen, J., K-means clustering versus validation measures: a data distribution perspective. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 779-784.

Digital Library

[39]

Zhang, T., Ramakrishnan, R. and Livny, M., Birch: an efficient data clustering method for very large databases. In: Proceedings of 1996 ACM SIGMOD International Conference on Management of Data, pp. 103-114.

Digital Library

[40]

Zhang, T., Tao, D. and Yang, J., Discriminative locality alignment. In: Proceedings of the 10th European Conference on Computer Vision (ECCV), pp. 725-738.

Digital Library

[41]

Y. Zhao, G. Karypis, Hierarchical clustering algorithms for document datasets, Technical Report #03-027, University of Minnesota, Minneapolis, MN, 2003.

[42]

Zhao, Y. and Karypis, G., Criterion functions for document clustering: experiments and analysis. Machine Learning. v55 i3. 311-331.

Cited By

Xu BZhu ZQiu XWang SChen ZZhang HLu J(2024)Real measurement data-driven correlated hysteresis monitoring model for concrete arch dam displacementExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121752238:PAOnline publication date: 15-Mar-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.121752
Chrobak GKowalczyk TFischer TSzewrański SChrobak KWąsowicz BKazak J(2022)First, do no harm - Missing data treatment to support lake ecological condition assessmentEnvironmental Modelling & Software10.1016/j.envsoft.2022.105558158:COnline publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1016/j.envsoft.2022.105558
Zhou KYang S(2020)Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clusteringPattern Analysis & Applications10.1007/s10044-019-00783-623:1(455-466)Online publication date: 1-Feb-2020
https://dl.acm.org/doi/10.1007/s10044-019-00783-6
Show More Cited By

Index Terms

Towards understanding hierarchical clustering: A data distribution perspective
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Index terms have been assigned to the content through auto-classification.

Recommendations

K-means clustering versus validation measures: a data-distribution perspective

K-means is a well-known and widely used partitional clustering method. While there are considerable research efforts to characterize the key features of the K-means clustering algorithm, further investigation is needed to understand how data ...
K-means clustering versus validation measures: a data distribution perspective
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

K-means is a widely used partitional clustering method. While there are considerable research efforts to characterize the key features of K-means clustering, further investigation is needed to reveal whether and how the data distributions can have the ...
Semi-supervised Hierarchical Clustering
ICDM '11: Proceedings of the 2011 IEEE 11th International Conference on Data Mining

Semi-supervised clustering (i.e., clustering with knowledge-based constraints) has emerged as an important variant of the traditional clustering paradigms. However, most existing semi-supervised clustering algorithms are designed for partitional ...

Comments

Information & Contributors

Information

Published In

Copyright © Elsevier B.V. © 2009.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 June 2009

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xu BZhu ZQiu XWang SChen ZZhang HLu J(2024)Real measurement data-driven correlated hysteresis monitoring model for concrete arch dam displacementExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121752238:PAOnline publication date: 15-Mar-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.121752
Chrobak GKowalczyk TFischer TSzewrański SChrobak KWąsowicz BKazak J(2022)First, do no harm - Missing data treatment to support lake ecological condition assessmentEnvironmental Modelling & Software10.1016/j.envsoft.2022.105558158:COnline publication date: 1-Dec-2022
https://dl.acm.org/doi/10.1016/j.envsoft.2022.105558
Zhou KYang S(2020)Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clusteringPattern Analysis & Applications10.1007/s10044-019-00783-623:1(455-466)Online publication date: 1-Feb-2020
https://dl.acm.org/doi/10.1007/s10044-019-00783-6
Lerato LNiesler T(2019)Feature trajectory dynamic time warping for clustering of speech segmentsEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-019-0149-92019:1(1-9)Online publication date: 1-Dec-2019
https://dl.acm.org/doi/10.1186/s13636-019-0149-9
Zhou KYang S(2016)Exploring the uniform effect of FCM clusteringKnowledge-Based Systems10.1016/j.knosys.2016.01.00196:C(76-83)Online publication date: 15-Mar-2016
https://dl.acm.org/doi/10.1016/j.knosys.2016.01.001
de la Paz-Marín MGutiérrez PHervás-Martínez C(2015)Classification of countries' progress toward a knowledge economy based on machine learning classification techniquesExpert Systems with Applications: An International Journal10.1016/j.eswa.2014.08.00842:1(562-572)Online publication date: 1-Jan-2015
https://dl.acm.org/doi/10.1016/j.eswa.2014.08.008
Zheng LLi TDing C(2014)A Framework for Hierarchical Ensemble ClusteringACM Transactions on Knowledge Discovery from Data10.1145/26113809:2(1-23)Online publication date: 23-Sep-2014
https://dl.acm.org/doi/10.1145/2611380

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents