Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Towards understanding hierarchical clustering: A data distribution perspective

Published: 01 June 2009 Publication History

Abstract

A very important category of clustering methods is hierarchical clustering. There are considerable research efforts which have been focused on algorithm-level improvements of the hierarchical clustering process. In this paper, our goal is to provide a systematic understanding of hierarchical clustering from a data distribution perspective. Specifically, we investigate the issues about how the ''true'' cluster distribution can make impact on the clustering performance, and what is the relationship between hierarchical clustering schemes and validation measures with respect to different data distributions. To this end, we provide an organized study to illustrate these issues. Indeed, one of our key findings reveals that hierarchical clustering tends to produce clusters with high variation on cluster sizes regardless of ''true'' cluster distributions. Also, our results show that F-measure, an external clustering validation measure, has bias towards hierarchical clustering algorithms which tend to increase the variation on cluster sizes. Viewed in light of this, we propose F"n"o"r"m, the normalized version of the F-measure, to solve the cluster validation problem for hierarchical clustering. Experimental results show that F"n"o"r"m is indeed more suitable than the unnormalized F-measure in evaluating the hierarchical clustering results across data sets with different data distributions.

References

[1]
Borg, I. and Groenen, P., Modern Multidimensional Scaling-Theory and Applications. 1997. Springer, Berlin.
[2]
Bouguettaya, A. and Le Viet, Q., Data clustering analysis in a multidimensional space. Information Sciences. v112 i1-4. 267-295.
[3]
DeGroot, M. and Schervish, M., Probability and Statistics. 2001. third ed. Addison-Wesley, Reading, MA.
[4]
J.W. Demmel, Applied Numerical Linear Algebra, Society for Industrial & Applied Mathematics, Philadelphia, PA, 1997.
[5]
J. Ghosh, Scalable Clustering Methods for Data Mining, Handbook of Data Mining, Lawrence Ealbaum Assoc, 2003.
[6]
Guha, S., Rastogi, R. and Shim, K., Cure: an efficient clustering algorithm for large databases. In: Proceedings of 1998 ACM SIGMOD International Conference on Management of Data, pp. 73-84.
[7]
Halkidi, M., Batistakis, Y. and Vazirgiannis, M., Cluster validity methods: Part i. SIGMOD Record. v31 i2. 40-45.
[8]
Halkidi, M., Batistakis, Y. and Vazirgiannis, M., Clustering validity checking methods: Part ii. SIGMOD Record. v31 i3. 19-27.
[9]
Han, E.-H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B. and Moore, J., WebACE: a web agent for document categorization and exploration. In: Proceedings of the 2nd International Conference on Autonomous Agents,
[10]
Hersh, W., Buckley, C., Leone, T.J. and Hickam, D., OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 192-201.
[11]
Hirano, S., Sun, X. and Tsumoto, S., Comparison of clustering methods for clinical databases. Information Sciences. v159 i3-4. 155-165.
[12]
Hruschka, E.R., Campello, R.J.G.B. and de Castro, L.N., Evolving clusters in gene-expression data. Information Sciences. v176 i13. 1898-1927.
[13]
Hubert, L. and Arabie, P., Comparing partitions. Journal of Classification. v2. 193-218.
[14]
Jain, A.K. and Dubes, R.C., Algorithms for Clustering Data. 1998. Prentice-Hall, Englewood Cliffs, NJ.
[15]
Jarvis, R.A. and Patrick, E.A., Clustering using a similarity measure based on shared nearest neighbors. IEEE Transactions on Computers. vC-22 i11. 1025-1034.
[16]
Jolliffe, I.T., Principal Component Analysis. 2002. second ed. Springer, Berlin.
[17]
Karypis, G., Han, E.-H. and Kumar, V., Chameleon: a hierarchical clustering algorithm using dynamic modeling. IEEE Computer. v32 i8. 68-75.
[18]
G. Karypis, Cluto-software for clustering high-dimensional datasets, version 2.1.1, 2008 {http://glaros.dtc.umn.edu/gkhome/views/cluto}.
[19]
Larsen, B. and Aone, C., Fast and effective text mining using linear-time document clustering. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16-22.
[20]
D. Lewis. Reuters-21578 text categorization text collection 1.0, 2008 {http://www.research.att.com/lewis}.
[21]
Li, J., Tao, D., Hu, W. and Li, X., Kernel principle component analysis in pixels clustering. In: Proceedings of The 2005 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 786-789.
[22]
J. Li, H. Liu. Kent ridge biomedical data set repository, 2008 {http://sdmc.i2r.a-star.edu.sg/rp/}.
[23]
Murtagh, F., Clustering Massive Data Sets, Handbook of Massive Data Sets. 2000. Kluwer Academic Publishers, Dordrecht.
[24]
D.J. Newman, S. Hettich, C.L. Blake, C.J. Merz, UCI repository of machine learning databases, 1998.
[25]
Pang, Y., Tao, D., Yuan, Y. and Li, X., Binary two-dimensional PCA. IEEE Transactions on Systems, Man, and Cybernetics, Part B. v38 i4. 1176-1180.
[26]
Porter, M.F., An algorithm for suffix stripping. Program. v14 i3. 130-137.
[27]
Van Rijsbergen, C.J., Information Retrieval. 1979. second ed. Butterworths, London.
[28]
Ryu, T.-W. and Eick, C.F., A database clustering methodology and tool. Information Sciences. v171 i1-3. 29-59.
[29]
Sneath, P.H. and Sokal, R.R., Numerical Taxonomy. 1973. Freeman, San Francisco, CA.
[30]
M. Steinbach, G. Karypis, V. Kumar, A comparison of document clustering techniques, in: Workshop on Text Mining, the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2000.
[31]
Tan, P.-N., Steinbach, M. and Kumar, V., Introduction to Data Mining. 2005. Addison-Wesley, Reading, MA.
[32]
B. Tang, M. Shepherd, M.I. Heywood, X. Luo, Comparing dimension reduction techniques for document clustering, in: Canadian Conference on Artificial Intelligence, 2005, pp. 292-296.
[33]
Tao, D., Li, X., Wu, X. and Maybank, S., Geometric mean for subspace selection in multiclass classification. IEEE Transactions on Pattern Analysis and Machine Intelligence. v30.
[34]
TREC, Text retrieval conference, 2008 {http://trec.nist.gov}.
[35]
Xiong, H., Pandey, G., Steinbach, M. and Kumar, V., Enhancing data analysis with noise removal. IEEE Transactions on Knowledge and Data Engineering. v18 i3. 304-319.
[36]
Xiong, H., Tan, P. and Kumar, V., Mining strong affinity association patterns in data sets with skewed support distribution. In: Proceedings of the 3rd IEEE International Conference on Data Mining, pp. 387-394.
[37]
Xiong, H., Tan, P.-N. and Kumar, V., Hyperclique pattern discovery. Data Mining and Knowledge Discovery Journal. v13 i2. 219-242.
[38]
Xiong, H., Wu, J. and Chen, J., K-means clustering versus validation measures: a data distribution perspective. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 779-784.
[39]
Zhang, T., Ramakrishnan, R. and Livny, M., Birch: an efficient data clustering method for very large databases. In: Proceedings of 1996 ACM SIGMOD International Conference on Management of Data, pp. 103-114.
[40]
Zhang, T., Tao, D. and Yang, J., Discriminative locality alignment. In: Proceedings of the 10th European Conference on Computer Vision (ECCV), pp. 725-738.
[41]
Y. Zhao, G. Karypis, Hierarchical clustering algorithms for document datasets, Technical Report #03-027, University of Minnesota, Minneapolis, MN, 2003.
[42]
Zhao, Y. and Karypis, G., Criterion functions for document clustering: experiments and analysis. Machine Learning. v55 i3. 311-331.

Cited By

View all
  • (2024)Real measurement data-driven correlated hysteresis monitoring model for concrete arch dam displacementExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121752238:PAOnline publication date: 15-Mar-2024
  • (2022)First, do no harm - Missing data treatment to support lake ecological condition assessmentEnvironmental Modelling & Software10.1016/j.envsoft.2022.105558158:COnline publication date: 1-Dec-2022
  • (2020)Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clusteringPattern Analysis & Applications10.1007/s10044-019-00783-623:1(455-466)Online publication date: 1-Feb-2020
  • Show More Cited By

Index Terms

  1. Towards understanding hierarchical clustering: A data distribution perspective
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        Publisher

        Elsevier Science Publishers B. V.

        Netherlands

        Publication History

        Published: 01 June 2009

        Author Tags

        1. Coefficient of variation (CV)
        2. F-measure
        3. Hierarchical clustering
        4. Measure normalization
        5. Unweighted pair group method with arithmetic mean (UPGMA)

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 17 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Real measurement data-driven correlated hysteresis monitoring model for concrete arch dam displacementExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121752238:PAOnline publication date: 15-Mar-2024
        • (2022)First, do no harm - Missing data treatment to support lake ecological condition assessmentEnvironmental Modelling & Software10.1016/j.envsoft.2022.105558158:COnline publication date: 1-Dec-2022
        • (2020)Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clusteringPattern Analysis & Applications10.1007/s10044-019-00783-623:1(455-466)Online publication date: 1-Feb-2020
        • (2019)Feature trajectory dynamic time warping for clustering of speech segmentsEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-019-0149-92019:1(1-9)Online publication date: 1-Dec-2019
        • (2016)Exploring the uniform effect of FCM clusteringKnowledge-Based Systems10.1016/j.knosys.2016.01.00196:C(76-83)Online publication date: 15-Mar-2016
        • (2015)Classification of countries' progress toward a knowledge economy based on machine learning classification techniquesExpert Systems with Applications: An International Journal10.1016/j.eswa.2014.08.00842:1(562-572)Online publication date: 1-Jan-2015
        • (2014)A Framework for Hierarchical Ensemble ClusteringACM Transactions on Knowledge Discovery from Data10.1145/26113809:2(1-23)Online publication date: 23-Sep-2014

        View Options

        View options

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media