article

Clustering aggregation

Authors:

Aristides Gionis,

Heikki Mannila,

Panayiotis TsaparasAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 1, Issue 1

Pages 4 - es

https://doi.org/10.1145/1217299.1217303

Published: 01 March 2007 Publication History

Abstract

We consider the following problem: given a set of clusterings, find a single clustering that agrees as much as possible with the input clusterings. This problem, clustering aggregation, appears naturally in various contexts. For example, clustering categorical data is an instance of the clustering aggregation problem; each categorical attribute can be viewed as a clustering of the input rows where rows are grouped together if they take the same value on that attribute. Clustering aggregation can also be used as a metaclustering method to improve the robustness of clustering by combining the output of multiple algorithms. Furthermore, the problem formulation does not require a priori information about the number of clusters; it is naturally determined by the optimization function.

In this article, we give a formal statement of the clustering aggregation problem, and we propose a number of algorithms. Our algorithms make use of the connection between clustering aggregation and the problem of correlation clustering. Although the problems we consider are NP-hard, for several of our methods, we provide theoretical guarantees on the quality of the solutions. Our work provides the best deterministic approximation algorithm for the variation of the correlation clustering problem we consider. We also show how sampling can be used to scale the algorithms for large datasets. We give an extensive empirical evaluation demonstrating the usefulness of the problem and of the solutions.

References

[1]

Ailon, N., Charikar, M., and Newman, A. 2005. Aggregating inconsistent information: Ranking and clustering. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 684--693.

Digital Library

[2]

Andritsos, P., Tsaparas, P., Miller, R. J., and Sevcik, K. C. 2004. LIMBO: Scalable clustering of categorical data. In Proceedings of the International Conference on Extending Database Technology (EDBT). 123--146.

[3]

Bansal, N., Blum, A., and Chawla, S. 2004. Correlation clustering. Machine Learn. 56, 1--3, 89--113.

Digital Library

[4]

Barthelemy, J.-P. and Leclerc, B. 1995. The median procedure for partitions. DIMACS Series in Discrete Mathematics, 3--34.

[5]

Blake, C. L. and Merz, C. J. 1998. UCI repository of machine learning databases.

[6]

Boulis, C. and Ostendorf, M. 2004. Combining multiple clustering systems. In Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). 63--74.

Digital Library

[7]

Charikar, M., Guruswami, V., and Wirth, A. 2003. Clustering with qualitative information. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS). 524--533.

Digital Library

[8]

Cristofor, D. and Simovici, D. A. 2001. An information-theoretical approach to genetic algorithms for clustering. Tech. rep. TR-01-02, UMass, Boston, MA.

[9]

Demaine, E. D., Emanuel, D., Fiat, A., and Immorlica, N. 2006. Correlation clustering in general weighted graphs. Theoret. Comput. Science 361, 2--3, 172--187.

Digital Library

[10]

Deza, M. and Laurent, M. 1997. Geometry of Cuts and Metrics. Springer-Verlag.

[11]

Dwork, C., Kumar, R., Naor, M., and Sivakumar, D. 2001. Rank aggregation methods for the Web. In Proceedings of the International World Wide Web Conference. 613--622.

Digital Library

[12]

Fagin, R., Kumar, R., and Sivakumar, D. 2003. Comparing top k lists. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA). 28--36.

Digital Library

[13]

Fern, X. Z. and Brodley, C. E. 2003. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proceedings of the International Conference on Machine Learning (ICML). 186--193.

[14]

Filkov, V. and Skiena, S. 2004. Integrating microarray data by consensus clustering. Int. J. AI Tools 13, 4, 863--880.

[15]

Fred, A. and Jain, A. K. 2002. Data clustering using evidence accumulation. In Proceedings of the International Conference on Pattern Recognition (ICPR). 276--280.

Digital Library

[16]

Guha, S., Rastogi, R., and Shim, K. 2000. ROCK: A robust clustering algorithm for categorical attributes. Inform. Syst. 25, 5, 345--366.

Digital Library

[17]

Hamerly, G. and Elkan, C. 2003. Learning the k in k-means. In Advances in Neural Information Processing Systems (NIPS).

[18]

Han, J. and Kamber, M. 2001. Data Mining: Concepts and Techniques. Morgan Kaufmann.

Digital Library

[19]

Hand, D., Mannila, H., and Smyth, P. 2001. Principles of Data Mining. The MIT Press, Cambridge, MA.

Digital Library

[20]

Hochbaum, D. and Shmoys, D. 1985. A best possible heuristic for the k-center problem. Mathem. Operat. Resea., 180--184.

[21]

Jain, A. K. and Dubes, R. C. 1988. Algorithms for Clustering Data. Prentice-Hall.

Digital Library

[22]

Mielikäinen, T., Terzi, E., and Tsaparas, P. 2006. Aggregating time partitions. In Proceedings of the International ACM SIGKDD Conference on Knowledge Discovery in Data Mining (KDD). 347--356.

Digital Library

[23]

Schwarz, G. 1978. Estimating Dimension of a Model. Ann. Statis. 6, 461--464.

[24]

Smyth, P. 2000. Model selection for probabilistic clustering using cross-validated likelihood. Statist. Comput. 10, 1, 63--72.

Digital Library

[25]

Strehl, A. and Ghosh, J. 2002. Cluster ensembles---A knowledge reuse framework for combining multiple partitions. J. Machine Learn. Resear. 3, 583--617.

Digital Library

[26]

Swamy, C. 2004. Correlation clustering: maximizing agreements via semidefinite programming. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA). 526--527.

Digital Library

[27]

Topchy, A., Jain, A. K., and Punch, W. 2004. A mixture model of clustering ensembles. In Proceedings of the SIAM International Conference on Data Mining (SDM). 379--390.

Cited By

Dale JKeller JGalusha A(2025)StreamSoNGv2: Online Classification of Data Streams Using Growing Neural GasIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2024.33996559:1(576-589)Online publication date: Feb-2025
https://doi.org/10.1109/TETCI.2024.3399655
Cheng MLiu LMa TMa LYan Q(2025)Aggregated center-based clustering algorithm based on principal component radiusNeurocomputing10.1016/j.neucom.2025.129469624(129469)Online publication date: Apr-2025
https://doi.org/10.1016/j.neucom.2025.129469
Akhter MKhan AMaheshwari RJothi RMohanty S(2025)A fast sparse graph based clustering technique using dispersion of data pointsNeurocomputing10.1016/j.neucom.2024.129054618(129054)Online publication date: Feb-2025
https://doi.org/10.1016/j.neucom.2024.129054
Show More Cited By

Index Terms

Clustering aggregation
1. Information systems
  1. Information systems applications
    1. Data mining
2. Theory of computation
  1. Design and analysis of algorithms

Recommendations

Clustering aggregation by probability accumulation

Since a large number of clustering algorithms exist, aggregating different clustered partitions into a single consolidated one to obtain better results has become an important problem. In Fred and Jain's evidence accumulation algorithm, they construct a ...
A size-insensitive integrity-based fuzzy c-means method for data clustering

Fuzzy c-means (FCM) is one of the most popular techniques for data clustering. Since FCM tends to balance the number of data points in each cluster, centers of smaller clusters are forced to drift to larger adjacent clusters. For datasets with ...
Ensemble-Initialized k-Means Clustering
ICMLC '19: Proceedings of the 2019 11th International Conference on Machine Learning and Computing

As one of the most classical clustering techniques, the k-means clustering has been widely used in various areas over the past few decades. Despite its significant success, there are still several challenging issues in the k-means clustering research, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 1, Issue 1

March 2007

161 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/1217299

Issue’s Table of Contents

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 2007

Published in TKDD Volume 1, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

608
Total Citations
View Citations
4,812
Total Downloads

Downloads (Last 12 months)148
Downloads (Last 6 weeks)19

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Dale JKeller JGalusha A(2025)StreamSoNGv2: Online Classification of Data Streams Using Growing Neural GasIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2024.33996559:1(576-589)Online publication date: Feb-2025
https://doi.org/10.1109/TETCI.2024.3399655
Cheng MLiu LMa TMa LYan Q(2025)Aggregated center-based clustering algorithm based on principal component radiusNeurocomputing10.1016/j.neucom.2025.129469624(129469)Online publication date: Apr-2025
https://doi.org/10.1016/j.neucom.2025.129469
Akhter MKhan AMaheshwari RJothi RMohanty S(2025)A fast sparse graph based clustering technique using dispersion of data pointsNeurocomputing10.1016/j.neucom.2024.129054618(129054)Online publication date: Feb-2025
https://doi.org/10.1016/j.neucom.2024.129054
Dinh THauchi WFournier-Viger PLisik DHa MDam HHuynh V(2025)Categorical data clustering: 25 years beyond K-modesExpert Systems with Applications10.1016/j.eswa.2025.126608(126608)Online publication date: Jan-2025
https://doi.org/10.1016/j.eswa.2025.126608
Ouertani MManita GChhabra AKorbaa O(2025)Chaotic quasi-opposition marine predator algorithm for automatic data clusteringCluster Computing10.1007/s10586-024-04721-y28:3Online publication date: 21-Jan-2025
https://doi.org/10.1007/s10586-024-04721-y
Yu ZHao XHongyu LQi B(2024)DPC clustering algorithm based on K-nearest neighbors and kernel density estimationScientific Insights and Discoveries Review10.59782/sidr.v5i1.905(30-65)Online publication date: 14-Oct-2024
https://doi.org/10.59782/sidr.v5i1.90
Siow CSaputra AObo TKubota N(2024)Distributed Batch Learning of Growing Neural Gas for Quick and Efficient ClusteringMathematics10.3390/math1212190912:12(1909)Online publication date: 20-Jun-2024
https://doi.org/10.3390/math12121909
Azzam AMaghrabi AEl-Naqeeb EAldawood MElGhawalby H(2024)Morphological Accuracy Data ClusteringApplied Computational Intelligence and Soft Computing10.1155/2024/37951262024Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1155/2024/3795126
Li ZYang XLiu JHong BZhang YZhai HShen LChen XLiu ZHan H(2024)DeepMulticut: Deep Learning of Multicut Problem for Neuron Segmentation From Electron Microscopy VolumeIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.340963446:12(8696-8714)Online publication date: Dec-2024
https://doi.org/10.1109/TPAMI.2024.3409634
Jia YTao SWang RWang Y(2024)Ensemble Clustering via Co-Association Matrix Self-EnhancementIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.324920735:8(11168-11179)Online publication date: Aug-2024
https://doi.org/10.1109/TNNLS.2023.3249207
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents