Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Clustering aggregation

Published: 01 March 2007 Publication History

Abstract

We consider the following problem: given a set of clusterings, find a single clustering that agrees as much as possible with the input clusterings. This problem, clustering aggregation, appears naturally in various contexts. For example, clustering categorical data is an instance of the clustering aggregation problem; each categorical attribute can be viewed as a clustering of the input rows where rows are grouped together if they take the same value on that attribute. Clustering aggregation can also be used as a metaclustering method to improve the robustness of clustering by combining the output of multiple algorithms. Furthermore, the problem formulation does not require a priori information about the number of clusters; it is naturally determined by the optimization function.
In this article, we give a formal statement of the clustering aggregation problem, and we propose a number of algorithms. Our algorithms make use of the connection between clustering aggregation and the problem of correlation clustering. Although the problems we consider are NP-hard, for several of our methods, we provide theoretical guarantees on the quality of the solutions. Our work provides the best deterministic approximation algorithm for the variation of the correlation clustering problem we consider. We also show how sampling can be used to scale the algorithms for large datasets. We give an extensive empirical evaluation demonstrating the usefulness of the problem and of the solutions.

References

[1]
Ailon, N., Charikar, M., and Newman, A. 2005. Aggregating inconsistent information: Ranking and clustering. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 684--693.
[2]
Andritsos, P., Tsaparas, P., Miller, R. J., and Sevcik, K. C. 2004. LIMBO: Scalable clustering of categorical data. In Proceedings of the International Conference on Extending Database Technology (EDBT). 123--146.
[3]
Bansal, N., Blum, A., and Chawla, S. 2004. Correlation clustering. Machine Learn. 56, 1--3, 89--113.
[4]
Barthelemy, J.-P. and Leclerc, B. 1995. The median procedure for partitions. DIMACS Series in Discrete Mathematics, 3--34.
[5]
Blake, C. L. and Merz, C. J. 1998. UCI repository of machine learning databases.
[6]
Boulis, C. and Ostendorf, M. 2004. Combining multiple clustering systems. In Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). 63--74.
[7]
Charikar, M., Guruswami, V., and Wirth, A. 2003. Clustering with qualitative information. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS). 524--533.
[8]
Cristofor, D. and Simovici, D. A. 2001. An information-theoretical approach to genetic algorithms for clustering. Tech. rep. TR-01-02, UMass, Boston, MA.
[9]
Demaine, E. D., Emanuel, D., Fiat, A., and Immorlica, N. 2006. Correlation clustering in general weighted graphs. Theoret. Comput. Science 361, 2--3, 172--187.
[10]
Deza, M. and Laurent, M. 1997. Geometry of Cuts and Metrics. Springer-Verlag.
[11]
Dwork, C., Kumar, R., Naor, M., and Sivakumar, D. 2001. Rank aggregation methods for the Web. In Proceedings of the International World Wide Web Conference. 613--622.
[12]
Fagin, R., Kumar, R., and Sivakumar, D. 2003. Comparing top k lists. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA). 28--36.
[13]
Fern, X. Z. and Brodley, C. E. 2003. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proceedings of the International Conference on Machine Learning (ICML). 186--193.
[14]
Filkov, V. and Skiena, S. 2004. Integrating microarray data by consensus clustering. Int. J. AI Tools 13, 4, 863--880.
[15]
Fred, A. and Jain, A. K. 2002. Data clustering using evidence accumulation. In Proceedings of the International Conference on Pattern Recognition (ICPR). 276--280.
[16]
Guha, S., Rastogi, R., and Shim, K. 2000. ROCK: A robust clustering algorithm for categorical attributes. Inform. Syst. 25, 5, 345--366.
[17]
Hamerly, G. and Elkan, C. 2003. Learning the k in k-means. In Advances in Neural Information Processing Systems (NIPS).
[18]
Han, J. and Kamber, M. 2001. Data Mining: Concepts and Techniques. Morgan Kaufmann.
[19]
Hand, D., Mannila, H., and Smyth, P. 2001. Principles of Data Mining. The MIT Press, Cambridge, MA.
[20]
Hochbaum, D. and Shmoys, D. 1985. A best possible heuristic for the k-center problem. Mathem. Operat. Resea., 180--184.
[21]
Jain, A. K. and Dubes, R. C. 1988. Algorithms for Clustering Data. Prentice-Hall.
[22]
Mielikäinen, T., Terzi, E., and Tsaparas, P. 2006. Aggregating time partitions. In Proceedings of the International ACM SIGKDD Conference on Knowledge Discovery in Data Mining (KDD). 347--356.
[23]
Schwarz, G. 1978. Estimating Dimension of a Model. Ann. Statis. 6, 461--464.
[24]
Smyth, P. 2000. Model selection for probabilistic clustering using cross-validated likelihood. Statist. Comput. 10, 1, 63--72.
[25]
Strehl, A. and Ghosh, J. 2002. Cluster ensembles---A knowledge reuse framework for combining multiple partitions. J. Machine Learn. Resear. 3, 583--617.
[26]
Swamy, C. 2004. Correlation clustering: maximizing agreements via semidefinite programming. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA). 526--527.
[27]
Topchy, A., Jain, A. K., and Punch, W. 2004. A mixture model of clustering ensembles. In Proceedings of the SIAM International Conference on Data Mining (SDM). 379--390.

Cited By

View all
  • (2025)StreamSoNGv2: Online Classification of Data Streams Using Growing Neural GasIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2024.33996559:1(576-589)Online publication date: Feb-2025
  • (2025)Aggregated center-based clustering algorithm based on principal component radiusNeurocomputing10.1016/j.neucom.2025.129469624(129469)Online publication date: Apr-2025
  • (2025)A fast sparse graph based clustering technique using dispersion of data pointsNeurocomputing10.1016/j.neucom.2024.129054618(129054)Online publication date: Feb-2025
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 1, Issue 1
March 2007
161 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/1217299
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 2007
Published in TKDD Volume 1, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data clustering
  2. clustering aggregation
  3. clustering categorical data
  4. correlation clustering

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)148
  • Downloads (Last 6 weeks)19
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)StreamSoNGv2: Online Classification of Data Streams Using Growing Neural GasIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2024.33996559:1(576-589)Online publication date: Feb-2025
  • (2025)Aggregated center-based clustering algorithm based on principal component radiusNeurocomputing10.1016/j.neucom.2025.129469624(129469)Online publication date: Apr-2025
  • (2025)A fast sparse graph based clustering technique using dispersion of data pointsNeurocomputing10.1016/j.neucom.2024.129054618(129054)Online publication date: Feb-2025
  • (2025)Categorical data clustering: 25 years beyond K-modesExpert Systems with Applications10.1016/j.eswa.2025.126608(126608)Online publication date: Jan-2025
  • (2025)Chaotic quasi-opposition marine predator algorithm for automatic data clusteringCluster Computing10.1007/s10586-024-04721-y28:3Online publication date: 21-Jan-2025
  • (2024)DPC clustering algorithm based on K-nearest neighbors and kernel density estimationScientific Insights and Discoveries Review10.59782/sidr.v5i1.905(30-65)Online publication date: 14-Oct-2024
  • (2024)Distributed Batch Learning of Growing Neural Gas for Quick and Efficient ClusteringMathematics10.3390/math1212190912:12(1909)Online publication date: 20-Jun-2024
  • (2024)Morphological Accuracy Data ClusteringApplied Computational Intelligence and Soft Computing10.1155/2024/37951262024Online publication date: 1-Jan-2024
  • (2024)DeepMulticut: Deep Learning of Multicut Problem for Neuron Segmentation From Electron Microscopy VolumeIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.340963446:12(8696-8714)Online publication date: Dec-2024
  • (2024)Ensemble Clustering via Co-Association Matrix Self-EnhancementIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.324920735:8(11168-11179)Online publication date: Aug-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media