article

RIC: Parameter-free noise-robust clustering

Authors:

Christian Böhm,

Christos Faloutsos,

Claudia PlantAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 1, Issue 3

Pages 10 - es

https://doi.org/10.1145/1297332.1297334

Published: 01 December 2007 Publication History

Abstract

How do we find a natural clustering of a real-world point set which contains an unknown number of clusters with different shapes, and which may be contaminated by noise? As most clustering algorithms were designed with certain assumptions (Gaussianity), they often require the user to give input parameters, and are sensitive to noise. In this article, we propose a robust framework for determining a natural clustering of a given dataset, based on the minimum description length (MDL) principle. The proposed framework, robust information-theoretic clustering (RIC), is orthogonal to any known clustering algorithm: Given a preliminary clustering, RIC purifies these clusters from noise, and adjusts the clusterings such that it simultaneously determines the most natural amount and shape (subspace) of the clusters. Our RIC method can be combined with any clustering technique ranging from K-means and K-medoids to advanced methods such as spectral clustering. In fact, RIC is even able to purify and improve an initial coarse clustering, even if we start with very simple methods. In an extension, we propose a fully automatic stand-alone clustering method and efficiency improvements. RIC scales well with the dataset size. Extensive experiments on synthetic and real-world datasets validate the proposed RIC framework.

References

[1]

Aggarwal, C. C. and Yu, P. S. 2000. Finding generalized projected clusters in high dimensional spaces. In Proceedings of the ACM International SIGMOD Conference on Management of Data, 70--81.

Digital Library

[2]

Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the ACM International SIGMOD Conference on Management of Data, 94--105.

Digital Library

[3]

Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J. 1999. OPTICS: Ordering points to identify the clustering structure. In Proceedings of the ACM International SIGMOD Conference on Management of Data.

Digital Library

[4]

Banfield, J. D. and Raftery, A. E. 1993. Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 3, 803--821.

[5]

Bhattacharya, A., Ljosa, V., Pan, J.-Y., Verardo, M. R., Yang, H., Faloutsos, C., and Singh, A. K. 2005. ViVo: Visual vocabulary construction for mining biomedical images. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM).

Digital Library

[6]

Böhm, C., Kailing, K., Kröger, P., and Zimek, A. 2004. Computing clusters of correlation connected objects. In Proceedings of the ACM International SIGMOD Conference on Management of Data, 455--466.

Digital Library

[7]

Chakrabarti, D., Papadimitriou, S., Modha, D. S., and Faloutsos, C. 2004. Fully automatic cross-associations. In Proceedings of the ACM SIGKDD Conference on International Knowledge Discovery and Data Mining. 79--88.

Digital Library

[8]

Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the ACM SIGKDD Conference on International Knowledge Discovery and Data Mining.

[9]

Grünwald, P. 2005. A tutorial introduction to the minimum description length principle. Advances in Minimum Description Length: Theory and Applications.

[10]

Guha, S., Rastogi, R., and Shim, K. 1998. CURE: An efficient clustering algorithm for large databases. In Proceedings of the ACM International SIGMOD Conference on Management of Data, 73--84.

Digital Library

[11]

Hamerly, G. and Elkan, C. 2003. Learning the k in k-means. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS).

[12]

Hartigan, J. A. 1975. Clustering Algorithms. John Wiley.

Digital Library

[13]

Jolliffe, I. 1986. Principal Component Analysis. Springer.

[14]

Murtagh, F. 1983. A survey of recent advances in hierarchical clustering algorithms. Comput. J. 26, 4, 354--359.

[15]

Ng, A., Jordan, M., and Weiss, Y. 2001. On spectral clustering: Analysis and an algorithm. In Proceedings of the Conference on Advances in Neural Information Processing Systems.

[16]

Ng, R. T. and Han, J. 1994. Efficient and effective clustering methods for spatial data mining. In Proceedings of the Conference on Very Large Databases (VLDB), 144--155.

Digital Library

[17]

Pelleg, D. and Moore, A. 2000. X-means: Extending K-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conference on Machine Learning (ICML), 727--734.

Digital Library

[18]

Slonim, N. and Tishby, N. 2000. Document clustering using word clusters via the information bottleneck method. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 208--215.

Digital Library

[19]

Still, S. and Bialek, W. 2004. How many clusters&quest; An information theoretic perspective. Neural Comput. 16, 2483--2506.

Digital Library

[20]

Tibshirani, R., Walther, G., and Hastie, T. 2000. Estimating the number of clusters in a dataset via the gap statistic. Tech. Rep., Stanford University.

[21]

Tishby, N., Pereira, F. C., and Bialek, W. 2000. The information bottleneck method. In Proceedings of the 37th Allerton Conference on Communication, Control and Computing.

[22]

Tung, A. K., Xu, X., and Ooi, B. C. 2005. CURLER: Finding and visualizing nonlinear correlation clusters. In Proceedings of the ACM International SIGMOD Conference on Management of Data, 467--478.

Digital Library

[23]

Van-Rijsbergen, C. 1979. Information Retrieval, 2nd ed. Butterworths, London.

Digital Library

[24]

Zhang, B., Hsu, M., and Dayal, U. 2000. K-harmonic means---A spatial clustering algorithm with boosting. In Proceedings of the 1st International Workshop on Temporal, Spatial, and Spatio-Temporal Data Mining-Revised Papers (TSDM), 31--45.

Digital Library

[25]

Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM International SIGMOD Conference on Management of Data, 103--114.

Digital Library

Cited By

Obata KKawabata KMatsubara YSakurai YChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Dynamic Multi-Network Mining of Tensor Time SeriesProceedings of the ACM Web Conference 202410.1145/3589334.3645461(4117-4127)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645461
Taheri SBagirov ASultanova NOrdin B(2024)Robust clustering algorithm: The use of soft trimming approachPattern Recognition Letters10.1016/j.patrec.2024.06.032Online publication date: Jul-2024
https://doi.org/10.1016/j.patrec.2024.06.032
Nakamura KMatsubara YKawabata KUmeda YWada YSakurai Y(2023)Fast and Multi-aspect Mining of Complex Time-stamped Event StreamsProceedings of the ACM Web Conference 202310.1145/3543507.3583370(1638-1649)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583370
Show More Cited By

Index Terms

RIC: Parameter-free noise-robust clustering
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
        Data compression
  2. Information systems applications
    1. Data mining

Recommendations

Robust information-theoretic clustering
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

How do we find a natural clustering of a real world point set, which contains an unknown number of clusters with different shapes, and which may be contaminated by noise? Most clustering algorithms were designed with certain assumptions (Gaussianity), ...
Effective data summarization for hierarchical clustering in large datasets

Cluster analysis in a large dataset is an interesting challenge in many fields of Science and Engineering. One important clustering approach is hierarchical clustering, which outputs hierarchical (nested) structures of a given dataset. The single-link ...
Tolerance rough set theory based data summarization for clustering large datasets
Transactions on rough sets XIV

Finding clusters in large datasets is an interesting challenge in many fields of Science and Technology. Many clustering methods have been successfully developed over the years. However, most of the existing clustering methods need multiple data scans ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 1, Issue 3

December 2007

145 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/1297332

Issue’s Table of Contents

Copyright © 2007 ACM.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2007

Published in TKDD Volume 1, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
842
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Obata KKawabata KMatsubara YSakurai YChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Dynamic Multi-Network Mining of Tensor Time SeriesProceedings of the ACM Web Conference 202410.1145/3589334.3645461(4117-4127)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645461
Taheri SBagirov ASultanova NOrdin B(2024)Robust clustering algorithm: The use of soft trimming approachPattern Recognition Letters10.1016/j.patrec.2024.06.032Online publication date: Jul-2024
https://doi.org/10.1016/j.patrec.2024.06.032
Nakamura KMatsubara YKawabata KUmeda YWada YSakurai Y(2023)Fast and Multi-aspect Mining of Complex Time-stamped Event StreamsProceedings of the ACM Web Conference 202310.1145/3543507.3583370(1638-1649)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583370
Lee GYoon SShin K(2022)Simple epidemic models with segmentation can be better than complex onesPLOS ONE10.1371/journal.pone.026224417:1(e0262244)Online publication date: 12-Jan-2022
https://doi.org/10.1371/journal.pone.0262244
Altinigneli CBauer LBehzadi SFritze RHlaváčková-Schindler KLeodolter MMiklautz LPerdacher MSadikaj YSchelling BPlant C(2020)The Data Mining Group at University of ViennaDatenbank-Spektrum10.1007/s13222-020-00337-920:1(71-79)Online publication date: 10-Feb-2020
https://doi.org/10.1007/s13222-020-00337-9
Honda TMatsubara YNeyama RAbe MSakurai Y(2019)Multi-aspect Mining of Complex Sensor Sequences2019 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM.2019.00040(299-308)Online publication date: Nov-2019
https://doi.org/10.1109/ICDM.2019.00040
Hess SDuivesteijn W(2019)k Is the Magic Number—Inferring the Number of Clusters Through Nonparametric Concentration InequalitiesMachine Learning and Knowledge Discovery in Databases10.1007/978-3-030-46150-8_16(257-273)Online publication date: 16-Sep-2019
https://dl.acm.org/doi/10.1007/978-3-030-46150-8_16
He SZheng XZeng D(2018)Modeling online user behaviors with competitive interactionsInformation & Management10.1016/j.im.2018.09.007Online publication date: Sep-2018
https://doi.org/10.1016/j.im.2018.09.007
Janssen JMehrabian A(2017)Rumors Spread Slowly in a Small-World Spatial NetworkSIAM Journal on Discrete Mathematics10.1137/16M108325631:4(2414-2428)Online publication date: Jan-2017
https://doi.org/10.1137/16M1083256
He SZheng XZeng D(2017)The dynamics of health sentiments with competitive interactions in social media2017 IEEE International Conference on Intelligence and Security Informatics (ISI)10.1109/ISI.2017.8004882(101-106)Online publication date: 22-Jul-2017
https://dl.acm.org/doi/10.1109/ISI.2017.8004882
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents