article

On Clustering Validation Techniques

Authors:

Yannis Batistakis,

Michalis VazirgiannisAuthors Info & Claims

Journal of Intelligent Information Systems, Volume 17, Issue 2-3

Pages 107 - 145

https://doi.org/10.1023/A:1012801612483

Published: 02 December 2001 Publication History

Abstract

Cluster analysis aims at identifying groups of similar objects and, therefore helps to discover distribution of patterns and interesting correlations in large data sets. It has been subject of wide research since it arises in many application domains in engineering, business and social sciences. Especially, in the last years the availability of huge transactional and experimental data sets and the arising requirements for data mining created needs for clustering algorithms that scale and can be applied in diverse domains.

This paper introduces the fundamental concepts of clustering while it surveys the widely known clustering algorithms in a comparative way. Moreover, it addresses an important issue of clustering process regarding the quality assessment of the clustering results. This is also related to the inherent features of the data set under concern. A review of clustering validity measures and approaches available in the literature is presented. Furthermore, the paper illustrates the issues that are under-addressed by the recent algorithms and gives the trends in clustering process.

References

[1]

Berry, M.J.A. and Linoff, G. (1996). Data Mining Techniques For Marketing, Sales and Customer Support. John Wiley & Sons, Inc., USA.]]

Digital Library

[2]

Bezdeck, J.C., Ehrlich, R., and Full, W. (1984). FCM: Fuzzy C-Means Algorithm. Computers and Geoscience, 10(2-3), 191-203.]]

[3]

Dave, R.N. (1996). Validating Fuzzy Partitions Obtained Through c-Shells Clustering. Pattern Recognition Letters, 17, 613-623.]]

Digital Library

[4]

Davies, D.L. and Bouldin, D.W. (1979). A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2), 224-227.]]

Digital Library

[5]

Dunn, J.C. (1974). Well Separated Clusters and Optimal Fuzzy Partitions. J. Cybern., 4, 95-104.]]

[6]

Ester, M., Kriegel, H-P., Sander, J., and Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceeding of 2nd Int. Conf. On Knowledge Discovery and Data Mining, Portland (pp. 226-23).]]

[7]

Ester, M., Kriegel, H.-P., Sander, J., Wimmer, M., and Xu, X. (1998). Incremental Clustering for Mining in a Data Warehousing Environment. In Proceedings of 24th VLDB Conference, New York, USA.]]

Digital Library

[8]

Fayyad, M.U., Piatesky-Shapiro, G., Smuth P., Uthurusamy, R. (1996). Advances in Knowledge Discovery and Data Mining. AAAI Press.]]

Digital Library

[9]

Gath I. and Geva A.B. (1989). Unsupervised Optimal Fuzzy Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7), 773-781.]]

Digital Library

[10]

Guha, S., Rastogi, R., and Shim K. (1998). CURE: An Efficient Clustering Algorithm for Large Databases. In Proceedings of the ACM SIGMOD Conference.]]

Digital Library

[11]

Guha, S, Rastogi, R., and Shim K. (1999). ROCK: A Robust Clustering Algorithm for Categorical Attributes. In Proceedings of the IEEE Conference on Data Engineering.]]

Digital Library

[12]

Halkidi, M., Vazirgiannis, M., and Batistakis, I. (2000). Quality Scheme Assessment in the Clustering Process. In Proceedings of PKDD, Lyon, France.]]

Digital Library

[13]

Han, J. and Kamber, M. (2001). Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, USA.]]

Digital Library

[14]

Hinneburg, A. and Keim, D. (1998). An Efficient Approach to Clustering in Large Multimedia Databases with Noise. In Proceedings of KDD Conference.]]

[15]

Huang, Z. (1997). A Fast Clustering Algorithm to Cluster very Large Categorical Data Sets in Data Mining. DMKD.]]

[16]

Jain, A.K., Murty, M.N., and Flyn, P.J. (1999). Data Clustering: A Review. ACM Computing Surveys, 31(3), 264-323.]]

Digital Library

[17]

Krishnapuram, R., Frigui, H., and Nasraoui, O. (1993). Quadratic Shell Clustering Algorithms and the Detection of Second-Degree Curves. Pattern Recognition Letters, 14(7), 545-552.]]

Digital Library

[18]

MacQueen, J.B. (1967). Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of 5th Berkley Symposium on Mathematical Statistics and Probability, Volume I: Statistics, pp. 281-297.]]

[19]

Milligan, G.W. and Cooper, M.C. (1985). An Examination of Procedures for Determining the Number of Clusters in a Data Set. Psychometrika, 50, 159-179.]]

[20]

Milligan, G.W., Soon, S.C., and Sokol, L.M. (1983). The Effect of Cluster Size, Dimensionality and the Number of Clusters on Recovery of True Cluster Structure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 40-47.]]

Digital Library

[21]

Mitchell, T. (1997). Machine Learning. McGraw-Hill, USA.]]

Digital Library

[22]

Ng, R. and Han, J.(1994). Effecient and Effictive Clustering Methods for Spatial Data Mining. In Proceeding's of the 20th VLDB Conference, Santiago, Chile.]]

Digital Library

[23]

Pal, N.R. and Biswas, J. (1997). Cluster Validation Using Graph Theoretic Concepts. Pattern Recognition, 30(6), 847-857.]]

[24]

Rezaee, R., Lelieveldt, B.P.F., and Reiber, J.H.C. (1998). A New Cluster Validity Index for the Fuzzy c-Mean. Pattern Recognition Letters, 19, 237-246.]]

Digital Library

[25]

Sharma, S.C. (1996). Applied Multivariate Techniques. John Wiley and Sons.]]

Digital Library

[26]

Sheikholeslami, C., Chatterjee, S., and Zhang, A. (1998). WaveCluster: A-MultiResolution Clustering Approach for Very Large Spatial Database. In Proceedings of 24th VLDB Conference, New York, USA.]]

Digital Library

[27]

Smyth, P. (1996). Clustering using Monte Carlo Cross-Validation. In Proceedings of KDD Conference.]]

[28]

Theodoridis, S. and Koutroubas, K. (1999). Pattern Recognition. Academic Press.]]

[29]

Theodoridis, Y. (1999). Spatial Datasets: An "unofficial" collection. http://dias.cti.gr/~ytheod/research/ datasets/spatial.html]]

[30]

Wang, W., Yang, J., and Muntz, R. (1997). STING: A Ststistical Information Grid Approach to Spatial Data Mining. In Proceedings of 23rd VLDB Conference.]]

Digital Library

[31]

Xie, X.L. and Beni, G. (1991). A Validity Measure for Fuzzy Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4), 841-846.]]

Digital Library

[32]

Zhang, T., Ramakrishnman, R., and Linvy, M. (1996). BIRCH: An Efficient Method for Very Large Databases. ACM SIGMOD, Montreal, Canada.]]

Digital Library

Cited By

Poulakis YDoulkeridis CKyriazis D(2024)A Survey on AutoML Methods and Systems for ClusteringACM Transactions on Knowledge Discovery from Data10.1145/364356418:5(1-30)Online publication date: 26-Jan-2024
https://dl.acm.org/doi/10.1145/3643564
Silva DCarvalho DSilla C(2024)A Clustering-Based Computational Model to Group Students With Similar Programming Skills From Automatic Source Code Analysis Using Novel FeaturesIEEE Transactions on Learning Technologies10.1109/TLT.2023.327392617(428-444)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TLT.2023.3273926
Tu HDing SXu XHou HLi CDing L(2024)Non-iterative border-peeling clustering algorithm based on swap strategyInformation Sciences: an International Journal10.1016/j.ins.2023.119864654:COnline publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1016/j.ins.2023.119864
Show More Cited By

On Clustering Validation Techniques
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
  2. Information systems applications
    1. Data mining

Recommendations

Cluster validity measurement techniques
AIKED'06: Proceedings of the 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases

Clustering is a process of discovering groups of objects such that the objects of the same group are similar, and the objects belonging to different groups are dissimilar. Several research fields deal with the problem of clustering: for example pattern ...
Cluster validity measurement for arbitrary shaped clusters
AIKED'06: Proceedings of the 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases

Clustering is an unsupervised process in data mining and pattern recognition and most of the clustering algorithms are very sensitive to their input parameters. Therefore it is very important to evaluate the result of the clustering algorithms. In this ...
A partitional clustering algorithm validated by a clustering tendency index based on graph theory

Applying graph theory to clustering, we propose a partitional clustering method and a clustering tendency index. No initial assumptions about the data set are requested by the method. The number of clusters and the partition that best fits the data set, ...

Comments

Information & Contributors

Information

Published In

cover image Journal of Intelligent Information Systems

Journal of Intelligent Information Systems Volume 17, Issue 2-3

December 2001

217 pages

ISSN:0925-9902

Issue’s Table of Contents

Copyright © Copyright © 2001 Kluwer Academic Publishers.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 02 December 2001

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

444
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Poulakis YDoulkeridis CKyriazis D(2024)A Survey on AutoML Methods and Systems for ClusteringACM Transactions on Knowledge Discovery from Data10.1145/364356418:5(1-30)Online publication date: 26-Jan-2024
https://dl.acm.org/doi/10.1145/3643564
Silva DCarvalho DSilla C(2024)A Clustering-Based Computational Model to Group Students With Similar Programming Skills From Automatic Source Code Analysis Using Novel FeaturesIEEE Transactions on Learning Technologies10.1109/TLT.2023.327392617(428-444)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TLT.2023.3273926
Tu HDing SXu XHou HLi CDing L(2024)Non-iterative border-peeling clustering algorithm based on swap strategyInformation Sciences: an International Journal10.1016/j.ins.2023.119864654:COnline publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1016/j.ins.2023.119864
Arango-Abella MFigueroa-García J(2024)Classification of Users of a Health Service Provider Using Unsupervised Machine Learning MethodsSN Computer Science10.1007/s42979-024-02685-95:5Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1007/s42979-024-02685-9
Jafseer KShailesh SSreekumar A(2024)CPOCEDS-concept preserving online clustering for evolving data streamsCluster Computing10.1007/s10586-023-04121-827:3(2983-2998)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1007/s10586-023-04121-8
Domínguez-Álvarez Dde la Cruz AGorla ACaballero JChandra SBlincoe KTonella P(2023)LibKit: Detecting Third-Party Libraries in iOS AppsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616344(1407-1418)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616344
de Wet REngelbrecht A(2023)Set-based Particle Swarm Optimization for Data Clustering: Comparison and Analysis of Control ParametersProceedings of the 2023 7th International Conference on Intelligent Systems, Metaheuristics & Swarm Intelligence10.1145/3596947.3596956(103-110)Online publication date: 23-Apr-2023
https://dl.acm.org/doi/10.1145/3596947.3596956
Ding WLi WZhang ZWan CDuan JLu S(2023)Time-Varying Gaussian Markov Random Fields Learning for Multivariate Time Series ClusteringIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.323233135:11(11950-11966)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1109/TKDE.2022.3232331
Ren ZWang SZhang Y(2023)Weakly supervised machine learningCAAI Transactions on Intelligence Technology10.1049/cit2.122168:3(549-580)Online publication date: 28-Apr-2023
https://dl.acm.org/doi/10.1049/cit2.12216
Díez-Sanmartín CCabezuelo ABelmonte A(2023)A new approach to predicting mortality in dialysis patients using sociodemographic features based on artificial intelligenceArtificial Intelligence in Medicine10.1016/j.artmed.2022.102478136:COnline publication date: 1-Feb-2023
https://dl.acm.org/doi/10.1016/j.artmed.2022.102478
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents