Abstract
Clustering methods often come down to the optimization of a numeric criterion defined from a distance or from a dissimilarity measure. It is possible to show that this problem is often equivalent to the estimation of the parameters of a probabilistic model under the classification likelihood approach. For instance, we know that the inertia criterion optimized under the k-means algorithm corresponds to the hypothesis of a population arising from a Gaussian mixture. In this paper, we propose an adapted mixture model for categorical data. Using the classification likelihood approach, we develop the Classification EM algorithm (CEM) to estimate the parameters of the mixture model. With our probabilistic model, the data are not denatured and the estimated parameters readily indicate the characteristics of the clusters. This probabilistic approach gives an interpretation of the criterion optimized by the k-modes algorithm which is an extension of k-means to categorical attributes and allows us to study the behavior of this algorithm.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Celeux, G. and Govaert, G. (1992): A Classification EM Algorithm for Clustering and two Stochastic Versions. Computational Statistics & Data Analysis, 14, 315–332.
Dempster, A., Laird, N. and Rubin, D. (1977): Mixture Densities, Maximum Likelihood from incomplete data via the EM Algorithm. Journal of the Royal Statistical Society, 39,1, 1–38.
Diday, E., Bochi, S., Brossier, G. and Celeux, G. (1980): Optimisation en Classification Automatique, Le Chesnay, INRIA.
Everitt, B. (1984): An introduction to Latent Variables Models, Chapman and Hall.
Forgy, E. W. (1965): Cluster Analysis of Multivariate Data: Efficiency versus Interpretability of Classification. Biometrics, 21,3, 768.
Govaert, G. and Nadif, M. (1996): Comparison of the Mixture and the Classification Maximum Likelihood in Cluster Analysis with binary data. Comput. Statis. and Data Analysis, 23, 65–81.
Huang, Z. (1997): A Fast Clustering Algorithm to Cluster very large categorical data sets in Data Mining. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (SIGMOD-DMKD’97).
Huang, Z. (1998): Extensions to the k-means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery, 2, 283–304.
Mc Lachlan, G. J. and Basford, K. E. (1989): Mixture Models, Inference and Applications to Clustering, Marcel Dekker.
Mac Queen, J. B. (1967): Some Methods for Classification and Analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 281–297.
Nadif, M. and Marchetti, F. (1993): Classification de Données Qualitatives et Modèles. Revue de Statistique Appliquée, XLI, 1, 55–69.
Ralambondrainy H. (1995): A Conceptual Version of the k-means Algorithm, Pattern Recognition Letters, 16, pp. 1147–1157.
Symons M. J. (1981): Clustering Criteria and Multivariate Normal Mixture, Biometrics, 27, pp 387–397.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jollois, FX., Nadif, M. (2002). Clustering Large Categorical Data. In: Chen, MS., Yu, P.S., Liu, B. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2002. Lecture Notes in Computer Science(), vol 2336. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47887-6_25
Download citation
DOI: https://doi.org/10.1007/3-540-47887-6_25
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43704-8
Online ISBN: 978-3-540-47887-4
eBook Packages: Springer Book Archive