Abstract
Distribution mixtures with product components have been applied repeatedly to determine clusters in multivariate data. Unfortunately, for categorical variables the mixture parameters are not uniquely identifiable and therefore the result of cluster analysis may become questionable. We give a simple proof that any non-degenerate discrete product mixture can be equivalently described by infinitely many different parameter sets. Nevertheless a unique result of cluster analysis can be guaranteed by additional constraints. We propose a heuristic method of sequential estimation of components to guarantee a unique identification of clusters by means of EM algorithm. The application of the method is illustrated by a numerical example.
This research was supported by the EC project no. FP6-507752 MUSCLE, by the grant No. 1ET400750407 of the Grant Agency of the Academy of Sciences CR and partially by the project MŠMT 1M0572 DAR and GAČR 402/03/1310.
Chapter PDF
Similar content being viewed by others
References
Bartholomew, D.J.: Factor analysis for categorical data. J. Roy. Statist. Soc. B 42(3), 293–321 (1980)
Blischke, W.R.: Estimating the parameters of mixtures of binomial distributions. Journal Amer. Statist. Assoc. 59, 510–528 (1964)
Carreira-Perpignan, M.A., Renals, S.: Practical identifiability of finite mixtures of multivariate Bernoulli distributions. Neural Computation 12, 141–152 (2000)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. B 39, 1–38 (1977)
Fielding, A.: Latent structure models. In: O’Muircheaxtaigh, C.A., Payne, C. (eds.) The Analysis of survey data, pp. 125–157. Wiley, London (1977)
Gibson, W.A.: Three multivariate models: Factor analysis, latent structure analysis and latent profile analysis. Psychometrika 24, 229–252 (1969)
Grim, J.: Multivariate statistical pattern recognition with nonreduced dimensionality. Kybernetika 22, 142–157 (1986)
Grim, J., Boček, P., Pudil, P.: Safe dissemination of census results by means of interactive probabilistic models. In: Proceedings of the ETK-NTTS 2001 Conference (Hersonissos (Crete), European Communities 2001), vol. 2, pp. 849–856 (2001)
Grim, J.: Latent Structure Analysis for Categorical Data. Research Report UTIA, No. 2019, Academy of Sciences, Czech Republic, Prague, p. 13 (2001)
Grim, J., Haindl, M.: Texture Modelling by Discrete Distribution Mixtures. Computational Statistics and Data Analysis 41(3-4), 603–615 (2003)
Grim, J., Kittler, J., Pudil, P., Somol, P.: Multiple classifier fusion in probabilistic neural networks. Pattern Analysis & Applications 5(7), 221–233 (2002)
Gyllenberg, M., Koski, T., Reilink, E., Verlaan, M.: Non-uniqueness in probabilistic numerical identification of bacteria. Journal of Applied Prob. 31, 542–548 (1994)
McLachlan, G.J., Peel, D.: Finite Mixture Models. John Wiley & Sons, New York (2000)
Lazarsfeld, P.F., Henry, N.: Latent structure analysis. Houghton Mifl., Boston (1968)
Pearl, J.: Probabilistic reasoning in intelligence systems: networks of plausible inference. Morgan-Kaufman, San Mateo (1988)
Suppes, P.A.: Probabilistic theory of causality. North-Holland, Amsterdam (1970)
Teicher, H.: Identifiability of mixtures of product measures. Ann. Math. Statist. 39, 1300–1302 (1968)
Vermunt, J.K., Magidson, J.: Latent Class Cluster Analysis. In: Hagenaars, J.A., et al. (eds.) Advances in Latent Class Analysis. Cambridge University Press, Cambridge (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Grim, J. (2006). EM Cluster Analysis for Categorical Data. In: Yeung, DY., Kwok, J.T., Fred, A., Roli, F., de Ridder, D. (eds) Structural, Syntactic, and Statistical Pattern Recognition. SSPR /SPR 2006. Lecture Notes in Computer Science, vol 4109. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11815921_70
Download citation
DOI: https://doi.org/10.1007/11815921_70
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37236-3
Online ISBN: 978-3-540-37241-7
eBook Packages: Computer ScienceComputer Science (R0)