Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Improved initialisation of model-based clustering using Gaussian hierarchical partitions

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Initialisation of the EM algorithm in model-based clustering is often crucial. Various starting points in the parameter space often lead to different local maxima of the likelihood function and, so to different clustering partitions. Among the several approaches available in the literature, model-based agglomerative hierarchical clustering is used to provide initial partitions in the popular mclust R package. This choice is computationally convenient and often yields good clustering partitions. However, in certain circumstances, poor initial partitions may cause the EM algorithm to converge to a local maximum of the likelihood function. We propose several simple and fast refinements based on data transformations and illustrate them through data examples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Auder B, Lebret R, Lovleff S, Langrognet F (2014) Rmixmod: an interface for MIXMOD. http://CRAN.R-project.org/package=Rmixmod, R package version 2.0.2

  • Banfield J, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821

    Article  MATH  MathSciNet  Google Scholar 

  • Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22(7):719–725

    Article  Google Scholar 

  • Biernacki C, Celeux G, Govaert G (2003) Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput Stat Data Anal 41(3):561–575

    Article  MATH  MathSciNet  Google Scholar 

  • Biernacki C, Celeux G, Govaert G, Langrognet F (2006) Model-based cluster and discriminant analysis with the MIXMOD software. Comput Stat Data Anal 51:587–600

    Article  MATH  MathSciNet  Google Scholar 

  • Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recognit 28:781–793

    Article  Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc Series B Stat Methodol 39:1–38

    MATH  MathSciNet  Google Scholar 

  • Everitt B, Landau S, Leese M, Stahl D (2011) Cluster analysis, 5th edn. Wiley, Chichester, UK

    Book  MATH  Google Scholar 

  • Flury B (1997) A first course in multivariate statistics. Springer, New York

    Book  MATH  Google Scholar 

  • Forina M, Armanino C, Castino M, Ubigli M (1986) Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25:189–201

    Google Scholar 

  • Fraley C (1998) Algorithms for model-based Gaussian hierarchical clustering. SIAM J Sci Compu 20(1):270–281

    Article  MATH  MathSciNet  Google Scholar 

  • Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41:578–588

    Article  MATH  Google Scholar 

  • Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631

    Article  MATH  MathSciNet  Google Scholar 

  • Fraley C, Raftery AE, Murphy TB, Scrucca L (2012) MCLUST version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report 597, Department of Statistics, University of Washington

  • Fraley C, Raftery AE, Scrucca L (2015) mclust: normal mixture modelling for model-based clustering, classification, and density estimation. http://CRAN.R-project.org/package=mclust, R package version 5.0.1

  • Gordon AD (1999) Classification, 2nd edn. Chapman & Hall/CRC

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218

    Article  Google Scholar 

  • Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Inc

  • Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, UK

  • Maitra R (2009) Initializing partition-optimization algorithms. IEEE/ACM Trans Comput Biol Bioinform 6(1):144–157

    Article  Google Scholar 

  • McLachlan G, Krishnan T (2008) The EM algorithm and extensions, 2nd edn. Wiley-Interscience, Hoboken, New Jersey

    Book  MATH  Google Scholar 

  • McLachlan G, Peel D (2000) Finite mixture models. Wiley, New York

    Book  MATH  Google Scholar 

  • McLachlan GJ (1988) On the choice of starting values for the EM algorithm in fitting mixture models. Statistician 37(4/5):417

    Article  MathSciNet  Google Scholar 

  • McNicholas PD, ElSherbiny A, McDaid AF, Murphy TB (2015) pgmm: Parsimonious Gaussian Mixture Models. http://CRAN.R-project.org/package=pgmm, R package version 1.2

  • Melnykov V, Maitra R (2010) Finite mixture models and model-based clustering. Stat Surv 4:80–116

    Article  MATH  MathSciNet  Google Scholar 

  • Melnykov V, Melnykov I (2012) Initializing the EM algorithm in Gaussian mixture models with an unknown number of components. Comput Stat Data Anal 56(6):1381–1395

    Article  MATH  MathSciNet  Google Scholar 

  • Milligan GW, Cooper MC (1986) A study of the comparability of external criteria for hierarchical cluster analysis. Multivar Behav Res 21(4):441–458

    Article  Google Scholar 

  • Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101(473):168–178

    Article  MATH  MathSciNet  Google Scholar 

  • Schwartz G (1978) Estimating the dimension of a model. Ann Stat 6:31–38

    Google Scholar 

  • Wu CJ (1983) On the convergence properties of the EM algorithm. Ann Stat 11(1):95–103

    Article  MATH  Google Scholar 

Download references

Acknowledgments

The authors are grateful to the Coordinating Editor and two referees for their very helpful comments. This work was supported by NIH Grants R01-HD054511, R01-HD070936 and U54-HL127624, and by Science Foundation Ireland Walton Research Fellowship Number 11/W.1/I2079.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luca Scrucca.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Scrucca, L., Raftery, A.E. Improved initialisation of model-based clustering using Gaussian hierarchical partitions. Adv Data Anal Classif 9, 447–460 (2015). https://doi.org/10.1007/s11634-015-0220-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-015-0220-z

Keywords

Mathematics Subject Classification