×

Parsimonious mixtures of multivariate contaminated normal distributions. (English) Zbl 1353.62124

Summary: A mixture of multivariate contaminated normal distributions is developed for model-based clustering. In addition to the parameters of the classical normal mixture, our contaminated mixture has, for each cluster, a parameter controlling the proportion of mild outliers and one specifying the degree of contamination. Crucially, these parameters do not have to be specified a priori, adding a flexibility to our approach. Parsimony is introduced via eigen-decomposition of the component covariance matrices, and sufficient conditions for the identifiability of all the members of the resulting family are provided. An expectation-conditional maximization algorithm is outlined for parameter estimation and various implementation issues are discussed. Using a large-scale simulation study, the behavior of the proposed approach is investigated and comparison with well-established finite mixtures is provided. The performance of this novel family of models is also illustrated on artificial and real data.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62F10 Point estimation
62-07 Data analysis (statistics) (MSC2010)

References:

[1] Aggarwal, C. C. (2013). Outlier Analysis. Springer, New York, NY. · Zbl 1291.68004
[2] Aitken, A. (1926). A series formula for the roots of algebraic and transcendental equations. Proceedings of the Royal Society of Edinburgh. 4514-22. · JFM 51.0096.03
[3] Aitkin, M. and Wilson, G. T. (1980). Mixture models, outliers, and the EM algorithm. Technometrics22, 325-331. · Zbl 0466.62034
[4] Andrews, J. L. and McNicholas, P. D. (2012). Model‐based clustering, classification, and discriminant analysis with the multivariate t‐distribution: the tEIGEN family. Statistics and Computing22, 1021-1029. · Zbl 1252.62062
[5] Andrews, J. L., Wickins, J. R., Boers, N. M. and McNicholas, P. D. (2015). teigen: model‐based clustering and classification with the multivariate t distribution. Version 2.1.0 (2015‐11‐20). URL http://CRAN.R‐project.org/package=teigen
[6] Bagnato, L. and Punzo, A. (2013). Finite mixtures of unimodal beta and gamma densities and the k‐bumps algorithm. Computational Statistics28, 1571-1597. · Zbl 1306.65024
[7] Bai, X., Yao, W. and Boyer, J. E. (2012). Robust fitting of mixture regression models. Computational Statistics and Data Analysis56, 2347-2359. · Zbl 1252.62011
[8] Banfield, J. D. and Raftery, A. E. (1993). Model‐based Gaussian and non‐Gaussian clustering. Biometrics49, 803-821. · Zbl 0794.62034
[9] Barnett, V. and Lewis, T. (1994). Outliers in Statistical Data. Wiley Series in Probability & Statistics. Chichester, UK: John Wiley & Sons. · Zbl 0801.62001
[10] Becker, C. and Gather, U. (1999). The masking breakdown point of multivariate outlier identification rules. Journal of the American Statistical Association94, 947-955. · Zbl 1072.62600
[11] Berkane, M. and Bentler, P. M. (1988). Estimation of contamination parameters and identification of outliers in multivariate data. Sociological Methods and Research17, 55-64.
[12] Biernacki, C. (2004). An asymptotic upper bound of the likelihood to prevent Gaussian mixtures from degenerating. Tech. rep., Université de Franche‐Comté, Besançon, FR.
[13] Biernacki, C., Celeux, G. and Govaert, G. (2003). Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics and Data Analysis41, 561-575. · Zbl 1429.62235
[14] Biernacki, C., Celeux, G., Govaert, G., Langrognet, F., Noulin, G. and Vernaz, Y. (2008). \( \mathsf{MIXMOD} \) ‐ Statistical documentation. Downloadable from http://www.mixmod.org/IMG/pdf/statdoc_2_1_1.pdf.
[15] Biernacki, C. and Chrétien, S. (2003). Degeneracy in the maximum likelihood estimation of univariate Gaussian mixtures with EM. Statistics and Probability Letters61, 373-382. · Zbl 1038.62023
[16] Bock, H. H. (2002). Clustering methods: from classical models to new approaches. Statistics in Transition5, 725-758.
[17] Böhning, D. (2000). Computer‐assisted analysis of mixtures and applications: meta‐analysis, disease mapping and others. Vol. 81 of Monographs on Statistics and Applied Probability. Chapman & Hall/CRC, London, UK. · Zbl 0951.62088
[18] Böhning, D., Dietz, E., Schaub, R., Schlattmann, P. and Lindsay, B. (1994). The distribution of the likelihood ratio for mixtures of densities from the one‐parameter exponential family. Annals of the Institute of Statistical Mathematics46, 373-388. · Zbl 0802.62017
[19] Böhning, D. and Ruangroj, R. (2002). A note on the maximum deviation of the scale‐contaminated normal to the best normal distribution. Metrika55, 177-182. · Zbl 1320.62035
[20] Browne, R. P. and McNicholas, P. D. (2014). Estimating common principal components in high dimensions. Advances in Data Analysis and Classification8, 217-226. · Zbl 1474.62183
[21] Browne, R. P. and McNicholas, P. D. (2015). mixture: mixture models for clustering and classification. Version 1.4 (2015‐03‐10). URL http://CRAN.R‐project.org/package=mixture
[22] Browne, R. P., McNicholas, P. D. and Sparling, M. D. (2012). Model‐based learning using a mixture of mixtures of Gaussian and uniform distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence34, 814-817.
[23] Browne, R. P., Subedi, S. and McNicholas, P. D. (2013). Constrained optimization for a subset of the Gaussian parsimonious clustering models. arXiv.org e‐print 1306.5824, available at: http://arxiv.org/abs/1306.5824.
[24] Byers, S. and Raftery, A. E. (1998). Nearest‐neighbor clutter removal for estimating features in spatial point processes. Journal of the American Statistical Association93, 577-584. · Zbl 0926.62089
[25] Campbell, N. A. (1984). Mixture models and atypical values. Mathematical Geology16, 465-477.
[26] Campbell, N. A. and Mahon, R. J. (1974). A multivariate study of variation in two species of rock crab of genus Leptograpsus. Australian Journal of Zoology22, 417-425.
[27] Celeux, G. and Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition28, 781-793.
[28] Celeux, G., Hurn, M. and Robert, C. P. (2000). Computational and inferential difficulties with mixture posterior distributions. Journal of the American Statistical Association95, 957-970. · Zbl 0999.62020
[29] Coretto, P. and Hennig, C. (2011). Maximum likelihood estimation of heterogeneous mixtures of Gaussian and uniform distributions. Journal of Statistical Planning and Inference141, 462-473. · Zbl 1203.62017
[30] Coretto, P. and Hennig, C. (2015). Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. arXiv.org e‐print 1406.0808, available at: http://arxiv.org/abs/1406.0808.
[31] Crawford, S. L. (1994). An application of the laplace method to finite mixture distributions. Journal of the American Statistical Association89, 259-267. · Zbl 0795.62022
[32] Cuesta‐Albertos, J. A., Gordaliza, A. and Matrán, C. (1997). Trimmed k‐means: An attempt to robustify quantizers. The Annals of Statistics25, 553-576. · Zbl 0878.62045
[33] Davies, L. and Gather, U. (1993). The identification of multiple outliers. Journal of the American Statistical Association88, 782-792. · Zbl 0797.62025
[34] De Veaux, R. D. and Krieger, A. M. (1990). Robust estimation of a normal mixture. Statistics and Probability Letters10, 1-7.
[35] Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B39, 1-38. · Zbl 0364.62022
[36] Di Zio, M., Guarnera, U. and Rocci, R. (2007). A mixture of mixture models for a classification problem: the unity measure error. Computational Statistics and Data Analysis51, 2573-2585. · Zbl 1161.62373
[37] Flury, B. N. and Gautschi, W. (1986). An algorithm for simultaneous orthogonal transformation of several positive definite matrices to nearly diagonal form. SIAM Journal on Scientific and Statistical Computing7, 169-184. · Zbl 0614.65043
[38] Forina, M., Leardi, R., Armanino, C. and Lanteri, S. (1998). PARVUS: an extendible package for data exploration, classification and correlation. Tech. rep., Institute of Pharmaceutical and Food Analysis and Technologies, Genoa, IT.
[39] Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering method? Answers via model‐based cluster analysis. Computer Journal41, 578-588. · Zbl 0920.68038
[40] Fraley, C., Raftery, A. E., Murphy, T. B. and Scrucca, L. (2012). mclust version 4 for R: normal mixture modeling for model‐based clustering, classification, and density estimation. Technical report 597, Department of Statistics, University of Washington, Seattle, WA.
[41] Fraley, C., Raftery, A. E., Scrucca, L., Murphy, T. B. and Fop, M. (2015). mclust: normal mixture modelling for model‐based clustering, classification, and density estimation. Version 5.1 (2015‐10‐27). URL http://CRAN.R‐project.org/package=mclust
[42] Gallegos, M. T. and Ritter, G. (2005). A robust method for cluster analysis. The Annals of Statistics33, 347-380. · Zbl 1064.62074
[43] Gallegos, M. T. and Ritter, G. (2009). Trimmed ML estimation of contaminated mixtures. Sankhyā: The Indian Journal of Statistics, Series A71, 164-220. · Zbl 1193.62021
[44] García‐Escudero, L. A. and Gordaliza, A. (1999). Robustness properties of k means and trimmed k means. Journal of the American Statistical Association94, 956-969. · Zbl 1072.62547
[45] García‐Escudero, L. A., Gordaliza, A. and Matrán, C. (2003). Trimming tools in exploratory data analysis. Journal of Computational and Graphical Statistics12, 434-449.
[46] García‐Escudero, L. A., Gordaliza, A., Matrán, C. and Mayo‐Iscar, A. (2008). A general trimming approach to robust cluster analysis. The Annals of Statistics36, 1324-1345. · Zbl 1360.62328
[47] García‐Escudero, L. A., Gordaliza, A., Matrán, C. and Mayo‐Iscar, A. (2010). A review of robust clustering methods. Advances in Data Analysis and Classification4, 89-109. · Zbl 1284.62375
[48] Gerogiannis, D., Nikou, C. and Likas, A. (2009). The mixtures of Student’s t‐distributions as a robust framework for rigid registration. Image and Vision Computing27, 1285-1294.
[49] Hartigan, J. A. (1985). Statistical theory in clustering. Journal of Classification2, 63-76. · Zbl 0575.62058
[50] Hastie, T. and Tibshirani, R. (1996). Discriminant analysis by Gaussian mixtures. Journal of the Royal Statistical Society: Series B58, 155-176. · Zbl 0850.62476
[51] Hathaway, R. J. (1986). A constrained EM algorithm for univariate normal mixtures. Journal of Statistical Computation and Simulation23, 211-230.
[52] Hawkins, D. (2013). Identification of Outliers. Monographs on Statistics and Applied Probability. Springer, The Netherlands.
[53] Hennig, C. (2002). Fixed point clusters for linear regression: computation and comparison. Journal of Classification19, 249-276. · Zbl 1017.62057
[54] Hennig, C. (2004). Breakdown points for maximum likelihood estimators of location‐scale mixtures. The Annals of Statistics32, 1313-1340. · Zbl 1047.62063
[55] Hennig, C. and Hausdorf, B. (2015). prabclus: functions for clustering of presence‐absence, abundance and multilocus genetic data. Version 2.2‐6 (2015‐01‐14). URL http://CRAN.R‐project.org/package=prabclus
[56] Holzmann, H., Munk, A. and Gneiting, T. (2006). Identifiability of finite mixtures of elliptical distributions. Scandinavian Journal of Statistics33, 753-763. · Zbl 1164.62354
[57] Hunter, D. R. and Lange, K. (2000). Rejoinder to discussion of “optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics9, 52-59.
[58] Hurley, C. (2004). Clustering visualizations of multivariate data. Journal of Computational and Graphical Statistics13, 788-806.
[59] Ingrassia, S. (2004). A likelihood‐based constrained algorithm for multivariate normal mixture models. Statistical Methods and Applications13, 151-166. · Zbl 1205.62066
[60] Ingrassia, S. and Rocci, R. (2007). Constrained monotone em algorithms for finite mixture of multivariate Gaussians. Computational Statistics and Data Analysis51, 5339-5351. · Zbl 1445.62116
[61] Ingrassia, S. and Rocci, R. (2011). Degeneracy of the EM algorithm for the mle of multivariate Gaussian mixtures and dynamic constraints. Computational Statistics and Data Analysis55, 1715-1725. · Zbl 1328.65030
[62] Karlis, D. and Xekalaki, E. (2003). Choosing initial values for the EM algorithm for finite mixtures. Computational Statistics and Data Analysis41, 577-590. · Zbl 1429.62082
[63] Lebret, R., Iovleff, S., Langrognet, F., Biernacki, C., Celeux, G. and Govaert, G. (2012). Rmixmod: The R Package of the Model‐Based Unsupervised, Supervised and Semi‐Supervised Classification Mixmod Library.
[64] Li, J. (2005). Clustering based on a multi‐layer mixture model. Journal of Computational and Graphical Statistics14, 547-568.
[65] Little, R. J. A. (1988). Robust estimation of the mean and covariance matrix from data with missing values. Applied Statistics37, 23-38. · Zbl 0647.62040
[66] Lo, Y. (2005). Likelihood ratio tests of the number of components in a normal mixture with unequal variances. Statistics and Probability Letters71, 225-235. · Zbl 1065.62024
[67] Lo, Y. (2008). A likelihood ratio test of a homoscedastic normal mixture against a heteroscedastic normal mixture. Statistics and Computing18, 233-240.
[68] Lo, Y., Mendell, N. R. and Rubin, D. B. (2001). Testing the number of components in a normal mixture. Biometrika88, 767-778. · Zbl 0985.62019
[69] Markatou, M. (2000). Mixture models, robustness, and the weighted likelihood methodology. Biometrics56, 483-486. · Zbl 1060.62511
[70] McLachlan, G. and Krishnan, T. (2007). The EM algorithm and extensions (2nd edn.). Vol. 382 of Wiley Series in Probability and Statistics. John Wiley & Sons, New York.
[71] McLachlan, G. J. and Basford, K. E. (1988). Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York, NY. · Zbl 0697.62050
[72] McLachlan, G. J. and Peel, D. (1998). Robust cluster analysis via mixtures of multivariate t‐distributions. In: Amin, A. (ed.), Dori, D. (ed.), Pudil, P. (ed.), Freeman, H. (ed.) (Eds.), Advances in Pattern Recognition. Vol. 1451 of Lecture Notes in Computer Science. Springer, Berlin‐Heidelberg, pp. 658-666.
[73] McLachlan, G. J. and Peel, D. (2000). Finite Mixture Models. John Wiley & Sons, New York, NY. · Zbl 0963.62061
[74] McNicholas, P. D. (2010). Model‐based classification using latent Gaussian mixture models. Journal of Statistical Planning and Inference140, 1175-1181. · Zbl 1181.62095
[75] McNicholas, P. D. (2016). Mixture Model‐Based Classification. Chapman & Hall/CRC Press, Boca Raton, FL.
[76] McNicholas, P. D., Murphy, T. B., McDaid, A. F. and Frost, D. (2010). Serial and parallel implementations of model‐based clustering via parsimonious Gaussian mixture models. Computational Statistics and Data Analysis54, 711-723. · Zbl 1464.62131
[77] Meng, X.‐L. and Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika80, 267-278. · Zbl 0778.62022
[78] Peel, D. and McLachlan, G. J. (2000). Robust mixture modelling using the t distribution. Statistics and Computing10, 339-348.
[79] Punzo, A., Browne, R. P. and McNicholas, P. D. (2016). Hypothesis testing for mixture model selection. Journal of Statistical Computation and Simulation. 86, 2797-2818 · Zbl 07184768
[80] Punzo, A., Mazza, A. and McNicholas, P. D. (2015). ContaminatedMixt: model‐based clustering and classification with the multivariate contaminated normal distribution. Version 1.0 (2015‐12‐20). URL http://CRAN.R‐project.org/package=ContaminatedMixt.
[81] R Core Team (2015). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R‐project.org/.
[82] Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology25, 111-164.
[83] Ritter, G. (2015). Robust Cluster Analysis and Variable Selection. Vol. 137 of Chapman & Hall/CRC Monographs on Statistics & Applied Probability. CRC Press: Boca Raton, FL. · Zbl 1341.62037
[84] Ruwet, C., García‐Escudero, L. A., Gordaliza, A. and Mayo‐Iscar, A. (2012). The influence function of the tclust robust clustering procedure. Advances in Data Analysis and Classification6, 107-130. · Zbl 1255.62182
[85] Ruwet, C., García‐Escudero, L. A., Gordaliza, A. and Mayo‐Iscar, A. (2013). On the breakdown behavior of the tclust clustering procedure. Test22, 466-487. · Zbl 1273.62146
[86] Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics6, 461-464. · Zbl 0379.62005
[87] Stephens, M. (2000). Dealing with label switching in mixture models. Journal of the Royal Statistical Society. Series B: Statistical Methodology62, 795-809. · Zbl 0957.62020
[88] Teicher, H. (1963). Identifiability of finite mixtures. Annals of Mathematical Statistics34, 1265-1269. · Zbl 0137.12704
[89] Tukey, J. W. (1960). A survey of sampling from contaminated distributions. In: Olkin, I. (ed.) (Ed.), Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling. Stanford Studies in Mathematics and Statistics. Stanford University Press, CA, Ch. 39, pp. 448-485. · Zbl 0201.52803
[90] Verdinelli, I. and Wasserman, L. (1991). Bayesian analysis of outlier problems using the Gibbs sampler. Statistics and Computing1, 105-117.
[91] Wolfe, J. H. (1965). A computer program for the maximum likelihood analysis of types. Technical Bulletin 65-15, U.S. Naval Personnel Research Activity.
[92] Yakowitz, S. J. and Spragins, J. D. (1968). On the identifiability of finite mixtures. The Annals of Mathematical Statistics39, 209-214. · Zbl 0155.25703
[93] Yao, W. (2012). Model based labeling for mixture models. Statistics and Computing22, 337-347. · Zbl 1322.62047
[94] Yao, W., Wei, Y. and Yu, C. (2014). Robust mixture regression using the t‐distribution. Computational Statistics and Data Analysis71, 116-127. · Zbl 1471.62227
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.