×

Conditional mixture modelling for heavy-tailed and skewed data. (English) Zbl 07858732

Summary: Overparameterization is a serious concern for multivariate mixture models as it can lead to model overfitting and, as a result, mixture order underestimation. Parsimonious modelling is one of the most effective remedies in this context. In Gaussian mixture models, the majority of parameters is associated with covariance matrices and parsimonious models based on factor analysers and spectral decomposition of dispersion parameters are the most popular in literature. Some drawbacks of these models include the lack of flexibility in imposing different covariance structures for individual components and limitations in modelling compact clusters. Recently introduced conditional mixture models provide substantial flexibility in addressing these concerns. The components of such mixtures are formulated as a product of conditional distributions with univariate Gaussian densities being the primary choice. However, the presence of heavy tails or skewness in any dimension can lead to fitting problems. We propose a flexible model that is free of the above-mentioned limitations and name it a contaminated transformation conditional mixture model and demonstrate on a series of simulation studies that it can effectively account for skewness and heavy tails. Applications to real-life data sets show good results and highlight the promise of the proposed model.
© 2023 John Wiley & Sons, Ltd.

MSC:

62-XX Statistics
Full Text: DOI

References:

[1] Banfield, J. D., & Raftery, A. E. (1993). Model‐based Gaussian and non‐Gaussian clustering. Biometrics, 49, 803-821. · Zbl 0794.62034
[2] Bergé, L., Bouveyron, C., & Girard, S. (2012). HDclassif: An R package for model‐based clustering and discriminant analysis of high‐dimensional data. Journal of Statistical Software, 46(6), 1-29.
[3] Biernacki, C., Celeux, G., & Govaert, G. (2003). Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics & Data Analysis, 413, 561-575. · Zbl 1429.62235
[4] Bouveyron, C., & Brunet‐Saumard, C. (2014). Model‐based clustering of high‐dimensional data: A review. Computational Statistics & Data Analysis, 71, 52-78. · Zbl 1471.62032
[5] Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society: Series B (Methodological), 26(2), 211-243. · Zbl 0156.40104
[6] Celeux, G., & Govaert, G. (1995). Gaussian parsimonious clustering models. Pattern Recognition, 28(5), 781-793.
[7] Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood for incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B, 39, 1-38. · Zbl 0364.62022
[8] Fop, M., Murphy, T. B., & Scrucca, L. (2019). Model‐based clustering with sparse covariance matrices. Statistics and Computing, 29(4), 791-819. · Zbl 1430.62131
[9] Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193-218.
[10] Kiefer, N. M. (1978). Discrete parameter variation: Efficient estimation of a switching regression model. Econometrica, 46, 427-434. · Zbl 0408.62058
[11] Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22, 79-86. · Zbl 0042.38403
[12] Lin, T.‐I., McNicholas, P. D., & Ho, H. J. (2014). Capturing patterns via parsimonious \(t\) mixture models. Statistics & Probability Letters, 88, 80-87. · Zbl 1369.62131
[13] Lindsay, B. G. (1995). Mixture models: Theory, geometry, and applications. In NSF‐CBMS Regional Conference Series in Probability and Statistics, IMS. · Zbl 1163.62326
[14] Lo, K., Brinkman, R., & Gottardo, R. (2008). Automated gating of flow cytometry data via robust model‐based clustering. Cytometry A, 37, 321-332.
[15] Lo, K., & Gottardo, R. (2012). Flexible mixture modeling via the multivariate \(t\) distribution with the Box‐Cox transformation: An alternative to the skew‐ \(t\) distribution. Statistics and Computing, 22(1), 33-52. · Zbl 1322.62173
[16] Lo, K., Hahne, F., Brinkman, R., & Gottardo, R. (2009). flowClust: A Bioconductor package for automated gating of flow cytometry data. BMC Bioinformatics, 10, 1-8.
[17] Maitra, R., & Melnykov, V. (2010). Simulating data to study performance of finite mixture modeling and clustering algorithms. Journal of Computational and Graphical Statistics, 2(19), 354-376.
[18] Manly, B. F. J. (1976). Exponential data transformations. Journal of the Royal Statistical Society: Series D (The Statistician), 25(1), 37-42.
[19] McLachlan, G. J., & Peel, D. (2000). Finite mixture models: John Wiley & Sons. · Zbl 0963.62061
[20] McNicholas, P. D. (2016). Model‐based clustering. Journal of Classification, 33(3), 331-373. · Zbl 1364.62155
[21] McNicholas, P. D., & Murphy, T. B. (2008). Parsimonious Gaussian mixture models. Statistics and Computing, 18(3), 285-296.
[22] Melnykov, V. (2013). Challenges in model‐based clustering. WIREs: Computational Statistics, 5, 135-148. · Zbl 1540.62018
[23] Melnykov, V. (2016). Model‐based biclustering of clickstream data. Computational Statistics & Data Analysis, 93, 31-45. · Zbl 1468.62138
[24] Melnykov, V., Chen, W.‐C., & Maitra, R. (2012). MixSim: an R package for simulating data to study performance of clustering algorithms. Journal of Statistical Software, 51, 1-25.
[25] Melnykov, V., & Wang, Y. (2023). Conditional mixture modeling and model‐based clustering. Pattern Recognition, 133, 108994.
[26] Melnykov, Y., Zhu, X., & Melnykov, V. (2021). Transformation mixture modeling for skewed data groups with heavy tails and scatter. Computational Statistics, 36(1), 61-78. · Zbl 1505.62290
[27] Nelder, J. A., & Mead, R. (1965). A simplex algorithm for function minimization. Computer Journal, 7(4), 308-313. · Zbl 0229.65053
[28] Prates, M., Cabral, C., & Lachos, V. (2013). mixsmsn: Fitting finite mixture of scale mixture of skew‐normal distributions. Journal of Statistical Software, 54, 1-20.
[29] Punzo, A., & McNicholas, P. D. (2016). Parsimonious mixtures of multivariate contaminated normal distributions. Biometrical Journal, 58, 1506-1537. · Zbl 1353.62124
[30] Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461-464. · Zbl 0379.62005
[31] Scrucca, L., Fop, M., Murphy, T. B., & Raftery, A. E. (2016). Mclust 5: clustering, classification and density estimation using gaussian finite mixture models. The R Journal, 8(1), 289.
[32] Vrbik, I., & McNicholas, P. D. (2014). Parsimonious skew mixture models for model‐based clustering and classification. Computational Statistics & Data Analysis, 71, 196-210. · Zbl 1471.62202
[33] Wang, Y., & Melnykov, V. (2022). cmbClust: conditional model‐based clustering. https://CRAN.R-project.org/package=cmbClust, R package version 0.0.1.
[34] Yeo, I.‐K., & Johnson, R. A. (2000). A new family of power transformations to improve normality or symmetry. Biometrika, 87(4), 954-959. · Zbl 1028.62010
[35] Zhu, X., & Melnykov, V. (2018). Manly transformation in finite mixture modeling. Computational Statistics & Data Analysis, 121, 190-208. · Zbl 1469.62184
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.