×

A hierarchical modeling approach for clustering probability density functions. (English) Zbl 1471.62034

Summary: The problem of clustering probability density functions is emerging in different scientific domains. The methods proposed for clustering probability density functions are mainly focused on univariate settings and are based on heuristic clustering solutions. New aspects of the problem associated with the multivariate setting and a model-based perspective are investigated. The novel approach relies on a hierarchical mixture modeling of the data. The method is introduced in the univariate context and then extended to multivariate densities by means of a factorial model performing dimension reduction. Model fitting is carried out using an EM-algorithm. The proposed method is illustrated through simulated experiments and applied to two real data sets in order to compare its performance with alternative clustering strategies.

MSC:

62-08 Computational methods for problems pertaining to statistics
62H30 Classification and discrimination; cluster analysis (statistical aspects)

Software:

Latent GOLD
Full Text: DOI

References:

[1] Banfield, J. D.; Raftery, A. E., Model-based gaussian and non-Gaussian clustering, Biometrics, 49, 803-821, (1993) · Zbl 0794.62034
[2] Bouveyron, C.; Brunet-Samardb, B., Model-based clustering of high-dimensional data: a review, Computational Statistics and Data Analysis, 71, 52-78, (2014) · Zbl 1471.62032
[3] Bouveyron, C.; Girard, S.; Taqqu, M. S., High-dimensional data clustering, Computational Statistics and Data Analysis, 52, 502-519, (2007) · Zbl 1452.62433
[4] Calò, D. G.; Viroli, C., A dimensionally reduced finite mixture model for multilevel data, Journal of Multivariate Analysis, 101, 2543-2553, (2010) · Zbl 1198.62063
[5] Chervoneva, I.; Zhan, T.; Iglewicz, B.; Walter, H.; Birck, D. E., Two-stage hierarchical modeling for analysis of subpopulations in conditional distributions, Journal of Applied Statistics, 39, 445-460, (2012) · Zbl 1514.62223
[6] Delicado, P., Dimensionality reduction when data are density functions, Computational Statistics and Data Analysis, 55, 401-420, (2011) · Zbl 1247.62148
[7] Dempster, N. M.; Laird, A. P.; Rubin, D. B., Maximum likelihood for incomplete data via the EM algorithm, Journal of the Royal Statistical Society Series B, 39, 1-39, (1977), (with discussion) · Zbl 0364.62022
[8] Golyandina, N.; Pepelyshev, A.; Steland, A., New approaches to nonparametric density estimation and selection of smoothing parameters, Computational Statistics and Data Analysis, 56, 2206-2218, (2012) · Zbl 1252.62039
[9] Irpino, A.; Verde, R., Dynamic clustering of interval data using a Wasserstein-based distance, Pattern Recognition Letters, 29, 1648-1658, (2008)
[10] Lukočienė, O.; Vermunt, J. K., Determining the number of components in mixture models for hierarchical data, (Fink, A.; etal., Advances in Data Analysis, Data Handling and Business Intelligence, (2010), Springer Heidelberg), 241-249
[11] Mardia, K. V.; Kent, J. T.; Bibby, J. M., Multivariate analysis, (2003), Academic Press Oxford · Zbl 0432.62029
[12] McLachlan, G. J.; Baek, J.; Rathnayake, S. I., Mixtures of factor analyzers for the analysis of high-dimensional data, (Mengersen, K. L.; Titterington, D. M., Mixtures: Estimation and Applications, (2011), John Wiley and Sons Chichester, UK)
[13] McLachlan, G. J.; Peel, D., Finite mixture models, (2000), Wiley New York · Zbl 0963.62061
[14] Montanari, A.; Viroli, C., Heteroscedastic factor mixture analysis, Statistical Modelling, 10, 441-460, (2010) · Zbl 07256833
[15] Noirhomme-Fraiture, M.; Brito, P., Far beyond the classical data models: symbolic data analysis, Statistical Analysis and Data Mining, 4, 157-170, (2011) · Zbl 07260275
[16] Palardy, G.; Vermunt, J. K., Multilevel growth mixture models for classifying groups, Journal of Educational and Behavioral Statistics, 35, 532-565, (2010)
[17] Sakurai, Y., Chong, R., Lei, L., Faloutsos, C., 2008. Efficient distribution mining and classification. In: Proceedings of the SIAM International Conference on Data Mining. Atlanta, USA, pp. 632-643.
[18] Schwarz, G., Estimating the dimension of a model, Annals of Statistics, 6, 461-464, (1978) · Zbl 0379.62005
[19] Skrondal, A.; Rabe-Hesketh, S., Generalized latent variable modelig: multilevel, longitudinal and structural equation models, (2004), Chapman and Hall New York · Zbl 1097.62001
[20] Spellman, E., Vemuri, B.C., Rao, M., 2005. Using the KL-center for efficient and accurate retrieval of distributions arising from texture images. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 111-116.
[21] Tay, L.; Diener, E.; Drasgow, F.; Vermunt, J. K., Multilevel mixed-measurement IRT analysis: an explication and application to self-reported emotions across the world, Organizational Research Methods, 14, 177-207, (2011)
[22] Terada, Y., Yadohisa, H., 2010. Non-hierarchical clustering for distribution-valued data, in: Lechevallier, Y., Saporta, G. (Eds.), Proceedings of COMPSTAT 2010, pp. 1653-1660.
[23] Varriale, R.; Vermunt, J. K., Multilevel mixture factor models, Multivariate Behavioral Research, 47, 247-275, (2012)
[24] Verde, R., Irpino, A., 2008. Comparing Histogram Data Using a Mahalanobis-Wasserstein Distance. in: Brito, P. (Ed.), Proceedings of COMPSTAT 2008, pp. 77-89. · Zbl 1147.62054
[25] Vermunt, J. K., A hierarchical mixture model for clustering three-way data sets, Computational Statistics and Data Analysis, 51, 5368-5376, (2007) · Zbl 1445.62154
[26] Vermunt, J. K., Multilevel latent variable modeling: an application in educational testing, Austrian Journal of Statistics, 37, 285-299, (2008)
[27] Vermunt, J. K., Latent class and finite mixture models for multilevel data sets, Statistical Methods in Medical Research, 17, 33-51, (2008) · Zbl 1154.62086
[28] Vermunt, J. K.; Magidson, J., Hierarchical mixture models for nested data structures, (Weihs, C.; Gaul, W., Classification: The Ubiquitous Challenge, (2005), Springer Heidelberg), 176-183
[29] Vermunt, J. K.; Magidson, J., LG-syntax user’s guide: manual for latent GOLD 4.5, syntax module, (2008), Statistical Innovations Inc. Belmont, MA
[30] Vrac, M.; Billard, L.; Diday, E.; Chédin, A., Copula analysis of mixture models, Computational Statistics, 1-31, (2011) · Zbl 1304.65087
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.