×

Bayesian nonparametric clustering as a community detection problem. (English) Zbl 1510.62284

Summary: A wide class of Bayesian nonparametric priors leads to the representation of the distribution of the observable variables as a mixture density with an infinite number of components. Such a representation induces a clustering structure in the data. However, due to label switching, cluster identification is not straightforward a posteriori and some post-processing of the MCMC output is usually required. Alternatively, observations can be mapped on a weighted undirected graph, where each node represents a sample item and edge weights are given by the posterior pairwise similarities. It is shown how, after building a particular random walk on such a graph, it is possible to apply a community detection algorithm, known as map equation, leading to the minimisation of the expected description length of the partition. A relevant feature of this method is that it allows for the quantification of the posterior uncertainty of the classification.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62F15 Bayesian inference
62-08 Computational methods for problems pertaining to statistics

References:

[1] Aitkin, M., How many components in a finite mixture?, (Mengersen, K.; Robert, C.; Titterington, M., Mixtures: Estimation and Applications (2011), John Wiley & Sons Inc, Chichester), 123-144
[2] Antoniak, C., Mixtures of Dirichlet processes with applications to Bayesian non parametric problems, Ann. Statist., 2, 1152-1174 (1974) · Zbl 0335.60034
[3] Arbel, J.; Salomond, J.-B., Sequential quasi Monte Carlo for Dirichlet process mixture models, (BNP 2017 - 11th Conference on Bayesian NonParametrics (2017)), 1, URL https://hal.archives-ouvertes.fr/hal-01667781. Poster
[4] Azzalini, A.; Menardi, G., Clustering via nonparametric density estimation: The R Package pdfCluster, J. Stat. Softw., 57, 11, 1-26 (2014)
[5] Baudry, J.-P., Estimation and model selection for model-based clustering with the conditional classification likelihood, Electron. J. Stat., 9, 1, 1041-1077 (2015) · Zbl 1325.62120
[6] Biernacki, C.; Celeux, G.; Govaert, G., Assessing a mixture model for clustering with the integrated completed likelihood, IEEE Trans. Pattern Anal. Mach. Intell., 22, 7, 719-725 (2000)
[7] Binder, D. A., Bayesian Cluster analysis, Biometrika, 65, 1, 31-38 (1978) · Zbl 0376.62007
[8] Blackwell, D.; MacQueen, J. B., Ferguson Distributions Via Pólya Urn Schemes, Ann. Statist., 1, 2, 353-355 (1973) · Zbl 0276.62010
[9] Csardi, G.; Nepusz, T., The igraph software package for complex network research, InterJournal, Complex Systems, 1695 (2006), URL http://igraph.org
[10] Dahl, D. B., Modal clustering in a class of Product Partition Models, Bayesian Anal., 4, 243-264 (2009) · Zbl 1330.62248
[11] Ferguson, T. S., A Bayesian analysis of some nonparametric problems, Ann. Statist., 1, 2, 209-230 (1973) · Zbl 0255.62037
[12] Flury, B.; Riedwyl, H., Multivariate Statistics: A practical approach (1988), Chapman & Hall: Chapman & Hall London
[13] Forina, M.; Armanino, C.; Castino, M.; Ubigli, M., Multivariate data analysis as a discriminating method of the origin of wines, J. R. Stat. Soc. Ser. B Stat. Methodol., 25, 3, 189-201 (1986), URL https://ojs.openagrar.de/index.php/VITIS/article/view/5950
[14] Fortunato, S., Community detection in graphs, Phys. Rep., 486, 3, 75-174 (2010)
[15] Fortunato, S.; Hric, D., Community detection in networks: A user guide, Phys. Rep., 659, 1-44 (2016), Community detection in networks: A user guide
[16] Fraley, C.; Raftery, A., Model-based clustering, discriminant analysis, and density estimation, J. Amer. Statist. Assoc., 97, 611-631 (2002) · Zbl 1073.62545
[17] Fritsch, A.; Ickstadt, K., Improved criteria for clustering based on the posterior similarity matrix, Bayesian Anal., 4, 2, 367-391 (2009) · Zbl 1330.62249
[18] Frühwirth-Schnatter, S., Finite Mixture and Markov Switching Models (2006), Springer: Springer Berlin · Zbl 1108.62002
[19] Frühwirth-Schnatter, S.; Grün, B.; Malsiner-Walli, G., Comment on the paper by Wade and Ghahramani, Bayesian Anal., 13, 2, 601-603 (2018)
[20] Gordon, A., Classification, Chapman & Hall/CRC Monographs on Statistics & Applied Probability (1999), CRC Press · Zbl 0929.62068
[21] Griffin, J., Sequential Monte Carlo methods for normalized random measure with independent increments mixtures, Stat. Comput., 27 (2011)
[22] Griffin, J. E.; Walker, S. G., Posterior simulation of Normalized Random measure mixtures, J. Comput. Graph. Statist., 20, 1, 241-259 (2011)
[23] Hartigan, J. A., Clustering Algorithms (1975), John Wiley & Sons, Inc.: John Wiley & Sons, Inc. New York, NY, USA · Zbl 0372.62040
[24] Hubert, L.; Arabie, P., Comparing partitions, J. Classification, 2, 1, 193-218 (1985)
[25] Ishwaran, H.; James, L. F., Gibbs Sampling Methods for Stick Breaking Priors, J. Amer. Statist. Assoc., 96, 161-173 (2001) · Zbl 1014.62006
[26] Jain, S.; Neal, R. M., Splitting and merging components of a nonconjugate Dirichlet process mixture model, Bayesian Anal., 2, 3, 445-472 (2007) · Zbl 1331.62145
[27] Jara, A., Applied Bayesian non- and semi-parametric inference using DPpackage, R News, 7, 3, 17-26 (2007)
[28] Jara, A.; Hanson, T.; Quintana, F.; Müller, P.; Rosner, G., DPpackage: Bayesian semi- and nonparametric Modeling in R, J. Stat. Softw., 40, 5, 1-30 (2011)
[29] Kaufman, L.; Rousseeuw, P., Finding Groups in Data: An Introduction to Cluster Analysis (1990), Wiley: Wiley New York · Zbl 1345.62009
[30] Lau, J. W.; Green, P. J., Bayesian model-based clustering procedures, J. Comput. Graph. Statist., 16, 3, 526-558 (2007)
[31] Lijoi, A.; Prünster, I., Models beyond the Dirichlet process, (Hjort, N. L.; Holmes, C.; Müller, P.; Walker, S. G., Bayesian Nonparametrics. Bayesian Nonparametrics, Cambridge Series in Statistical and Probabilistic Mathematics (2010), Cambridge University Press), 80-136
[32] Lovász, L., Random Walks on graphs: A survey, (Miklós, D.; Sós, V. T.; Szőnyi, T., Combinatorics, Paul Erdős is Eighty, Vol. 2 (1996), János Bolyai Mathematical Society: János Bolyai Mathematical Society Budapest), 353-398 · Zbl 0854.60071
[33] Maceachern, S.; Clyde, M.; Liu, J., Sequential importance sampling for nonparametric Bayes models: The next generation, Canad. J. Statist., 27 (1997) · Zbl 0957.62068
[34] McLachlan, G.; Peel, D., Finite Mixture Models (2000), Wiley: Wiley New York · Zbl 0963.62061
[35] Medvedovic, M.; Guo, J., Bayesian model-averaging in unsupervised learning from microarray data, (BIOKDD (2004)), 40-47
[36] Medvedovic, M.; Sivaganesan, S., Bayesian Infinite mixture model based clustering of gene expression profiles, Bioinformatics, 18, 1194-1206 (2002)
[37] Meilă, M., Comparing clusterings — an information based distance, J. Multivariate Anal., 98, 5, 873-895 (2007) · Zbl 1298.91124
[38] Miller, J. W.; Harrison, M. T., Inconsistency of pitman-yor process mixtures for the number of components, J. Mach. Learn. Res., 15, 96, 3333-3370 (2014) · Zbl 1319.62100
[39] Neal, R. M., Markov Chain sampling methods for Dirichlet process mixture models, J. Comput. Graph. Statist., 9, 2, 249-265 (2000)
[40] Papaspiliopoulos, O.; Roberts, G. O., Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models, Biometrika, 95, 1, 169-186 (2008) · Zbl 1437.62576
[41] Raftery, A. E.; Dean, N., Variable selection for model-based clustering, J. Amer. Statist. Assoc., 101, 473, 168-178 (2006) · Zbl 1118.62339
[42] Rastelli, R.; Latouche, P.; Friel, N., Choosing the number of groups in a latent stochastic blockmodel for dynamic networks, Netw. Sci., 6, 4, 469-493 (2018)
[43] Roeder, K., Density estimation with confidence sets exemplified by superclusters and voids in the Galaxies, J. Amer. Statist. Assoc., 85, 411, 617-624 (1990) · Zbl 0704.62103
[44] Rosvall, M.; Axelsson, D.; Bergstrom, C., The map equation, Eur. Phys. J. Spec. Top., 178, 13-23 (2009)
[45] Rosvall, M.; Bergstrom, C., Maps of random walks on complex networks reveal community structure, PNAS, 105, 1118-1123 (2008), www.pnas.org/cgi/doi/10.1073/pnas.0706851105
[46] Schwarz, G., Estimating the dimension of a model, Ann. Statist., 6, 2, 461-464 (1978) · Zbl 0379.62005
[47] Scrucca, L.; Fop, M.; Murphy, T. B.; Raftery, A. E., Mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models, R J., 8, 1, 205-233 (2016)
[48] Scrucca, L.; Raftery, A. E., Clustvarsel: A package implementing variable selection for Gaussian model-based clustering in R, J. Stat. Softw., 84, 1, 1-28 (2018)
[49] Sethuraman, J., A constructive definition of Dirichlet priors, Statist. Sinica, 4, 639-650 (1994) · Zbl 0823.62007
[50] Tyron, R. C., Cluster Analysis: Correlation Profile and Orthometric (factor) Analysis for the Isolation of Unities in Mind and Personality (1939), Edwards Brothers
[51] Wade, S.; Ghahramani, Z., Bayesian cluster analysis: Point estimation and credible Balls (with Discussion), Bayesian Anal., 13, 2, 559-626 (2018) · Zbl 1407.62241
[52] Walker, S. G., Sampling the Dirichlet mixture model with Slices, Comm. Statist. Simulation Comput., 36, 1, 45-54 (2007) · Zbl 1113.62058
[53] Yau, C.; Holmes, C., Hierarchical Bayesian nonparametric mixture models for clustering with variable relevance determination, Bayesian Anal., 6, 2, 329-351 (2011) · Zbl 1330.62265
[54] Zubin, J., A technique for measuring like-mindedness, J. Abnorm. Soc. Psychol., 33, 4, 508-516 (1938)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.