×

Cohesion and repulsion in Bayesian distance clustering. (English) Zbl 07877167

Summary: Clustering in high-dimensions poses many statistical challenges. While traditional distance-based clustering methods are computationally feasible, they lack probabilistic interpretation and rely on heuristics for estimation of the number of clusters. On the other hand, probabilistic model-based clustering techniques often fail to scale and devising algorithms that are able to effectively explore the posterior space is an open problem. Based on recent developments in Bayesian distance-based clustering, we propose a hybrid solution that entails defining a likelihood on pairwise distances between observations. The novelty of the approach consists in including both cohesion and repulsion terms in the likelihood, which allows for cluster identifiability. This implies that clusters are composed of objects which have small dissimilarities among themselves (cohesion) and similar dissimilarities to observations in other clusters (repulsion). We show how this modeling strategy has interesting connection with existing proposals in the literature. The proposed method is computationally efficient and applicable to a wide variety of scenarios. We demonstrate the approach in simulation and an application in digital numismatics. Supplementary Material with code is available online.

MSC:

62-XX Statistics

References:

[1] American Numismatic Society (n.d.). “Silver Denarius of Nero, Rome, AD 64 - AD 65 (BMC 74, RIC I Second Edition Nero 53),” ANS ID: 1944.100.39423, available at http://numismatics.org/collection/1944.100.39423.
[2] Argiento, R., and De Iorio, M. (2022), “Is infinity that far? A Bayesian Nonparametric Perspective of Finite Mixture Models,”Annals of Statistics, 50, 2641-2663. DOI: . · Zbl 1539.62085
[3] Barcella, W., De Iorio, M., and Baio, G. (2017), “A Comparative Review of Variable Selection Techniques for Covariate Dependent Dirichlet Process Mixture Models,”Canadian Journal of Statistics, 45, 254-273. DOI: . · Zbl 1474.62118
[4] Barry, D., and Hartigan, J. A. (1992), “Product Partition Models for Change Point Problems,”The Annals of Statistics, 20, 260-279. DOI: . · Zbl 0780.62071
[5] Beaumont, M. A., Zhang, W., Balding, D. J. (2002), “Approximate Bayesian Computation in Population Genetics,”Genetics, 162, 2025-2035. DOI: .
[6] Besag, J. (1975), “Statistical Analysis of Non-Lattice Data,”Journal of the Royal Statistical Society, Series D, 24, 179-195. DOI: .
[7] Betancourt, B., Zanella, G., and Steorts, R. C. (2022), “Random Partition Models for Microclustering Tasks,”Journal of the American Statistical Association, 117, 1215-1227. DOI: . · Zbl 1506.62327
[8] Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. (1999), “When Is “Nearest Neighbor” Meaningful?” in Database Theory—ICDT’99, eds. Beeri, C., and Buneman, P., pp. 217-235. Berlin Heidelberg: Springer. DOI: .
[9] Chandler, R. E., and Bate, S. (2007), “Inference for Clustered Data Using the Independence Loglikelihood,”Biometrika, 94, 167-183. DOI: . · Zbl 1142.62367
[10] Classical Numismatic Group (n.d.). “Triton XX, Lot: 673,” RIC I 53; WCN 57; RSC 119; BMCRE 74-6; BN 220-1, available at https://www.cngcoins.com/Coin.aspx?CoinID=324866.
[11] Corander, J., Sirén, J., and Arjas, E. (2008), “Bayesian Spatial Modeling of Genetic Population Structure,”Computational Statistics, 23, 111-129. DOI: .
[12] Dahl, D. B. (2008), “Distance-Based Probability Distribution for Set Partitions with Applications to Bayesian Nonparametrics,” in “2008 JSM Proceedings: Papers Presented at the Joint Statistical Meetings, Denver, Colorado, August 3-7, 2008, and other ASA-sponsored Conferences; Communicating Statistics: Speaking Out and Reaching Out, ,” Alexandria, VA: American Statistical Association.
[13] Dahl, D. B., Day, R., and Tsai, J. W. (2017), “Random Partition Distribution Indexed by Pairwise Information,”Journal of the American Statistical Association, 112, 721-732. DOI: .
[14] Dahl, D. B., Johnson, D. J., and Müller, P. (2022), “Search Algorithms and Loss Functions for Bayesian Clustering,”Journal of Computational and Graphical Statistics, 31, 1189-1201. DOI: . · Zbl 07633313
[15] Dasgupta, A., and Raftery, A. E. (1998), “Detecting Features in Spatial Point Processes with Clutter via Model-Based Clustering,”Journal of the American Statistical Association, 93, 294-302. DOI: . · Zbl 0906.62105
[16] Denison, D. G. T., and Holmes, C. C. (2001), “Bayesian Partitioning for Estimating Disease Risk,”Biometrics, 57, 143-149. DOI: . · Zbl 1209.62276
[17] Dryden, I. L., and Mardia, K. V. (2016), Statistical Shape Analysis, with Applications in R, New York, NY: Wiley. DOI: . · Zbl 1381.62003
[18] Duan, L. L., and Dunson, D. B. (2021), “Bayesian Distance Clustering,”Journal of Machine Learning Research, 22, 1-27. http://jmlr.org/papers/v22/20-688.html. · Zbl 07626739
[19] Fearnhead, P., and Prangle, D. (2012), “Constructing Summary Statistics for Approximate Bayesian Computation: Semi-Automatic Approximate Bayesian Computation,”Journal of the Royal Statistical Society, Series B, 74, 419-474. DOI: . · Zbl 1411.62057
[20] Fúquene, J., Steel, M., and Rossell, D. (2019), “On Choosing Mixture Components via Non-Local Priors,”Journal of the Royal Statistical Society, Series B, 81, 809-837. DOI: . · Zbl 1429.62243
[21] Fritsch, A., and Ickstadt, K. (2009), “Improved Criteria for Clustering based on the Posterior Similarity Matrix,”Bayesian Analysis, 4, 367-391. DOI: . · Zbl 1330.62249
[22] Gao, T., Kovalsky, S. Z., Boyer, D. M., and Daubechies, I. (2019a), “Gaussian Process Landmarking for Three-Dimensional Geometric Morphometrics,”SIAM Journal on Mathematics of Data Science, 1, 237-267. DOI: . · Zbl 1499.60113
[23] Gao, T., Kovalsky, S. Z., and Daubechies, I. (2019b). “Gaussian Process Landmarking on Manifolds,”SIAM Journal on Mathematics of Data Science, 1, 208-236. DOI: . · Zbl 1499.60114
[24] Gerhard Hirsch Nachfolger (2013). “Auktion 293, Lot 2656,” image available at https://www.numisbids.com/n.php?p=lot&sid=514&lot=2656.
[25] Gnedin, A. V., and Pitman, J. (2006), “Exchangeable Gibbs Partitions and Stirling Triangles,”Journal of Mathematical Sciences, 138, 5674-5685. DOI: .
[26] Hartigan, J. (1990), “Partition Models,”Communications in Statistics - Theory and Methods, 19, 2745-2756. DOI: .
[27] Hennig, C. (2015), “What are the True Clusters?”Pattern Recognition Letters, 64, 53-62. DOI: .
[28] Ishwaran, H., and James, L. F. (2003), “Generalized Weighted Chinese Restaurant Processes for Species Sampling Mixture Models,”Statistica Sinica, 13, 1211-1235. https://www.jstor.org/stable/24307169. · Zbl 1086.62036
[29] Jain, S., and Neal, R. M. (2004), “A Split-Merge Markov chain Monte Carlo Procedure for the Dirichlet Process Mixture Model,”Journal of Computational and Graphical Statistics, 13, 158-182. DOI: .
[30] Johnstone, I. M., and Titterington, D. M. (2009), “Statistical Challenges of Hhigh-Dimensional Data,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 367, 4237-4253. DOI: . · Zbl 1185.62007
[31] Klaus, J. (1995), Topology (2nd ed.), New York: Springer-Verlag.
[32] Kleinberg, J. (2002), “An Impossibility Theorem for Clustering,” in “Proceedings of the 15th International Conference on Neural Information Processing Systems,” NIPS’02, pp. 463-470, Cambridge, MA: MIT Press. DOI: .
[33] Kruskal, J. B. (1964), “Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis,”Psychometrika, 29, 1-27. DOI: . · Zbl 0123.36803
[34] Larribe, F., and Fearnhead, P. (2011), “On Composite Likelihoods in Statistical Genetics,”Statistica Sinica, 21, 43-69. https://www.jstor.org/stable/24309262. · Zbl 1206.62171
[35] Lau, J. W., and Green, P. J. (2007), “Bayesian Model-Based Clustering Procedures,”Journal of Computational and Graphical Statistics, 16, 526-558. DOI: .
[36] Lele, S., and Taper, M. L. (2002), “A Composite Likelihood Approach to (co)variance Components Estimation,”Journal of Statistical Planning and Inference, 103, 117-135. DOI: . · Zbl 1005.62066
[37] Li, N., and Stephens, M. (2003), “Modeling Linkage Disequilibrium and Identifying Recombination Hotspots Using Single-Nucleotide Polymorphism Data,”Genetics, 165, 2213-2233. DOI: .
[38] Lindsay, B. G. (1988), “Composite Likelihood Methods,” in “Statistical Inference from Stochastic Processes (Ithaca, NY, 1987),” volume 80 of Contemporary Mathematics, pp. 221-239. Providence, RI: American Mathematical Society. DOI: . · Zbl 0672.62069
[39] Lipman, Y., Yagev, S., Poranne, R., Jacobs, D. W., and Basri, R. (2014), “Feature Matching with Bounded Distortion,”ACM Transactions on Graphics, 33, 1-14. DOI: . · Zbl 1322.68244
[40] Lowe, D. (2004). “Distinctive Image Features from Scale-Invariant Keypoints,”International Journal of Computer Vision, 60, 91-110. DOI: .
[41] Marjoram, P., Molitor, J., Plagnol, V., and Tavaré, S. (2003), “Markov Chain Monte Carlo Without Likelihoods,”Proceedings of the National Academy of Sciences, 100, 15324-15328. DOI: .
[42] Mclachlan, G., and Basford, K. (1988), “Mixture Models: Inference and Applications to Clustering,”Journal of the Royal Statistical Society, Series C, 38, 384-384. DOI: . · Zbl 0697.62050
[43] Meilǎ, M. (2007), “Comparing Clusterings—An Information based Distance,”Journal of Multivariate Analysis, 98, 873-895. DOI: . · Zbl 1298.91124
[44] Miller, J. (2020), “BayesianMixtures.” Version 0.1.1, available at https://github.com/jwmi/BayesianMixtures.jl.
[45] Miller, J., Betancourt, B., Zaidi, A., Wallach, H., and Steorts, R. C. (2015), “Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set,” online preprint. DOI: .
[46] Miller, J. W., and Harrison, M. T. (2018), “Mixture Models With a Prior on the Number of Components,”Journal of the American Statistical Association, 113, 340-356. DOI: . · Zbl 1398.62066
[47] Møller, J., and Skare, Ø. (2001), “Coloured Voronoi Tessellations for Bayesian Image Analysis and Reservoir Modelling,”Statistical Modelling, 1, 213-232. DOI: . · Zbl 1104.62106
[48] Müller, P., Quintana, F., and Rosner, G. L. (2011), “A Product Partition Model With Regression on Covariates,”Journal of Computational and Graphical Statistics, 20, 260-278. DOI: .
[49] Nowicki, K., and Snijders, T. A. B. (2001), “Estimation and Prediction for Stochastic Blockstructures,”Journal of the American Statistical Association, 96, 1077-1087. DOI: . · Zbl 1072.62542
[50] Paganin, S., Herring, A. H., Olshan, A. F., and Dunson, D. B. (2021), “Centered Partition Processes: Informative Priors for Clustering (with Discussion),”Bayesian Analysis, 16, 301-670. DOI: . · Zbl 1483.62109
[51] Petralia, F., Rao, V., and Dunson, D. (2012), “Repulsive Mixtures,” in “Advances in Neural Information Processing Systems” (Vol. 25), eds. Pereira, F., Burges, C. J. C., Bottou, L., Weinberger, K. Q.. Curran Associates, Inc. Available at https://proceedings.neurips.cc/paper/2012/file/8d6dc35e506fc23349dd10ee68dabb64-Paper.pdf.
[52] Pitman, J. (1996). “Some Developments of the Blackwell-MacQueen URN Scheme,”Lecture Notes-Monograph Series, 30, 245-267. DOI: .
[53] Pitman, J., and Yor, M. (1997), “The Two-Parameter Poisson-Dirichlet Distribution Derived from a Stable Subordinator,”The Annals of Probability, 25, 855-900. http://www.jstor.org/stable/2959614. DOI: . · Zbl 0880.60076
[54] Quinlan, J. J., Quintana, F. A., and Page, G. L. (2017), “Parsimonious Hierarchical Modeling Using Repulsive Distributions,” DOI: .
[55] Quintana, F. A. (2006), “A Predictive View of Bayesian Clustering,”Journal of Statistical Planning and Inference, 136, 2407-2429. DOI: . · Zbl 1090.62023
[56] Quintana, F. A., and Iglesias, P. L. (2003), “Bayesian Clustering and Product Partition Models,”Journal of the Royal Statistical Society, Series B, 65, 557-574. DOI: . · Zbl 1065.62115
[57] Rigon, T., Herring, A. H., and Dunson, D. B. (2023), “A Generalized Bayes Framework for Probabilistic Clustering,”Biometrika. DOI: . · Zbl 07802183
[58] Schmidt, M. N., and Morup, M. (2013), “Nonparametric Bayesian Modeling of Complex Networks: An Introduction,”IEEE Signal Processing Magazine, 30, 110-128. DOI: .
[59] Snijders, T. A. B., and Nowicki, K. (1997), “Estimation and Prediction for Stochastic Blockmodels for Graphs with Latent Block Structure,”Journal of the Classification, 14, 75-100. DOI: . · Zbl 0896.62063
[60] Stephens, M. (2000), “Dealing with Label Switching in Mixture Models,”Journal of the Royal Statistical Society, Series B, 62, 795-809. DOI: . · Zbl 0957.62020
[61] Szeliski, R. (2010), Computer Vision: Algorithms and Applications, London: Springer. DOI: .
[62] Taylor, Z. M. (2020), “The Computer-Aided Die Study (CADS): A Tool for Conducting Numismatic Die Studies with Computer Vision and Hierarchical Clustering,” Bachelor’s thesis, Trinity Uni. Available at https://digitalcommons.trinity.edu/compsci_honors/54/.
[63] Varin, C., Reid, N., and Firth, D. (2011), “An Overview of Composite Likelihood Methods,”Statistica Sinica, 21, 5-42. https://www.jstor.org/stable/24309261. · Zbl 1534.62022
[64] Wade, S., and Ghahramani, Z. (2018), “Bayesian Cluster Analysis: Point Estimation and Credible Balls”, (with Discussion), Bayesian Analysis, 13, 559-626. DOI: . · Zbl 1407.62241
[65] Xu, Y., Müller, P., and Telesca, D. (2016), “Bayesian Inference for Latent Biologic Structure with Determinantal Point Processes (DPP),”Biometrics, 72, 955-964. DOI: . · Zbl 1390.62320
[66] Zanella, G., Betancourt, B., Wallach, H., Miller, J., Zaidi, A., and Steorts, R. C. (2016), “Flexible Models for Microclustering with Application to Entity Resolution,” in “Proceedings of the 30th International Conference on Neural Information Processing Systems,” NIPS’16, pp. 1425-1433. Red Hook, NY: Curran Associates Inc.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.