×

Bayesian model-based clustering for populations of network data. (English) Zbl 07832647

Summary: There is increasing appetite for analysing populations of network data due to the fast-growing body of applications demanding such methods. While methods exist to provide readily interpretable summaries of heterogeneous network populations, these are often descriptive or ad hoc, lacking any formal justification. In contrast, principled analysis methods often provide results difficult to relate back to the applied problem of interest. Motivated by two complementary applied examples, we develop a Bayesian framework to appropriately model complex heterogeneous network populations, while also allowing analysts to gain insights from the data and make inferences most relevant to their needs. The first application involves a study in computer science measuring human movements across a university. The second analyses data from neuroscience investigating relationships between different regions of the brain. While both applications entail analysis of a heterogeneous population of networks, network sizes vary considerably. We focus on the problem of clustering the elements of a network population, where each cluster is characterised by a network representative. We take advantage of the Bayesian machinery to simultaneously infer the cluster membership, the representatives, and the community structure of the representatives, thus allowing intuitive inferences to be made. The implementation of our method on the human movement study reveals interesting movement patterns of individuals in clusters, readily characterised by their network representative. For the brain networks application, our model reveals a cluster of individuals with different network properties of particular interest in neuroscience. The performance of our method is additionally validated in extensive simulation studies.

MSC:

62Pxx Applications of statistics

Software:

kmed; blockmodels

References:

[1] AIROLDI, E. M. and BLOCKER, A. W. (2013). Estimating latent processes on a network from indirect measurements. J. Amer. Statist. Assoc. 108 149-164. Digital Object Identifier: 10.1080/01621459.2012.756328 Google Scholar: Lookup Link MathSciNet: MR3174609 · Zbl 06158332 · doi:10.1080/01621459.2012.756328
[2] ARROYO, J., ATHREYA, A., CAPE, J., CHEN, G., PRIEBE, C. E. and VOGELSTEIN, J. T. (2021). Inference for multiple heterogeneous networks with a common invariant subspace. J. Mach. Learn. Res. 22 Paper No. 142, 49. MathSciNet: MR4318498 · Zbl 07415085
[3] ARROYO RELIÓN, J. D., KESSLER, D., LEVINA, E. and TAYLOR, S. F. (2019). Network classification with applications to brain connectomics. Ann. Appl. Stat. 13 1648-1677. Digital Object Identifier: 10.1214/19-AOAS1252 Google Scholar: Lookup Link MathSciNet: MR4019153 · Zbl 1433.62314 · doi:10.1214/19-AOAS1252
[4] BALACHANDRAN, P., KOLACZYK, E. D. and VILES, W. D. (2017). On the propagation of low-rate measurement error to subgraph counts in large networks. J. Mach. Learn. Res. 18 Paper No. 61, 33. MathSciNet: MR3687604 · Zbl 1440.62226
[5] BASSETT, D. S. and BULLMORE, E. D. (2006). Small-world brain networks. Neuroscientist 12 512-523.
[6] BUDIAJI, W. (2019). kmed: Distance-based K-medoids R package version 0.3.0.
[7] CARVALHO, C. M., POLSON, N. G. and SCOTT, J. G. (2009). Handling sparsity via the horseshoe. In Artificial Intelligence and Statistics. 73-80. PMLR, U.S.
[8] CHANG, J., KOLACZYK, E. D. and YAO, Q. (2022). Estimation of subgraph densities in noisy networks. J. Amer. Statist. Assoc. 117 361-374. Digital Object Identifier: 10.1080/01621459.2020.1778482 Google Scholar: Lookup Link MathSciNet: MR4399091 · Zbl 1506.62301 · doi:10.1080/01621459.2020.1778482
[9] Chatterjee, S. (2015). Matrix estimation by universal singular value thresholding. Ann. Statist. 43 177-214. Digital Object Identifier: 10.1214/14-AOS1272 Google Scholar: Lookup Link MathSciNet: MR3285604 · Zbl 1308.62038 · doi:10.1214/14-AOS1272
[10] CRADDOCK, R. C., JAMES, G. A., HOLTZHEIMER, P. E. III, HU, X. and MAYBERG, H. S. (2012). A whole brain fMRI atlas generated via spatially constrained spectral clustering. Hum. Brain Mapp. 33 1914-1928.
[11] DIQUIGIOVANNI, J. and SCARPA, B. (2019). Analysis of association football playing styles: An innovative method to cluster networks. Stat. Model. 19 28-54. Digital Object Identifier: 10.1177/1471082X18808628 Google Scholar: Lookup Link MathSciNet: MR3903741 · Zbl 07289524 · doi:10.1177/1471082X18808628
[12] DONNAT, C. and HOLMES, S. (2018). Tracking network dynamics: A survey using graph distances. Ann. Appl. Stat. 12 971-1012. Digital Object Identifier: 10.1214/18-AOAS1176 Google Scholar: Lookup Link MathSciNet: MR3834292 · Zbl 1405.62171 · doi:10.1214/18-AOAS1176
[13] DURANTE, D., DUNSON, D. B. and VOGELSTEIN, J. T. (2017). Nonparametric Bayes modeling of populations of networks. J. Amer. Statist. Assoc. 112 1516-1530. Digital Object Identifier: 10.1080/01621459.2016.1219260 Google Scholar: Lookup Link MathSciNet: MR3750873 · doi:10.1080/01621459.2016.1219260
[14] FIELDS, S. and SONG, O. (1989). A novel genetic system to detect protein-protein interactions. Nature 340 245-246.
[15] Frühwirth-Schnatter, S. and Malsiner-Walli, G. (2019). From here to infinity: Sparse finite versus Dirichlet process mixtures in model-based clustering. Adv. Data Anal. Classif. 13 33-64. Digital Object Identifier: 10.1007/s11634-018-0329-y Google Scholar: Lookup Link MathSciNet: MR3935190 · Zbl 1474.62225 · doi:10.1007/s11634-018-0329-y
[16] GINESTET, C. E., LI, J., BALACHANDRAN, P., ROSENBERG, S. and KOLACZYK, E. D. (2017). Hypothesis testing for network data in functional neuroimaging. Ann. Appl. Stat. 11 725-750. Digital Object Identifier: 10.1214/16-AOAS1015 Google Scholar: Lookup Link MathSciNet: MR3693544 · Zbl 1391.62217 · doi:10.1214/16-AOAS1015
[17] GOLLINI, I. and MURPHY, T. B. (2016). Joint modeling of multiple network views. J. Comput. Graph. Statist. 25 246-265. Digital Object Identifier: 10.1080/10618600.2014.978006 Google Scholar: Lookup Link MathSciNet: MR3474046 · doi:10.1080/10618600.2014.978006
[18] HANDCOCK, M. S. and GILE, K. J. (2010). Modeling social networks from sampled data. Ann. Appl. Stat. 4 5-25. Digital Object Identifier: 10.1214/08-AOAS221 Google Scholar: Lookup Link MathSciNet: MR2758082 · Zbl 1189.62187 · doi:10.1214/08-AOAS221
[19] HEITJAN, D. F. and RUBIN, D. B. (1990). Inference from coarse data via multiple imputation with application to age heaping. J. Amer. Statist. Assoc. 85 304-314.
[20] Heitjan, D. F. and Rubin, D. B. (1991). Ignorability and coarse data. Ann. Statist. 19 2244-2253. Digital Object Identifier: 10.1214/aos/1176348396 Google Scholar: Lookup Link MathSciNet: MR1135174 · Zbl 0745.62004 · doi:10.1214/aos/1176348396
[21] Hoff, P. D., Raftery, A. E. and Handcock, M. S. (2002). Latent space approaches to social network analysis. J. Amer. Statist. Assoc. 97 1090-1098. Digital Object Identifier: 10.1198/016214502388618906 Google Scholar: Lookup Link MathSciNet: MR1951262 · Zbl 1041.62098 · doi:10.1198/016214502388618906
[22] INRA and LEGER, J.-B. (2015). blockmodels: Latent and stochastic block model estimation by a ‘V-EM’ algorithm R package version 1.1.1.
[23] JIANG, X., GOLD, D. and KOLACZYK, E. D. (2011). Network-based auto-probit modeling for protein function prediction. Biometrics 67 958-966. Digital Object Identifier: 10.1111/j.1541-0420.2010.01519.x Google Scholar: Lookup Link MathSciNet: MR2829270 · Zbl 1226.62116 · doi:10.1111/j.1541-0420.2010.01519.x
[24] JOSEPHS, N., LI, W. and KOLACZYK, E. D. (2021). Network recovery from unlabeled noisy samples. In 2021 55th Asilomar Conference on Signals, Systems, and Computers 1268-1273. IEEE, U.S.
[25] Keribin, C. (2000). Consistent estimation of the order of mixture models. Sankhyā Ser. A 62 49-66. MathSciNet: MR1769735 · Zbl 1081.62516
[26] KIM, H. and PARK, H. (2007). Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23 1495-1502.
[27] KIM, J. K. and HONG, M. (2012). Imputation for statistical inference with coarse data. Canad. J. Statist. 40 604-618. Digital Object Identifier: 10.1002/cjs.11142 Google Scholar: Lookup Link MathSciNet: MR2968401 · Zbl 1349.62024 · doi:10.1002/cjs.11142
[28] KOLACZYK, E. D., LIN, L., ROSENBERG, S., WALTERS, J. and XU, J. (2020). Averages of unlabeled networks: Geometric characterization and asymptotic behavior. Ann. Statist. 48 514-538. Digital Object Identifier: 10.1214/19-AOS1820 Google Scholar: Lookup Link MathSciNet: MR4065172 · Zbl 1439.62068 · doi:10.1214/19-AOS1820
[29] KOSKINEN, J. H. ROBINS, G. L. WANG, P. PATTISON, P. E. (2013). Bayesian analysis for partially observed network data, missing ties, attributes and actors. Soc. Netw. 35 514-527.
[30] Le, C. M., Levin, K. and Levina, E. (2018). Estimating a network from multiple noisy realizations. Electron. J. Stat. 12 4697-4740. Digital Object Identifier: 10.1214/18-ejs1521 Google Scholar: Lookup Link MathSciNet: MR3894068 · Zbl 1409.62115 · doi:10.1214/18-ejs1521
[31] LE, C. M. and LI, T. (2022). Linear regression and its inference on noisy network-linked data. J. R. Stat. Soc. Ser. B. Stat. Methodol. 84 1851-1885. MathSciNet: MR4515560 · Zbl 07686601
[32] LEVIN, K., ATHREYA, A., TANG, M., LYZINSKI, V., PARK, Y. and PRIEBE, C. E. (2017). A central limit theorem for an omnibus embedding of multiple random graphs and implications for multiscale network inference. ArXiv preprint. Available at arXiv:1705.09355.
[33] LI, W., SUSSMAN, D. L. and KOLACZYK, E. D. (2021). Causal inference under network interference with noise. ArXiv preprint. Available at arXiv:2105.04518.
[34] LIAO, X., VASILAKOS, A. V. and HE, Y. (2017). Small-world human brain networks: Perspectives and challenges. Neurosci. Biobehav. Rev. 77 286-300.
[35] LU, X. and MARRON, J. S. (2014). Analysis of juggling data: Object oriented data analysis of clustering in acceleration functions. Electron. J. Stat. 8 1842-1847. Digital Object Identifier: 10.1214/14-EJS937D Google Scholar: Lookup Link MathSciNet: MR3273603 · Zbl 1305.62014 · doi:10.1214/14-EJS937D
[36] LUNAGÓMEZ, S., OLHEDE, S. C. and WOLFE, P. J. (2021). Modeling network populations via graph distances. J. Amer. Statist. Assoc. 116 2023-2040. Digital Object Identifier: 10.1080/01621459.2020.1763803 Google Scholar: Lookup Link MathSciNet: MR4353730 · Zbl 1524.62123 · doi:10.1080/01621459.2020.1763803
[37] MALSINER-WALLI, G., FRÜHWIRTH-SCHNATTER, S. and GRÜN, B. (2016). Model-based clustering based on sparse finite Gaussian mixtures. Stat. Comput. 26 303-324. Digital Object Identifier: 10.1007/s11222-014-9500-2 Google Scholar: Lookup Link MathSciNet: MR3439375 · Zbl 1342.62109 · doi:10.1007/s11222-014-9500-2
[38] MANTZIOU, A., LUNAGÓMEZ, S. and MITRA, R. (2024). Supplement to “Bayesian model-based clustering for populations of network data.” https://doi.org/10.1214/23-AOAS1789SUPPA, https://doi.org/10.1214/23-AOAS1789SUPPB
[39] MARCHETTE, D. J. and HOHMAN, E. L. (2015). Utilizing covariates in partially observed networks. In 2015 18th International Conference on Information Fusion (Fusion) 166-172. IEEE, U.S.
[40] MUKHERJEE, S. S., SARKAR, P. and LIN, L. (2017). On clustering network-valued data. In Advances in Neural Information Processing Systems.
[41] Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Statist. 9 249-265. Digital Object Identifier: 10.2307/1390653 Google Scholar: Lookup Link MathSciNet: MR1823804 · doi:10.2307/1390653
[42] NEWMAN, M. E. J. (2018). Estimating network structure from unreliable measurements. Phys. Rev. E 98 062321.
[43] NIELSEN, A. M. and WITTEN, D. (2018). The multiple random dot product graph model. ArXiv preprint. Available at arXiv:1811.12172.
[44] PEIXOTO, T. P. (2018). Reconstructing networks with unknown and heterogeneous errors. Phys. Rev. X 8 041011.
[45] PRASAD, G., JOSHI, S. H., NIR, T. M., TOGA, A. W., THOMPSON, P. M., ALZHEIMER’S DISEASE NEUROIMAGING INITIATIVE (ADNI) et al. (2015). Brain connectivity and novel network measures for Alzheimer’s disease classification. Neurobiol. Aging 36 S121-S131.
[46] PRIEBE, C. E., SUSSMAN, D. L., TANG, M. and VOGELSTEIN, J. T. (2015). Statistical inference on errorfully observed graphs. J. Comput. Graph. Statist. 24 930-953. Digital Object Identifier: 10.1080/10618600.2014.951049 Google Scholar: Lookup Link MathSciNet: MR3432923 · doi:10.1080/10618600.2014.951049
[47] RASTELLI, R. and FRIEL, N. (2018). Optimal Bayesian estimators for latent variable cluster models. Stat. Comput. 28 1169-1186. Digital Object Identifier: 10.1007/s11222-017-9786-y Google Scholar: Lookup Link MathSciNet: MR3850389 · Zbl 1430.62140 · doi:10.1007/s11222-017-9786-y
[48] RICHARDSON, S. and GREEN, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. J. Roy. Statist. Soc. Ser. B 59 731-792. Digital Object Identifier: 10.1111/1467-9868.00095 Google Scholar: Lookup Link MathSciNet: MR1483213 · doi:10.1111/1467-9868.00095
[49] RICHIARDI, J., ERYILMAZ, H., SCHWARTZ, S., VUILLEUMIER, P. and VAN DE VILLE, D. (2011). Decoding brain states from fMRI connectivity graphs. NeuroImage 56 616-626.
[50] SCHÜTZE, H., MANNING, C. D. and RAGHAVAN, P. (2008). Introduction to Information Retrieval 39. Cambridge Univ. Press, Cambridge. · Zbl 1160.68008
[51] SHAW, P., MIKUSZ, M., NURMI, P. and DAVIES, N. (2018). Tacita: A privacy preserving public display personalisation service. In Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers 448-451.
[52] SHEN, W., WANG, Y., BAI, X., WANG, H. and LATECKI, L.J. (2013). Shape clustering: Common structure discovery. Pattern Recognit. 46 539-550. · Zbl 1251.68270
[53] SIGNORELLI, M. and WIT, E. C. (2020). Model-based clustering for populations of networks. Stat. Model. 20 9-29. Digital Object Identifier: 10.1177/1471082X19871128 Google Scholar: Lookup Link MathSciNet: MR4052400 · Zbl 07290023 · doi:10.1177/1471082X19871128
[54] SONG, J. J., LEE, H.-J., MORRIS, J. S. and KANG, S. (2007). Clustering of time-course gene expression data using functional data analysis. Comput. Biol. Chem. 31 265-274. · Zbl 1122.92029
[55] Wang, S., Arroyo, J., Vogelstein, J. T. and Priebe, C. E. (2019). Joint embedding of graphs. IEEE Trans. Pattern Anal. Mach. Intell.
[56] WHITE, J. G., SOUTHGATE, E., THOMSON, N. J. and BRENNER, S. (1986). The structure of the nervous system of the nematode Caenorhabditis elegans. Philos. Trans. R. Soc. Lond. B, Biol. Sci. 314 1-340.
[57] YOUNG, J.-G., CANTWELL, G. T. and NEWMAN, M. E. J. (2020). Bayesian inference of network structure from unreliable data. J. Complex Netw. 8 cnaa046, 26. Digital Object Identifier: 10.1093/comnet/cnaa046 Google Scholar: Lookup Link MathSciNet: MR4228651 · doi:10.1093/comnet/cnaa046
[58] YOUNG, S. J. and SCHEINERMAN, E. R. (2007). Random dot product graph models for social networks. In Algorithms and Models for the Web-Graph. Lecture Notes in Computer Science 4863 138-149. Springer, Berlin. Digital Object Identifier: 10.1007/978-3-540-77004-6_11 Google Scholar: Lookup Link MathSciNet: MR2504912 · Zbl 1136.05322 · doi:10.1007/978-3-540-77004-6_11
[59] ZHANG, J., CHENG, W., WANG, Z., ZHANG, Z., LU, W., LU, G. and FENG, J. (2012). Pattern classification of large-scale functional brain networks: Identification of informative neuroimaging markers for epilepsy. PLoS ONE 7 e36733.
[60] ZHAO, Y., WU, Y.-J., LEVINA, E. and ZHU, J. (2017). Link prediction for partially observed networks. J. Comput. Graph. Statist. 26 725-733. Digital Object Identifier: 10.1080/10618600.2017.1286243 Google Scholar: Lookup Link MathSciNet: MR3698680 · doi:10.1080/10618600.2017.1286243
[61] ZUO, X.-N., ANDERSON, J. S., BELLEC, P., BIRN, R. M., BISWAL, B. B., BLAUTZIK, J., BREITNER, J. C. S., BUCKNER, R. L., CALHOUN, V. D., CASTELLANOS, F. X. et al. (2014). An open science resource for establishing reliability and reproducibility in functional connectomics. Sci. Data 1 1-13.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.