Document Zbl 1531.62014

Buchholz, Alexander; Ahfock, Daniel; Richardson, Sylvia

Distributed computation for marginal likelihood based model choice. (English) Zbl 1531.62014

Bayesian Anal. 18, No. 2, 607-638 (2023).

Summary: We propose a general method for distributed Bayesian model choice, using the marginal likelihood, where a data set is split in non-overlapping subsets. These subsets are only accessed locally by individual workers and no data is shared between the workers. We approximate the model evidence for the full data set through Monte Carlo sampling from the posterior on every subset generating a model evidence per subset. The results are combined using a novel approach which corrects for the splitting using summary statistics of the generated samples. Our divide-and-conquer approach enables Bayesian model choice in the large data setting, exploiting all available information but limiting communication between workers. We derive theoretical error bounds that quantify the resulting trade-off between computational gain and loss in precision. The embarrassingly parallel nature yields important speed-ups when used on massive data sets as illustrated by our real world experiments. In addition, we show how the suggested approach can be extended to model choice within a reversible jump setting that explores multiple feature combinations within one run.

MSC:

62F15	Bayesian inference
62-08	Computational methods for problems pertaining to statistics
65C05	Monte Carlo methods

Keywords:

distributed computation; marginal likelihood

Software:

bridgesampling; Stan; plyr; BayesDA; tsbridge; SSS; MapReduce

Cite Review PDF

Full Text: DOI arXiv Link

References:

[1]	Ahfock, D. C. (2019). “New statistical perspectives on efficient Big Data algorithms for high-dimensional Bayesian regression and model selection.” Ph.D. thesis, University of Cambridge. URL https://doi.org/10.17863/CAM.38965
[2]	Alquier, P., Friel, N., Everitt, R., and Boland, A. (2016). “Noisy Monte Carlo: Convergence of Markov chains with approximate transition kernels.” Statistics and Computing, 26(1-2): 29-47. · Zbl 1342.60122 · doi:10.1007/s11222-014-9521-x
[3]	Bardenet, R., Doucet, A., and Holmes, C. (2017). “On Markov chain Monte Carlo methods for tall data.” The Journal of Machine Learning Research, 18(1): 1515-1557. · Zbl 1433.68394
[4]	Barthelmé, S., Chopin, N., and Cottet, V. (2018). “Divide and conquer in ABC: Expectation-Propagation algorithms for likelihood-free inference.” Handbook of Approximate Bayesian Computation, 415-34.
[5]	Bartolucci, F., Scaccia, L., and Mira, A. (2006). “Efficient Bayes factor estimation from the reversible jump output.” Biometrika, 93(1): 41-52. · Zbl 1152.62320 · doi:10.1093/biomet/93.1.41
[6]	Buchholz, A., Ahfock, D., and Richardson, S. (2022). “Appendix.” Bayesian Analysis. · doi:10.1214/22-BA1321SUPP
[7]	Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017). “Stan: A Probabilistic Programming Language.” Journal of Statistical Software, Articles, 76(1): 1-32. URL https://www.jstatsoft.org/v076/i01
[8]	Chen, H.-Y. and Chao, W.-L. (2021). “FedBE: Making Bayesian Model Ensemble Applicable to Federated Learning.” In International Conference on Learning Representations. URL https://openreview.net/forum?id=dgtpE6gKjHn
[9]	Chen, T., Fox, E., and Guestrin, C. (2014). “Stochastic Gradient Hamiltonian Monte Carlo.” In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, 1683-1691. Bejing, China: PMLR. URL https://proceedings.mlr.press/v32/cheni14.html
[10]	Chib, S. (1995). “Marginal likelihood from the Gibbs output.” Journal of the American Statistical Association, 90(432): 1313-1321. · Zbl 0868.62027 · doi:10.1080/01621459.1995.10476635
[11]	Dang, K.-D., Quiroz, M., Kohn, R., Tran, M.-N., and Villani, M. (2019). “Hamiltonian Monte Carlo with Energy Conserving Subsampling.” Journal of Machine Learning Research, 20(100): 1-31. URL http://jmlr.org/papers/v20/17-452.html · Zbl 1441.62927
[12]	Dean, J. and Ghemawat, S. (2008). “MapReduce: Simplified Data Processing on Large Clusters.” Communications of the ACM, 51(1): 107-113.
[13]	Deisenroth, M. and Ng, J. W. (2015). “Distributed Gaussian Processes.” In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, 1481-1490. Lille, France: PMLR. URL https://proceedings.mlr.press/v37/deisenroth15.html
[14]	Del Moral, P., Doucet, A., and Jasra, A. (2006). “Sequential Monte Carlo samplers.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3): 411-436. · Zbl 1105.62034 · doi:10.1111/j.1467-9868.2006.00553.x
[15]	Dietrich, F. (2010). “Bayesian group belief.” Social Choice and Welfare, 35(4): 595-626. · Zbl 1232.91164 · doi:10.1007/s00355-010-0453-x
[16]	Dümbgen, L., Samworth, R. J., and Wellner, J. A. (2021). “Bounding distributional errors via density ratios.” Bernoulli, 27(2): 818-852. · Zbl 1497.62054 · doi:10.3150/20-bej1256
[17]	Dunson, D. B. and Johndrow, J. E. (2019). “The Hastings algorithm at fifty.” Biometrika. Asz066. URL https://doi.org/10.1093/biomet/asz066 · Zbl 1435.62042 · doi:10.1093/biomet/asz066
[18]	Emerson, P. (2013). “The original Borda count and partial voting.” Social Choice and Welfare, 40(2): 353-358. URL https://doi.org/10.1007/s00355-011-0603-9 · Zbl 1287.91050 · doi:10.1007/s00355-011-0603-9
[19]	Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., and Rubin, D. (2013). Bayesian Data Analysis. Chapman & Hall. Taylor & Francis. · Zbl 1279.62004
[20]	Gelman, A. and Meng, X.-L. (1998). “Simulating normalizing constants: From importance sampling to bridge sampling to path sampling.” Statistical Science, 163-185. · Zbl 0966.65004 · doi:10.1214/ss/1028905934
[21]	Gelman, A., Vehtari, A., Jylänki, P., Sivula, T., Tran, D., Sahai, S., Blomstedt, P., Cunningham, J. P., Schiminovich, D., and Robert, C. (2017). “Expectation propagation as a way of life: A framework for Bayesian inference on partitioned data.” arXiv preprint arXiv:1412.4869. · Zbl 1498.68287
[22]	Genest, C. (1984). “A Characterization Theorem for Externally Bayesian Groups.” The Annals of Statistics, 12(3): 1100-1105. · Zbl 0541.62003 · doi:10.1214/aos/1176346726
[23]	Geweke, J. (1989). “Bayesian inference in econometric models using Monte Carlo integration.” Econometrica: Journal of the Econometric Society, 1317-1339. · Zbl 0683.62068 · doi:10.2307/1913710
[24]	Ghosh, J. K. and Ramamoorthi, R. V. (2003). Bayesian nonparametrics. Springer Series in Statistics. New York: Springer. URL http://cds.cern.ch/record/1608771 · Zbl 1029.62004
[25]	Goudie, R. J., Presanis, A. M., Lunn, D., De Angelis, D., and Wernisch, L. (2019). “Joining and splitting models with Markov melding.” Bayesian Analysis, 14(1): 81. · Zbl 1409.62153 · doi:10.1214/18-BA1104
[26]	Green, P. J. (1995). “Reversible jump Markov chain Monte Carlo computation and Bayesian model determination.” Biometrika, 82(4): 711-732. · Zbl 0861.62023 · doi:10.1093/biomet/82.4.711
[27]	Gronau, Q. F., Singmann, H., and Wagenmakers, E.-J. (2020). “bridgesampling: An R Package for Estimating Normalizing Constants.” Journal of Statistical Software, 92(10): 1-29. URL https://www.jstatsoft.org/index.php/jss/article/view/v092i10
[28]	Gunawan, D., Dang, K.-D., Quiroz, M., Kohn, R., and Tran, M.-N. (2020). “Subsampling sequential Monte Carlo for static Bayesian models.” Statistics and Computing, 30(6): 1741-1758. · Zbl 1452.62985 · doi:10.1007/s11222-020-09969-z
[29]	Hans, C., Dobra, A., and West, M. (2007). “Shotgun stochastic search for “large p” regression.” Journal of the American Statistical Association, 102(478): 507-516. · Zbl 1134.62398 · doi:10.1198/016214507000000121
[30]	Hastings, W. K. (1970). “Monte Carlo sampling methods using Markov chains and their applications.” Biometrika, 57(1): 97-109. URL https://doi.org/10.1093/biomet/57.1.97 · Zbl 0219.65008 · doi:10.1093/biomet/57.1.97
[31]	Hinton, G. E. (2002). “Training products of experts by minimizing contrastive divergence.” Neural Computation, 14(8): 1771-1800. · Zbl 1010.68111 · doi:10.1162/089976602760128018
[32]	Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). “Stochastic variational inference.” The Journal of Machine Learning Research, 14(1): 1303-1347. · Zbl 1317.68163
[33]	Holmes, C. C. and Held, L. (2006). “Bayesian auxiliary variable models for binary and multinomial regression.” Bayesian Analysis, 1(1): 145-168. URL https://doi.org/10.1214/06-BA105 · Zbl 1331.62142 · doi:10.1214/06-BA105
[34]	Huang, Z. and Gelman, A. (2005). “Sampling for Bayesian computation with large datasets.” SSRN 1010107.
[35]	Immer, A., Bauer, M., Fortuin, V., Rätsch, G., and Emtiyaz, K. M. (2021). “Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning.” In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 4563-4573. PMLR. URL https://proceedings.mlr.press/v139/immer21a.html
[36]	Jacob, P. E., Murray, L. M., Holmes, C. C., and Robert, C. P. (2017). “Better together? Statistical learning in models made of modules.” arXiv preprint arXiv:1708.08719.
[37]	Jacob, P. E., O’Leary, J., and Atchadé, Y. F. (2020). “Unbiased Markov chain Monte Carlo methods with couplings.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(3): 543-600. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12336 · Zbl 07554766 · doi:10.1111/rssb.12336
[38]	Jahan, F., Ullah, I., and Mengersen, K. L. (2020). “A survey of Bayesian statistical approaches for big data.” In Case studies in applied Bayesian data science, 17-44. Springer. · Zbl 07620009 · doi:10.1007/978-3-030-42553-1_2
[39]	Jordan, M. I., Lee, J. D., and Yang, Y. (2019). “Communication-efficient distributed statistical inference.” Journal of the American Statistical Association, 114(526): 668-681. · Zbl 1420.62097 · doi:10.1080/01621459.2018.1429274
[40]	Kanamori, T., Suzuki, T., and Sugiyama, M. (2012). “Statistical analysis of kernel-based least-squares density-ratio estimation.” Machine Learning, 86(3): 335-367. · Zbl 1246.68182 · doi:10.1007/s10994-011-5266-3
[41]	Kass, R. E. and Raftery, A. E. (1995). “Bayes Factors.” Journal of the American Statistical Association, 90(430): 773-795. · Zbl 0846.62028 · doi:10.1080/01621459.1995.10476572
[42]	Knuth, K. H., Habeck, M., Malakar, N. K., Mubeen, A. M., and Placek, B. (2015). “Bayesian Evidence and Model Selection.” Digitial Signal Processing, 47(C): 50-67. URL https://doi.org/10.1016/j.dsp.2015.06.012 · doi:10.1016/j.dsp.2015.06.012
[43]	Lewis, S. M. and Raftery, A. E. (1997). “Estimating Bayes factors via posterior simulation with the Laplace—Metropolis estimator.” Journal of the American Statistical Association, 92(438): 648-655. · Zbl 0889.62018 · doi:10.2307/2965712
[44]	Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. (2020). “Federated Learning: Challenges, Methods, and Future Directions.” IEEE Signal Processing Magazine, 37(3): 50-60.
[45]	Lyne, A.-M., Girolami, M., Atchadé, Y., Strathmann, H., and Simpson, D. (2015). “On Russian Roulette Estimates for Bayesian Inference with Doubly-Intractable Likelihoods.” Statistical Science, 30(4): 443-467. URL https://doi.org/10.1214/15-STS523 · Zbl 1426.62092 · doi:10.1214/15-STS523
[46]	McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. (2017). “Communication-efficient learning of deep networks from decentralized data.” In Artificial intelligence and statistics, 1273-1282. PMLR.
[47]	Meng, X.-L. and Wong, W. H. (1996). “Simulating ratios of normalizing constants via a simple identity: a theoretical exploration.” Statistica Sinica, 831-860. · Zbl 0857.62017
[48]	Minsker, S., Srivastava, S., Lin, L., and Dunson, D. (2014). “Scalable and Robust Bayesian Inference via the Median Posterior.” In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, 1656-1664. Bejing, China: PMLR. URL http://proceedings.mlr.press/v32/minsker14.html
[49]	Neal, R. M. et al. (2011). “MCMC using Hamiltonian dynamics.” Handbook of Markov cain Monte Carlo, 2(11): 2.
[50]	Neiswanger, W., Wang, C., and Xing, E. (2013). “Asymptotically exact, embarrassingly parallel MCMC.” arXiv preprint arXiv:1311.4780.
[51]	Nowozin, S. (2018). “Debiasing Evidence Approximations: On Importance-weighted Autoencoders and Jackknife Variational Inference.” In International Conference on Learning Representations. URL https://openreview.net/forum?id=HyZoi-WRb
[52]	Quiroz, M., Kohn, R., Villani, M., and Tran, M.-N. (2019). “Speeding up MCMC by efficient data subsampling.” Journal of the American Statistical Association, 114(526): 831-843. · Zbl 1420.62121 · doi:10.1080/01621459.2018.1448827
[53]	Rabinovich, M., Angelino, E., and Jordan, M. I. (2015). “Variational consensus Monte Carlo.” In Advances in Neural Information Processing Systems, 1207-1215.
[54]	Rendell, L. J., Johansen, A. M., Lee, A., and Whiteley, N. (2020). “Global consensus Monte Carlo.” Journal of Computational and Graphical Statistics, 30(2): 249-259. · Zbl 07499857 · doi:10.1080/10618600.2020.1811105
[55]	Rischard, M., Jacob, P. E., and Pillai, N. (2018). “Unbiased estimation of log normalizing constants with applications to Bayesian cross-validation.” arXiv preprint arXiv:1810.01382.
[56]	Robert, C. (2007). The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation. Springer Science & Business Media. · Zbl 1129.62003
[57]	Safaryan, M., Islamov, R., Qian, X., and Richtárik, P. (2021). “FedNL: Making Newton-Type Methods Applicable to Federated Learning.” CoRR, abs/2106.02969. URL https://arxiv.org/abs/2106.02969
[58]	Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I., and McCulloch, R. E. (2016). “Bayes and big data: The consensus Monte Carlo algorithm.” International Journal of Management Science and Engineering Management, 11(2): 78-88.
[59]	Scott, S. L. et al. (2017). “Comparing consensus Monte Carlo strategies for distributed Bayesian computation.” Brazilian Journal of Probability and Statistics, 31(4): 668-685. · Zbl 1385.65008 · doi:10.1214/17-BJPS365
[60]	Skilling, J. (2006). “Nested sampling for general Bayesian computation.” Bayesian Analysis, 1(4): 833-859. · Zbl 1332.62374 · doi:10.1214/06-BA127
[61]	Srivastava, S., Li, C., and Dunson, D. B. (2018). “Scalable Bayes via barycenter in Wasserstein space.” The Journal of Machine Learning Research, 19(1): 312-346. · Zbl 1444.62037
[62]	Szabó, B. and van Zanten, H. (2019). “An asymptotic analysis of distributed nonparametric methods.” Journal of Machine Learning Research, 20(87): 1-30. · Zbl 1434.68457
[63]	Tanner, M. A. and Wong, W. H. (2010). “From EM to Data Augmentation: The Emergence of MCMC Bayesian Computation in the 1980s.” Statistical Science, 25(4): 506-516. · Zbl 1329.65021 · doi:10.1214/10-STS341
[64]	Tierney, L. and Kadane, J. B. (1986). “Accurate approximations for posterior moments and marginal densities.” Journal of the American Statistical Association, 81(393): 82-86. · Zbl 0587.62067 · doi:10.1080/01621459.1986.10478240
[65]	van Dyk, D. A. and Meng, X.-L. (2001). “The Art of Data Augmentation.” Journal of Computational and Graphical Statistics, 10(1): 1-50. · doi:10.1198/10618600152418584
[66]	Wang, X. and Dunson, D. B. (2013). “Parallelizing MCMC via Weierstrass sampler.” arXiv preprint arXiv:1312.4605.
[67]	Welling, M. and Teh, Y. W. (2011). “Bayesian Learning via Stochastic Gradient Langevin Dynamics.” In Getoor, L. and Scheffer, T. (eds.), Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, 681-688. Omnipress. URL https://icml.cc/2011/papers/398_icmlpaper.pdf · doi:10.1007/978-1-4419-9782-1_28
[68]	West, M. (1984). “Bayesian Aggregation.” Journal of the Royal Statistical Society. Series A (General), 147(4): 600-607. · Zbl 0581.62004 · doi:10.2307/2981847
[69]	Wickham, H. (2011). “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software, 40(1): 1-29. URL https://www.jstatsoft.org/index.php/jss/article/view/v040i01
[70]	Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K. H., Hoang, T. N., and Khazaeni, Y. (2019). “Bayesian Nonparametric Federated Learning of Neural Networks.” In ICML, 7252-7261. URL http://proceedings.mlr.press/v97/yurochkin19a.html
[71]	Zhang, M. M., Lam, H., and Lin, L. (2018). “Robust and parallel Bayesian model selection.” Computational Statistics & Data Analysis, 127: 229-247. · Zbl 1469.62178 · doi:10.1016/j.csda.2018.05.016
[72]	Zhao, P. and Zhang, T. (2014). “Accelerating minibatch stochastic gradient descent using stratified sampling.” arXiv preprint arXiv:1405.3080.
[73]	Zhu, J., Chen, J., Hu, W., and Zhang, B. (2017). “Big learning with Bayesian methods.” National Science Review, 4(4): 627-651.

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.