×

Bayesian fractional posteriors. (English) Zbl 1473.62116

Authors’ abstract: We consider the fractional posterior distribution that is obtained by updating a prior distribution via Bayes theorem with a fractional likelihood function, a usual likelihood function raised to a fractional power. First, we analyze the contraction property of the fractional posterior in a general misspecified framework. Our contraction results only require a prior mass condition on certain Kullback-Leibler (KL) neighborhood of the true parameter (or the KL divergence minimizer in the misspecified case), and obviate constructions of test functions and sieves commonly used in the literature for analyzing the contraction property of a regular posterior. We show through a counterexample that some condition controlling the complexity of the parameter space is necessary for the regular posterior to contract, rendering additional flexibility on the choice of the prior for the fractional posterior. Second, we derive a novel Bayesian oracle inequality based on a PAC-Bayes inequality in misspecified models. Our derivation reveals several advantages of averaging based Bayesian procedures over optimization based frequentist procedures. As an application of the Bayesian oracle inequality, we derive a sharp oracle inequality in multivariate convex regression problems. We also illustrate the theory in Gaussian process regression and density estimation problems.

MSC:

62G07 Density estimation
62G20 Asymptotic properties of nonparametric inference
60G22 Fractional processes, including fractional Brownian motion

References:

[1] Alquier, P., Ridgway, J. and Chopin, N. (2016). On the properties of variational approximations of Gibbs posteriors. J. Mach. Learn. Res.17 Paper No. 239, 41. · Zbl 1437.62129
[2] Balázs, G., György, A. and Szepesvári, C. (2015). Near-optimal max-affine estimators for convex regression. In AISTATS.
[3] Barron, A., Schervish, M. J. and Wasserman, L. (1999). The consistency of posterior distributions in nonparametric problems. Ann. Statist.27 536-561. · Zbl 0980.62039 · doi:10.1214/aos/1017939142
[4] Bartlett, P. L., Bousquet, O. and Mendelson, S. (2005). Local Rademacher complexities. Ann. Statist.33 1497-1537. · Zbl 1083.62034 · doi:10.1214/009053605000000282
[5] Bartlett, P. L. and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res.3 463-482. · Zbl 1084.68549
[6] Bartlett, P. L., Mendelson, S. and Philips, P. (2004). Local complexities for empirical risk minimization. In Learning Theory. Lecture Notes in Computer Science3120 270-284. Springer, Berlin. · Zbl 1078.68046
[7] Bellec, P. C. and Tsybakov, A. B. (2015). Sharp oracle bounds for monotone and convex regression through aggregation. J. Mach. Learn. Res.16 1879-1892. · Zbl 1351.62088
[8] Bhattacharya, A., Pati, D. and Yang, Y. (2019). Supplement to “Bayesian fractional posteriors.” DOI:10.1214/18-AOS1712SUPP. · Zbl 1473.62116
[9] Birgé, L. (1984). Sur un théorème de minimax et son application aux tests. Probab. Math. Statist.3 259-282. · Zbl 0571.62036
[10] Bissiri, P. G., Holmes, C. C. and Walker, S. G. (2016). A general framework for updating belief distributions. J. R. Stat. Soc. Ser. B. Stat. Methodol.78 1103-1130. · Zbl 1414.62039 · doi:10.1111/rssb.12158
[11] Catoni, O. (2007). Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics Lecture Notes—Monograph Series56. IMS, Beachwood, OH. · Zbl 1277.62015
[12] Catoni, O. and Picard, J. (2004). Statistical Learning Theory and Stochastic Optimization: Ecole D’Eté de Probabilités de Saint-Flour, XXXI-2001. Springer, Berlin. · Zbl 1076.93002
[13] Chatterjee, S. (2014). A new perspective on least squares under convex constraint. Ann. Statist.42 2340-2381. · Zbl 1302.62053 · doi:10.1214/14-AOS1254
[14] Chernozhukov, V. and Hong, H. (2003). An MCMC approach to classical estimation. J. Econometrics115 293-346. · Zbl 1043.62022 · doi:10.1016/S0304-4076(03)00100-3
[15] Dalalyan, A. and Tsybakov, A. B. (2008). Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Mach. Learn.72 39-61. · Zbl 1470.62054
[16] Dalalyan, A. S. and Salmon, J. (2012). Sharp oracle inequalities for aggregation of affine estimators. Ann. Statist.40 2327-2355. · Zbl 1257.62038 · doi:10.1214/12-AOS1038
[17] De Blasi, P. and Walker, S. G. (2013). Bayesian asymptotics with misspecified models. Statist. Sinica23 169-187. · Zbl 1257.62026
[18] Friel, N. and Pettitt, A. N. (2008). Marginal likelihood estimation via power posteriors. J. R. Stat. Soc. Ser. B. Stat. Methodol.70 589-607. · Zbl 05563360 · doi:10.1111/j.1467-9868.2007.00650.x
[19] Gelman, A. and Meng, X.-L. (1998). Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statist. Sci.13 163-185. · Zbl 0966.65004 · doi:10.1214/ss/1028905934
[20] Germain, P., Lacasse, A., Laviolette, F. and Marchand, M. (2009). PAC-Bayesian learning of linear classifiers. In Proceedings of the 26th Annual International Conference on Machine Learning 353-360. ACM, New York.
[21] Geyer, C. J. and Thompson, E. A. (1995). Annealing Markov chain Monte Carlo with applications to ancestral inference. J. Amer. Statist. Assoc.90 909-920. · Zbl 0850.62834 · doi:10.1080/01621459.1995.10476590
[22] Ghosal, S., Ghosh, J. K. and van der Vaart, A. W. (2000). Convergence rates of posterior distributions. Ann. Statist.28 500-531. · Zbl 1105.62315 · doi:10.1214/aos/1016218228
[23] Ghosal, S. and van der Vaart, A. (2007). Convergence rates of posterior distributions for non-i.i.d. observations. Ann. Statist.35 192-223. · Zbl 1114.62060 · doi:10.1214/009053606000001172
[24] Ghosal, S. and van der Vaart, A. (2007). Posterior convergence rates of Dirichlet mixtures at smooth densities. Ann. Statist.35 697-723. · Zbl 1117.62046 · doi:10.1214/009053606000001271
[25] Grünwald, P. (2012). The safe Bayesian: Learning the learning rate via the mixability gap. In Algorithmic Learning Theory. Lecture Notes in Computer Science7568 169-183. Springer, Heidelberg. · Zbl 1255.68086
[26] Grünwald, P. and van Ommen, T. (2017). Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Anal.12 1069-1103. · Zbl 1384.62088 · doi:10.1214/17-BA1085
[27] Guedj, B. and Alquier, P. (2013). PAC-Bayesian estimation and prediction in sparse additive models. Electron. J. Stat.7 264-291. · Zbl 1337.62075 · doi:10.1214/13-EJS771
[28] Guntuboyina, A. and Sen, B. (2013). Covering numbers for convex functions. IEEE Trans. Inform. Theory59 1957-1965. · Zbl 1364.52007 · doi:10.1109/TIT.2012.2235172
[29] Guntuboyina, A. and Sen, B. (2015). Global risk bounds and adaptation in univariate convex regression. Probab. Theory Related Fields163 379-411. · Zbl 1327.62255 · doi:10.1007/s00440-014-0595-3
[30] Hannah, L. A. and Dunson, D. B. (2013). Multivariate convex regression with adaptive partitioning. J. Mach. Learn. Res.14 3261-3294. · Zbl 1318.62225
[31] Holmes, C. C. and Walker, S. G. (2017). Assigning a value to a power likelihood in a general Bayesian model. Biometrika104 497-503. · Zbl 1506.62264
[32] Jiang, W. and Tanner, M. A. (2008). Gibbs posterior for variable selection in high-dimensional classification and data mining. Ann. Statist.36 2207-2231. · Zbl 1274.62227 · doi:10.1214/07-AOS547
[33] Kleijn, B. J. K. and van der Vaart, A. W. (2006). Misspecification in infinite-dimensional Bayesian statistics. Ann. Statist.34 837-877. · Zbl 1095.62031 · doi:10.1214/009053606000000029
[34] Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist.34 2593-2656. · Zbl 1118.62065 · doi:10.1214/009053606000001073
[35] Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Lecture Notes in Math.2033. Springer, Heidelberg. · Zbl 1223.91002
[36] Koltchinskii, V. and Panchenko, D. (2000). Rademacher processes and bounding the risk of function learning. In High Dimensional Probability, II (Seattle, WA, 1999). Progress in Probability47 443-457. Birkhäuser, Boston, MA. · Zbl 1106.68385
[37] Kruijer, W., Rousseau, J. and van der Vaart, A. (2010). Adaptive Bayesian density estimation with location-scale mixtures. Electron. J. Stat.4 1225-1257. · Zbl 1329.62188 · doi:10.1214/10-EJS584
[38] Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer, New York. · Zbl 0605.62002
[39] LeCam, L. (1973). Convergence of estimates under dimensionality restrictions. Ann. Statist.1 38-53. · Zbl 0255.62006
[40] Leung, G. and Barron, A. R. (2006). Information theory and mixing least-squares regressions. IEEE Trans. Inform. Theory52 3396-3410. · Zbl 1309.94051 · doi:10.1109/TIT.2006.878172
[41] Lugosi, G. and Wegkamp, M. (2004). Complexity regularization via localized random penalties. Ann. Statist.32 1679-1697. · Zbl 1045.62060 · doi:10.1214/009053604000000463
[42] Martin, R., Mess, R. and Walker, S. G. (2017). Empirical Bayes posterior concentration in sparse high-dimensional linear models. Bernoulli23 1822-1847. · Zbl 1450.62085 · doi:10.3150/15-BEJ797
[43] Martin, R. and Walker, S. G. (2014). Asymptotically minimax empirical Bayes estimation of a sparse normal mean vector. Electron. J. Stat.8 2188-2206. · Zbl 1302.62015 · doi:10.1214/14-EJS949
[44] Martin, R. and Walker, S. G. (2016). Optimal Bayesian posterior concentration rates with empirical priors. Preprint. Available at arXiv:1604.05734.
[45] McAllester, D. A. (1998). Some PAC-Bayesian theorems. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory (Madison, WI, 1998) 230-234. ACM, New York.
[46] Miller, J. W. and Dunson, D. B. (2015). Robust Bayesian inference via coarsening. Preprint. Available at arXiv:1506.06101. · Zbl 1428.62287
[47] Miller, J. W. and Harrison, M. T. (2018). Mixture models with a prior on the number of components. J. Amer. Statist. Assoc.113 340-356. · Zbl 1398.62066 · doi:10.1080/01621459.2016.1255636
[48] O’Hagan, A. (1995). Fractional Bayes factors for model comparison. J. Roy. Statist. Soc. Ser. B57 99-138. · Zbl 0813.62026
[49] Rakhlin, A., Sridharan, K. and Tsybakov, A. B. (2017). Empirical entropy, minimax regret and minimax risk. Bernoulli23 789-824. · Zbl 1380.62176 · doi:10.3150/14-BEJ679
[50] Ramamoorthi, R. V., Sriram, K. and Martin, R. (2015). On posterior concentration in misspecified models. Bayesian Anal.10 759-789. · Zbl 1335.62022 · doi:10.1214/15-BA941
[51] Rockafellar, R. T. (1997). Convex Analysis. Princeton Univ. Press, Princeton, NJ. · Zbl 0932.90001
[52] Shawe-Taylor, J. and Williamson, R. C. (1997). A PAC analysis of a Bayesian estimator. In Proceedings of the Tenth Annual Conference on Computational Learning Theory 2-9. ACM, New York.
[53] Shen, W., Tokdar, S. T. and Ghosal, S. (2013). Adaptive Bayesian multivariate density estimation with Dirichlet mixtures. Biometrika100 623-640. · Zbl 1284.62183 · doi:10.1093/biomet/ast015
[54] Shen, X. and Wasserman, L. (2001). Rates of convergence of posterior distributions. Ann. Statist.29 687-714. · Zbl 1041.62022 · doi:10.1214/aos/1009210686
[55] Stephens, M. (2016). False discovery rates: A new deal. Biostatistics18 275-294.
[56] van der Vaart, A. W. and van Zanten, J. H. (2009). Adaptive Bayesian estimation using a Gaussian random field with inverse gamma bandwidth. Ann. Statist.37 2655-2675. · Zbl 1173.62021 · doi:10.1214/08-AOS678
[57] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York. · Zbl 0862.60002
[58] van Erven, T. and Harremoës, P. (2014). Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inform. Theory60 3797-3820. · Zbl 1360.94180
[59] van de Geer, S. (2000). Empirical Processes in M-Estimation. Cambridge Univ. Press, Cambridge. · Zbl 1179.62073
[60] Vapnik, V. N. and Chervonenkis, A. J. (1974). Theory of pattern recognition. · Zbl 0284.68070
[61] Walker, S. and Hjort, N. L. (2001). On Bayesian consistency. J. R. Stat. Soc. Ser. B. Stat. Methodol.63 811-821. · Zbl 0987.62021 · doi:10.1111/1467-9868.00314
[62] Zhang, T. (2006). From \(ϵ\)-entropy to KL-entropy: Analysis of minimum information complexity density estimation. Ann. Statist.34 2180-2210. · Zbl 1106.62005 · doi:10.1214/009053606000000704
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.