×

Asymptotically optimal algorithms for budgeted multiple play bandits. (English) Zbl 1446.91032

Summary: We study a generalization of the multi-armed bandit problem with multiple plays where there is a cost associated with pulling each arm and the agent has a budget at each time that dictates how much she can expect to spend. We derive an asymptotic regret lower bound for any uniformly efficient algorithm in our setting. We then study a variant of Thompson sampling for Bernoulli rewards and a variant of KL-UCB for both single-parameter exponential families and bounded, finitely supported rewards. We show these algorithms are asymptotically optimal, both in rate and leading problem-dependent constants, including in the thick margin setting where multiple arms fall on the decision boundary.

MSC:

91A60 Probabilistic games; gambling
91A68 Algorithmic game theory and complexity

Software:

R

References:

[1] Agrawal, S., & Devanur, N. R. (2014). Bandits with concave rewards and convex knapsacks. In Proceedings of the fifteenth ACM conference on Economics and computation (pp. 989-1006). ACM.
[2] Agrawal, S., & Goyal, N. (2011). Analysis of thompson sampling for the multi-armed bandit problem. arXiv preprint arXiv:1111.1797.
[3] Agrawal, S., & Goyal, N. (2012). Further optimal regret bounds for thompson sampling. arXiv preprint arXiv:1209.3353. · Zbl 1426.68293
[4] Anantharam, V.; Varaiya, P.; Walrand, J., Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-Part I: IID rewards, IEEE Transactions on Automatic Control, 32, 11, 968-976 (1987) · Zbl 0632.93067 · doi:10.1109/TAC.1987.1104491
[5] Audibert, J. -Y., Bubeck, S., & Lugosi, G. (2011). Minimax policies for combinatorial prediction games. arXiv preprint arXiv:1105.4871.
[6] Badanidiyuru, A., Kleinberg, R., & Slivkins, A. (2013). Bandits with knapsacks. In 2013 IEEE 54th annual symposium on foundations of computer science (FOCS) (pp. 207-216). IEEE. · Zbl 1425.68340
[7] Burnetas, AN; Katehakis, M., Optimal adaptive policies for sequential allocation problems, Advances in Applied Mathematics, 17, 2, 122-142 (1996) · Zbl 0854.60032 · doi:10.1006/aama.1996.0007
[8] Cappé, O.; Garivier, A.; Maillard, OA; Munos, R.; Stoltz, G., Kullback-leibler upper confidence bounds for optimal sequential allocation, The Annals of Statistics, 41, 3, 1516-1541 (2013) · Zbl 1293.62161 · doi:10.1214/13-AOS1119
[9] Cappé, O., Garivier, A., Maillard, O. A., Munos, R., & Stoltz, G. (2013b). Supplement to “Kullback-Leibler upper confidence bounds for optimal sequential allocation”. 10.1214/13-AOS1119SUPP. · Zbl 1293.62161
[10] Cesa-Bianchi, N.; Lugosi, G., Combinatorial bandits, Journal of Computer and System Sciences, 78, 1404-1422 (2012) · Zbl 1262.91052 · doi:10.1016/j.jcss.2012.01.001
[11] Chen, W., Wang, Y., & Yuan, Y. (2013). Combinatorial multi-armed bandit: General framework and applications. In Proceedings of the 30th international conference on machine learning (pp. 151-159).
[12] Combes, R., Magureanu, S., Proutière, A., & Laroche, C. (2015a). Learning to rank: Regret lower bounds and efficient algorithms. In Proceedings of the 2015 ACM SIGMETRICS international conference on measurement and modeling of computer systems (pp. 231-244).
[13] Combes, R., Shahi, M. S. T. M., Proutiere, A., & Lelarge, M. (2015b). Combinatorial bandits revisited. In Advances in neural information processing systems (pp. 2107-2115).
[14] Dantzig, GB, Discrete-variable extremum problems, Operations Research, 5, 2, 266-288 (1957) · Zbl 1414.90353 · doi:10.1287/opre.5.2.266
[15] Garivier, A., Ménard, P., & Stoltz, G. (2016). Explore first, exploit next: The true shape of regret in bandit problems. arXiv preprint arXiv:1602.07182. · Zbl 1435.62308
[16] Gittins, JC, Bandit processes and dynamic allocation indices, Journal of the Royal Statistical Society Series B, 41, 2, 148-177 (1979) · Zbl 0411.62055
[17] Karp, RM, Reducibility among combinatorial problems (1972), New York, Berlin, Heidelberg: Springer, New York, Berlin, Heidelberg
[18] Kaufmann, E., Korda, N., & Munos, R. (2012). Thompson sampling: An asymptotically optimal finite-time analysis. In Algorithmic learning theory (pp. 199-213). Springer. · Zbl 1386.91055
[19] Komiyama, J., Honda, J., & Nakagawa, H. (2015). Optimal regret analysis of thompson sampling in stochastic multi-armed bandit problem with multiple plays. arXiv preprint arXiv:1506.00779.
[20] Korda, N., Kaufmann, E., & Munos, R. (2013). Thompson sampling for 1-dimensional exponential family bandits. In Advances in neural information processing systems (pp. 1448-1456).
[21] Kveton, B., Szepesvári, C., Wen, Z., & Ashkan, A. (2015a). Cascading bandits: Learning to rank in the cascade model. In Proceedings of the 32nd international conference on machine learning (pp. 767-776).
[22] Kveton, B., Weng, Z., Ashkan, A., Hoda, E., & Eriksson, B. (2014). Matroid bandits: Fast combinatorial optimization with learning. In Uncertainty in artificial intelligence (UAI).
[23] Kveton, B., Zheng, W., Ashkan, A., & Szepesvári, C. (2015b). Combinatorial cascading bandits. In Advances in neural information processing systems (NIPS).
[24] Lagrée, P., Vernade, C., & Cappé, O. (2016). Multiple-play bandits in the postition-based model. arXiv preprint arXiv:1606.02448.
[25] Lai, TL, Adaptive treatment allocation and the multi-armed bandit problem, The Annals of Statistics, 15, 1091-1114 (1987) · Zbl 0643.62054 · doi:10.1214/aos/1176350495
[26] Lai, TL; Robbins, H., Asymptotically efficient adaptive allocation rules, Advances in Applied Mathematics, 6, 1, 4-22 (1985) · Zbl 0568.62074 · doi:10.1016/0196-8858(85)90002-8
[27] Li, H., & Xia, Y. (2017). Infinitely many-armed bandits with budget constraints. In AAAI (pp. 2182-2188).
[28] Luedtke, A., Kaufmann, E., & Chambaz, A. (2016). Asymptotically Optimal algorithms for multiple play bandits with partial feedback. arXiv preprint arXiv:1606.09388. · Zbl 1446.91032
[29] R Core Team. (2014). R: A language and environment for statistical computing. Retrieved July 1, 2016 from http://www.r-project.org/.
[30] Robbins, H., Some aspects of the sequential design of experiments, Bulletin of the American Mathematical Society, 58, 5, 527-535 (1952) · Zbl 0049.37009 · doi:10.1090/S0002-9904-1952-09620-8
[31] Sankararaman, K., & Slivkins, A. (2018). Combinatorial semi-bandits with knapsacks. In AISTATS.
[32] Thompson, WR, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, 25, 3-4, 285-294 (1933) · JFM 59.1159.03 · doi:10.2307/2332286
[33] Tran-Thanh, L., Chapman, A. C., Rogers, A., & Jennings, N. R. (2012). Knapsack based optimal policies for budget-limited multi-armed bandits. In AAAI.
[34] Wen, Z., Kveton, B., & Ashkan, A. (2015). Efficient learning in large-scale combinatorial semi-bandits. In International conference on machine learning (ICML).
[35] Xia, Y., Ding, W., Zhang, X. -D., Yu, N., & Qin, T. (2016a). Budgeted bandit problems with continuous random costs. In Asian conference on machine learning (pp. 317-332).
[36] Xia, Y., Li, H., Qin, T., Yu, N., & Liu, T. -Y. (2015). Thompson sampling for budgeted multi-armed bandits. In IJCAI (pp. 3960-3966).
[37] Xia, Y., Qin, T., Ma, W., Yu, N., & Liu, T. -Y. (2016b). Budgeted multi-armed bandits with multiple plays. In IJCAI (pp. 2210-2216).
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.