×

Smoothness-adaptive contextual bandits. (English) Zbl 07640290

Summary: We study a nonparametric multiarmed bandit problem with stochastic covariates, where a key complexity driver is the smoothness of payoff functions with respect to covariates. Previous studies have focused on deriving minimax-optimal algorithms in cases where it is a priori known how smooth the payoff functions are. In practice, however, the smoothness of payoff functions is typically not known in advance, and misspecification of smoothness may severely deteriorate the performance of existing methods. In this work, we consider a framework where the smoothness of payoff functions is not known and study when and how algorithms may adapt to unknown smoothness. First, we establish that designing algorithms that adapt to unknown smoothness of payoff functions is, in general, impossible. However, under a self-similarity condition (which does not reduce the minimax complexity of the dynamic optimization problem at hand), we establish that adapting to unknown smoothness is possible and further devise a general policy for achieving smoothness-adaptive performance. Our policy infers the smoothness of payoffs throughout the decision-making process while leveraging the structure of off-the-shelf nonadaptive policies. We establish that for problem settings with either differentiable or nondifferentiable payoff functions, this policy matches (up to a logarithmic scale) the regret rate that is achievable when the smoothness of payoffs is known a priori.

MSC:

62-XX Statistics

References:

[1] Agrawal R (1995) The continuum-armed bandit problem. SIAM J. Control Optim. 70(6):1926-1951.Crossref, Google Scholar · Zbl 0848.93069 · doi:10.1137/S0363012992237273
[2] Agrawal S, Avadhanula V, Goyal V, Zeevi A (2019) MNL-bandit: A dynamic learning approach to assortment selection. Oper. Res. 67(5):1453-1485.Link, Google Scholar · Zbl 1444.90021
[3] Armstrong TB, Kolesár M (2018) Optimal inference in a class of regression models. Econometrica 86(2):655-683.Crossref, Google Scholar · Zbl 1414.62150 · doi:10.3982/ECTA14434
[4] Audibert J-Y, Tsybakov AB (2007) Fast learning rates for plug-in classifiers. Ann. Statist. 35(2):608-633.Crossref, Google Scholar · Zbl 1118.62041 · doi:10.1214/009053606000001217
[5] Auer P, Ortner R, Szepesvári C (2007) Improved rates for the stochastic continuum-armed bandit problem. Bshouty N, Gentile C, eds. Internat. Conf. Comput. Learn. Theory (Springer, Berlin, Heidelberg, Germany), 454-468.Google Scholar · Zbl 1203.68134
[6] Ban G-Y, Keskin NB (2021) Personalized dynamic pricing with machine learning: High dimensional features and heterogeneous elasticity. Management Sci. 67(9):5549-5568.Link, Google Scholar
[7] Bastani H, Bayati M (2020) Online decision making with high-dimensional covariates. Oper. Res. 68(1):276-294.Link, Google Scholar · Zbl 1445.90042
[8] Bastani H, Bayati M, Khosravi K (2021a) Mostly exploration-free algorithms for contextual bandits. Management Sci. 67(3):1329-1349.Link, Google Scholar
[9] Bastani H, Simchi-Levi D, Zhu R (2021b) Meta dynamic pricing: Learning across experiments. Management Sci., ePub ahead of print September 9, https://doi.org/10.1287/mnsc.2021.4071.Link, Google Scholar
[10] Bastani H, Harsha P, Perakis G, Singhvi D (2018) Learning personalized product recommendations with customer disengagement. Preprint, submitted August 29, http://dx.doi.org/10.2139/ssrn.3240970.Google Scholar
[11] Bickel PJ, Rosenblatt M (1973) On some global measures of the deviations of density function estimates. Ann. Statist. 1(6):1071-1095.Crossref, Google Scholar · Zbl 0275.62033 · doi:10.1214/aos/1176342558
[12] Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1-122.Crossref, Google Scholar · Zbl 1281.91051 · doi:10.1561/2200000024
[13] Bubeck S, Munos R, Stoltz G (2009a) Pure exploration in multi-armed bandits problems. Gavalda R, Lugosi G, Zeugmann T, Zilles S, eds. Internat. Conf. Algorithmic Learn. Theory (Springer, Berlin, Heidelberg, Germany), 23-37.Google Scholar · Zbl 1262.68061
[14] Bubeck S, Stoltz G, Szepesvári C, Munos R (2009b) Online optimization in x-armed bandits. Bengio Y, Schuurmans D, Lafferty J, Williams C, Culotta A, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 201-208.Google Scholar
[15] Bull AD (2012) Honest adaptive confidence bands and self-similar functions. Electronic J. Statist. 6(2012):1490-1516.Google Scholar · Zbl 1295.62049
[16] Bull AD, Nickl R (2013) Adaptive confidence sets in l2. Probab. Theory Related Fields 156(3-4):889-919.Crossref, Google Scholar · Zbl 1273.62105 · doi:10.1007/s00440-012-0446-z
[17] Chandrashekar A, Amat F, Basilico J, Jebara T (2017) Artwork personalization at Netflix. Netflix Tech Blog (December 7), https://netflixtechblog.com/artwork-personalization-c589f074ad76.Google Scholar
[18] Chatterji N, Muthukumar V, Bartlett P (2020) Osom: A simultaneously optimal algorithm for multi-armed and linear contextual bandits. Chiappa S, Calandra R, eds. Internat. Conf. Artificial Intelligence Statist. (PMLR, Cambridge, MA), 1844-1854.Google Scholar
[19] Chick SE, Gans N, Yapar O (2021) Bayesian sequential learning for clinical trials of multiple correlated medical interventions. Management Sci., ePub ahead of print November 5, https://doi.org/10.1287/mnsc.2021.4137.Google Scholar
[20] Chu W, Li L, Reyzin L, Schapire RE (2011) Contextual bandits with linear payoff functions. Gordon G, Dunson D, Dudík M, eds. Internat. Conf. Artificial Intelligence Statist. (JMLR, Cambridge, MA), 208-214.Google Scholar
[21] Cohen MC, Lobel I, Paes Leme R (2020) Feature-based dynamic pricing. Management Sci. 66(11):4921-4943.Link, Google Scholar
[22] Donoho DL, Johnstone IM (1994) Ideal spatial adaptation by wavelet shrinkage. Biometrika 81(3):425-455.Crossref, Google Scholar · Zbl 0815.62019 · doi:10.1093/biomet/81.3.425
[23] Donoho DL, Johnstone IM, Kerkyacharian G, Picard D (1995) Wavelet shrinkage: Asymptopia? J. Roy. Statist. Soc. B 57(2):301-337.Google Scholar · Zbl 0827.62035
[24] Dudik M, Hsu D, Kale S, Karampatziakis N, Langford J, Reyzin L, Zhang T (2011) Efficient optimal learning for contextual bandits. Preprint, submitted June 13, https://arxiv.org/abs/1106.2369.Google Scholar
[25] Foster DJ, Krishnamurthy A, Luo H (2019) Model selection for contextual bandits. Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 14741-14752.Google Scholar
[26] Giné E, Nickl R (2010) Confidence bands in density estimation. Ann. Statist. 38(2):1122-1170.Crossref, Google Scholar · Zbl 1183.62062 · doi:10.1214/09-AOS738
[27] Goldenshluger A, Nemirovski A (1997) On spatially adaptive estimation of nonparametric regression. Math. Methods Statist. 6(2):135-170.Google Scholar · Zbl 0892.62018
[28] Goldenshluger A, Zeevi A (2013) A linear response bandit problem. Stochastic Systems 3(1):230-261.Link, Google Scholar · Zbl 1352.91009
[29] Gur Y, Momeni A (2019) Adaptive sequential experiments with unknown information arrival processes. Preprint, submitted June 28, https://arxiv.org/abs/1907.00107v1.Google Scholar
[30] Hall P (1992) Effect of bias estimation on coverage accuracy of bootstrap confidence intervals for a probability density. Ann. Statist. 20(2):675-694.Crossref, Google Scholar · Zbl 0748.62028 · doi:10.1214/aos/1176348651
[31] Hu Y, Kallus N, Mao X (2019) Smooth contextual bandits: Bridging the parametric and non-differentiable regret regimes. Preprint, submitted September 5, https://arxiv.org/abs/1909.02553v1.Google Scholar
[32] Javanmard A, Nazerzadeh H (2019) Dynamic pricing in high-dimensions. J. Machine Learn. Res. 20(1):315-363.Google Scholar · Zbl 1484.91202
[33] Juditsky A (1997) Wavelet estimators: Adapting to unknown smoothness. Math. Methods Statist. 6(1):1-25.Google Scholar · Zbl 0871.62039
[34] Kallus N, Udell M (2020) Dynamic assortment personalization in high dimensions. Oper. Res. 68(4):1020-1037.Link, Google Scholar · Zbl 1451.90077
[35] Kleinberg RD (2005) Nearly tight bounds for the continuum-armed bandit problem. Weiss Y, Schölkopf B, Platt J, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 697-704.Google Scholar
[36] Langford J, Zhang T (2008) The epoch-greedy algorithm for multi-armed bandits with side information. Koller D, Schuurmans D, Bengio Y, Bottou L, eds. Advances in Neural Information Processing Systems (Curran Associates, Red Hook, NY), 817-824.Google Scholar
[37] Lepski OV, Mammen E, Spokoiny VG (1997) Optimal spatial adaptation to inhomogeneous smoothness: An approach based on kernel estimates with variable bandwidth selectors. Ann. Statist. 25(3):929-947.Crossref, Google Scholar · Zbl 0885.62044 · doi:10.1214/aos/1069362731
[38] Lepskii O (1992) Asymptotically minimax adaptive estimation. I. Upper bounds. Optimally adaptive estimates. Theory Probab. Appl. 36(4):682-697.Crossref, Google Scholar · Zbl 0776.62039 · doi:10.1137/1136085
[39] Li W, Chen N, Hong LJ (2020) Dimension reduction in contextual online learning via nonparametric variable selection. Preprint, submitted September 17, https://arxiv.org/abs/2009.08265.Google Scholar
[40] Locatelli A, Carpentier A (2018) Adaptivity to smoothness in x-armed bandits. Bubeck S, Perchet V, Rigollet P, eds. Conf. Learn. Theory (PMLR, Cambridge, MA), 1463-1492.Google Scholar
[41] Low MG (1997) On nonparametric confidence intervals. Ann. Statist. 25(6):2547-2554.Crossref, Google Scholar · Zbl 0894.62055 · doi:10.1214/aos/1030741084
[42] Mukherjee R, Sen S (2018) Optimal adaptive inference in random design binary regression. Bernoulli 24(1):699-739.Crossref, Google Scholar · Zbl 1387.62053 · doi:10.3150/16-BEJ893
[43] Nickl R, Szabó B (2016) A sharp adaptive confidence ball for self-similar functions. Stochastic Processes Their Appl. 126(12):3913-3934.Crossref, Google Scholar · Zbl 1348.62165 · doi:10.1016/j.spa.2016.04.017
[44] Nickl R, van de Geer S (2013) Confidence sets in sparse regression. Ann. Statist. 41(6):2852-2876.Crossref, Google Scholar · Zbl 1288.62108 · doi:10.1214/13-AOS1170
[45] Perchet V, Rigollet P (2013) The multi-armed bandit problem with covariates. Ann. Statist. 41(2):693-721.Crossref, Google Scholar · Zbl 1360.62436 · doi:10.1214/13-AOS1101
[46] Picard D, Tribouley K (2000) Adaptive confidence interval for pointwise curve estimation. Ann. Statist. 28(1):298-335.Crossref, Google Scholar · Zbl 1106.62331 · doi:10.1214/aos/1016120374
[47] Qian W, Yang Y (2016) Randomized allocation with arm elimination in a bandit problem with covariates. Electronic J. Statist. 10(1):242-270.Crossref, Google Scholar · Zbl 1332.62138 · doi:10.1214/15-EJS1104
[48] Qiang S, Bayati M (2016) Dynamic pricing with demand covariates. Preprint, submitted April 25, https://arxiv.org/abs/1604.07463.Google Scholar
[49] Reeve HWJ, Mellor J, Brown G (2018) The k-nearest neighbour UCB algorithm for multi-armed bandits with covariates. Preprint, submitted March 1, https://arxiv.org/abs/1803.00316.Google Scholar
[50] Rigollet P, Zeevi A (2010) Nonparametric bandits with covariates. Kalai AT, Mohri M, eds. Conf. Learn. Theory (Omnipress, Madison, WI), 54-66.Google Scholar
[51] Robbins H (1952) Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. 58(5):527-535.Crossref, Google Scholar · Zbl 0049.37009 · doi:10.1090/S0002-9904-1952-09620-8
[52] Tewari A, Murphy SA (2017) From ads to interventions: Contextual bandits in mobile health. Rehg J, Murphy S, Kumar S, eds. Mobile Health (Springer, Cham, Switzerland), 495-517.Crossref, Google Scholar · doi:10.1007/978-3-319-51394-2_25
[53] Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285-294.Crossref, Google Scholar · JFM 59.1159.03 · doi:10.2307/2332286
[54] Wang C-C, Kulkarni SR, Poor HV (2005) Bandit problems with side observations. IEEE Trans. Automatic Control 50(3):338-355.Crossref, Google Scholar · Zbl 1366.91063 · doi:10.1109/TAC.2005.844079
[55] Wang Y, Chen B, Simchi-Levi D (2019) Multi-modal dynamic pricing. Preprint, submitted November 26, http://dx.doi.org/10.2139/ssrn.3489355.Google Scholar
[56] Woodroofe M (1979) A one-armed bandit problem with a concomitant variable. J. Amer. Statist. Assoc. 74(368):799-806.Crossref, Google Scholar · Zbl 0442.62063 · doi:10.1080/01621459.1979.10481033
[57] Yang Y, Zhu D (2002) Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates. Ann. Statist. 30(1):100-121.Crossref, Google Scholar · Zbl 1012.62088 · doi:10.1214/aos/1015362186
[58] Zhou Z, Wang Y, Mamani H, Coffey DG (2019) How do tumor cytogenetics inform cancer treatments? Dynamic risk stratification and precision medicine using multi-armed bandits. Preprint, submitted July 12, http://dx.doi.org/10.2139/ssrn.3405082.Google Scholar
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.