Document Zbl 1539.62097

Cai, Changxiao; Cai, T. Tony; Li, Hongzhe

Transfer learning for contextual multi-armed bandits. (English) Zbl 1539.62097

Ann. Stat. 52, No. 1, 207-232 (2024).

Summary: Motivated by a range of applications, we study in this paper the problem of transfer learning for nonparametric contextual multi-armed bandits under the covariate shift model, where we have data collected from source bandits before the start of the target bandit learning. The minimax rate of convergence for the cumulative regret is established and a novel transfer learning algorithm that attains the minimax regret is proposed. The results quantify the contribution of the data from the source domains for learning in the target domain in the context of nonparametric contextual multi-armed bandits.
In view of the general impossibility of adaptation to unknown smoothness, we develop a data-driven algorithm that achieves near-optimal statistical guarantees (up to a logarithmic factor) while automatically adapting to the unknown parameters over a large collection of parameter spaces under an additional self-similarity assumption. A simulation study is carried out to illustrate the benefits of utilizing the data from the source domains for learning in the target domain.

Cited in 1 Document

MSC:

62G08	Nonparametric regression and quantile regression
62L12	Sequential estimation
62G15	Nonparametric tolerance and confidence regions
68T05	Learning and adaptive systems in artificial intelligence

Keywords:

contextual multi-armed bandit; transfer learning; covariate shift; minimax rate; regret bounds; adaptivity; self-similarity

Cite Review PDF

Full Text: DOI arXiv Link

References:

[1]	ABE, N. and LONG, P. M. (1999). Associative reinforcement learning using linear probabilistic concepts. In ICML 3-11. Citeseer.
[2]	AGRAWAL, S., AVADHANULA, V., GOYAL, V. and ZEEVI, A. (2019). MNL-Bandit: A dynamic learning approach to assortment selection. Oper. Res. 67 1453-1485. Digital Object Identifier: 10.1287/opre.2018.1832 Google Scholar: Lookup Link MathSciNet: MR4014580 zbMATH: 1444.90021 · Zbl 1444.90021 · doi:10.1287/opre.2018.1832
[3]	Audibert, J.-Y. and Tsybakov, A. B. (2007). Fast learning rates for plug-in classifiers. Ann. Statist. 35 608-633. Digital Object Identifier: 10.1214/009053606000001217 Google Scholar: Lookup Link MathSciNet: MR2336861 · Zbl 1118.62041 · doi:10.1214/009053606000001217
[4]	AUER, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 3 397-422. Digital Object Identifier: 10.1162/153244303321897663 Google Scholar: Lookup Link MathSciNet: MR1984023 zbMATH: 1084.68543 · Zbl 1084.68543 · doi:10.1162/153244303321897663
[5]	AUER, P., CESA-BIANCHI, N., FREUND, Y. and SCHAPIRE, R. E. (1995). Gambling in a rigged casino: The adversarial multi-armed bandit problem. In 36th Annual Symposium on Foundations of Computer Science (Milwaukee, WI, 1995) 322-331. IEEE Comput. Soc. Press, Los Alamitos, CA. Digital Object Identifier: 10.1109/SFCS.1995.492488 Google Scholar: Lookup Link MathSciNet: MR1619094 · Zbl 0938.68920 · doi:10.1109/SFCS.1995.492488
[6]	AUER, P. and ORTNER, R. (2010). UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Period. Math. Hungar. 61 55-65. Digital Object Identifier: 10.1007/s10998-010-3055-6 Google Scholar: Lookup Link MathSciNet: MR2728432 · Zbl 1240.68164 · doi:10.1007/s10998-010-3055-6
[7]	BASTANI, H. and BAYATI, M. (2020). Online decision making with high-dimensional covariates. Oper. Res. 68 276-294. Digital Object Identifier: 10.1287/opre.2019.1902 Google Scholar: Lookup Link MathSciNet: MR4059503 · Zbl 1445.90042 · doi:10.1287/opre.2019.1902
[8]	BASTANI, H., BAYATI, M. and KHOSRAVI, K. (2021). Mostly exploration-free algorithms for contextual bandits. Manage. Sci. 67 1329-1349.
[9]	BEN-DAVID, S., BLITZER, J., CRAMMER, K. and PEREIRA, F. (2006). Analysis of representations for domain adaptation. Adv. Neural Inf. Process. Syst. 19.
[10]	BLITZER, J., CRAMMER, K., KULESZA, A., PEREIRA, F. and WORTMAN, J. (2007). Learning bounds for domain adaptation. Adv. Neural Inf. Process. Syst. 20.
[11]	Bull, A. D. (2012). Honest adaptive confidence bands and self-similar functions. Electron. J. Stat. 6 1490-1516. Digital Object Identifier: 10.1214/12-EJS720 Google Scholar: Lookup Link MathSciNet: MR2988456 · Zbl 1295.62049 · doi:10.1214/12-EJS720
[12]	CAI, C., CAI, T. T. and LI, H. (2024). Supplement to “Transfer learning for contextual multi-armed bandits.” https://doi.org/10.1214/23-AOS2341SUPP
[13]	CAI, T. T. (2012). Minimax and adaptive inference in nonparametric function estimation. Statist. Sci. 27 31-50. Digital Object Identifier: 10.1214/11-STS355 Google Scholar: Lookup Link MathSciNet: MR2953494 · Zbl 1330.62059 · doi:10.1214/11-STS355
[14]	CAI, T. T. and LOW, M. G. (2004). An adaptation theory for nonparametric confidence intervals. Ann. Statist. 32 1805-1840. Digital Object Identifier: 10.1214/009053604000000049 Google Scholar: Lookup Link MathSciNet: MR2102494 · Zbl 1056.62060 · doi:10.1214/009053604000000049
[15]	Cai, T. T., Low, M. G. and Xia, Y. (2013). Adaptive confidence intervals for regression functions under shape constraints. Ann. Statist. 41 722-750. Digital Object Identifier: 10.1214/12-AOS1068 Google Scholar: Lookup Link MathSciNet: MR3099119 · Zbl 1267.62066 · doi:10.1214/12-AOS1068
[16]	CAI, T. T. and PU, H. (2022a). Transfer learning for nonparametric regression: Non-asymptotic minimax analysis and adaptive procedure. Preprint. Available at arXiv:2401.12272.
[17]	CAI, T. T. and PU, H. (2022b). Stochastic continuum-armed bandits with additive models: Minimax regrets and adaptive algorithm. Ann. Statist. 50 2179-2204. Digital Object Identifier: 10.1214/22-aos2182 Google Scholar: Lookup Link MathSciNet: MR4474487 · Zbl 1539.62099 · doi:10.1214/22-aos2182
[18]	Cai, T. T. and Wei, H. (2021). Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. Ann. Statist. 49 100-128. Digital Object Identifier: 10.1214/20-AOS1949 Google Scholar: Lookup Link MathSciNet: MR4206671 · Zbl 1466.62351 · doi:10.1214/20-AOS1949
[19]	CHEN, J. and JIANG, N. (2019). Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning 1042-1051. PMLR.
[20]	DEMIREL, I., CELIK, A. A. and TEKIN, C. (2022). Escada: Efficient safety and context aware dose allocation for precision medicine. Adv. Neural Inf. Process. Syst. 35 27441-27454.
[21]	DING, K., LI, J. and LIU, H. (2019). Interactive anomaly detection on attributed networks. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining 357-365.
[22]	DÜMBGEN, L. (1998). New goodness-of-fit tests and their application to nonparametric confidence sets. Ann. Statist. 26 288-314. Digital Object Identifier: 10.1214/aos/1030563987 Google Scholar: Lookup Link MathSciNet: MR1611768 · Zbl 0930.62034 · doi:10.1214/aos/1030563987
[23]	FARAHMAND, A.-M., SZEPESVÁRI, C. and MUNOS, R. (2010). Error propagation for approximate policy and value iteration. Adv. Neural Inf. Process. Syst. 23.
[24]	GENOVESE, C. R. and WASSERMAN, L. (2005). Confidence sets for nonparametric wavelet regression. Ann. Statist. 33 698-729. Digital Object Identifier: 10.1214/009053605000000011 Google Scholar: Lookup Link MathSciNet: MR2163157 · Zbl 1068.62057 · doi:10.1214/009053605000000011
[25]	Giné, E. and Nickl, R. (2010). Confidence bands in density estimation. Ann. Statist. 38 1122-1170. Digital Object Identifier: 10.1214/09-AOS738 Google Scholar: Lookup Link MathSciNet: MR2604707 · Zbl 1183.62062 · doi:10.1214/09-AOS738
[26]	GOLDENSHLUGER, A. and ZEEVI, A. (2009). Woodroofe’s one-armed bandit problem revisited. Ann. Appl. Probab. 19 1603-1633. Digital Object Identifier: 10.1214/08-AAP589 Google Scholar: Lookup Link MathSciNet: MR2538082 · Zbl 1168.62071 · doi:10.1214/08-AAP589
[27]	GOLDENSHLUGER, A. and ZEEVI, A. (2013). A linear response bandit problem. Stoch. Syst. 3 230-261. Digital Object Identifier: 10.1214/11-SSY032 Google Scholar: Lookup Link MathSciNet: MR3353472 · Zbl 1352.91009 · doi:10.1214/11-SSY032
[28]	GUR, Y., MOMENI, A. and WAGER, S. (2022). Smoothness-adaptive contextual bandits. Oper. Res. 70 3198-3216. Digital Object Identifier: 10.1287/opre.2021.2215 Google Scholar: Lookup Link MathSciNet: MR4538513 · Zbl 07640290 · doi:10.1287/opre.2021.2215
[29]	HANNEKE, S. and KPOTUFE, S. (2019). On the value of target data in transfer learning. Adv. Neural Inf. Process. Syst. 32.
[30]	HENGARTNER, N. W. and STARK, P. B. (1995). Finite-sample confidence envelopes for shape-restricted densities. Ann. Statist. 23 525-550. Digital Object Identifier: 10.1214/aos/1176324534 Google Scholar: Lookup Link MathSciNet: MR1332580 · Zbl 0828.62043 · doi:10.1214/aos/1176324534
[31]	HU, Y., KALLUS, N. and MAO, X. (2022). Smooth contextual bandits: Bridging the parametric and nondifferentiable regret regimes. Oper. Res. 70 3261-3281. MathSciNet: MR4538516 · Zbl 07640293
[32]	KALLUS, N. and UDELL, M. (2020). Dynamic assortment personalization in high dimensions. Oper. Res. 68 1020-1037. Digital Object Identifier: 10.1287/opre.2019.1948 Google Scholar: Lookup Link MathSciNet: MR4166283 · Zbl 1451.90077 · doi:10.1287/opre.2019.1948
[33]	KLEINBERG, R. and LEIGHTON, T. (2003). The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings. 594-605. IEEE, New York.
[34]	KPOTUFE, S. and MARTINET, G. (2021). Marginal singularity and the benefits of labels in covariate-shift. Ann. Statist. 49 3299-3323. Digital Object Identifier: 10.1214/21-aos2084 Google Scholar: Lookup Link MathSciNet: MR4352531 · Zbl 1486.62186 · doi:10.1214/21-aos2084
[35]	KULIS, B., SAENKO, K. and DARRELL, T. (2011). What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In CVPR 2011 1785-1792. IEEE, New York.
[36]	Lepski, O. V., Mammen, E. and Spokoiny, V. G. (1997). Optimal spatial adaptation to inhomogeneous smoothness: An approach based on kernel estimates with variable bandwidth selectors. Ann. Statist. 25 929-947. Digital Object Identifier: 10.1214/aos/1069362731 Google Scholar: Lookup Link MathSciNet: MR1447734 · Zbl 0885.62044 · doi:10.1214/aos/1069362731
[37]	Lepskii, O. V. (1991). On a problem of adaptive estimation in Gaussian white noise. Theory Probab. Appl. 35 454-466. · Zbl 0745.62083
[38]	Lepskii, O. V. (1992). Asymptotically minimax adaptive estimation. I: Upper bounds. Optimally adaptive estimates. Theory Probab. Appl. 36 682-697. · Zbl 0776.62039
[39]	Lepskii, O. V. (1993). Asymptotically minimax adaptive estimation. II. Schemes without optimal adaptation: Adaptive estimators. Theory Probab. Appl. 37 433-448. MathSciNet: MR1214353 · Zbl 0787.62087
[40]	LI, G., ZHAN, W., LEE, J. D., CHI, Y. and CHEN, Y. (2023). Reward-agnostic fine-tuning: Provable statistical benefits of hybrid reinforcement learning. Preprint. Available at arXiv:2305.10282.
[41]	LI, L., CHU, W., LANGFORD, J. and SCHAPIRE, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web 661-670.
[42]	LI, S., CAI, T. T. and LI, H. (2022). Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. J. R. Stat. Soc. Ser. B. Stat. Methodol. 84 149-173. MathSciNet: MR4400393 · Zbl 07593407
[43]	LI, S., CAI, T. T. and LI, H. (2023). Transfer learning in large-scale Gaussian graphical models with false discovery rate control. J. Amer. Statist. Assoc. 118 2171-2183. Digital Object Identifier: 10.1080/01621459.2022.2044333 Google Scholar: Lookup Link MathSciNet: MR4646634 · Zbl 07751836 · doi:10.1080/01621459.2022.2044333
[44]	LI, W., DUAN, L., XU, D. and TSANG, I. W. (2013). Learning with augmented features for supervised and semi-supervised heterogeneous domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 36 1134-1148.
[45]	LOCATELLI, A. and CARPENTIER, A. (2018). Adaptivity to smoothness in x-armed bandits. In Conference on Learning Theory 1463-1492. PMLR.
[46]	Low, M. G. (1997). On nonparametric confidence intervals. Ann. Statist. 25 2547-2554. Digital Object Identifier: 10.1214/aos/1030741084 Google Scholar: Lookup Link MathSciNet: MR1604412 · Zbl 0894.62055 · doi:10.1214/aos/1030741084
[47]	Luedtke, A. R. and van der Laan, M. J. (2016). Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann. Statist. 44 713-742. Digital Object Identifier: 10.1214/15-AOS1384 Google Scholar: Lookup Link MathSciNet: MR3476615 zbMATH: 1338.62089 · Zbl 1338.62089 · doi:10.1214/15-AOS1384
[48]	MA, C., PATHAK, R. and WAINWRIGHT, M. J. (2023). Optimally tackling covariate shift in RKHS-based nonparametric regression. Ann. Statist. 51 738-761. Digital Object Identifier: 10.1214/23-aos2268 Google Scholar: Lookup Link MathSciNet: MR4601000 · Zbl 1539.62116 · doi:10.1214/23-aos2268
[49]	MAITY, S., SUN, Y. and BANERJEE, M. (2020). Minimax optimal approaches to the label shift problem. Preprint. Available at arXiv:2003.10443.
[50]	Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808-1829. Digital Object Identifier: 10.1214/aos/1017939240 Google Scholar: Lookup Link MathSciNet: MR1765618 · Zbl 0961.62058 · doi:10.1214/aos/1017939240
[51]	MANSOUR, Y., MOHRI, M. and ROSTAMIZADEH, A. (2009). Domain adaptation: Learning bounds and algorithms. Preprint. Available at arXiv:0902.3430.
[52]	MNIH, V., KAVUKCUOGLU, K., SILVER, D., RUSU, A. A., VENESS, J., BELLEMARE, M. G., GRAVES, A., RIEDMILLER, M., FIDJELAND, A. K. et al. (2015). Human-level control through deep reinforcement learning. Nature 518 529-533.
[53]	MUNOS, R. (2007). Performance bounds in \(\mathit{L}_{\mathit{p}} \)-norm for approximate value iteration. SIAM J. Control Optim. 46 541-561. Digital Object Identifier: 10.1137/040614384 Google Scholar: Lookup Link MathSciNet: MR2309039 · Zbl 1356.90159 · doi:10.1137/040614384
[54]	NAKAMOTO, M., ZHAI, Y., SINGH, A., MARK, M. S., MA, Y., FINN, C., KUMAR, A. and LEVINE, S. (2023). Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning. Preprint. Available at arXiv:2303.05479.
[55]	NICKL, R. and SZABÓ, B. (2016). A sharp adaptive confidence ball for self-similar functions. Stochastic Process. Appl. 126 3913-3934. Digital Object Identifier: 10.1016/j.spa.2016.04.017 Google Scholar: Lookup Link MathSciNet: MR3565485 · Zbl 1348.62165 · doi:10.1016/j.spa.2016.04.017
[56]	Nickl, R. and van de Geer, S. (2013). Confidence sets in sparse regression. Ann. Statist. 41 2852-2876. Digital Object Identifier: 10.1214/13-AOS1170 Google Scholar: Lookup Link MathSciNet: MR3161450 · Zbl 1288.62108 · doi:10.1214/13-AOS1170
[57]	Pan, S. J. and Yang, Q. (2009). A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22 1345-1359.
[58]	PATHAK, R., MA, C. and WAINWRIGHT, M. (2022). A new similarity measure for covariate shift with applications to nonparametric regression. In International Conference on Machine Learning 17517-17530. PMLR.
[59]	PENG, M., LI, Y., WAMSLEY, B., WEI, Y. and ROEDER, K. (2021). Integration and transfer learning of single-cell transcriptomes via cFIT. Proc. Natl. Acad. Sci. USA 118 e2024383118.
[60]	PERCHET, V. and RIGOLLET, P. (2013). The multi-armed bandit problem with covariates. Ann. Statist. 41 693-721. Digital Object Identifier: 10.1214/13-AOS1101 Google Scholar: Lookup Link MathSciNet: MR3099118 · Zbl 1360.62436 · doi:10.1214/13-AOS1101
[61]	Picard, D. and Tribouley, K. (2000). Adaptive confidence interval for pointwise curve estimation. Ann. Statist. 28 298-335. Digital Object Identifier: 10.1214/aos/1016120374 Google Scholar: Lookup Link MathSciNet: MR1762913 · Zbl 1106.62331 · doi:10.1214/aos/1016120374
[62]	Qian, M. and Murphy, S. A. (2011). Performance guarantees for individualized treatment rules. Ann. Statist. 39 1180-1210. Digital Object Identifier: 10.1214/10-AOS864 Google Scholar: Lookup Link MathSciNet: MR2816351 · Zbl 1216.62178 · doi:10.1214/10-AOS864
[63]	QIAN, W. and YANG, Y. (2016). Randomized allocation with arm elimination in a bandit problem with covariates. Electron. J. Stat. 10 242-270. Digital Object Identifier: 10.1214/15-EJS1104 Google Scholar: Lookup Link MathSciNet: MR3466182 · Zbl 1332.62138 · doi:10.1214/15-EJS1104
[64]	QUATTONI, A., COLLINS, M. and DARRELL, T. (2008). Transfer learning for image classification with sparse prototype representations. In 2008 IEEE Conference on Computer Vision and Pattern Recognition 1-8. IEEE, New York.
[65]	RABBI, M., AUNG, M. S., GAY, G., REID, M. C. and CHOUDHURY, T. (2018). Feasibility and acceptability of mobile phone-based auto-personalized physical activity recommendations for chronic pain self-management: Pilot study on adults. J. Med. Internet Res. 20 e10147. Digital Object Identifier: 10.2196/10147 Google Scholar: Lookup Link · doi:10.2196/10147
[66]	RAGHU, M., ZHANG, C., KLEINBER, J. and BENGIO, S. (2019). Transfusion: Understanding transfer learning for medical imaging. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019).
[67]	RASHIDINEJAD, P., ZHU, B., MA, C., JIAO, J. and RUSSELL, S. (2022). Bridging offline reinforcement learning and imitation learning: A tale of pessimism. IEEE Trans. Inf. Theory 68 8156-8196. Digital Object Identifier: 10.1109/tit.2022.3185139 Google Scholar: Lookup Link MathSciNet: MR4544936 · Zbl 1534.68219 · doi:10.1109/tit.2022.3185139
[68]	REEVE, H. W. J., CANNINGS, T. I. and SAMWORTH, R. J. (2021). Adaptive transfer learning. Ann. Statist. 49 3618-3649. Digital Object Identifier: 10.1214/21-aos2102 Google Scholar: Lookup Link MathSciNet: MR4352543 · Zbl 1486.62191 · doi:10.1214/21-aos2102
[69]	REEVE, H. W. J., MELLOR, J. and BROWN, G. (2018). The \(k\)-nearest neighbour UCB algorithm for multi-armed bandits with covariates. In Algorithmic Learning Theory 725-752. MathSciNet: MR3857327 · Zbl 1405.68303
[70]	RIGOLLET, P. and ZEEVI, A. (2010). Nonparametric bandits with covariates. Preprint. Available at arXiv:1003.1630.
[71]	RINDTORFF, N. T., LU, M., PATEL, N. A., ZHENG, H. and D’AMOUR, A. (2019). A biologically plausible benchmark for contextual bandit algorithms in precision oncology using in vitro data. Preprint. Available at arXiv:1911.04389.
[72]	Robbins, H. (1952). Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. 58 527-535. Digital Object Identifier: 10.1090/S0002-9904-1952-09620-8 Google Scholar: Lookup Link MathSciNet: MR0050246 · Zbl 0049.37009 · doi:10.1090/S0002-9904-1952-09620-8
[73]	ROSS, S. and BAGNELL, J. A. (2012). Agnostic system identification for model-based reinforcement learning. In Proceedings of the 29th International Conference on Machine Learning.
[74]	ROTHSCHILD, M. (1974). A two-armed bandit theory of market pricing. J. Econom. Theory 9 185-202. Digital Object Identifier: 10.1016/0022-0531(74)90066-0 Google Scholar: Lookup Link MathSciNet: MR0496544 · doi:10.1016/0022-0531(74)90066-0
[75]	SHI, C., LU, W. and SONG, R. (2020). Breaking the curse of nonregularity with subagging—Inference of the mean outcome under optimal treatment regimes. J. Mach. Learn. Res. 21 Paper No. 176, 67 pp. MathSciNet: MR4209462 · Zbl 1536.62087
[76]	SHRESTHA, S. and JAIN, S. (2021). A Bayesian-bandit adaptive design for N-of-1 clinical trials. Stat. Med. 40 1825-1844. Digital Object Identifier: 10.1002/sim.8873 Google Scholar: Lookup Link MathSciNet: MR4229804 · doi:10.1002/sim.8873
[77]	SOEMERS, D., BRYS, T., DRIESSENS, K., WINANDS, M. and NOWÉ, A. (2018). Adapting to concept drift in credit card transaction data streams using contextual bandits and decision trees. In Proceedings of the AAAI Conference on Artificial Intelligence 32.
[78]	SONG, Y., ZHOU, Y., SEKHARI, A., BAGNELL, J. A., KRISHNAMURTHY, A. and SUN, W. (2022). Hybrid RL: Using both offline and online data can make RL efficient. Preprint. Available at arXiv:2210.06718.
[79]	SUDLOW, C., GALLACHER, J., ALLEN, N., BERAL, V., BURTON, P., DANESH, J., DOWNEY, P., ELLIOTT, P., GREEN, J. et al. (2015). UK Biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12 e1001779.
[80]	SUK, J. and KPOTUFE, S. (2021). Self-tuning bandits over unknown covariate-shifts. In Algorithmic Learning Theory 1114-1156. MathSciNet: MR4227355
[81]	TEWARI, A. and MURPHY, S. A. (2017). From ads to interventions: Contextual bandits in mobile health. In Mobile Health 495-517. Springer, Berlin.
[82]	Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135-166. Digital Object Identifier: 10.1214/aos/1079120131 Google Scholar: Lookup Link MathSciNet: MR2051002 · Zbl 1105.62353 · doi:10.1214/aos/1079120131
[83]	WAGENMAKER, A. and PACCHIANO, A. (2023). Leveraging offline data in online reinforcement learning. In International Conference on Machine Learning 35300-35338. PMLR.
[84]	WANG, J., AGARWAL, D., HUANG, M., HU, G., ZHOU, Z., YE, C. and ZHANG, N. R. (2019). Data denoising with transfer learning in single-cell transcriptomics. Nat. Methods 16 875-878.
[85]	WANG, Y., CHEN, B. and SIMCHI-LEVI, D. (2021). Multimodal dynamic pricing. Manage. Sci. 67 6136-6152.
[86]	WEISS, K., KHOSHGOFTAAR, T. M. and WANG, D. (2016). A survey of transfer learning. J. Big Data 3 1-40.
[87]	WOODROOFE, M. (1979). A one-armed bandit problem with a concomitant variable. J. Amer. Statist. Assoc. 74 799-806. MathSciNet: MR0556471 · Zbl 0442.62063
[88]	XIE, T. and JIANG, N. (2021). Batch value-function approximation with only realizability. In International Conference on Machine Learning 11404-11413. PMLR.
[89]	XIE, T., JIANG, N., WANG, H., XIONG, C. and BAI, Y. (2021). Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Adv. Neural Inf. Process. Syst. 34 27395-27407.
[90]	YANG, Y. and ZHU, D. (2002). Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates. Ann. Statist. 30 100-121. Digital Object Identifier: 10.1214/aos/1015362186 Google Scholar: Lookup Link MathSciNet: MR1892657 · Zbl 1012.62088 · doi:10.1214/aos/1015362186
[91]	YU, X., WANG, J., HONG, Q.-Q., TEKU, R., WANG, S.-H. and ZHANG, Y.-D. (2022). Transfer learning for medical images analyses: A survey. Neurocomputing 489 230-254.
[92]	ZHOU, Z., WANG, Y., MAMANI, H. and COFFEY, D. G. (2019). How do tumor cytogenetics inform cancer treatments? Dynamic risk stratification and precision medicine using multi-armed bandits. Dynamic Risk Stratification and Precision Medicine Using Multi-armed Bandits (June 17, 2019).

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.