×

Online bootstrap inference for policy evaluation in reinforcement learning. (English) Zbl 07784954

Summary: The recent emergence of reinforcement learning (RL) has created a demand for robust statistical inference methods for the parameter estimates computed using these algorithms. Existing methods for inference in online learning are restricted to settings involving independently sampled observations, while inference methods in RL have so far been limited to the batch setting. The bootstrap is a flexible and efficient approach for statistical inference in online learning algorithms, but its efficacy in settings involving Markov noise, such as RL, has yet to be explored. In this article, we study the use of the online bootstrap method for inference in RL policy evaluation. In particular, we focus on the temporal difference (TD) learning and Gradient TD (GTD) learning algorithms, which are themselves special instances of linear stochastic approximation under Markov noise. The method is shown to be distributionally consistent for statistical inference in policy evaluation, and numerical experiments are included to demonstrate the effectiveness of this algorithm across a range of real RL environments. Supplementary materials for this article are available online.

MSC:

62-XX Statistics

Software:

CausalRL; OpenAI Gym

References:

[1] Adomavicius, G., and Zhang, J. (2012), “Stability of Recommendation Algorithms,” ACM Transactions on Information Systems (TOIS), 30, 1-31. DOI: .
[2] Andrieu, C., Moulines, E., and Priouret, P. (2005), “Stability of Stochastic Approximation Under Verifiable Conditions,” SIAM Journal on Control and Optimization, 44, 283-312. DOI: . · Zbl 1083.62073
[3] Bhandari, J., Russo, D., and Singal, R. (2018), “A Finite Time Analysis of Temporal Difference Learning with Linear Function Approximation,” in Proceedings of the 31st Conference On Learning Theory, Volume 75 of Proceedings of Machine Learning Research, eds. Bubeck, S., Perchet, V., and Rigollet, P., pp. 1691-1692. PMLR.
[4] Bojun, H. (2020), “Steady State Analysis of Episodic Reinforcement Learning,” in Advances in Neural Information Processing Systems (Vol. 33), eds. Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H., pp. 9335-9345. Curran Associates, Inc.
[5] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016), “Openai gym,” arXiv preprint arXiv:1606.01540.
[6] Chakraborty, B., and Murphy, S. A. (2014), “Dynamic Treatment Regimes,” Annual Review of Statistics and its Application, 1, 447-464. DOI: .
[7] Chen, H., Lu, W., and Song, R. (2021), “Statistical Inference for Online Decision Making via Stochastic Gradient Descent,” Journal of the American Statistical Association, 116, 708-719. DOI: . · Zbl 1465.62032
[8] Chen, M., Beutel, A., Covington, P., Jain, S., Belletti, F.,and Chi, E. H. (2019), “Top-k Off-policy Correction for a Reinforce Recommender System,” in Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, , pp. 456-464. DOI: .
[9] Chen, X., Lee, J. D., Tong, X. T., and Zhang, Y. (2020), “Statistical Inference for Model Parameters in Stochastic Gradient Descent,” The Annals of Statistics, 48, 251-273. DOI: . · Zbl 1440.62287
[10] Chen, Y., Fan, J., Ma, C., and Yan, Y. (2019), “Inference and Uncertainty Quantification for Noisy Matrix Completion,” Proceedings of the National Academy of Sciences, 116, 22931-22937. DOI: . · Zbl 1431.90117
[11] Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2021), “A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-learning and TD-learning Variants,” arXiv preprint arXiv:2102.01567.
[12] Cheng, G. (2015), “Moment Consistency of the Exchangeably Weighted Bootstrap for Semiparametric m-estimation,” Scandinavian Journal of Statistics, 42, 665-684. DOI: . · Zbl 1360.62111
[13] Chung, W., Nath, S., Joseph, A., and White, M. (2019), “Two-Timescale Networks for Nonlinear Value Function Approximation,” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
[14] Dai, B., Nachum, O., Chow, Y., Li, L., Szepesvari, C., and Schuurmans, D. (2020), “Coindice: Off-policy Confidence Interval Estimation,” in Advances in Neural Information Processing Systems (Vol. 33), eds. Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H., pp. 9398-9411. Curran Associates, Inc.
[15] DasGupta, A. (2008), The Bootstrap, New York: Springer, pp. 461-497.
[16] Douc, R., Moulines, E., Priouret, P., and Soulier, P. (2018), Markov Chains, Operation Research and Financial Engineering, Cham: Springer. · Zbl 1429.60002
[17] Dulac-Arnold, G., Levine, N., Mankowitz, D. J., Li, J., Paduraru, C., Gowal, S., and Hester, T. (2021, September), “Challenges of Real-World Reinforcement Learning: Definitions, Benchmarks and Analysis,” Machine Learning, 110, 2419-2468. DOI: . · Zbl 07465677
[18] Durmus, A., Moulines, E., Naumov, A., Samsonov, S., and Wai, H.-T. (2021), “On the Stability of Random Matrix Product with Markovian Noise: Application to Linear Stochastic Approximation and TD Learning,” in Conference on Learning Theory, pp. 1711-1752. PMLR.
[19] Ertefaie, A., and Strawderman, R. L. (2018), “Constructing Dynamic Treatment Regimes Over Indefinite Time Horizons,” Biometrika, 105, 963-977. DOI: . · Zbl 1506.62432
[20] Fang, Y., Xu, J., and Yang, L. (2018), “Online Bootstrap Confidence Intervals for the Stochastic Gradient Descent Estimator,” Journal of Machine Learning Research, 19, 1-21. · Zbl 1476.62060
[21] Gu, S., Holly, E., Lillicrap, T., and Levine, S. (2017), “Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-policy Updates,” in 2017 IEEE international Conference on Robotics and Automation (ICRA), pp. 3389-3396. IEEE. DOI: .
[22] Gupta, H., Srikant, R., and Ying, L. (2019), “Finite-‘Time Performance Bounds and Adaptive Learning Rate Selection for Two Time-Scale Reinforcement Learning,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada,, eds. Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R., pp. 4706-4715.
[23] Hall, P. (1992), The Bootstrap and Edgeworth Expansion, New York: Springer. · Zbl 0744.62026
[24] Hanna, J. P., Stone, P., and Niekum, S. (2017), “Bootstrapping with Models: Confidence Intervals for Off-policy Evaluation,” in Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, , AAMAS ’17, Richland, SC, pp. 538-546. International Foundation for Autonomous Agents and Multiagent Systems.
[25] Hao, B., Abbasi Yadkori, Y., Wen, Z., and Cheng, G. (2019), “Bootstrapping Upper Confidence Bound,” in Advances in Neural Information Processing Systems (Vol. 32), eds. Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., and Garnett, R., Curran Associates, Inc.
[26] Hao, B., Ji, X., Duan, Y., Lu, H., Szepesvári, C., and Wang, M. (2021), “Bootstrapping Statistical Inference for Off-policy Evaluation,” arXiv preprint arXiv:2102.03607.
[27] Hu, B., and Syed, U. A. (2019), “Characterizing the Exact Behaviors of Temporal Difference Learning Algorithms using Markov Jump Linear System Theory,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, eds. Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R., December 8-14, 2019, Vancouver, BC, Canada, pp. 8477-8488.
[28] Jiang, N., and Huang, J. (2020), “Minimax Value Interval for Off-Policy Evaluation and Policy Optimization,” in Advances in Neural Information Processing Systems (Vol. 33), eds. Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H., pp. 2747-2758. Curran Associates, Inc.
[29] Kaledin, M., Moulines, E., Naumov, A., Tadic, V., and Wai, H.-T. (2020), “Finite Time Analysis of Linear Two-Timescale Stochastic Approximation with Markovian Noise,” in Conference on Learning Theory, pp. 2144-2203.
[30] Kallus, N., and Uehara, M. (2021), “Efficiently Breaking the Curse of Horizon in Off-policy Evaluation with Double Reinforcement Learning,” Operations Research, (to appear). DOI: . · Zbl 1510.90285
[31] Kuzborskij, I., Vernade, C., Gyorgy, A., and Szepesvari, C. (2021), “Confident Off-policy Evaluation and Selection through Self-normalized Importance Weighting,” in Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, Volume 130 of Proceedings of Machine Learning Research, eds. Banerjee, A. and Fukumizu, K., pp. 640-648.
[32] Lagoudakis, M. G. (2003), “Least-Squares Policy Iteration,” Journal of Machine Learning Research, 4, 1107-1149. · Zbl 1094.68080
[33] Levin, D., Peres, Y., and Wilmer, E. L. (2017), “Markov Chains and Mixing Times (2nd ed.), Providence, RI: American Mathematical Society. · Zbl 1390.60001
[34] Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020), “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems,” arXiv preprint arXiv:2005.01643.
[35] Li, T., Liu, L., Kyrillidis, A., and Caramanis, C. (2018), “Statistical Inference using SGD,” in Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 32). DOI: .
[36] Li, Y., Xie, H., Lin, Y., and Lui, J. C. (2021), “Unifying Offline Causal Inference and Online Bandit Learning for Data Driven Decision,” in Proceedings of the Web Conference 2021, , pp. 2291-2303. DOI: .
[37] Liang, F. (2010), “Trajectory Averaging for Stochastic Approximation MCMC Algorithms,” The Annals of Statistics, 38, 2823-2856. DOI: . · Zbl 1218.60064
[38] Luckett, D. J., Laber, E. B., Kahkoska, A. R., Maahs, D. M., Mayer-Davis, E., and Kosorok, M. R. (2019), “Estimating Dynamic Treatment Regimes in Mobile Health using V-learning,” Journal of the American Statistical Association, 115, 692-706. DOI: . · Zbl 1445.62279
[39] Meyn, S., and Tweedie, R. L. (2009), Markov Chains and Stochastic Stability (2nd ed.), New York: Cambridge University Press. · Zbl 1165.60001
[40] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013), “Playing Atari with Deep Reinforcement Learning,” cite arxiv:1312.5602Comment: NIPS Deep Learning Workshop 2013.
[41] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015), “Human-Level Control through Deep Reinforcement Learning,” Nature, 518, 529-533. DOI: .
[42] Mou, W., Pananjady, A., Wainwright, M. J., and Bartlett, P. L. (2021), “Optimal and Instance-Dependent Guarantees for Markovian Linear Stochastic Approximation,” arXiv preprint arXiv:2112.12770.
[43] Moulines, E., and Bach, F. (2011), “Non-asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning,” in Advances in Neural Information Processing Systems (Vol. 24), eds. Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., and Weinberger, K. Q.. Curran Associates, Inc.
[44] Oberst, M., and Sontag, D. (2019), “Counterfactual Off-policy Evaluation with Gumbel-max Structural Causal Models,” in Proceedings of the 36th International Conference on Machine Learning, Volume 97 of Proceedings of Machine Learning Research, eds. Chaudhuri, K. and Salakhutdinov, R., pp. 4881-4890. PMLR.
[45] Parekh, V. S., and Jacobs, M. A. (2019), “Deep Learning and Radiomics in Precision Medicine,” Expert Review of Precision Medicine and Drug Development, 4, 59-72. DOI: .
[46] Polyak, B. T., and Juditsky, A. B. (1992), “Acceleration of Stochastic Approximation by Averaging,” SIAM Journal on Control and Optimization, 30, 838-855. DOI: . · Zbl 0762.62022
[47] Robbins, H., and Monro, S. (1951), “A Stochastic Approximation Method,” The Annals of Mathematical Statistics, 22, 400-407. DOI: . · Zbl 0054.05901
[48] Ruppert, D. (1988), “Efficient Estimations from a Slowly Convergent Robbins-Monro Process,” Technical Report, Cornell University Operations Research and Industrial Engineering.
[49] Sallab, A. E., Abdou, M., Perot, E., and Yogamani, S. (2017), “Deep Reinforcement Learning Framework for Autonomous Driving,” Electronic Imaging, 2017, 70-76. DOI: .
[50] Shi, C., Luo, S., Zhu, H., and Song, R. (2021), “An Online Sequential Test for Qualitative Treatment Effects,”Journal of Machine Learning Research, 22, 1-51. · Zbl 07626801
[51] Shi, C., Wang, X., Luo, S., Zhu, H., Ye, J., and Song, R. (2022), “Dynamic Causal Effects Evaluation in a/b Testing with a Reinforcement Learning Framework,” Journal of the American Statistical Association (just-accepted), 1-29, . · Zbl 07751828
[52] Shi, C., Zhang, S., Lu, W., and Song, R. (2021), “Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings,” arXiv preprint arXiv:2001.04515.
[53] Srikant, R., and Ying, L. (2019), “Finite-Time Error Bounds for Linear Stochastic Approximation and TD Learning,” in Conference on Learning Theory, pp. 2803-2830. PMLR.
[54] Sutton, R. S. (1988), “Learning to Predict by the Methods of Temporal Differences,” Machine Learning, 3, 9-44. DOI: .
[55] Sutton, R. S., and Barto, A. G. (2018), Reinforcement Learning: An Introduction, Cambridge, MA: MIT press. · Zbl 1407.68009
[56] Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and Wiewiora, E. (2009), “Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation,” in Proceedings of the 26th Annual International Conference on Machine Learning, , pp. 993-1000. DOI: .
[57] Sutton, R. S., Mahmood, A. R., and White, M. (2016), “An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning,”Journal of Machine Learning Research, 17, 1-29. · Zbl 1360.68712
[58] Sutton, R. S., Szepesvári, C., and Maei, H. R. (2008), “A Convergent o(n) Algorithm for Off-policy Temporal-Difference Learning with Linear Function Approximation,” in Proceedings of the 21st International Conference on Neural Information Processing Systems, NIPS’08, USA, pp. 1609-1616, Curran Associates Inc.
[59] Tsitsiklis, J. N., and Van Roy, B. (1997, May), “An Analysis of Temporal-Difference Learning with Function Approximation,” IEEE Transactions on Automatic Control, 42, 674-690. DOI: . · Zbl 0914.93075
[60] Ueno, T., Maeda, S.-i., Kawanabe, M., and Ishii, S. (2011), “Generalized TD Learning,”Journal of Machine Learning Research, 12, 1977-2020. · Zbl 1280.68208
[61] Wang, C.-H., Yu, Y., Hao, B., and Cheng, G. (2020), “Residual Bootstrap Exploration for Bandit Algorithms,” arXiv preprint arXiv:2002.08436.
[62] White, M., and White, A. (2010), “Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains,” in Advances in Neural Information Processing Systems (Vol. 23), eds. Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., and Culotta, A., Curran Associates, Inc.
[63] Xu, T., Wang, Z., Zhou, Y., and Liang, Y. (2020), “Reanalysis of Variance Reduced Temporal Difference Learning,” in International Conference on Learning Representations.
[64] Xu, Z., Li, Z., Guan, Q., Zhang, D., Li, Q., Nan, J., Liu, C., Bian, W., and Ye, J. (2018), “Large-Scale Order Dispatch in On-demand Ride-Hailing Platforms: A Learning and Planning Approach,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, , pp. 905-913.
[65] Zhang, K. W., Janson, L., and Murphy, S. A. (2021), “Statistical Inference with m-estimators on Bandit Data,” arXiv preprint arXiv:2104.14074.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.