×

Target network and truncation overcome the deadly triad in \(Q\)-learning. (English) Zbl 1532.68075

Summary: \(Q\)-learning with function approximation is one of the most empirically successful while theoretically mysterious reinforcement learning (RL) algorithms and was identified in [R. S. Sutton, Lect. Notes Comput. Sci. 1572, 11–17 (1999; doi:10.1007/3-540-49097-3_2)] as one of the most important theoretical open problems in the RL community. Even in the basic setting where linear function approximation is used, there are well-known divergent examples. In this work, we propose a stable online variant of \(Q\)-learning with linear function approximation that uses target network and truncation and is driven by a single trajectory of Markovian samples. We present some finite-sample guarantees of the algorithm, which imply a sample complexity of \(\tilde{\mathcal{O}}(\epsilon^{-2})\) up to a function approximation error. Importantly, we establish the results under minimal assumptions and do not modify the problem parameters to achieve stability.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
68T07 Artificial neural networks and deep learning
68T09 Computational aspects of data analysis and big data
90C40 Markov and semi-Markov decision processes
62L20 Stochastic approximation

References:

[1] Agarwal, N., Chaudhuri, S., Jain, P., Nagaraj, D. M., and Netrapalli, P., Online target \(Q\)-learning with reverse experience replay: Efficiently finding the optimal policy for linear MDPs, in Proceedings of the International Conference on Learning Representations, , 2021.
[2] Baird, L., Residual algorithms: Reinforcement learning with function approximation, in Machine Learning Proceedings 1995, , Elsevier, Amsterdam, 1995, pp. 30-37, doi:10.1.1.50.7784&rep=rep1&type=pdf.
[3] Bertsekas, D. P. and Tsitsiklis, J. N., Neuro-Dynamic Programming, Athena Scientific, Nashua, NH, 1996. · Zbl 0924.68163
[4] Bhandari, J., Russo, D., and Singal, R., A finite-time analysis of temporal difference learning with linear function approximation, in Proceedings of the Conference on Learning Theory, , 2018, pp. 1691-1692.
[5] Borkar, V. S., A concentration bound for contractive stochastic approximation, Systems Control Lett., 153 (2021), 104947. · Zbl 1475.93106
[6] Borkar, V. S. and Meyn, S. P., The ODE method for convergence of stochastic approximation and reinforcement learning, SIAM J. Control Optim., 38 (2000), pp. 447-469. · Zbl 0990.62071
[7] Cai, Q., Yang, Z., Lee, J. D., and Wang, Z., Neural temporal difference and Neural temporal difference and \(Q\)-learning provably converge to global optima, Math. Oper. Res., (2023), doi:10.1287/moor.2023.1370.
[8] Carvalho, D., Melo, F. S., and Santos, P., A new convergent variant of \(Q\)-learning with linear function approximation, Adv. Neural Inf. Process. Syst., 33 (2020).
[9] Chen, Z., Clarke, J. P., and Maguluri, S. T., Target Network and Truncation Overcome the Deadly Triad in \(Q\)-Learning, preprint, arXiv:2203.02628, 2022.
[10] Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K., Finite-sample analysis of off-policy TD-learning via generalized Bellman operators, Adv. Neural Inf. Process. Syst., 34 (2021), pp. 21440-21452.
[11] Chen, Z., Zhang, S., Doan, T. T., Clarke, J.-P., and Maguluri, S. T., Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning, Automatica, 146 (2022), 110623. · Zbl 1504.93364
[12] Devraj, A. M. and Meyn, S., Zap \(Q\)-learning, in Advances in Neural Information Processing Systems, 2017, pp. 2235-2244.
[13] Du, S. S., Lee, J. D., Mahajan, G., and Wang, R., Agnostic \(Q\)-learning with function approximation in deterministic systems: Near-optimal bounds on approximation error and sample complexity, in Advances in Neural Information Processing Systems, 2020.
[14] Du, S. S., Luo, Y., Wang, R., and Zhang, H., Provably efficient \(Q\)-learning with function approximation via distribution shift error checking oracle, in Proceedings of the 33rd International Conference on Neural Information Processing Systems, , 2019, pp. 8060-8070.
[15] Duan, Y., Jia, Z., and Wang, M., Minimax-optimal off-policy evaluation with linear function approximation, in International Conference on Machine Learning, , PMLR, 2020, pp. 2701-2709.
[16] Ernst, D., Geurts, P., and Wehenkel, L., Tree-based batch mode reinforcement learning, J. Mach. Learn. Res., 6 (2005), pp. 503-556. · Zbl 1222.68193
[17] Fan, J., Wang, Z., Xie, Y., and Yang, Z., A theoretical analysis of deep \(Q\)-learning, in Learning for Dynamics and Control, PMLR, 2020, pp. 486-489.
[18] Gao, Z., Ma, Q., Başar, T., and Birge, J. R., Finite-Sample Analysis of Decentralized \(Q\)-Learning for Stochastic Games, preprint, arXiv:2112.07859, 2021.
[19] Györfi, L., Kohler, M., Krzyzak, A., Walk, H., et al., A Distribution-free Theory of Nonparametric Regression, Vol. 1, Springer, Berlin, 2002. · Zbl 1021.62024
[20] Hasselt, H., Double \(Q\)-learning, Adv. Neural Inf. Process. Syst., 23 (2010), pp. 2613-2621.
[21] Jaakkola, T., Jordan, M. I., and Singh, S. P., Convergence of stochastic iterative dynamic programming algorithms, in Advances in Neural Information Processing Systems, 1994, pp. 703-710. · Zbl 0822.68095
[22] Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I., Is \(Q\)-learning provably efficient?, in Proceedings of the 32nd International Conference on Neural Information Processing Systems, , 2018, pp. 4868-4878.
[23] Jin, C., Yang, Z., Wang, Z., and Jordan, M. I., Provably efficient reinforcement learning with linear function approximation, in Proceedings of the Conference on Learning Theory, , PMLR, 2020, pp. 2137-2143.
[24] Khodadadian, S., Chen, Z., and Maguluri, S. T., Finite-sample analysis of off-policy natural actor-critic algorithm, in Proceedings of the International Conference on Machine Learning, , PMLR, 2021, pp. 5420-5431.
[25] Lee, D. and He, N., A unified switching system perspective and convergence analysis of \(Q\)-learning algorithms, Adv. Neural Inf. Process. Syst., 33 (2020), pp. 15556-15567.
[26] Levin, D. A. and Peres, Y., Markov Chains and Mixing Times, AMS, Providence, RI, 2017. · Zbl 1390.60001
[27] Li, G., Chen, Y., Chi, Y., Gu, Y., and Wei, Y., Sample-efficient reinforcement learning is feasible for linearly realizable MDPs with limited revisiting, Adv. Neural Inf. Process. Syst., 34 (2021).
[28] Li, G., Shi, L., Chen, Y., Gu, Y., and Chi, Y., Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning, Adv. Neural Inf. Process. Syst., 34 (2021).
[29] Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y., Sample complexity of asynchronous \(Q\)-learning: Sharper analysis and variance reduction, Adv. Neural Inf. Process. Syst., 33 (2020), pp. 7031-7043.
[30] Ma, S., Chen, Z., Zhou, Y., and Zou, S., Greedy-GQ with variance reduction: Finite-time analysis and improved complexity, in Proceedings of the International Conference on Learning Representations, , 2021.
[31] Maei, H. R., Szepesvári, C., Bhatnagar, S., and Sutton, R. S., Toward off-policy learning control with function approximation, in Proceedings of the 27th International Conference on Machine Learning, , 2010, pp. 719-726.
[32] Melo, F. S., Meyn, S. P., and Ribeiro, M. I., An analysis of reinforcement learning with function approximation, in Proceedings of the 25th International Conference on Machine Learning, , 2008, pp. 664-671.
[33] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al., Human-level control through deep reinforcement learning, Nature, 518 (2015), pp. 529-533.
[34] Munos, R. and Szepesvári, C., Finite-time bounds for fitted value iteration, J. Mach. Learn. Res., 9 (2008). · Zbl 1225.68203
[35] Qu, G. and Wierman, A., Finite-time analysis of asynchronous stochastic approximation and \(Q\)-learning, in Proceedings of the Conference on Learning Theory, , PMLR, 2020, pp. 3185-3205.
[36] Robbins, H. and Monro, S., A stochastic approximation method, Ann. Math. Statist., 22 (1951), pp. 400-407. · Zbl 0054.05901
[37] Singh, S. P. and Yee, R. C., An upper bound on the loss from approximate optimal-value functions, Mach. Learn., 16 (1994), pp. 227-233. · Zbl 0939.68781
[38] Srikant, R. and Ying, L., Finite-time error bounds for linear stochastic approximation and TD-learning, in Proceedings of the Conference on Learning Theory, , 2019, pp. 2803-2830.
[39] Sutton, R. S., Open theoretical questions in reinforcement learning, in European Conference on Computational Learning Theory, , Springer, Berlin, 1999, pp. 11-17.
[40] Sutton, R. S. and Barto, A. G., Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, 2018. · Zbl 1407.68009
[41] Szepesvári, C. and Munos, R., Finite time bounds for sampling-based fitted value iteration, in Proceedings of the 22nd International Conference on Machine Learning, , 2005, pp. 880-887.
[42] Tsitsiklis, J. N., Asynchronous stochastic approximation and \(Q\)-learning, Mach. Learn., 16 (1994), pp. 185-202. · Zbl 0820.68105
[43] Tsitsiklis, J. N. and Van Roy, B., An analysis of temporal-difference learning with function approximation, IEEE Trans. Automat. Control, 42 (1997), pp. 674-690. · Zbl 0914.93075
[44] Wang, R., Foster, D., and Kakade, S. M., What are the statistical limits of offline RL with linear function approximation?, in Proceedings of the International Conference on Learning Representations, , 2020.
[45] Wang, Y. and Zou, S., Finite-sample analysis of Greedy-GQ with linear function approximation under Markovian noise, in Proceedings of the Conference on Uncertainty in Artificial Intelligence, , PMLR, 2020, pp. 11-20.
[46] Watkins, C. J. and Dayan, P., \(Q\)-learning, Mach. Learn., 8 (1992), pp. 279-292. · Zbl 0773.68062
[47] Xie, T. and Jiang, N., Batch value-function approximation with only realizability, in Proceedings of the International Conference on Machine Learning, , PMLR, 2021, pp. 11404-11413.
[48] Xu, P. and Gu, Q., A finite-time analysis of \(Q\)-learning with neural network function approximation, in Proceedings of the International Conference on Machine Learning, , PMLR, 2020, pp. 10555-10565.
[49] Xu, T. and Liang, Y., Sample complexity bounds for two timescale value-based reinforcement learning algorithms, in Proceedings of the International Conference on Artificial Intelligence and Statistics, , PMLR, 2021, pp. 811-819.
[50] Yang, L. and Wang, M., Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound, in Proceedings of the International Conference on Machine Learning, , PMLR, 2020, pp. 10746-10756.
[51] Zanette, A., Exponential lower bounds for batch reinforcement learning: Batch RL can be exponentially harder than online RL, in Proceedings of the International Conference on Machine Learning, , PMLR, 2021, pp. 12287-12297.
[52] Zanette, A., Lazaric, A., Kochenderfer, M., and Brunskill, E., Learning near optimal policies with low inherent Bellman error, in Proceedings of the International Conference on Machine Learning, , PMLR, 2020, pp. 10978-10989.
[53] Zanette, A. and Wainwright, M., Stabilizing \(Q\)-learning with linear architectures for provable efficient learning, in Proceedings of the International Conference on Machine Learning, , PMLR, 2022, pp. 25920-25954.
[54] Zhang, S., Yao, H., and Whiteson, S., Breaking the deadly triad with a target network, in Proceedings of the 38th International Conference on Machine Learning, , PMLR, 2021, pp. 12621-12631.
[55] Zou, S., Xu, T., and Liang, Y., Finite-sample analysis for SARSA with linear function approximation, in Advances in Neural Information Processing Systems, 2019, pp. 8668-8678.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.