Abstract
The Out-of-Distribution (OOD) issue presents a considerable obstacle in offline reinforcement learning. Although current approaches strive to conservatively estimate the Q-values of OOD actions, their excessive conservatism under constant constraints may adversely affect model learning throughout the policy learning procedure. Moreover, the diverse task distributions across various environments and behaviors call for tailored solutions. To tackle these challenges, we propose the Adaptable Conservative Q-Learning (ACQ) method, which capitalizes on the Q-value’s distribution for each fixed dataset to devise a highly generalizable metric that strikes a balance between the conservative constraint and the training objective. Experimental outcomes reveal that ACQ not only holds its own against a variety of offline RL algorithms but also significantly improves the performance of CQL on most D4RL MuJoCo locomotion tasks in terms of normalized return.
L. Qiu—Work partly done during internship at Cognitive Computing Lab, Baidu Research.
J. Yan—The SJTU authors were supported by NSFC (61972250, U19B2035), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Agarwal, A., Kakade, S., Yang, L.F.: Model-based reinforcement learning with a generative model is minimax optimal. In: Conference on Learning Theory, pp. 67–83. PMLR (2020)
Agarwal, R., Schuurmans, D., Norouzi, M.: An optimistic perspective on offline reinforcement learning. In: ICML, pp. 104–114. PMLR (2020)
Ajay, A., Kumar, A., Agrawal, P., Levine, S., Nachum, O.: Opal: Offline primitive discovery for accelerating offline reinforcement learning. arXiv preprint arXiv:2010.13611 (2020)
An, G., Moon, S., Kim, J.H., Song, H.O.: Uncertainty-based offline reinforcement learning with diversified q-ensemble. Adv. Neural. Inf. Process. Syst. 34, 7436–7447 (2021)
Azar, M.G., Osband, I., Munos, R.: Minimax regret bounds for reinforcement learning. In: International Conference on Machine Learning, pp. 263–272. PMLR (2017)
Chen, L., et al.: Decision transformer: reinforcement learning via sequence modeling. Adv. Neural. Inf. Process. Syst. 34, 15084–15097 (2021)
Dulac-Arnold, G., et al.: Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Mach. Learn. 110(9), 2419–2468 (2021)
Fu, J., Kumar, A., Nachum, O., Tucker, G., Levine, S.: D4rl: datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219 (2020)
Fujimoto, S., Gu, S.S.: A minimalist approach to offline reinforcement learning. Adv. Neural. Inf. Process. Syst. 34, 20132–20145 (2021)
Guiñón, J.L., Ortega, E., García-Antón, J., Pérez-Herranz, V.: Moving average and savitzki-golay smoothing filters using mathcad. Papers ICEE 2007, 1–4 (2007)
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: ICML, pp. 1861–1870. PMLR (2018)
Jaques, N., et al.: Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456 (2019)
Jin, Y., Yang, Z., Wang, Z.: Is pessimism provably efficient for offline rl? In: International Conference on Machine Learning, pp. 5084–5096. PMLR (2021)
Kendall, A., et al.: Learning to drive in a day. In: ICRA, pp. 8248–8254. IEEE (2019)
Kidambi, R., Rajeswaran, A., Netrapalli, P., Joachims, T.: Morel: Model-based offline reinforcement learning. Adv. Neural. Inf. Process. Syst. 33, 21810–21823 (2020)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kostrikov, I., Nair, A., Levine, S.: Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169 (2021)
Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for offline reinforcement learning https://arxiv.org/abs/2006.04779
Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for offline reinforcement learning. Adv. Neural. Inf. Process. Syst. 33, 1179–1191 (2020)
Lange, S., Gabel, T., Riedmiller, M.: Batch reinforcement learning. Reinforcement learning: State-of-the-art, pp. 45–73 (2012)
Leibo, J.Z., Zambaldi, V., Lanctot, M., Marecki, J., Graepel, T.: Multi-agent reinforcement learning in sequential social dilemmas. arXiv preprint arXiv:1702.03037 (2017)
Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643 (2020)
Li, G., Wei, Y., Chi, Y., Gu, Y., Chen, Y.: Breaking the sample size barrier in model-based reinforcement learning with a generative model. Adv. Neural. Inf. Process. Syst. 33, 12861–12872 (2020)
Liu, Y., Swaminathan, A., Agarwal, A., Brunskill, E.: Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473 (2019)
Lyu, J., Ma, X., Li, X., Lu, Z.: Mildly conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2206.04745 (2022)
Ma, Y., Jayaraman, D., Bastani, O.: Conservative offline distributional reinforcement learning. Adv. Neural. Inf. Process. Syst. 34, 19235–19247 (2021)
Munemasa, I., Tomomatsu, Y., Hayashi, K., Takagi, T.: Deep reinforcement learning for recommender systems. In: ICOIACT, pp. 226–233. IEEE (2018)
Nachum, O., Dai, B., Kostrikov, I., Chow, Y., Li, L., Schuurmans, D.: Algaedice: policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074 (2019)
O’Donoghue, B., Osband, I., Munos, R., Mnih, V.: The uncertainty bellman equation and exploration. In: International Conference on Machine Learning, pp. 3836–3845 (2018)
Ovadia, Y., et al.: Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In: Advances in Neural Information Processing Systems 32 (2019)
Paszke, Aet al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32 (2019)
Precup, D., Sutton, R.S., Dasgupta, S.: Off-policy temporal-difference learning with function approximation. In: ICML, pp. 417–424 (2001)
Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., Russell, S.: Bridging offline reinforcement learning and imitation learning: a tale of pessimism. Adv. Neural. Inf. Process. Syst. 34, 11702–11716 (2021)
Sinha, S., Mandlekar, A., Garg, A.: S4rl: surprisingly simple self-supervision for offline reinforcement learning in robotics. In: Conference on Robot Learning, pp. 907–917. PMLR (2022)
Sutton, R.S., Mahmood, A.R., White, M.: An emphatic approach to the problem of off-policy temporal-difference learning. J. Mach. Learn. Res. 17(1), 2603–2631 (2016)
Todorov, E., Erez, T., Tassa, Y.: Mujoco: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE (2012)
Vinyals, O., et al.: Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782 (2017)
Wu, K., et al.: Acql: an adaptive conservative q-learning framework for offline reinforcement learning (2022)
Wu, Y., Tucker, G., Nachum, O.: Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361 (2019)
Wu, Y., et al.: Uncertainty weighted actor-critic for offline reinforcement learning. arXiv preprint arXiv:2105.08140 (2021)
Yu, T., Kumar, A., Rafailov, R., Rajeswaran, A., Levine, S., Finn, C.: Combo: Conservative offline model-based policy optimization. Adv. Neural. Inf. Process. Syst. 34, 28954–28967 (2021)
Yu, T., et al.: Mopo: model-based offline policy optimization. Adv. Neural. Inf. Process. Syst. 33, 14129–14142 (2020)
Zou, L., Xia, L., Ding, Z., Song, J., Liu, W., Yin, D.: Reinforcement learning to optimize long-term user engagement in recommender systems. In: SIGKDD, pp. 2810–2818 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Qiu, L., Li, X., Liang, L., Sun, M., Yan, J. (2024). Adaptable Conservative Q-Learning for Offline Reinforcement Learning. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14427. Springer, Singapore. https://doi.org/10.1007/978-981-99-8435-0_16
Download citation
DOI: https://doi.org/10.1007/978-981-99-8435-0_16
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8434-3
Online ISBN: 978-981-99-8435-0
eBook Packages: Computer ScienceComputer Science (R0)