Adaptable Conservative Q-Learning for Offline Reinforcement Learning

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14427))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

696 Accesses

Abstract

The Out-of-Distribution (OOD) issue presents a considerable obstacle in offline reinforcement learning. Although current approaches strive to conservatively estimate the Q-values of OOD actions, their excessive conservatism under constant constraints may adversely affect model learning throughout the policy learning procedure. Moreover, the diverse task distributions across various environments and behaviors call for tailored solutions. To tackle these challenges, we propose the Adaptable Conservative Q-Learning (ACQ) method, which capitalizes on the Q-value’s distribution for each fixed dataset to devise a highly generalizable metric that strikes a balance between the conservative constraint and the training objective. Experimental outcomes reveal that ACQ not only holds its own against a variety of offline RL algorithms but also significantly improves the performance of CQL on most D4RL MuJoCo locomotion tasks in terms of normalized return.

L. Qiu—Work partly done during internship at Cognitive Computing Lab, Baidu Research.

J. Yan—The SJTU authors were supported by NSFC (61972250, U19B2035), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ETQ-learning: an improved Q-learning algorithm for path planning

Article 26 June 2024

Offline reinforcement learning with anderson acceleration for robotic tasks

Article 10 January 2022

State-Dependent Maximum Entropy Reinforcement Learning for Robot Long-Horizon Task Learning

Article Open access 24 January 2024

References

Agarwal, A., Kakade, S., Yang, L.F.: Model-based reinforcement learning with a generative model is minimax optimal. In: Conference on Learning Theory, pp. 67–83. PMLR (2020)
Google Scholar
Agarwal, R., Schuurmans, D., Norouzi, M.: An optimistic perspective on offline reinforcement learning. In: ICML, pp. 104–114. PMLR (2020)
Google Scholar
Ajay, A., Kumar, A., Agrawal, P., Levine, S., Nachum, O.: Opal: Offline primitive discovery for accelerating offline reinforcement learning. arXiv preprint arXiv:2010.13611 (2020)
An, G., Moon, S., Kim, J.H., Song, H.O.: Uncertainty-based offline reinforcement learning with diversified q-ensemble. Adv. Neural. Inf. Process. Syst. 34, 7436–7447 (2021)
Google Scholar
Azar, M.G., Osband, I., Munos, R.: Minimax regret bounds for reinforcement learning. In: International Conference on Machine Learning, pp. 263–272. PMLR (2017)
Google Scholar
Chen, L., et al.: Decision transformer: reinforcement learning via sequence modeling. Adv. Neural. Inf. Process. Syst. 34, 15084–15097 (2021)
Google Scholar
Dulac-Arnold, G., et al.: Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Mach. Learn. 110(9), 2419–2468 (2021)
Article MathSciNet Google Scholar
Fu, J., Kumar, A., Nachum, O., Tucker, G., Levine, S.: D4rl: datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219 (2020)
Fujimoto, S., Gu, S.S.: A minimalist approach to offline reinforcement learning. Adv. Neural. Inf. Process. Syst. 34, 20132–20145 (2021)
Google Scholar
Guiñón, J.L., Ortega, E., García-Antón, J., Pérez-Herranz, V.: Moving average and savitzki-golay smoothing filters using mathcad. Papers ICEE 2007, 1–4 (2007)
Google Scholar
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: ICML, pp. 1861–1870. PMLR (2018)
Google Scholar
Jaques, N., et al.: Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456 (2019)
Jin, Y., Yang, Z., Wang, Z.: Is pessimism provably efficient for offline rl? In: International Conference on Machine Learning, pp. 5084–5096. PMLR (2021)
Google Scholar
Kendall, A., et al.: Learning to drive in a day. In: ICRA, pp. 8248–8254. IEEE (2019)
Google Scholar
Kidambi, R., Rajeswaran, A., Netrapalli, P., Joachims, T.: Morel: Model-based offline reinforcement learning. Adv. Neural. Inf. Process. Syst. 33, 21810–21823 (2020)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kostrikov, I., Nair, A., Levine, S.: Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169 (2021)
Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for offline reinforcement learning https://arxiv.org/abs/2006.04779
Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for offline reinforcement learning. Adv. Neural. Inf. Process. Syst. 33, 1179–1191 (2020)
Google Scholar
Lange, S., Gabel, T., Riedmiller, M.: Batch reinforcement learning. Reinforcement learning: State-of-the-art, pp. 45–73 (2012)
Google Scholar
Leibo, J.Z., Zambaldi, V., Lanctot, M., Marecki, J., Graepel, T.: Multi-agent reinforcement learning in sequential social dilemmas. arXiv preprint arXiv:1702.03037 (2017)
Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643 (2020)
Li, G., Wei, Y., Chi, Y., Gu, Y., Chen, Y.: Breaking the sample size barrier in model-based reinforcement learning with a generative model. Adv. Neural. Inf. Process. Syst. 33, 12861–12872 (2020)
Google Scholar
Liu, Y., Swaminathan, A., Agarwal, A., Brunskill, E.: Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473 (2019)
Lyu, J., Ma, X., Li, X., Lu, Z.: Mildly conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2206.04745 (2022)
Ma, Y., Jayaraman, D., Bastani, O.: Conservative offline distributional reinforcement learning. Adv. Neural. Inf. Process. Syst. 34, 19235–19247 (2021)
Google Scholar
Munemasa, I., Tomomatsu, Y., Hayashi, K., Takagi, T.: Deep reinforcement learning for recommender systems. In: ICOIACT, pp. 226–233. IEEE (2018)
Google Scholar
Nachum, O., Dai, B., Kostrikov, I., Chow, Y., Li, L., Schuurmans, D.: Algaedice: policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074 (2019)
O’Donoghue, B., Osband, I., Munos, R., Mnih, V.: The uncertainty bellman equation and exploration. In: International Conference on Machine Learning, pp. 3836–3845 (2018)
Google Scholar
Ovadia, Y., et al.: Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In: Advances in Neural Information Processing Systems 32 (2019)
Google Scholar
Paszke, Aet al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32 (2019)
Google Scholar
Precup, D., Sutton, R.S., Dasgupta, S.: Off-policy temporal-difference learning with function approximation. In: ICML, pp. 417–424 (2001)
Google Scholar
Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., Russell, S.: Bridging offline reinforcement learning and imitation learning: a tale of pessimism. Adv. Neural. Inf. Process. Syst. 34, 11702–11716 (2021)
Google Scholar
Sinha, S., Mandlekar, A., Garg, A.: S4rl: surprisingly simple self-supervision for offline reinforcement learning in robotics. In: Conference on Robot Learning, pp. 907–917. PMLR (2022)
Google Scholar
Sutton, R.S., Mahmood, A.R., White, M.: An emphatic approach to the problem of off-policy temporal-difference learning. J. Mach. Learn. Res. 17(1), 2603–2631 (2016)
MathSciNet Google Scholar
Todorov, E., Erez, T., Tassa, Y.: Mujoco: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE (2012)
Google Scholar
Vinyals, O., et al.: Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782 (2017)
Wu, K., et al.: Acql: an adaptive conservative q-learning framework for offline reinforcement learning (2022)
Google Scholar
Wu, Y., Tucker, G., Nachum, O.: Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361 (2019)
Wu, Y., et al.: Uncertainty weighted actor-critic for offline reinforcement learning. arXiv preprint arXiv:2105.08140 (2021)
Yu, T., Kumar, A., Rafailov, R., Rajeswaran, A., Levine, S., Finn, C.: Combo: Conservative offline model-based policy optimization. Adv. Neural. Inf. Process. Syst. 34, 28954–28967 (2021)
Google Scholar
Yu, T., et al.: Mopo: model-based offline policy optimization. Adv. Neural. Inf. Process. Syst. 33, 14129–14142 (2020)
Google Scholar
Zou, L., Xia, L., Ding, Z., Song, J., Liu, W., Yin, D.: Reinforcement learning to optimize long-term user engagement in recommender systems. In: SIGKDD, pp. 2810–2818 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China
Lyn Qiu & Junchi Yan
Cognitive Computing Lab, Baidu Research, Beijing, China
Xu Li & Mingming Sun
School of Artificial Intelligence, Peking University, Beijing, China
Lenghan Liang

Authors

Lyn Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Xu Li
View author publications
You can also search for this author in PubMed Google Scholar
Lenghan Liang
View author publications
You can also search for this author in PubMed Google Scholar
Mingming Sun
View author publications
You can also search for this author in PubMed Google Scholar
Junchi Yan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junchi Yan .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qiu, L., Li, X., Liang, L., Sun, M., Yan, J. (2024). Adaptable Conservative Q-Learning for Offline Reinforcement Learning. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14427. Springer, Singapore. https://doi.org/10.1007/978-981-99-8435-0_16

Download citation

DOI: https://doi.org/10.1007/978-981-99-8435-0_16
Published: 24 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8434-3
Online ISBN: 978-981-99-8435-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Adaptable Conservative Q-Learning for Offline Reinforcement Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

ETQ-learning: an improved Q-learning algorithm for path planning

Offline reinforcement learning with anderson acceleration for robotic tasks

State-Dependent Maximum Entropy Reinforcement Learning for Robot Long-Horizon Task Learning

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Adaptable Conservative Q-Learning for Offline Reinforcement Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

ETQ-learning: an improved Q-learning algorithm for path planning

Offline reinforcement learning with anderson acceleration for robotic tasks

State-Dependent Maximum Entropy Reinforcement Learning for Robot Long-Horizon Task Learning

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation