Skip to main content

Adaptable Conservative Q-Learning for Offline Reinforcement Learning

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14427))

Included in the following conference series:

  • 696 Accesses

Abstract

The Out-of-Distribution (OOD) issue presents a considerable obstacle in offline reinforcement learning. Although current approaches strive to conservatively estimate the Q-values of OOD actions, their excessive conservatism under constant constraints may adversely affect model learning throughout the policy learning procedure. Moreover, the diverse task distributions across various environments and behaviors call for tailored solutions. To tackle these challenges, we propose the Adaptable Conservative Q-Learning (ACQ) method, which capitalizes on the Q-value’s distribution for each fixed dataset to devise a highly generalizable metric that strikes a balance between the conservative constraint and the training objective. Experimental outcomes reveal that ACQ not only holds its own against a variety of offline RL algorithms but also significantly improves the performance of CQL on most D4RL MuJoCo locomotion tasks in terms of normalized return.

L. Qiu—Work partly done during internship at Cognitive Computing Lab, Baidu Research.

J. Yan—The SJTU authors were supported by NSFC (61972250, U19B2035), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
eBook
USD 59.99
Price excludes VAT (USA)
Softcover Book
USD 79.99
Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Agarwal, A., Kakade, S., Yang, L.F.: Model-based reinforcement learning with a generative model is minimax optimal. In: Conference on Learning Theory, pp. 67–83. PMLR (2020)

    Google Scholar 

  2. Agarwal, R., Schuurmans, D., Norouzi, M.: An optimistic perspective on offline reinforcement learning. In: ICML, pp. 104–114. PMLR (2020)

    Google Scholar 

  3. Ajay, A., Kumar, A., Agrawal, P., Levine, S., Nachum, O.: Opal: Offline primitive discovery for accelerating offline reinforcement learning. arXiv preprint arXiv:2010.13611 (2020)

  4. An, G., Moon, S., Kim, J.H., Song, H.O.: Uncertainty-based offline reinforcement learning with diversified q-ensemble. Adv. Neural. Inf. Process. Syst. 34, 7436–7447 (2021)

    Google Scholar 

  5. Azar, M.G., Osband, I., Munos, R.: Minimax regret bounds for reinforcement learning. In: International Conference on Machine Learning, pp. 263–272. PMLR (2017)

    Google Scholar 

  6. Chen, L., et al.: Decision transformer: reinforcement learning via sequence modeling. Adv. Neural. Inf. Process. Syst. 34, 15084–15097 (2021)

    Google Scholar 

  7. Dulac-Arnold, G., et al.: Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Mach. Learn. 110(9), 2419–2468 (2021)

    Article  MathSciNet  Google Scholar 

  8. Fu, J., Kumar, A., Nachum, O., Tucker, G., Levine, S.: D4rl: datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219 (2020)

  9. Fujimoto, S., Gu, S.S.: A minimalist approach to offline reinforcement learning. Adv. Neural. Inf. Process. Syst. 34, 20132–20145 (2021)

    Google Scholar 

  10. Guiñón, J.L., Ortega, E., García-Antón, J., Pérez-Herranz, V.: Moving average and savitzki-golay smoothing filters using mathcad. Papers ICEE 2007, 1–4 (2007)

    Google Scholar 

  11. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: ICML, pp. 1861–1870. PMLR (2018)

    Google Scholar 

  12. Jaques, N., et al.: Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456 (2019)

  13. Jin, Y., Yang, Z., Wang, Z.: Is pessimism provably efficient for offline rl? In: International Conference on Machine Learning, pp. 5084–5096. PMLR (2021)

    Google Scholar 

  14. Kendall, A., et al.: Learning to drive in a day. In: ICRA, pp. 8248–8254. IEEE (2019)

    Google Scholar 

  15. Kidambi, R., Rajeswaran, A., Netrapalli, P., Joachims, T.: Morel: Model-based offline reinforcement learning. Adv. Neural. Inf. Process. Syst. 33, 21810–21823 (2020)

    Google Scholar 

  16. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  17. Kostrikov, I., Nair, A., Levine, S.: Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169 (2021)

  18. Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for offline reinforcement learning https://arxiv.org/abs/2006.04779

  19. Kumar, A., Zhou, A., Tucker, G., Levine, S.: Conservative q-learning for offline reinforcement learning. Adv. Neural. Inf. Process. Syst. 33, 1179–1191 (2020)

    Google Scholar 

  20. Lange, S., Gabel, T., Riedmiller, M.: Batch reinforcement learning. Reinforcement learning: State-of-the-art, pp. 45–73 (2012)

    Google Scholar 

  21. Leibo, J.Z., Zambaldi, V., Lanctot, M., Marecki, J., Graepel, T.: Multi-agent reinforcement learning in sequential social dilemmas. arXiv preprint arXiv:1702.03037 (2017)

  22. Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643 (2020)

  23. Li, G., Wei, Y., Chi, Y., Gu, Y., Chen, Y.: Breaking the sample size barrier in model-based reinforcement learning with a generative model. Adv. Neural. Inf. Process. Syst. 33, 12861–12872 (2020)

    Google Scholar 

  24. Liu, Y., Swaminathan, A., Agarwal, A., Brunskill, E.: Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473 (2019)

  25. Lyu, J., Ma, X., Li, X., Lu, Z.: Mildly conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2206.04745 (2022)

  26. Ma, Y., Jayaraman, D., Bastani, O.: Conservative offline distributional reinforcement learning. Adv. Neural. Inf. Process. Syst. 34, 19235–19247 (2021)

    Google Scholar 

  27. Munemasa, I., Tomomatsu, Y., Hayashi, K., Takagi, T.: Deep reinforcement learning for recommender systems. In: ICOIACT, pp. 226–233. IEEE (2018)

    Google Scholar 

  28. Nachum, O., Dai, B., Kostrikov, I., Chow, Y., Li, L., Schuurmans, D.: Algaedice: policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074 (2019)

  29. O’Donoghue, B., Osband, I., Munos, R., Mnih, V.: The uncertainty bellman equation and exploration. In: International Conference on Machine Learning, pp. 3836–3845 (2018)

    Google Scholar 

  30. Ovadia, Y., et al.: Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In: Advances in Neural Information Processing Systems 32 (2019)

    Google Scholar 

  31. Paszke, Aet al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32 (2019)

    Google Scholar 

  32. Precup, D., Sutton, R.S., Dasgupta, S.: Off-policy temporal-difference learning with function approximation. In: ICML, pp. 417–424 (2001)

    Google Scholar 

  33. Rashidinejad, P., Zhu, B., Ma, C., Jiao, J., Russell, S.: Bridging offline reinforcement learning and imitation learning: a tale of pessimism. Adv. Neural. Inf. Process. Syst. 34, 11702–11716 (2021)

    Google Scholar 

  34. Sinha, S., Mandlekar, A., Garg, A.: S4rl: surprisingly simple self-supervision for offline reinforcement learning in robotics. In: Conference on Robot Learning, pp. 907–917. PMLR (2022)

    Google Scholar 

  35. Sutton, R.S., Mahmood, A.R., White, M.: An emphatic approach to the problem of off-policy temporal-difference learning. J. Mach. Learn. Res. 17(1), 2603–2631 (2016)

    MathSciNet  Google Scholar 

  36. Todorov, E., Erez, T., Tassa, Y.: Mujoco: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE (2012)

    Google Scholar 

  37. Vinyals, O., et al.: Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782 (2017)

  38. Wu, K., et al.: Acql: an adaptive conservative q-learning framework for offline reinforcement learning (2022)

    Google Scholar 

  39. Wu, Y., Tucker, G., Nachum, O.: Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361 (2019)

  40. Wu, Y., et al.: Uncertainty weighted actor-critic for offline reinforcement learning. arXiv preprint arXiv:2105.08140 (2021)

  41. Yu, T., Kumar, A., Rafailov, R., Rajeswaran, A., Levine, S., Finn, C.: Combo: Conservative offline model-based policy optimization. Adv. Neural. Inf. Process. Syst. 34, 28954–28967 (2021)

    Google Scholar 

  42. Yu, T., et al.: Mopo: model-based offline policy optimization. Adv. Neural. Inf. Process. Syst. 33, 14129–14142 (2020)

    Google Scholar 

  43. Zou, L., Xia, L., Ding, Z., Song, J., Liu, W., Yin, D.: Reinforcement learning to optimize long-term user engagement in recommender systems. In: SIGKDD, pp. 2810–2818 (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Junchi Yan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Qiu, L., Li, X., Liang, L., Sun, M., Yan, J. (2024). Adaptable Conservative Q-Learning for Offline Reinforcement Learning. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14427. Springer, Singapore. https://doi.org/10.1007/978-981-99-8435-0_16

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8435-0_16

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8434-3

  • Online ISBN: 978-981-99-8435-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics