arXiv:2006.04779 Abstract | arXiv Analytics

arXiv:2006.04779 [cs.LG]Abstract References Reviews Resources

Conservative Q-Learning for Offline Reinforcement Learning

Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine

Published 2020-06-08Version 1

Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a principled policy improvement procedure. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.

Comments: Preprint. Website at: https://sites.google.com/view/cql-offline-rl

Categories: cs.LG, stat.ML

Keywords: offline reinforcement learning, existing offline rl methods, outperforms existing offline rl, conservative q-learning, multi-modal data distributions

Related articles: Most relevant | Search more

arXiv:2302.14372 [cs.LG] (Published 2023-02-28)

The In-Sample Softmax for Offline Reinforcement Learning

Chenjun Xiao, Han Wang, Yangchen Pan, Adam White, Martha White

arXiv:2111.10919 [cs.LG] (Published 2021-11-21, updated 2022-08-30)

Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation

Dylan J. Foster, Akshay Krishnamurthy, David Simchi-Levi, Yunzong Xu

arXiv:2006.13888 [cs.LG] (Published 2020-06-24)

RL Unplugged: Benchmarks for Offline Reinforcement Learning

Caglar Gulcehre et al.

arXiv Analytics

arXiv:2006.04779 [cs.LG]Abstract References Reviews Resources

Conservative Q-Learning for Offline Reinforcement Learning

Links

Toolbox

arXiv:2006.04779 [cs.LG]AbstractReferencesReviewsResources

Conservative Q-Learning for Offline Reinforcement Learning

Links

Toolbox

arXiv:2006.04779 [cs.LG]Abstract References Reviews Resources