arXiv:1805.07805 Abstract | arXiv Analytics

arXiv:1805.07805 [cs.LG]Abstract References Reviews Resources

Safe Policy Learning from Observations

Published 2018-05-20Version 1

In this paper, we consider the problem of learning a policy by observing numerous non-expert agents. Our goal is to extract a policy that, with high-confidence, acts better than the average agents' performance. Such a setting is important for real-world problems where expert data is scarce but non-expert data can easily be obtained, e.g. by crowdsourcing. Our approach is to pose this problem as safe policy improvement in Reinforcement Learning. First, we evaluate an average behavior policy and approximate its value function. Then, we develop a stochastic policy improvement algorithm, termed Rerouted Behavior Improvement (RBI), that safely improves the average behavior. The primary advantages of RBI over current safe learning methods are its stability in the presence of value estimation errors and the elimination of a policy search process. We demonstrate these advantages in a Taxi grid-world domain and in four games from the Atari learning environment.

Categories: cs.LG, cs.AI, stat.ML

Keywords: safe policy learning, observations, stochastic policy improvement algorithm, safe policy improvement, taxi grid-world domain

Related articles: Most relevant | Search more

arXiv:1712.06924 [cs.LG] (Published 2017-12-19)

Safe Policy Improvement with Baseline Bootstrapping

Romain Laroche, Paul Trichelair, Layla El Asri

arXiv:1907.05079 [cs.LG] (Published 2019-07-11)

Safe Policy Improvement with Soft Baseline Bootstrapping

Kimia Nadjahi, Romain Laroche, Rémi Tachet des Combes

arXiv:2010.12645 [cs.LG] (Published 2020-10-23)

Towards Safe Policy Improvement for Non-Stationary MDPs

Yash Chandak, Scott M. Jordan, Georgios Theocharous, Martha White, Philip S. Thomas