arXiv:2309.15257 Abstract | arXiv Analytics

arXiv:2309.15257 [cs.LG]Abstract References Reviews Resources

STARC: A General Framework For Quantifying Differences Between Reward Functions

Joar Skalse, Lucy Farnik, Sumeet Ramesh Motwani, Erik Jenner, Adam Gleave, Alessandro Abate

Published 2023-09-26Version 1

In order to solve a task using reinforcement learning, it is necessary to first formalise the goal of that task as a reward function. However, for many real-world tasks, it is very difficult to manually specify a reward function that never incentivises undesirable behaviour. As a result, it is increasingly popular to use reward learning algorithms, which attempt to learn a reward function from data. However, the theoretical foundations of reward learning are not yet well-developed. In particular, it is typically not known when a given reward learning algorithm with high probability will learn a reward function that is safe to optimise. This means that reward learning algorithms generally must be evaluated empirically, which is expensive, and that their failure modes are difficult to predict in advance. One of the roadblocks to deriving better theoretical guarantees is the lack of good methods for quantifying the difference between reward functions. In this paper we provide a solution to this problem, in the form of a class of pseudometrics on the space of all reward functions that we call STARC (STAndardised Reward Comparison) metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret, which implies that our metrics are tight, and that any metric with the same properties must be bilipschitz equivalent to ours. Moreover, we also identify a number of issues with reward metrics proposed by earlier works. Finally, we evaluate our metrics empirically, to demonstrate their practical efficacy. STARC metrics can be used to make both theoretical and empirical analysis of reward learning algorithms both easier and more principled.

Categories: cs.LG, cs.AI

Keywords: reward function, reward learning algorithm, general framework, quantifying differences, starc metrics induce

Related articles: Most relevant | Search more

arXiv:2310.18564 [cs.LG] (Published 2023-10-28)

A General Framework for Robust G-Invariance in G-Equivariant Networks

Sophia Sanborn, Nina Miolane

arXiv:2403.13249 [cs.LG] (Published 2024-03-20)

A Unified and General Framework for Continual Learning

Zhenyi Wang, Yan Li, Li Shen, Heng Huang

arXiv:2402.01922 [cs.LG] (Published 2024-02-02, updated 2024-05-28)

A General Framework for Learning from Weak Supervision

Hao Chen et al.

arXiv Analytics

arXiv:2309.15257 [cs.LG]Abstract References Reviews Resources

STARC: A General Framework For Quantifying Differences Between Reward Functions

Links

Toolbox

arXiv:2309.15257 [cs.LG]AbstractReferencesReviewsResources

STARC: A General Framework For Quantifying Differences Between Reward Functions

Links

Toolbox

arXiv:2309.15257 [cs.LG]Abstract References Reviews Resources