{ "id": "2209.11745", "version": "v1", "published": "2022-09-23T17:47:24.000Z", "updated": "2022-09-23T17:47:24.000Z", "title": "Unified Algorithms for RL with Decision-Estimation Coefficients: No-Regret, PAC, and Reward-Free Learning", "authors": [ "Fan Chen", "Song Mei", "Yu Bai" ], "categories": [ "cs.LG", "cs.AI", "math.ST", "stat.ML", "stat.TH" ], "abstract": "Finding unified complexity measures and algorithms for sample-efficient learning is a central topic of research in reinforcement learning (RL). The Decision-Estimation Coefficient (DEC) is recently proposed by Foster et al. (2021) as a necessary and sufficient complexity measure for sample-efficient no-regret RL. This paper makes progress towards a unified theory for RL with the DEC framework. First, we propose two new DEC-type complexity measures: Explorative DEC (EDEC), and Reward-Free DEC (RFDEC). We show that they are necessary and sufficient for sample-efficient PAC learning and reward-free learning, thereby extending the original DEC which only captures no-regret learning. Next, we design new unified sample-efficient algorithms for all three learning goals. Our algorithms instantiate variants of the Estimation-To-Decisions (E2D) meta-algorithm with a strong and general model estimation subroutine. Even in the no-regret setting, our algorithm E2D-TA improves upon the algorithms of Foster et al. (2021) which require either bounding a variant of the DEC which may be prohibitively large, or designing problem-specific estimation subroutines. As applications, we recover existing and obtain new sample-efficient learning results for a wide range of tractable RL problems using essentially a single algorithm. Finally, as a connection, we re-analyze two existing optimistic model-based algorithms based on Posterior Sampling or Maximum Likelihood Estimation, showing that they enjoy similar regret bounds as E2D-TA under similar structural conditions as the DEC.", "revisions": [ { "version": "v1", "updated": "2022-09-23T17:47:24.000Z" } ], "analyses": { "keywords": [ "decision-estimation coefficient", "reward-free learning", "unified algorithms", "general model estimation subroutine", "enjoy similar regret bounds" ], "note": { "typesetting": "TeX", "pages": 0, "language": "en", "license": "arXiv", "status": "editable" } } }