{ "id": "1206.6461", "version": "v1", "published": "2012-06-27T19:59:59.000Z", "updated": "2012-06-27T19:59:59.000Z", "title": "On the Sample Complexity of Reinforcement Learning with a Generative Model", "authors": [ "Mohammad Gheshlaghi Azar", "Remi Munos", "Bert Kappen" ], "comment": "Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012)", "categories": [ "cs.LG", "stat.ML" ], "abstract": "We consider the problem of learning the optimal action-value function in the discounted-reward Markov decision processes (MDPs). We prove a new PAC bound on the sample-complexity of model-based value iteration algorithm in the presence of the generative model, which indicates that for an MDP with N state-action pairs and the discount factor \\gamma\\in[0,1) only O(N\\log(N/\\delta)/((1-\\gamma)^3\\epsilon^2)) samples are required to find an \\epsilon-optimal estimation of the action-value function with the probability 1-\\delta. We also prove a matching lower bound of \\Theta (N\\log(N/\\delta)/((1-\\gamma)^3\\epsilon^2)) on the sample complexity of estimating the optimal action-value function by every RL algorithm. To the best of our knowledge, this is the first matching result on the sample complexity of estimating the optimal (action-) value function in which the upper bound matches the lower bound of RL in terms of N, \\epsilon, \\delta and 1/(1-\\gamma). Also, both our lower bound and our upper bound significantly improve on the state-of-the-art in terms of 1/(1-\\gamma).", "revisions": [ { "version": "v1", "updated": "2012-06-27T19:59:59.000Z" } ], "analyses": { "keywords": [ "sample complexity", "generative model", "reinforcement learning", "lower bound", "optimal action-value function" ], "tags": [ "conference paper" ], "note": { "typesetting": "TeX", "pages": 0, "language": "en", "license": "arXiv", "status": "editable", "adsabs": "2012arXiv1206.6461G" } } }