arXiv:1912.01192 Abstract | arXiv Analytics

arXiv:1912.01192 [cs.LG]Abstract References Reviews Resources

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

Published 2019-12-03Version 1

We consider the problem of learning in episodic finite-horizon Markov decision processes with unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $\mathcal{\tilde{O}}(L|X|^2\sqrt{|A|T})$ regret with high probability, where $L$ is the horizon, $|X|$ is the number of states, $|A|$ is the number of actions, and $T$ is the number of episodes. To the best of our knowledge, our algorithm is the first one to ensure sub-linear regret in this challenging setting. Our key technical contribution is to introduce an optimistic loss estimator that is inversely weighted by an $\textit{upper occupancy bound}$.

Comments: 14 pages

Categories: cs.LG, stat.ML

Subjects: I.2.6, I.2.6

Keywords: learning adversarial mdps, bandit feedback, episodic finite-horizon markov decision processes, ensure sub-linear regret, optimistic loss estimator

Related articles: Most relevant | Search more

arXiv:2004.13106 [cs.LG] (Published 2020-04-27)

Learning to Rank in the Position Based Model with Bandit Feedback

Beyza Ermis, Patrick Ernst, Yannik Stein, Giovanni Zappella

arXiv:1202.3079 [cs.LG] (Published 2012-02-14)

Towards minimax policies for online linear optimization with bandit feedback

Sébastien Bubeck, Nicolò Cesa-Bianchi, Sham M. Kakade

arXiv:2106.05165 [cs.LG] (Published 2021-06-09)