arXiv Analytics

Sign in

arXiv:1912.01192 [cs.LG]AbstractReferencesReviewsResources

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

Tiancheng Jin, Haipeng Luo

Published 2019-12-03Version 1

We consider the problem of learning in episodic finite-horizon Markov decision processes with unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $\mathcal{\tilde{O}}(L|X|^2\sqrt{|A|T})$ regret with high probability, where $L$ is the horizon, $|X|$ is the number of states, $|A|$ is the number of actions, and $T$ is the number of episodes. To the best of our knowledge, our algorithm is the first one to ensure sub-linear regret in this challenging setting. Our key technical contribution is to introduce an optimistic loss estimator that is inversely weighted by an $\textit{upper occupancy bound}$.

Related articles: Most relevant | Search more
arXiv:2004.13106 [cs.LG] (Published 2020-04-27)
Learning to Rank in the Position Based Model with Bandit Feedback
arXiv:1202.3079 [cs.LG] (Published 2012-02-14)
Towards minimax policies for online linear optimization with bandit feedback
arXiv:2106.05165 [cs.LG] (Published 2021-06-09)
A Lyapunov-Based Methodology for Constrained Optimization with Bandit Feedback