Since its introduction a decade ago, relative entropy policy search (REPS) has demonstrated successful policy learning on a number of simulated and real-world robotic domains, not to mention providing algorithmic components used by many recently proposed reinforcement learning (RL) algorithms. Unfortunately there exist no theoretical guarantees for the performance of REPS. In this project we take the first steps towards addressing this shortcoming.
Researchers
- Aldo Pacchiano, UC Berkeley, https://people.eecs.berkeley.edu/~pacchiano/
- Ofir Nachum, Google, https://research.google/people/105364/
- Peter Bartlett, UC Berkeley, https://www.stat.berkeley.edu/~bartlett/
Overview
While REPS is commonly known in the community, there exist no guarantees on its performance when using stochastic and gradient-based solvers. In this paper we aim to fill this gap by providing guarantees and convergence rates for the sub-optimality of a policy learned using first-order optimization methods applied to the REPS objective. We first consider the setting in which we are given access to exact gradients and demonstrate how near-optimality of the objective translates to near-optimality of the policy. We then consider the practical setting of stochastic gradients, and introduce a technique for calculating unbiased gradients of the REPS objective using generative access to the underlying Markov decision process (MDP). Our results match state-of-the-art convergence rates in the saddle-point RL literature, suggesting that REPS is an algorithm that enjoys not only favorable practical performance but also strong theoretical guarantees.
Closing Report Sept 2021 LP-based Algorithms for Reinforcement Learning