Statistically Efficient Offline RL with General Function Approximation

Abstract 

Offline reinforcement learning (RL) aims at learning effective policies from only a previously-collected dataset of interactions without access to further interactions with the environment. To handle datasets with partial coverage, conservatism is recently shown to be necessary, both in practice and theory, for offline RL. Existing offline RL algorithms, however, either do not offer theoretical guarantees or are not practical due to strong assumptions (such as tabular or linear parameterization) or computational intractability. We propose to take steps towards addressing this gap by borrowing ideas from optimal transport in optimization theory and combining them with recent advances in conservative offline RL algorithm design. We plan to design algorithms and analyze their sample efficiency both theoretically and empirically. In addition, we will investigate fundamental limitations and barriers in offline learning pertaining to dataset coverage, function approximation, model misspecification, and the existence of adversaries. We will also conduct thorough empirical evaluations and apply the proposed methods to real applications.

Researchers

  • BAIR Faculty Member: Kannan Ramchandran (kannanr [at] eecs [dot] berkeley [dot] edu), Jiantao Jiao (jiantao [at] berkeley [dot] edu)
  • BAIR PhD student/postdoc: Hanlin Zhu (hanlinzhu [at] berkeley [dot] edu), Kunhe Yang (kunheyang [at] berkeley [dot] edu), Paria Rashidinejad (paria [dot] rashidinejad [at] berkeley [dot] edu)
  • Meta-AI researcher: Amy Zhang (amyzhang [at] fb [dot] com)

Overview

Contrary to the more classical setting of online RL, which involves interactions with the environment, offline RL only assumes access to a previously-collected dataset of interactions. Effective offline RL is crucial as it allows exploiting ample previously-collected datasets, avoids possible costly or infeasible exploration, and can revolutionize many important real-world applications such as healthcare, climate sciences, and scientific discovery. 

There has been a surge of interest in offline RL over the past few years. On the practical front, several algorithms have been developed to improve the applicability of offline learning with the majority relying on the notion of conservatism to handle partial dataset coverage. While these initial results are promising, offline RL is far from resolved for the purposes of real applications. For instance, the majority of the mentioned algorithms do not have theoretical guarantees of sample efficiency and failure modes [1], can be brittle to changes in implementation, hyperparameters, and random seeds [7, 8], and only perform well for datasets with certain coverage type [1].

On the theoretical front, conservative offline RL algorithms are proposed with theoretical guarantees. While some of these algorithms work for general function approximation, they are computationally intractable and rely on prohibitively strong assumptions such as uniform all-policy dataset coverage and Bellman completeness [2, 10, 11]. In our previous work [5], we introduced the single-policy concentrability framework for studying offline RL under partial coverage and showed that offline RL with lower confidence bound is information-theoretically near-optimal in the tabular setting. However, forming confidence bounds for general function approximators such as neural networks remains challenging and heuristics such as using ensembles do not offer a competitive performance [1]. This motivates us to investigate alternative offline RL algorithms that are:

  1. Practical, i.e. is computationally efficient and avoids forming confidence bounds;

  2. Applicable to real scenarios, i.e. works well given datasets with partial coverage and avoids prohibitive assumptions such as Bellman completeness and all-policy coverage;

  3. Enjoys theoretical guarantees on sample efficiency with general function approximation and recovers the near-optimal sample complexity in the tabular setting;

  4. Achieves state-of-the-art performance in practice.

To achieve the above, we plan to break the information-theoretic barriers of value-function approximation [12], by designing algorithms that revolve around the linear programming formulation for reinforcement learning. We will build on the initial promising results recently proposed in [6] and combine them with ideas from optimal transport in optimization theory. In addition to the criteria above, we plan to investigate the difficulties of offline learning with general function approximation and assess the robustness of our algorithms with respect to model misspecification and adversaries.

References

[1] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.

[2] Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In the International Conference on Machine Learning, pages 1042–1051. PMLR, 2019.

[3] Jianqing Fan, Zhaoran Wang, Yuchen Xie, and Zhuoran Yang. A theoretical analysis of deep q-learning. In Learning for Dynamics and Control, pages 486–489. PMLR, 2020.

[4] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 2020.

[5] Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34, 2021.

[6] Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, and Jason D Lee. Offline reinforcement learning with realizability and single-policy concentrability. arXiv preprint arXiv:2202.04634, 2022

[7] Andrew Ilyas, Logan Engstrom, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. A closer look at deep policy gradients. In International Conference on Learning Representations, 2019.

[8] Riashat Islam, Peter Henderson, Maziar Gomrokchi, and Doina Precup.Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. International Conference on Learning Representations, 2017.

[10] Xie, Tengyang, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. "Bellman-consistent pessimism for offline reinforcement learning." Advances in neural information processing systems 34 (2021).

[11] Xie, Tengyang, and Nan Jiang. "Batch value-function approximation with only realizability." In International Conference on Machine Learning, pp. 11404-11413. PMLR, 2021.

[12] Foster, Dylan J., Akshay Krishnamurthy, David Simchi-Levi, and Yunzong Xu. "Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation." arXiv preprint arXiv:2111.10919 (2021).