Automatic Curriculum Generation and Emergent Complexity via Inter-agent Competition

Reinforcement Learning (RL) has been most successful when agents can collect extensive training experience in a simulated environment [1-4]. However, building simulated environments requires a great deal of manual effort, is error prone, and is unlikely to cover the space of all real world tasks. Inter-agent competition has the potential to automatically generate increasingly challenging and complex environments, as agents learn and adapt to each other [5-7].  The diagram above describes one such paradigm, which served as the basis for our NeurIPS2020 paper, which is described in more detail below.

Closing Report

July 12, 2021

Researchers

  • Michael Dennis, BAIR (contact email: michael_dennis@berkeley.edu)
  • Eugene Vinitsky, BAIR
  • Arnaud Fickinger, BAIR
  • Natasha Jaques, Google
  • Igor Mordatch, Google
  • Stuart Russell, BAIR

Overview

We propose to develop novel RL algorithms and training objectives which harness competitive games as a way to generate a series of progressively more difficult tasks for agents to solve.

Our initial collaborative research on this topic resulted in a NeurIPS paper on a technique

called Protagonist Antagonist Induced Regret Environment Design (PAIRED), a training regime in which an adversary learns to design complex but feasible environments by training it to maximize the difference in scores between a pair of agents, called the regret as depicted in the diagram below.  

A small sample of the results of this work can be seen in the following figure.  In a comparison between two baseline methods, Domain Randomization(DR) which generates random block positions, and Minimax which generates adversarial block positions, PAIRED transfers better to complex unseen environments at test time.  We are extending this work to new environments by modifying the game-theoretic setting to increase stability and reduce variance.

We are planning a second project, Adversarial Surprise, which harnesses information-theoretic objectives and inter-agent competition by training agents to minimize their Bayesian surprise while maximizing surprise in another agent. We will also investigate cooperative curriculum generation, including how to develop more teachable agents.

Technical Objective

We aim to create a training paradigm which automatically generates a curriculum of complex training tasks (see Fig. 1) to effectively prepare agents to generalize zero-shot to unknown, challenging test tasks (see Fig. 2). We are already making progress towards these objectives.

References

[1] Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., ... & Oh, J. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782), 350-354.

[2] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.

[3] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... & Dieleman, S. (2016). Mastering the game of Go with deep neural networks and tree search. nature, 529(7587), 484.

[4] Berner, C., Brockman, G., Chan, B., Cheung, V., Dębiak, P., Dennison, C., ... & Józefowicz, R. (2019). Dota 2 with Large Scale Deep Reinforcement Learning. arXiv preprint arXiv:1912.06680.

[5] Rui Wang, Joel Lehman, Jeff Clune, and Kenneth O Stanley. Paired open-ended trailblazer (poet): Endlessly generating increasingly complex and diverse learning environments and their solutions.arXiv preprint arXiv:1901.01753, 2019.

[6] Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. Emergent tool use from multi-agent autocurricula.arXiv preprint arXiv:1909.07528, 2019.

[7] Joel Z Leibo, Edward Hughes, Marc Lanctot, and Thore Graepel. Autocurricula and the emergence of innovation from social interaction: A manifesto for multi-agent intelligence research.arXiv preprint arXiv:1903.00742, 2019.