Learning to Collaborate with Human Players

In order for agents trained by deep reinforcement learning to help humans in realistic settings, we will need to ensure that the agents are robust. In collaborative scenarios with humans, evaluating robustness to tail failure cases is nontrivial, as adversarial training (and evaluation) is too conservative of an expectation for one’s partner’s behavior. We propose for AI system designers to use a suite of unit tests that sanity check agent behavior in a variety of states and with different (reasonable) collaboration partners.


31/8/2021: This project was accepted to AAMAS 2021, and the full paper is now available on Arxiv.


  • Paul Knott, University of Nottingham, webpage
  • Micah Carroll, UC Berkeley, webpage
  • Sam Devlin, Microsoft Research, webpage
  • Kamil Ciosek, Microsoft Research, webpage
  • Katja Hofmann, Microsoft Research, webpage
  • Anca Dragan, UC Berkeley, webpage
  • Rohin Shah, UC Berkeley, webpage


Since the real world is very diverse, and human behavior often changes in response to agent deployment, the agent will likely encounter novel situations that have never been seen during training. This results in an evaluation challenge: if we cannot rely on the average training or validation reward as a metric, then how can we effectively evaluate robustness?

We take inspiration from the practice of unit testing in software engineering. Specifically, we suggest that when designing AI agents that collaborate with humans, designers should search for potential edge cases in possible partner behavior and possible states encountered, and write tests which check that the behavior of the agent in these edge cases is reasonable.

Concretely, we identify that robustness in collaborative scenarios is relevant on two different axes: state robustness (robustness to specific states) and agent robustness (robustness to the specific collaboration partners the agent is paired with). Our suite of unit tests is thus designed to stress test the trained agents with unlikely (but still plausible) states and collaborators that one might encounter in real deployment scenarios.

In recent work on Overcooked, agents were trained to collaborate with humans in completing a cooking task. While achieving high reward performance with a limited set of humans at test time, the trained agents were far from robust: under some realistic circumstances that would have occurred in practice given enough evaluation runs, the agents systematically failed to cooperate with humans.

We apply this methodology of using unit tests to evaluate robustness to the Overcooked-AI environment. While we acknowledge that a test suite cannot guarantee full edge case coverage, we argue it is still a significant improvement over the current status quo of only looking at reward – which covers almost no edge cases.

In addition, current agents do not meet the robustness requirements from our current test suite – no deep RL agent scored above 65%. This suggests that our simple approach can serve as a good metric for the foreseeable future.

To show how this suite of tests can give additional information regarding robustness than reward alone, we compare three proposals for improving robustness in human-AI cooperation scenarios:

  • Improving the quality of human models: using Theory of Mind (ToM) for human modeling instead of simple Behavior Cloning (BC).
  • Improving the diversity of human models that the agent is trained with: training with a population of agents (composed of BC agents, ToM agents, or a mix of the two)
  • Leveraging human-human gameplay data: human-human gameplay data can be used as an inductive bias for the relevant states that the AI agents should be able to perform well on. We leverage this data by starting the environment at training time from randomly sampled states from human data. We call this diverse starts.

We find that the test suite provides insight into the effects of these proposals that were generally not revealed by looking solely at the average validation reward.

Through our experiments, we find that the unit tests revealed information about the method that was relatively uncorrelated with the average reward metric: sometimes unit test robustness increased at the cost of average reward (as with initialization from states in human-human gameplay), sometimes different types of robustness were affected while average reward stayed the same (as with the use of a single ToM agent as the partner), sometimes unit test robustness remained the same while average reward was improved (as with the use of a mixture of BC and ToM agents), and sometimes unit test robustness and average reward were in sync (as with the effects of using a population).

In sum, our paper tries to show that unit test suites can provide a better source of information about agent robustness than reward alone, and so should likely be a part of any real-world deployment pipeline for AI agents in which robustness is critical.


Contact emails: mdc@berkeley.eduPaul.Knott@nottingham.ac.uk