Towards Trustworthy Reinforcement Learning


Recent work has found seemingly capable deep RL policies may harbour serious failure modes, being exploitable by an adversarial opponent acting in a shared environment. We seek to develop techniques to improve the robustness of deep RL policies, enabling deployment in safety-critical settings that may contain adversaries. Currently we are investigating modifying self-play, and have found our technique trains policies to be robust to a broader range of opponents, but that the resulting hardened policy is still vulnerable to a smaller class of attacks.



To date, we have investigated two modifications to self-play to make training more robust:

  • Opponent diversity: rather than training against a single opponent, train against a broader range of opponents, such as both an opponent created by our adversarial attack and an opponent created by normal training. This helps avoid catastrophic forgetting, and can be viewed as a more computationally tractable version of population-based training.
  • Burst training. Self-play is typically justified with reference to fictitious play, which is guaranteed to converge to a Nashg equilibria in finite zero-sum games. Fictitious play involves iterated best response. However, self-play typically trains a deep RL policy for a small period of time -- which will not converge to even an approximately optimal policy. In this way, self-play will tend to find local rather than global equilibria.

An obvious way to resolve this to significantly increase the number of timesteps deep RL training is performed for at each step of self-play training. But this would be computationally expensive, and there is no way to know a priori what a sufficient number of steps would be.

Burst training instead gradually increases the number of timesteps with each epoch. This is computationally efficient since, initially, useful behaviour can be learned without training to convergence. Meanwhile, in the limit it will still allow adequate time for deep RL training to converge.

Applying these techniques to a multi-agent simulated robotics game, we are able to fine-tune a pre-existing “victim” policy to be robust to a range of adversaries that the victim was originally vulnerable to. However, so far we have found that so far the adversary retains an upper hand, and is still able to exploit the victim in new ways.

Figure 1: burst training curves, alternating training the opponent (red background) and victim (green background), with the number of training timesteps increasing over time. The victim plays against two opponents: one opponent that is trained from a random initialization, and tends to exploit low-level bugs in the victim policy; and another opponent that is pre-trained to a more sensible high-level policy, such as tackling the victim. The victim reward (in yellow) decreases as the randomly initialized policy (in blue) learns. However, the victim improves qualitatively, becoming robust to attacks from earlier checkpoints.

We believe this technique may still be useful when trying to improve robustness to natural perturbations, where reducing the surface area of vulnerability can significantly decrease failure rates. However, our defence is not currently strong enough to protect against a determined adversary with access to our policy, although it will increase the cost of the attack by demanding greater compute.

We plan to investigate ways to scale this defence further, as we have so far only trained for 10s of millions of timesteps, whereas the agents we were fine-tuning were trained for billions of timesteps. It is likely this defence will work considerably better with sufficient compute. We also intend to investigate alternatives, such as population-based training with a greater diversity of agents, and methods to detect.

Note this triad is unfunded and as a result Adam, the PhD student forming part of this triad, has so far only been able to work on this collaboration on a part-time basis due to commitments he has had to other, funded projects. He has been advising Sergei Volodin, an undergraduate researcher, who is responsible for much of the implementation and experiments documented in this report.