The goal of this project is to use reinforcement learning to build pixel-based agents that successfully zero-shot transfer from a simulated environment to reality, which is challenging due to variations and mismatches in transition dynamics, sensor readings, and environment conditions. The dominant current techniques for handling this mismatch consists of hand-designed domain randomization techniques, in which an engineer picks some set of features that the agent should be robust against e.g. colors of segmentation masks, distractor objects, etc and uniformly varies those features during training. This requires human intuition and code to modify the simulator. We propose a code-free, risk sensitive variant of this procedure in which we select our randomizations in a latent space, allowing us to automatically select the riskiest randomizations. By training against these adversaries in a zero-sum game, we can construct RL agents that are robust to visual and semantic variations.
Fig 1. Randomized color segment masks (Tobin et. al. 2017)
Domain randomization, in which features that the agent should be robust to are randomly varied, is one of the most important existing techniques for generalization from simulator to reality. As an example, Fig. 1 shows randomized color segments and object colors in an object picking task. By varying the color, the agent learns that it cannot rely on the color and develops a policy that is invariant to color. However, to make this technique work a researcher needed to identify that color invariance was important and design code that could be used to vary the color of mask segmentations and objects.
Instead of attempting to randomize by hand-coded feature selection, we propose to learn the same invariances using the simulator. We will use an autoencoder-based dynamics model to compress the image into a series of factorized latents and then perform randomization on the latents instead of at the image level. For an appropriately constructed autoencoder, the latents can control both semantics of the scenario and the dynamics and can be occasionally used to create samples outside the training distributio. An appropriately constructed latent distribution would allow us to randomize features like color, number of objects in a scenario, and masses of objects. By rolling forwards the dynamics entirely in the latent space while applying perturbations to the latents, we can create a wide array of new training settings without having to write any new code.
Furthermore, our approach allows us to align domain randomizations with notions of robustness. Given the latent structure, we can perform search on the value function or learn an adversary, allowing us to quickly find the latent perturbations that actually degrade the performance of our RL agent. In contrast, standard domain randomization can require thousands of environments to even begin to generalize effectively, likely because most of the randomized environments are uninformative.
Our initial work has focused on understanding the latent structure of auto-encoders trained on procedurally generated environments such as Coinrun. These environments are essentially generated by domain randomization, allowing us to focus on the question of whether the latents can be appropriately perturbed to create randomizations when the training distribution has sufficient variety. As Fig. 2 demonstrates, this question has been answered in the affirmative; given sufficient variety in the training distribution we can generate color and scene variations. We can now proceed to investigate whether adversarial selection of these latents is sufficient to induce robustness or generalization to a test set of CoinRun environments.