Task-Specific World Models for Robotic Manipulation

In this project, we look to develop methods that learn world models for agents to solve difficult real-world robotics tasks. Specifically, we focus on various real-world tasks, such as cable manipulation, that require very fine-grained details of a scene to accurately model future predictions.  To do this, we will explore models that localize spatial regions of interest in images, and construct patch or region-based selection methods to acquire fine-grained detailed information.



We approach this problem in a video prediction setting by learning image-based world models for robotic control tasks. Existing methods tend to operate on fixed resolution images, and may easily ignore fine-grained scene information when trading off image resolution and processing speed for control. We are exploring a method that uses observations of higher and lower resolutions for efficient and accurate future video prediction.  This extends to many existing works on object detection and segmentation, and the less explored area of constructing video models grounded in instances of objects or action scene graphs. Future work may extend to using the video prediction model in visual planning, or other model-based RL techniques.