Large-scale 3D Reconstruction from Multi-view Image Datasets

This ongoing project attempts at using large scale multi-view datasets available online to build a multi-view 3D reconstruction approach that works on wide-baseline images.



Current computer vision systems for 3D reconstruction from images do not work in the wild on everyday objects when only a few images are available. They either don’t learn any priors and use ~100s of images to reconstruct a single scene, rely on CAD models which are limited and finite, or are weakly supervised approaches that approaches that are trained per-category. However, there’s often enough information in a single image to estimate and reason about a rough 3D shape for the underlying object. E.g. In this FB Marketplace example, even from a single image, we can roughly tell the shape of this stroller as having 2 big and 2 small wheels, a U-shaped rod for pushing it and a concave seat in which the child sits. More views supplement our understanding of the shape and make it more accurate. Yet still, obtaining ground-truth 3D shape annotations for such an in-the-wild image is almost impossible. Therefore, any successful 3Dification approach will likely rely on multi-view supervision.

Our goal is to build a large-scale 3Difying approach that can learn to reconstruct shape from a few views for a wide variety of objects using natural multi-view supervision during training. Such supervision will come from videos and product images available online that typically show multiple views of the item being sold. Our plan is to:

  1. Use multi-view supervision from Amazon and Facebook Marketplace for building 3d models of a diverse variety of objects.
  2. At test time, then predict shape from single-view images supervising using these models.

Building such a system can reason about shape and occlusion -- augmenting object recognition capabilities. In addition, such a system will have immediate applications to AR/VR and robotics.