We investigated methods for combining the strengths of semantic (instance) segmentation and monocular depth estimation to predict sharp, high-quality relative depth maps for Augmented Reality applications.
- Utkarsh Singhal, UC Berkeley, firstname.lastname@example.org
- Stella Yu, UC Berkeley, email@example.com
- Chun Kai Wang, Amazon, firstname.lastname@example.org
- Michael Lou, Amazon, email@example.com
Introduction: For Augmented Reality applications, knowing the relative depth of all points in a scene is necessary for producing convincing augmented images. Since handheld phone cameras are ubiquitous, it is practically desirable to use monocular depth estimation to estimate the depth map of a scene. However, existing monocular depth estimation methods produce blurry depth maps with inaccurate occlusion edges. Additionally, these methods often mis-group small objects, confusing the foreground with the background. Such depth maps are not suitable for AR applications. On the other hand, modern semantic segmentation methods are able to produce sharp outputs with high-quality object boundaries on diverse inputs. In this project, we combine the strengths of these approaches using an affinity-based approach.
Research Approach: We tackle the problem on three fronts: (a) In order to explicitly model grouping, we use an affinity-based approach where we compute pairwise affinity of neighboring pixels, and use this to selectively pool semantic and geometric features; (b) We overcome the limitations of RGB-D sensors (noise, missing data) by using a combination of RGB-D data and high-quality large scale synthetic datasets like Hypersim. (c) We analyze the effects of different loss functions, and use a robust reverse Huber loss on pairwise depth differences to ensure sharp relative depth maps.
Results: On the target test set ibims-100, our model outperforms previous state-of-the-art by 4% on DBE completeness, and 11% on DBE accuracy. Qualitatively, our model produces semantically consistent depth maps while preserving thin structures in foreground objects. In particular, in several test examples, our model predicted depth maps even in regions where the ground truth RGB-D sensor failed. A compilation of visual results can be found here