Top: direct shape alignment into stereo images using PCA shape priors [ ]. Bottom: Multi-object scene parsing and reconstruction using deep shape priors [ ].
Representing scenes at the granularity of objects is a prerequisite for scene understanding and decision making. For instance, for autonomous driving, the ability to parse scenes into static and moving objects and obtain 3D information can be useful to reason on possible actions and collision-free paths. In DirectShape [ ], we propose a novel method for determining the pose and shape of vehicles in stereo images which does not require a dense stereo reconstruction. Instead a shape space of vehicle objects represented in a low-dimensional PCA embedding is fitted into the stereo images using direct image alignment principles. We demonstrate that our approach improves a variety of deep learning based 3D object detectors which are used to initialize the pose estimate of the objects.
In Elich et al. [ ] we use deep learning based shape spaces of various object categories including typical household objects. We devise a deep learning based encoder-decoder architecture which parses RGB images recursively into individual objects alongside their shape parameters, texture, 3D position, and orientation. The decoder is implemented by a differentiable renderer which renders the signed distance field representation of the objects and their texture back into images. This way, the model can be trained in a self-supervised way on RGB-D images. The method achieves competitive results in object segmentation and image reconstruction compared to previous approaches which do not use explicit 3D representations.
In a collaboration with the Computer Graphics group at University of Tuebingen, we developed a meta-learning approach for multi-view stereo reconstruction [ ]. Through meta-learning, the method is trained to adapt better to novel datasets by self-supervised learning.