Representing scenes at the granularity of objects is a prerequisite for scene understanding and decision making. For instance, for autonomous driving, the ability to parse scenes into static and moving objects and obtain 3D information can be useful to reason on possible actions and collision-free paths.
In our prior works, we have used shape spaces represented in a latent embedding of signed distance functions (SDFs) to estimate object pose and shape from stereo images (e.g.~[ ]). In Elich et al. [ ] we use deep learning based shape spaces of various object categories including typical household objects. We devise a deep learning based encoder-decoder architecture which parses RGB images recursively into individual objects alongside their shape parameters, texture, 3D position, and orientation. The decoder is implemented by a differentiable renderer which renders the signed distance field representation of the objects and their texture back into images. This way, the model can be trained in a self-supervised way on RGB-D images. The method achieves competitive results in object segmentation and image reconstruction compared to previous approaches which do not use explicit 3D representations.
todo: physics-based object tracking
todo: physically plausible object pose estimation
todo: event-based non rigid tracking