TL;DR
We train a pixel-space diffusion model from scratch to jointly predict normals, albedo and reflectance from three-frame videos of object motion.Abstract
Perceiving the shape and material of an object from a single image is inherently ambiguous, especially when lighting is unknown and unconstrained. Despite this, humans can often disentangle shape and material, and when they are uncertain, they often move their head slightly or rotate the object to help resolve the ambiguities. Inspired by this behavior, we introduce a novel conditional denoising-diffusion model that generates samples of shape-and-material maps from a short video of an object undergoing differential motions. Our parameter-efficient architecture allows training directly in pixel-space, and it generates many disentangled attributes of an object simultaneously. Trained on a modest number of synthetic object-motion videos with supervision on shape and material, the model exhibits compelling emergent behavior: For static observations, it produces diverse, multimodal predictions of plausible shape-and-material maps that capture the inherent ambiguities; and when objects move, the distributions converge to more accurate explanations. The model also produces high-quality shape-and-material estimates for less ambiguous, real-world objects. By moving beyond single-view to continuous motion observations, and by using generative perception to capture visual ambiguities, our work suggests ways to improve visual reasoning in physically-embodied systems.
Qualitative Comparison
Here we compare to DiffusionRenderer (CVPR 2025), another video-conditioned model that can output albedo and normals. Objects 5 (Teapot) and 6 (Cactus) are from the Stanford-ORB dataset (NeurIPS 2023); the input comes from irregular camera motion that is different from the horizontal object motion that we use for training.Extending to Longer Observations
Capturing Ambiguity
Reducing Ambiguity by Exploiting Motion
There is usually less ambiguity about shape and material in a video than in a static image. A good example is Hartung and Kersten's "Shiny or Matte" perceptual demonstration , shown here. The still image can be seen as either shiny or matte material, but once the object moves, we immediately perceive either one or the other.
Our model behaves the same way. When it sees the static image, it produces a variety of albedo maps, including both matte and shiny interpretations. And when it sees the shiny motion video, its albedo predictions collapse to a smaller set of spatially-uniform possibilities. (The right figure shows a PCA-embedding of 100 albedo-map samples for each of the static and motion cases.)
Model Architecture
Results Visualizations
(1) Objects with synthetic textures:
BibTeX
@article{han2025generative,
title={Generative Perception of Shape and Material from Differential Motion},
author={Han, Xinran Nicole and Nishino, Ko and Zickler, Todd},
journal={arXiv preprint arXiv:2506.02473},
year={2025}
}
Acknowledgements
We thank Jianbo Shi for reviewing the manuscript. We also thank Kohei Yamashita for guidance about the data rendering pipeline and Boyuan Chen for discussion
about video diffusion modeling. This work was supported in part by the NSF cooperative agreement PHY-2019786 (an NSF AI Institute, http://iaifi.org)
and by JSPS 21H04893 and JST JPMJAP2305.
The website template was borrowed from Michaƫl Gharbi.