Abstract
Perceiving the shape and material of an object from a single image is inherently ambiguous, especially when lighting is unknown and unconstrained. Despite this, humans can often disentangle shape and material, and when they are uncertain, they often move their head slightly or rotate the object to help resolve the ambiguities. Inspired by this behavior, we introduce a novel conditional denoising-diffusion model that generates samples of shape-and-material maps from a short video of an object undergoing differential motions. Our parameter-efficient architecture allows training directly in pixel-space, and it generates many disentangled attributes of an object simultaneously. Trained on a modest number of synthetic object-motion videos with supervision on shape and material, the model exhibits compelling emergent behavior: For static observations, it produces diverse, multimodal predictions of plausible shape-and-material maps that capture the inherent ambiguities; and when objects move, the distributions quickly converge to more accurate explanations. The model also produces high-quality shape-and-material estimates for less ambiguous, real-world objects. By moving beyond single-view to continuous motion observations, our work suggests a generative perception approach for improving visual reasoning in physically-embodied systems.

Interdependence of Shape and Material

Static Object Examples: Multiple shape and material hypotheses on ambiguous online images. (a) A cat on an illusion carpet that is painted to resemble a hole. When interpreted as a hole (Sample 1), the predicted albedo is brighter in the center region (yellow box) than when interpreted as a plane (Sample 2). (b) A painting of a pumpkin is interpreted as either a 3D shape with orange color or a planar shape with painted texture.
Shiny or Matte Ambiguity (by Hartung & Kersten )
Moving Object Examples: An object (such as the teapot or the crossiant shape above), when viewed as a static observation, can be ambiguous. Once we see the object in motion, we can often easily perceive the underlying material of the object. In both videos, the object is first rendered with a shiny surface reflecting the environment, and then switches to a diffuse object with painted texture.
Why Generative Perception?

Motion Disambiguation Examples: Each plot shows the 2D PCA embedding of 100 albedo samples corresponding to the frame (inset) that is common between two distinct input videos. Left: When a shiny teapot moves, the albedo samples converge to a subset of the ones produced for the static teapot. Right: The albedo samples for a matte-rendered motion video are clearly separated from those of a shiny-rendered video. Note that spatially-uniform albedos form tighter clusters in PCA space, while highly textured ones exhibit greater variation.
Model Architecture

Results Visualizations
(1) Objects with synthetic textures:
(2) Objects with original, artists' designed textures:
Motion Disambiguation - Perception Test
Extending to Longer Observations
Acknowledgements
We thank Jianbo Shi for reviewing the manuscript. We also thank Kohei Yamashita for guidance about the data rendering pipeline and Boyuan Chen for discussion
about video diffusion modeling. This work was supported in part by the NSF cooperative agreement PHY-2019786 (an NSF AI Institute, http://iaifi.org)
and by JSPS 21H04893 and JST JPMJAP2305.
The website template was borrowed from Michaƫl Gharbi.