TL;DR

We train a pixel-space diffusion model from scratch to jointly predict normals, albedo and reflectance from three-frame videos of object motion.

Abstract

Perceiving the shape and material of an object from a single image is inherently ambiguous, especially when lighting is unknown and unconstrained. Despite this, humans can often disentangle shape and material, and when they are uncertain, they often move their head slightly or rotate the object to help resolve the ambiguities. Inspired by this behavior, we introduce a novel conditional denoising-diffusion model that generates samples of shape-and-material maps from a short video of an object undergoing differential motions. Our parameter-efficient architecture allows training directly in pixel-space, and it generates many disentangled attributes of an object simultaneously. Trained on a modest number of synthetic object-motion videos with supervision on shape and material, the model exhibits compelling emergent behavior: For static observations, it produces diverse, multimodal predictions of plausible shape-and-material maps that capture the inherent ambiguities; and when objects move, the distributions converge to more accurate explanations. The model also produces high-quality shape-and-material estimates for less ambiguous, real-world objects. By moving beyond single-view to continuous motion observations, and by using generative perception to capture visual ambiguities, our work suggests ways to improve visual reasoning in physically-embodied systems.

Qualitative Comparison

Here we compare to DiffusionRenderer (CVPR 2025), another video-conditioned model that can output albedo and normals. Objects 5 (Teapot) and 6 (Cactus) are from the Stanford-ORB dataset (NeurIPS 2023); the input comes from irregular camera motion that is different from the horizontal object motion that we use for training.

Extending to Longer Observations

We generate longer videos of shape and material by jointly predicting several three-frame clips that overlap and are regularized by inference-time consistency guidance. We compare with the shape prediction result from StableNormal, using their official video mode demo.

Input
Ours - normal and albedo
StableNormal

Capturing Ambiguity

When input is ambiguous, our model captures the ambiguity by generating diverse samples of plausible shape-and-material combinations. Here, our model correctly infers that a static image of a diffuse object can be explained in at least three different ways.

Reducing Ambiguity by Exploiting Motion

There is usually less ambiguity about shape and material in a video than in a static image. A good example is Hartung and Kersten's "Shiny or Matte" perceptual demonstration , shown here. The still image can be seen as either shiny or matte material, but once the object moves, we immediately perceive either one or the other.
Static
Static Result
Diffuse
Shiny

Our model behaves the same way. When it sees the static image, it produces a variety of albedo maps, including both matte and shiny interpretations. And when it sees the shiny motion video, its albedo predictions collapse to a smaller set of spatially-uniform possibilities. (The right figure shows a PCA-embedding of 100 albedo-map samples for each of the static and motion cases.)

Model Architecture

Our parameter-efficient denoising network, U-ViT3D-Mixer, takes in a channel-wise concatenation of conditional video frames and noisy shape-and-material frames. At high spatial resolutions, it uses efficient local 3D blocks (middle) with decoupled spatial, temporal, and channel-wise interactions. At lower spatial resolutions, it uses global transformer layers with full 3D attention.

Results Visualizations

We visualize the input video sequence, predicted surface normal and albedo from a moving object. Input videos are synthetically rendered using a physics-based rendering engine (1,2) or are captured real world objects (3).

(1) Objects with synthetic textures:
(2) Objects with original, artists' designed textures:
(3) Captured real world object-motion videos:

BibTeX

@article{han2025generative,
  title={Generative Perception of Shape and Material from Differential Motion},
  author={Han, Xinran Nicole and Nishino, Ko and Zickler, Todd},
  journal={arXiv preprint arXiv:2506.02473},
  year={2025}
}

Acknowledgements

We thank Jianbo Shi for reviewing the manuscript. We also thank Kohei Yamashita for guidance about the data rendering pipeline and Boyuan Chen for discussion about video diffusion modeling. This work was supported in part by the NSF cooperative agreement PHY-2019786 (an NSF AI Institute, http://iaifi.org) and by JSPS 21H04893 and JST JPMJAP2305.

The website template was borrowed from Michaƫl Gharbi.