Abstract

Perceiving the shape and material of an object from a single image is inherently ambiguous, especially when lighting is unknown and unconstrained. Despite this, humans can often disentangle shape and material, and when they are uncertain, they often move their head slightly or rotate the object to help resolve the ambiguities. Inspired by this behavior, we introduce a novel conditional denoising-diffusion model that generates samples of shape-and-material maps from a short video of an object undergoing differential motions. Our parameter-efficient architecture allows training directly in pixel-space, and it generates many disentangled attributes of an object simultaneously. Trained on a modest number of synthetic object-motion videos with supervision on shape and material, the model exhibits compelling emergent behavior: For static observations, it produces diverse, multimodal predictions of plausible shape-and-material maps that capture the inherent ambiguities; and when objects move, the distributions quickly converge to more accurate explanations. The model also produces high-quality shape-and-material estimates for less ambiguous, real-world objects. By moving beyond single-view to continuous motion observations, our work suggests a generative perception approach for improving visual reasoning in physically-embodied systems.

Interdependence of Shape and Material

Interpreting visual inputs requires joint reasoning about shape, texture and reflectance. A static observation of an object can always be explained by a planar shape that has suitable paint or pigment (a `postcard' explanation). It can also be explained by mirror-like shapes that reflect suitable lighting. The set of possibilities is extensive and profoundly ambiguous, but in all of them there is an interplay between shape and material. This suggests that shape and material, which we represent here using reflectance and texture, ought to be inferred simultaneously. We argue that vision systems should embrace and model ambiguities in shape and material, so that other cues like motion or touch can be leveraged when needed for disambiguation.


Static Object Examples: Multiple shape and material hypotheses on ambiguous online images. (a) A cat on an illusion carpet that is painted to resemble a hole. When interpreted as a hole (Sample 1), the predicted albedo is brighter in the center region (yellow box) than when interpreted as a plane (Sample 2). (b) A painting of a pumpkin is interpreted as either a 3D shape with orange color or a planar shape with painted texture.

Shiny or Matte Ambiguity (by Hartung & Kersten )


Moving Object Examples: An object (such as the teapot or the crossiant shape above), when viewed as a static observation, can be ambiguous. Once we see the object in motion, we can often easily perceive the underlying material of the object. In both videos, the object is first rendered with a shiny surface reflecting the environment, and then switches to a diffuse object with painted texture.

Why Generative Perception?

We advocate a generative perception approach, where the vision system generates diverse samples of shapes-and-materials that explain the object appearance captured in the input observations. When the input observation is a single static image, the output samples are expected to be diverse and express ambiguities. Moreover, perception should not be limited to a single view or a single attribute. It should naturally incorporate multiple views as they become available and jointly disentangle the complete physical attributes of object appearance--namely shape, texture, and reflectance--all at the same time.


Motion Disambiguation Examples: Each plot shows the 2D PCA embedding of 100 albedo samples corresponding to the frame (inset) that is common between two distinct input videos. Left: When a shiny teapot moves, the albedo samples converge to a subset of the ones produced for the static teapot. Right: The albedo samples for a matte-rendered motion video are clearly separated from those of a shiny-rendered video. Note that spatially-uniform albedos form tighter clusters in PCA space, while highly textured ones exhibit greater variation.

Model Architecture

Our parameter-efficient denoising network, U-ViT3D-Mixer, takes in a channel-wise concatenation of conditional video frames and noisy shape-and-material frames. At high spatial resolutions, it uses efficient local 3D blocks (middle) with decoupled spatial, temporal, and channel-wise interactions. At lower spatial resolutions, it uses global transformer layers with full 3D attention.

Results Visualizations

We visualize the input video sequence, predicted surface normal and albedo (diffuse color) from a moving object. Input videos are synthetically rendered using physics-based rendering engine.

(1) Objects with synthetic textures:


(2) Objects with original, artists' designed textures:

Motion Disambiguation - Perception Test

Visualization of the shiny/matte perception test designed by Hartung and Kersten. Note in the croissant shape example the static frames can look similar for both the shiny and diffuse material. Motion cues help to disambiguate the materials.

Pot (shiny)
Croissant (shiny)
Croissant (diffuse)

Extending to Longer Observations

We apply temporal consistency guidance on the predictions of the overlapping frame to extend our model to longer video sequence. This guided sampling method allows temporally consistent shape and material estimation. We compare with the shape prediction result from StableNormal, using their official video mode demo.

Ours
StableNormal
Ours
StableNormal

Acknowledgements

We thank Jianbo Shi for reviewing the manuscript. We also thank Kohei Yamashita for guidance about the data rendering pipeline and Boyuan Chen for discussion about video diffusion modeling. This work was supported in part by the NSF cooperative agreement PHY-2019786 (an NSF AI Institute, http://iaifi.org) and by JSPS 21H04893 and JST JPMJAP2305.
The website template was borrowed from Michaƫl Gharbi.