Robust Conditional 3D Shape Generation from Casual Captures

Phone videos to full 3D rooms? Hype builds while folks ask about depth and Macs

TLDR: ShapeR turns everyday video into accurate 3D scenes using multiple views and motion data. The comments fixate on whether it truly works without depth sensors and if it runs on Apple silicon, sparking excitement over accuracy and a practical debate about setup and compatibility.

ShapeR promises to turn your casual phone clips into full 3D scenes with proper sizes and layouts, and the crowd is split between wow and wait, what do I need? Fans love that it uses multiple views and motion data (think your phone’s tracking) to get real-world scale, unlike single-photo tools that guess and warp. But the comment spotlight fell hard on one question: do you need actual depth sensors or not? nico’s blunt “Does this need depth data?” set the tone, with others echoing confusion over the phrase “casual captures.” The team says it uses SLAM points (device motion-derived 3D hints) and can even work from monocular images via tools that create metric points, which has people asking if their everyday phone video qualifies.

The second flame war: Apple silicon. “Will it run on my M2?” stole the show, spawning memes about “metric accuracy, but only if my laptop doesn’t melt.” Fans cheered the idea of combining ShapeR’s accurate shapes with SAM 3D’s photoreal textures—“best of both worlds”—while skeptics joked they just want their couch to be the right size for once. Trained on synthetic data yet generalizing to real scenes, ShapeR dropped a new in-the-wild dataset and a bold claim: no hand-holding, just results. The community? Half dazzled, half demanding receipts.

Key Points

  • ShapeR generates metrically accurate, object-centric 3D meshes from casual image sequences using multimodal conditioning.
  • It leverages off-the-shelf SLAM points, 3D instance detections, images, 2D projections, and VLM captions, with a rectified flow transformer denoising a latent VecSet.
  • Robustness is achieved via heavy on-the-fly compositional augmentations and curriculum training.
  • A new in-the-wild evaluation dataset includes posed multi-view images, SLAM point clouds, and complete per-object 3D annotations for 178 objects across 7 scenes.
  • ShapeR, trained on synthetic data, outperforms single-view, interactive methods like SAM 3D in metric accuracy, and can be combined with SAM 3D; it generalizes to non-Aria data and monocular images via MapAnything.

Hottest takes

"Does this need depth data capture as well?" — nico
"Also, can it run on Apple silicon?" — nico
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.