November 26, 2025

Cat videos, meet spy-grade tracking

Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

AI photo tools can now follow objects in video — fans amazed, newbies confused

TLDR: Researchers found image generators can track objects across video frames without extra training, and their DRIFT system sets a new benchmark. Commenters oscillate between awe and confusion, with some cheering hobbyists for reviving old models and others side‑eyeing the surveillance implications.

Here’s the plot twist no one saw coming: models built to make pretty pictures just learned to follow stuff in videos. Researchers say the attention maps inside image generators can “pass the label baton” from frame to frame, meaning zero‑shot (no extra training) object tracking. They even packaged it into a new system called DRIFT, with a helper tool that cleans up outlines, and it’s clocking state‑of‑the‑art scores. The crowd? Split between mind‑blown and what does any of that mean. One top comment begs, “Can someone smarter than me explain?” while another gushes about how hobbyists keep squeezing smarts from 2022-era Stable Diffusion models like they’re toothpaste that never ends. The big drama: Is this genuine “emergent intelligence” or just clever rebranding of math we already had? Hype vs. homework, in classic tech-forum fashion. Privacy nerves flare too—if a photo model can track your dog across a video, what about your face at a protest? Cue jokes: “Finally, my cat’s chaotic zoomies get a highlight reel,” plus memes about Skynet learning to stalk coffee cups. The vibe: unexpected superpower unlocked, with the usual internet cocktail of awe, confusion, and a dash of surveillance side‑eye.

Key Points

  • Self-attention maps in image diffusion models can serve as semantic label propagation kernels for pixel-level correspondences.
  • Extending label propagation across frames creates a temporal kernel enabling zero-shot object tracking via segmentation.
  • Test-time optimizations (DDIM inversion, textual inversion, adaptive head weighting) improve robustness and consistency of propagation.
  • DRIFT leverages a pretrained image diffusion model with SAM-guided mask refinement for video tracking.
  • DRIFT achieves state-of-the-art zero-shot performance on standard video object segmentation benchmarks.

Hottest takes

"Can someone smarter than me explain what this is about?" — onesandofgrain
"squeeze intelligence out of models that were trained in 2022" — ttul
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.