Apple Releases Open Weights Video Model

Apple’s video AI drop: genius tool or more slop

TLDR: Apple released STARFlow‑V, an open‑weights video AI that promises high‑quality clips and supports multiple generation modes. The comments split hard: excitement for accessibility and features versus worries about data sourcing, training opacity, and “AI slop,” making this a buzzy, consequential move for Apple and developers.

Apple just dropped STARFlow-V, an open-weights video AI that makes clips from text and can remix images or videos into fresh footage. The tech flex: it uses “normalizing flows” (a different recipe than trending diffusion) to promise cleaner, more consistent results and end-to-end training. Apple posted code and says it supports text‑to‑video, image‑to‑video, and video‑to‑video.

Comments went wild. coolspot flagged the scale—“96 H100 GPUs” and “~20 million videos”—then side‑eye’d the missing training time, cueing carbon‑footprint jokes and “was this running since the iPhone 8?” satvikpendem wonders what Apple’s actual plan is: Vision Pro magic? iMovie auto‑edits? Or just research chasing the zeitgeist.

The spiciest thread: data. nothrowaways asks, “Where did the videos come from?” and speculation flies about licensing and privacy. On the heartwarming side, devinprater dreams of accessibility wins—richer scene descriptions for blind users. Culture critics weigh in too: camillomiller begs Apple not to unleash “more terrible slop,” praising Apple’s taste while other platforms drown in AI mush. Memes landed fast: “H100 = Hundred Hype 100,” and “open weights, closed vibes.” The vibe? Excited, uneasy, and very online—classic Apple drop with a side of comment-section chaos.

Key Points

  • Apple researchers introduce STARFlow‑V, a normalizing flow–based causal video generator supporting T2V, I2V, and V2V.
  • The model uses a global‑local architecture: a deep causal Transformer for global temporal reasoning and shallow per‑frame flow blocks for local detail.
  • A learnable causal denoiser trained via flow‑score matching complements maximum likelihood training to improve consistency.
  • Sampling is accelerated with a video‑aware Jacobi iteration that parallelizes inner updates without violating causality.
  • Empirical results claim strong visual fidelity, temporal consistency, and practical throughput, matching diffusion models in quality.

Hottest takes

“96 H100 GPUs… ~20 million videos. They don’t say for how long” — coolspot
“Where do they get the video training data?” — nothrowaways
“not contribute to having just more terrible slop” — camillomiller
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.