April 6, 2026
Now you see them, now you don’t
Netflix Void Model: Video Object and Interaction Deletion
Netflix’s VOID erases people—and their impact—sparking ‘censorship vs VFX’ brawl
TLDR: Netflix’s VOID can remove people and objects from video and also delete their effects, like a dropped guitar falling naturally. Commenters are split between calling it dystopian censorship fuel and defending it as standard VFX wizardry, with extra buzz over its hefty 40GB GPU requirement.
Netflix researchers just dropped VOID, a video tool that can erase objects—and even the ripples they cause. Remove a person holding a guitar, and the guitar drops like it would in real life. It’s built on the CogVideoX research model, runs in two passes for steadier results, and—plot twist—needs a beefy 40GB graphics card. There’s even a mask step powered by Google’s Gemini and Meta’s SAM2. Tech flex? Absolutely. Community meltdown? You bet.
The comment section split fast. One camp went full dystopia, with a user calling it the “dream of Minitrue,” worrying this could scrub history clean—like quietly erasing cigarettes from old shows. Another camp rolled its eyes and said this is just modern movie magic, like wire removal, used to make scenes look cleaner, not to rewrite reality. A pragmatic voice chimed in: studios will love this for local censorship rules—less reshooting, more saving.
Meanwhile, the nerds cheered the underdog: one admirer said CogVideoX keeps powering big research, and they’re not wrong. Others joked about the 40GB GPU requirement—“BRB, strapping an A100 to my backpack”—and called the demo “auto-regressive theater.” The vibe: equal parts awe and anxiety. Whether you see a censorship machine or a smarter VFX brush, VOID is already the internet’s newest Rorschach test—part magic trick, part moral panic. Try it if you’ve got the hardware; debate it if you’ve got the time.
Key Points
- •VOID is a video inpainting system that removes target objects and their induced physical interactions from scenes.
- •It is built on CogVideoX and fine-tuned with interaction-aware mask conditioning.
- •The pipeline uses two transformer passes: a base inpainting model (Pass 1) and a warped-noise refinement model (Pass 2) for temporal consistency.
- •Running the quick-start notebook requires a 40GB+ VRAM GPU (e.g., NVIDIA A100), and setup includes Gemini (Google AI API) and SAM2 for mask generation.
- •Input sequences require an input video, a quadmask_0.mp4 with four-region semantics, and a prompt.json describing the clean background; example data and data-generation tools are provided.