Learnings from 4 months of Image-Video VAE experiments

From blurry battles to open release: tiny tricks, big opinions

TLDR: Linum released open weights for its image‑video compressor after months of trial and error, even though their latest model uses a different one. Commenters are split between excited tinkerers, tiny‑tweak evangelists pushing EQ‑VAE, and pragmatists grilling the team on how they run so many long experiments.

Linum just dropped its Image‑Video VAE with open weights after four months of wrestling with “NaNs, mysterious splotches, and instability.” Translation: they built a compressor for images and videos, learned that perfect reconstructions aren’t everything, and—plot twist—still used Wan 2.1’s VAE in their latest video model. They’ve published the gory details and say another round is coming in 2026. And yes, the comments are the real show.

The authors jumped into the thread to take questions, and the crowd split into two delightfully internetty camps. Camp A: the tinkerers, hyped to fine‑tune it on personal art because it’s small and permissively licensed. Camp B: the tweak‑seekers, poking the team for missing “one tiny trick”—EQ‑VAE—claiming it can boost quality. Meanwhile, a chorus of pragmatists asked how you even run this many experiments without losing your mind, given that each training run takes ages.

Between applause for the clear write‑up and nitpicks about what’s “missing,” the vibe is peak hacker drama: open weights energize the DIY crowd, tiny tweaks ignite debate, and process questions press for real‑world grit. Also trending: jokes about “Not a Nice video” NaNs and “splotch‑core aesthetics.” Tech is serious—but the comments? Deliciously unserious and very alive.

Key Points

  • The team trained an Image-Video VAE from July to November 2024 and is releasing it along with implementation details and lessons learned.
  • Despite this work, their latest text-to-video model uses Wan 2.1’s VAE, suggesting reconstruction quality was less critical than expected.
  • VAEs are used to compress images/videos into latent spaces to make diffusion transformers computationally tractable, given quadratic attention costs and massive token counts (e.g., ~110M for a short 720p clip).
  • Their VAE training uses a near-zero KL weight (~1e-6), a Laplacian decoder with a learned global scale (Sigma-VAE style) for reconstruction, and adds perceptual (VGG) and adversarial (GAN-like) losses to improve sharpness.
  • For text-to-video pipelines, the model is first pretrained on images before videos, and the VAE’s objective sums both image and video losses.

Hottest takes

This seems like a great model to experiment fine tuning with original art — lastdong
Tiny trick, huge impact! Have you tried it? — DonThomasitos
whats your process of trying these different techniques/architectures? — asaiacai
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.