December 18, 2025
Long memory, short patience
T5Gemma 2: The next generation of encoder-decoder models
Tiny model, big promises—no instruction release, chart drama, and everyone asking what it does
TLDR: Small, efficient model that reads and writes—now with images and long memory—built on Gemma 3. The crowd loves the concept but slams no instruction-tuned release, calls the charts unfair, and asks what encoder–decoder models are even for.
T5Gemma 2 just dropped as a compact “reads and writes” AI—an encoder-decoder that can handle text, images, and super-long documents up to 128K tokens. It uses efficiency tricks (tied embeddings and merged attention) to squeeze big brains into smaller sizes for on-device use. Sounds dreamy… until the comments erupted. The hottest take: no instruction-tuned checkpoints. That means no ready-to-use “taught” version, and the crowd isn’t having it. As one user put it, “just post-train it yourself” isn’t always feasible. Cue the drama.
Meanwhile, newbies want a decoder for the decoder: what even is an encoder-decoder? Think of it like a two-person team—one listens (encoder), one talks (decoder)—good for translation, summarizing long stuff, and mixing text with images. Fans ask if this beats the usual chatbots; skeptics wonder if it’s only for “traditional” tasks.
Then comes chart-gate. One commenter calls out the comparisons: the flashy wins pit a 1B Gemma against a “1+1B” T5Gemma 2—so basically double the parameters. Size wars! Others squint at a mysterious “X” on the pentagon chart and shrug. Still, the promise is clear: multimodal, multilingual (140+ languages), long memory, small footprint. The vibe? Excited, confused, and a little salty.
Key Points
- •T5Gemma 2 introduces multimodal and long-context encoder-decoder models based on Gemma 3.
- •Architectural changes include tied encoder-decoder embeddings and merged decoder self-/cross-attention to reduce parameters and complexity.
- •Models are available in compact sizes: 270M-270M (~370M total), 1B-1B (~1.7B), and 4B-4B (~7B), suited for on-device use.
- •Capabilities include image-text understanding via a vision encoder, 128K token context via alternating local/global attention, and support for 140+ languages.
- •Performance claims show improved multimodal, long-context, and general tasks over Gemma 3; post-training results used minimal SFT without RL, with no post-trained checkpoints released.