T5Gemma 2: The next generation of encoder-decoder models

Tiny model, big promises—no instruction release, chart drama, and everyone asking what it does

TLDR: Small, efficient model that reads and writes—now with images and long memory—built on Gemma 3. The crowd loves the concept but slams no instruction-tuned release, calls the charts unfair, and asks what encoder–decoder models are even for.

T5Gemma 2 just dropped as a compact “reads and writes” AI—an encoder-decoder that can handle text, images, and super-long documents up to 128K tokens. It uses efficiency tricks (tied embeddings and merged attention) to squeeze big brains into smaller sizes for on-device use. Sounds dreamy… until the comments erupted. The hottest take: no instruction-tuned checkpoints. That means no ready-to-use “taught” version, and the crowd isn’t having it. As one user put it, “just post-train it yourself” isn’t always feasible. Cue the drama.

Meanwhile, newbies want a decoder for the decoder: what even is an encoder-decoder? Think of it like a two-person team—one listens (encoder), one talks (decoder)—good for translation, summarizing long stuff, and mixing text with images. Fans ask if this beats the usual chatbots; skeptics wonder if it’s only for “traditional” tasks.

Then comes chart-gate. One commenter calls out the comparisons: the flashy wins pit a 1B Gemma against a “1+1B” T5Gemma 2—so basically double the parameters. Size wars! Others squint at a mysterious “X” on the pentagon chart and shrug. Still, the promise is clear: multimodal, multilingual (140+ languages), long memory, small footprint. The vibe? Excited, confused, and a little salty.

Key Points

  • T5Gemma 2 introduces multimodal and long-context encoder-decoder models based on Gemma 3.
  • Architectural changes include tied encoder-decoder embeddings and merged decoder self-/cross-attention to reduce parameters and complexity.
  • Models are available in compact sizes: 270M-270M (~370M total), 1B-1B (~1.7B), and 4B-4B (~7B), suited for on-device use.
  • Capabilities include image-text understanding via a vision encoder, 128K token context via alternating local/global attention, and support for 140+ languages.
  • Performance claims show improved multimodal, long-context, and general tasks over Gemma 3; post-training results used minimal SFT without RL, with no post-trained checkpoints released.

Hottest takes

"Note: we are not releasing any post-trained / IT checkpoints." — minimaxir
"Obviously a model with twice more parameters can do more better." — killerstorm
"More traditional ML/NLP tasks?" — potatoman22
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.