Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Linear claims 6x faster AI—fans hype it, spreadsheet sleuths circle

TLDR: Kimi Linear promises much faster AI reading of long texts with less memory, and devs can try it now. The crowd splits between hype (“let’s go!”) and spreadsheet checkers demanding real-world benchmarks, making this a speed-vs-accuracy showdown worth watching.

Moonshot AI dropped Kimi Linear, an attention upgrade that helps big AI models “focus” better and chew through super-long text faster. Translation: it’s designed to read huge docs without melting your GPU, boasting up to 6× faster decoding and 75% less memory for the cache that stores “what the model remembers.” The crowd came in hot: one camp is pure hype, shouting “Let’s GO!” while another camp rolled up with receipts and benchmarks.

The numbers flex hard—scores on tests like MMLU-Pro (general knowledge) and RULER (long-context stress test) look strong, with Pareto-optimal speed/quality at long lengths. Supporters like Ethan312 say it could make giant models faster “without much accuracy loss.” Meanwhile, adt casually dropped the models table like a mic, inviting the classic “show me the scorecard” battle. Cue the memes: “Kimi goes brrr,” “KV cache on a diet,” and “free your GPUs from doom.” Skeptics still ask if cherry-picked benchmarks translate to real-world chat.

There’s drama brewing over tradeoffs: speed vs truthfulness, flashy graphs vs everyday tasks. But with open-source kernels, 48B-parameter models, and 1M-token context windows, devs are already spinning up Hugging Face and vLLM demos to see if this rocket actually launches. The comments? Equal parts confetti and calculator.

Key Points

  • Kimi Linear introduces Kimi Delta Attention (KDA), refining Gated DeltaNet with fine-grained gating for efficient linear attention.
  • On MMLU-Pro (4k context), Kimi Linear scores 51.0 with similar speed to full attention; on RULER (128k), it achieves 84.3 and 3.98× speedup.
  • The architecture reduces KV cache needs by up to 75% and boosts decoding throughput by up to 6× for contexts up to 1M tokens.
  • Two 48B-parameter checkpoints (Base and Instruct) with 3B activated params and 1M context length are released, trained on 5.7T tokens.
  • KDA kernel is open-sourced in FLA; usage and deployment guidance is provided using Hugging Face Transformers and vLLM.

Hottest takes

"https://lifearchitect.ai/models-table/" — adt
"make large models faster without much accuracy loss" — Ethan312
"Let’s GO!" — nekofneko
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.