June 4, 2026
Q, K... and community chaos
Do Transformers Need Three Projections? Systematic Study of QKV Variants
AI may not need all its usual moving parts — and the comments are losing it
TLDR: Researchers say a leaner version of today’s core AI design can keep similar performance while slashing memory use, which matters for running models on everyday devices. Commenters split fast: some cheered a possible simplification, while others mocked the paper’s math style and demanded proof it still works at massive scale.
A new AI paper basically walked into the room and asked: what if transformers, the engine behind today’s chatbots and image models, have been carrying extra baggage this whole time? The researchers tested whether all three of the usual internal parts are really necessary, and found that in some setups, models worked almost as well — sometimes even better — while using far less memory. The headline-grabber: one simplified version cuts memory use for text generation in half with only a small quality drop, and even more when combined with other tricks, which could make running AI on phones and laptops much easier.
But the real fireworks were in the comments. One camp was delighted by the possibility that the AI world has been overcomplicating things for years, with one user calling the idea "great and amusing" — before immediately twisting the knife by noting the promised code repo was missing. Another commenter went full professor mode, roasting the paper’s notation and basically saying, please stop using a minus sign like it means vibes. Then came the skeptics, who were not buying the victory lap just yet: the loudest hot take was essentially, "scaling curves or GTFO," arguing that lots of clever math looks fine at small sizes and then falls apart when the stakes get real.
And because no internet debate is complete without poetry, one user described attention — the model’s way of deciding what matters — as a bizarre act of spinning and squishing vectors until you can "see all the way through." In other words: the science is serious, but the crowd turned it into a glorious mix of nitpicking, hype, and math-fueled stand-up.
Key Points
- •The study evaluates three projection-sharing variants in transformer attention: Q-K=V, Q=K-V, and Q=K=V.
- •Experiments cover synthetic tasks, vision benchmarks including MNIST, CIFAR, and TinyImageNet, and language models with 300M and 1.2B parameters trained on 10B tokens.
- •The article reports that shared-projection transformers match or sometimes exceed standard QKV transformer performance.
- •In language modeling, Q-K=V reduces KV cache by 50% with a reported 3.1% perplexity degradation.
- •Combining Q-K=V with head-sharing methods increases memory savings to 87.5% with GQA-4 and 96.9% with MQA, supporting on-device inference use cases.