April 25, 2026
Size vs Think Time—Fight!
Which one is more important: more parameters or more computation? (2021)
Reddit says: both—bigger brains help, but more thinking time matters too
TLDR: Researchers show you can grow AI size or give it more “think time” separately, and both help. Commenters call it obvious, mock mega-model overkill, and hype clever hacks and fine-tuning—turning a tech demo into a brawl over brains, compute, and smarter shortcuts that actually matter.
Tech folks dropped a spicy debate: is AI better with more “brain cells” or more time to think? The paper says… why not both! Researchers showed you can make models bigger without extra effort using hashing (a simple way to send words to specialist “experts”), and also crank up thinking steps without adding new parts with “staircase attention.” Translation: size and compute can be tuned separately, and both can boost smarts. Cue the comments section meltdown.
One camp yawned, calling it obvious: “it’s both,” said one reader, comparing model size to how curvy a shape can be, and compute to the image’s resolution. Another camp cheered the “free upgrade” energy, linking to a post about cutting duplicate layers and stacking the real “thinking layers” to get higher scores with “basically no overhead.” The crowd loved that hacky vibe.
Then came the jokes: one commenter roasted mega-models as “employing a million random people to play darts,” and likened overkill to “shooting sparrows with a nuclear bomb.” Others pushed a third path: smarter data selection, lightweight fine-tuning (LoRA), and mixture-of-experts (MoE) for efficiency. Bottom line: the study lit a fire under an old fight—size vs. sweat—while the community gleefully piled on with memes, metaphors, and DIY tricks.
Key Points
- •The article argues that model parameters and computation should be considered separately in deep learning.
- •A hashing-based MoE routing method increases model size without increasing per-input computation.
- •Staircase attention models increase computation without adding parameters, improving performance.
- •On the pushshift.io Reddit task, hashing-based MoE outperforms the Switch baseline, especially with more experts.
- •Experiments include 1.28B-parameter models using ~17% of parameters per input and 4.5B-parameter models outperforming the BASE MoE model.