How Much Linear Memory Access Is Enough?

1MB is enough? Dev drops data, coders cheer, 'Grug' asks now what, GPU crowd says go gigabytes

TLDR: New tests claim most apps hit full speed with 1MB memory chunks (and sometimes much smaller), challenging the “bigger is always better” mantra. The comments split fast: data lovers cheer the receipts, GPU folks boast about gigabyte-scale chunks, and “Grug programmers” just want simple, practical rules they can use.

A performance-obsessed dev just posted data saying you don’t need giant memory chunks to go fast: for most work, 1MB blocks hit full speed, 128KB is enough if the work per byte isn’t tiny, and even 4KB works when the compute is heavy. Charts, code, and receipts are all on GitHub: github.com/solidean/bench-linear-access. And yes, the internet is already fighting about it.

Fans of “data over dogma” are loving it. The author admits he chased numbers because everyone repeats “more contiguity = better” without proof, and commenters piled in with relief and sass. One engineer flexed that they “reach peak performance before 128KB,” while another, coming from the GPU database world, cracked a smile with “is this nerd sniping?” before bragging they run gigabytes per task when they can. The vibe: Chunk Wars have begun.

Meanwhile, the self-described “Grug Programmer” crowd asked what to actually do: rearrange your data for better use of the tiny fast memory (cache) if you’re doing simple sums, but chill if your code is expensive enough that compute time hides the memory cost. Side threads spun up on whether huge pages (bigger memory blocks the operating system can use) would change the picture, and someone even praised the project while confessing a “mistrust of all booleans.” It’s nerdy, punchy, and blessedly full of numbers.

Key Points

•Benchmarks quantify block-size thresholds for linear memory access on a Ryzen 9 7950X3D.
•Findings: ~1 MB blocks suffice for most workloads; ~128 kB suffices at ≥1 cycle/byte; ~4 kB suffices at ≥10 cycles/byte.
•Experiments fix the total working set and vary block size to measure amortization of block-jump costs.
•Runs randomize block order and placement within a 4 GB backing memory and flush caches to avoid warm-cache bias.
•A C++ scalar_stats kernel achieved ~7 GB/s on large blocks (~0.75 cycles/byte), with results and code available on GitHub.

Hottest takes

"reach peak performance before 128 kB" — PhilipTrettner

"is this an attempt at nerd sniping? ;-)" — _zoltan_

"I'm not sure what the actionable is for me as a Grug Programmer." — 01HNNWZ0MV43FF

April 11, 2026

Chunk Wars ignite

1MB is enough? Dev drops data, coders cheer, 'Grug' asks now what, GPU crowd says go gigabytes

Key Points

Hottest takes

April 11, 2026

Chunk Wars ignite

How Much Linear Memory Access Is Enough?

1MB is enough? Dev drops data, coders cheer, 'Grug' asks now what, GPU crowd says go gigabytes

Key Points

Hottest takes

Save News