April 11, 2026
Chunk Wars ignite
How Much Linear Memory Access Is Enough?
1MB is enough? Dev drops data, coders cheer, 'Grug' asks now what, GPU crowd says go gigabytes
TLDR: New tests claim most apps hit full speed with 1MB memory chunks (and sometimes much smaller), challenging the “bigger is always better” mantra. The comments split fast: data lovers cheer the receipts, GPU folks boast about gigabyte-scale chunks, and “Grug programmers” just want simple, practical rules they can use.
A performance-obsessed dev just posted data saying you don’t need giant memory chunks to go fast: for most work, 1MB blocks hit full speed, 128KB is enough if the work per byte isn’t tiny, and even 4KB works when the compute is heavy. Charts, code, and receipts are all on GitHub: github.com/solidean/bench-linear-access. And yes, the internet is already fighting about it.
Fans of “data over dogma” are loving it. The author admits he chased numbers because everyone repeats “more contiguity = better” without proof, and commenters piled in with relief and sass. One engineer flexed that they “reach peak performance before 128KB,” while another, coming from the GPU database world, cracked a smile with “is this nerd sniping?” before bragging they run gigabytes per task when they can. The vibe: Chunk Wars have begun.
Meanwhile, the self-described “Grug Programmer” crowd asked what to actually do: rearrange your data for better use of the tiny fast memory (cache) if you’re doing simple sums, but chill if your code is expensive enough that compute time hides the memory cost. Side threads spun up on whether huge pages (bigger memory blocks the operating system can use) would change the picture, and someone even praised the project while confessing a “mistrust of all booleans.” It’s nerdy, punchy, and blessedly full of numbers.
Key Points
- •Benchmarks quantify block-size thresholds for linear memory access on a Ryzen 9 7950X3D.
- •Findings: ~1 MB blocks suffice for most workloads; ~128 kB suffices at ≥1 cycle/byte; ~4 kB suffices at ≥10 cycles/byte.
- •Experiments fix the total working set and vary block size to measure amortization of block-jump costs.
- •Runs randomize block order and placement within a 4 GB backing memory and flush caches to avoid warm-cache bias.
- •A C++ scalar_stats kernel achieved ~7 GB/s on large blocks (~0.75 cycles/byte), with results and code available on GitHub.