March 13, 2026
Speed hack meets comment smack
Prefix sums at gigabytes per second with ARM NEON
Speed hack thrills—then chaos: “AI post?” and “Why won’t Apple add the new stuff”
TLDR: A clever chip trick speeds up running totals by processing numbers in parallel, promising near–gigabytes‑per‑second throughput. Comments erupted over a link insinuating the post might be AI‑made and a debate about why newer chip features aren’t widely supported—especially on big‑name devices—highlighting a messy, fast‑moving hardware landscape.
A nerdy post promised to make “running totals” (the rolling sum of your daily numbers) fly by using a chip trick on ARM devices. The gist: split numbers into four lanes, add them up in parallel, and stitch the results—so you move data at near “gigabytes per second.” Cool! But the crowd quickly turned the speed run into a drama run.
First spark: a drive‑by dunk linking only to HN’s “no AI-generated posts” rule. No words, just the link—pure side‑eye. Some readers read it as: “Is this post machine-made?” Suddenly, the fastest thing here wasn’t math—it was suspicion. The vibe: half impressed by the trick, half policing the feed for bots. Meta‑drama unlocked.
Second spark: a technical pivot with real-world stakes—“Why are we still doing this on older vector features when the newer stuff exists?” One commenter called out that even Apple’s latest chips still don’t support the newer vector tech (beyond a separate matrix unit), turning the thread into a mini state‑of‑the‑industry rant about fragmentation. In plain speak: the hack is neat, but chip makers can’t agree on the next step. Cue quips about neon lights still “glowing” and a chorus asking whether to optimize for today’s hardware or wait for tomorrow’s promised land. Speed vs standards: fight!
Key Points
- •A scalar in-place prefix sum can approach one cycle per element due to data dependency, enabling billions of integers per second on modern CPUs.
- •A 4-element NEON vector prefix sum (two shifts, two adds) may not outperform scalar due to sequential instruction chaining and overhead.
- •Using NEON interleaved loads (vld4) to process 16 elements de-interleaves data into four vectors for parallel vertical accumulation.
- •Three vector adds produce four local group sums; a 4-lane prefix sum (vext+vadd) across these sums computes cumulative offsets.
- •This approach requires about eight sequential instructions per 16 elements, yielding a theoretical 2× speedup over the scalar loop.