December 19, 2025
Zero to hero, literally
Parallelizing ClickHouse aggregation merge for fixed hash map
A tiny math tweak makes billion‑row queries fly—and the comments go wild
TLDR: A new ClickHouse tweak lets threads merge groups in parallel without converting data structures, boosting speed. The community is stunned by a 7x gap from a simple “0 +” trick, sparking debates over footguns, smarter defaults, and why billion‑row queries should never hinge on type quirks.
ClickHouse just hit a plot twist: a developer’s PR to speed up how big groups of data get merged revealed that adding 0 + to a calculation makes a billion‑row query run almost 7× faster. The culprit? Type quirks. One version treated the group key as a small number, trapping merges in a single “drawer,” while the other used a bigger type that unlocked multi‑threaded merging across “buckets.” In plain English: color‑coded bins versus one giant junk drawer. The PR lets threads work on disjoint key sets in place—no conversion, less drama.
Comments exploded. Performance folks called it a footgun: “Why does math trivia decide speed?” Others defended ClickHouse’s low‑level control, arguing the optimizer shouldn’t guess types for users. A middle camp begged for “auto‑promote or warn” so beginners don’t ship the ‘0+’ hack. Then came a spooky memory deallocation failure in CI; meme lords dropped “hash map ghosts” and declared the 0+ Gang. Profilers debated why the flamegraph looked unchanged—CPU time doesn’t show threads overlapping—cue explainers and smug charts. The final vibe: applause for the clever per‑thread merge, side‑eye at defaults, and a loud chorus: make fast the default, or at least shout when you’re choosing the junk drawer
Key Points
- •Two similar ClickHouse queries over 1e9 rows differed greatly in runtime (≈62.8s vs ≈8.55s) due to differing group-by key types.
- •Grouping by UInt16 used a fixed hash map (one-dimensional array), hindering parallel merge; wider types used standard/two-level hash tables enabling parallel merge.
- •Initial idea to convert the fixed array to a two-level structure was slow; an in-place parallel merge over disjoint key subsets was proposed and implemented.
- •Range-based key segmentation improved wall-clock time via parallelism but left CPU time and flamegraph profiles largely unchanged.
- •CI uncovered a memory deallocation assertion failure (indicative of corruption) during development, logged on 2025-09-22.