June 4, 2026
Sorting tea, no branches
Branchless Quicksort faster than std:sort and pdqsort with C and C++ API
New sort code beats the old champs, and the comments instantly turned into a nerd cage match
TLDR: A new sorting library says it can outpace common C and C++ options, with especially strong numbers on Apple and AMD chips. The comments split fast between praise, nitpicking, requests for tougher comparisons, and a side quest arguing what “branchless” even means.
A new sorting tool called blqsort just strutted into the room claiming it can beat familiar favorites like std::sort and pdqsort, and on the benchmark charts, it absolutely brought receipts. On an Apple M1, it sorted 50 million numbers faster than the standard option, and on Ryzen it widened the gap even more. The pitch is simple-ish: avoid the little split-second “guessing mistakes” chips make, and your program speeds up. The project even comes with easy C and C++ drop-in headers, which made some readers go, basically, wait… that’s it? One admirer summed up the vibe perfectly: it looked so simple they had to re-read it slowly.
But of course, the real action was in the peanut gallery. One camp was impressed; another immediately yelled “show us more comparisons”, asking why it wasn’t stacked against flashy vectorized sorting methods too. Then came the classic comment-thread spiral: one confused reader asked whether the supposedly “branchless” example still had… well, a branch. That kicked open the door for a mini semantics fight over what “branchless” actually means in plain English. And just when the thread needed more spice, a Rust developer casually slid in with a “by the way, Rust already has great sort options built in” flex. Meanwhile, C++ purists side-eyed the implementation details, warning that real-world custom data types might not enjoy being copied around so freely. In other words: fast code, faster opinions, and the comments sorted themselves into chaos immediately.
Key Points
- •The article benchmarks single-threaded blqsort against std::sort and pdqsort for sorting 50 million doubles on Apple M1 and AMD Ryzen 3 systems, with blqsort posting the fastest reported times in both listed setups.
- •blqsort uses branchless partitioning to reduce branch misprediction, including a 1024-element stack buffer and block-wise copying inspired by fluxsort.
- •To avoid worst-case O(n²) behavior, the implementation groups identical elements, checks for already sorted partitions, and can switch to heapsort when partitioning becomes highly imbalanced.
- •For larger partitions the algorithm uses a median-of-medians pivot strategy and loop unrolling, while partitions of 2 to 12 elements are handled with custom sorting networks.
- •The article provides both C++ and C APIs, including multithreaded variants using C++ threads and POSIX threads, and describes a BlockQuicksort-based path for expensive-to-copy non-trivially-copyable C++ types.