March 20, 2026
Bananas or benchmarks?
NumKong: 2'000 Mixed Precision Kernels for All
Tiny math gorilla drops: fans cheer 5MB speed, skeptics cry “banana benchmarks”
TLDR: NumKong launches 2,000 compact math routines across seven languages in a tiny 5MB package, promising fast CPU and browser math. Comments split between loving the no-bloat drop and questioning bold benchmark claims like “0% error” and single-thread tests—hype meets healthy skepticism over what the numbers really mean.
NumKong just flung 2,000 tiny math building blocks into the wild, promising faster, smaller number crunching across seven languages in a 5MB package. Fans are swooning over the “no bloat” vibe and the swagger of claiming OpenBLAS-sized features with browser-friendly WASM and GPU-free tricks like ColBERT scoring. The mood: hyped but suspicious.
The top fight: accuracy vs speed. NumKong’s charts boast wild gains (that 1,279 int8 speed number) and even “0% error” on float8, and the comments erupted. Benchmark cops demanded definitions and datasets; defenders say the author is upfront about trading speed for precision. One line summed it up: “Pick your poison, but at least the bottle is tiny.”
Memes landed fast. “Float118 is two doubles in a trench coat.” “Float6? My toaster can do that.” And of course: bananas-per-second jokes because … Kong. Naming drama too: “SimSIMD is dead, long live NumKong” became a rallying cry, while open-source purists grumbled: why not contribute to OpenBLAS instead of “starting a new empire”?
Underneath the chaos, practical builders are excited: multi-CPU support, USearch acceleration, and 99 Python wheels feel like a real dev gift. The verdict: bold drop, spicy thread.
Key Points
- •NumKong is an open-source relaunch of SimSIMD, offering 2,000+ mixed-precision SIMD kernels across seven languages with binaries of 5 MB or less.
- •It supports multiple ISAs and platforms, including RISC-V V, Intel AMX/AVX2/AVX-512, Arm SME/SVE, and Relaxed WebAssembly SIMD.
- •Kernels span from Float6 to Float118, include Int4/UInt4 dot products, and implement precision techniques (Neumaier, Dot2, Ozaki) plus domain algorithms (Haversine, Vincenty, Kabsch, Umeyama, fused MaxSim for ColBERT).
- •Benchmarks on Intel Xeon4 CPUs show single-threaded GEMM-like performance and error metrics across data types, alongside smaller binaries and wide language distribution (99 Python wheels).
- •NumKong was built to accelerate Unum’s USearch and maintain compatibility; USearch powers vector search across systems like ClickHouse, DuckDB, ScyllaDB, TiDB, Yugabyte, and MemGraph.