December 16, 2025
Unicode wars: ß vs speed demons
Full Unicode Search at 50× ICU Speed with AVX‑512
Blazing-fast text search ignites ß vs 'Mass' fight, confusables panic, and transliteration dreams
TLDR: A new library promises much faster Unicode text search than ICU, using modern CPU instructions. Commenters clashed over correctness and safety: Germans objected to “ß” folding to “ss,” others warned about look‑alike characters, and some asked for transliteration so searches match how people actually type.
A new open-source library claims it can search Unicode text up to 50× faster than the industry workhorse ICU (the thing big apps use to handle text), thanks to fancy CPU tricks. The crowd? Split right down the middle. Speed freaks cheered; language purists came out swinging. One German commenter fired the opening shot: “Mass is not a different casing of Maß,” arguing that smart case-insensitive search breaks real meanings. Another camp waved the correctness flag too, pointing out dangerous “look‑alike” characters—like the Kelvin symbol that looks like a regular K—prompting a mini panic about phishing and misread text. Cue Wikipedia receipts.
Meanwhile, the “keep it simple” squad insisted you should normalize text first and then compare bytes, grumbling that the new library doesn’t handle grapheme clusters (think: emojis, accents, and combined characters) as a first-class citizen. A pragmatic voice asked for transliteration rules—“let me find Armenian ‘Վարդանյան’ by typing ‘vardanyan’”—because users just want search that matches how people type. And in the back, someone deadpanned “its good,” which instantly became the thread’s meme reaction.
So yes, the tool is fast and claims to be correct. But the comments turned it into a front‑page Unicode culture war: performance vs precision, safety vs convenience, and whether the letter you see is really the letter you mean.
Key Points
- •StringZilla uses AVX‑512 to accelerate Unicode/UTF‑8 processing, aiming for correctness via verifiers and tests.
- •Tokenization (v4.3) supports 25 whitespace characters and 9 newline variants, with ~10× speedups.
- •Case folding (v4.4) covers 1,400+ Unicode 17 rules and edge cases, with ~10× speedups.
- •Case-insensitive substring search (v4.5) bypasses folding and claims 20–150× speedups over alternatives and up to 20,000× vs PCRE2.
- •The article explains UTF‑8’s variable-length encoding and notes 98% Internet adoption by 2024; StringZilla is tested against Unicode specs and ICU, easing updates for Unicode 18.0.