April 12, 2026
Float wars on tiny chips
Mark's Magic Multiply
Tiny chips, big math: fans rally behind a “halfway” speed boost
TLDR: A “firm float” approach promises faster math on small processors without adding expensive hardware, and readers are here for it. The comments cheer and swap Mark Owen’s classic library link, while the age‑old hardware‑vs‑software math debate quietly revs up in the background.
Float math on tiny, low‑power chips is usually a pain, but this post’s “firm float” idea — a halfway step between pure software and full hardware — has the crowd buzzing. The author claims fast add and multiply without bolting on a full math coprocessor, and the first reply immediately drops a tribute link to Mark Owen’s legendary soft‑float library for ARM: qfplib. Translation: respect paid to the OG, hype granted to the new trick.
Readers chuckled at the tongue‑in‑cheek “California warning label” about floating point panic, then got serious about what this means for real‑world projects: faster math on cheap chips without rewriting everything. The vibe so far? Warm, practical, and a little nostalgic — with one commenter straight‑up saying they like the “firm float” concept.
Meanwhile, a familiar storm cloud looms over the horizon: the eternal fight between hardware purists (“just add a floating‑point unit!”), software minimalists (“just use integers!”), and the new middle path (“firm float”) fans. Not much fire in the thread yet, but everyone knows this is the debate that never dies. For now, the mood is upbeat and link‑heavy: Mark’s trick resurfaces, the new RISC‑V‑flavored approach gets kudos, and embedded devs everywhere eye their tiny chips like, “so… math party?”
Key Points
- •Xh3sfx is a custom RISC‑V extension that accelerates soft floating‑point operations using specialized ALU instructions without requiring a forked compiler.
- •The approach replaces compiler runtime routines (e.g., libgcc, compiler‑rt) with drop‑in accelerated versions, leveraging a stable API surface.
- •Reported performance: single‑precision addition in 14 cycles and multiplication in 16 cycles (excluding call overhead) for a small hardware cost (hundreds of gates).
- •The single‑precision multiply routine unpacks operands, handles special cases, performs a 32×32→64 significand multiply, gathers sticky bits, normalizes, and repacks the result.
- •The work is inspired by a floating‑point multiplication trick by Mark Owen for 32‑bit embedded cores, which the article aims to dissect.