December 16, 2025
Poison pill, spicy comments
Reverse-Engineering the RK3588 NPU: Hacking Limits to Run Vision Transformers
Student outsmarts a $100 AI chip; fans cheer, skeptics nitpick, everyone begs for standards
TLDR: A student sliced and “poison‑pilled” a cheap AI chip to run modern image models despite a tiny memory limit. Commenters split between cheering real hacking, downplaying the “reverse‑engineering,” and demanding an open NPU standard like RISC‑V to end today’s confusing, closed toolchains.
Cue the hacker movie montage: a grad student took a bargain AI board—the Rockchip RK3588 in the Orange Pi 5—that bragged “6 TOPS” (translation: fast at AI), and made it actually run modern image AI by slicing the problem into bite‑size chunks and dropping a sneaky “poison pill” to stop a stubborn compiler from breaking it. Translation for humans: the chip’s tiny fast memory (think a thimble of workspace) kept crashing big models, so the student diced the job into mini tiles and tricked the software into keeping them separate. Boom—no more crash, and the NPU finally works instead of napping.
The comments? Pure popcorn. Fans yelled “finally real hacking!” with one quip that Hacker News needed a break from “vibe coding” and ChatGPT summaries. Purists rolled in to say there was “very little reverse engineering,” sparking a spicy semantics brawl: is outsmarting a black‑box toolchain reverse‑engineering or just clever hacking? Pragmatists shrugged: why not just use Rockchip’s newer black‑box SDK. Open‑source diehards dropped names like TEFLON and ROCKET and argued we need an open NPU standard—a RISC‑V for AI chips—before this mess gets worse. Old‑timers compared today’s NPU chaos to the wild west of early CPUs. The meme of the day: “6 TOPS on paper, 0 tops in practice—until the poison pill hit.”
Key Points
- •The RK3588 NPU (Orange Pi 5) failed to compile SmolVLM-v1’s SigLIP Vision Transformer using rknn-toolkit2, producing an undocumented REGTASK Overflow (0xe010).
- •Synthetic ONNX tests revealed a strict 32KB threshold, indicating a hardware-enforced 32KB L1 SRAM scratchpad for vector operations.
- •Attention layers for a 1024-token sequence generate ≈25MB activations, far exceeding the NPU scratchpad and causing compilation crashes.
- •A PyTorch-based “Nano-Tiling” (32×32 tiles) fit computations into the scratchpad, but the compiler fused tiles back into large blocks.
- •A “Poison Pill” dummy dependency was inserted to prevent fusion, enabling tiled execution to proceed on the legacy rknn-toolkit2 stack.