January 6, 2026
PCIe panic vs GPU glory
High-Performance GPU Cuckoo Filter
GPU filter goes turbo; fans cheer, skeptics ask 'But PCIe?'
TLDR: A new GPU Cuckoo Filter claims huge speedups over CPU filters, with lookups up to 1,504× faster. The crowd is split: fans celebrate if data stays on the GPU, while shetaye’s PCIe bottleneck question fuels doubts about real-world use with giant datasets, making practicality the battleground.
Move over Bloom filters—this new GPU-powered “Cuckoo Filter” just lit up the benchmarks. Built with CUDA (NVIDIA’s tool for running code on graphics cards), it blasts through lookups up to 1,504× faster than the old-school CPU version and posts huge gains on inserts too. The dev flexes features like multi-GPU “gossip,” sorted insertion, and dramatic eviction strategies (think: reshuffling items when a spot is full). People cheered the idea of header-only, plug-and-play speed, and the meme machine went wild: “cuckoo birds evicting bad tenants” and “Marie Kondo your keys.” But the hype train hit a speed bump when someone asked the question nobody wants to face.
That someone was user shetaye, who dropped the reality check: what happens when your data isn’t on the GPU—does the PCIe bus (the computer’s hallway to the graphics card) choke the whole thing? Suddenly it was team GPU glory vs. PCIe panic. Optimists insisted batch workloads and smart memory layouts still win big, especially for data that lives mostly on the card. Skeptics fired back: real networks, multi‑terabyte datasets, and live packet inspection don’t magically fit in HBM. Meanwhile, Bloom‑filter loyalists waved the one chart where Bloom insert speed holds its own, and database folks begged for end‑to‑end tests. The thesis got spam-clicked.
Key Points
- •A CUDA-based, header-only GPU Cuckoo Filter supports batch insert, lookup, and delete with configurable fingerprint and bucket sizes.
- •Benchmarked at 80% load factor on an NVIDIA GH200, showing large speedups over CPU Cuckoo Filter, TCF, GQF, and mixed results versus Blocked Bloom Filter.
- •L2-resident (4M items) results: up to 973× faster queries vs CPU; 6×–585× faster than TCF/GQF; insert slower than Bloom (0.6×).
- •DRAM-resident (268M items) results: up to 1504× faster queries vs CPU; 1.9×–9.6× faster inserts vs TCF/GQF; insert slower than Bloom (0.7×).
- •Features include sorted insertion, multi-GPU via gossip, IPC support, and a Meson/C++20/CUDA toolchain; detailed configuration via CuckooConfig.