Falcon: A Reliable, Low Latency Hardware Transport

Falcon promises speed; commenters yell “just use the CPU”

TLDR: Falcon is a new hardware transport promising faster, smoother data on regular Ethernet and big gains over existing tech under congestion. Comments erupt over whether we even need it, with one camp claiming a single CPU core can already do 200 Gbps and others arguing latency wins for AI-scale workloads.

Meet Falcon, a new hardware “fast lane” that claims 200 Gbps and smoother rides on regular datacenter Ethernet (no special switch magic). The pitch: fewer slowdowns under congestion and better throughput when things get messy—think up to 8× faster than Nvidia’s RoCE in traffic jams and 65% more good data when packets drop. The community? Instantly split. The top vibe is: do we even need fancy hardware when a CPU core can already push 200 Gbps? Veserv fires off the cost-efficiency cannon, arguing software can do the job if you add a crypto accelerator for encryption.

Hardware fans clap back with “latency is life,” saying microseconds matter for AI training and real-time workloads, and Falcon’s delay-based congestion control plus multipath routing (it spreads traffic over multiple paths) might save clusters from meltdown. Skeptics call it “another reinvent-TCP moment,” worried about vendor lock-in and NIC wizardry that ops teams will have to babysit. PFC (a switch feature that prevents drops) gets roasted as a headache, so Falcon’s “no special switches” brag earns cheers—and side-eyes. Meme corner: “Falcon Punch to RoCE,” “My Wi‑Fi cries at 200 Gbps,” and a chorus of “show real benchies,” pointing to workloads like Gromacs and WRF to prove this bird actually flies.

Key Points

  • Falcon is a hardware transport designed for general-purpose Ethernet datacenters that operate with losses and without special switch support.
  • It supports multiple Upper Layer Protocols via a layered design and a simple request-response transaction interface.
  • Falcon’s key mechanisms include delay-based congestion control with multipath load balancing, hardware retransmissions, and robust error handling.
  • A programmable engine in Falcon provides flexibility to adapt to heterogeneous application workloads.
  • Initial hardware results show 200 Gbps and 120 Mops/sec, with up to 8× lower operation completion times than CX-7 RoCE under congestion and up to 65% higher goodput under lossy conditions.

Hottest takes

"It only takes like 1 core to terminate 200 Gb/s" — Veserv
"If you want encrypted transport, then all you need is a parallel hardware crypto accelerator" — Veserv
"If you want to keep it off the memory bus..." — Veserv
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.