Gt: [experimental] multiplexing tensor framework

New tool to juggle your graphics cards — early fans call it elegant

TLDR: GT is a new experimental tool that quietly coordinates multiple graphics cards to speed up math while keeping a familiar Python feel. Early praise calls it elegant, but readers want benchmarks and worry about hidden complexity from background servers and configuration files; if it’s simple and fast, it matters.

A hacker‑friendly science project just dropped: GT, an experimental tool that promises to juggle work across multiple graphics cards without locking coders into stiff, step‑by‑step routines. The vibe out of the gate? Admiration. One commenter set the tone with a swoon—“Bram always manages to build quite elegant stuff”—and the fan energy is palpable for a lean alternative to heavyweight machine‑learning toolkits. The mood is very “small team, big brain,” with people hoping this stays nimble as it grows.

So what is GT in mortal terms? Think “stage manager for your GPUs.” You write normal Python math, and behind the curtain a dispatcher quietly spins up, hands tasks to each graphics card, and keeps everything moving asynchronously (read: no waiting around). It looks like PyTorch, uses ZeroMQ (message pipes) for speedy chatter, optional YAML (simple text files) to tell it how to split big jobs, supports auto‑grad (it can compute training gradients), and even has a live dashboard. The drama: curious lurkers want benchmarks and side‑eye the surprise background server; fans love the import‑and‑go simplicity. Jokes about “YAML summoning circles” and the dispatcher being “the boss” brought the memes, but the headline sentiment is clear—if GT stays as elegant as promised, the crowd will show up. Try it here: github.com/bwasti/gt

Key Points

  • GT is an experimental tensor framework for distributed GPU computing emphasizing dynamic scheduling and asynchronous execution.
  • Its architecture comprises multiple clients, a single dispatcher, and per-GPU workers communicating via annotated instruction streams.
  • Clients emit functional, GPU-unaware instructions that the dispatcher rewrites for GPU execution; workers process asynchronously and may JIT compile.
  • Configuration and optimization use signals and YAML annotations for sharding and compilation, which are optional for portability.
  • Features include ZeroMQ transport (DEALER/ROUTER), client-side autograd, PyTorch-compatible APIs, real-time monitoring, instruction logging, and PyTorch/NumPy backends with torch.compile.

Hottest takes

build quite elegant stuff. — almostgotcaught
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.