April 4, 2026
Comment wars go brrr
Zml-smi: universal monitoring tool for GPUs, TPUs and NPUs
One tool to watch every AI chip—fans cheer, purists say use nvtop
TLDR: zml-smi debuts as a single tool to monitor GPUs, Google TPUs, and AWS Trainium. The thread splits between praise for cross‑vendor coverage and criticism of a 'hacky' AMD file redirect plus calls to upstream it to nvtop, with extra side‑eye over 'NPU' meaning Trainium only.
zml-smi just dropped as a “one screen to rule them all” monitor for GPUs (graphics cards), Google TPUs (AI accelerator chips), and AWS Trainium (Amazon’s AI chip). Think a mashup of nvidia-smi and nvtop with a live “top” view, plus host and process stats, all in a tidy, sandboxed download from the official mirror. On paper: fewer windows, more vibes.
But the comments? Spicy. One camp loves the cross‑vendor coverage: finally, a single dashboard for NVIDIA, AMD, TPUs, and Trainium. The other camp is side‑eyeing the AMD trick where the tool intercepts a file lookup to keep everything self‑contained—critics call it a fragile “hack,” not “sandboxing.” User mrflop lit the fuse: “brittle hack masquerading as ‘sandboxing,’” and asked why this isn’t just upstreamed to nvtop instead of “fragmenting” tools.
Then rdyro chimed in with a flex: nvtop can already handle TPUs via libtpuinfo, reviving the classic open‑source soap opera—new tool vs. upgrade the old one. Meanwhile, 152334H poked the naming bear: does “NPU” here basically mean Trainium only? Cue memes about “universal” meaning “universal…ish,” and quips like “my htop is crying” as another monitor joins the party.
Bottom line: zml-smi brings real-time performance, temps, memory, and process info across vendors in one pane—but the thread is a tug‑of‑war between people who want a bold new hub and folks who’d rather see the features land in the tools they already trust.
Key Points
- •zml-smi is a universal, sandboxed monitoring tool for GPUs, TPUs, and NPUs, combining features of nvidia-smi and nvtop.
- •It supports NVIDIA, AMD, Google TPU, and AWS Trainium, with plans to add more platforms as ZML expands.
- •Installation is via a downloadable archive; the tool lists devices and offers real-time monitoring with the --top flag.
- •Host and process metrics are provided, plus platform-specific metrics sourced through NVML (NVIDIA), AMD SMI/libdrm (AMD), TPU runtime gRPC (Google TPU), and libnrt.so (AWS Trainium).
- •For AMD, zml-smi merges amdgpu.ids from Mesa and ROCm and uses a zmlxrocm.so interposer to redirect file access, enabling updated device recognition within the sandbox.