October 29, 2025
GPU drama, served piping hot
Continuous Nvidia CUDA Profiling in Production
Open‑source GPU spy promises speed without pain—devs clutch popcorn
TLDR: Polar Signals released an open‑source tool to watch NVIDIA GPU work in production with minimal slowdown. The community’s split between excitement for always‑on visibility and skepticism over “world’s first” claims, asking for proof, overhead numbers, and whether it’s truly safe to run in live systems.
Polar Signals dropped a bold claim: the “world’s first” open‑source, always‑on NVIDIA GPU profiler with low overhead, baked into their parca‑agent. Translation for non‑engineers: a tool that watches how your graphics card actually spends its time, without slowing your app to a crawl. Cue the community side‑eye. Some cheered the move—finally, a profiler that won’t turn production into molasses. Others smelled marketing and murmured the classic meme: “benchmarks or it didn’t happen.” One camp worries about tracing anything in live systems at all, while ops folks are already imagining dashboards with every kernel name lit up like a Christmas tree. The feature taps CUPTI (NVIDIA’s built‑in profiling hooks), USDT (lightweight trace points), and eBPF (kernel‑level data plumbing), but the vibe in the thread is delightfully chaotic: “Open‑source good, overhead bad, prove it,” basically. An early moment of drama: the author pops in—“I’m here, ask me anything”—feeding a fresh wave of curiosity about how this compares to Nsight, NVIDIA’s heavy‑duty profiler that’s famously… heavy. The hottest takes swirl around whether this can truly be “always‑on,” how much the GPU gets slowed, and whether containerized Python apps will stay chill. Someone joked it’s “Fitbit for your GPU,” another asked who’s brave enough to turn it on in production. Popcorn secured.
Key Points
- •Polar Signals released an open-source, low-overhead CUDA profiler in parca-agent v0.43.0 for continuous production profiling.
- •The approach profiles individual CUDA kernel executions to determine where GPU time is spent, extending prior GPU metrics collection.
- •The solution avoids file/socket overhead by using CUPTI to collect kernel data and exposing it via USDT probes to eBPF.
- •A shim library (parcagpu) is injected via CUDA_INJECTION64_PATH to intercept CUDA API calls without modifying applications.
- •Two USDT probes capture launch correlation and execution details (timestamps, device/stream IDs, graph IDs, kernel names) with minimal overhead.