June 19, 2026
GPU gossip, now in production
Show HN: Continuous Nvidia CUDA PC Sampling Profiler
This new Nvidia code tracker has geeks cheering, side-eyeing, and debating if prod should be a lab
TLDR: Parca Agent added a way to continuously watch Nvidia GPU code while real apps are running, aiming to spot slowdowns without causing much extra load. Commenters loved the visibility but argued over whether this is a breakthrough for real-world debugging or just proof developers should have tested better earlier.
A shiny new Show HN post dropped with a big promise: a way to keep watching Nvidia graphics-card workloads while they’re live in production, not just in a testing lab. In plain English, the tool helps developers see exactly where GPU programs slow down, down to tiny instruction-level hiccups, and then ships that data into the Polar Signals interface for analysis. The makers say the trick is doing this with low enough overhead that it won’t become the problem it’s trying to diagnose.
But the real fireworks were in the comments, where the community instantly split into two camps: “This is incredibly useful” versus “Why didn’t you already figure this out before launch?” One admirer, killamdiaz, gave the project a warm round of applause but also dropped the kind of thoughtful line that makes a thread go quiet: maybe the biggest win isn’t speed at all, but visibility. Translation: sometimes the real tea is simply finally seeing what your system is doing.
Then came the skeptical energy. saagarjha basically asked the question lurking in every performance thread: if your GPU jobs are so mysterious in production, did you skip homework during development? Ouch. That set up the classic nerd drama of the week: is live profiling a breakthrough, or a fancy flashlight for problems you should’ve caught earlier? Even without meme-heavy replies, the vibe was unmistakable: half the crowd saw a superpower, the other half saw a very expensive “check engine” light for code.
Key Points
- •Parca Agent v0.48.0 adds continuous Nvidia CUDA PC sampling support using CUPTI and sends the resulting data to a backend for analysis.
- •The article says PC sampling can potentially be used in production because the implementation is designed for low overhead, unlike traditional developer-only workflows.
- •PC sampling was introduced with Nvidia Maxwell and later received a dedicated API in Volta.
- •Sampling frequency is configurable through a power-of-two sampling factor from 5 to 31, and the project uses 20 as a default, yielding more than 2,000 raw hardware samples per second.
- •The collected data consists of program counter offset and stall-reason buckets rather than call stacks or timestamps, enabling instruction-level analysis of GPU stalls such as memory latency, shared-memory waits, queuing delays, and synchronization stalls.