What happens when you run a CUDA kernel?

A simple 1+1 graphics chip trick turned into a wild tour of hidden computer chaos

TLDR: A deep dive showed that even a tiny graphics-chip math task triggers a surprisingly huge amount of behind-the-scenes work. Commenters loved the clarity, argued over how “mysterious” NVIDIA’s tools really are, and started speculating about who could win the future fight over making these chips faster.

A programmer set out to answer a deceptively simple question: what actually happens when your computer’s graphics chip is told to add numbers together? The punchline is deliciously absurd. A tiny program that just does “1 + 1” a million times apparently kicks off tens of millions of processor steps, hundreds of system calls, and a mysterious hardware ‘doorbell’ ring before the answer comes back. And the comment section was instantly split between awe, relief, and the usual “well, actually” energy.

The strongest reaction was basically: this is the kind of explainer people wish they had years ago. One reader who had studied high-performance computing said this would have saved them pain in class, especially during the brain-melting parts about how the chip decides what work can run. Others loved that the article peeled back the “magic trick” feeling around NVIDIA’s software stack, with one commenter arguing that if you skip some of CUDA’s friendlier layers, a lot of the supposed user-space “voodoo” disappears.

Then came the mini-drama: is NVIDIA hiding a black box, or is more of this documented than people think? One commenter swooped in with a link to open hardware docs like a receipts-dropping detective. Another spicy side thread wondered whether entire companies built around speeding up these programs are about to get eaten alive by open-source tools—or bought up as a juicy competitive moat. Meanwhile, one reader took a swipe at Vulkan, jokingly praising CUDA for handling synchronization so users don’t have to suffer immediately. In short: the article explained the machine, but the comments explained the mood.

Key Points

•The article uses a simple CUDA vector-add kernel to examine the full path from source code to GPU execution.
•It states that even a basic CUDA kernel launch on an RTX 4090 involves tens of millions of CPU instructions, multiple device files, about 900 ioctls, and a memory-mapped doorbell register.
•The CUDA build pipeline shown includes nvcc orchestrating host compilation plus device-code stages that generate PTX, SASS, a cubin, a fatbin, and host launch stubs.
•PTX is presented as a device-agnostic virtual ISA with typed virtual registers, which makes operations like address formation appear in multiple explicit instructions.
•The article explains how PTX instructions compute thread indices, perform bounds checks, convert generic pointers to global addresses, load operands, add them, and store results before later conversion to SASS by ptxas.

Hottest takes

"You don't actually need to read the kernel source" — fooblaster

"a lot of the user-space 'voodoo' is gone" — einpoklum

"dethroned by some sort of like open source library" — orliesaurus

June 29, 2026

When 1+1 causes maximum drama

A simple 1+1 graphics chip trick turned into a wild tour of hidden computer chaos

Key Points

Hottest takes

June 29, 2026

When 1+1 causes maximum drama

What happens when you run a CUDA kernel?

A simple 1+1 graphics chip trick turned into a wild tour of hidden computer chaos

Key Points

Hottest takes

Save News