May 27, 2026
Hot chips, cold logic
Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data
Your GPU may run faster on ‘easy’ numbers — and the comments are losing it
TLDR: A developer found that a high-end graphics card can do the same math faster when the input numbers are more predictable, apparently because “busier” numbers make the chip burn more power and slow itself down. Commenters were split between amazed, skeptical, and deeply distracted by the card using 88 watts while idle.
A routine speed test turned into full-on computer gossip when a developer discovered that the actual numbers inside a matrix — basically a giant grid of values — could make the same graphics card run the same calculation at very different speeds. With all-zero inputs, the chip flew; with random values, it slowed down. That sent readers straight into the comments with a collective: wait, what? For many, this broke the basic common-sense rule that the same job should take the same time no matter what numbers are inside.
The biggest reaction was pure disbelief. One commenter said they arrived expecting the old classic explanation — branch prediction — only to discover that modern chips are somehow even weirder. Another jumped in with a practical flex: people running AI models at home have already seen this strange effect, and some even claim that limiting power can make things faster overall, which sounds fake until it isn’t.
But not everyone was ready to crown the mystery solved. One skeptical voice asked the question everyone was thinking: is this actually proven, or is it just a neat theory with numbers that kind of fit? And then came the side quest nobody expected — readers becoming obsessed with the claim that a high-end card can sit there doing “nothing” while still gulping 88 watts. In other words: the article found a hardware quirk, but the comments delivered the real drama — confusion, skepticism, and a lot of “your idle PC is doing WHAT?” energy.
Key Points
- •A large 8192 x 8192 x 8192 matrix multiplication benchmark showed 258 teraflops in PyTorch via CuBLAS and 288 teraflops in the CUTLASS profiler.
- •When CUTLASS kernels were bound into Python and compared again, CUTLASS measured 257 teraflops versus CuBLAS at 258 teraflops, removing the earlier performance gap.
- •The article traced the discrepancy to CUTLASS profiler input initialization, which by default used integer-valued inputs.
- •Benchmarking different input values showed major performance differences: zero-valued inputs reached about 295 teraflops, while normally distributed random inputs reached about 257 teraflops.
- •The article explains the effect as a consequence of GPU power limits and dynamic switching power, where different data patterns may change transistor switching activity and trigger clock throttling.