June 6, 2026
CUDA got everyone acting up
PyTorch Custom Operation
PyTorch’s DIY add-on guide drops, and the comments instantly demand receipts
TLDR: PyTorch showed how to plug custom code into models and still ship them into production-ready apps. Commenters were less impressed by the how-to than obsessed with the real drama: whether it works with newer tools, whether simpler options exist, and where the benchmark face-off is.
PyTorch just published a how-to for people who want to bolt their own custom pieces onto AI models — including faster low-level code and even reusable custom classes — then still run those models in both Python and C++ apps. In plain English: it’s a guide for power users who want to swap in their own special engine parts without breaking the whole car. The post says this can even work with PyTorch’s newer compile-and-export flow, which is a big deal for anyone trying to move models from development to real-world deployment.
But the real show was in the comments, where the crowd immediately turned this into a mini courtroom drama. One camp basically said, “Cool demo, but can it fuse with normal ops, and does it really support pt2?” Translation: nice tutorial, now prove it works with the shiny new tools people actually care about. Another commenter came in with the classic “Do we really need to go this low-level?” energy, asking whether tools like JAX or CuPy might be easier than diving into C++ and CUDA. That sparked the familiar tech-world tension between maximum control and please, anything but writing C++.
Then came the wonderfully relatable chaos: version confusion, upgrade anxiety, and benchmarking envy. One person wanted to know exactly when torchbind support landed, while another begged for a side-by-side showdown between Triton, CUDA/C++, and just letting torch.compile do its thing. The vibe was half curiosity, half skepticism, with a strong undertone of “show us the benchmarks or we’re not impressed.”
Key Points
- •The article explains how to implement PyTorch custom functions and custom classes in C++ and CUDA using an identity convolution example.
- •Custom functions are registered with TORCH_LIBRARY_IMPL, while custom parameterized classes are defined with torch::CustomClassHolder and registered with TORCH_LIBRARY.
- •The custom code is built into a shared library, libidentity_conv_ops.so, and loaded into PyTorch with torch.ops.load_library.
- •For torch.compile and torch.export compatibility, fake registrations of custom classes and functions are required so FakeTensor-based tracing can run without executing native code.
- •Exported models can be packaged with torch._inductor.aoti_compile_and_package into model.pt2 and used in both Python and pure C++ inference programs, with shared library loading possible via dlopen.