June 28, 2026

Code, CUDA, and comment-section carnage

Show HN: NanoEuler – GPT-2 scale model in pure C/CUDA from scratch

He built a chatbot from scratch — and the comments instantly went for the jugular

TLDR: A developer showed off a tiny chatbot built almost completely from scratch, which is impressive because most people use giant software toolkits instead. But the comments immediately turned into a roast session over weird code style, possible AI-written text, and whether the graphics-card version truly works.

A developer rolled into Hacker News with a bold flex: a GPT-2-sized text generator built almost entirely by hand in plain C and CUDA, the low-level programming tools beloved by people who think using big machine-learning frameworks is cheating. On paper, it’s a wild science-fair masterpiece: custom tokenizer, custom training pipeline, custom chat fine-tuning, and even a single consumer graphics card doing the heavy lifting. The creator is refreshingly honest that the result is more “fluent nonsense machine” than useful assistant, but that didn’t stop the crowd from zooming past the achievement and straight into the drama.

The strongest reactions? Suspicion, nitpicking, and a little bit of savage comedy. One commenter took one look at the code style and basically asked if someone had run a code beautifier meant for another language on it. Ouch. Another spotted what looked like an AI-ish comment in the source code saying a section was untested, then dropped the obvious grenade: so… does the GPU version actually work? Meanwhile, the README got put on trial too, with one critic saying the references to fancy math ideas didn’t really fit and accusing the writeup of having that unmistakable AI-generated vibe — yes, right down to the em-dashes.

And then there was the classic internet reality check: amid all the philosophy and formatting snark, someone simply asked the question everyone else should have started with — how long was this thing trained, and on how much text? That split the mood perfectly. Half the room saw a heroic from-scratch engineering feat. The other half saw a possible case of “cool demo, but receipts please.”

Key Points

  • NanoEuler is a GPT-2-class language model implemented from scratch in C/CUDA without PyTorch, autograd, or other ML libraries.
  • The project includes a hand-written byte-level BPE tokenizer, manual forward and backward passes, pretraining on a books-and-web corpus, and supervised fine-tuning into a chat model.
  • Its CUDA training pipeline uses cuBLAS and a hand-written FlashAttention implementation, validated with a full-model gradient check against a CPU reference.
  • The article says the system can train an approximately 116M-parameter model on a single RTX 4070, while also supporting smaller CPU-based models.
  • The model architecture is a decoder-only transformer using RMSNorm, RoPE, SwiGLU, grouped-query attention, and multi-token prediction.

Hottest takes

"did you run astyle --style=python on C code?" — Chu4eeno
"your LLM left a comment in the cuda source that it is untested" — Chu4eeno
"the README is pretty clearly AI generated" — tdesilva
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.