Yes, you should understand backprop (2016)

Karpathy’s classic sparks a 2025 brawl: learn the guts or just push the AI button

TLDR: Karpathy’s 2016 case for learning backpropagation is trending again, arguing that knowing the basics prevents nasty AI surprises. The crowd is split between fundamentals-first hardliners, tool-trusting pragmatists, and a middle path, all agreeing it matters in an AI-everywhere world where “push the button” often isn’t enough.

Andrej Karpathy’s 2016 rallying cry — that you should actually understand how neural nets learn — just got dragged back into the spotlight, and the comments are throwing elbows. His point: backprop, the math trick that teaches AI, is a “leaky abstraction,” so hand‑waving it away can break your models in weird ways. Think: buttons pressed, lights flash, nothing learns. Cue the community fireworks.

One camp rolled in with peak gatekeeping energy. User joshdavham basically asked the 2025 tech world: if you’ve got AI opinions, can you at least explain gradient descent? Translation: no takes without homework. Others pushed back with humility and curiosity — drivebyhooting wondered if fancy training tools (momentum, clipping, all the knobs) mean the raw gradient math matters less now. Meanwhile the fan club showed up, calling Karpathy’s teaching “immense,” linking his throwback hits like RNN effectiveness, and reviving the old HN thread for a nostalgia lap.

The vibe: a three‑way cage match between “learn the fundamentals,” “modern tools are fine,” and “why not both.” Jokes flew — “Do you even backprop, bro?” — while pragmatists stressed a sweet spot: understand one layer deeper, but don’t drown in math. Karpathy’s post is old; the debate is not, and it’s messier than ever in the age of AI‑everywhere

Key Points

  • CS231n at Stanford required students to implement forward and backward passes in raw NumPy.
  • Students questioned manual backprop since frameworks like TensorFlow auto-compute gradients.
  • Karpathy argues backpropagation is a leaky abstraction that developers should understand.
  • Sigmoid/tanh activations can saturate with large weights or poor preprocessing, causing vanishing gradients.
  • Sigmoid’s local gradient peaks at 0.25, shrinking gradients each layer and slowing learning, especially with basic SGD.

Hottest takes

"but can you at least tell me how gradient descent works?" — joshdavham
"Karpathy's contribution to teaching around deep learning is just immense." — gchadwick
"with more advanced optimizers the gradient is not really used directly" — drivebyhooting
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.