Python: The Optimization Ladder

From 'too slow' to 'just glue it': the internet brawls over how to make Python fast

TLDR: A dev benchmarked how far common tools (NumPy, Numba, Cython, PyPy, Rust, GPUs) can push Python speed, showing trade‑offs instead of a single winner. Comments split between JIT optimism, “Numba/GPU first” pragmatism, and the “Python as glue, hot loops in C/C++” crowd—plus the usual “just use Rust” jokes.

Python’s latest “how fast can we make it?” saga arrives with real benchmarks and real drama. One dev climbed an “optimization ladder” on an Apple M4 Pro—testing classic math puzzles from the Benchmarks Game plus a real-world JSON pipeline—and found what veterans keep saying: raw Python is slow, but the right tricks can fly. The post walks through tools like NumPy, Numba, Cython, PyPy, Rust and more, and explains (in plain English) that Python’s super-flexible design is why it’s hard to turbocharge.

But the comments are where the fireworks start. The JIT-hope crowd cheers that Python already has a small just‑in‑time compiler in 3.13, with one commenter claiming 3.15 will borrow PyPy’s tracing magic. The pragmatists fire back: stay in the pandas/polars sandbox, “reach for Numba first,” and don’t forget GPUs—because your data crunch might move faster on a graphics card than on a rewrite. Then the two‑language truthers chime in: Python is the perfect “glue,” so write the inner hot loops in C or C++ and keep the rest comfy.

Meanwhile, the memes roll in: “Just use Rust,” “100x slower but 1000x happier,” and the classic “I/O bound, bro.” It’s equal parts speed science and identity crisis—is Python a sprinter, or the world’s best team coach? The answer depends on which comment you upvote.

Key Points

  • The author benchmarked CPython 3.13 against C (gcc) on n-body and spectral-norm, plus a JSON event pipeline, on an Apple M4 Pro.
  • Baseline results show large gaps: e.g., n-body 2.1s (C) vs 372s (CPython, 177x) and spectral-norm 0.4s (C) vs 350s (CPython, 875x).
  • Python’s maximally dynamic design causes heavy runtime overhead (type checks, dispatch, object allocation), making numeric loops slow.
  • Python integers have at least 28 bytes of overhead in CPython, compared to a 4-byte C int, illustrating object model costs.
  • CPython 3.11+ adaptive specialization speeds common operations; the GIL doesn’t affect single-threaded benchmarks, and CPython 3.13 adds an experimental free-threaded mode.

Hottest takes

"Good news. Python 3.15 adapts Pypy tracing approach to JIT" — Ralfp
"reach for numba first" — __mharrison__
"Python is perfect as a "glue" language" — rusakov-field
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.