Pandas with Rows (2022)

Pandas melts down, DuckDB flexes in airport delay smackdown

TLDR: Loading 13GB of flight data into pandas crashed, so a streaming Python approach saved the day; the crowd split between DIY code and “just use DuckDB,” with one user flexing M4 speed. The debate matters: picking the right tool can turn hours of waiting into seconds of results.

A coder tried to crunch 120 million U.S. flight records to find the most delay-prone airports, and their first move—loading everything into pandas—went boom. The fix? A scrappy, streaming, pure‑Python loop that tallies delays without drowning your laptop. And then the comments happened. One user dropped the mic: “DuckDB does it in seconds on my M4 MacBook” and shared a gist, casually noting they didn’t bother fixing an encoding hiccup. Cue the clash: Python purists cheering the lean, low‑RAM approach versus the SQL squad yelling “Use the right tool!” while flexing benchmarks.

Nitpickers circled the time math like hawks—did the code treat flights crossing midnight right? Is mean (average) even the right choice, or should it be median to ignore extreme delays? Meanwhile, jokers piled on with “Excel would have exploded” memes and “M4 MBP speedrun” one‑liners. There’s also the eternal fight: compute it yourself with code you control, or lean on a database that’s basically a cheat code for big data. Either way, the community came for flight delays and stayed for the drama, turning a nerdy task into a full‑blown tool war.

Key Points

  • The article uses the Airline On-Time dataset (1987–2008) from Harvard Dataverse, totaling ~120 million records across 22 CSV files (~13 GB).
  • A naive pandas approach to concatenating all yearly CSVs is likely to cause a MemoryError on typical hardware.
  • A pure-Python streaming solution processes rows sequentially, tracking cumulative delay and flight counts per origin airport.
  • Delays are computed from parsed scheduled and actual departure times; rows with 'NA' or invalid times are skipped, and large negative delays are adjusted (if delay < -2, set delay = 24 - delay).
  • Top five airports by average delay are identified using heapq.nlargest after computing mean delay per airport.

Hottest takes

“DuckDB returns in seconds on my M4 MBP” — cjonas
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.