Comparing Python packages for A/B test analysis (with code examples)

Author compares his ‘tea-tasting’ to Pingouin, statsmodels, SciPy — and the crowd goes wild

TLDR: An author compared Python tools for A/B testing, including his own “tea-tasting,” without naming a winner. Comments erupted over bias, ease versus rigor, and whether fancy tricks like CUPED are overkill, with camps split between Pingouin convenience, statsmodels reliability, SciPy simplicity, and the inevitable “just use R.”

Move over coding tutorials — this comment section turned into a popcorn-fueled showdown. The author of tea-tasting dropped a comparison pitting his tool against Pingouin, statsmodels, and SciPy, promising no “best tool” verdict. That set off classic internet drama: purists demanded a winner, pragmatists cheered the nuance, and skeptics side-eyed the self-review. One faction yelled “bias,” another clapped for the disclosure, and a third asked for more copy‑paste‑ready outputs.

For the uninitiated, A/B tests are taste tests for apps: show two versions to different users, then check if one truly wins. The thread explained “p‑values” as a rough luck check and “confidence intervals” as the likely range of the true effect — all the numbers teams want in reports.

Pingouin fans swooned over one‑liner convenience; statsmodels defenders flexed “court‑proof” rigor; SciPy minimalists said “keep it simple.” The Bayesian crew barged in screaming “stop worshipping p‑values,” and meme‑lords quipped: “Pingouin for vibes, statsmodels for court.” Product skeptics tossed shade at variance‑boosting tricks like CUPED (“fix the product first”), while the eternal brigade chimed in: “just use R.” No winner crowned, but plenty of spilled tea — and a feature matrix people actually bookmarked.

Key Points

  • Compares four Python packages—tea-tasting, Pingouin, statsmodels, SciPy—for A/B test analysis without selecting a single best tool.
  • Defines a standard A/B testing workflow: design (incl. power analysis), run, and analyze/report metrics with CIs and p-values.
  • Maps common metric types (averages, ratios, proportions) to typical tests (Welch’s t-test, delta-method variance, asymptotic and exact tests).
  • Emphasizes proper relative effect confidence intervals using the delta method or Fieller’s theorem, discouraging naive scaling of absolute CIs.
  • Highlights practical needs: variance reduction via CUPED, multiple testing correction (FWER/FDR), and efficiency using aggregated statistics.

Hottest takes

"Instead of choosing a single 'best' tool" — e10v_me
"Pingouin for vibes, statsmodels for court" — memeLord42
"Just use R and call it a day" — csvOverlord
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.