Show HN: Cancer diagnosis makes for an interesting RL environment for LLMs

AI tries cancer spotting by zooming slides; HN splits on accuracy, training, and calling Eric Topol

TLDR: A demo lets AI zoom around giant tissue slides to attempt cancer diagnoses, with promising but very early results. HN loved the concept but pressed for comparisons to specialized models, fine‑tuning details, bigger benchmarks—and joked to “call Eric Topol.”

A YC founder just let an AI “drive the microscope” and the internet grabbed popcorn. David from Aluna built a demo where frontier chatbots zoom and pan across giant, digitized tissue slides—like a pathologist—but only seeing small regions at a time. The goal: hunt the right spots, then call the diagnosis. He shared videos of the AI tracing its path on a lung cancer case link and a benign breast case link. Early numbers? GPT‑5 explored up to 30 regions and often agreed with a real pathologist; Claude 4.5 was similar; smaller models lagged. It’s tiny-sample stuff, but undeniably cool.

Cue Hacker News drama. Skeptics immediately asked for receipts: how do humans classify cancers, and how does this compare to specialist pathology AI that’s built for slides? Another chorus pushed for a bigger benchmark and clarification on whether the models were fine‑tuned or just “out of the box.” The acronym drop (IHC—lab stain scoring) got translated for the crowd, while one commenter summoned the meme-y medical influencer: “Call Eric Topol.” Meanwhile, builders suggested plugging in newer image‑segmenting models to supercharge the “AI pathologist videogame.” The vibe: jaw‑drop demo, but the crowd wants head‑to‑head stats, training details, and a lot more cases before crowning Dr. GPT.

Key Points

  • Aluna built an RL environment that lets LLMs zoom and pan WSIs to identify regions for cancer diagnosis.
  • WSIs in TIF/SVS formats are multi-gigabyte and exceed LLM context windows; tiling is inefficient for large models.
  • Tool-augmented navigation and prompt engineering produced promising results on limited samples.
  • GPT-5 agreed with an expert pathologist on 4/6 subtyping and 3/5 IHC tasks after exploring ~30 regions.
  • Claude 4.5 showed similar accuracy with fewer views; smaller models (GPT-4o, Claude 3.5 Haiku) were less accurate.

Hottest takes

"How would a human classify the cancers?" — n2d4
"You should read out to Eric Topol..." — ytrt54e
"Did you fine tune GPT 5... or ... 'out of the box?'" — areoform
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.