'Comically bad' datasets used to train clinical models for stroke and diabetes

Researchers used celeb pics to spot strokes, and commenters say the real illness is bad data

TLDR: A medical AI study for stroke detection appears to have used a wildly unreliable online image set that included celebrities and duplicate photos, prompting a journal warning. Commenters were brutal, saying this proves too many researchers chase flashy models while ignoring the one thing that actually matters: trustworthy data.

The internet has found its latest "you had one job" scandal: researchers used a Kaggle image set to help train a stroke-detection system, and people digging through it reportedly found Rambo, George Clooney, Angelina Jolie, Daniel Craig, kids, duplicates, and images that may not even show stroke at all. After reporting on the paper, Springer Nature slapped on an editor’s warning, and one of the datasets vanished from Kaggle. Translation for non-specialists: a study meant to help spot a serious medical emergency may have been built on what critics are calling absolute junk.

And the comments? Merciless. One of the strongest community opinions is that this isn’t some tiny mistake — it’s a whole culture problem. As user Legend2440 put it, too many researchers seem obsessed with building fancy models while treating the actual data like an afterthought. Their scorching verdict: the model is the easy part; the hard part is getting trustworthy data. Another commenter, matusp, basically said this kind of mess is so common that if you peek at a handful of samples in almost any dataset, you’ll find something bizarre right away.

The mood is a mix of rage, dark laughter, and exhausted déjà vu. The unspoken meme floating over the whole thing is: "Congrats, your stroke detector can now diagnose Sylvester Stallone." Beneath the jokes, though, the fear is real — if sloppy internet data can flow into published papers and possibly clinical tools, commenters think the problem isn’t just embarrassing. It’s dangerous.

Key Points

•Researchers found that a Kaggle image dataset used in a Scientific Reports stroke-detection paper contained duplicate celebrity photos and other inappropriate images.
•Adrian Barnett and Alexander Gibson said the case is part of a larger problem involving questionable Kaggle stroke and diabetes datasets used in published clinical prediction studies.
•Their medRxiv preprint traced two problematic datasets into 124 published papers and reported that both datasets failed a provenance checklist.
•One of the two image datasets used in the Scientific Reports paper has been removed from Kaggle, while the remaining "droopy" dataset still contained mislabeled and duplicate images when reviewed.
•After Springer Nature was contacted, the journal added an editor’s note warning about concerns over the reliability of the article’s data and said further editorial action may follow.

Hottest takes

"The model is the easy part" — Legend2440

"Getting good data is 99% of the job" — Legend2440

"You will find out something weird going on instantly" — matusp

May 19, 2026

Diagnosis: terminal data drama

Researchers used celeb pics to spot strokes, and commenters say the real illness is bad data

Key Points

Hottest takes

May 19, 2026

Diagnosis: terminal data drama

'Comically bad' datasets used to train clinical models for stroke and diabetes

Researchers used celeb pics to spot strokes, and commenters say the real illness is bad data

Key Points

Hottest takes

Save News