May 21, 2026
Prompt and Circumstance
The famous O3 "GeoGuessr" prompt did not work
That ‘magic’ photo-location trick may have been hype all along, and commenters are not letting it slide
TLDR: A test found the famous long prompt for guessing where photos were taken didn’t beat a basic prompt, suggesting the “magic” may have been overhyped. Commenters split between mocking flimsy AI hype, questioning the benchmark, and floating a conspiracy that the skill was intentionally weakened.
The big reveal? The famous super-long prompt that supposedly turned OpenAI’s o3 into a photo-location wizard seems to have not actually helped much at all. After testing 200 images, the author found the plain, simple prompt often did slightly better than the fancy “GeoGuessr” script people were passing around like secret sauce. In other words: the model was already good, and the internet may have mistaken that for prompt genius. Ouch.
That set off exactly the kind of comment-section fireworks you’d expect. One camp basically yelled, “called it!” and treated this as another lesson in how easy it is to get dazzled by AI and start giving credit to the wrong thing. A particularly sharp jab came from users saying the real reason nobody checked sooner is that bold AI claims are simply more retweetable than boring verification. Another crowd was immediately suspicious of the test itself, asking whether image metadata was stripped and whether using public images from places like Wikipedia meant the model might already have seen them. And then came the spiciest theory of all: maybe newer models seem worse at geolocation because the ability was deliberately nerfed.
The vibe was part skepticism, part detective drama, part “lol prompt engineering strikes again.” The running joke? AI will happily tell you your prompt tweak was brilliant — just like a very flattering friend who absolutely should not be grading its own homework.
Key Points
- •The article revisits claims that OpenAI’s o3 model had exceptional photo geolocation ability and that a long custom prompt helped unlock it.
- •The author created a 200-image benchmark using images from Wikimedia Commons, Geograph Britain and Ireland, and iNaturalist.
- •o3 was tested twice on the same benchmark: once with a basic prompt and once with the long “GeoGuessr” prompt.
- •The reported results showed the default prompt performed better on average and produced guesses closer to the true locations.
- •The article argues that prompt effectiveness should be measured with benchmarks because anecdotal testing can overstate the value of prompt engineering.