June 2, 2026
Picture this: comment chaos
How we index images for RAG
AI found a cheaper way to read screenshots, and the comments instantly made it a fight
TLDR: Kapa says it can make AI helpers use screenshots and diagrams more cheaply by describing each image once instead of re-checking it every time. Commenters were split between “smart and useful” and “isn’t this just common sense?”, with one person nearly derailed by rage at the cookie popup.
A startup called Kapa says it has cracked a very annoying problem: how to make chatbots understand the screenshots, diagrams, and tables buried inside tech manuals without paying to re-read every image every time someone asks a question. Their trick is surprisingly simple in plain English: look at each image once, write down what’s in it, save that text, and use that later. That means better answers for users trying to find the right button or setting, without the giant cost of sending piles of pictures into an AI on every search.
But in the comments, the real show began. One camp basically shrugged and said, “uh, yes, obviously”. A popular reaction compared it to old-school “eager” media processing, the same idea websites have used for years to pre-make thumbnails and save time later. Another commenter had big “I’ve been doing this already” energy, saying they use the same approach in their personal note system. That sparked the classic tech-thread drama: is this clever product work, or just packaging common sense?
Then came the practical crowd, cheering that the post actually explained a real solution instead of hand-wavy AI magic. One commenter flat-out said, in effect, “great, now I actually know how I’d solve this.” And of course, because no internet discussion can remain peaceful for long, someone ignored the whole breakthrough and went straight for the site itself, declaring the cookie popup so annoying it made them want to flee. In other words: useful idea, mild swagger, one open-source plug, and a cookie-banner villain stealing a little bit of the spotlight.
Key Points
- •Kapa processes images in technical documentation at indexing time by generating text descriptions with a low-cost vision model and storing those descriptions for retrieval.
- •The company reports that this design increases per-query overhead by only 1% to 6% compared with text-only retrieval.
- •Kapa reviewed thousands of customer questions and found that documentation images are either illustrative or load-bearing sources of information.
- •Across three customer projects and two models, answers with image context were preferred by an LLM judge with statistical significance using McNemar's test at p < 0.05.
- •Kapa says query-time multimodal processing is hard to scale because of higher per-query costs, payload-size limits, and the limitations of CLIP-style embeddings for fine-grained technical images.