February 27, 2026
Logs, Lies, and LLMs
We gave terabytes of CI logs to an LLM
Half the crowd is cheering, half yelling 'hallucination', and a 'bot' accusation steals the show
TLDR: An AI agent reportedly solved flaky build failures by querying billions of log lines in seconds using fast SQL. Commenters are split between excitement over real wins and fears of AI “hallucinations” in noisy logs, with extra drama over a “bot” accusation and calls for smarter, step-by-step methods.
CSI: Build Logs, but make it AI. A dev toolmaker says their robot assistant scanned months of company build logs—1.5 billion lines a week—wrote its own database questions, and found a flaky test’s culprit in seconds. It’s fueled by ClickHouse (a super-fast data store) and lets the bot ask open-ended questions in SQL instead of rigid, prebuilt commands. They claim it can follow breadcrumbs across jobs and log lines, sometimes reading hundreds of millions of rows to pin down when errors first appeared.
The community? Absolutely split. One camp is hyped, calling this the kind of post you screenshot and sneak into your next team sprint. Another camp slams the brakes: LLMs (AI text models) hallucinate, warns a skeptic, saying real-world logs are chaos and cause-and-effect often lives far outside what an AI can see. A third voice jumps in with a nerdy plot twist: “Use Recursive Language Models,” basically teaching the bot to code and reason step-by-step instead of cramming everything into its memory.
And then the side drama: someone calls another commenter “pretty obviously a bot,” which only stokes the fire. Meanwhile, battle-scarred engineers nod along about noise, false alarms, and errors jumping between containers like soap opera characters. The vibe: AI sleuth meets messy reality, with equal parts wow, whoa, and “are we sure this isn’t making stuff up?”
Key Points
- •An LLM-driven agent analyzes CI logs via a SQL interface to diagnose failures quickly.
- •The system ingests ~1.5B log lines and ~700K jobs weekly, stored in ClickHouse with ~35:1 compression.
- •Agent investigations use job metadata 63% of the time and raw log lines 37%, starting broad then drilling down.
- •Across 8,534 sessions, median total rows scanned per question is 335K (P75: 5.2M; P95: 940M); heavy raw-log cases reach 4.3B rows.
- •Each log line stores 48 columns of context; denormalization performs well in ClickHouse’s columnar format, yielding millisecond queries.