SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via CI

New SWE-CI test crowns Claude, sparks fairness fight and “not-so-long-term” backlash

TLDR: SWE-CI tests AI coders on long-haul code upkeep using real commit histories, with Claude leading the posted scores. Comments cheer the real-world focus but argue over small change sizes and outdated GPT comparisons, turning the launch into a fairness fight over what “maintainable” actually means.

SWE-CI just dropped a new way to test AI coders: not a one-and-done bug fix, but a marathon of keeping a real codebase healthy through the Continuous Integration loop—aka CI, where every change gets checked automatically. It packs 100 real projects, averaging 233 days and 71 commits, and asks AI agents to grind through many rounds of analysis and fixes. The crowd loved the premise—finally, a test about long-term maintainability, not just quick correctness. But the comments quickly turned into a leaderboard brawl: Claude from Anthropic blows past rivals, while OpenAI’s listed GPT entry looks… old.

One early reviewer cheered “significant improvements” yet flagged “really bad regression rates,” stirring fear that long projects still fall apart. Another doubted the “long term” part, noting the average change size is just ~500 lines—cue memes about counting steps instead of miles. The fairness police rolled in: “Why pit Opus 4.6 against GPT‑5.2?” demanded one user, while another posted the spicy scoreboard—“Claude wins by a large margin” (0.71 vs 0.23). Defenders say newer GPTs hide behind a proprietary tool, making apples-to-apples testing hard. Meanwhile, optimists imagine training on GitHub issues to teach AIs real upkeep, and jokers rebrand CI as “Chaos Intensified.” Verdict? A bold benchmark, a messy scoreboard, and a community asking whether we’re testing maintenance—or just starting another model wars season.

Key Points

•SWE-CI is introduced as a repository-level benchmark focused on long-term maintainability in software development.
•It is built around the Continuous Integration loop to reflect real-world development processes.
•The benchmark includes 100 tasks based on real repositories, averaging 233 days and 71 consecutive commits per task.
•Agents must perform dozens of iterative analysis and coding rounds to resolve tasks.
•SWE-CI aims to shift evaluation from short-term functional correctness to sustained code quality over long-term evolution.

Hottest takes

"really bad regression rates across the board" — verdverm

"they're benchmarking Opus 4.6 (Anthropic's latest and greatest model) against GPT-5.2" — woadwarrior01

"Claude wins by a large margin" — mentalgear

March 8, 2026

CI now stands for Community Infighting

New SWE-CI test crowns Claude, sparks fairness fight and “not-so-long-term” backlash

Key Points

Hottest takes

March 8, 2026

CI now stands for Community Infighting

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via CI

New SWE-CI test crowns Claude, sparks fairness fight and “not-so-long-term” backlash

Key Points

Hottest takes

Save News