March 27, 2026
Ship fast, roast faster
Improving Composer through real-time RL
AI learns from your keystrokes every 5 hours — commenters want proof, Kimi fans want credit
TLDR: Composer now trains on real usage and ships new versions every five hours. Commenters want hard data, debate whether this creates a lasting edge, and call out missing credit to Kimi/Fireworks, while skeptics warn about regressions and risks of “learning on the fly” — high stakes, big claims.
Cursor says its code assistant, Composer, is now “learning on the job,” training on real user activity and shipping new versions as often as every five hours. The company claims this real-time reinforcement learning (think: the AI gets feedback from how you use it) closes the gap between lab tests and real life, with evals like CursorBench guarding against backfires.
But the comment section? Spicy. The top vibe is: show us the receipts. One user begged, “I’d love to see some data,” while another wondered if the generous free usage is really a data play to build a big moat. Cue the business drama: is this a clever flywheel or just an expensive sprint that rivals can copy?
Then came the credit controversy. A user points out the model sits on a Kimi base with a Fireworks AI license, grumbling that the new post forgot to say the quiet part out loud. That lit up a “who built what” thread, with folks noting Fireworks already offers RL tuning for Kimi. Meanwhile, veterans dropped perspective: we used to call this “implicit user feedback,” but doing it on giant AI models is a different beast. Skeptics warn about “catastrophic forgetting,” cheering the ambition but demanding months of proof. The memes write themselves: “Real-time RL? More like real-time PR” and “Composer cramming every five hours” — but hey, if it works, it’s a power move.
Key Points
- •Cursor applies a “real-time RL” approach that uses live inference tokens and user interactions as reward signals to train its coding model Composer.
- •The method was previously used to train Tab and is now extended to Composer for continuous production improvement.
- •A full training cycle—from data collection to deployment—takes about five hours, enabling multiple checkpoint updates per day.
- •Updated checkpoints are validated against evaluation suites, including CursorBench, to prevent regressions before deployment.
- •This rapid, on-policy training reduces train-test mismatch, particularly by capturing real user behavior that simulations struggle to model.