The First Fully General Computer Action Model

Internet-trained clicker stuns, skeptics demand receipts

TLDR: A new model, FDM-1, learns computer actions by watching 11 million hours of screen video and can track minutes-to-hours of context. Commenters are split between amazement at the demos and demands for proof, sparking debates on audio, mouse input details, and whether some actions need “future” clues to label.

Move over, Clippy—FDM-1 just rolled in with 11 million hours of internet screen time and a promise to be your keyboard-and-mouse coworker. The devs say their video model can watch nearly two hours of computer footage in one go and predict the next clicks and keystrokes. Think CAD designs, finance workflows, even driving demos—all powered by screen video instead of snapshot screenshots, and trained with an "inverse dynamics" trick that infers what keys were pressed from what showed up onscreen. It’s a wild flex in context length—previous systems struggled past seconds; this wants minutes to hours, like OpenAI’s VPT but super-sized.

Cue the comment section fireworks. The biggest vibe split: wow vs. prove it. One camp is dazzled by the car demo and asks what’s next (audio support? mouse tokenization?). Another camp is side-eyeing the science, with calls for ablation studies and hard numbers after a “typos were common” claim. A team member, Neel, parachutes in with backstage energy (“holed up in South Park for a year”) and confirms the monster dataset, which only fuels the hype. Meanwhile, a brainy side-thread debates whether copy-paste actions can only be labeled by looking into the future, launching a mini philosophy seminar. The memes? Let’s just say “Clippy finally graduated” is making the rounds.

Key Points

  • FDM-1 is a video-based foundation model for computer use trained with long-context, aiming to handle complex workflows.
  • An IDM labels actions across an 11-million-hour screen recording corpus, enabling large-scale training without extensive contractor annotation for every clip.
  • The training pipeline includes: training an IDM on 40,000 hours, using it to label the 11-million-hour dataset, and autoregressively training a forward dynamics model on next-action prediction.
  • A custom video encoder compresses nearly two hours of 30 FPS video into 1M tokens, claimed to be 50× more efficient than prior SOTA and 100× more than OpenAI’s encoder.
  • The approach addresses limitations of VLM-based agents (short context, low framerate, task-specific RL) and demonstrates capabilities in CAD, car driving, and website fuzzing.

Hottest takes

"hand-wavy without numbers" — rio_popper
"The car thing is very impressive" — ennucore
"the past does encode the future" — clemvonstengel
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.