There’s a new model out from [Standard Intelligence](https://si.inc/posts/fdm1/) called FDM-1, and it’s doing something genuinely different in the computer-use agent space. Instead of training on labeled screenshots — the way pretty much everyone else does it — they threw 11 million hours of raw screen recording video at the problem and let the model figure out how humans actually interact with software.
The trick is an inverse dynamics model that watches video frames and reconstructs what the user must have done between them — keystrokes, mouse movements, scrolls, drags. No human annotation needed. They trained that IDM on about 40,000 hours of contractor-labeled data, then turned it loose on the full 11-million-hour corpus to auto-generate action labels at scale. It’s a clever workaround to what has always been the biggest bottleneck in this field: you can’t hire enough people to manually label every click in millions of hours of footage.
What caught my attention is the context window. FDM-1 compresses nearly two hours of 30 FPS video into about 1 million tokens — roughly 50x more efficient than previous approaches. That means the model can follow along with extended workflows, not just isolated clicks. They showed it extruding a gear in Blender, navigating a car through San Francisco streets (with less than an hour of driving-specific finetuning), and fuzzing web apps to find bugs. The driving demo especially is wild for a model that was primarily trained on screen recordings.
The whole thing [blew up on Hacker News](https://news.ycombinator.com/item?id=47125014) on Feb 26, pulling in over 200 points, and [Metaverse Post](https://mpost.io/standard-intelligence-launches-fdm-1-ai-system-capable-of-learning-complex-computer-tasks-from-video/) covered it the same day. The [Standard Intelligence GitHub org](https://github.com/Standard-Intelligence) exists but is still pretty sparse on public repos — so this is very much an announcement, not an open-source drop.
The real thesis here is a shift from data-constrained to compute-constrained. Once you solve the labeling problem with IDM, the ceiling becomes how much compute you can throw at training. The team says performance keeps scaling up, which, if it holds, means FDM-1 is the small version of something much bigger coming. Worth keeping an eye on this one.

Leave a comment