AI2’s MolmoWeb Outscores GPT-4o on Web Tasks — With Just 8 Billion Parameters

The web agent race has a new open-source contender, and the benchmarks are hard to ignore.

On March 24, the Allen Institute for AI (AI2) released MolmoWeb, a fully open-source visual web agent that navigates browsers by looking at screenshots — the same way a human would. The kicker: its 8B-parameter model outperforms agents built on GPT-4o across multiple web navigation benchmarks, despite being a fraction of the size.

In a space dominated by closed-source offerings from OpenAI, Google, and Anthropic, MolmoWeb arrives with everything open: model weights, training data, code, and the largest public dataset of human web task execution ever released. That combination is what’s driving the buzz across VentureBeat, GeekWire, SiliconANGLE, and developer communities this week.

How MolmoWeb Actually Works

Most web automation tools parse HTML, inspect the DOM, or rely on accessibility trees to understand what’s on a page. MolmoWeb takes a fundamentally different approach: it looks at screenshots.

Built on AI2’s Molmo 2 multimodal model family, MolmoWeb processes raw webpage screenshots and decides what to do next — click a button, type into a field, scroll down, navigate to a new page. It chains these actions together to complete multi-step tasks like searching for products, filling out forms, or extracting specific information from websites.

The architecture pairs Qwen3-8B as the language backbone with SigLIP 2 as the vision encoder. This vision-first design means MolmoWeb doesn’t need access to page source code or structured data. If a human can see it on screen, MolmoWeb can interact with it.

Two model sizes are available: a 4B version for lighter workloads and an 8B version that delivers the strongest performance. Both are small enough to run locally, which matters for developers who want to build web automation pipelines without sending every screenshot to a cloud API.

The Numbers: Where MolmoWeb Stands

Here’s what makes MolmoWeb notable — the raw benchmark results:

WebVoyager: 78.2% (pass@1), jumping to 94.7% with test-time compute scaling (pass@4)
DeepShop: 42.3%
WebTailBench: 49.5%
Online-Mind2Web: 35.3% (pass@1), rising to 60.5% (pass@4)

The 8B model outperforms every other publicly available open-weight web agent at comparable scale, including Fara-7B, across all four benchmarks. On visual grounding specifically (ScreenSpot and ScreenSpot v2), a dedicated 8B grounding model trained on MolmoWeb’s data beats not just open-weight competitors but also larger proprietary systems including Claude 3.7 and OpenAI’s CUA.

The GPT-4o comparison deserves context. MolmoWeb outperforms GPT-4o-based agents that rely on annotated screenshots and structured page data — agents with substantially richer input representations and orders of magnitude more parameters. That’s a meaningful result for an 8B model running on consumer hardware.

But let’s be clear about the limits: current frontier agents from Anthropic, Google, and OpenAI still outperform MolmoWeb on most complex tasks. The gap is narrowing, but it hasn’t closed.

MolmoWebMix: The Dataset That Makes This Possible

The model is only half the story. AI2 also released MolmoWebMix, the training dataset behind MolmoWeb, and it’s arguably the more important contribution for the broader research community.

The numbers are staggering:

36,000 human task trajectories — crowdworkers performing real browsing tasks using a custom Chrome extension that captured actions and screenshots
623,000+ individual subtask demonstrations across 1,100+ websites
108,000 synthetically generated trajectories from automated agents operating on accessibility trees
362,000 screenshot QA pairs for GUI perception training, drawn from nearly 400 websites
2.2 million question-answer pairs for screenshot understanding

This is the largest publicly available dataset of human web task execution to date. Previous open-source web agents often relied on distilling behavior from proprietary vision models like GPT-4V or Claude. MolmoWeb deliberately avoided that path — its training data comes from human demonstrations and synthetic trajectories generated by text-only accessibility-tree agents, not from imitating closed-source competitors.

That independence matters. It means the entire pipeline — from data collection tools to training recipes to model weights — can be freely used, modified, and built upon without legal ambiguity around distillation from proprietary systems.

Everything ships under Apache 2.0.

MolmoWeb vs. the Closed-Source Competition

The web agent market is heating up, and the competitive landscape looks very different depending on whether you value performance ceilings or accessibility.

OpenAI Operator runs on GPT-4o inside a secure virtual browser environment. It’s polished and powerful, but locked behind a $200/month ChatGPT Pro subscription and currently limited to the US. On browser-specific benchmarks, Operator has posted scores around 87% — significantly above MolmoWeb’s numbers. But you can’t run it locally, modify it, or integrate it into custom pipelines.

Anthropic’s Computer Use takes a broader approach, enabling Claude to control entire desktops, not just browsers. It’s available through Anthropic’s API and has shown strong results in general computer interaction. For pure web tasks, though, MolmoWeb’s visual grounding on ScreenSpot v2 actually surpasses Claude 3.7’s performance.

Google’s Project Mariner remains more experimental, with limited public access and fewer published benchmarks. Google has been more cautious about shipping agentic web capabilities to production.

Where MolmoWeb stands apart from all three:

Feature	MolmoWeb	OpenAI Operator	Claude Computer Use	Google Mariner
Open source	Yes (Apache 2.0)	No	No	No
Run locally	Yes	No	No	No
Training data available	Yes (36K+ trajectories)	No	No	No
Price	Free	$200/mo	API pricing	Limited access
Scope	Web browser	Web browser	Full desktop	Web browser

For researchers and developers building custom web automation, MolmoWeb is in a category of its own — not because it’s the most capable agent overall, but because it’s the only one where you can inspect, modify, and retrain every component of the stack.

Why This Matters Beyond Benchmarks

The real significance of MolmoWeb isn’t that an 8B model beats GPT-4o on some benchmarks. It’s the full-stack openness.

Previous open-source web agents hit a ceiling because they lacked quality training data. You could get the model weights, maybe some code, but the actual trajectories showing how to navigate real websites? Those stayed locked behind the companies that collected them.

AI2 just removed that bottleneck. With MolmoWebMix, any research team can now train their own web agent from scratch without depending on proprietary data or model distillation. The included data collection tools mean teams can also extend the dataset with their own trajectories.

The test-time compute scaling results are also worth watching. MolmoWeb’s WebVoyager score jumps from 78.2% to 94.7% when given four attempts per task. Online-Mind2Web goes from 35.3% to 60.5%. These gains suggest that even without a larger model, smarter inference strategies can dramatically close the gap with closed-source agents.

For companies evaluating web automation, MolmoWeb presents an interesting option: a web agent you can deploy on your own infrastructure, fine-tune on your specific workflows, and run without per-query API costs. The trade-off is lower peak performance compared to frontier commercial offerings — a trade-off many teams will happily accept for the control and cost benefits.

FAQ

Is MolmoWeb free to use?
Yes. MolmoWeb is released under the Apache 2.0 license, which allows free commercial and research use. The model weights, training data (MolmoWebMix), and all code are available on Hugging Face and GitHub at no cost.

How does MolmoWeb compare to OpenAI Operator and Claude Computer Use?
MolmoWeb outperforms GPT-4o-based agents on several web navigation benchmarks and beats Claude 3.7 on visual grounding tasks. However, current frontier agents from OpenAI, Anthropic, and Google still lead on the most complex web tasks. MolmoWeb’s key advantage is being fully open-source and runnable locally, while Operator costs $200/month and Claude Computer Use requires API access.

Can I run MolmoWeb on my own hardware?
Yes. Both the 4B and 8B models are designed to run locally. The 4B model is suitable for consumer GPUs, while the 8B model requires more memory but remains feasible on modern hardware without cloud infrastructure.

What websites does MolmoWeb work with?
MolmoWeb was trained on data spanning over 1,100 websites, covering common tasks like search, navigation, form filling, and e-commerce. Since it works by interpreting screenshots rather than parsing page-specific code, it generalizes to websites not included in its training data, though performance may vary.

What is MolmoWebMix?
MolmoWebMix is the training dataset released alongside MolmoWeb. It contains 36,000 human task trajectories, 108,000 synthetic trajectories, and over 2.2 million screenshot QA pairs — making it the largest publicly available dataset for training web agents. It’s available on Hugging Face under Apache 2.0.

Top AI Product

Leave a comment Cancel reply