Everyone building web agents right now points a multimodal model at screenshots — browser-use, UI-TARS, OpenAI’s computer use. Alibaba’s new open-source Page-Agent went the opposite way: no screenshots, no vision model, no browser extension, no headless browser. It’s plain JavaScript that lives inside your page, reads the live DOM as text, and clicks and types as the actual logged-in user. GitHub gave it 700+ stars in a single day.
What it actually is
The trick is called DOM dehydration. A real page has thousands of nodes, and dumping raw HTML into an LLM is slow and expensive. Instead Page-Agent scans the DOM, pulls out every interactive element — buttons, links, inputs — tags each with an index, role, and label, and flattens it into a clean text map. A small, cheap text model can then act on it precisely. No pixels involved.
Why the SDK matters
It’s an MIT-licensed, TypeScript-first library. Point model and baseURL at any OpenAI-compatible backend — DashScope/Qwen, GPT, Claude, or a local Ollama — and a few lines drop an AI copilot into your SaaS. The pitch: turn a 20-click ERP/CRM/admin workflow into one sentence, with no backend rewrite. That’s the part enterprise teams will care about.
You Might Also Like
- Vercel Agent Browser Might be the Smartest way to let ai Actually use the web
- 13b Into Openai yet Microsoft Copilot Cowork Runs on Anthropic Claude
- 83k Github Stars and 17m in Funding how Browser use Became the Default Framework for ai Browser Agents
- 700 Github Stars in a Week Apfel Exposes the Free llm Apple Locked Behind Siri
- Openai Codex in Chrome Moves the Coding Agent Into Your Real Browser Session

Leave a comment