Qwen3.7-Plus is Alibaba’s entry into the multimodal-agent race. Where Qwen3.7-Max stayed a pure-text flagship, Plus takes images and video as input — understanding them, not generating — and wires that perception into an agent loop that plans, calls tools, writes and tests its own code, then iterates on real execution feedback.
## What it’s built for
This is a GUI agent. It reads a screen and acts on it: clicking, navigating apps, running multi-step tasks the way a person would. On the benchmarks that measure exactly that, it posts ScreenSpot Pro 79.0 and Terminal-Bench 70.3 — near the front of the open-API GUI-agent pack — and lands #16 on Vision Arena, fifth among vision labs globally.
## Where to get it
It’s live now via API on Alibaba’s Bailian platform (Model Studio). The interesting part under the hood is Agentic RL: instead of static training, it learns from real-world execution feedback, sharpening accuracy on the actual tasks people throw at it.

Leave a comment