Alibaba released Qwen3.7-Max on May 19, unveiling it at the 2026 Alibaba Cloud Summit. It’s a reasoning model engineered for long, multi-stage agentic projects rather than short chat — and the headline demo backs that up: it ran 35 hours uninterrupted, called over 1,000 different tools, and wrote an optimized compute kernel that ran 10x faster than the manufacturer’s official code.
## The benchmarks
GPQA Diamond 92.4, edging Claude Opus 4.6’s 91.3. HLE 41.4 versus Opus 4.6’s 40.0. #3 of 117 on coding benchmarks (average 92.7), #2 overall on BenchLM’s provisional leaderboard, with a 1M token context window. This is the first Chinese model credibly trading blows with frontier Western models on the hardest agentic and reasoning tasks.
## The open-source split
The Plus variant will be open source; the Max flagship will not. Alibaba continues its shift toward monetizing its best model while giving developers the tier below — open enough to build a community, closed enough to capture the premium.
## Why it matters
The “35-hour autonomous run” framing is the real signal. The frontier labs are all converging on long-horizon agentic competence as the next battleground, and Alibaba just planted a flag with a concrete, verifiable demo rather than a benchmark table.

Leave a comment