Alibaba dropped Qwen 3.6-Plus on April 2nd, and the numbers are hard to ignore. On SWE-bench Verified — the benchmark that actually matters for coding — it scores 78.8%. Claude Opus 4.6 scores 80.9%. That’s a 2.1-point gap. On Terminal-Bench 2.0, Qwen 3.6-Plus flips the script entirely: 61.6% vs Claude’s 59.3%. And the pricing? Input tokens cost $0.29 per million on Alibaba’s Bailian platform. Claude Opus charges $5. That’s not a rounding error. That’s a 17x difference.
This is the same Alibaba that shipped Qwen 3.5 just two months ago. The pace is aggressive, and the gap to the best Western models is shrinking fast.
What the Benchmarks Actually Say
The headline numbers tell one story. The details tell a more nuanced one.
Qwen 3.6-Plus leads on Terminal-Bench 2.0 (61.6 vs 59.3), which tests terminal-based execution, multi-step planning, and tool usage — the bread and butter of agentic coding workflows. It also tops OmniDocBench v1.5 at 91.2 and RealWorldQA at 85.4. On GPQA, a graduate-level science benchmark, it hits 90.4 — highest among all compared models.
But SWE-bench Verified is still the gold standard for real-world software engineering, and Claude Opus 4.6 holds a clear edge there at 80.9 vs 78.8. If your job is resolving GitHub issues on production codebases, Claude is still the safer bet. The question is whether that 2-point lead justifies paying 17x more.
For context, MiniMax M2.5 hit 80.2% on SWE-bench a couple months back, DeepSeek V4 targeted 80%+, and GPT-5.4 landed at 57.7%. The Chinese models are clustering right at the frontier while OpenAI’s latest is oddly far behind on this particular benchmark.
Speed Is the Real Story
Benchmarks aside, the speed advantage is what developers will actually feel day-to-day.
Qwen 3.6-Plus clocks 158 tokens per second on SpeedBench. Claude Opus 4.6 does 93.5. GPT-5.4 manages 76. That’s roughly 3x faster output than Claude and 2x faster than GPT. For agentic workflows where the model is running in a loop — plan, execute, observe, iterate — that speed compounds into dramatically shorter feedback cycles.
The architecture behind this is a hybrid of linear attention and sparse mixture-of-experts routing. It’s not just a bigger model thrown at more GPUs. The engineering is designed for throughput.
There’s one significant caveat: time-to-first-token (TTFT) is 11.5 seconds. Over 11 seconds before you see any output. For interactive use cases — chatbots, quick code completions, real-time pair programming — that’s painful. For background agent tasks that run autonomously, it barely matters. This is a model that’s fast once it gets going but slow to start. Know your use case.
The Tool Compatibility Play
Here’s where Alibaba is being smart. Qwen 3.6-Plus isn’t locked into a proprietary ecosystem. It works natively with Claude Code, OpenClaw, Cline, Kilo Code, and OpenCode. You don’t need to change your toolchain. Just swap the backend model.
This matters a lot right now. Anthropic recently tightened restrictions on OpenClaw usage for Claude Code, pushing some developers to look for alternatives. Qwen 3.6-Plus slots in as a direct drop-in replacement. Same tools, different brain, fraction of the cost.
The 1-million-token context window helps too. Feed it an entire repository. The model also supports native multimodal input — you can hand it a screenshot or a Figma design mockup and it’ll generate code from the visual. Always-on chain-of-thought reasoning and native function calling round out the feature set. These aren’t experimental add-ons. They’re core to the model’s architecture.
Where It Falls Short
Production reliability is where Claude still wins and it’s not close. Developers running long autonomous sessions — 20+ minutes of continuous agent operation — consistently report fewer hallucinations and better context retention with Claude Opus 4.6. When you’re shipping code to clients who care about compliance and output consistency, that reliability gap matters more than a 17x price difference.
The TTFT issue is real. 11.5 seconds of dead air before the first token is a hard sell for any interactive product. Alibaba will need to fix this before Qwen 3.6-Plus can compete seriously in chat-based developer tools.
And despite the impressive benchmarks, the model is still in preview. Production pricing is expected to land around $0.50/$3 per million tokens for input/output — still dramatically cheaper than Claude’s $5/$25, but a big jump from the current free preview. The real test comes when developers start paying and the honeymoon period ends.
The bigger picture: February and March 2026 saw GLM-5 trained entirely on Huawei chips, DeepSeek V4 pushing open-source frontiers, MiniMax M2.5 nearly matching Claude on SWE-bench, and now Qwen 3.6-Plus outpacing it on speed and terminal tasks. Chinese AI labs aren’t catching up anymore. On specific benchmarks and price-performance, they’ve already pulled ahead. The question isn’t whether they can compete. It’s whether Western developers will trust them enough to switch.
You Might Also Like
- Claude Code Remote Control Just Turned my Phone Into a Coding Terminal and im Weirdly Into it
- From Claude Flow to Ruflo 22k Stars 5900 Commits and the Multi Agent Swarm Taking Over Claude Code
- Oh my Claudecode Turns Claude Code Into a 32 Agent dev Team and 11 8k Stars Agree
- Pi Mono 29k Stars and a 200 Token System Prompt That Rivals Claude Code
- Claude Code Source Leak Kairos how a 59 8 mb File Exposed Anthropics Entire Agent Playbook

Leave a comment