subscribe to be an AI insider

Open-Weight Reality Check in 2025: Kimi K2 vs Qwen3 vs GLM vs OpenAI’s gpt-oss vs DeepSeek

The open-weight scene finally has real breadth. If you want to self-host, fine-tune weekly, and keep data in your own stack, you no longer have to compromise on capability to get permissive licenses and complete artifacts. The trade-offs now are mostly about license details, serving economics (dense vs MoE), ecosystem maturity, and how much post-training you need to hit your product’s tone and safety bar.

Kimi K2

K2 flipped the script this summer by publishing weights along with a straightforward model card and a permissive license. Under the hood it’s a Mixture-of-Experts design: high total parameters with a smaller active set per token, so you can push strong quality while keeping tokens-per-dollar under control once your kernels and routing are tuned. If you used K2 last year as the “closed benchmark” for UX and speed, the mental model should change—today you can actually clone the setup, test on your evals, and fine-tune in your own environment. The license is a Modified MIT that’s easy to run through legal, and the docs point you to mainstream inference stacks, which cuts bring-up time.

Qwen3

Qwen3 is the most complete open family right now if you value range and predictability. You get tiny through huge, dense and MoE, “thinking” and “non-thinking” chat templates, and long-context variants that stretch into the seven-figure token range. The team ships sensible finetune guidance, tokenizer specs, and first-week support across vLLM, SGLang, MLX, and the usual frameworks. Just as important, the default license across the open weights is Apache 2.0, which tends to unblock procurement in multi-region deployments. If you want one code path from a laptop to a rack and you care about bilingual behavior out of the box, Qwen3 is the baseline that makes the least noise day two.

GLM

GLM is excellent when your product is Chinese-first and you want alignment that needs less SFT to feel natural in both CN and EN. The practical wrinkle is licensing: older GLM-4 open drops required a quick free authorization for commercial use, which is manageable but adds a checkbox for compliance. Newer GLM-4.5 materials are positioned as open-source and, depending on the release, come with permissive terms that remove that extra step. In short: pick the exact artifact you plan to ship, read that repo’s license once, and you’re fine—then enjoy long-context options and agent-oriented post-training that save you cycles. 智谱AIReutersVentureBeat

OpenAI’s gpt-oss

OpenAI rejoined the open-weight world this month with gpt-oss models under Apache 2.0. It’s a notable shift: you get modern instruction-following and tool-use in weights you can actually run yourself, rather than yet another API box. The smaller checkpoint is sized for modest hardware; the larger one targets datacenter cards and gives you a clear path if you want an “OpenAI-flavored” agent stack that’s still fully self-hosted. It doesn’t change the closed status of the frontier GPT line, but for teams that need permissive licensing from a U.S. publisher and want to stay close to the familiar OpenAI developer ergonomics, this lands well.

DeepSeek

DeepSeek is the cost-performance provocateur. The V3 series leans into MoE with a large total parameter budget but ~37B active per token, plus inference-friendly attention tricks. The recent checkpoints ship under MIT terms and are explicit about “thinking” vs regular modes, tool-calling behavior, and how to turn the reasoning traces on or off. In practice you’ll spend a bit of time making sure your serving stack is MoE-savvy; once it is, the throughput and unit economics are hard to ignore, especially for multi-turn reasoning apps where base models plus a small, clean SFT can already hit production-grade answers.

How to actually choose

If you want the least drama across legal, ops, and fine-tuning, start with Qwen3 and freeze an internal eval set that mirrors your production tasks. If your north star is reasoning quality per dollar and you’re comfortable doing a short MoE optimization sprint, prototype on DeepSeek and K2 and measure under load. If stakeholders insist on Apache and a U.S. publisher, gpt-oss is the new default to trial first, especially the smaller checkpoint for easy bring-up. If your users live in Chinese and tone matters more than chasing the last point on a leaderboard, GLM will usually get you to “feels right” with less SFT.

The bigger lesson: “open-source” is an operations decision, not a moral one. The winner is the model your team can run cheaply, adapt weekly, and ship safely—without spending a month haggling over licenses or a quarter chasing kernel regressions. The good news is that in 2025, you finally have multiple ways to get there.

Leave a comment