subscribe to be an AI insider

Behind Kimi-K2’s Config: An Inference-Centric View of Four Key Design Decisions

Subscribe for more insider info about AI products

Author: Liu Shaowei
Source: Zhihu
(Translated from Chinese; original copyright belongs to the author. Commercial reprint requires permission; non-commercial use please cite the source.)


I was on the delivery team for Kimi-K2, so—like any proud parent—everything about it looks perfect to me.
Since its release, the open-source community has showered K2 with attention. Many have noticed that our architecture is almost identical to DeepSeek-V3 (henceforth DSv3) and have dug into the handful of remaining differences. As a low-profile member of Moonshot’s infra-side inference crew, I’d like to walk through why K2’s config looks the way it does from an inference perspective.
Disclaimer: the discussion inevitably touches on training details, but some parts are my own interpretations; colleagues, please correct me if I’m wrong.


1. Design philosophy of K2’s architecture

Before kicking off K2 training, we ran extensive scaling experiments on model structure. Every proposed variant that differed from DSv3 lost—at best, it tied. The question then became: should we force a different architecture just for the sake of being different? The answer was no. DSv3’s structure had already proven itself at large scale, whereas our “new” ideas lacked that validation. With Muon optimizer and larger parameter counts already introducing huge variables, we didn’t want extra uncertainty just to be “innovative.”
Thus Constraint #1: inherit DSv3’s architecture wholesale, then tune parameters that fit our setup.

Constraint #2 was cost—both training and inference. As a small company, our GPU budget is tight. After DSv3 shipped, we concluded its training + inference costs were near our ceiling. We therefore aimed to keep K2’s costs no higher than DSv3’s.

So the design task boiled down to: within DSv3’s skeleton, choose parameters that yield noticeably lower loss while matching DSv3’s train/infer cost. (Training-cost details are left to a future tech report—partly because I only half understand them 😉.)


2. Concrete changes and their motivations

As many have spotted by diffing the config files, we changed four items:

  1. num_experts = 384
  2. num_attention_heads = 64
  3. Only one dense layer at the front (first_k_dense = 1)
  4. Router without grouping (n_group = 1)

I’ll cover each in turn, reasoning purely from inference and assuming the EP+DP schedule described in the DSv3 tech report [1] and OpenDay slides [2], with communication ideally overlapped away.


2.1 num_experts = 384

This came from pretrain’s sparsity-scaling law: with activated parameters fixed, simply growing the MoE total parameters still obeys the scaling law, and both train & val loss keep dropping—no over-fitting risk. So 384 experts shoulder the primary duty of loss reduction.

Inference impact

  • Prefill: if we keep the same node count as for 256 experts and the sequences are long enough, runtime barely changes—prefill is compute-bound and activated FLOPs stay identical.
  • Decode: online TBOT (time-between-output-tokens) caps the usable batch size. MoE GEMMs remain memory-bound; 1.5× parameters ⇒ 1.5× time.
    • Under EP=128 (GCD of 384 & 256), each DSv3 rank holds 2 routed + 1 shared expert ≈ 7.5 GB MLP weights. K2 needs ≈ 10 GB—an extra 2.5 GB.

2.2 num_attention_heads = 64

MoE got slower—can we claw latency back elsewhere? Start with attention heads.
In MLA, DSv3 doubles the usual head count to exploit bandwidth, gaining a small bump in loss but hurting latency in both phases. Cutting heads back to 64 yields:

Prefill
(1) MLA attention FLOPs scale as 2hs²(d_nope + d_rope + dv); everything else is 2Ns. For long sequences (agent / vibe-coding use cases), the quadratic term dominates. Halving h directly shrinks that term.
(2) Q/K/V/O projections scale linearly with h. DSv3’s 37 B activated params vs K2’s 32 B—the 5 B gap is almost entirely these projections. Fewer params ⇒ fewer FLOPs.

Decode

  • Core attention latency depends on KV-cache size (tie).
  • Q/K/V/O projection memory traffic drops from 10 GB to 5 GB. Under DP attention these tensors are replicated on every rank, so the 5 GB saving doesn’t shrink with higher parallelism.
    • Earlier MoE added 2.5 GB; here we save 5 GB—net win.

Algorithm ablations confirmed the loss hit from halving heads is far smaller than the gain from extra MoE parameters. Thus num_heads = 64 was locked in.
(Bonus: lower head count leaves more room for speculative-decoding speed-ups.)


2.3 first_k_dense = 1

Like DSv3, we observed layer-1’s router struggles with load balance, but later layers are fine. To maximize MoE utilization we keep only layer 1 dense—no measurable prefill impact, and decode adds only a few hundred MB per rank.


2.4 n_group = 1 (no expert grouping)

Grouping helps when one GPU hosts many experts: within-device load balancing. At our scale EP is so large that each device ends up with one (or even zero) routed expert, shifting the balancing problem to the node level. Node-level balancing plus dynamic redundancy (EPLB) already handles hotspots, so grouping adds little. A freer router expands the combinatorial space and slightly boosts quality.


3. Wrap-up

Those four tweaks give us an inference recipe that, at the same EP degree, increases total params 1.5× yet reduces both prefill and decode latency (communication excluded). Even after accounting for overlap and network effects, costs stay on par with DSv3. We’re proud that every parameter change was backed by solid theory and experiments.

We hope the open-source community will help validate these claims once the weights are out. Thank you all for the enthusiasm around Kimi-K2!

Leave a comment