Kimi K2.7-Code cuts thinking tokens 30% — but practitioners say the benchmarks don't check out

kimi gateway smk1
Moonshot AI released KM K2.7-Code this week, an open-source update to its K2 coding model family that claims less logic and double-digit performance gains.

K2.7-code is built on the same trillion-parameter mixin-expert architecture as its predecessor K2.6, and comes via an OpenAI-compliant API – which makes sense for teams already running K2.6 in production gateways.

When K2.6 launched in April, it topped OpenRouter’s weekly LLM leaderboard – a ranking based on actual API routing decisions by developers, not self-reported benchmark scores.

Moonshot AI says the K2.7-code addresses what it calls "overthinking," Reducing think-token usage by 30% compared to K2.6 – a number that will directly impact estimation costs for teams running agentive workflows. Whether those efficiency gains hold up to independent benchmarks is a question physicians have already begun to raise publicly.

What is KM K2.7-code?

The K2.7-code is released under a modified MIT license, with weights available on HuggingFace. The model is deployable via vLLM or SGLang. It runs exclusively in thinking mode and doesn’t support temperature adjustments – Moonshot AI has this fixed at 1.0, which means teams can’t tune output determinism the way they can with other models.

The main change from K2.6 is how the model generates low-level code. Where K2.6 produced implementations by wrapping existing libraries and routing them through established frameworks, K2.7-code writers created implementations directly. Moonshot AI says it produces more reliable generalizations across Rust, Go and Python and across task types including frontend development, DevOps and performance optimization.

On benchmark performance, Moonshot AI claims a gain of 21.8% on KM Code Bench v2, 11% on Program Bench and 31.5% on MLS Bench Lite. These three are proprietary benchmarks powered by Moonshot AI. The model is not submitted to DeepSWE, an independent coding benchmark that produces a 70-point spread across models – compared to SWE-Bench Pro’s 30-point spread – making it a more discriminating signal for teams configuring model routing systems.

The more honest, the weaker for this

Outside of Moonshot’s own benchmarks the picture is more complex.

Researcher Elliot Arledge ran the K2.7 code against K2.6 and CloudFable 5 on KernelBench-Hard, a public benchmark focused on GPU kernel optimization, and published his full run log on kernelbench.com.

"K2.7 is more honest but not more capable," Arledge wrote on X.

On five of the six problems, K2.7-code produced the actual authored Triton kernel where K2.6 used library wrappers. Two of them failed on the kernel model’s own bugs. The MoE kernel result reduced the score of K2.6 from 0.222 to 0.157.

"For reference, the fable tops every cell, it honestly doesn’t fail," Arledge wrote.

Sugumaran Balasubramanian, a developer who created a model-task-router for the Hermes agent platform using DeepSWE as his reference signal, publicly responded to the K2.7-code release and directly challenged Moonshot AI on the benchmark options.

"Respectfully, each model makes double-digit ‘improvements’ in its test suite," Balasubramaniam wrote on X.

He noted that K2.6 scored 24% on DeepSWE, tied with GPT-5.4-mini, and asked whether Moonshot AI would render K2.7-code on the same benchmark.

Balasubramanian said it took 13 review rounds to get the right benchmark data for his router and if the independent numbers remained correct he would route the coding tasks to the K2.7-code.

What does this mean for enterprises

Token efficiency gains are immediately usable. Teams running K2.6 in production can swap to K2.7-code via OpenAI-compliant APIs and expect lower estimation costs on agentic workflows without architecture changes. A 30% headcount reduction is a moonshot in its own right, but the integration path is low enough risk to test against your own workload before committing.

The practical question is whether those efficiency gains apply to the team’s own work distribution. Running the K2.7-code against your own workload before adjusting the gateway load is a less risky route to find out.



<a href

Leave a Comment