Moonshot’s $4.6 million ‘Kimi K2 Thinking’ takes top spots on reasoning benchmarks

Open-weight beat closed weight—at least on paper. Moonshot AI released Kimi K2 Thinking, a trillion-parameter, sparse MoE model that posts higher scores than GPT-5 and Claude Sonnet 4.5 on several reasoning and agentic tests, according to the company’s own materials and early testers. Moonshot also published a clear technical write-up of the system’s design and serving regime in its K2 Thinking technical overview. The reported training bill: about $4.6 million, per a source cited by CNBC, which the outlet noted it could not independently verify.

That number is less than a Bay Area house. Meanwhile, OpenAI has been discussing eye-watering infrastructure needs, figures in the trillions, before clarifying it wasn’t asking Washington for a direct bailout. The contrast is stark.

What’s actually new

K2 Thinking is built to plan and act, not just predict the next token. It’s a one-trillion-parameter mixture-of-experts with 32 billion parameters active at inference, post-trained with quantization-aware training to run natively at INT4. The practical upshot: long “thinking” traces and tool chains at lower serving cost. One more thing: it sustains 256k-token context.

The headline capability is depth. Moonshot says K2 can execute 200–300 sequential tool calls while keeping its internal chain of thought coherent. That mirrors the “agentic” behavior the big closed models have marketed as their moat. It’s not a demo trick.

The scores, and what they mean

On Humanity’s Last Exam, a demanding reasoning suite, K2 Thinking posts 44.9%, ahead of GPT-5’s 41.7% in the same test regime reported by Moonshot. On BrowseComp, which probes web-search + synthesis, it logs 60.2% versus 54.9% for GPT-5 and 24.1% for Claude Sonnet 4.5 (Thinking). It also shows strong results on SWE-Bench Verified and LiveCodeBench v6.

These are not backyard quizzes. HLE spans expert-level questions across disciplines; BrowseComp stresses real-world retrieval, planning, and citation under time pressure. Crucially, Moonshot reports these scores at INT4—the way the model is actually served. That matters.

Caveat time. Much of the data is self-reported by Moonshot or measured by early evaluators. On some coding tasks, closed models have the edge. And benchmarks never capture failure modes you meet in production. Keep that in mind.

A licensing and cost structure designed to travel

Moonshot’s modified MIT license is permissive with one twist: if your product tops 100 million monthly users or $20 million a month in revenue, you must display “Kimi K2” prominently in the UI. For almost everyone else, it behaves like MIT. That lowers procurement friction.

Pricing undercuts. Moonshot lists $0.60 per million input tokens ($0.15 with cache hits) and $2.50 per million output tokens. That’s far below typical GPT-5-tier quotes circulated to enterprises. You can also self-host: the INT4 weights weigh in at roughly 594 GB, small enough for serious on-prem experiments and even some dual-Mac-Studio setups that early adopters have shown off. That’s new.

The bigger advantage is transparency. K2 Thinking emits an explicit reasoning_content field. Auditors can inspect intermediate steps, and ops teams can debug long workflows. Black-box incumbents will argue quality beats traceability. Many buyers won’t agree.

The China factor: speed over spectacle

Chinese labs are moving faster. That’s the through-line from DeepSeek to Qwen to Kimi. As Nathan Lambert notes, they tend to release on tighter cadences, while U.S. labs, especially Anthropic, take longer to ship. When progress comes in months, not years, speed compounds.

Policy stories miss the operational point. Beijing is happy to present open-weight wins as proof that sanctions aren’t biting. Washington points to capital intensity as a strategic moat. But if open models approach parity at a sliver of cost, the old “scale wins” playbook looks fragile. It’s a shift from scale supremacy to systems engineering.

Distribution, not training, is the new choke point

Quality models are outrunning serving capacity. Early demand overwhelmed Moonshot’s endpoints and OpenRouter proxies. Hosting 256k contexts, interleaved “thinking” tokens, and hundreds of tool calls per task is brutal on infrastructure. Latency balloons. Bills do too.

This is where hyperscalers still matter. Whoever can provision stable, cheap inference at scale will set the pace for agentic adoption. If Alibaba Cloud or Tencent pairs deeply with Moonshot, the balance of power could tilt quickly. Without that, performance on paper remains… on paper.

A sober note from inside the boom

At China’s World Internet Conference, DeepSeek senior researcher Chen Deli warned that AI’s short-term utility could turn into sweeping job displacement within a decade, and broader social upheaval within two. He called on tech companies to act as “defenders.” That’s not a think-tank talking point; it’s coming from the group credited with shrinking the performance gap on open release schedules.

The timing is awkward. Just as the cost narrative flips, the social one darkens. Both can be true.

Limits and unknowns

Benchmarks can be gamed, and “agentic” tests vary in fidelity. Much of the score-keeping depends on publishers’ harnesses, not independent labs. The $4.6 million training figure is single-source. U.S. providers still lead on trust tooling, uptime, and global support. And the licensing clause, while mild, isn’t “pure” open source.

Still, the direction is unmistakable. Open weights are no longer a toy.

Why this matters

Open-weight models now threaten closed frontiers on capability and cost, forcing buyers to revisit long-term API dependencies and cloud budgets.
If hosting becomes the bottleneck, power shifts from model trainers to whoever can deliver fast, reliable agentic inference at scale.

❓ Frequently Asked Questions

Q: What does the modified MIT license actually mean for my company?

A: You can use K2 Thinking freely for commercial purposes. Only if you hit 100 million monthly active users or $20 million monthly revenue must you display "Kimi K2" in your interface. For 99.9% of companies, it works exactly like standard MIT licensing.

Q: Can this really run on Mac Studios? How much hardware do I need?

A: Yes. The INT4 quantized version is 594GB. Developer Awni Hannun demonstrated it running on two M3 Ultra Mac Studios at 15 tokens/second. For comparison, GPT-5-class models typically need server clusters. Enterprise deployments would still use proper servers for speed.

Q: What does "200-300 sequential tool calls" mean in practice?

A: The model can autonomously chain actions like searching web, reading documents, running calculations, and synthesizing results without stopping for human input. Think: "research this topic" becomes search → analyze 20 sources → compile report → format output, all automatic.

Q: Who are the "AI Tigers" and why do they matter?

A: China's six leading AI companies: DeepSeek, Moonshot, Qwen (Alibaba), MiniMax, Zhipu, and 01.ai. They've gone from unknown to matching OpenAI in under two years. DeepSeek started the trend, now they're all releasing frontier models monthly while US labs take quarters.

Q: How exactly does K2's pricing compare to GPT-5 for a typical workload?

A: For 10 million tokens monthly (typical enterprise usage): K2 costs $6 input + $25 output = $31. GPT-5 costs $12.50 input + $100 output = $112.50. That's 72% cheaper, or you can self-host K2 for just infrastructure costs.