GLM-4.6 puts receipts on the table: open weights, real coding runs, cheaper tokens

💡 TL;DR - The 30 Seconds Version

🔓 Zhipu AI released GLM-4.6 on September 30, 2025 with MIT-licensed open weights, 200K context window, and published all 74 CC-Bench test trajectories on Hugging Face for independent verification.

💰 Pricing undercuts Claude Sonnet 4.5 by 90%—$0.60 per million input tokens and $2 per million output versus Claude's $3 and $15, with GLM-4.6 using 15% fewer tokens than GLM-4.5 on the same tasks.

📊 Performance shows 48.6% win rate against Claude Sonnet 4 on CC-Bench's published head-to-head, but Zhipu explicitly states GLM-4.6 still lags Sonnet 4.5 on coding benchmarks like SWE-bench Verified (64.2% vs 70.4%).

🏗️ Model runs 355B total parameters with 32B active in MoE architecture—half the size of DeepSeek-V3 (671B) and one-third of Kimi K2 (1043B), enabling cheaper inference on 8 H100s or 4 H200s.

🔍 Publishing complete multi-turn trajectories with prompts, tool calls, and corrections invites scrutiny but makes benchmark gaming harder and helps teams judge real-world fit before deployment.

🎯 Cost per solved task emerges as the selection metric when performance gaps narrow—GLM-4.6 bets that token efficiency plus transparency beats proprietary models on production economics.

Zhipu’s new GLM-4.6 doesn’t just promise better coding agents—it ships the logs. Alongside a 200K-token context window and open-weights release, the company published the complete CC-Bench trajectories for 74 human-evaluated, Docker-isolated coding tasks. The headline claim: near-parity against Claude Sonnet 4 at a fraction of the price.

What’s actually new

GLM-4.6 extends context from 128K to 200K tokens and raises the maximum output to 128K, aimed squarely at multi-file, multi-turn agent work. Weights are posted under an MIT license, with local serving supported via vLLM and SGLang. That’s the accessibility story.

The performance story is tighter token use on practical jobs. Zhipu’s CC-Bench update shows GLM-4.6 finishing tasks with roughly 15% fewer tokens than GLM-4.5—important when you run agents all day. It also integrates with popular IDE agents like Claude Code, Cline, Roo Code, and Kilo Code. Real tasks, not just leaderboards.

Evidence, not vibes

On CC-Bench’s published head-to-head, GLM-4.6 records a 48.6% win rate versus Claude Sonnet 4, with ties at 9.5% and losses at 41.9%. Zhipu also discloses per-task token counts: GLM-4.6 averaged about 651k tokens per trajectory—roughly a 14.6% reduction versus GLM-4.5 across the same expanded test set. You can audit every interaction step. That’s unusual, and useful.

Price and the token math

List pricing undercuts U.S. rivals. On OpenRouter and Z.ai, GLM-4.6 is roughly $0.60 per million input tokens and about $2 per million output tokens (provider-specific variants hover around $2–$2.20). Anthropic lists Claude Sonnet 4.5 at $3 per million input and $15 per million output. If the capability gap is small for your workload, the cost gap is not.

Put differently: when an agent spends millions of tokens a day on planning, reading, and patching, a double-digit cut in usage multiplies through your bill. That’s the moat Zhipu is trying to dig—unit economics, not just accuracy deltas.

Where it fits—and where it doesn’t

GLM-4.6 is framed as a “near-parity” contender to Claude Sonnet 4 on practical coding tasks, backed by public logs. Zhipu is explicit about the ceiling, though: it still lags Sonnet 4.5 on coding. That caveat matters if your stack depends on top-end SWE-bench or OSWorld performance. Choose for your workload, not the press release.

There’s also scope nuance. CC-Bench is Zhipu-curated, albeit openly published. That transparency helps outside teams reproduce results, spot failure modes, and judge whether the tasks match their reality. But it’s not a substitute for your own evals on your repos and tooling. Test before you bet. Always.

The transparency calculation

Most labs post scores, not traces. Zhipu released full multi-turn trajectories—prompts, tool calls, and corrections—so developers can see how the model iterates, not just whether it “won.” That decision invites scrutiny and makes benchmark-gaming harder. It also hands competitors a map of failure cases. Zhipu decided the trust trade-off is worth it.

Limits and open questions

Active-parameter details for GLM-4.6 aren’t spelled out on the model card, though the 4.5 paper specified a 355B-parameter MoE with 32B activated per token; the 4.6 card lists 357B total under MIT. For now, the safer read is that Zhipu iterated on the agent workflow—context, token discipline, and reproducible tests—more than it chased a flagship “best on everything” crown. That’s fine. It’s focused.

Why this matters

Cost per solved task is becoming the metric that moves budgets, and GLM-4.6 pushes that number down with lower prices and leaner token use.
Publishing full test runs raises the bar on credibility, pressuring rivals to show their work—and helping teams pick models with eyes open.

❓ Frequently Asked Questions

Q: What hardware do I need to run GLM-4.6 locally?

A: GLM-4.6 requires 8 H100 GPUs or 4 H200 GPUs for standard inference when running in FP8 precision. For full 128K context length capability, you need 16 H100s or 8 H200s. The model supports vLLM and SGLang frameworks for local serving. Server memory must exceed 1TB to ensure stable operation during model loading.

Q: How is CC-Bench different from SWE-bench?

A: CC-Bench tests multi-turn coding tasks in Docker-isolated environments with human evaluators providing iterative feedback, simulating real IDE agent workflows. SWE-bench focuses on single-pass GitHub issue resolution with predefined test cases. CC-Bench's 74 tasks are Zhipu-curated but published with complete trajectories. SWE-bench Verified contains 500 human-filtered GitHub issues. Both measure different aspects of coding capability.

Q: Can I use GLM-4.6 commercially with the MIT license?

A: Yes. The MIT license on Hugging Face allows commercial use, modification, and redistribution without royalties. You can deploy GLM-4.6 in production systems, build derivative products, or offer it as a service. The only requirement is including the original copyright notice and license terms in any distribution of the model weights.

Q: Is the 200K context window actually usable or just theoretical?

A: The 200K input context is functional but requires substantial GPU memory. According to Zhipu's documentation, achieving full 128K context length during inference needs 16 H100 GPUs or 8 H200 GPUs in FP8 precision. The maximum output is capped at 128K tokens regardless of input length, making it practical for large repository analysis but resource-intensive for continuous agent operation.

Q: Where can I access the published CC-Bench trajectories?

A: Zhipu posted all 74 test trajectories on Hugging Face at huggingface.co/datasets/zai-org/CC-Bench-trajectories. Each trajectory includes the complete interaction sequence: initial prompts, tool calls, environment feedback, human evaluator responses, and model corrections across multiple turns. The dataset also documents evaluation parameters: OpenHands v0.34.0 framework, 100-iteration limits, and temperature 0.6 settings.

Meta's Google Chip Talks Aren't About Abandoning Nvidia. They're About Everything Else.

Ilya Sutskever Declares the Scaling Era Dead. His $3 Billion Bet Says Research Will Win.

The Creativity Gap Persists: New Research Challenges AI's Democratization Promise