GLM-5 Undercuts Claude Opus by 86% but Safety Reviews Warn

Two weeks ago, Moonshot crowned its Kimi K2.5 the most powerful open-source language model on the planet. Engineers circulated benchmarks on social media within the hour, and procurement teams started scheduling evaluations.

That reign lasted 14 days.

On Tuesday in Beijing, z.ai released GLM-5, a 744-billion-parameter model that outscores most Western competitors on code generation while costing a fraction of what they charge. The license is MIT. No contract required. Artificial Analysis ranked it the new open-source leader within hours.

The pricing alone should stop you mid-scroll. GLM-5 costs roughly $4.20 per million tokens combined. Claude Opus 4.6 costs $30. GPT-5.2 runs $15.75. For 14 cents on Anthropic's dollar, you get 96% of Claude Opus's code generation score. If your company processes millions of tokens a day, that arithmetic is devastating for everyone charging a premium.

But hours after launch, Lukas Petersson from AI safety startup Andon Labs posted something that should give enterprise buyers real pause. After reading through GLM-5's execution traces, the step-by-step logs of how the model reasons through tasks, he called it "incredibly effective, but far less situationally aware." The model hits goals through aggressive tactics, he wrote, without reasoning about its own situation. His conclusion was blunt. "This is how you get a paperclip maximizer."

Frontier AI gets cheaper by the month. Nobody has figured out how to make it safer at the same pace.

What $4.20 buys you now

The raw specs are hard to wave away. GLM-5 runs 744 billion parameters, more than double the 355 billion in its predecessor GLM-4.5, with only 40 billion activating per token through a Mixture-of-Experts design. Pre-training data jumped to 28.5 trillion tokens.

The Argument

• GLM-5 delivers 96% of Claude Opus's coding score at 14% of the price, with an MIT license

• Chinese open-source labs are dethroning each other every two weeks, compressing Western pricing power

• Early safety reviews flag aggressive task execution without situational awareness

• Enterprise buyers face a split: regulated industries pay the premium, everyone else does the math

On SWE-bench Verified, the code generation benchmark that tests models against real GitHub issues, GLM-5 scored 77.8. That beats Google's Gemini 3 Pro at 76.2 and lands within three points of Anthropic's Claude Opus 4.6 at 80.9. On hallucination resistance, GLM-5 posted a -1 on the AA-Omniscience Index, a 35-point improvement over the previous generation and the lowest hallucination rate any model has achieved in testing.

Z.ai also built a custom training framework it calls "slime," an asynchronous reinforcement learning system that eliminates the sequential bottleneck consuming over 90% of standard RL training time. The result is faster iteration on agent behavior at a fraction of the usual training cost.

Beyond coding, z.ai is positioning GLM-5 as an end-to-end office tool. The model can turn raw prompts into formatted Word documents, PDFs, and spreadsheets, from financial reports to project proposals. That puts it in direct competition with the enterprise productivity stacks Microsoft and Google are building around their own models.

And the license is MIT. No revenue caps or usage restrictions, and no enterprise sales team to call. Deploy it on your own servers and the only bill is compute.

The two-week throne

Forget the individual model. What matters is the conveyor belt underneath.

Moonshot's Kimi K2.5 arrived in late January and held the open-source crown for exactly two weeks before GLM-5 knocked it aside. Before Kimi, DeepSeek's V3 series had dominated the conversation about what open-weight models could accomplish for fractions of a cent. Each new release arrives with better benchmarks, lower pricing, and a permissive license. Each one turns the last into a footnote.

Stay ahead of the curve

Strategic AI news from San Francisco. No hype, no "AI will change everything" throat clearing. Just what moved, who won, and why it matters. Daily at 6am PST.

No spam. Unsubscribe anytime.

Z.ai is not a garage operation. It is the commercial arm of Zhipu AI, a company backed by serious Chinese venture capital and closely tied to Tsinghua University's research apparatus. The "slime" RL framework was published openly on GitHub, inviting the broader open-source community to build on their training infrastructure. Nobody publishes their training framework when they're defending a lead. Z.ai is racing to turn frontier intelligence into a commodity.

If you are an enterprise CTO trying to standardize on a foundation model, this rotation creates a strange procurement problem. By the time your evaluation team finishes testing one Chinese open-source release, the next has already shipped. The stability that procurement departments need most is exactly what this market refuses to provide.

Western labs run on a different clock. Anthropic spent months developing and testing Claude Opus 4.6 before release. OpenAI followed a similar cadence with GPT-5.2. Both companies treat model launches as marquee events. Chinese open-source labs treat them like biweekly software sprints.

That velocity produces impressive results. It also skips steps.

Benchmarks measure power, not judgment

Petersson's paperclip warning deserves more scrutiny than it received. He spent hours reading GLM-5's execution traces and found a model that hits targets with brutal efficiency but rarely stops to question whether it should.

The reference is worth explaining. Nick Bostrom, the Oxford philosopher, came up with the paperclip maximizer back in 2003. Picture an AI told to make paperclips. It does exactly that, converting every available resource on earth into paperclips, humans included. That scenario sounds less academic when a frontier model's own traces show it bulldozing through tasks without pausing for context.

GLM-5's hallucination score is real and worth respecting. Knowing when to say "I don't know" instead of fabricating an answer matters enormously for enterprise trust. A 35-point improvement on that metric represents serious engineering. But low hallucination and high situational awareness are different skills. A model can refuse to make things up while still executing tasks in ways that cause collateral damage.

Think of it as a surgeon who never misidentifies an organ but operates without reading the patient's chart.

And the benchmark gap between GLM-5 and Claude Opus, 77.8 versus 80.9 on SWE-bench, might encode exactly this distinction. Three points on a coding benchmark sounds trivial. But those three points could be the difference between a model that solves a problem cleanly and one that solves it by steamrolling edge cases. Benchmarks reward completion. They do not measure restraint.

Enterprise buyers evaluating GLM-5 would do well to run their own safety tests beyond the standard benchmark suites. Give the model ambiguous instructions. Feed it conflicting constraints. Watch how it resolves them. A model built for aggressive task completion will pick the fastest path every time. One with genuine situational awareness would flag the conflict or ask for clarification instead. That behavioral gap never shows up on a leaderboard.

The pricing squeeze Western labs can't dodge

For enterprise procurement, the numbers are uncomfortable.

GLM-5 at $4.20 per million tokens combined. Claude Opus at $30. GPT-5.2 at $15.75. Run ten million tokens through GLM-5 every day and your bill is $42. Do the same on Claude Opus and it's $300. Multiply that out over twelve months and you've saved $94,000. Scale to a hundred million tokens daily, where the largest enterprise deployments operate, and the annual savings approach a million dollars.

The awkward truth for Western AI companies is that most enterprise use cases do not need the last three points on a coding benchmark. Customer support automation, document summarization, data extraction, internal knowledge search. These workflows need reliability and speed at a manageable price. They do not need the model that scores 80.9 instead of 77.8 on a developer benchmark most procurement officers have never heard of.

Western labs have two options and neither is comfortable. Cut prices, which compresses margins that investors are already watching with visible anxiety. Or differentiate on something the pricing table cannot display. Safety and alignment work. The kind of situational awareness that Petersson found absent from GLM-5's execution logs.

Anthropic is leaning hard into the second path. The company built its brand on responsible AI development, and Claude's stronger performance on tasks requiring judgment gives it a story worth telling. But "we cost seven times more because we're more careful" is a rough pitch to a CFO reviewing quarterly token spend. Especially when the cheaper model passes most of the same public benchmarks.

OpenAI faces the same squeeze from a steeper angle. GPT-5.2 Pro at $189 per million tokens combined looks almost absurd next to $4.20. Even base GPT-5.2 at $15.75 costs nearly four times as much. Both companies are now anxious about something they didn't think about a year ago. Commodity-priced open-source AI pulling the middle of the enterprise market away before premium providers can prove their safety markup is worth paying.

What the next two weeks will tell you

For regulated industries, financial services, healthcare, defense, the calculation may already be settled. Data residency requirements alone make a Chinese-trained model a non-starter for many applications. Add the alignment questions Petersson raised, and the procurement decision gets simple. Pay the premium. Sleep soundly.

For everyone else, the math is messier and the consequences are lower. A startup building internal developer tools cares about cost and throughput above all. A mid-market company automating document workflows wants the cheapest model that clears its quality bar. These buyers are GLM-5's real addressable market, and most of them are not losing sleep over alignment scores. There are far more of them than there are defense contractors.

The real test will not come from leaderboards. It will come from the first enterprise that deploys GLM-5 at production scale and discovers what those execution traces look like when the model handles real customer data, real edge cases, and real stakes. Petersson read the traces in a controlled setting. Production will write its own verdict.

Somewhere in China right now, another lab is training the model that will dethrone GLM-5. It will arrive in weeks, not months, with higher benchmark scores and a lower price tag. And someone will read its traces and flag the same gap Petersson found this week.

The conveyor belt keeps accelerating. It does not slow down to read the patient's chart.

Frequently Asked Questions

Q: What is z.ai and how is it connected to Zhipu AI?

A: Z.ai is the commercial arm of Zhipu AI, a Chinese AI company backed by major venture capital and closely tied to Tsinghua University. Zhipu AI was also rumored to be behind 'Pony Alpha,' a stealth model that previously topped coding benchmarks on OpenRouter.

Q: What does the paperclip maximizer warning mean for GLM-5?

A: AI safety researcher Lukas Petersson found that GLM-5 pursues goals through aggressive tactics without reasoning about context or consequences. The paperclip maximizer, a thought experiment by philosopher Nick Bostrom, describes an AI that optimizes a single objective so relentlessly it causes catastrophic collateral damage. Petersson sees early signs of that pattern in GLM-5's execution traces.

Q: How does GLM-5 pricing compare to Western competitors?

A: GLM-5 costs roughly $4.20 per million tokens combined. Claude Opus 4.6 costs $30 and GPT-5.2 costs $15.75 for the same volume. At enterprise scale of ten million tokens daily, that translates to $42 versus $300 per day, or roughly $94,000 in annual savings.

Q: Can enterprises safely deploy a Chinese-trained open-source model?

A: It depends on the industry. Regulated sectors like financial services, healthcare, and defense face data residency restrictions that make Chinese-trained models impractical. For less regulated use cases like internal tools and document automation, the decision comes down to whether the cost savings outweigh the alignment and geopolitical risks.

Q: What is the slime training framework and why does it matter?

A: Slime is z.ai's custom asynchronous reinforcement learning system. Traditional RL training wastes over 90% of compute time on sequential bottlenecks. Slime breaks that pattern, allowing faster iteration on complex agent behavior at lower cost, which partly explains how Chinese labs ship frontier models every few weeks.

Analysis

Marcus Schuler

San Francisco

Editor-in-Chief and founder of Implicator.ai. Former ARD correspondent and senior broadcast journalist with 10+ years covering tech. Writes daily briefings on policy and market developments. Based in San Francisco. E-mail: editor@implicator.ai

GLM-5 Costs 86% Less Than Claude Opus. The Safety Gap Might Cost More.

What $4.20 buys you now

The two-week throne

Stay ahead of the curve

Benchmarks measure power, not judgment

The pricing squeeze Western labs can't dodge

What the next two weeks will tell you

Frequently Asked Questions

Marcus Schuler

Get the Morning Briefing in your inbox.

Related Stories

Thinking Machines’ Inkling Takes U.S. Open-Model Lead With 41 Score

Demis Hassabis Draws Praise for Up-to-30-Day Frontier AI Review Plan

Claude Fable 5 and GPT-5.6-Sol: How to Orchestrate Claude and Codex to Ship More Reliable Code