Artificial Analysis this week named Z.ai's GLM-5.2 the leading open-weight model on its Intelligence Index, the independent benchmarking firm said in a post on June 16. The model scored 51 on the v4.1 index, up 11 points from GLM-5.1, and ships under an MIT license with a 1-million-token context window. Independent reviewers who ran it over the past week reported that it matched or beat Anthropic's Opus 4.8 on several build tasks at a fraction of the cost.

Key Takeaways

AI-generated summary, reviewed by an editor. More on our AI guidelines.

Nate Herk, who runs the AI Automation channel on YouTube, switched Claude Code over to GLM-5.2 and ran it for a day. He clocked a one-shot website build at 3 minutes 59 seconds, against 14 minutes 59 seconds for Opus 4.8 on the same prompt, and said the GLM run used fewer tokens. Herk routed Claude Code to the model by setting ANTHROPIC_BASE_URL to z.ai in his settings.local.json, a swap he described to viewers as "switching out the engine of the car." He ran it on a $60-a-month Z.ai plan.

The model did not match Opus everywhere. On one head-to-head, Herk had OpenAI's Codex generate a homework task and then grade both outputs, and Codex judged Opus the more precise of the two. It cited a duplicate-records edge case the GLM run missed, where values such as true versus one tripped it up. Herk put the share of his work that genuinely needs Opus at maybe 10 to 20 percent and said GLM-5.2 handled most of the rest.

Julian Goldie, who tested the model on June 14 against Moonshot's Kimi K2.7 and Opus 4.8, said GLM-5.2 won four of his five build tests. He gave it the edge on a Temple Run-style voxel runner, a liquid metaball simulation, an Apple-style landing page, and a neon arcade game, and handed the fifth, an inner solar-system orbit map, to Kimi K2.7. "I can't believe GLM 5.2 is beating Opus 4.8," Goldie told viewers. He noted the model was so new it had not yet appeared on OpenRouter.

A third reviewer, the channel Zero to MVP, ran GLM-5.2 through an easy-to-hard coding gauntlet on June 17. The model handled a single sorting algorithm "perfectly" and built a six-algorithm visualization "very well." On the hardest task it compiled a Hantavirus-spread simulation written in Rust on the first try. "The UI is just slightly out of place, but overall I am impressed," the tester said in the video, and called the result a pass on "all my tests with excellent results."

The same Artificial Analysis report recorded a higher token cost. GLM-5.2 spends 43,000 output tokens per Intelligence Index task, of which 37,000 are reasoning tokens, up from 26,000 for GLM-5.1 and above the 35,000 Kimi K2.6 uses. That works out to about 46 cents per task, against roughly 5 cents for DeepSeek V4 Pro at its maximum setting. Artificial Analysis still placed the model on its intelligence-versus-cost Pareto frontier, citing the lowest cost per task among models at its intelligence level.

Know someone who'd find this useful? ✉️ Email it to a friend in one click, or they can subscribe free here.

On per-token price, Herk put Z.ai's published rates at $1.40 per million input tokens and $4.40 per million output tokens, against $5 and $25 for Opus 4.8, which he called roughly five times cheaper for the same job. The model carries 744 billion total parameters and 40 billion active, the same footprint as GLM-5.1, according to Artificial Analysis; Z.ai's Hugging Face card lists 753 billion. The firm also reported GLM-5.2 ahead of MiniMax-M3 and DeepSeek V4 Pro on GDPval-AA v2, its agentic-work metric, at 1,524 to their 1,418 and 1,328, effectively level with GPT-5.5 at xhigh reasoning, which scored 1,514.

Z.ai attributes part of the long-context performance to IndexShare, a design in which every four sparse-attention layers share one lightweight indexer, which the company says cuts per-token compute 2.9 times at a 1-million-token context. The company released the weights on Hugging Face and said its own runs put the model at 81.0 on Terminal-Bench 2.1 Terminus-2, up from 63.5 for GLM-5.1, and 62.1 on SWE-bench Pro, up from 58.4.

Z.ai's own numbers also mark a ceiling. On FrontierSWE, which scores multi-hour autonomous engineering projects, the company put GLM-5.2 at 74.4 against Opus 4.8's 75.1, ahead of GPT-5.5 at 72.6. On SWE-Marathon, a test of the longest sustained tasks, GLM-5.2 scored 13.0 to Opus 4.8's 26.0, a gap the open-weight model has not closed.

GLM-5.2 is already live on third-party hosts including Fireworks, Baseten, and DeepInfra. It had not reached OpenRouter as of Goldie's June 14 test.

Frequently Asked Questions

What did Artificial Analysis say about GLM-5.2?

Artificial Analysis scored GLM-5.2 at 51 on its Intelligence Index v4.1, ahead of MiniMax-M3 and DeepSeek V4 Pro max at 44 each and Kimi K2.6 at 43, naming it the leading open-weight model on the index.

Why does the 43,000-token figure matter?

It shows GLM-5.2 spends more output tokens per benchmark task than leading open peers, with 37,000 of them reasoning tokens. That can raise production costs even when the model ranks well.

How much cheaper is GLM-5.2 than Opus 4.8?

Tester Nate Herk put Z.ai's API at $1.40 per million input tokens and $4.40 per million output, against $5 and $25 for Opus 4.8, which he called roughly five times cheaper for the same job.

What is IndexShare?

IndexShare lets every four sparse-attention layers share a lightweight indexer. Z.ai says that reduces per-token compute by 2.9 times at a 1-million-token context length.

Is GLM-5.2 ready to replace Opus 4.8?

Testers called it a strong, cheaper option for most work but still gave Opus the edge on the hardest reasoning-heavy tasks. Z.ai's own benchmarks show GLM-5.2 trailing Opus on the longest sustained engineering tests.

AI-generated summary, reviewed by an editor. More on our AI guidelines.

GLM-5.2 Still Trails Claude Opus 4.8 on Coding Benchmarks
A 33% one-day jump in an AI lab's stock usually means its newest model just beat the competition. Zhipu's didn't. The scorecard the Chinese company published this week shows its open-weight GLM-5.2 st
Implicator.ai Launches the AI Top 40, Ranking LLMs Across 10 Benchmarks in One Score
Implicator.ai on Friday released the AI Top 40, a weekly chart that scrapes 10 independent benchmarks and boils them down to one number per model. Forty models from 18 labs made the cut. The chart upd
GLM-5.1 Works Eight Hours Without You. No Benchmark Measures That.
For a few hours on April 7, Z.ai looked like it had won one of artificial intelligence's favorite parlor games: topping a coding benchmark. By evening, Anthropic had taken back the crown. But Z.ai may
Tools & Workflows

San Francisco

Editor-in-Chief and founder of Implicator.ai. Former ARD correspondent and senior broadcast journalist with 10+ years covering tech. Writes daily briefings on policy and market developments. Based in San Francisco. E-mail: editor@implicator.ai