GLM-5.1 Beats GPT-5.4 on SWE-Bench, Works 8 Hours Alone

For a few hours on April 7, Z.ai looked like it had won one of artificial intelligence's favorite parlor games: topping a coding benchmark. By evening, Anthropic had taken back the crown. But Z.ai may still have exposed something more consequential, that the industry's favorite benchmarks measure the wrong skill. GLM-5.1's real claim is not that it is marginally better at coding than rivals. It is that it can keep working long after most models run out of ideas.

GLM-5.1 scored 58.4 on SWE-Bench Pro, edging past GPT-5.4 at 57.7 and Claude Opus 4.6 at 57.3. VentureBeat ran a full feature. Simon Willison tested it with an SVG pelican and called the output his "new favorite from an open weights model." Before the day ended, Anthropic's Mythos Preview system card showed 77.8% on the same test. Nearly 20 points higher.

The crown lasted about as long as a standup meeting. The model that lost it may matter more than the one that reclaimed it. Because GLM-5.1 can work autonomously for eight continuous hours, iterating through hundreds of strategy changes, debugging its own code, optimizing without human intervention. If that claim survives independent testing, the relevant metric for AI coding agents just shifted from intelligence to endurance. And it arrives as open source, under the MIT license, from a company that trained it without a single American chip.

Key Takeaways

Z.ai's GLM-5.1 scored 58.4 on SWE-Bench Pro, topping GPT-5.4 and Claude Opus 4.6, but Anthropic's Mythos Preview eclipsed it the same day at 77.8%
The model's real breakthrough is eight-hour autonomous execution with a 'staircase' optimization pattern across 6,000+ tool calls
Trained entirely on Huawei Ascend chips under US Entity List sanctions, shipped open source under MIT license at roughly 5x less than Claude Opus 4.6
Self-reported benchmarks remain unverified by independent labs, though Z.ai's prior GLM-5 scores held up under third-party testing

AI-generated summary, reviewed by an editor. More on our AI guidelines.

The clock nobody watches

SWE-Bench Pro evaluates whether a model can fix a real GitHub issue in a single session. It is a snapshot test. You hand the model a bug, it hands back a patch, and the exchange takes minutes.

GLM-5.1 was designed for something structurally different. Z.ai calls it "long-horizon agentic engineering." In practice: hand the model a vague objective, walk away, and check back when it finishes. In a published evaluation, the model spent 655 iterations and over 6,000 tool calls optimizing a vector database. It didn't plateau after 50 rounds or stall at 100. It kept restructuring its own approach. At iteration 90, it abandoned full-corpus scanning for IVF cluster probing with vector compression. At iteration 240, it introduced a two-stage pipeline using prescoring and reranking. Six structural breakthroughs across 600 iterations, reaching 21,500 queries per second, six times the best result any model achieved in a standard 50-turn session.

Z.ai calls this trajectory the "staircase": long stretches of incremental tuning punctuated by structural leaps where the model drops one approach entirely and invents another. Most models exhaust their bag of tricks in the first hundred tool calls. GLM-5.1 was still finding new tricks at iteration 500.

The pattern repeated on KernelBench Level 3, which requires end-to-end GPU kernel optimization of complete machine learning architectures. GLM-5.1 sustained useful optimization past 1,000 tool-use turns, delivering a 3.6x geometric mean speedup across 50 problems. Claude Opus 4.6 leads this specific test at 4.2x, but the gap narrowed substantially from GLM-5, which plateaued early at 2.6x.

If you have ever assigned an optimization task to a junior engineer and returned four hours later to find them cycling through the same approach, you understand what that staircase pattern is worth. Not smarter. More stubborn.

The most striking demonstration was also the least quantifiable. Given a prompt to build a Linux-style desktop environment as a web application, GLM-5.1 ran for eight continuous hours with no starter code and no intermediate guidance. Early on, it produced the expected skeleton, a taskbar and a placeholder window. But the model kept looping back through its own output, identifying gaps, and building more. File browser, terminal, text editor, system monitor, calculator, functional games. By the end, a complete desktop running in the browser. Polished. Consistent.

No coding benchmark captures that. No leaderboard ranks it. The model basically ran a shift.

The entity list that didn't hold

The geopolitics here are blunt. Z.ai was placed on the US Entity List in January 2025, cutting it off from every American chip manufacturer. The company trained GLM-5.1's parent model on 100,000 Huawei Ascend 910B chips using Huawei's MindSpore framework. Zero American silicon touched this model.

And the result scored within 1.1 points of Claude Opus 4.6 on the coding benchmark that matters most to enterprise buyers. Export controls were supposed to slow Chinese AI development by choking its access to compute. GLM-5.1 doesn't challenge that theory. It mocks it.

The model ships under the MIT license with weights available on Hugging Face, runnable on any infrastructure a developer chooses. What sanctions restrict in hardware, open source unlocks in distribution. Z.ai took the company public on the Hong Kong Stock Exchange in January 2026, raising $558 million at a $6.6 billion valuation. The stock has rallied more than 250 percent since. Three model releases since February, from GLM-5 to GLM-5-Turbo to GLM-5.1. Each builds developer mindshare while Western labs charge five to eight times more per token. Z.ai looks emboldened. Washington should feel exposed. The company's release cadence, three frontier models since February, would be aggressive for any lab. For one operating under active sanctions, it is a pointed demonstration.

Anthropic accused three Chinese AI labs, DeepSeek, Minimax, and Moonshot AI, of creating over 24,000 fraudulent accounts and training their models using more than 16 million exchanges with Claude. Z.ai was not named. But the accusation hangs over the entire Chinese AI ecosystem like a question mark that nobody at Z.ai has bothered to address publicly.

Get Implicator.ai in your inbox

Strategic AI news from San Francisco. No hype, no "AI will change everything" throat clearing. Just what moved, who won, and why it matters. Daily at 6am PST.

No spam. Unsubscribe anytime.

Five dollars versus thirty

Pricing tells the story that benchmarks bury. GLM-5.1 costs $1.40 per million input tokens and $4.40 per million output tokens. Claude Opus 4.6 charges $5 and $25 respectively. Roughly 5x cheaper across the board.

For a quick coding question, the gap is pocket change. For an eight-hour autonomous agent session generating thousands of tool calls and hundreds of thousands of tokens, it determines whether the workflow is viable or unaffordable. The Z.ai Coding Plan starts at $27 per quarter. Claude Max runs $100 to $200 per month.

Early adopters have converged on a tiered strategy: GLM for the daily grind, Opus for the hard problems. That approach works only if the cheaper model is good enough for routine tasks. At 94.6% of Opus performance on Z.ai's own coding evaluation, it appears to be. One caveat: that figure is self-reported. No independent lab has corroborated it.

The MIT license adds another layer. Teams with GPU infrastructure can self-host GLM-5.1 and pay zero per-token costs beyond their own compute. At 744 billion parameters, this is not a laptop model. But for organizations that need data sovereignty, compliance, or simply refuse to route their code through a third-party API, the option exists. It didn't six months ago.

But the economics that matter most aren't per-token pricing. They are per-hour productivity. If a model works eight hours autonomously, you stop supervising and start delegating. Assign a task at the end of your workday. The pull request waits in the morning. That changes sprint planning, staffing math, and what a single developer can ship in a week. The solo developer who used to ship one feature per sprint might ship three. The startup that budgeted for four engineers might get by with two and a fleet of overnight agents. That is the math that keeps engineering managers anxious, and it only works if the model can actually sustain coherent work for eight hours. So far, Z.ai's demonstrations are the only evidence. Real-world production data doesn't exist yet.

What the numbers actually reveal

Step back from SWE-Bench Pro and GLM-5.1's profile gets more complicated. On Humanity's Last Exam, it scores 31.0 without tools, against 36.7 for Claude Opus 4.6 and 45.0 for Gemini 3.1 Pro. On GPQA-Diamond, the graduate-level science benchmark, GLM-5.1 posts 86.2 versus Gemini's 94.3. On AIME 2026, it reaches 95.3, trailing GPT-5.4's 98.7.

The pattern is consistent. GLM-5.1 leads or matches on coding and agentic tasks. It trails on pure reasoning and math. Z.ai deliberately targeted its reinforcement learning pipeline at practical coding and multi-step execution, trading mathematical reasoning for engineering persistence. The tradeoff reflects a strategic bet: customers willing to pay are building software, not proving theorems.

Then there is the caveat that shadows every Chinese AI benchmark: these numbers are self-reported. Z.ai's prior scores for GLM-5 held up under third-party testing, which earns some credibility. Anthropic's Dario Amodei has publicly argued that Chinese models "tend to be benchmark-optimized and distilled from US labs." Whether that critique applies here depends on independent verification nobody has yet conducted. GLM-5.1 went to Coding Plan subscribers March 27 and launched publicly today. Give it time. But the direction of the argument is clear: even if GLM-5.1's benchmark numbers are inflated by 10%, the eight-hour autonomy claim is the part that changes the competitive equation. Benchmarks can be gamed. Working eight hours cannot.

The crown that proved the crown is beside the point

On the same day GLM-5.1 claimed the top spot, Anthropic revealed its unreleased Mythos Preview model scores 77.8% on SWE-Bench Pro. Nearly 20 points ahead. On SWE-bench Verified, Mythos hits 93.9%, thirteen points above any publicly available model.

You cannot buy Mythos. Anthropic restricted it to closed security partners. But its existence confirms that the distance between internal lab capabilities and public products is wider than most observers assumed. GLM-5.1's moment at the top of a public leaderboard didn't reflect the actual state of the art. It reflected what companies choose to sell. The frontier is not where the leaderboard says it is. It is wherever Anthropic's internal cluster finished training last.

The AI industry has spent eighteen months treating coding benchmarks the way track and field treats the 100-meter dash: the definitive test of speed. GLM-5.1 suggests the race that matters is the ultramarathon. Not who writes the best patch in ten minutes, but who delivers a working system after eight hours of unattended labor. That competition hasn't been formalized yet. No leaderboard tracks it. No benchmark captures it.

Z.ai just ran the first heat. Open source, priced for volume, trained on sanctioned chips, available to anyone with a Hugging Face account and the hardware to run 744 billion parameters. The model that held a benchmark crown for half a day may have proven, more convincingly than any score, that the scoreboard needs replacing.

Frequently Asked Questions

What is GLM-5.1 and who made it?

GLM-5.1 is a 754-billion parameter open-source language model from Z.ai (formerly Zhipu AI), a publicly traded Chinese AI company. It is designed for long-horizon coding and engineering tasks, capable of working autonomously for up to eight hours.

How does GLM-5.1 compare to Claude Opus 4.6 and GPT-5.4?

On SWE-Bench Pro, GLM-5.1 scored 58.4 versus GPT-5.4's 57.7 and Claude Opus 4.6's 57.3. However, it trails both on reasoning benchmarks like GPQA-Diamond and Humanity's Last Exam. Its main advantage is sustained autonomous execution over hours.

What does eight-hour autonomous execution mean?

Unlike standard coding models that handle single tasks in minutes, GLM-5.1 can work continuously on a complex objective for up to eight hours, restructuring its approach multiple times. Z.ai demonstrated this with a vector database optimization that ran 655 iterations and 6,000 tool calls.

Is GLM-5.1 really open source?

Yes. GLM-5.1 is released under the MIT license with model weights available on Hugging Face and ModelScope. The MIT license permits unrestricted commercial use, modification, and redistribution.

How much does GLM-5.1 cost compared to competitors?

GLM-5.1 costs $1.40 per million input tokens and $4.40 per million output tokens. Claude Opus 4.6 charges $5 and $25 respectively. The Z.ai Coding Plan subscription starts at $27 per quarter versus Claude Max at $100-200 per month.

AI-generated summary, reviewed by an editor. More on our AI guidelines.

Tools & Workflows

Marcus Schuler

San Francisco

Editor-in-Chief and founder of Implicator.ai. Former ARD correspondent and senior broadcast journalist with 10+ years covering tech. Writes daily briefings on policy and market developments. Based in San Francisco. E-mail: [email protected]

GLM-5.1 Works Eight Hours Without You. No Benchmark Measures That.

The clock nobody watches

The entity list that didn't hold

Five dollars versus thirty

What the numbers actually reveal

The crown that proved the crown is beside the point

Marcus Schuler

Get the Morning Briefing in your inbox.

Related Stories

Repo Radar: 5 GitHub Projects Worth Your Week

TypeWhisper Founder Built His Dictation App After a Stroke

Professional Tutorial: Build a Pi-to-Pi Agent Communication Bus