Open-Weights LLMs Score 94.8% on Custom Coding Benchmark

Jannis Fedoruk-Betschki, Austrian founder of managed Ghost hosting platform Magic Pages, built a 20-task TypeScript benchmark modeled on six months of actual commits and tested 16 models against it. Alibaba's open-weights Qwen3-Coder-Next scored 94.8% against Claude Opus 4.6's 98.8%, with the gap collapsing to zero after a single self-correction iteration. The results, published recently, land as Anthropic acknowledges its Claude Code subscribers are burning through session limits far faster than the company projected.

Key Takeaways

Ghost hosting operator tested 16 coding models on a custom 20-task TypeScript benchmark built from real commits
Open-weights Qwen3-Coder-Next scored 94.8% against Claude Opus 4.6's 98.8%, gap vanished after one self-correction pass
Rate-limit frustrations with Claude Code's $200/month subscription triggered the investigation
MiniMax, GLM, and Kimi models all score above 92% at fractions of proprietary pricing

The benchmark that SWE-bench can't replicate

Standard coding benchmarks test Python bug fixes in open-source repos. Fedoruk-Betschki's stack is TypeScript, Hono, Zod, Docker Swarm, and MongoDB. SWE-bench tells him nothing about his actual work.

He built a harness with synthetic codebases to avoid training data contamination, prompts describing real tasks, and hidden test suites. The tasks came from his git log: API endpoints with three-layer architecture, auth failures pulled from production incidents, webhook idempotency bugs that once created duplicate customer sites when Paddle fired a webhook twice. No academic toy problems. Just the work.

A Microsoft Research paper found LLMs can identify buggy file paths 76% of the time from issue descriptions alone on SWE-bench. Performance dropped to 53% on outside repositories. OpenAI's own audit flagged 59.4% of SWE-bench Verified problems as flawed. Fedoruk-Betschki's approach sidesteps both contamination and broken test suites.

Thirteen models above 92%

The top 13 of 16 models all cleared 92%. Every one aced bug fixes at 100%. All wrote working Ghost API clients, DNS validators, Docker Compose configs.

Only one task separated first place from seventh. A Hono CRUD endpoint where Qwen imported the auth middleware but never called it. Fed its own broken code and the test output, the model fixed the issue in a single pass. Expected 401, got 500. Obvious.

Fedoruk-Betschki argues that self-correction loop isn't a workaround. Claude Code, Codex, every serious coding assistant runs tests before declaring victory. Add that feedback cycle and the leaderboard flattens.

What triggered the investigation

Fedoruk-Betschki's Claude Code Max subscription costs $200 per month. It started throttling him mid-session with rate limit errors while his dashboard showed capacity remaining. Not once. Regularly. "Without the frustration I wouldn't have done this," he told The Implicator. "I would have had a normal day coding with Claude."

Anthropic is visibly anxious about it. The company admitted last week that users are hitting limits "way faster than expected," calling it "the top priority for the team." One user claimed to have found two bugs in the Claude Code binary that silently inflate token costs by 10-20x through broken prompt caching.

AI coding tools are reshaping what developers pay and how they work. Stay ahead.

Strategic AI news from San Francisco. No hype, no "AI will change everything" throat clearing. Just what moved, who won, and why it matters. Daily at 6am PST.

No spam. Unsubscribe anytime.

Qwen3-Coder-Next runs on 30GB of VRAM at Q4 quantization. Activates 3 billion of its 80 billion parameters per token through mixture-of-experts. Generates over 100 tokens per second locally. Apache 2.0 license. No rate limits. No per-token billing. No code leaves the machine.

The test coverage caveat

Fedoruk-Betschki's argument assumes well-written tests exist. He says so directly. "If you don't have these the gap will probably stay, though I'd argue that it's a lot smaller than a few months ago."

Qodo, a code verification startup, raised $70 million last week on a related bet. Their survey found 95% of developers don't fully trust AI-generated code, but only 48% consistently review it before committing. The takeaway might be less about which model you pick and more about the test infrastructure that makes any model viable.

A market thinning from below

MiniMax M2.5 hit 80.2% on SWE-bench Verified at one-tenth to one-twentieth of Claude's per-token cost. GLM-5.1, released March 27, reached 94.6% of Claude Opus's score on Z.ai's internal coding evaluation at $1 per million input tokens versus Claude's $5. Kimi K2.5 scored 99% on HumanEval with its trillion-parameter architecture.

Fedoruk-Betschki doesn't expect open-weights models to surpass proprietary ones on single-shot tasks any time soon. Claude Opus 4.6 remains unbeaten in his benchmark on that dimension. But if you run 1,300 Ghost sites on a well-understood TypeScript stack, the 4% edge no longer justifies the vendor lock-in, the recurring cost, or the session limits that keep interrupting the work.

The hardware investment, offset by Austrian business tax treatment for capital equipment, pays back in a few months. After that, the inference is free.

Full disclosure: Implicator AI is hosted on Magic Pages servers. This is not a paid advertisement.

Frequently Asked Questions

What benchmark was used to test the coding models?

A custom 20-task TypeScript harness built by Jannis Fedoruk-Betschki of Magic Pages. Tasks covered API endpoints, bug fixes, Docker configs, React components, and MongoDB aggregations, all modeled on six months of real commits from his Ghost hosting platform.

Which open-weights model performed best?

Qwen3-Coder-Next scored 94.8%, just 4 points behind Claude Opus 4.6 at 98.8%. It runs on 30GB of VRAM using a mixture-of-experts architecture that activates only 3 billion of its 80 billion parameters per token.

Why does the 4% gap matter less in practice?

The gap came down to one task where Qwen imported auth middleware but did not call it. Given the test output, the model self-corrected in a single iteration. Real agentic coding workflows include test-and-fix loops that close these gaps.

What triggered this benchmark investigation?

Persistent rate-limit errors on Fedoruk-Betschki's $200/month Claude Code Max subscription. Anthropic acknowledged users hitting limits faster than expected, with possible prompt caching bugs inflating token costs by 10-20x.

What hardware do you need to run Qwen3-Coder-Next locally?

At Q4 quantization it fits in roughly 30GB of VRAM. With context overhead, 64GB total memory is sufficient. It generates over 100 tokens per second locally under Apache 2.0 license.

Vibe Coding

Marcus Schuler

San Francisco

Tech translator with German roots who fled to Silicon Valley chaos. Decodes startup noise from San Francisco. Launched implicator.ai to slice through AI's daily madness—crisp, clear, with Teutonic precision and sarcasm. E-Mail: [email protected]

Open-Weights LLMs Score 94.8% on Custom Coding Benchmark, 4% Behind Claude Opus

The benchmark that SWE-bench can't replicate

Thirteen models above 92%

What triggered the investigation

The test coverage caveat

A market thinning from below

Marcus Schuler

Get the Morning Briefing in your inbox.

Related Stories

OpenAI Ships a Codex Plugin for Claude Code, Putting Its Agent Inside a Rival's Tool

Three tutorials in five days. Claude Code's real product isn't the model.

OpenAI Wants Codex to Be a Platform. Developers Already Made Claude Code One.