Claude Opus 4.7 Beats GPT-5.4 and Gemini on Coding Tests

Anthropic shipped Claude Opus 4.7 on Thursday, closing a strange ten-week stretch. Opus 4.6 had landed February 5. Google's Gemini 3.1 Pro arrived two weeks later. OpenAI's GPT-5.4 came out on March 5. Today's launch finishes the set.

Four flagship reasoning models. Ten weeks. One news cycle.

On paper the four look nearly identical now. GPQA Diamond? Everyone hovering inside a point of each other. Context window? Roughly a million tokens each. Training run? Each lab burned enough compute to light a small city. That's the convergence story and it's boring.

The interesting story is where they split. Coding scores don't line up. Tool use doesn't line up. Vision doesn't line up. Computer control doesn't line up. And the monthly bill definitely doesn't line up. This piece is a stack of tables. Read it that way.

Key Takeaways

Four frontier reasoning models shipped in ten weeks. Opus 4.6, Gemini 3.1 Pro, GPT-5.4, and Opus 4.7 all cleared 94% on GPQA Diamond.
Opus 4.7 leads coding. It jumped from 53.4% to 64.3% on SWE-bench Pro and holds the MCP-Atlas tool-use crown at 77.3%.
GPT-5.4 leads terminal work and computer use. It beat the human expert on OSWorld and hit 75.1% on Terminal-Bench 2.0.
Gemini 3.1 Pro leads price, speed, and modality breadth. Base pricing is 2.14x cheaper than Opus 4.7, with native video and audio input.

AI-generated summary, reviewed by an editor. More on our AI guidelines.

Release timeline

Table 01 · Launch cadenceFour flagships, ten weeks

Lab	Model	Release date	Status
Anthropic	Opus 4.6	2026-02-05	Generally available
Google	Gemini 3.1 Pro	2026-02-19	Preview (API + Vertex AI)
OpenAI	GPT-5.4	2026-03-05	Generally available
Anthropic	Opus 4.7	2026-04-16	Generally available today
Anthropic	Claude Mythos Preview	Invitation-only	Project Glasswing

Sources: Anthropic, Google DeepMind, OpenAI

Mythos is the awkward extra entry. Anthropic says it sits above Opus 4.7 on every capability test the company ran. You can't have it. Developers need a vetted application, and those do not go out often.

Core specifications

Table 02 · SpecsOpus 4.7 highlighted throughout

Dimension	Opus 4.6	Opus 4.7	GPT-5.4	Gemini 3.1 Pro
Context window	1M tokens	1M tokens	272K default, 1M opt-in	1,048,576 tokens
Max output	128K tokens	128K tokens	Not published	65,536 tokens
Knowledge cutoff	May 2025	Jan 2026	Aug 2025	Not published
Extended thinking	Yes	Adaptive only	Yes	Yes
Input modalities	Text, image	Text, image	Text, image	Text, image, speech, video
Output modalities	Text	Text	Text	Text
Native computer use	API-level	API-level	Native, no plugin	API-level

Highlighted cells show the leader in each row

Three quick reads from this table. Opus 4.7 is the freshest by roughly five months. Gemini swallows the widest set of inputs, including raw video. Only GPT-5.4 controls a desktop without a wrapper library in between. That last one matters less than the marketing suggests, though. A good wrapper is cheap.

Pricing (per million tokens)

Table 03 · API pricingLeader per row = cheapest

Tier	Opus 4.6	Opus 4.7	GPT-5.4	Gemini 3.1 Pro
Input (standard)	$5.00	$5.00	$2.50	$2.00
Output (standard)	$25.00	$25.00	$15.00	$12.00
Input (>200K or >272K)	$10.00	$5.00	2x rate	$4.00
Output (>200K or >272K)	$37.50	$25.00	1.5x rate	$18.00
Prompt cache discount	Up to 90%	Up to 90%	Available	Available
Batch discount	50%	50%	50%	50%
Blended $/M (in+out/2)	$15.00	$15.00	$8.75	$7.00

Opus 4.7 is 2.14x the Gemini 3.1 Pro base rate, 1.71x GPT-5.4

Run the math. Opus 4.7 costs 2.14 times the Gemini rate at the base tier, and 1.71 times GPT-5.4. Anthropic did not cut prices today. It did not match the cheaper rivals either. That is a statement. The company believes the coding and agent work pays for itself.

But the sticker price hides a second lever. Opus 4.7 introduces a new effort tier called xhigh, plus task budgets in public beta. Both change how the bill actually lands. The next table is the one developers will keep open on a second monitor.

Opus 4.7 effort levels and what they cost you

Table 04 · Effort tiersNew `xhigh` level marked

Effort level	What Claude does	Token usage vs `high`	When to pick it
`low`	Minimal thinking. Skips reasoning on simple tasks. Combines tool calls when possible.	Significantly lower	Classification, lookups, subagents, latency-sensitive chat
`medium`	Moderate thinking. Balanced speed and quality. Drop-in for typical workflows.	Lower	Cost-sensitive agentic workflows, average production use
`high` default	Claude almost always thinks. Deep reasoning on complex tasks.	Baseline	Complex reasoning, nuanced analysis
`xhigh` new in 4.7	Extended exploration. More tool calls, deeper search, longer traces.	Meaningfully higher	Anthropic's recommended start for coding and agentic work
`max`	Absolute maximum. No constraints on thinking depth.	Highest	Frontier problems only. Can overthink structured tasks.

Source: Anthropic Claude API documentation for Opus 4.7

Note what Anthropic actually recommends. For coding and agent jobs on Opus 4.7, the company says start at xhigh, not high. That is a deliberate push. Anthropic built the new tier because it thinks the extra token spend pays back in fewer failed runs. You will find out quickly if that is true for your workload. Anthropic also recommends a max_tokens of at least 64k when running at xhigh or max, so the model has headroom to think and act across subagents.

There is also a tokenizer catch. Opus 4.7 retokenizes text anywhere from 1.0x up to 1.35x the rate of Opus 4.6. Exact ratio depends on what you send. Run an actual token count before assuming flat prices mean flat bills, because 35% more billable tokens for the same input is very possible, and that's before you touch the effort level.

Task budgets, a new control surface

Opus 4.7 also introduces task budgets in public beta. These are not the same as max_tokens. A task budget is an advisory countdown the model sees and paces against across a full agentic loop, including thinking tokens, tool calls, tool results, and output.

Table 05 · Cost controlsThree orthogonal levers

Parameter	What it controls	Scope	Behavior
`max_tokens`	Hard per-request cap on generated tokens	One API request	Truncates response when reached
`task_budget` beta, Opus 4.7 only	Advisory token target across an agentic loop	Full multi-request loop	Model self-regulates, finishes gracefully
`effort`	Depth of reasoning per step	Per response	Tunes thinking allocation

Header to enable task budgets: task-budgets-2026-03-13

Anthropic set the floor at 20,000 tokens. Anything lower comes back as a 400 error. The company also warns that an undersized budget can look like a refusal. The model may scope down, stop early, or decline outright if the budget looks too tight for what you asked. Measure your actual per-task spend first, then size the budget above the p99 of your distribution.

The feature is Opus 4.7 only. It is not supported on Opus 4.6, Sonnet 4.6, Haiku 4.5, GPT-5.4, or Gemini 3.1 Pro.

Coding benchmarks

Table 06 · Software engineeringBars show score vs 100%

Benchmark	What it tests	Opus 4.6	Opus 4.7	GPT-5.4	Gemini 3.1 Pro
SWE-bench Verified	Real GitHub issue resolution	80.8%	87.6%	~80%	80.6%
SWE-bench Pro	Harder, contamination-resistant	53.4%	64.3%	57.7%	54.2%
Terminal-Bench 2.0	Live terminal and CI/CD work	65.4%	69.4%	75.1%	68.5%
CursorBench	Cursor IDE production tasks	58%	70%	Not reported	Not reported
Rakuten-SWE-Bench	Production task resolution	1x baseline	3x baseline	Not reported	Not reported
CodeRabbit review recall	Code review quality	Baseline	+10%	Not reported	Not reported

Opus 4.7 wins four of six coding benchmarks. GPT-5.4 leads terminal work.

The headline jumps out. SWE-bench Pro is the harder, cleaner version of SWE-bench, scrubbed against training-data contamination. Opus 4.7 gains 10.9 points there. That's bigger than the gap between second and fourth place on the same table. Much bigger. GPT-5.4 wins terminal work, though, by a margin that anyone doing sysadmin, git, or CI-heavy automation will feel.

Customer-reported gains from Opus 4.6 to Opus 4.7

Table 07 · Enterprise deltasVendor-reported, not independent

Customer	Metric	Delta
GitHub	93-task internal benchmark	+13% resolved
Cursor	CursorBench pass rate	58% → 70%
Notion	Complex workflows	+14% at fewer tokens, −67% tool errors
Factory Droids	Task success	+10 to 15%
Databricks OfficeQA Pro	Error rate	−21%
XBOW	Visual-acuity for pen-testing	54.5% → 98.5%

Customer metrics disclosed in Anthropic's Opus 4.7 launch materials

Customer quotes are not neutral. They rarely are. But pointed the same way across six unrelated companies, the pattern is hard to wave off. Fewer dead loops. Fewer broken tool calls. Fewer tokens wasted chasing a failure state an agent should have caught three steps earlier.

Reasoning and knowledge work

Table 08 · ReasoningTop three models within 0.2 pts on GPQA

Benchmark	Opus 4.6	Opus 4.7	GPT-5.4	GPT-5.4 Pro	Gemini 3.1 Pro
GPQA Diamond	Not reported	94.2%	78.2%	94.4%	94.3%
MMMLU (multilingual)	91.1%	91.5%	Not reported	Not reported	92.6%
Humanity's Last Exam	53.0%w/ tools	Not yet	41.6%	46%	37.5%no tools
ARC-AGI-2	Not reported	Not reported	Not reported	Not reported	77.1%with code
BigLaw Bench	Not reported	Not reported	91%	Not reported	Not reported
GDPval-AA Elo	1606	Higher, not yet published	Not published	Not published	Not published

GPQA Diamond top three within 0.2 points — effectively tied

The frontier race is now a distribution race

Strategic AI news from San Francisco. No hype, no "AI will change everything" throat clearing. Just what moved, who won, and why it matters. Daily at 6am PST.

No spam. Unsubscribe anytime.

Graduate reasoning scores at 94% are basically noise now. Three models, 0.2 points apart. Calling that a leaderboard is generous. It is a rounding error dressed up as a benchmark win. The tests that still separate these models are the ones plugged into real work. GDPval. BigLaw. Anthropic's Finance Agent. Look there, not at GPQA.

Vision and multimodal

Table 09 · Vision & multimodalGemini alone reads native video

Dimension	Opus 4.6	Opus 4.7	GPT-5.4	Gemini 3.1 Pro
Max image resolution	~1,568 px / 1.15 MP	2,576 px / 3.75 MP	Full-res supported	Full-res supported
MMMU-Pro	Not reported	Not reported	Not reported	81.0%
ScreenSpot-Pro	Not reported	Not reported	3.5%prior gen	72.7%
Video-MMMU	Not supported	Not supported	Not supported	87.6%
XBOW visual-acuity	54.5%	98.5%	Not reported	Not reported
Native video input	No	No	No	Yes
Native audio input	No	No	No	Yes

Opus 4.7 triples image resolution. Gemini owns multimodal breadth.

Opus 4.7's image jump is narrow but meaningful. Dense screenshots. UI dashboards. Patent drawings. A PhD student's messy handwritten notes. These now arrive at the model with labels intact, instead of getting downsampled into blur. Gemini still wins multimodal breadth outright. It reads video. The other three cannot. That is not a small deal for anyone doing meeting summarization, patent analysis, or compliance review.

Agents, tools, and computer control

Table 10 · Agents & toolsSplit field; no overall leader

Benchmark	What it measures	Opus 4.6	Opus 4.7	GPT-5.4	Gemini 3.1 Pro
MCP-Atlas	Scaled multi-tool use	75.8%	77.3%	68.1%	73.9%
OSWorld	Computer use on desktop apps	72.7%	Not published	75%> 72.4% human	Not reported
BrowseComp	Web research agents	86.8%multi-agent	79.3%	89.3%Pro	Not reported
Finance Agent	Multi-step financial reasoning	Not reported	64.4%	Not reported	Not reported
τ2-bench Retail	Retail workflows	91.9%	Not indexed	Not reported	Not reported
Vending-Bench 2	Vending-machine agent	Not reported	Not reported	Not reported	Tops leaderboard

Four different winners across six agent benchmarks

Something to notice. OpenAI made GPT-5.4 the first model above the human expert baseline on OSWorld. The humans scored 72.4%. The model scored 75%. That is real. On MCP-Atlas, though, Anthropic leads by nine points over OpenAI. And Anthropic wrote the MCP spec. So the winner who built the test wins the test. Surprised? Nobody is.

Safety, cyber access, and verification

Table 11 · Safety tiersAll three labs now ship verified access

Dimension	Opus 4.6	Opus 4.7	GPT-5.4	Gemini 3.1 Pro
Safety framework	ASL-3	ASL-3, below Mythos	Preparedness "high" cyber	FSF alert threshold
Cybench 35-challenge	Strong	96% pass@1	Not published	11/12 v1 hard · 0/13 v2
Restricted cyber variant	No	Mythos Preview invite-only	GPT-5.4-Cyber launched Apr 15	None separate
Verification program	Enterprise controls	Cyber Verification Program	Trusted Access for Cyber	FSF mitigations
Independent evaluation	UK AISI on Mythos	UK AISI on Mythos	Preparedness disclosure	Google FSF

The frontier product is now a model weight file plus an access-control system

Every lab has now built an access control system around its flagship. Anthropic has two models stacked: the public one you buy, and Mythos behind Glasswing. OpenAI shipped GPT-5.4-Cyber yesterday, a defensive-security variant for vetted researchers. Google has not separated a cyber model yet, but its own safety framework flagged Gemini 3 Pro at the early-warning threshold. The 2026 frontier product is no longer just a model weight file. It is a model weight file plus an identity system plus a use-case gate plus an exemption appeals process. All three labs are building the same shape of thing.

Cross-benchmark scoreboard

Table 12 · ScoreboardTwelve dimensions, twelve rankings

Dimension	1st	2nd	3rd	4th
SWE-bench Verified	Opus 4.7 87.6	Opus 4.6 80.8	Gemini 3.1 Pro 80.6	GPT-5.4 ~80
SWE-bench Pro	Opus 4.7 64.3	GPT-5.4 57.7	Gemini 3.1 Pro 54.2	Opus 4.6 53.4
Terminal-Bench 2.0	GPT-5.4 75.1	Opus 4.7 69.4	Gemini 3.1 Pro 68.5	Opus 4.6 65.4
MCP-Atlas	Opus 4.7 77.3	Opus 4.6 75.8	Gemini 3.1 Pro 73.9	GPT-5.4 68.1
GPQA Diamond	GPT-5.4 Pro 94.4	Gemini 3.1 Pro 94.3	Opus 4.7 94.2	—
OSWorld	GPT-5.4 75.0	Opus 4.6 72.7	—	—
BrowseComp	GPT-5.4 Pro 89.3	Opus 4.6 86.8	Opus 4.7 79.3	—
MMMU-Pro (vision)	Gemini 3 Pro 81.0	—	—	—
Output speed	Gemini 3.1 Pro 142 tok/s	GPT-5.4 75 tok/s	Opus moderate	—
Input modality breadth	Gemini 3.1 Pro 4 modes	Opus / GPT-5.4 2 modes	—	—
Blended $/M tokens	Gemini 3.1 Pro $7.00	GPT-5.4 $8.75	Opus 4.6 / 4.7 $15.00	—
Knowledge freshness	Opus 4.7 Jan 2026	GPT-5.4 Aug 2025	Opus 4.6 May 2025	Gemini not disclosed

Opus 4.7: 4 firsts · GPT-5.4: 4 firsts · Gemini 3.1 Pro: 4 firsts

So what wins? Depends who is asking. Opus 4.7 takes coding and tool use. GPT-5.4 takes terminal, computer control, and web research. Gemini takes multimodal breadth, speed, and the bill. Nobody collects the whole board. And that's actually a helpful sign. The market has enough models now that the right question shifted. From "which one is the best" to "which one is best for the thing I am doing Tuesday morning."

Pick by workload, not by leaderboard

Table 13 · Buyer's guideMap workload to model

Your workload	Choose	Why
Long-running agentic coding on real repositories	Opus 4.7	+10.9 on SWE-bench Pro vs Opus 4.6, +6.6 vs GPT-5.4. Fewer failed loops.
Terminal ops, CI/CD automation, sysadmin tasks	GPT-5.4	75.1% Terminal-Bench beats the field by six points.
Computer control without wrappers	GPT-5.4	Native desktop use, first above human expert on OSWorld.
Video, audio, or mixed-modality input	Gemini 3.1 Pro	Only frontier model with native video and audio input.
Cost-sensitive scaling at similar intelligence	Gemini 3.1 Pro	Intelligence Index 57 at $7/M blended, versus $15/M for Opus.
Tool-heavy agents using MCP	Opus 4.7	77.3% MCP-Atlas. Anthropic wrote the protocol.
Defensive cybersecurity at a verified org	Mythos Preview or GPT-5.4-Cyber	Public models block many security tasks. Verified access unlocks them.
Stories from after mid-2025	Opus 4.7	Knowledge cutoff January 2026, five months ahead of GPT-5.4.
Pure speed on high-volume workloads	Gemini 3.1 Pro	142 tokens per second, nearly double GPT-5.4's output rate.

Nine workloads, four answers — split mostly between Opus 4.7 and Gemini 3.1 Pro

Buyers keep asking which one is best. Wrong question, though. Ask which one is best for this specific job. At this price. With this latency budget. At this access tier. The tables above answer that. Leaderboards will not.

What to watch next

Three labs already converged on the reasoning ceiling. The next fight is about everything downstream. Anthropic has bet the premium on agentic coding and tool discipline. Google has bet its price and multimodal breadth. OpenAI has bet on computer control and unified distribution through ChatGPT, Codex, and the API at the same time. Each is also quietly building an access control system, and that is where the real interesting fights will happen over the next year.

Watch whether Anthropic's Cyber Verification Program opens beyond a handful of current partners. Watch whether OpenAI pushes GPT-5.4-Cyber deeper into infrastructure firms. Watch whether Google formally splits off a cyber variant after its safety framework flag. And watch how Anthropic handles Mythos when it eventually broadens, which it will.

The benchmark tables will keep moving. The distribution tables are where the market is now.

Frequently Asked Questions

Which model is best for agentic coding on real codebases?

Claude Opus 4.7. It jumped 10.9 points over Opus 4.6 on SWE-bench Pro to 64.3%, beating GPT-5.4 at 57.7% and Gemini 3.1 Pro at 54.2%. Customer reports from GitHub, Cursor, Notion, and Factory Droids show double-digit gains in task completion with fewer tool errors on long-running agent jobs.

Which model is cheapest at the API?

Gemini 3.1 Pro at $2 per million input tokens and $12 per million output tokens under 200K context. GPT-5.4 is next at $2.50 in and $15 out. Opus 4.7 held prices flat from Opus 4.6 at $5 in and $25 out, roughly 2.14 times the Gemini rate.

What is Claude Mythos Preview and can I use it?

Mythos Preview is Anthropic's higher-capability model, released through Project Glasswing with invitation-only access. Anthropic says it scores higher than Opus 4.7 on every axis measured, but restricts it to vetted defensive-cybersecurity partners. There is no self-serve signup.

How much did vision improve in Opus 4.7?

The pixel ceiling roughly doubled. Opus 4.7 now accepts images at 2,576 pixels on the long edge and about 3.75 megapixels, up from 1,568 pixels and 1.15 megapixels on Opus 4.6. On XBOW's visual-acuity benchmark for pen-testing, the score climbed from 54.5% to 98.5%.

Is GPT-5.4 the only model that controls a desktop natively?

Yes, at the API level. GPT-5.4 scored 75% on OSWorld, becoming the first model to beat the 72.4% human expert baseline on computer-use tasks. Claude and Gemini can drive a desktop through the API but typically need wrapper tooling to handle the control loop.

AI-generated summary, reviewed by an editor. More on our AI guidelines.

AI News

Marcus Schuler

San Francisco

Editor-in-Chief and founder of Implicator.ai. Former ARD correspondent and senior broadcast journalist with 10+ years covering tech. Writes daily briefings on policy and market developments. Based in San Francisco. E-mail: editor@implicator.ai

Claude Opus 4.7 Beats GPT-5.4 and Gemini 3.1 Pro on Software Engineering Tests