Anthropic shipped Claude Opus 4.7 on Thursday, closing a strange ten-week stretch. Opus 4.6 had landed February 5. Google's Gemini 3.1 Pro arrived two weeks later. OpenAI's GPT-5.4 came out on March 5. Today's launch finishes the set.

Four flagship reasoning models. Ten weeks. One news cycle.

On paper the four look nearly identical now. GPQA Diamond? Everyone hovering inside a point of each other. Context window? Roughly a million tokens each. Training run? Each lab burned enough compute to light a small city. That's the convergence story and it's boring.

The interesting story is where they split. Coding scores don't line up. Tool use doesn't line up. Vision doesn't line up. Computer control doesn't line up. And the monthly bill definitely doesn't line up. This piece is a stack of tables. Read it that way.

Key Takeaways

AI-generated summary, reviewed by an editor. More on our AI guidelines.

Release timeline

Table 01 · Launch cadenceFour flagships, ten weeks
LabModelRelease dateStatus
AnthropicOpus 4.62026-02-05Generally available
GoogleGemini 3.1 Pro2026-02-19Preview (API + Vertex AI)
OpenAIGPT-5.42026-03-05Generally available
AnthropicOpus 4.72026-04-16Generally available today
AnthropicClaude Mythos PreviewInvitation-onlyProject Glasswing
Sources: Anthropic, Google DeepMind, OpenAI

Mythos is the awkward extra entry. Anthropic says it sits above Opus 4.7 on every capability test the company ran. You can't have it. Developers need a vetted application, and those do not go out often.

Core specifications

Table 02 · SpecsOpus 4.7 highlighted throughout
DimensionOpus 4.6Opus 4.7GPT-5.4Gemini 3.1 Pro
Context window1M tokens1M tokens272K default, 1M opt-in1,048,576 tokens
Max output128K tokens128K tokensNot published65,536 tokens
Knowledge cutoffMay 2025Jan 2026Aug 2025Not published
Extended thinkingYesAdaptive onlyYesYes
Input modalitiesText, imageText, imageText, imageText, image, speech, video
Output modalitiesTextTextTextText
Native computer useAPI-levelAPI-levelNative, no pluginAPI-level
Highlighted cells show the leader in each row

Three quick reads from this table. Opus 4.7 is the freshest by roughly five months. Gemini swallows the widest set of inputs, including raw video. Only GPT-5.4 controls a desktop without a wrapper library in between. That last one matters less than the marketing suggests, though. A good wrapper is cheap.

Pricing (per million tokens)

Table 03 · API pricingLeader per row = cheapest
TierOpus 4.6Opus 4.7GPT-5.4Gemini 3.1 Pro
Input (standard)$5.00$5.00$2.50$2.00
Output (standard)$25.00$25.00$15.00$12.00
Input (>200K or >272K)$10.00$5.002x rate$4.00
Output (>200K or >272K)$37.50$25.001.5x rate$18.00
Prompt cache discountUp to 90%Up to 90%AvailableAvailable
Batch discount50%50%50%50%
Blended $/M (in+out/2)$15.00$15.00$8.75$7.00
Opus 4.7 is 2.14x the Gemini 3.1 Pro base rate, 1.71x GPT-5.4

Run the math. Opus 4.7 costs 2.14 times the Gemini rate at the base tier, and 1.71 times GPT-5.4. Anthropic did not cut prices today. It did not match the cheaper rivals either. That is a statement. The company believes the coding and agent work pays for itself.

But the sticker price hides a second lever. Opus 4.7 introduces a new effort tier called xhigh, plus task budgets in public beta. Both change how the bill actually lands. The next table is the one developers will keep open on a second monitor.

Opus 4.7 effort levels and what they cost you

Table 04 · Effort tiersNew `xhigh` level marked
Effort levelWhat Claude doesToken usage vs `high`When to pick it
lowMinimal thinking. Skips reasoning on simple tasks. Combines tool calls when possible.Significantly lowerClassification, lookups, subagents, latency-sensitive chat
mediumModerate thinking. Balanced speed and quality. Drop-in for typical workflows.LowerCost-sensitive agentic workflows, average production use
high defaultClaude almost always thinks. Deep reasoning on complex tasks.BaselineComplex reasoning, nuanced analysis
xhigh new in 4.7Extended exploration. More tool calls, deeper search, longer traces.Meaningfully higherAnthropic's recommended start for coding and agentic work
maxAbsolute maximum. No constraints on thinking depth.HighestFrontier problems only. Can overthink structured tasks.
Source: Anthropic Claude API documentation for Opus 4.7

Note what Anthropic actually recommends. For coding and agent jobs on Opus 4.7, the company says start at xhigh, not high. That is a deliberate push. Anthropic built the new tier because it thinks the extra token spend pays back in fewer failed runs. You will find out quickly if that is true for your workload. Anthropic also recommends a max_tokens of at least 64k when running at xhigh or max, so the model has headroom to think and act across subagents.

There is also a tokenizer catch. Opus 4.7 retokenizes text anywhere from 1.0x up to 1.35x the rate of Opus 4.6. Exact ratio depends on what you send. Run an actual token count before assuming flat prices mean flat bills, because 35% more billable tokens for the same input is very possible, and that's before you touch the effort level.

Task budgets, a new control surface

Opus 4.7 also introduces task budgets in public beta. These are not the same as max_tokens. A task budget is an advisory countdown the model sees and paces against across a full agentic loop, including thinking tokens, tool calls, tool results, and output.

Table 05 · Cost controlsThree orthogonal levers
ParameterWhat it controlsScopeBehavior
max_tokensHard per-request cap on generated tokensOne API requestTruncates response when reached
task_budget beta, Opus 4.7 onlyAdvisory token target across an agentic loopFull multi-request loopModel self-regulates, finishes gracefully
effortDepth of reasoning per stepPer responseTunes thinking allocation
Header to enable task budgets: task-budgets-2026-03-13

Anthropic set the floor at 20,000 tokens. Anything lower comes back as a 400 error. The company also warns that an undersized budget can look like a refusal. The model may scope down, stop early, or decline outright if the budget looks too tight for what you asked. Measure your actual per-task spend first, then size the budget above the p99 of your distribution.

The feature is Opus 4.7 only. It is not supported on Opus 4.6, Sonnet 4.6, Haiku 4.5, GPT-5.4, or Gemini 3.1 Pro.

Coding benchmarks

Table 06 · Software engineeringBars show score vs 100%
BenchmarkWhat it testsOpus 4.6Opus 4.7GPT-5.4Gemini 3.1 Pro
SWE-bench VerifiedReal GitHub issue resolution80.8%87.6%~80%80.6%
SWE-bench ProHarder, contamination-resistant53.4%64.3%57.7%54.2%
Terminal-Bench 2.0Live terminal and CI/CD work65.4%69.4%75.1%68.5%
CursorBenchCursor IDE production tasks58%70%Not reportedNot reported
Rakuten-SWE-BenchProduction task resolution1x baseline3x baselineNot reportedNot reported
CodeRabbit review recallCode review qualityBaseline+10%Not reportedNot reported
Opus 4.7 wins four of six coding benchmarks. GPT-5.4 leads terminal work.

The headline jumps out. SWE-bench Pro is the harder, cleaner version of SWE-bench, scrubbed against training-data contamination. Opus 4.7 gains 10.9 points there. That's bigger than the gap between second and fourth place on the same table. Much bigger. GPT-5.4 wins terminal work, though, by a margin that anyone doing sysadmin, git, or CI-heavy automation will feel.

Customer-reported gains from Opus 4.6 to Opus 4.7

Table 07 · Enterprise deltasVendor-reported, not independent
CustomerMetricDelta
GitHub93-task internal benchmark+13% resolved
CursorCursorBench pass rate58% → 70%
NotionComplex workflows+14% at fewer tokens, −67% tool errors
Factory DroidsTask success+10 to 15%
Databricks OfficeQA ProError rate−21%
XBOWVisual-acuity for pen-testing54.5% → 98.5%
Customer metrics disclosed in Anthropic's Opus 4.7 launch materials

Customer quotes are not neutral. They rarely are. But pointed the same way across six unrelated companies, the pattern is hard to wave off. Fewer dead loops. Fewer broken tool calls. Fewer tokens wasted chasing a failure state an agent should have caught three steps earlier.

Reasoning and knowledge work

Table 08 · ReasoningTop three models within 0.2 pts on GPQA
BenchmarkOpus 4.6Opus 4.7GPT-5.4GPT-5.4 ProGemini 3.1 Pro
GPQA DiamondNot reported94.2%78.2%94.4%94.3%
MMMLU (multilingual)91.1%91.5%Not reportedNot reported92.6%
Humanity's Last Exam53.0%w/ toolsNot yet41.6%46%37.5%no tools
ARC-AGI-2Not reportedNot reportedNot reportedNot reported77.1%with code
BigLaw BenchNot reportedNot reported91%Not reportedNot reported
GDPval-AA Elo1606Higher, not yet publishedNot publishedNot publishedNot published
GPQA Diamond top three within 0.2 points — effectively tied

Graduate reasoning scores at 94% are basically noise now. Three models, 0.2 points apart. Calling that a leaderboard is generous. It is a rounding error dressed up as a benchmark win. The tests that still separate these models are the ones plugged into real work. GDPval. BigLaw. Anthropic's Finance Agent. Look there, not at GPQA.

Vision and multimodal

Table 09 · Vision & multimodalGemini alone reads native video
DimensionOpus 4.6Opus 4.7GPT-5.4Gemini 3.1 Pro
Max image resolution~1,568 px / 1.15 MP2,576 px / 3.75 MPFull-res supportedFull-res supported
MMMU-ProNot reportedNot reportedNot reported81.0%
ScreenSpot-ProNot reportedNot reported3.5%prior gen72.7%
Video-MMMUNot supportedNot supportedNot supported87.6%
XBOW visual-acuity54.5%98.5%Not reportedNot reported
Native video inputNoNoNoYes
Native audio inputNoNoNoYes
Opus 4.7 triples image resolution. Gemini owns multimodal breadth.

Opus 4.7's image jump is narrow but meaningful. Dense screenshots. UI dashboards. Patent drawings. A PhD student's messy handwritten notes. These now arrive at the model with labels intact, instead of getting downsampled into blur. Gemini still wins multimodal breadth outright. It reads video. The other three cannot. That is not a small deal for anyone doing meeting summarization, patent analysis, or compliance review.

Agents, tools, and computer control

Table 10 · Agents & toolsSplit field; no overall leader
BenchmarkWhat it measuresOpus 4.6Opus 4.7GPT-5.4Gemini 3.1 Pro
MCP-AtlasScaled multi-tool use75.8%77.3%68.1%73.9%
OSWorldComputer use on desktop apps72.7%Not published75%> 72.4% humanNot reported
BrowseCompWeb research agents86.8%multi-agent79.3%89.3%ProNot reported
Finance AgentMulti-step financial reasoningNot reported64.4%Not reportedNot reported
τ2-bench RetailRetail workflows91.9%Not indexedNot reportedNot reported
Vending-Bench 2Vending-machine agentNot reportedNot reportedNot reportedTops leaderboard
Four different winners across six agent benchmarks

Something to notice. OpenAI made GPT-5.4 the first model above the human expert baseline on OSWorld. The humans scored 72.4%. The model scored 75%. That is real. On MCP-Atlas, though, Anthropic leads by nine points over OpenAI. And Anthropic wrote the MCP spec. So the winner who built the test wins the test. Surprised? Nobody is.

Safety, cyber access, and verification

Table 11 · Safety tiersAll three labs now ship verified access
DimensionOpus 4.6Opus 4.7GPT-5.4Gemini 3.1 Pro
Safety frameworkASL-3ASL-3, below MythosPreparedness "high" cyberFSF alert threshold
Cybench 35-challengeStrong96% pass@1Not published11/12 v1 hard · 0/13 v2
Restricted cyber variantNoMythos Preview invite-onlyGPT-5.4-Cyber launched Apr 15None separate
Verification programEnterprise controlsCyber Verification ProgramTrusted Access for CyberFSF mitigations
Independent evaluationUK AISI on MythosUK AISI on MythosPreparedness disclosureGoogle FSF
The frontier product is now a model weight file plus an access-control system

Every lab has now built an access control system around its flagship. Anthropic has two models stacked: the public one you buy, and Mythos behind Glasswing. OpenAI shipped GPT-5.4-Cyber yesterday, a defensive-security variant for vetted researchers. Google has not separated a cyber model yet, but its own safety framework flagged Gemini 3 Pro at the early-warning threshold. The 2026 frontier product is no longer just a model weight file. It is a model weight file plus an identity system plus a use-case gate plus an exemption appeals process. All three labs are building the same shape of thing.

Cross-benchmark scoreboard

Table 12 · ScoreboardTwelve dimensions, twelve rankings
Dimension1st2nd3rd4th
SWE-bench VerifiedOpus 4.7 87.6Opus 4.6 80.8Gemini 3.1 Pro 80.6GPT-5.4 ~80
SWE-bench ProOpus 4.7 64.3GPT-5.4 57.7Gemini 3.1 Pro 54.2Opus 4.6 53.4
Terminal-Bench 2.0GPT-5.4 75.1Opus 4.7 69.4Gemini 3.1 Pro 68.5Opus 4.6 65.4
MCP-AtlasOpus 4.7 77.3Opus 4.6 75.8Gemini 3.1 Pro 73.9GPT-5.4 68.1
GPQA DiamondGPT-5.4 Pro 94.4Gemini 3.1 Pro 94.3Opus 4.7 94.2
OSWorldGPT-5.4 75.0Opus 4.6 72.7
BrowseCompGPT-5.4 Pro 89.3Opus 4.6 86.8Opus 4.7 79.3
MMMU-Pro (vision)Gemini 3 Pro 81.0
Output speedGemini 3.1 Pro 142 tok/sGPT-5.4 75 tok/sOpus moderate
Input modality breadthGemini 3.1 Pro 4 modesOpus / GPT-5.4 2 modes
Blended $/M tokensGemini 3.1 Pro $7.00GPT-5.4 $8.75Opus 4.6 / 4.7 $15.00
Knowledge freshnessOpus 4.7 Jan 2026GPT-5.4 Aug 2025Opus 4.6 May 2025Gemini not disclosed
Opus 4.7: 4 firsts · GPT-5.4: 4 firsts · Gemini 3.1 Pro: 4 firsts

So what wins? Depends who is asking. Opus 4.7 takes coding and tool use. GPT-5.4 takes terminal, computer control, and web research. Gemini takes multimodal breadth, speed, and the bill. Nobody collects the whole board. And that's actually a helpful sign. The market has enough models now that the right question shifted. From "which one is the best" to "which one is best for the thing I am doing Tuesday morning."

Pick by workload, not by leaderboard

Table 13 · Buyer's guideMap workload to model
Your workloadChooseWhy
Long-running agentic coding on real repositoriesOpus 4.7+10.9 on SWE-bench Pro vs Opus 4.6, +6.6 vs GPT-5.4. Fewer failed loops.
Terminal ops, CI/CD automation, sysadmin tasksGPT-5.475.1% Terminal-Bench beats the field by six points.
Computer control without wrappersGPT-5.4Native desktop use, first above human expert on OSWorld.
Video, audio, or mixed-modality inputGemini 3.1 ProOnly frontier model with native video and audio input.
Cost-sensitive scaling at similar intelligenceGemini 3.1 ProIntelligence Index 57 at $7/M blended, versus $15/M for Opus.
Tool-heavy agents using MCPOpus 4.777.3% MCP-Atlas. Anthropic wrote the protocol.
Defensive cybersecurity at a verified orgMythos Preview or GPT-5.4-CyberPublic models block many security tasks. Verified access unlocks them.
Stories from after mid-2025Opus 4.7Knowledge cutoff January 2026, five months ahead of GPT-5.4.
Pure speed on high-volume workloadsGemini 3.1 Pro142 tokens per second, nearly double GPT-5.4's output rate.
Nine workloads, four answers — split mostly between Opus 4.7 and Gemini 3.1 Pro

Buyers keep asking which one is best. Wrong question, though. Ask which one is best for this specific job. At this price. With this latency budget. At this access tier. The tables above answer that. Leaderboards will not.

What to watch next

Three labs already converged on the reasoning ceiling. The next fight is about everything downstream. Anthropic has bet the premium on agentic coding and tool discipline. Google has bet its price and multimodal breadth. OpenAI has bet on computer control and unified distribution through ChatGPT, Codex, and the API at the same time. Each is also quietly building an access control system, and that is where the real interesting fights will happen over the next year.

Watch whether Anthropic's Cyber Verification Program opens beyond a handful of current partners. Watch whether OpenAI pushes GPT-5.4-Cyber deeper into infrastructure firms. Watch whether Google formally splits off a cyber variant after its safety framework flag. And watch how Anthropic handles Mythos when it eventually broadens, which it will.

The benchmark tables will keep moving. The distribution tables are where the market is now.

Frequently Asked Questions

Which model is best for agentic coding on real codebases?

Claude Opus 4.7. It jumped 10.9 points over Opus 4.6 on SWE-bench Pro to 64.3%, beating GPT-5.4 at 57.7% and Gemini 3.1 Pro at 54.2%. Customer reports from GitHub, Cursor, Notion, and Factory Droids show double-digit gains in task completion with fewer tool errors on long-running agent jobs.

Which model is cheapest at the API?

Gemini 3.1 Pro at $2 per million input tokens and $12 per million output tokens under 200K context. GPT-5.4 is next at $2.50 in and $15 out. Opus 4.7 held prices flat from Opus 4.6 at $5 in and $25 out, roughly 2.14 times the Gemini rate.

What is Claude Mythos Preview and can I use it?

Mythos Preview is Anthropic's higher-capability model, released through Project Glasswing with invitation-only access. Anthropic says it scores higher than Opus 4.7 on every axis measured, but restricts it to vetted defensive-cybersecurity partners. There is no self-serve signup.

How much did vision improve in Opus 4.7?

The pixel ceiling roughly doubled. Opus 4.7 now accepts images at 2,576 pixels on the long edge and about 3.75 megapixels, up from 1,568 pixels and 1.15 megapixels on Opus 4.6. On XBOW's visual-acuity benchmark for pen-testing, the score climbed from 54.5% to 98.5%.

Is GPT-5.4 the only model that controls a desktop natively?

Yes, at the API level. GPT-5.4 scored 75% on OSWorld, becoming the first model to beat the 72.4% human expert baseline on computer-use tasks. Claude and Gemini can drive a desktop through the API but typically need wrapper tooling to handle the control loop.

AI-generated summary, reviewed by an editor. More on our AI guidelines.

OpenAI Launches GPT-5.4 With Built-In Computer Use and 1 Million Token Context Window
OpenAI released GPT-5.4 on Thursday, its latest frontier model that combines coding improvements from GPT-5.3-Codex with two capabilities new to its mainline lineup, native computer use and a one-mill
Anthropic Opus 4.6 Tops GPT-5.2 on Knowledge Work Benchmarks by Wide Margin
Anthropic on Thursday released Claude Opus 4.6, its most powerful AI model yet, claiming state-of-the-art performance in coding, financial analysis, and information retrieval. The model outperforms Op
MiniMax Matched Opus at 5% of the Price. The AI Premium Is Dead.
Anthropic shipped Claude Opus 4.6 last week. Strong reviews. Benchmark wins across the board. The kind of release that usually buys a frontier lab months of breathing room. The breathing room lasted
AI News

San Francisco

Editor-in-Chief and founder of Implicator.ai. Former ARD correspondent and senior broadcast journalist with 10+ years covering tech. Writes daily briefings on policy and market developments. Based in San Francisco. E-mail: [email protected]