Anthropic shipped Claude Opus 4.7 on Thursday, closing a strange ten-week stretch. Opus 4.6 had landed February 5. Google's Gemini 3.1 Pro arrived two weeks later. OpenAI's GPT-5.4 came out on March 5. Today's launch finishes the set.
Four flagship reasoning models. Ten weeks. One news cycle.
On paper the four look nearly identical now. GPQA Diamond? Everyone hovering inside a point of each other. Context window? Roughly a million tokens each. Training run? Each lab burned enough compute to light a small city. That's the convergence story and it's boring.
The interesting story is where they split. Coding scores don't line up. Tool use doesn't line up. Vision doesn't line up. Computer control doesn't line up. And the monthly bill definitely doesn't line up. This piece is a stack of tables. Read it that way.
Key Takeaways
- Four frontier reasoning models shipped in ten weeks. Opus 4.6, Gemini 3.1 Pro, GPT-5.4, and Opus 4.7 all cleared 94% on GPQA Diamond.
- Opus 4.7 leads coding. It jumped from 53.4% to 64.3% on SWE-bench Pro and holds the MCP-Atlas tool-use crown at 77.3%.
- GPT-5.4 leads terminal work and computer use. It beat the human expert on OSWorld and hit 75.1% on Terminal-Bench 2.0.
- Gemini 3.1 Pro leads price, speed, and modality breadth. Base pricing is 2.14x cheaper than Opus 4.7, with native video and audio input.
AI-generated summary, reviewed by an editor. More on our AI guidelines.
Release timeline
| Lab | Model | Release date | Status |
|---|---|---|---|
| Anthropic | Opus 4.6 | 2026-02-05 | Generally available |
| Gemini 3.1 Pro | 2026-02-19 | Preview (API + Vertex AI) | |
| OpenAI | GPT-5.4 | 2026-03-05 | Generally available |
| Anthropic | Opus 4.7 | 2026-04-16 | Generally available today |
| Anthropic | Claude Mythos Preview | Invitation-only | Project Glasswing |
Mythos is the awkward extra entry. Anthropic says it sits above Opus 4.7 on every capability test the company ran. You can't have it. Developers need a vetted application, and those do not go out often.
Core specifications
| Dimension | Opus 4.6 | Opus 4.7 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|---|
| Context window | 1M tokens | 1M tokens | 272K default, 1M opt-in | 1,048,576 tokens |
| Max output | 128K tokens | 128K tokens | Not published | 65,536 tokens |
| Knowledge cutoff | May 2025 | Jan 2026 | Aug 2025 | Not published |
| Extended thinking | Yes | Adaptive only | Yes | Yes |
| Input modalities | Text, image | Text, image | Text, image | Text, image, speech, video |
| Output modalities | Text | Text | Text | Text |
| Native computer use | API-level | API-level | Native, no plugin | API-level |
Three quick reads from this table. Opus 4.7 is the freshest by roughly five months. Gemini swallows the widest set of inputs, including raw video. Only GPT-5.4 controls a desktop without a wrapper library in between. That last one matters less than the marketing suggests, though. A good wrapper is cheap.
Pricing (per million tokens)
| Tier | Opus 4.6 | Opus 4.7 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|---|
| Input (standard) | $5.00 | $5.00 | $2.50 | $2.00 |
| Output (standard) | $25.00 | $25.00 | $15.00 | $12.00 |
| Input (>200K or >272K) | $10.00 | $5.00 | 2x rate | $4.00 |
| Output (>200K or >272K) | $37.50 | $25.00 | 1.5x rate | $18.00 |
| Prompt cache discount | Up to 90% | Up to 90% | Available | Available |
| Batch discount | 50% | 50% | 50% | 50% |
| Blended $/M (in+out/2) | $15.00 | $15.00 | $8.75 | $7.00 |
Run the math. Opus 4.7 costs 2.14 times the Gemini rate at the base tier, and 1.71 times GPT-5.4. Anthropic did not cut prices today. It did not match the cheaper rivals either. That is a statement. The company believes the coding and agent work pays for itself.
But the sticker price hides a second lever. Opus 4.7 introduces a new effort tier called xhigh, plus task budgets in public beta. Both change how the bill actually lands. The next table is the one developers will keep open on a second monitor.
Opus 4.7 effort levels and what they cost you
| Effort level | What Claude does | Token usage vs `high` | When to pick it |
|---|---|---|---|
low | Minimal thinking. Skips reasoning on simple tasks. Combines tool calls when possible. | Significantly lower | Classification, lookups, subagents, latency-sensitive chat |
medium | Moderate thinking. Balanced speed and quality. Drop-in for typical workflows. | Lower | Cost-sensitive agentic workflows, average production use |
high | Claude almost always thinks. Deep reasoning on complex tasks. | Baseline | Complex reasoning, nuanced analysis |
xhigh | Extended exploration. More tool calls, deeper search, longer traces. | Meaningfully higher | Anthropic's recommended start for coding and agentic work |
max | Absolute maximum. No constraints on thinking depth. | Highest | Frontier problems only. Can overthink structured tasks. |
Note what Anthropic actually recommends. For coding and agent jobs on Opus 4.7, the company says start at xhigh, not high. That is a deliberate push. Anthropic built the new tier because it thinks the extra token spend pays back in fewer failed runs. You will find out quickly if that is true for your workload. Anthropic also recommends a max_tokens of at least 64k when running at xhigh or max, so the model has headroom to think and act across subagents.
There is also a tokenizer catch. Opus 4.7 retokenizes text anywhere from 1.0x up to 1.35x the rate of Opus 4.6. Exact ratio depends on what you send. Run an actual token count before assuming flat prices mean flat bills, because 35% more billable tokens for the same input is very possible, and that's before you touch the effort level.
Task budgets, a new control surface
Opus 4.7 also introduces task budgets in public beta. These are not the same as max_tokens. A task budget is an advisory countdown the model sees and paces against across a full agentic loop, including thinking tokens, tool calls, tool results, and output.
| Parameter | What it controls | Scope | Behavior |
|---|---|---|---|
max_tokens | Hard per-request cap on generated tokens | One API request | Truncates response when reached |
task_budget | Advisory token target across an agentic loop | Full multi-request loop | Model self-regulates, finishes gracefully |
effort | Depth of reasoning per step | Per response | Tunes thinking allocation |
Anthropic set the floor at 20,000 tokens. Anything lower comes back as a 400 error. The company also warns that an undersized budget can look like a refusal. The model may scope down, stop early, or decline outright if the budget looks too tight for what you asked. Measure your actual per-task spend first, then size the budget above the p99 of your distribution.
The feature is Opus 4.7 only. It is not supported on Opus 4.6, Sonnet 4.6, Haiku 4.5, GPT-5.4, or Gemini 3.1 Pro.
Coding benchmarks
| Benchmark | What it tests | Opus 4.6 | Opus 4.7 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|---|---|
| SWE-bench Verified | Real GitHub issue resolution | 80.8% | 87.6% | ~80% | 80.6% |
| SWE-bench Pro | Harder, contamination-resistant | 53.4% | 64.3% | 57.7% | 54.2% |
| Terminal-Bench 2.0 | Live terminal and CI/CD work | 65.4% | 69.4% | 75.1% | 68.5% |
| CursorBench | Cursor IDE production tasks | 58% | 70% | Not reported | Not reported |
| Rakuten-SWE-Bench | Production task resolution | 1x baseline | 3x baseline | Not reported | Not reported |
| CodeRabbit review recall | Code review quality | Baseline | +10% | Not reported | Not reported |
The headline jumps out. SWE-bench Pro is the harder, cleaner version of SWE-bench, scrubbed against training-data contamination. Opus 4.7 gains 10.9 points there. That's bigger than the gap between second and fourth place on the same table. Much bigger. GPT-5.4 wins terminal work, though, by a margin that anyone doing sysadmin, git, or CI-heavy automation will feel.
Customer-reported gains from Opus 4.6 to Opus 4.7
| Customer | Metric | Delta |
|---|---|---|
| GitHub | 93-task internal benchmark | +13% resolved |
| Cursor | CursorBench pass rate | 58% → 70% |
| Notion | Complex workflows | +14% at fewer tokens, −67% tool errors |
| Factory Droids | Task success | +10 to 15% |
| Databricks OfficeQA Pro | Error rate | −21% |
| XBOW | Visual-acuity for pen-testing | 54.5% → 98.5% |
Customer quotes are not neutral. They rarely are. But pointed the same way across six unrelated companies, the pattern is hard to wave off. Fewer dead loops. Fewer broken tool calls. Fewer tokens wasted chasing a failure state an agent should have caught three steps earlier.
Reasoning and knowledge work
| Benchmark | Opus 4.6 | Opus 4.7 | GPT-5.4 | GPT-5.4 Pro | Gemini 3.1 Pro |
|---|---|---|---|---|---|
| GPQA Diamond | Not reported | 94.2% | 78.2% | 94.4% | 94.3% |
| MMMLU (multilingual) | 91.1% | 91.5% | Not reported | Not reported | 92.6% |
| Humanity's Last Exam | 53.0% | Not yet | 41.6% | 46% | 37.5% |
| ARC-AGI-2 | Not reported | Not reported | Not reported | Not reported | 77.1% |
| BigLaw Bench | Not reported | Not reported | 91% | Not reported | Not reported |
| GDPval-AA Elo | 1606 | Higher, not yet published | Not published | Not published | Not published |
The frontier race is now a distribution race
Strategic AI news from San Francisco. No hype, no "AI will change everything" throat clearing. Just what moved, who won, and why it matters. Daily at 6am PST.
No spam. Unsubscribe anytime.
Graduate reasoning scores at 94% are basically noise now. Three models, 0.2 points apart. Calling that a leaderboard is generous. It is a rounding error dressed up as a benchmark win. The tests that still separate these models are the ones plugged into real work. GDPval. BigLaw. Anthropic's Finance Agent. Look there, not at GPQA.
Vision and multimodal
| Dimension | Opus 4.6 | Opus 4.7 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|---|
| Max image resolution | ~1,568 px / 1.15 MP | 2,576 px / 3.75 MP | Full-res supported | Full-res supported |
| MMMU-Pro | Not reported | Not reported | Not reported | 81.0% |
| ScreenSpot-Pro | Not reported | Not reported | 3.5% | 72.7% |
| Video-MMMU | Not supported | Not supported | Not supported | 87.6% |
| XBOW visual-acuity | 54.5% | 98.5% | Not reported | Not reported |
| Native video input | No | No | No | Yes |
| Native audio input | No | No | No | Yes |
Opus 4.7's image jump is narrow but meaningful. Dense screenshots. UI dashboards. Patent drawings. A PhD student's messy handwritten notes. These now arrive at the model with labels intact, instead of getting downsampled into blur. Gemini still wins multimodal breadth outright. It reads video. The other three cannot. That is not a small deal for anyone doing meeting summarization, patent analysis, or compliance review.
Agents, tools, and computer control
| Benchmark | What it measures | Opus 4.6 | Opus 4.7 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|---|---|
| MCP-Atlas | Scaled multi-tool use | 75.8% | 77.3% | 68.1% | 73.9% |
| OSWorld | Computer use on desktop apps | 72.7% | Not published | 75% | Not reported |
| BrowseComp | Web research agents | 86.8% | 79.3% | 89.3% | Not reported |
| Finance Agent | Multi-step financial reasoning | Not reported | 64.4% | Not reported | Not reported |
| τ2-bench Retail | Retail workflows | 91.9% | Not indexed | Not reported | Not reported |
| Vending-Bench 2 | Vending-machine agent | Not reported | Not reported | Not reported | Tops leaderboard |
Something to notice. OpenAI made GPT-5.4 the first model above the human expert baseline on OSWorld. The humans scored 72.4%. The model scored 75%. That is real. On MCP-Atlas, though, Anthropic leads by nine points over OpenAI. And Anthropic wrote the MCP spec. So the winner who built the test wins the test. Surprised? Nobody is.
Safety, cyber access, and verification
| Dimension | Opus 4.6 | Opus 4.7 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|---|
| Safety framework | ASL-3 | ASL-3, below Mythos | Preparedness "high" cyber | FSF alert threshold |
| Cybench 35-challenge | Strong | 96% pass@1 | Not published | 11/12 v1 hard · 0/13 v2 |
| Restricted cyber variant | No | Mythos Preview | GPT-5.4-Cyber | None separate |
| Verification program | Enterprise controls | Cyber Verification Program | Trusted Access for Cyber | FSF mitigations |
| Independent evaluation | UK AISI on Mythos | UK AISI on Mythos | Preparedness disclosure | Google FSF |
Every lab has now built an access control system around its flagship. Anthropic has two models stacked: the public one you buy, and Mythos behind Glasswing. OpenAI shipped GPT-5.4-Cyber yesterday, a defensive-security variant for vetted researchers. Google has not separated a cyber model yet, but its own safety framework flagged Gemini 3 Pro at the early-warning threshold. The 2026 frontier product is no longer just a model weight file. It is a model weight file plus an identity system plus a use-case gate plus an exemption appeals process. All three labs are building the same shape of thing.
Cross-benchmark scoreboard
| Dimension | 1st | 2nd | 3rd | 4th |
|---|---|---|---|---|
| SWE-bench Verified | Opus 4.7 | Opus 4.6 | Gemini 3.1 Pro | GPT-5.4 |
| SWE-bench Pro | Opus 4.7 | GPT-5.4 | Gemini 3.1 Pro | Opus 4.6 |
| Terminal-Bench 2.0 | GPT-5.4 | Opus 4.7 | Gemini 3.1 Pro | Opus 4.6 |
| MCP-Atlas | Opus 4.7 | Opus 4.6 | Gemini 3.1 Pro | GPT-5.4 |
| GPQA Diamond | GPT-5.4 Pro | Gemini 3.1 Pro | Opus 4.7 | — |
| OSWorld | GPT-5.4 | Opus 4.6 | — | — |
| BrowseComp | GPT-5.4 Pro | Opus 4.6 | Opus 4.7 | — |
| MMMU-Pro (vision) | Gemini 3 Pro | — | — | — |
| Output speed | Gemini 3.1 Pro | GPT-5.4 | Opus | — |
| Input modality breadth | Gemini 3.1 Pro | Opus / GPT-5.4 | — | — |
| Blended $/M tokens | Gemini 3.1 Pro | GPT-5.4 | Opus 4.6 / 4.7 | — |
| Knowledge freshness | Opus 4.7 | GPT-5.4 | Opus 4.6 | Gemini not disclosed |
So what wins? Depends who is asking. Opus 4.7 takes coding and tool use. GPT-5.4 takes terminal, computer control, and web research. Gemini takes multimodal breadth, speed, and the bill. Nobody collects the whole board. And that's actually a helpful sign. The market has enough models now that the right question shifted. From "which one is the best" to "which one is best for the thing I am doing Tuesday morning."
Pick by workload, not by leaderboard
| Your workload | Choose | Why |
|---|---|---|
| Long-running agentic coding on real repositories | Opus 4.7 | +10.9 on SWE-bench Pro vs Opus 4.6, +6.6 vs GPT-5.4. Fewer failed loops. |
| Terminal ops, CI/CD automation, sysadmin tasks | GPT-5.4 | 75.1% Terminal-Bench beats the field by six points. |
| Computer control without wrappers | GPT-5.4 | Native desktop use, first above human expert on OSWorld. |
| Video, audio, or mixed-modality input | Gemini 3.1 Pro | Only frontier model with native video and audio input. |
| Cost-sensitive scaling at similar intelligence | Gemini 3.1 Pro | Intelligence Index 57 at $7/M blended, versus $15/M for Opus. |
| Tool-heavy agents using MCP | Opus 4.7 | 77.3% MCP-Atlas. Anthropic wrote the protocol. |
| Defensive cybersecurity at a verified org | Mythos Preview or GPT-5.4-Cyber | Public models block many security tasks. Verified access unlocks them. |
| Stories from after mid-2025 | Opus 4.7 | Knowledge cutoff January 2026, five months ahead of GPT-5.4. |
| Pure speed on high-volume workloads | Gemini 3.1 Pro | 142 tokens per second, nearly double GPT-5.4's output rate. |
Buyers keep asking which one is best. Wrong question, though. Ask which one is best for this specific job. At this price. With this latency budget. At this access tier. The tables above answer that. Leaderboards will not.
What to watch next
Three labs already converged on the reasoning ceiling. The next fight is about everything downstream. Anthropic has bet the premium on agentic coding and tool discipline. Google has bet its price and multimodal breadth. OpenAI has bet on computer control and unified distribution through ChatGPT, Codex, and the API at the same time. Each is also quietly building an access control system, and that is where the real interesting fights will happen over the next year.
Watch whether Anthropic's Cyber Verification Program opens beyond a handful of current partners. Watch whether OpenAI pushes GPT-5.4-Cyber deeper into infrastructure firms. Watch whether Google formally splits off a cyber variant after its safety framework flag. And watch how Anthropic handles Mythos when it eventually broadens, which it will.
The benchmark tables will keep moving. The distribution tables are where the market is now.
Frequently Asked Questions
Which model is best for agentic coding on real codebases?
Claude Opus 4.7. It jumped 10.9 points over Opus 4.6 on SWE-bench Pro to 64.3%, beating GPT-5.4 at 57.7% and Gemini 3.1 Pro at 54.2%. Customer reports from GitHub, Cursor, Notion, and Factory Droids show double-digit gains in task completion with fewer tool errors on long-running agent jobs.
Which model is cheapest at the API?
Gemini 3.1 Pro at $2 per million input tokens and $12 per million output tokens under 200K context. GPT-5.4 is next at $2.50 in and $15 out. Opus 4.7 held prices flat from Opus 4.6 at $5 in and $25 out, roughly 2.14 times the Gemini rate.
What is Claude Mythos Preview and can I use it?
Mythos Preview is Anthropic's higher-capability model, released through Project Glasswing with invitation-only access. Anthropic says it scores higher than Opus 4.7 on every axis measured, but restricts it to vetted defensive-cybersecurity partners. There is no self-serve signup.
How much did vision improve in Opus 4.7?
The pixel ceiling roughly doubled. Opus 4.7 now accepts images at 2,576 pixels on the long edge and about 3.75 megapixels, up from 1,568 pixels and 1.15 megapixels on Opus 4.6. On XBOW's visual-acuity benchmark for pen-testing, the score climbed from 54.5% to 98.5%.
Is GPT-5.4 the only model that controls a desktop natively?
Yes, at the API level. GPT-5.4 scored 75% on OSWorld, becoming the first model to beat the 72.4% human expert baseline on computer-use tasks. Claude and Gemini can drive a desktop through the API but typically need wrapper tooling to handle the control loop.
AI-generated summary, reviewed by an editor. More on our AI guidelines.



IMPLICATOR