Claude Opus 4.5: Everything You Need to Know About Anthropic's New Flagship

Anthropic's third major model in eight weeks arrives with price cuts, safety improvements, and benchmark claims that merit scrutiny.

Anthropic released Claude Opus 4.5 on Monday, completing its 4.5 model series after Sonnet 4.5 in September and Haiku 4.5 in October. The company calls it "the best model in the world for coding, agents, and computer use." That's a bold claim in a month that also saw OpenAI ship GPT-5.1-Codex-Max and Google launch Gemini 3 Pro.

The headline numbers look impressive. An 80.9% score on SWE-bench Verified, the industry's preferred coding benchmark. A 67% price reduction from the previous Opus. New integrations with Chrome and Excel. Memory improvements that enable what Anthropic calls "infinite chat."

But headlines compress complexity. The security evaluations show inconsistent refusal rates depending on which test you examine. The "beat all humans" claim on an engineering exam relies on methodology most users won't parse. And at least one prominent developer found the real-world difference from Sonnet 4.5 harder to detect than benchmarks suggest.

Here's what actually matters about Opus 4.5, broken down by what you're trying to accomplish.

The Breakdown

• Pricing drops 67% to $5/$25 per million tokens, making Opus viable for production workloads previously limited to cheaper models like Sonnet.

• Opus 4.5 scores 80.9% on SWE-bench Verified, reclaiming coding benchmark leadership from Google's Gemini 3 Pro by 4.7 percentage points.

• Security evaluations show inconsistent results: 100% refusal rate in one coding test, but only 78% in Claude Code malware scenarios.

• Developer Simon Willison found no distinguishable performance difference between Opus 4.5 and Sonnet 4.5 during two days of production coding.

The Coding Picture

Opus 4.5 scores 80.9% on SWE-bench Verified, reclaiming the top position from Google's Gemini 3 Pro (76.2%) released last week. OpenAI's GPT-5.1-Codex-Max sits at 77.9%. Anthropic's own Sonnet 4.5 manages 77.2%. These are close margins at the top of the leaderboard.

The more striking claim involves Anthropic's internal engineering assessment. The company gives prospective performance engineering candidates a notoriously difficult take-home exam with a two-hour time limit. Using a technique called parallel test-time compute, which aggregates multiple attempts and selects the best result, Opus 4.5 scored higher than any human candidate in Anthropic's history.

Two caveats deserve attention. First, without parallel compute and time limits, the model only matched the best-ever human candidate. Second, Anthropic acknowledges the test doesn't measure collaboration, communication, or professional instincts. The result signals something real about AI capability trajectories. It doesn't mean Opus 4.5 replaces senior engineers.

On Terminal Bench, which measures command-line task completion, Warp reports a 15% improvement over Sonnet 4.5. GitHub's chief product officer Mario Rodriguez says early testing shows Opus 4.5 "surpasses internal coding benchmarks while cutting token usage in half." Multiple customers report the model handles complex refactoring, code migration, and multi-file changes more reliably than predecessors.

The efficiency claims are specific and verifiable. At medium effort settings, Opus 4.5 matches Sonnet 4.5's best SWE-bench score while using 76% fewer output tokens. At high effort, it exceeds Sonnet 4.5 by 4.3 percentage points while still using 48% fewer tokens. For production workloads where token costs compound, these numbers matter.

Pricing That Changes the Calculus

Opus 4.5 costs $5 per million input tokens and $25 per million output tokens. The previous Opus 4.1 ran $15/$75. That's a 67% reduction that fundamentally changes which workloads justify the flagship model.

Context for comparison: Sonnet 4.5 costs $3/$15. Haiku 4.5 runs $1/$5. Google's Gemini 3 Pro charges $2/$12 for standard context, $4/$18 for contexts exceeding 200,000 tokens. OpenAI's GPT-5.1 family sits at $1.25/$10.

Opus remains more expensive per token than competitors. The efficiency improvements complicate direct comparisons. If Opus uses 50% fewer tokens for equivalent work, effective costs approach parity with cheaper models. The math depends entirely on your specific workload and how the new "effort" parameter translates to actual token consumption.

For Claude.ai subscribers, the changes created immediate confusion. Anthropic removed Opus-specific caps, stating Max and Team Premium users now have "roughly the same number of Opus tokens as you previously had with Sonnet." But Opus consumes more of that allocation per interaction. Multiple users report unclear allocation structures and automatic model switches without obvious opt-out mechanisms.

API customers face simpler economics. Pay per token at the stated rates. Use as much as you want. The subscription tier complexity only affects consumer and prosumer plans.

Security: The Numbers That Matter

Anthropic describes Opus 4.5 as "the most robustly aligned model we have released to date and, we suspect, the best-aligned frontier model by any developer." The supporting evidence comes from internal evaluations, with one external benchmark from Gray Swan on prompt injection.

The prompt injection results show genuine improvement. At a single attempt, attacks succeed roughly 5% of the time against Opus 4.5, compared to 7.3% for Sonnet 4.5 and 12.5% for Gemini 3 Pro. At ten attempts, success rates climb to 33.6% for Opus 4.5 versus 41.9% for Sonnet 4.5 and 60.7% for Gemini 3 Pro. Better than competitors. Still far from solved.

The malicious use evaluations reveal more variance than the marketing suggests. In one "agentic coding evaluation" assessing 150 prohibited requests, Opus 4.5 refused 100% of the time. In Claude Code tests involving malware creation, DDoS attacks, and non-consensual monitoring software, the refusal rate dropped to 78%. For computer use scenarios involving surveillance, data collection, and harmful content generation, Opus 4.5 refused just over 88% of requests.

That's a 22-percentage-point gap between coding evaluations depending on which test you examine. The system card provides details. The announcement emphasizes the favorable numbers.

For anyone deploying agents with tool access, these distinctions matter. A model that refuses malicious requests 100% of the time in controlled conditions but only 78% of the time in Claude Code environments presents different risk profiles than the headline suggests.

Knowledge and Context

Opus 4.5 maintains the same 200,000-token context window as Sonnet 4.5, with a 64,000-token output limit. The knowledge cutoff is March 2025, slightly more recent than Sonnet's January cutoff and Haiku's February cutoff.

The meaningful change involves how the model manages long conversations. Anthropic introduced what it calls "infinite chat," which eliminates the frustrating context limit errors that previously terminated extended sessions. The implementation is auto-summarization. When conversations approach limits, Claude compresses earlier portions to free space for continuation.

This solves a real pain point. It doesn't preserve conversation content. Summarization loses information by design. Details that seemed unimportant during compression might become relevant later, with no mechanism for retrieval. For casual use, the tradeoff probably works. For professional work requiring precise reference to earlier exchanges, the limitation matters.

Anthropic's head of product management for research, Dianne Na Penn, told TechCrunch that context windows alone aren't sufficient. "Knowing the right details to remember is really important in complement to just having a longer context window." The comment acknowledges both the improvement and its boundaries.

For agentic use cases, Anthropic made infrastructure changes that compound. Tool Search Tool defers loading tool definitions until needed, reducing context consumption from approximately 55,000 tokens to roughly 3,000 for typical setups. Programmatic Tool Calling lets Claude write Python to orchestrate tools, keeping intermediate results out of context. Combined, these changes enable agents that would previously fail from context overflow.

Real-World Testing: The Simon Willison Report

Simon Willison, a prominent developer and AI commentator, had preview access to Opus 4.5 over the weekend. He used it extensively in Claude Code, producing an alpha release of sqlite-utils that involved 20 commits, 39 files changed, 2,022 additions, and 1,173 deletions across two days.

His assessment: "It's clearly an excellent new model." His caveat reveals something benchmarks obscure.

When his preview expired Sunday evening with work remaining, Willison switched back to Sonnet 4.5 and "kept on working at the same pace I'd been achieving with the new model." Production coding, he concluded, is a less effective way of evaluating model differences than he'd expected.

"I'm not saying the new model isn't an improvement on Sonnet 4.5," Willison wrote, "but I can't say with confidence that the challenges I posed it were able to identify a meaningful difference in capabilities between the two."

The observation points to a growing evaluation problem. Frontier models beat each other by single-digit percentage points on benchmarks, but translating those margins to real-world tasks users need to solve daily becomes increasingly difficult. Willison wants AI labs to accompany new releases with concrete examples of prompts that failed on previous models but succeed on new ones. "Here's an example prompt which failed on Sonnet 4.5 but succeeds on Opus 4.5 would excite me a lot more than some single digit percent improvement on a benchmark."

The industry hasn't delivered that kind of demonstration. Until it does, the gap between benchmark leadership and practical utility remains hard to assess.

How It Compares

Against Gemini 3 Pro, released last week: Opus 4.5 wins on SWE-bench Verified (80.9% versus 76.2%) and prompt injection resistance. Gemini 3 Pro wins on GPQA Diamond, a graduate-level reasoning test (91.9% versus 87.0%). Gemini costs roughly half as much per token. The tradeoff depends on whether your workload emphasizes coding or general reasoning.

Against GPT-5.1-Codex-Max, released five days earlier: Opus 4.5 edges ahead on SWE-bench (80.9% versus 77.9%). OpenAI's model can work autonomously for up to 24 hours, a capability Anthropic hasn't matched. GPT-5.1 pricing ($1.25/$10) undercuts Opus significantly. For extended autonomous coding sessions, OpenAI's approach may suit certain workflows better despite lower benchmark scores.

Against Sonnet 4.5, Anthropic's own mid-tier model: The benchmark gap is narrower than flagship positioning suggests. Sonnet scores 77.2% on SWE-bench versus Opus's 80.9%. At $3/$15 per million tokens, Sonnet costs roughly 40% less than Opus. For many production workloads, particularly those sensitive to latency or cost, Sonnet remains the pragmatic choice.

None of these models have appeared on LMArena yet, the crowdsourced evaluation platform that provides rankings harder to game than lab-controlled benchmarks. The absence is notable. Real comparative assessment requires third-party testing.

The Valuation Context

Microsoft and Nvidia announced multi-billion-dollar investments in Anthropic last week, boosting the company's valuation to approximately $350 billion. Anthropic reached $2 billion in annualized revenue during Q1 2025, more than doubling from $1 billion the prior period. Customers spending over $100,000 annually increased eightfold year-over-year.

At $2 billion ARR, a $350 billion valuation implies 175x revenue. Even with 2x year-over-year growth, the multiple requires sustained hypergrowth assumptions. Whether Opus 4.5 and its successors justify that valuation depends on capturing enterprise workloads that currently don't exist at scale.

The model improvements are real. The competitive position is contested. The economics remain speculative.

Why This Matters

For developers: The 67% price reduction makes Opus viable for workloads previously limited to Sonnet. The efficiency improvements compound the savings. Tool Search and Programmatic Tool Calling address real production pain points. Whether the benchmark gains translate to your specific tasks requires testing.

For enterprise buyers: Claude for Excel reaching general availability creates measurable automation opportunities. The Microsoft Azure availability, despite Microsoft's competing Copilot products, expands deployment options. Security evaluation variance between test scenarios warrants attention before deploying agents with sensitive tool access.

For the industry: Benchmark saturation is becoming visible. Three frontier models within 4 percentage points on SWE-bench suggests the metric is approaching its useful ceiling. Real-world differentiation, as Willison observed, is harder to demonstrate than benchmark leadership. The company that solves evaluation may matter more than the company that wins the next benchmark.

❓ Frequently Asked Questions

Q: What is "parallel test-time compute" and why does it matter for the human benchmark claim?

A: Parallel test-time compute runs multiple attempts at the same problem and selects the best result. Anthropic used this method when Opus 4.5 beat all human candidates on their engineering exam. Without it, and without time limits, the model only matched the best-ever human candidate. The technique inflates apparent performance beyond what users experience in normal single-attempt use.

Q: How does the "effort" parameter work?

A: The effort parameter lets developers trade speed and cost for quality. Set to "low" or "medium," Opus responds faster using fewer tokens. At "medium," it matches Sonnet 4.5's best SWE-bench score while using 76% fewer output tokens. At "high" (the default), it exceeds Sonnet by 4.3 percentage points while still using 48% fewer tokens. It's essentially a quality dial.

Q: What does "infinite chat" actually lose when it compresses conversations?

A: When conversations hit context limits, Claude summarizes earlier portions to free space. This compression loses information by design. Specific details, calculations, or exchanges that seemed unimportant during summarization cannot be retrieved later. For casual chat, this works fine. For professional work requiring precise reference to earlier discussion, you lose access to specifics the model deemed expendable.

Q: Why do the security test results vary so much between evaluations?

A: Different tests measure different scenarios. The 100% refusal rate came from an "agentic coding evaluation" with 150 prohibited requests. The 78% rate came from Claude Code tests specifically involving malware creation, DDoS attacks, and monitoring software. The 88% rate tested computer use for surveillance and data collection. Real-world risk depends on which scenario matches your deployment.

Q: What is LMArena and why does its absence matter?

A: LMArena is a crowdsourced AI model evaluation platform where users compare model outputs without knowing which model produced them. Rankings emerge from real user preferences rather than lab-controlled benchmarks. Opus 4.5, Gemini 3 Pro, and GPT-5.1-Codex-Max haven't appeared there yet. Until they do, comparative claims rest entirely on company-reported benchmarks, which are easier to optimize for.