Anthropic Opus 4.6 Tops GPT-5.2 on Knowledge Work Benchmarks by Wide Margin

Anthropic on Thursday released Claude Opus 4.6, its most powerful AI model yet, claiming state-of-the-art performance in coding, financial analysis, and information retrieval. The model outperforms OpenAI's GPT-5.2 by roughly 144 Elo points on GDPval-AA, an independent benchmark measuring knowledge-work tasks in finance, legal, and other professional domains. It also tops every other frontier model on Terminal-Bench 2.0 for agentic coding and on Humanity's Last Exam, a multidisciplinary reasoning test.

The release lands during a rough week for software stocks. Anthropic's quiet rollout of open-source legal plug-ins for its Cowork tool last Friday helped trigger what Bloomberg called a trillion-dollar market meltdown, with investors panicking over which legacy software companies AI might swallow whole. Opus 4.6 does nothing to calm those nerves. If anything, it shows Anthropic is no longer circling enterprise software. It is eating through the walls.

What the model actually does better

Opus 4.6 improves on its predecessor in ways that matter for enterprise customers who build on the API. The biggest upgrade is a 1 million token context window, a first for any Opus-class model. In practical terms, you can feed it an entire codebase or a stack of regulatory filings and the model holds information across the full span with far less drift.

Key Takeaways

• Opus 4.6 beats GPT-5.2 by 144 Elo on GDPval-AA and leads on Terminal-Bench 2.0, Humanity's Last Exam, and BrowseComp

• First Opus-class model with 1M token context window; long-context retrieval scores 76% vs. 18.5% for Sonnet 4.5

• New agent teams feature lets multiple AI agents split and coordinate tasks in parallel inside Claude Code

• Claude now works inside PowerPoint and Excel, automating the spreadsheet-to-slide workflow for enterprise customers

How much less? On the 8-needle 1M variant of MRCR v2, a needle-in-a-haystack benchmark that hides information inside vast text, Opus 4.6 scored 76%. Sonnet 4.5 managed 18.5%. That gap represents a different category of capability. Maximum output jumps from 64k to 128k tokens, which means fewer broken-up requests when you need the model to write something long.

Then there is adaptive thinking. Opus 4.6 now reads the problem and decides for itself whether to reason harder or move on. Developers get four effort levels to tune the trade-off between intelligence and speed. Scott White, Anthropic's head of product, told the Financial Times that the company plans to push hard into cybersecurity, life sciences, healthcare, and financial services. "Those are areas where we're going to lean in really hard," he said.

Agent teams and the coding angle

Developers will notice agent teams first. Available as a research preview in Claude Code, agent teams let you spin up multiple AI agents that split a large task, work their segments in parallel, and coordinate without human prompting. Codebase reviews, large-scale refactoring, migrating between frameworks. Scott White compared it to managing a team of competent engineers. You delegate, they figure out how to divide the work. In early demos, a team of four agents chewed through a repository review that would have taken a single agent most of an afternoon.

Michael Truell, CEO and co-founder of Cursor, one of Anthropic's biggest customers, told the Financial Times that Opus 4.6 stands out on "harder problems." He put it bluntly. "Stronger tenacity, better code review and it stays on long-horizon tasks where others drop off."

Join 10,000+ AI professionals

Strategic AI news from San Francisco. No hype, no "AI will change everything" throat clearing. Just what moved, who won, and why it matters. Daily at 6am PST.

No spam. Unsubscribe anytime.

Anthropic has carved out a clear lead in AI-assisted coding since Claude Code hit $1 billion in revenue within six months of launch last year. Opus 4.5, released in November, set off a vibe-coding wave over the holidays. The new model extends that edge while chewing deeper into work that non-developers do.

PowerPoint, Excel, and the knowledge worker play

Opus 4.6 now works directly inside Microsoft PowerPoint as a side panel, reading your layouts, fonts, and slide masters to build or edit presentations on brand. Previously, Claude could generate a PowerPoint file, but you had to export and open it separately. The in-app integration, available as a research preview for Max, Team, and Enterprise plans, removes that step entirely. A financial analyst can process earnings data in Excel, then have Claude build the investor deck in PowerPoint without switching windows. That workflow used to require a junior associate and half an afternoon.

Claude in Excel got a significant upgrade too. The model can now plan before acting on spreadsheet tasks, ingest messy unstructured data and infer the correct structure without hand-holding, and handle multi-step changes in one pass.

This is where legacy software vendors should feel cornered. Anthropic has more than 300,000 business customers. Guillaume Princen, Anthropic's head of digital native businesses, told the FT that the company sees itself as "the leader in the enterprise market" and wants to build "real-life agents for the enterprise." When a company with that kind of momentum starts eating through the daily workflow of spreadsheet-to-slide, the product managers at Microsoft and Salesforce are not watching from a comfortable distance.

OpenAI counterpunches on the same day

OpenAI released an update to Codex, its own AI coding agent, within hours of Anthropic's announcement. The timing was defensive. Both companies are chasing enterprise budgets and trying to justify enormous valuations. The FT reported this week that Anthropic is trying to raise about twenty billion dollars at a $350 billion valuation. OpenAI wants even more, up to $830 billion if you believe the Bloomberg numbers from last month.

The benchmark numbers explain why OpenAI felt the urgency. Opus 4.6 beating GPT-5.2 by 144 Elo on GDPval-AA, run independently by Artificial Analysis, translates to Anthropic's model winning roughly 70% of head-to-head comparisons on real-world knowledge work. On BrowseComp, OpenAI's own benchmark for finding hard-to-locate information online, Opus 4.6 also leads all competitors. Losing on your own test, that is the kind of number that makes a fundraising pitch uncomfortable.

API pricing has not changed, still $5/$25 per million tokens for input and output. Prompts longer than 200,000 tokens carry a surcharge.

Safety and the cybersecurity trade-off

Anthropic ran its automated behavioral audit on Opus 4.6 and tracked deception, sycophancy, and willingness to help with misuse. The company says the results look comparable to Opus 4.5, its most aligned model before this release.

But the company acknowledges a tension. Opus 4.6 shows "enhanced cybersecurity abilities," which cuts both ways. Anthropic developed six new probes to detect harmful use of those skills and said it may eventually deploy real-time intervention to block abuse. At the same time, it is actively using the model to find and patch vulnerabilities in open-source software. The defensive logic is simple enough: if attackers will use frontier models, defenders need them too. Whether the safeguards hold under pressure from state-sponsored actors and criminal organizations is a question Anthropic cannot answer from a system card alone.

Opus 4.6 is available now on claude.ai, the API, and all major cloud platforms. For developers, the model ID is claude-opus-4-6. The agent teams, PowerPoint integration, and 1M context window remain in beta or research preview. Anthropic is emboldened. The benchmarks say so. And the enterprise software vendors watching their workflows get eaten through, one integration at a time, have every reason to be nervous.

Frequently Asked Questions

Q: How much does Claude Opus 4.6 cost on the API?

A: Pricing stays the same as Opus 4.5: $5 per million input tokens and $25 per million output tokens. Prompts exceeding 200,000 tokens carry premium pricing at $10 and $37.50 per million tokens respectively.

Q: What are agent teams in Claude Code?

A: Agent teams let developers spin up multiple AI agents that split a large task into segments, work in parallel, and coordinate autonomously. Available as a research preview, they are suited for codebase reviews, refactoring, and framework migrations.

Q: How does Opus 4.6 compare to GPT-5.2?

A: On GDPval-AA, an independent benchmark measuring real-world knowledge work in finance, legal, and other domains, Opus 4.6 outperforms GPT-5.2 by 144 Elo points. That translates to Anthropic's model winning about 70% of head-to-head comparisons.

Q: What is the 1M token context window?

A: Opus 4.6 is the first Opus-class model to support 1 million tokens of context, currently in beta. This lets it process entire codebases or large document sets while retaining information with far less degradation than previous models.

Q: Is Claude in PowerPoint available now?

A: Claude in PowerPoint launched as a research preview for Max, Team, and Enterprise plan subscribers. It works as a side panel inside PowerPoint, reading existing layouts and slide masters to build or edit presentations on brand.