Anthropic’s Sonnet 4.5 tops coding tests, runs for 30 hours

💡 TL;DR - The 30 Seconds Version

🚀 Anthropic released Claude Sonnet 4.5 on September 29, claiming it's the world's best coding model with 30-hour autonomous runtime—up from seven hours for Opus 4 released in May, at the same $3/$15 pricing.

📊 Sonnet 4.5 scored 77.2% on SWE-bench Verified (82% with parallel compute), beating GPT-5's 71.4% and Gemini 2.5 Pro's 69.8%, while leading OSWorld computer use tests at 61.4% versus 42.2% four months ago.

🔧 The release bundles context editing and memory tools that improve agent performance 39%, plus Claude Code checkpoints, VS Code extension, and the rebranded Claude Agent SDK for building custom agents beyond coding.

💰 Anthropic closed a $13 billion Series F at $183 billion valuation in September, with Claude Code generating $500 million run-rate as sovereign wealth funds increasingly back frontier AI development.

⚖️ Early tests show mixed results—Sonnet 4.5 finished code reviews in two minutes versus GPT-5 Codex's ten, but struggled more on complex bug hunts, revealing the persistent gap between benchmark performance and production deployment.

🎯 The release arrives days before OpenAI's developer event, as companies shift from pure model competition to infrastructure control—recognizing that deployment complexity, not just capability, determines enterprise value capture.

Priced like Sonnet 4, the new model arrives with agent tooling, context management, and tighter safety.

Anthropic shipped Claude Sonnet 4.5 today with a declaration that invites scrutiny: it's the best coding model in the world. The $183 billion startup says the model can code autonomously for 30 hours straight—up from seven hours for Opus 4, released in May. Pricing holds at $3/$15 per million tokens, matching the previous Sonnet.

The release arrives days before OpenAI's annual developer event. Microsoft added Claude models to Copilot 365 last week. OpenAI's Sam Altman conceded days earlier that Anthropic offers "the best AI for work-related tasks." The sequencing suggests coordination.

Sonnet 4.5 bundles performance claims with infrastructure updates across Anthropic's product stack and a new safety framework requiring stricter controls. Early testing shows the model excels at specific tasks while revealing gaps between benchmark scores and production deployment. The pattern reshaping the industry: models improve faster than enterprises can absorb them.

The benchmark narrative

Anthropic grounds its "best coding model" claim in SWE-bench Verified, which tests real-world software engineering using GitHub pull requests. Sonnet 4.5 scored 77.2% in standard configuration. With parallel test-time compute—running multiple attempts simultaneously and discarding failures—it reached 82%. OpenAI's GPT-5 scored 71.4%. Google's Gemini 2.5 Pro managed 69.8%.

OSWorld, measuring computer use capabilities, shows steeper gains. Sonnet 4.5 leads at 61.4%. Four months ago, Sonnet 4 topped the board at 42.2%. Opus 4.1, released in August, scored around 44%.

The 30-hour autonomy claim requires context. Anthropic demonstrated the model rebuilding Claude.ai's web app—a task consuming five and a half hours and 3,000 tool calls. Simon Willison, an independent developer, initiated a complex coding task from his phone and watched Sonnet 4.5 check out his repository, install dependencies, run tests, and implement new features across dozens of operations without intervention.

Head-to-head tests complicate the picture. Dan Shipper at Every compared Sonnet 4.5 against GPT-5 Codex on code review. Sonnet 4.5 finished in two minutes. GPT-5 Codex took ten. Yet Shipper noted GPT-5 Codex handles "the trickiest production bug hunts" more reliably. Speed measures one dimension of capability, not comprehensive superiority.

Infrastructure as competitive wedge

The model release packages significant product updates, reflecting a strategic shift. Anthropic added checkpoints to Claude Code, letting developers save progress and revert when the model generates broken code. A native VS Code extension shows inline diffs as Claude makes real-time changes. The terminal interface received upgrades, including searchable prompt history.

The API gained context editing and memory tools addressing a fundamental constraint: context windows have limits, but real work doesn't. Context editing automatically removes stale tool calls when approaching token caps, effectively extending runtime. Memory stores information outside the context window through file-based systems that persist across sessions.

On Anthropic's internal agentic search evaluation, combining memory with context editing improved performance 39% over baseline. Context editing alone delivered 29% gains. In a 100-turn web search test, context editing enabled workflows that would otherwise fail while reducing token consumption 84%.

Anthropic rebranded its Claude Code SDK as the Claude Agent SDK, emphasizing broader application. The SDK includes agent orchestration, memory management, tool usage, and permission systems—the same foundation powering Claude Code, which generates over $500 million in run-rate revenue with 10x usage growth over three months.

The message: model capability commoditizes; infrastructure creates lock-in. Anthropic co-founder Mike Krieger stated this plainly: "A few things need to happen" before companies realize AI's full value. That includes models improving, workflow adaptation, and "a deeper level of partnership between some of the frontier labs and these enterprises."

Translation: deployment complexity remains the bottleneck. Owning the pipes matters more than winning benchmarks.

The alignment tradeoff

Anthropic positions Sonnet 4.5 as its "most aligned frontier model yet." Extensive safety training reduced concerning behaviors: sycophancy, deception, power-seeking, encouraging delusional thinking. Internal testing shows Sonnet 4.5 scoring substantially lower on misaligned behaviors versus previous models. Defenses against prompt injection attacks—when adversaries use crafted language to hijack model behavior—strengthened considerably.

But alignment introduces friction. Anthropic released Sonnet 4.5 under its AI Safety Level 3 framework, including classifiers detecting dangerous inputs around chemical, biological, radiological, and nuclear weapons. These filters generate false positives. When triggered, users redirect to Sonnet 4, a lower-risk model.

Anthropic reduced false positives by a factor of ten since initially describing them, and by half since Opus 4 launched. The company says it's "continuing to make progress" on discernment. Cybersecurity and biological research customers can join an allowlist through account teams.

Both conditions hold: the model is safer and more restrictive. The equilibrium between capability and control remains unsettled.

Capital structure signals strategy

Anthropic closed its Series F at $183 billion valuation in September, raising $13 billion. Amazon remains a major backer, but sovereign wealth funds increasingly participate. The pattern extends across frontier AI: as capital requirements compound and geopolitical stakes rise, nation-states fund development through investment vehicles.

Anthropic also settled a $1.5 billion lawsuit with authors over alleged copyright infringement this month. Terms weren't disclosed, but the settlement removes legal uncertainty shadowing multiple AI companies.

The financial backdrop clarifies product decisions. Sonnet 4.5 maintains Sonnet 4 pricing despite performance gains. Anthropic optimizes for adoption over margin. Claude Code's $500 million run-rate and 10x usage growth validate the approach—developers select the model solving their problems, then expand usage across organizations.

Profitability remains distant. Training runs cost hundreds of millions. Inference at scale consumes capital continuously. Amazon, Google, and Microsoft subsidize AI losses with cloud revenue. Anthropic and OpenAI require continuous funding until the business model closes.

The deployment question

Multiple studies in recent weeks questioned whether AI delivers measurable business value. Skepticism centers on a persistent gap: impressive demos don't translate to production workflows. Models fail on edge cases. Integration proves complex. ROI remains elusive beyond narrow applications.

Early Sonnet 4.5 adopters suggest progress on specific dimensions. Canva, serving 240 million users, said the model helps with "complex, long-context tasks—from engineering in our codebase to in-product features and research." HackerOne reported Sonnet 4.5 reduced average vulnerability intake time for security agents 44% while improving accuracy 25%. Cursor CEO Michael Truell noted "state-of-the-art coding performance with significant improvements on longer horizon tasks."

These claims await independent verification. The pattern emerging: Sonnet 4.5 excels at defined tasks within existing workflows. Canva engineers navigating large codebases. Security teams processing vulnerability reports. Developers refactoring code. Practical automation of technical work, not transformative business redesign.

Chief science officer Jared Kaplan framed this intentionally: Sonnet 4.5 is "more of a colleague" that's "kind of fun to work with when encountering problems." The anthropomorphization aside, the positioning matters: useful tool, not autonomous worker.

Velocity creates instability

Anthropic shipped Opus 4 and Sonnet 4 in May. Opus 4.1 in August. Sonnet 4.5 in September. Kaplan said "very likely" one or two more releases arrive before year-end, probably including a new Opus.

OpenAI released GPT-5 in August. Google ships Gemini updates continuously. Competitive dynamics force rapid iteration. Each release resets benchmark hierarchies. Models leapfrog on specific evaluations while struggling elsewhere.

The velocity compounds enterprise challenges. Production systems require stability. Developers need consistent APIs. But model improvements arrive monthly, tempting switches that introduce new failure modes. Anthropic addressed this by letting paid subscribers choose older Sonnet generations "if they aren't ready to migrate overnight."

The tension: faster innovation makes deployment harder. Companies solving this—through infrastructure, migration paths, predictable behavior—capture enterprise value. Companies optimizing only for benchmark wins lose customers to operational chaos.

Where incentives converge

Anthropic wants developer mindshare driving enterprise adoption justifying capital raises. Developers want reliable tools solving specific problems without operational debt. Enterprises want measurable ROI and manageable risk.

Sonnet 4.5's positioning threads this: cheaper than Opus, better than Opus on most tasks, same price as previous Sonnet, bundled with infrastructure reducing deployment friction. Safety improvements reduce legal and reputational exposure. The 30-hour autonomy claim signals capability without promising full automation.

Microsoft's decision adding Claude to Copilot validates the strategy. OpenAI's acknowledgment that Anthropic leads on work tasks confirms it. Whether Anthropic maintains the lead as OpenAI, Google, and others invest billions catching up depends less on model capability than execution: reliable inference, predictable behavior, enterprise support, ecosystem development.

Anthropic's infrastructure investments—the Agent SDK, context management, memory tools—suggest the company understands this. The race remains open. Gemini 3 reportedly ships soon. OpenAI's developer event next week will showcase new capabilities. The frontier moves fast enough that "best coding model in the world" describes a weeks-long position, not a durable advantage.

The structural question persists: as models improve faster than enterprises deploy them, which companies capture value? The answer depends on solving operational complexity, not just winning benchmarks. Training runs and inference costs require revenue from sticky, production-deployed systems rather than one-time API calls. Anthropic's bundled release—model plus infrastructure plus safety framework—suggests they're betting on comprehensive platforms over isolated models.

The bet makes sense. Developers already building on Claude Code don't switch easily. Memory and context management create stickiness. The Agent SDK enables custom agents using Anthropic's infrastructure. Each layer makes migration harder and lock-in deeper.

From competitors' perspectives, the timing matters. OpenAI's developer event next week will likely announce comparable capabilities. Google's Gemini 3 is rumored to advance computer use significantly. The benchmark leaderboard reshuffles continuously. But if Anthropic builds sufficient enterprise deployment momentum before competitors catch up, technical parity may not dislodge them.

The irony: in racing to build artificial general intelligence, the frontier labs discovered they're actually competing on enterprise software fundamentals—reliability, integration, support, migration paths. The companies that realize this fastest will capture disproportionate value as the technology matures.

Why this matters:

• Model releases accelerate while deployment challenges compound—companies solving operational complexity rather than just benchmark performance capture enterprise value and justify massive capital raises at sustainable margins.

• Anthropic's infrastructure investments signal a strategic shift from pure model competition to ecosystem control, recognizing that training costs and inference expenses require revenue from sticky, production-deployed systems rather than commoditized API access.

❓ Frequently Asked Questions

Q: How does Sonnet 4.5's pricing compare to GPT-5 and other competitors?

A: Sonnet 4.5 costs $3 per million input tokens and $15 per million output tokens—matching previous Sonnet pricing. That's significantly more than GPT-5 and GPT-5 Codex at $1.25/$10, but cheaper than Claude Opus at $15/$75. Anthropic prioritizes adoption over margins, keeping prices steady despite performance gains to encourage enterprise switching.

Q: What happens when the AI Safety Level 3 filters trigger false positives?

A: When classifiers detect potential CBRN-related content (chemical, biological, radiological, nuclear weapons), the conversation stops and users redirect to Sonnet 4, a lower-risk model. Anthropic reduced these false positives by 10x since initial rollout and by 50% since Opus 4 launched. Cybersecurity and biological research customers can join an allowlist to avoid interruptions.

Q: Why does 30-hour autonomous runtime matter for developers?

A: Longer runtime lets the model tackle complex projects without human intervention—like rebuilding entire web apps or implementing features across large codebases. Opus 4 managed seven hours before needing checkpoints. Thirty hours means developers can assign tasks overnight or over weekends, checking back only when work completes. This transforms AI from assistant to colleague handling multi-day projects independently.

Q: What's the difference between the Claude Agent SDK and the old Claude Code SDK?

A: Anthropic rebranded it to emphasize broader application beyond coding. The SDK includes the same infrastructure powering Claude Code—agent orchestration, memory management, tool usage, permission systems—but developers can build any agent type, not just coding assistants. This shift signals Anthropic's move from single-purpose tools to platform control, creating lock-in across use cases.

Q: How does context editing prevent the model from hitting token limits?

A: Context editing automatically removes stale tool calls and results when approaching the context window cap, keeping only relevant information. In a 100-turn web search test, this enabled workflows that would otherwise fail due to exhaustion while reducing token consumption 84%. Combined with the memory tool, which stores data outside the context window, agents can run indefinitely without manual intervention.