Anthropic Says Cache Misses Are Production Incidents, Reveals Caching Shaped Claude Code

Anthropic's Claude Code engineering team disclosed this week that prompt caching drove every major design decision in the company's AI coding agent. Thariq Shafi, a Claude Code engineer, wrote that the team "builds our entire harness around prompt caching" and treats drops in cache hit rate as production incidents, triggering severity alerts when the numbers slip. Alongside the disclosure, Anthropic shipped auto-caching, a single-parameter addition to its Messages API that automatically manages cache breakpoints as conversations grow.

A related post from Lance Martin, an Anthropic developer advocate, detailed how auto-caching reduces cached input token costs to 10% of standard pricing. For Claude Sonnet, that translates to $0.30 per million tokens versus $3.00. A compaction API, currently in beta, handles what happens when conversations outgrow the context window by summarizing older turns while keeping the cache prefix intact.

Together, the posts amount to an engineering confession. Model capability and training data dominate the public AI conversation. But inside the team that actually ships a production coding agent, staring at dashboards where a two-point drop in cache hit rate triggers a pager, the constraint that shaped everything was a caching mechanism most developers have never thought about.

Key Takeaways

Anthropic treats drops in cache hit rate as production incidents, triggering severity alerts across Claude Code.
Auto-caching adds one API parameter; cached tokens cost 10% of standard pricing ($0.30 vs $3.00 per million on Sonnet).
Plan mode, tool search, and compaction were all architected to avoid breaking the cached prefix.
Manus founder Peak Ji independently confirmed KV-cache hit rate as the single most important production agent metric.

What prefix matching actually costs you

Prompt caching works through prefix matching. The API caches everything from the start of a request up to a designated breakpoint, generates a cryptographic hash of that content, and reuses the cached computation when a subsequent request carries an identical prefix. One character difference kills the match. Every token past the break gets reprocessed at full price.

This fragility forced the Claude Code team into a rigid architecture. Static content goes first: system prompt, tool definitions, project configuration. Variable content, the actual conversation, sits at the end. Shafi described the ordering as "surprisingly fragile," citing past incidents where a detailed timestamp in the system prompt or non-deterministic tool ordering silently killed cache hit rates across the entire product.

Ripple effects touch everything a user does. Need to tell the model what day it is? Don't edit the system prompt. Inject a reminder tag into the next user message instead. The model gets the update. The cached prefix survives.

Want to switch from Opus to Haiku for a quick question midway through a long session? Don't. A hundred thousand tokens of cached context on Opus would need rebuilding from scratch for Haiku. The Claude Code team routes these requests through subagents instead, where the primary model packages a focused handoff message for a separate session with its own cache.

Every feature bent around the same constraint

Plan mode is where the constraint becomes most visible.

When a user enters plan mode, the obvious engineering move is to swap the toolset. Strip out write permissions, keep only read-only tools. That would blow the cache. Tens of thousands of cached tokens, gone in a single API call.

So the Claude Code team kept every tool present in every request. Plan mode is itself a tool. EnterPlanMode and ExitPlanMode are callable functions, not configuration switches. When a user toggles plan mode, a system message tells the model to restrict itself to read operations. Tool definitions never change. Prefix stays warm. And because EnterPlanMode is a tool the model can invoke on its own, Claude Code sometimes drops into plan mode autonomously when it encounters a hard problem. No cache break needed.

Tool search follows the same logic. Claude Code can load dozens of MCP tools. A full schema dump means thousands of tokens of definitions wedged between the system prompt and the first user message. But removing tools mid-conversation breaks the prefix.

Anthropic went with lightweight stubs. Each tool gets a placeholder entry, just a name with a defer_loading flag. When the model needs the full schema, it calls ToolSearch. The stubs sit in the prefix, unchanging, in the same order, every request.

Compaction posed the most stubborn problem of the three. When a conversation fills the context window, the system has to summarize everything and start fresh. A naive implementation sends the history to the model with a different system prompt and no tools. Completely different prefix. You pay full input cost for the entire conversation all over again.

Their workaround was defensive. Compaction runs with the exact same system prompt, tools, and conversation prefix as the parent request. Tack the summarization instruction on as the final user message. From the API's perspective, the request looks nearly identical to the previous one. The cache gets reused. Only the compaction instruction itself is new tokens.

Anthropic turned its own pain into a product

Anthropic packaged these lessons for external developers rather than keeping them internal. The nervous energy is palpable. If every agent builder has to rediscover cache-safe compaction on their own, adoption stalls.

Auto-caching adds a single cache_control field at the top level of a Messages API request. One field. The system handles breakpoint placement from there, advancing it to the last cacheable block as conversations grow. Developers who want finer control can still place explicit breakpoints on individual content blocks, up to four total.

Cache lifetime defaults to five minutes, refreshing automatically on each reuse. A one-hour option costs twice the base input token price for writes. Standard five-minute writes cost 1.25 times the base rate. Reads cost one-tenth. For Sonnet, that breaks down to $3.75 per million tokens to write and $0.30 to read, versus $3.00 uncached. The math is not subtle.

Stay ahead of the curve

Strategic AI news from San Francisco. No hype, no "AI will change everything" throat clearing. Just what moved, who won, and why it matters. Daily at 6am PST.

No spam. Unsubscribe anytime.

Compaction, still in beta for Opus 4.6 and Sonnet 4.6, triggers when input tokens exceed a configurable threshold, 150,000 by default. The API generates a summary, wraps it in a compaction block, and continues with shortened context. Developers can pause after compaction to inject preserved messages or enforce total token budgets across a session.

The two features pair well. A cache breakpoint on the system prompt means system instructions stay cached even across multiple compaction events. Only the summary itself gets written as new cache content. That's the whole point.

Manus arrived at the same conclusions independently

Anthropic's disclosure would read as self-serving marketing if it stood alone. It carries weight because another major agent builder, working with different models and users, reached the same answers independently.

Yichao "Peak" Ji, who founded Manus before its acquisition by Meta, published a detailed engineering account of what he called "context engineering." Ji didn't hedge. The KV-cache hit rate, he wrote, is "the single most important metric for a production-stage AI agent."

Manus runs an average input-to-output token ratio of 100 to 1. That means 99% of each request's cost sits in input processing. With Sonnet's cache pricing, the gap between cached and uncached processing is tenfold. Ji described rebuilding the Manus agent framework four times, scrapping each version after hitting cost walls or latency spikes that made the product unusable. Frustration hardened into discipline.

His team's rules overlap almost perfectly with Anthropic's. Stable prefixes, append-only context, no mid-session tool changes.

Where they diverge is instructive. Manus masks token logits during decoding to block specific tools without altering tool definitions at all. Tool names share consistent prefixes: all browser tools start with browser_, all shell tools with shell_. Blocking an entire category of actions requires constraining just the first few tokens of a function name. Anthropic achieves similar results through stub-based deferral at the API layer, a cleaner interface but one that requires provider support.

Manus also treats the file system as extended context. When the active context window fills up, Manus writes structured data to files on disk and drops the content from the prompt while preserving the file path. The information stays recoverable. The prompt stays lean. Compression, in Ji's framework, must always be reversible. A web page can be dropped from context as long as the URL survives.

Ji's most counterintuitive finding cuts against every engineering instinct. Leave errors in the context. The engineering instinct is to clean up failed actions, retry cleanly, hide the mess. But errors carry signal. When the model sees a stack trace from a botched tool call, it shifts its predictions away from repeating the same mistake. Scrubbing the evidence removes the correction mechanism. Error recovery, Ji wrote, is "one of the clearest indicators of true agentic behavior."

The arithmetic behind the engineering anxiety

Behind the blog posts sits a straightforward cost calculation.

An AI coding agent processes hundreds of thousands of input tokens per session. Without caching, a 200,000-token conversation on Sonnet costs $0.60 in input alone each turn. With caching, subsequent turns drop to roughly $0.06. Over a 50-turn session, that's the difference between thirty dollars and three dollars sixty for one user.

Scale that across millions of Claude Code subscribers. At that volume, cache efficiency determines whether the product can exist at its current price. Every percentage point of cache miss rate translates directly into infrastructure spend. Shafi wrote that higher cache hit rates let Anthropic "create more generous rate limits for our subscription plans." The infrastructure constraint sets the pricing constraint.

This is why the Claude Code team monitors cache performance with the same intensity it watches uptime. A few percentage points of increased cache miss rate, spread across millions of requests, can crater infrastructure budgets overnight. Cache breaks are not tuning opportunities. They are production incidents, severity-rated and paged.

For developers building their own agents, the lesson from both Anthropic and Manus converges on a single architectural principle. Design around the prefix. Not around the model. Not around the features. Not around what feels architecturally clean. If you find yourself reaching for on-the-fly tool management or mid-session model switching, you're about to torch your cache hit rate, and your unit economics with it.

Production AI's most consequential decisions aren't the ones users notice. They're the ones that keep the prefix identical from one request to the next, and the pager silent.

Frequently Asked Questions

What is prompt caching and how does prefix matching work?

The API caches computation from the start of a request to a designated breakpoint. Subsequent requests with an identical prefix reuse that cached work. One character difference invalidates the entire cache, forcing full reprocessing at standard token prices.

How much does prompt caching save on API costs?

Cached reads cost 10% of standard input pricing. For Claude Sonnet, that means $0.30 per million tokens versus $3.00 uncached. Cache writes cost 1.25x the base rate for a five-minute TTL or 2x for a one-hour TTL.

What is the compaction API?

Compaction triggers when input tokens exceed a configurable threshold, 150,000 by default. It summarizes older conversation turns while preserving the cache prefix. Currently in beta for Opus 4.6 and Sonnet 4.6.

Why does Claude Code keep all tools in every API request?

Removing or changing tool definitions mid-conversation would alter the cached prefix and invalidate the cache. Claude Code uses lightweight stubs with a defer_loading flag instead, loading full schemas only when the model calls ToolSearch.

How did Manus handle the same caching challenges differently from Anthropic?

Manus masks token logits during decoding to block tools without altering definitions, uses consistent tool name prefixes for category-level blocking, and treats the file system as extended context to keep prompts lean while preserving recoverability.