AI Agents Need Context Compaction Before 100 Percent

Anthropic engineer Thariq Shihipar published a session-management guide for Claude Code on April 15, pairing the assistant's 1 million-token context window with a decision table that lists which command the developer should run in each session state. The table lists four scenarios for in-session decisions and assigns each a different command, ranging from /compact <hint> for bloated mid-task sessions to /clear for new tasks. Shihipar argued that the larger window did not remove the need for those choices.

Three recent vendor documents reach a similar conclusion. The Claude Code session guide, OpenAI's compaction API documentation, and Anthropic's context-engineering post each describe the context window as a resource the developer or the system must allocate before the window fills, not after.

Key Takeaways

Claude Code pairs a 1 million-token window with commands for compact, clear, rewind and subagents.
OpenAI compaction emits opaque encrypted items when rendered token counts cross a configured threshold.
Chroma found focused 300-token prompts beat full prompts of about 113,000 tokens on LongMemEval.
DeepSeek and Google show why stable prefixes matter for cache hits, cost and latency.

AI-generated summary, reviewed by an editor. More on our AI guidelines.

OpenAI's compaction items are opaque by design

OpenAI's compaction guide describes the API equivalent of that workflow. When the rendered token count of a Responses-API conversation crosses a configured compact_threshold, the API emits an encrypted compaction item and prunes the prior context before continuing. OpenAI's documentation states that the item is "opaque and not intended to be human-interpretable," a property consistent with its Zero Data Retention posture under store=false and one that, the guide acknowledges, complicates debugging when the handoff is wrong.

Shihipar's guide identifies the same failure mode. He writes that bad auto-compactions occur when the model cannot predict where the work is heading, and that the problem arrives when the model is "at its least intelligent point when compacting." Once compaction has run, the summary is the only state the next turn inherits. A compaction triggered after a failed test run can therefore pass that dead end forward and drop earlier warnings from the thread in the process.

Shihipar uses the risk to argue for manual compaction at task boundaries rather than at threshold limits. The guide recommends preserving the architectural decision and any unresolved bug or open file in the summary, while discarding failed log output and tool noise.

Chroma and Google document attention falloff in long prompts

Google's Gemini long-context documentation sets out the optimistic case. Its capacity examples list a 1 million-token window as roughly equivalent to 50,000 lines of code or eight average English novels, and to transcripts of more than 200 average podcast episodes. The same document encourages developers to load relevant material up front instead of retrieving or summarizing it on each call.

The guide also notes a limit. Retrieval of multiple items, Google writes, does not degrade in proportion to retrieval of one. A model that scores 99 percent accuracy on a single-needle benchmark may still need 100 separate retrievals to surface 100 pieces of information from the same long context.

Chroma's Kelly Hong, Anton Troynikov, and Jeff Huber reach a similar conclusion empirically. In Chroma's context-rot study, the researchers evaluated 18 models and reported that "LLMs do not maintain consistent performance across input lengths." Their LongMemEval comparison ran full prompts of about 113,000 tokens against focused prompts of about 300 tokens, and the focused prompts produced more accurate answers because the model was not required to retrieve and reason inside the same call.

Hong, Troynikov, and Huber write that long-window demonstrations measure how much material fits in the window, not how much of it the model can attend to during a single response.

DeepSeek and Google publish prefix-cache pricing

DeepSeek attaches pricing to the same idea. Its August 2024 cache launch note listed cache-hit input at $0.014 per million tokens against $0.14 per million for a cache miss, and reported that a 128K prompt with a high reference rate dropped from 13 seconds to 500 milliseconds of first-token latency. DeepSeek attributed the cost and latency gains to repeated reuse of a stable prefix across requests.

The note also describes the cache as "best-effort" and states that a 100 percent hit rate is not guaranteed. Google's documentation imposes similar conditions. Implicit caching activates at 1,024 input tokens for Gemini 3 Flash Preview and at 4,096 for Gemini 3 Pro Preview, and explicit caches default to a one-hour time to live, with one Google example setting a five-minute cache for a file named SherlockJr._10min.mp4. The two pricing structures target the same kind of workload, in which a long preamble stays identical across calls and only the per-turn query changes between requests.

Get Implicator.ai in your inbox

Strategic AI news from San Francisco. No hype, no "AI will change everything" throat clearing. Just what moved, who won, and why it matters. Daily at 6am PST.

No spam. Unsubscribe anytime.

Google's documentation notes that a preamble rewritten on each turn will not match the prior cached prefix and will register as a cache miss for billing purposes.

Anthropic and OpenAI differ on session-management mechanics

Anthropic's context-engineering post and OpenAI's Codex best-practices documentation address the same set of session-management decisions with different specifics.

Pi creator Mario Zechner pushes the same idea further, but his preferred move is to avoid polluting the implementation thread in the first place. In a November 30 post on building Pi, Zechner wrote that "context engineering is paramount" because "exactly controlling what goes into the model's context yields better outputs, especially when it's writing code." For context gathering, his advice is to do it "first in its own session" and create "an artifact that you can later use in a fresh session," rather than drag every tool result forward. At the time, he wrote that there were still "a few more features I'd like to add, like compaction," but added that "missing compaction hasn't been a problem for me personally." Pi's current compaction docs now describe /compact [instructions], automatic compaction, a 16,384-token response reserve, and a 20,000-token recent-message keep window, while preserving the full JSONL history for /tree.

Session-management mechanics, head to head

Mechanism	AnthropicClaude Code	OpenAICodex
Automatic compaction	Summary preserves architectural decisions, unresolved bugs, and implementation specifics. Drops tool outputs. Resumes with summary plus the five most recently accessed files.	Encrypted item emitted when `compact_threshold` is crossed. Replaces pruned context. Opaque to inspection.
Subagent dispatch	Spawn only when the parent needs the conclusion, not the full output. Child consumes the exploratory tokens and returns the condensed result.	Scope one thread per coherent unit of work.
Default prompt shape	Not specified in the cited post.	Specify Goal, Context, Constraints, and a definition of Done in every default prompt.
Auto-loaded project file	Claude Code's project context file at session start.	`AGENTS.md`, kept short and accurate.

Anthropic's recommended test before spawning a child agent is a single question, in the post's wording: "will I need this tool output again, or just the conclusion?"

Frequently Asked Questions

What is context compaction?

Context compaction summarizes or prunes prior conversation state so a session can continue inside the model window. In OpenAI's Responses API, crossing a configured threshold can emit an encrypted compaction item. In Claude Code, developers can also run /compact with a hint.

Why compact before the context window reaches 100 percent?

Anthropic warns that automatic compaction can happen when the model is least able to judge what future turns will need. Compacting at task boundaries lets the developer preserve decisions, unresolved bugs and active files before threshold pressure decides for them.

Does a 1 million-token window solve long-context reliability?

No. Gemini's documentation shows the scale of a 1 million-token window, but it also cautions that retrieving many items is not the same as retrieving one. Chroma found focused prompts of about 300 tokens beat full prompts of about 113,000 tokens.

How do context caches change the cost calculation?

Caches reward repeated, stable prefixes. DeepSeek's August 2024 launch note listed cache-hit input at $0.014 per million tokens against $0.14 for a miss, and reported first-token latency dropping from 13 seconds to 500 milliseconds on a 128K prompt.

What belongs outside the chat thread?

Durable rules, project conventions and definitions of done should live in files or prompts that reload cleanly. Codex points users to AGENTS.md and prompts with Goal, Context, Constraints and Done when. Claude Code uses compact summaries and subagents for bounded work.

AI-generated summary, reviewed by an editor. More on our AI guidelines.

Tools & Workflows

Marcus Schuler

San Francisco

Editor-in-Chief and founder of Implicator.ai. Former ARD correspondent and senior broadcast journalist with 10+ years covering tech. Writes daily briefings on policy and market developments. Based in San Francisco. E-mail: editor@implicator.ai