Sabrina Ramonov spent weeks burning through her Claude subscription before she figured out what was wrong. Not because she was building anything exotic. Because she was running Opus on every task, letting her context window bloat to 80,000 tokens before typing a word, and feeding Claude her entire project on every turn. She documented six fixes that cut her usage in half. Damian Galarza ran /context on a fresh Claude Code session and discovered that 50.6% of his 200,000-token context window was already consumed by system overhead. Before he sent a single message. Before he wrote a single line of code.

These aren't isolated complaints. They're dispatches from a growing cottage industry of token optimization content. YouTube tutorials promising 18 hacks in 18 minutes. GitHub repos offering a single CLAUDE.md file that cuts output by 63%. Paid toolkits with names like "Code Kit" that ship pre-built hooks for context recovery. An entire blog ecosystem teaching developers how to spend less money on the tool that was supposed to save them time.

The token optimization industry shouldn't exist. The fact that it does tells you something important about where AI coding tools actually stand.

Andrej Karpathy endorsed "context engineering" as a real discipline last June, calling it "the delicate art and science of filling the context window with just the right information for the next step." Shopify CEO Tobi Lutke agreed. Google DeepMind's Philipp Schmid went further: "Most agent failures are not model failures anymore, they are context failures." The institutional consensus is clear. What goes into the context window matters. But endorsing context engineering as a discipline is not the same as endorsing an industry of CLAUDE.md tweaking tips. The distance between those two things is where the money gets made.

Key Takeaways

AI-generated summary, reviewed by an editor. More on our AI guidelines.

The return of manual resource management

If you wrote C in the 1990s, you remember malloc(). You allocated memory by hand, tracked every byte, freed it when you were done. Get it wrong and your program crashed, leaked, or corrupted itself silently. Then garbage collection arrived. Java, Python, and eventually every mainstream language automated the problem away. Developers stopped thinking about memory. They shipped faster.

Claude Code has reintroduced manual resource management, just with tokens instead of bytes.

The SFEIR Institute published a token budget table that reads like a systems programming reference manual. A 300-line file read costs 2,500 tokens. A full Claude response runs 1,500 to 3,000. A bash command result can eat 5,000. Your CLAUDE.md file loads at boot for another 800 to 2,000. The 200,000-token context window sounds enormous until you realize a focused developer can burn through it in under 30 minutes of active work.

So developers are counting again. They grep instead of reading whole files because it costs 200 tokens versus 3,000. They batch prompts into single messages because three separate exchanges cost three times as much. They watch a status bar like a fuel gauge, compacting at 60% because waiting until 80% degrades output quality. One developer tracked a 100-message chat and found 98.5% of all tokens were spent re-reading old conversation history. Not writing code. Re-reading.

That's not a productivity tool. That's a resource management puzzle with a subscription fee.

Half your budget vanishes before you start

The invisible overhead cuts deepest. Galarza's fresh-session breakdown tells the story in raw numbers: 3,100 tokens eaten by the system prompt. Another 19,800 by built-in tools. MCP tool definitions swallowed 26,500 more. Custom agents took 2,800, memory files 4,000, and the autocompact buffer grabbed 45,000. Grand total: 101,200 tokens gone before Galarza typed a word.

ClaudeFast analyzed the autocompact buffer and found Anthropic recently reduced it from 45,000 tokens to 33,000. Anthropic framed the reduction as a win, handing back roughly 12,000 usable tokens. Generous, right? Except those tokens were yours to begin with. A $200-per-month tool had been quietly reserving them, and the partial return landed as a product improvement in the changelog.

The GenAI Skills blog called setting MAX_THINKING_TOKENS to 10,000 "the single highest-impact change" for reducing consumption. This caps an internal reasoning process that users never see, can't read, and didn't ask for. HumanLayer's research found that Claude's built-in system prompt already occupies roughly 50 instruction slots, leaving about 100 to 150 for your actual project rules before the model starts ignoring them.

Here's the part nobody in the optimization ecosystem talks about. Galarza's own numbers show that CLAUDE.md files, the thing the entire cottage industry obsesses over, account for roughly 2% of total context consumption. Two percent. MCP tool definitions eat 13.3%. The autocompact buffer takes 22.5%. The optimization tips industry has built itself around the smallest line item on the invoice.

If you're an executive evaluating AI coding tools for your engineering team, sit with that number. Your developers get 150 instruction slots. The tool takes 50 before they arrive. The feeling among power users is something close to defensiveness, a low-grade anxiety that the tool they championed to their team is burning budget in ways they can't fully explain.

Who profits from the complexity

A secondary economy has formed around this friction. ClaudeFast hawks a Code Kit with SkillActivation hooks that load context on demand, claiming to claw back 15,000 tokens per session. MindStudio positions itself as an orchestration layer that keeps Claude focused on reasoning while infrastructure runs elsewhere. The drona23 GitHub repo ships a drop-in CLAUDE.md that reduced test outputs from 465 words to 170, a 63% cut with what the author calls "zero signal loss."

And the person who built Claude Code? Runs it stock. Boris Cherny, the Anthropic staff engineer behind the tool, posted his setup in early 2026. Nothing exotic. No elaborate CLAUDE.md rituals. Claude botches a task, Cherny adds a line to the CLAUDE.md. Problem goes away, he moves on. Five to ten parallel Opus sessions running at once, no token spreadsheet in sight. The guy whose tool spawned an entire optimization cottage industry doesn't bother optimizing it.

Even Anthropic contributes to the confusion. The company's official documentation dedicates an entire page to cost management, walking developers through model switching, thinking budget caps, MCP server pruning, and hook-based preprocessing. The average cost is $6 per developer per day, Anthropic says, with 90% of users staying below $12. But those figures assume competent token management. Without it, sessions drain within the hour.

The business incentive misalignment is worth naming. For API and team customers, Anthropic bills per token. For Pro and Max subscribers, usage limits gate access based on consumption. Either way, efficiency works against Anthropic's economics. The fact that Anthropic publishes optimization guides anyway suggests the alternative, frustrated and exposed developers abandoning the product, is worse. But it also means the platform has limited motivation to solve the problem at the infrastructure level. Why build garbage collection when your customers will build it themselves?

Robert Matsuoka ran independent monitoring and caught something Anthropic never announced: compaction now kicks in at 75% capacity, not 90%. That silent change handed developers 50,000 tokens for reasoning where they'd previously had 20,000. Quality climbed. But Matsuoka discovered this through independent monitoring, not through Anthropic documentation. The improvement was silent. The burden of understanding it still falls on the developer.

The tool works best when you use it less

Anthropic's own best practices documentation says it plainly: "If your CLAUDE.md is too long, Claude ignores half of it because important rules get lost in the noise." The platform is telling developers to stop doing what the cottage industry teaches them to do. The documentation even distinguishes between CLAUDE.md instructions, which are advisory and followed roughly 80% of the time, and hooks, which are deterministic. The optimization tips ecosystem is largely tweaking an advisory channel when a deterministic one exists.

Researchers at ETH Zurich put numbers on the problem. Their February 2026 study tested 138 task instances across 12 real-world Python repositories with four coding agents. LLM-generated context files caused a 2-3% degradation in task success compared to no context file at all, while increasing inference costs by 20-23%. Human-written context files performed marginally better, adding roughly 4% improvement at 19% higher cost. The researchers' recommendation was blunt: omit LLM-generated context files entirely and limit human-written instructions to details the agent can't infer on its own.

Multiple sources converge on the same counterintuitive finding beyond the ETH Zurich data. Samuel Lawrentz cut his costs in half by fixing five habits, the biggest being a CLAUDE.md file that contained a section explaining what React is. To an AI model trained on the entire internet. The Hugging Face community recommends compacting at 50% context usage, not the 95% where auto-compact triggers. Matsuoka's analysis shows that leaving 25% of the context window deliberately empty produces better output than filling it.

The pattern holds beyond Claude Code. A controlled study found that experienced developers actually work 19% slower with AI coding assistants, even as they report feeling faster. The productivity gain is at least partly a mirage. And that gap between what developers feel and what the clock says gets worse as the tools pile on complexity.

You can draw a straight line from that finding to the token optimization ecosystem. Developers believe Claude Code is saving them time. Some of that time goes back into managing Claude Code. The net calculation is murkier than any vendor will admit.

Why Anthropic won't fix this first

This resolves one of three ways. Anthropic builds automatic context management, the garbage collection equivalent, that handles compaction, model routing, and MCP optimization invisibly. The optimization layer becomes a separate product category, with companies like ClaudeFast and MindStudio selling the tooling that Anthropic doesn't ship. Or developers accept the cognitive tax as permanent, the way they once accepted build systems and dependency management as the price of modern software.

History favors option one. Manual resource management always gets automated. It happened with memory. It happened with deployment. It happened with dependency resolution. The question is whether Anthropic moves first or whether a competitor ships the abstraction that makes token management invisible, and takes the market with it.

When Anthropic bet on the terminal over flashy IDE integrations, it won developers who valued control. But control and cognitive load live on the same spectrum. Right now the developer who opens Claude Code at 8 AM spends her first ten minutes disconnecting MCP servers, checking her context budget, and trimming her CLAUDE.md to stay under 200 lines. She isn't writing code. She's managing the machine that promised to write it for her. And somewhere on YouTube, another tutorial is teaching the next developer how to do the same thing, one hack at a time.

Frequently Asked Questions

How much does Claude Code actually cost per developer?

Anthropic reports an average of $6 per developer per day, with 90% of users staying below $12. But those figures assume competent token management. Subscriptions range from $20/month (Pro) to $200/month (Max 20x). API and team users pay per token consumed.

Why does Claude Code's context window fill up so fast?

System overhead consumes roughly half the 200,000-token window before you start. System tools, MCP definitions, memory files, and the autocompact buffer collectively eat over 100,000 tokens. Every message then reprocesses the entire conversation history, compounding costs.

What is the most effective way to reduce Claude Code token usage?

Cap extended thinking with MAX_THINKING_TOKENS=10000, switch to Sonnet for most tasks (reserving Opus for architecture decisions), disconnect unused MCP servers, keep CLAUDE.md under 150 lines, and use /clear between unrelated tasks.

What is the autocompact buffer and why does it matter?

Anthropic sets aside around 33,000 tokens (down from 45,000) for automatic context compaction. In practice, you get about 167,000 usable tokens out of the advertised 200,000. The compaction process kicks in near 83.5% capacity.

Will token management eventually be automated?

History suggests yes. Manual resource management in computing has always been automated eventually, from memory allocation to deployment to dependency resolution. The question is whether Anthropic builds this abstraction first or a competitor does.

AI-generated summary, reviewed by an editor. More on our AI guidelines.

OpenAI Acquires Python Toolmaker Astral, Giving Codex Control of the Developer Stack
OpenAI announced Thursday it will acquire Astral, the startup behind Python's most widely adopted modern development tools. The deal, disclosed through coordinated blog posts from both companies, brin
Peter Steinberger Chose OpenAI. The Code Was Never the Point.
Mark Zuckerberg needed ten minutes. He was finishing code. Peter Steinberger had called him on WhatsApp without scheduling anything. "I don't like calendar entries," he told Lex Fridman last week. "L
Former GitHub CEO Thomas Dohmke Raises Record $60M for New Startup Entire
Thomas Dohmke, who stepped down as CEO of Microsoft's GitHub in August 2025, came out of stealth on Tuesday with Entire, a new developer platform built for the age of AI coding agents. The startup rai
Tools & Workflows

San Francisco

Editor-in-Chief and founder of Implicator.ai. Former ARD correspondent and senior broadcast journalist with 10+ years covering tech. Writes daily briefings on policy and market developments. Based in San Francisco. E-mail: [email protected]