💡 TL;DR - The 30 Seconds Version
🤖 Anthropic built Claude's research feature using multiple AI agents working together, claiming 90.2% better performance than single agents.
💰 Multi-agent systems burn 15x more tokens than regular chat but justify costs for complex research tasks.
⚡ Parallel processing cuts research time by 90% - tasks that took hours now finish in minutes.
🔧 Token usage explains 80% of performance differences, making smart distribution more important than raw model size.
🏭 Production required new debugging approaches since agent errors cascade into completely different behaviors than traditional software.
🚀 This suggests coordination between smaller models might beat building ever-larger single models for complex tasks.
Anthropic published details about how they built Claude's research feature. Their approach uses multiple AI agents working together instead of one large model.
The company claims their multi-agent system beats single-agent performance by 90.2%. They tested this using Claude Opus 4 as the main coordinator and Claude Sonnet 4 as worker agents.
The system sidesteps context limits by giving each sub-agent its own context window. This lets them process more information than single agents constrained by 200,000 token limits.
How the architecture works
Anthropic's system mimics a research team structure. A lead agent analyzes queries and creates specialized sub-agents to handle different aspects simultaneously.
Each sub-agent explores its assigned angle independently, then reports compressed insights back to the coordinator. This differs from traditional approaches where one agent handles everything sequentially.
Three factors explain 95% of performance differences in their testing: token usage (accounting for 80%), number of tool calls, and model choice. More tokens help, but only when distributed intelligently across agents.
Token costs mount quickly
Multi-agent systems consume tokens fast. Anthropic's data shows agents use 4x more tokens than chat interactions, while multi-agent systems use 15x more than regular chats.
Model quality matters more than raw compute power. Upgrading from Claude Sonnet 3.7 to Sonnet 4 delivered bigger gains than doubling the token budget on the older model.
This creates an economic threshold. Multi-agent systems only make sense for complex tasks where improved results justify the higher costs.
Engineering challenges emerge
Building production-ready agent systems revealed coordination problems. Early versions spawned 50 sub-agents for simple queries or searched endlessly for information that didn't exist.
Prompt engineering became their main control mechanism. Each sub-agent needs specific objectives, output formats, tool guidance, and clear boundaries. Vague instructions cause duplicated work and missed information.
Teaching the lead agent to delegate effectively proved crucial. Simple instructions like "research the semiconductor shortage" failed because sub-agents interpreted tasks differently or repeated each other's searches.
Parallel processing shows promise
Speed improvements came from two changes: running 3-5 sub-agents simultaneously instead of one after another, and having sub-agents use multiple tools at once.
These modifications cut research time by up to 90% for complex queries, according to Anthropic. Tasks that previously took hours now finish in minutes while covering more sources.
The parallel approach works because research naturally involves exploring multiple sources and angles. Sequential processing creates unnecessary bottlenecks.
Production presents new problems
Moving from prototype to production revealed unique challenges. Agent systems maintain state across long processes, and errors compound differently than in traditional software.
Regular software bugs break specific features. Agent bugs can cascade into completely different behaviors, making debugging harder.
Anthropic built systems that resume from failure points rather than restart completely. They also programmed agents to handle tool failures without stopping entirely.
Production monitoring required new approaches. Standard observability wasn't enough - they needed to track agent decision patterns and how agents interact with each other.
Testing gets complicated
Evaluating multi-agent systems breaks normal testing methods. Unlike regular software that follows predictable paths, agents might use completely different approaches to reach the same goal.
One agent might search three sources while another searches ten. Both approaches could be valid.
Anthropic started with small test sets of 20 queries. Early changes had large effects, so small samples caught major improvements effectively.
They used AI judges to evaluate outputs based on factual accuracy, citation quality, completeness, source quality, and tool efficiency. One judge with a single prompt worked better than multiple specialized judges.
Human testing caught problems that automated evaluation missed, like agents consistently choosing low-quality content over authoritative sources.
Agent-tool interfaces proved as important as user interfaces. Wrong tool choices doom agents from the start.
Poor tool descriptions send agents down incorrect paths. Anthropic created tool-testing agents that used flawed tools repeatedly, then rewrote descriptions to prevent failures.
This process cut task completion time by 40% for future agents using the improved descriptions.
Implications for AI development
Anthropic's work suggests scaling through coordination rather than just model size. Their performance equation shows token usage drives results, but intelligent distribution across parallel agents amplifies effects.
This might create a new category of AI applications: complex tasks that justify 15x token costs for significant performance improvements.
The engineering challenges are substantial. Multi-agent systems require careful state management, robust error handling, and sophisticated coordination mechanisms.
Early user reports suggest the approach works. People describe finding business opportunities they missed, navigating complex decisions, and saving substantial research time.
Why this matters:
- Token distribution across agents might matter more than raw model size - suggesting coordination could be more important than scale
- Multi-agent coordination potentially solves context limits that constrain single models, though long-term viability depends on cost-benefit ratios
Read on, my dear:
❓ Frequently Asked Questions
Q: How much more expensive are multi-agent systems to run?
A: Multi-agent systems use 15x more tokens than regular chat interactions. Single agents use 4x more tokens than chat. This means a complex research task that costs $1 in regular chat might cost $15 with multi-agent processing. The cost only makes sense for high-value tasks.
Q: What types of tasks work best with multiple agents?
A: Tasks that need heavy parallelization, exceed single context windows, or use many complex tools. Examples include finding board members across S&P 500 companies, comparing healthcare options across states, or researching business opportunities in new markets. Sequential tasks like most coding don't benefit as much.
Q: How many sub-agents does the system typically create?
A: Simple fact-finding uses 1 agent with 3-10 tool calls. Direct comparisons need 2-4 sub-agents with 10-15 calls each. Complex research uses 10+ sub-agents with clearly divided responsibilities. The lead agent typically runs 3-5 sub-agents in parallel rather than sequentially.
Q: Why does token usage matter more than model size?
A: Token usage explains 80% of performance differences in Anthropic's testing. More tokens let agents explore more sources and think longer about problems. Upgrading from Claude Sonnet 3.7 to Sonnet 4 beat doubling the token budget on the older model, showing quality and quantity both matter.
Q: What happens when agents make mistakes or get stuck?
A: Early versions spawned 50 sub-agents for simple queries or searched endlessly for non-existent information. Anthropic built resume systems that restart from failure points rather than beginning completely over. Agents can also adapt when tools fail during research.
Q: How do they test something that works differently each time?
A: Traditional testing breaks because agents take different valid paths to the same answer. Anthropic uses AI judges to score results on factual accuracy, citation quality, and tool efficiency. They start with small test sets of 20 queries since early changes have dramatic effects.
Q: Can other companies build similar systems?
A: Anthropic published some example prompts in their open-source cookbook. The engineering challenges include state management, error handling, and coordination between agents. Success requires careful prompt design, robust monitoring, and substantial token budgets for testing and operation.