Google Turns Research Agents Into Infrastructure, and Writes the Test They're Graded On

Google launched a research agent and wrote the test that grades it. Unsurprisingly, Google's tool leads the leaderboard. Competitors must now replicate Google's search infrastructure or accept permanent disadvantage on web research tasks.

Google Turns Research Agents Into Infrastructure, and Writes the Test They're Graded On

When Google announced the Gemini Deep Research agent today, the headline focused on developer access. Third parties can now embed Google's autonomous research capabilities directly into their applications through a new Interactions API. The company also open-sourced DeepSearchQA, a 900-task benchmark designed to evaluate multi-step web research.

The real story isn't about opening access to AI research tools. Google simultaneously built the infrastructure, defined the evaluation criteria, and positioned itself atop the resulting leaderboard. The company just sold the industry a ruler that measures 11 inches, then declared itself tallest.

The Breakdown

• Google released Gemini Deep Research agent via Interactions API while open-sourcing DeepSearchQA, a benchmark designed around Google's architectural strengths.

• Gemini leads DeepSearchQA at 66.1% accuracy. Claude models cluster at bottom (12.8%-24.0%), disadvantaged by lack of integrated search infrastructure.

• FACTS benchmark shows 31.2% error rate on factuality tasks. Enterprise testimonials from GV and Axiom Bio describe only "preliminary" research phases.

• The Interactions API bundles models, agents, and tools into integrated offerings. Each feature adopted increases switching costs and platform dependency.

The Ruler Google Designed

Here's what using the Gemini Deep Research agent actually looks like. A developer sends a query. Then waits. Research tasks run asynchronously because they exceed standard API timeout limits, typically running 5-15 minutes, sometimes longer for complex jobs. The developer must set background=True and poll for results, watching a loading spinner while a meter runs in the background. Gemini 3 Pro token rates apply to input, output, and all the intermediate reasoning the agent generates while it scrapes messy HTML, hits paywalls, gets confused by SEO spam, and retries.

That's the product. Now consider the benchmark that evaluates it.

DeepSearchQA contains 900 hand-crafted "causal chain" tasks across 17 fields. Each step depends on prior analysis, requiring agents to build answers sequentially rather than retrieve single facts. Google frames this as moving beyond traditional benchmarks that fail to capture real-world research complexity. A more accurate framing: Google designed a test that rewards Google's architecture.

The benchmark measures the ability to execute complex search plans and generate exhaustive answer sets. This tests research precision and retrieval recall simultaneously. An agent without tight integration into search infrastructure will struggle because it cannot efficiently navigate the information retrieval patterns the benchmark demands. Google controls the search infrastructure. Google designed the benchmark. Google's agent tops the benchmark.

Look at Kaggle's DeepSearchQA leaderboard as of December 10. Gemini Deep Research Agent leads at 66.1% accuracy. GPT-5 Pro follows at 65.2%. GPT-5 scores 59.4%. Then the gap widens dramatically.

Credit: Google

The Claude models cluster at the bottom. Claude Opus 4.5 with thinking enabled scores 24.0%, Sonnet 4.5 reaches 16.0%, Haiku 4.5 lands at 12.8%. The "Fully Incorrect" column tells a starker story: Claude Opus 4.5 was fully incorrect on 50.7% of tasks. Claude Sonnet 4.5 failed completely on 64.3%. Haiku 4.5 hit 71.0%.

Anthropic's models weren't designed as research agents with integrated search loops. Testing them on DeepSearchQA measures something other than their intended use case. But context doesn't appear on leaderboards. Google can now claim its approach is objectively superior, as measured by an objective benchmark, that Google designed. The circular logic doesn't make the capability claims false. It makes the competitive framing a marketing exercise dressed as science.

The Testimonials That Aren't

Google's announcement includes praise from financial firms and Axiom Bio. Google cites financial services companies, including its own venture arm GV, reporting that Gemini Deep Research shortened research cycles from days to hours during due diligence. Axiom Bio, which builds AI systems to predict drug toxicity, claims the agent surfaces granular data across biomedical literature.

A company praising its corporate sibling's tool is not validation. It's internal messaging with extra steps.

Both testimonials describe "preliminary" and "initial" research phases. Neither source discusses accuracy rates, hallucination frequency, or the verification burden that follows agent-generated reports. The financial firms claim due diligence happens "without loss of fidelity or quality." That claim deserves interrogation against actual benchmark results.

Google's FACTS Benchmark Suite, released two days before this announcement, shows Gemini 3 Pro achieving 68.8% overall accuracy across grounding, multimodal, parametric, and search tasks. Best among evaluated models. Also means roughly one-third of benchmark items are graded incorrect.

Due diligence requires accuracy. That error rate doesn't disappear because the tool runs fast. Either these financial firms accept error rates that traditional due diligence wouldn't tolerate, or the benchmark results don't translate to their specific use case, or the testimonial overstates the capability. Pick one.

The drug discovery claim raises identical problems. Can pharmaceutical companies trust agent-generated data for regulatory submissions? The testimonial carefully limits scope to "initial research depth" and building "a foundation for agentic systems." Production deployment in FDA-regulated environments requires different validation than a press release provides.

The Trap With Open Doors

The Interactions API represents Google's bet on how agent development should work. It offers server-side state management, interpretable data models for complex agentic histories, background execution for long-running tasks, and remote MCP tool support. These features address real friction in building agent applications.

Google positions this as infrastructure, not just a product. The company plans to expand built-in agents and introduce the ability to build custom agents using the same API. Gemini models, Google's agents, and third-party agents would share a unified interface.

This is a walled garden with the gate standing open. Walk in, and the walls grow higher.

Future updates include native chart generation for visual reports and Model Context Protocol support for custom data sources. Gemini Deep Research will appear in Google Search, NotebookLM, Google Finance, and the Gemini App. Enterprise deployment through Vertex AI is planned. Each integration point deepens platform dependency. Switching costs accumulate with each feature adopted.

The File Search tool illustrates the pattern. Developers can give the agent access to their own data by specifying file search store names. The agent then synthesizes proprietary documents alongside public web data. Valuable capability. Also a mechanism for moving enterprise data into Google's infrastructure, where it becomes one more reason leaving gets harder.

Compare this to how Google approached Android. Open source the core, control the services layer, make the valuable features dependent on Google infrastructure. The Interactions API follows identical logic. The basic agent runs on documented interfaces, but optimal performance requires Google's tools, Google's search, Google's cloud.

What Competitors Actually Face

OpenAI's Deep Research products scored meaningfully lower than Gemini on DeepSearchQA. The o3 Deep Research variant hit 44.2%, o4 mini Deep Research reached 40.4%. Both trail GPT-5 Pro's 65.2%, suggesting OpenAI's research-focused agents underperform their general-purpose models on this particular benchmark.

Anthropic faces a steeper problem. Without integrated search infrastructure, Claude models cannot compete on tasks designed around iterative web retrieval. The benchmark's "causal chain" design requires completing one step before the next becomes possible. Agents without tight search integration struggle by definition.

The response options for competitors are limited and expensive. Build comparable search infrastructure from scratch. Partner with search providers, creating different dependencies. Concede the research agent category and compete elsewhere. None of these are good choices, which is precisely the point.

Google didn't design DeepSearchQA to measure research capability in the abstract. Google designed it to measure research capability as Google implements it. The benchmark becomes a moat.

The Error Rate Nobody Wants to Discuss

Both the Deep Research announcement and the FACTS Benchmark Suite reveal uncomfortable truths about AI factuality that the marketing materials skip past.

Across the FACTS suite, all evaluated models scored under 70% overall, with the Multimodal slice showing the weakest performance. The benchmark's creators acknowledge "considerable headroom for future progress." Translation: current models aren't reliable enough for high-stakes applications without human verification.

Gemini 3 Pro reduced error rates by 55% on FACTS Search compared to Gemini 2.5 Pro, and by 35% on FACTS Parametric. Progress from unreliable to somewhat less unreliable. The absolute numbers still don't support unverified deployment in domains where accuracy matters.

A research agent that runs for 10 minutes, bills by the token, and produces outputs where roughly one-third of benchmark items fail creates a specific workflow. A human researcher receives the agent's report. That researcher spends time verifying claims, checking citations, and fixing mistakes. The automation didn't eliminate the work. It shifted the work from research to quality control, while adding a cloud computing bill.

For applications where errors carry consequences, this math doesn't favor automation. A rejected FDA application costs more than the hours saved on literature review. A due diligence report that misses a compliance risk exposes the firm to liability that dwarfs research efficiency gains. An agent hallucinating a drug interaction could delay a discovery pipeline by months while researchers trace the error.

Google's announcement treats these scenarios as edge cases. The benchmark results suggest they're baseline expectations.

Why This Matters

Google now controls the infrastructure that research agents query, the benchmark that evaluates research capability, and the API that developers will use to build research applications. The company that already dominates web search has extended that dominance into agentic AI. Competitors must either replicate Google's search infrastructure, which requires billions of dollars and years of development, or accept structural disadvantage on any task involving web retrieval.

Benchmark design becomes a competitive battleground where neutrality is impossible. DeepSearchQA measures capabilities Google optimized for. Future benchmarks from other companies will measure different capabilities. The industry lacks neutral ground for capability comparison. That problem will worsen as agents become more specialized and the benchmarks that evaluate them become more numerous.

Enterprise AI adoption will increasingly involve platform commitments rather than model choices. The Interactions API bundles models, agents, tools, and infrastructure into integrated offerings. Developers gain capabilities by accepting dependencies. The trade-off looks attractive when you're standing outside the garden. It looks different three years later when your proprietary data lives in Google's file stores, your workflows depend on Google's agents, and the switching costs have compounded into something that makes the finance team nervous.

Google didn't just launch a research agent. It established the terms under which research agents will be evaluated, deployed, and purchased. The benchmark numbers will dominate coverage. The infrastructure dependency will determine outcomes.

❓ Frequently Asked Questions

Q: How does pricing work for Gemini Deep Research?

A: You pay standard Gemini 3 Pro rates for all tokens, including input, output, and intermediate reasoning generated during the research loop. Tool fees apply separately. Search Grounding excludes retrieved tokens from billing, but File Search and URL context include them. A 10-minute research task generates substantial reasoning tokens, making total costs hard to predict before running production workloads.

Q: What exactly is the Interactions API?

A: The Interactions API is Google's new interface for working with Gemini models and agents. It handles server-side state management, supports background execution for long-running tasks, and provides a structured format for complex agent histories. Developers access it through a single /interactions endpoint. Google plans to add more built-in agents and let developers bring custom agents to the same API.

Q: Can Deep Research access my company's private documents?

A: Yes. The File Search tool lets you create named stores, upload documents (PDFs, CSVs, docs), and have the agent synthesize them alongside public web data. You specify file_search_store_names when calling the API. The agent can then cross-reference your internal files with information it retrieves from the web during research tasks.

Q: How is Deep Research different from regular Gemini 3 Pro?

A: Standard Gemini 3 Pro responds in seconds with a single generation pass. Deep Research runs an autonomous loop: it plans queries, searches the web, reads results, identifies gaps, and searches again. This cycle typically takes 5-15 minutes. The output is a detailed, cited report rather than conversational text. Deep Research is an agent built on top of Gemini 3 Pro, not a different model.

Q: What is MCP support and why does it matter?

A: MCP (Model Context Protocol) is a standard for connecting AI models to external data sources. Google's planned MCP support would let Deep Research pull from databases, APIs, and tools beyond Google's ecosystem. This reduces lock-in by enabling connections to third-party services. However, Google controls which MCP servers work within the Interactions API framework, limiting true openness.

Claude AI: Web Search Meets Google Workspace Integration
Claude just got a major upgrade that transforms how it finds and uses information. The AI assistant now searches the web and connects with Google Workspace, making it a more capable research partner.
Google’s AI Glasses Strategy: Three Products, One Goal
Google announced AI glasses for 2026. But the real story is the third product: wired XR glasses running Samsung’s $1,800 headset chip in a pocketable form. Google doesn’t want to sell you glasses. It wants Android on your face.
Google’s AI Mode Ends the Keyword Era in Search Advertising
Google’s AI Mode doesn’t match ads to keywords anymore. It reads its own AI-generated answers and decides what you meant to ask. Two decades of SEO expertise just became obsolete. The advertising industry hasn’t noticed yet.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Implicator.ai.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.