Google Gemini 3 Deep Think Hits 84.6% on ARC-AGI-2, Beating GPT-5 and Claude

Google's Gemini 3 Deep Think scored 84.6% on ARC-AGI-2, beating GPT-5.2 (52.9%) and Claude (68.8%). Verified cost: $13.62 per task.

Google Gemini 3 Deep Think Scores 84.6% on ARC-AGI-2

Google released a major upgrade to Gemini 3 Deep Think on Thursday, its specialized reasoning model built for scientific and engineering work. The model posted an 84.6% score on ARC-AGI-2, a reasoning benchmark designed to resist memorization, according to results the ARC Prize Foundation verified independently.

That score beats OpenAI's GPT-5.2 Thinking at 52.9% and Anthropic's Claude Opus 4.6 Thinking at 68.8% on the same test. Humans average about 60%. The gap between Google's model and its closest rival, 15.8 percentage points, is nearly double the distance between that rival and average human performance.

Google said the update also reached 48.4% on Humanity's Last Exam without external tools, hit a 3,455 Elo on Codeforces competitive programming, and achieved gold-medal-level results on the 2025 International Olympiads in math, physics, and chemistry.

Key Takeaways

• Gemini 3 Deep Think scored 84.6% on ARC-AGI-2, beating GPT-5.2 (52.9%) and Claude Opus 4.6 (68.8%)

• The model also leads on Humanity's Last Exam, Codeforces Elo, and international science olympiads

• Verified inference cost: $13.62 per ARC-AGI-2 task, raising questions about production-scale economics

• Access limited to Google AI Ultra subscribers and an early API program for researchers


A benchmark built to resist AI is running out of room

Most benchmarks reward pattern recall from training data. ARC-AGI-2 does not. It tests whether a model can solve problems it has never encountered. It presents novel visual reasoning puzzles that require flexible, on-the-fly logic. Previous AI models struggled to break 20% on the original ARC-AGI. Gemini 3 Deep Think now scores 96% on that first version and 84.6% on the harder sequel.

The ARC Prize Foundation confirmed these numbers at a cost of $13.62 per task. That price matters. Deep Think's reasoning mode uses extended "test-time compute," spending more processing cycles before answering. Better scores come with higher inference bills. No enterprise customer has disclosed what that tradeoff looks like at production volume.

For comparison, Google's own Gemini 3 Pro Preview, the non-reasoning variant, scored 31.1% on ARC-AGI-2. The specialized reasoning mode is doing the heavy work.

The lead is consistent across every major test

The ARC-AGI-2 gap is not a one-benchmark anomaly. Deep Think scored 48.4% on Humanity's Last Exam without tools, beating Claude Opus 4.6 at 40% and GPT-5.2 at 34.5%. That benchmark throws thousands of PhD-level questions at models across specialized fields. Google's own Gemini 3 Pro Preview managed 37.5%.

In competitive programming, the margin is even wider. Deep Think holds a 3,455 Elo on Codeforces, "Legendary Grandmaster" territory that only a sliver of human programmers reach. Claude Opus 4.6 sits at 2,352. Anthropic and OpenAI should find that uncomfortable.

Multimodal reasoning and theoretical physics tell the same story. Deep Think scored 81.5% on MMMMU-Pro versus GPT-5.2 at 79.5% and Claude at 73.9%, and hit 50.5% on the CMT-Benchmark for advanced physics. Google is selling the model as a research partner, not a chatbot. These numbers make the pitch easier.


From benchmark charts to lab benches

Google paired the numbers with real-world demonstrations. Lisa Carbone, a mathematician at Rutgers University working on structures that connect general relativity and quantum mechanics, used Deep Think to review a technical paper. The model caught a subtle logical flaw that had passed through human peer review unnoticed, according to Google.

At Duke University, the Wang Lab used Deep Think to optimize fabrication methods for crystal growth. It designed a recipe for thin films larger than 100 micrometers, hitting a precision target that earlier methods had missed.

Google also introduced Aletheia, a math research agent powered by Deep Think that can run autonomous investigations or collaborate with humans. The agent can "admit failure to solve a problem," which Google said improved efficiency for researchers by avoiding dead-end paths. The company published several papers produced with the technology, from information theory to mechanism design.

Google demonstrated one more trick: feeding Deep Think a hand-drawn sketch and getting back a 3D-printable file. The model reads the geometry, writes code to recreate the shape, and spits out something a printer can actually use. Napkin to prototype, no CAD software in between.

Who gets access, and what it costs

If you are not a Google AI Ultra subscriber, you will not see this model for a while. The updated Deep Think is available now in the Gemini app for Ultra users. For the first time, Google is also offering it through the Gemini API, but access is limited to an early access program targeting researchers and enterprise users.

That restricted rollout reflects the model's inference costs. At $13.62 per ARC-AGI-2 task, running Deep Think at scale gets expensive fast.

Bloomberg reported the update as part of a broader push by AI labs to build tools for scientific research and complex coding. Anthropic recently released a new version of its most powerful model for financial research and legal analysis, triggering a selloff in traditional software stocks.

The benchmark saturation question hangs over all of these releases. ARC-AGI-1 is already at 96%. If its harder successor follows the same trajectory, the industry will need tougher tests to separate models, or shift the conversation to real-world output entirely.

Frequently Asked Questions

Q: What is ARC-AGI-2 and why does it matter?

A: ARC-AGI-2 is a reasoning benchmark that tests whether AI models can solve novel visual puzzles they have never seen before. Unlike most benchmarks, it resists memorization from training data, making it a harder measure of genuine reasoning ability.

Q: How much does it cost to run Gemini 3 Deep Think?

A: The ARC Prize Foundation verified Deep Think's scores at $13.62 per task. The model uses extended test-time compute, spending more processing cycles per answer. Google has not published enterprise pricing or production-volume cost estimates.

Q: Who can access Gemini 3 Deep Think right now?

A: The model is available in the Gemini app for Google AI Ultra subscribers. Google also opened a limited early access program through the Gemini API for researchers and enterprise users. No general availability date has been announced.

Q: What is Aletheia, Google's new math research agent?

A: Aletheia is an autonomous research agent powered by Deep Think that can run mathematical investigations independently or alongside human researchers. It can identify when it cannot solve a problem and stop, which Google said avoids wasting researcher time on dead-end paths.

Q: How does Deep Think compare to the standard Gemini 3 Pro?

A: Gemini 3 Pro Preview scored 31.1% on ARC-AGI-2 compared to Deep Think's 84.6%. The gap shows that the specialized reasoning mode, which uses extended compute time, is responsible for the performance gains rather than the base model architecture alone.

Google's Gemini 3 Reveals What "PhD-Level Intelligence" Actually Means
Google launched Gemini 3 today with benchmark scores that dominate leaderboards and marketing copy promising "state-of-the-art reasoning" with "unprecedented depth and nuance." The model tops LMArena
DeepSeek's Math Model Beats Humans at Their Own Game, Then Gives Away the Playbook
Hangzhou-based AI lab DeepSeek dropped something unusual on November 27, 2025. Not another chatbot. Not incremental benchmark gains. A 685-billion-parameter math specialist that matches OpenAI and Goo
Alibaba's Qwen3-VL Can Find a Single Frame in Two Hours of Video. The Catch? It Still Can't Outthink GPT-5.
Alibaba just released a 42-page technical report detailing how its latest vision-language model processes two-hour videos with 99.5% accuracy on frame retrieval. The flagship Qwen3-VL-235B-A22B, built

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Implicator.ai.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.