Google’s Most Advanced AI Raises Alarms Over Weaponization Risks

💡 TL;DR - The 30 Seconds Version

🚨 Google launched Deep Think AI today at $250/month, but safety evaluations found it reached "early warning alert threshold" for weapons knowledge.

📊 Deep Think outperformed competitors on key benchmarks, scoring 34.8% on Humanity's Last Exam vs OpenAI's o3 at 20.3%.

🏭 Only Google AI Ultra subscribers get access with daily prompt limits, making it one of the most expensive consumer AI services.

🔬 A more powerful research version achieved gold-medal performance at the International Mathematical Olympiad by solving 5 of 6 problems.

🌍 The premium pricing could create digital divides, limiting advanced reasoning capabilities to wealthy individuals and large enterprises.

🚀 Multiple AI companies are converging on similar multi-agent reasoning approaches, suggesting this technology will become standard.

Google just launched its most powerful AI model, and it comes with two catches: you'll pay $250 a month to use it, and safety researchers are worried it knows too much about making weapons.

Gemini 2.5 Deep Think went live today for Google AI Ultra subscribers. The model spawns multiple AI agents that work problems simultaneously, like having several experts debate an answer before settling on the best solution. This "parallel thinking" approach takes more time and computing power than regular AI models, which explains the premium pricing.

The timing isn't coincidental. OpenAI's o1 and o3 reasoning models have grabbed attention for their ability to think through complex problems step by step. Google's response uses a different approach but aims for the same goal: AI that can reason more like humans do.

The $250 Question

Google AI Ultra costs $249.99 per month, making it one of the most expensive consumer AI subscriptions available. For comparison, ChatGPT Plus costs $20 monthly, and Claude Pro runs $20 as well. Google's betting that reasoning models justify a 12x price premium.

What do you get for that money? Access to Deep Think through the Gemini app, but with daily limits on how many prompts you can send. Google won't say exactly how many prompts users get, calling it "a fixed set" per day. That vague language suggests the limits are tight enough that Google doesn't want to advertise them.

The model accepts text, images, audio, and video files up to 1 million tokens of input. It can generate responses up to 192,000 tokens long, roughly equivalent to a short book. Those lengthy responses reflect Deep Think's tendency to show its work, explaining the reasoning steps that led to its conclusions.

How Deep Think Works

Traditional AI models generate responses in sequence, predicting one word at a time based on everything that came before. Deep Think takes a different approach, spinning up multiple reasoning paths simultaneously and letting them compete.

Google describes this as parallel thinking. The model generates several hypotheses about how to solve a problem, tests each approach, and can even combine different ideas before settling on a final answer. This process requires significantly more computational resources than standard models, but Google claims it produces better results.

The company trained Deep Think using novel reinforcement learning techniques that reward the model for exploring longer reasoning chains. Unlike the original version that won a gold medal at the International Mathematical Olympiad by spending hours on each problem, the commercial version aims for faster responses while maintaining strong performance.

This architecture puts Deep Think in the same category as other multi-agent systems. Elon Musk's xAI recently released Grok 4 Heavy, which also uses multiple agents working in parallel. OpenAI's unreleased model that won the Math Olympiad reportedly uses similar techniques. The industry seems to be converging on this approach for complex reasoning tasks.

Performance Claims vs Reality

Google's benchmark numbers look impressive. On Humanity's Last Exam, a test spanning 100+ subjects with 2,500 questions, Deep Think scored 34.8% compared to OpenAI's o3 at 20.3% and Grok 4 at 25.4%. That's a substantial lead, though all models still fail most questions on this particularly challenging test.

For coding tasks, Deep Think achieved 87.6% on LiveCodeBench, beating Grok 4's 79% and o3's 72%. Google improved this score from 80.4% since the model's preview in May, suggesting continued refinement based on user feedback.

The Mathematical Olympiad results tell an interesting story. Google's research version achieved gold-medal performance by solving five of six problems perfectly, but took hours per problem. The commercial Deep Think manages bronze-level performance while running fast enough for daily use. Bronze-level performance is nothing to scoff at - most of us would struggle to solve even one Olympiad problem with unlimited time.

Numbers only tell part of the story, though. Real users report mixed results. One Hacker News user tested the classic "pelican riding a bicycle" SVG generation prompt and got a solid result, with the bird clearly recognizable as a pelican thanks to its distinctive beak shape. That's better than most models manage on this particular test.

However, the model sometimes over-refuses benign requests, a common problem with safety-focused AI systems. Google acknowledges this limitation in their documentation but suggests it's preferable to being too permissive.

Safety Red Flags

Deep Think's model card reveals concerning safety findings that Google downplays in their marketing materials. The model reached the "early warning alert threshold" for CBRN risks - chemical, biological, radiological, and nuclear weapons information.

Internal evaluations found that Deep Think "has enough technical knowledge in certain CBRN scenarios and stages to be considered at early alert threshold." Translation: the model knows dangerous information about making weapons of mass destruction and could potentially help bad actors access that knowledge.

Google's response was to implement additional safety measures including threat modeling, usage monitoring, and account enforcement for misuse. Google also built filters into the model itself, trying to block dangerous responses without breaking normal uses.

Google wouldn't mention these risks unless they were genuinely worried. Most AI companies keep quiet about safety problems until forced to speak up. Google's transparency here might reflect lessons learned from previous AI safety incidents across the industry.

External safety testers found that while Deep Think can synthesize complex information and provide high-level strategies, it often falls short on "consistent and verified details required for real-world execution by a low-resourced actor." In other words, it knows enough to be concerning but not quite enough to be immediately dangerous.

The Competition Heats Up

The AI reasoning race is heating up fast. OpenAI kicked things off with o1, which showed humans step-by-step thinking processes for complex problems. Their follow-up o3 model achieved even stronger performance but remains unreleased to the public.

Meanwhile, Anthropic's research agent generates thorough research briefs using multi-agent techniques. The company hasn't launched a direct competitor to Deep Think yet, but their technology suggests they could.

All these companies face the same fundamental challenge: reasoning models require massive computational resources. That makes them expensive to run and difficult to scale. The result is premium pricing that puts advanced AI capabilities out of reach for most users.

Google's $250 monthly fee represents one approach to this problem. By charging high prices, they can limit usage while covering their costs. But this strategy only works if the performance justifies the expense.

Early user reports suggest mixed results. The model excels at complex problems that benefit from extended reasoning, particularly in mathematics, coding, and scientific analysis. But for simpler tasks, the extra thinking time doesn't add much value, and users might prefer faster, cheaper alternatives.

What Users Actually Get

Google AI Ultra subscribers can enable Deep Think by toggling a switch in the Gemini app when using the 2.5 Pro model. The interface looks identical to regular Gemini, but responses take longer to generate as the model works through its reasoning process.

The model automatically works with tools like code execution and Google Search, potentially making it more capable than isolated reasoning models. This tool integration could be a key differentiator, allowing Deep Think to gather current information and test code while reasoning through problems.

Response length varies significantly based on the complexity of the query. Simple questions might get brief answers, while complex problems can generate thousands of words of explanation. The model tends to show its work, explaining the reasoning steps that led to its conclusions.

Daily prompt limits mean users need to choose their queries carefully. Google hasn't specified exact numbers, but the "fixed set" language suggests limits tight enough that power users will hit them regularly. This constraint could limit the model's utility for intensive workflows.

API access is coming "in the coming weeks" for select developers and enterprise customers. Google plans to offer versions both with and without tool access, allowing developers to choose the configuration that best fits their needs.

The Academic Connection

Google is sharing a more powerful version of Deep Think with select mathematicians and academics. This research model achieved gold-medal performance at the International Mathematical Olympiad but takes hours to solve individual problems.

The company hopes academic feedback will help improve future versions of the model. This approach mirrors strategies used by other AI companies, which often test advanced capabilities with domain experts before public release.

Mathematician Michel van Garrel, who tested the model, appears in Google's promotional video discussing its potential for exploring mathematical conjectures. These academic partnerships could accelerate research in fields where reasoning capabilities matter most.

The academic version represents what consumer models might become as computational costs decrease. Today's hours-long reasoning sessions could become tomorrow's real-time responses as hardware improves and algorithms become more efficient.

Market Reality Check

The $250 price point raises questions about market demand for premium AI reasoning. Most consumers won't pay luxury car prices for AI assistance, regardless of capability. This suggests Google is targeting enterprise users and researchers rather than general consumers.

Enterprise customers might justify the expense if Deep Think genuinely accelerates complex workflows. Software development, scientific research, and mathematical analysis could all benefit from extended reasoning capabilities. But the model needs to demonstrate clear ROI to justify the premium.

The limited daily prompts create another challenge. Enterprise users often need to run many queries to complete projects. Strict limits could force organizations to carefully ration their AI usage, potentially reducing effectiveness.

Competition from cheaper alternatives adds pressure. OpenAI's o1 offers similar reasoning capabilities at a fraction of the cost through ChatGPT Plus. While Deep Think might be technically superior, the price difference could drive users to good-enough alternatives.

Looking Forward

Deep Think represents Google's bet that AI reasoning is worth premium pricing. The model's technical capabilities are impressive, particularly in mathematics and coding where extended reasoning provides clear benefits.

But several factors could limit adoption. The high price excludes most individual users, while daily limits constrain enterprise usage. Safety concerns around dangerous knowledge create additional complications, requiring careful monitoring and potential restrictions.

The broader trend toward reasoning models seems inevitable. As AI systems become more capable, the ability to think through complex problems step by step becomes increasingly valuable. The question isn't whether reasoning models have a future, but whether Google's particular approach and pricing will succeed.

OpenAI keeps prices low to reach more users. Anthropic prioritizes safety and reliability. Google's high-end approach could backfire if competitors match its performance for less money.

Why this matters:

• Google is betting that truly advanced AI justifies luxury pricing, but the $250 monthly cost could limit this technology to wealthy individuals and large enterprises, potentially creating new digital divides in access to reasoning capabilities.

• The safety warnings about weapons knowledge reveal how powerful AI models are becoming genuine dual-use technologies that require careful oversight, suggesting we're entering a new phase where AI capabilities outpace our ability to safely deploy them.

❓ Frequently Asked Questions

Q: How many prompts do you get per day with the $250 subscription?

A: Google won't specify the exact number, calling it "a fixed set" per day. The vague language suggests the limits are tight enough that Google doesn't want to advertise them publicly, likely indicating power users will hit the limits regularly.

Q: What does "parallel thinking" actually mean in technical terms?

A: Instead of generating one response sequentially, Deep Think creates multiple reasoning paths simultaneously. These different approaches compete and can combine ideas before settling on a final answer. Think of it like having several experts work the same problem independently, then comparing notes.

Q: How does Deep Think's $250 cost compare to other AI subscriptions?

A: It's 12 times more expensive than ChatGPT Plus ($20/month) or Claude Pro ($20/month). Google AI Ultra is currently one of the most expensive consumer AI subscriptions available, targeting enterprise users and researchers rather than general consumers.

Q: What does CBRN stand for and why is it concerning?

A: CBRN stands for Chemical, Biological, Radiological, and Nuclear weapons. Google's tests showed Deep Think knows too much about making these weapons. That knowledge could help the wrong people, so Google added extra monitoring and safety filters to watch for misuse.

Q: When will developers get API access to Deep Think?

A: Google plans to release Deep Think via the Gemini API "in the coming weeks" for select developers and enterprise customers. They'll offer versions both with and without tool access to fit different use cases.

Q: How long do Deep Think responses take compared to regular AI?

A: Deep Think takes "several minutes" to generate responses because of its extended reasoning process. The research version that won the Math Olympiad took hours per problem, but the commercial version is optimized for faster daily use.

Q: What safety measures did Google actually implement?

A: Google deployed threat modeling, multi-tier usage monitoring with human review, account enforcement for misuse, and model-level filters to block dangerous responses. They also conduct ongoing red team testing to find ways around these protections.

Q: Can you use Deep Think for simple tasks or just complex problems?

A: You can use it for any task, but the extra thinking time doesn't add much value for simple questions. Deep Think excels at complex problems in mathematics, coding, and scientific analysis where extended reasoning provides clear benefits.